Interview: MarkMail Indexes KDE Mailinglist Archives

Several weeks ago MarkMail, a project sponsored and run by Mark Logic, started indexing the KDE mailinglist archives. After about a week of hard work, the KDE archives are now directly searchable from MarkMail. Besides interesting analytics, this brings some powerful search capabilities to the table. Read on for a short interview with Jason Hunter who was responsible for engineering on the project.

Hi Jason! Could you give a little introduction of yourself and Mark Logic?

Hi, KDE! I'm a Silicon Valley hacker. I've been working at Mark Logic for about 5 years now, since the days it was an early startup. We sell MarkLogic Server, a special-purpose database built for content (where "content" is the stuff that's textual, hierarchical, irregular, and not often regularly repeating - like books, articles, and presentations). We use XML as our native data type instead of tables, and pride ourselves on performing very well at high scale.

Until about a year ago I worked with our customers to help them write content apps. I had the idea that we could use the core server to build a public email archive repository, using some of the product features to push the envelope of what people had done before with email archives. That's where MarkMail came from. We started with 4,000,000 emails from the Apache Software Foundation mailing lists.

I've been involved with open source for a long time, leading JDOM and participating as a member of the Apache Software Foundation, so it felt natural to put MarkMail to work initially on the problem of getting more value from open source mailing lists.

Konqueror showing MarkMail's search results

Why did you decide to grab the KDE mailinglists?

Cornelius Schumacher started the ball rolling when he asked if we could load the KDE lists. OK, that's not quite true. We have a long list of communities whose lists we hope to load, and KDE was actually on that list since the very beginning. It's just that one day in April we heard from Cornelius, and the next day received a separate request from Adriaan de Groot. That popped KDE to the top of the priority list.

The KDE mailinglists aren't the largest you have at MarkMail, but they sure aren't small. Did that pose any problems?

Yes, KDE is Big. At current count there's 2.7 million KDE emails. Hosting those emails isn't an issue (we're designed to scale to hundreds of millions) but we had to work hard to gather clean historical archives. We have one person on the MarkMail team dedicated only to this (we like to call him an email archaeologist. I'm not sure he's happy about that nickname).

Why the challenge? Well the most authoritative archives for KDE were the web-based Pipermail archives (I'm using past tense because I'd like to think that today the most authoritative archives are in MarkMail). Pipermail exposes a set of "mbox" files for each archived list. Very handy. The mbox file format is a classic storage format for email and a format from which we can readily load. But as we found out, the mbox files aren't really mbox and there was a lot of post-processing we had to do. Some examples:

  • Pipermail "scrubs" attachments from its mbox files. Instead of placing the attachment content into the message as normal, it gets placed at an external URL with a marker in the message dictating where you can find it. We had to recognize the scrubbed references, fetch the attachments, and then inline the contents. Sounds simple, doesn't it? It probably would be if the external links were always accurate. Sometimes we could guess and fix things and sometimes we couldn't - bonus points go to anyone who finds an email in MarkMail mentioning an attachment that doesn't really exist. Extra bonus points if you know our search syntax well enough to write a query that directly lists those emails.
  • Then there's the problem with character encodings in old emails. If you look at an mbox file it seems like ASCII, but in fact it's a binary file. That's because each message may have a different character encoding for its body (or even portion of the body). The Pipermail list archiver didn't always realise this, and fixing that was non-trivial and imperfect.

There are more examples, but I don't need to bore you. I should make clear it's nothing special with KDE or even with Pipermail. Turns out if you load a couple million emails you'll see at least one example of almost every problem that's ever existed. It's the same for every community, just with different challenges.

Graphically drilling down to a specific date

You mentioned pushing the envelope. Can you give an example of that?

Sure, here's a good example: When you do a search, besides getting the top 10 most relevant emails, you see lots of analytics. You see a histogram chart showing the number of messages matching your query each month across time. With it you can watch trends for lists, people, ideas, or any combination. Every query also shows the top senders, lists, attachment types, and message types for the messages matching the query. You can learn who's an expert on a topic, on what lists something is being discussed, which people are most involved on lists, and so on. By dragging across bars on the graph you can limit the view to just a particular time period. You can also click on any person's name or list name to limit the search. It's convenient to start with a simple query and refine interactively.

We've also strived to make the site easy to navigate. You can hit "n" and "p" to go to the next and previous search results. To move up and down the thread view you hit "j" and "k" (a homage to vim users). If you find an attachment (search for ext:pdf) you can view it inline in your browser.

Oh, and here's a little-known tip. If your screen is sufficiently wide, we give you all three panes (analytics, results, messages) at once. If not, you get the "slide".

Do you have any tips for the KDE community to take advantage of the available capabilities in MarkMail?

The first thing to remember is that you can limit your view to KDE-related mails by going to kde.markmail.org. The use of a subdomain adds an implicit constraint to all your queries.

Another is that you can do negations. For example, KDE has a huge number of automated emails generated by bug reports and code check-ins. You can search without those by adding -type:bugs -type:checkins. For example: http://kde.markmail.org/search/?q=-type%3Abugs+-type%3Acheckins

Lastly, if there's any other lists people want to see, let us know at our feedback page. You can track what we're up to at our blog.

Thank you very much, Jason, both for the work Mark Logic and you've been doing, and of course for granting this interview!

Dot Categories: 

Comments

by Lucianolev (not verified)

I've played a few minutes with the service and it rocks!
Great job really, well presented, fast and, most important, accurate results.
Thanks for this useful tool :)

by velocifer (not verified)

Are there also services like these for non-public mailing lists or means to stop at least google indexing?

The reason I detest public archives is that it gets indexed by google and may also pose legal liabilities under German law (Abmahnungen) you can hardly manage if enforced in a ruthless manner. It further invades privacy.

I always had a problem with attachments to mailing list archives under mailman.

by Mark (not verified)

This is a nice idea, but I think there is a big privacy issue. First a lot of bug reports get automaticly forwarded to mailinglists, exposing the email address of the reporter to the public and search engines. I know, this has happend with the old archives, but not it is even easier. I guess most of the bug reporters are not aware oft this fact. A similar problem arises through some support-email addresses of some KDE-programs, which are in fact mailing lists. Here also the emailaddresses of people asking for support are exposed into the public through the archives being scanned by search engines. It would be great, if there was a solution for this.

by velocifer (not verified)

Yes, this is why I always refrain from sending the crash reports, e.g. of Amarok.

However I don't really understand why its necessary that your email gets revealed.

by Mark Kretschmann (not verified)

Geez, being concerned about exposing your email address is so 90s.

My address is all over the web (you can google it in 10 seconds), and still Gmail deals fine with the spam. Absolutely manageable.

kretschmann(replace_tinfoil_hat_with_@)kde.org

by BiddyBoo (not verified)

Can I get free advertising for my proprietary privacy invading web20 web site just cause I'm barely helping a FOSS project so I can get free advertising for my proprietary privacy invading web20 web site?

by Anonymous (not verified)

Have you ever heard about lists.kde.org ? It already have everything to invade your privacy, this site adds nothing new with it. There is also gmane.org - here you even have search interface (for example: http://dir.gmane.org/gmane.comp.kde.devel.core ). The problem with privacy is more general - all data is available regardless of MarkMail.

What MarkMail adds is really good interface for browsing the archieves (which is really important for people who have not subscribed for that archieves for ages and thus don't have local copy of the history). For example one can use it to check whether given topic was already discussed before asking it themself. Yes, of course its not free, and it sucks. But hey, the google is also not free, but you don't blame it when it sponsors FOSS projects, do you ?

And about free advertising - well, its actually not free. The company does something for project and as reward gets advertasement. It's quite common in free software world, and it's good for both the company and the project and is very natural way of working together.

by BiddyBoo (not verified)

There are plenty of websites who offer nice interfaces for searching mailing list, source code, coders etc... Are we going to advertise each and every one of them that helps FOSS a little more than other especially if they are not FOSS themself? (not to mention this one is full of flash nasty stuff).

> But hey, the google is also not free, but you don't blame it when it sponsors FOSS projects, do you ?

What does it have to do with it? I've never seen an announcement on the dot about how google products should be used to help KDE developers. The only google announcements here are about Google SoC which are not about how people should use google products. And for what I know, markmails is not sending any checks to young kde hackers.

by Janne (not verified)

"Are we going to advertise each and every one of them that helps FOSS a little more than other especially if they are not FOSS themself?"

Probably not. And why should they?

"not to mention this one is full of flash nasty stuff)."

Then don't use it. Sheesh. "we should not talk about this web-service because it uses Flash!"...

And if you consider this service to be "invasion of privacy".... Well, how on earth can you expect one shred of privacy on a PUBLIC mailinglist? All the data they display and use is 100% public by default.

by velocifer (not verified)

Well, the search is for the "better tool". In the recent years the concern for privacy is getting higher as no data is lost anymore and governments apply anti-terror madness. People are not sensitive enough and reveal everything about themselves they can over myspace and the like.

On the other hand mail encryption is still immature and poses an usability burden. Who uses gmail makes the decision to go public. Fine.

Paranoid mode: when I reveal what software version I am using I consider this a personal security problem for a targeted attack.

Read this:
http://ec.europa.eu/public_opinion/flash/fl_225_en.pdf
http://ec.europa.eu/public_opinion/flash/fl_226_en.pdf

Two-thirds of respondents throughout the EU (65%) indicated that their
company transferred personal data via the Internet. One in three respondents (32%) admitted that their company did not take any security measures when transferring personal data over the Internet.

Among companies that transferred personal data to non-EU countries,
almost half of respondents (46%) indicated that this data mostly concerned clients’ or consumers’ data for commercial purposes, and 27% said it was human resources data for HR purposes.

by Vide (not verified)

"Paranoid mode: when I reveal what software version I am using I consider this a personal security problem for a targeted attack."

So you are all for security by obscurity?

by velocifer (not verified)

You mix a matter of principle, it is a matter of nondisclosure.

You don't document your infrastructure for an attacker.

"Security by obscurity" relates to non-transparency of the tools as such.

If someone has the ability to find out that I am using a service x he can target me specifically. E.g. I get a lot of phising messages for financial institutions I am not a client with. But if an attacker has access to my customer data or knows that I contacted the bank and who the person at the bank is he might succeed as he can customize the attack in a more intelligent way. Don't reveal what you don't have to reveal.

Even the fact that you use Linux may be an issue you have no desire and no reason to disclose to the public at large. It is as private as which books you have in your room. I don't want "my room" to communicate to the rest of the world which books I store.

For submitting a defect report an association with my real name is factually irrelevant.

by jospoortvliet (not verified)

Oh, yes, if you help us, we help you.

That's normal in FOSS - if someone does something for the project, we try to thank them. MarkMail didn't ask for this interview, you know... WE asked them to index our stuff, they did, we decided it would be cool to let the world know about it.

by Jason Hunter (not verified)

All email addresses shown by MarkMail are obfuscated. Only by clicking on an address and solving a CAPTCHA will you be able to see what the true address is. This is true for any emails mentioned in the body of a message also.

The goal is to make the site useless to address harvesters but useful to people who want to communicate with each other. We think this strikes a nice balance.

Hi Jason,

indeed! I like that feature very much. Thanks for that.

Marcel

by Anonymous (not verified)

Thanks for the great work. It would be also very cool to have threaded browsing available.

by Ian Monroe (not verified)

Just click on an email and it opens up the thread.

by André (not verified)

No, if I click on a message, I get a *list* of messages in a conversation, not a tree-like view that gives me an idea of which message is a reply to which other message (the different threads in the conversation). Or is there something I am overlooking here?

by axel (not verified)
by christoph (not verified)

I have globally disabled JavaScript, and only enable it using per-site rules in Konqueror. However, markmail search site does not display any results this way. Please fix.

Your internets are broken.

I guess your rules are broken then. Using per-site rules with Firefox (sorry, I am at windows at this machine) just works.

I needed to whitelist both markmail.org and kde.markmail.org, sorry for the bogus alert, maybe a Konqueror bug.

by taj (not verified)

I was a big fan of JDOM back when I was a Java programmer, it really did make XML processing in Java tolerable. Nice to see Jason's name pop up again, this time in relation to KDE!

by Ian Monroe (not verified)

I like the kind of statistics you can easily get out of it. (Like with any search engine, the first thing I do is search myself :D). And the UI is really well done. Good stuff.

by Debian User (not verified)

Hello,

it's good times if people provide free services like this one.

I personally don't like that it wants to use flash for the graph. The layout (results in a scollbar container) seems broken, but that could be my outdated KHTLM (3.5.9), overall, when searching for myself, I found a pretty nice set of results, and good way to view it.

I would say this site is very useful. Only a pity that its software is not Free, is it?

Yours,
Kay

by Robin (not verified)

markmail is excellent if you are searching for specific topics or just want some statistics, but it is very bad for just browsing the mailing lists.

i.e. select a specific list and get an overview of alle threads

lists.kde.org does a better job here.. would be nice to see an improvement there

by Robin (not verified)

oh, just found the browsing mode:
http://kde.markmail.org/docs/faq.xqy#browse

however, it's very basic...

by jospoortvliet (not verified)

sounds like something for a feature request ;-)

by Chaz6 (not verified)

I really wish markmail provided an nntp interface. I far prefer using a good nntp client with threading to any online forum/list archive. I find discussions a whole lot easier to follow. I also prefer it to mailing lists since i don't have to worry about my mail account filling up with lots and lots of messages.