Several weeks ago MarkMail, a project sponsored and run by Mark Logic, started indexing the KDE mailinglist archives. After about a week of hard work, the KDE archives are now directly searchable from MarkMail. Besides interesting analytics, this brings some powerful search capabilities to the table. Read on for a short interview with Jason Hunter who was responsible for engineering on the project.
Hi Jason! Could you give a little introduction of yourself and Mark Logic?
Hi, KDE! I'm a Silicon Valley hacker. I've been working at Mark Logic for about 5 years now, since the days it was an early startup. We sell MarkLogic Server, a special-purpose database built for content (where "content" is the stuff that's textual, hierarchical, irregular, and not often regularly repeating - like books, articles, and presentations). We use XML as our native data type instead of tables, and pride ourselves on performing very well at high scale.
Until about a year ago I worked with our customers to help them write content apps. I had the idea that we could use the core server to build a public email archive repository, using some of the product features to push the envelope of what people had done before with email archives. That's where MarkMail came from. We started with 4,000,000 emails from the Apache Software Foundation mailing lists.
I've been involved with open source for a long time, leading JDOM and participating as a member of the Apache Software Foundation, so it felt natural to put MarkMail to work initially on the problem of getting more value from open source mailing lists.
Why did you decide to grab the KDE mailinglists?
Cornelius Schumacher started the ball rolling when he asked if we could load the KDE lists. OK, that's not quite true. We have a long list of communities whose lists we hope to load, and KDE was actually on that list since the very beginning. It's just that one day in April we heard from Cornelius, and the next day received a separate request from Adriaan de Groot. That popped KDE to the top of the priority list.
The KDE mailinglists aren't the largest you have at MarkMail, but they sure aren't small. Did that pose any problems?
Yes, KDE is Big. At current count there's 2.7 million KDE emails. Hosting those emails isn't an issue (we're designed to scale to hundreds of millions) but we had to work hard to gather clean historical archives. We have one person on the MarkMail team dedicated only to this (we like to call him an email archaeologist. I'm not sure he's happy about that nickname).
Why the challenge? Well the most authoritative archives for KDE were the web-based Pipermail archives (I'm using past tense because I'd like to think that today the most authoritative archives are in MarkMail). Pipermail exposes a set of "mbox" files for each archived list. Very handy. The mbox file format is a classic storage format for email and a format from which we can readily load. But as we found out, the mbox files aren't really mbox and there was a lot of post-processing we had to do. Some examples:
There are more examples, but I don't need to bore you. I should make clear it's nothing special with KDE or even with Pipermail. Turns out if you load a couple million emails you'll see at least one example of almost every problem that's ever existed. It's the same for every community, just with different challenges.
You mentioned pushing the envelope. Can you give an example of that?
Sure, here's a good example: When you do a search, besides getting the top 10 most relevant emails, you see lots of analytics. You see a histogram chart showing the number of messages matching your query each month across time. With it you can watch trends for lists, people, ideas, or any combination. Every query also shows the top senders, lists, attachment types, and message types for the messages matching the query. You can learn who's an expert on a topic, on what lists something is being discussed, which people are most involved on lists, and so on. By dragging across bars on the graph you can limit the view to just a particular time period. You can also click on any person's name or list name to limit the search. It's convenient to start with a simple query and refine interactively.
We've also strived to make the site easy to navigate. You can hit "n" and "p" to go to the next and previous search results. To move up and down the thread view you hit "j" and "k" (a homage to vim users). If you find an attachment (search for ext:pdf) you can view it inline in your browser.
Oh, and here's a little-known tip. If your screen is sufficiently wide, we give you all three panes (analytics, results, messages) at once. If not, you get the "slide".
Do you have any tips for the KDE community to take advantage of the available capabilities in MarkMail?
The first thing to remember is that you can limit your view to KDE-related mails by going to kde.markmail.org. The use of a subdomain adds an implicit constraint to all your queries.
Another is that you can do negations. For example, KDE has a huge number of automated emails generated by bug reports and code check-ins. You can search without those by adding -type:bugs -type:checkins. For example: http://kde.markmail.org/search/?q=-type%3Abugs+-type%3Acheckins
Thank you very much, Jason, both for the work Mark Logic and you've been doing, and of course for granting this interview!