FOSDEM 2005: Desktop Search Interview

The schedule for the KDE developers room talks at FOSDEM is now online. Our final interview with the speakers is with Scott Wheeler who will be giving a talk titled "KDE 4: Beyond Hierarchical Data, The Desktop as a Searchable Web of Context". FOSDEM is this weekend, see you there.

Please introduce yourself and your role in KDE.

I feel like I've been asked this question enough times that I should have an exciting answer by now. But well, I wrote JuK and TagLib as well as a couple of other small applications in KDE CVS and do some work on a handful of things in kdelibs and elsewhere across KDE.

What kind of search capabilities do you think a modern desktop should have?

Well, I think I'd like to step back a bit first and look a little at the problem — and the problem isn't a lack of a search tool, the problem is that it's hard to find things. Search tool or no, all of the ideas flow from the idea of solving the problem rather than just creating a new tool. So, in a sense, I don't think a modern desktop should have a search tool; I think a modern desktop should make it easy to find stuff — we're then left with how to get there.

And I suppose with all of the buzz around search tools these days people have a much more concrete idea in mind when they hear about searching on the desktop. But such wasn't the case when I started kicking these ideas around a while back. Spotlight was announced a few days after I'd submitted my abstract for the KDE Developer's conference, Beagle was relatively low profile, Google for the Desktop and its successors hadn't entered the scene yet, etc.

So, I think — fundamentally "what sort of search should the desktop have" is almost the wrong question. "How should we make it easier to work with the data we accumulate on the desktop?" is closer to the right question. I think search is just part of the answer.

Where did the idea of integrating a search capability throughout KDE come from?

Well, a few things actually. It mostly came from not being able to find things and asking some fundamental questions about how we organize and access information on the desktop. The first step — and this is tied up with the first part of the name of both this talk (which is related to the one that I gave at Linux Bangalore) and the one at the KDE conference this summer — is that hierarchical interfaces simply don't make sense in a lot of cases.

When I started looking around for examples of how this had played out in other domains of information, the most obvious example was the World Wide Web, where we've already moved from hierarchical interfaces to search based interfaces. It seemed logical that we could learn from that metaphor.

On the technical side of things I'd just written the listview search line class (used in JuK) that's now fairly prevalent in KDE that makes filtering of information in lists much easier, so that played into things too.

What do you think of other search tools such as GNOME's Beagle and Google's Desktop Search?

Well, they're fundamentally different in scope. Again, right now the term "desktop search" actually means something; that wasn't really true when I started working on these ideas this summer. So while there are some things in common, they're really pretty different approaches.

Beagle, Spotlight, Google for the Desktop, and their relatives are more interested in static indexing and search through that information. That's kind of where I was at conceptually early this summer when I coded the first mock-up. Since then however the ideas have moved on quite a bit and I think we've actually got something rather more interesting up our proverbial sleeves. (I should note however that I think the Beagle group is doing fine work, but it's something pretty different from what I'm interested in.)

The first difference is that this is a framework, not a tool. Beagle has some elements of this, but it's still not integrated into the core of the desktop. Google for the Desktop is mostly just a standalone tool from what I know of it. Honestly I think it's really below the level of innovation that I tend to expect from Google.

What we're now looking for in the KDE 4 infrastructure is a general way of linking information and storing contextual information — that information can come from meta-data, usage patterns, explicit relationships and a host of other places.

There won't be a single interface to this set of contextual information; we'll provide some basic APIs for accessing the components in KDE applications, but we're quite interested in seeing what application authors will think to do with it. Really I think they'll surprise us.

We're looking at everything from reorganizing KControl to make search and related items and usage patterns more prevalent to annotating mails or documents with notes to reworking file dialogs. Really the scope is pretty broad.

Do you think Free Software solutions from KDE and GNOME can compete with the likes of Google and Microsoft?

Sure. I mean — I don't think the ability to compete with commercial players is significantly different with desktop search than it is with other components of the desktop. And honestly I think we've kind of got a head start here.

Has there been any progress on planning or coding search into KDE yet? Is anyone helping you? What problems are you facing?

There have been a number of cycles through some API and database design sketches. But right now we tend to write code and as soon as it's done we've realized the flaws in it and start rewriting. This will probably continue for a while, but I think we'll be able to have something pretty useful in KDE 4.

There are a number of folks involved in discussion of these issues from various sub-projects inside of KDE. Thusfar it's been mostly myself and Aaron Seigo banging on the API, but others have contributed to the discussions.

I think the biggest problem that we're dealing with is moving from the abstract set of ideas that we're working with into real APIs — trying to keep things general enough to stay as extensible as we'd like them to be, but not so lofty that they're convoluted and useless.

What technologies do you plan on using, e.g. which database?

Well, we've gravitated towards Postgres, but mostly because of licensing. Other than that, well, uhm, we're using Qt. The Qt 4 SQL API seems much improved, so I've kind of been mentally stalling on really finishing up the current code until I can just work with that since otherwise everything would just have to be rewritten in a few weeks.

Is the KDE search tool likely to be cross desktop compatible so we could have a common base with Gnome?

Well, again, this really isn't about a "KDE search tool" -- and the chances of it being GNOME compatible out of the box aren't particularly high. That said, as the data store will just be a Postgres database and ideally we won't have to use too many complex serialized types, there wouldn't be a reason that a GNOME frontend couldn't be written. But generally speaking I'd like to get the technology laid down and then see if we can convince others to adopt it rather than the other way around.

What does the project need most now?

Time. And I mean that in a few ways — we need time to finish fleshing out the ideas, time to let the stuff mature inside of KDE and well, the couple of us working on it could use more time for such. But really as most of the framework for things like metadata collection and whatnot are already inside of KDE this won't be a huge project from the framework side. What will take a good while will be porting over applications to use it where appropriate.

Dot Categories: 

Comments

by me (not verified)

oh...interesting!

I'd like to store thousands of big images (between 10 and 200mb a piece) and need to be notified when they are changed...

Does your plugin have a webpage? I've been looking for something like that, and I've grown frustrated with the available solutions (fam, enhanced dnotify and even www.dazuko.org suck), so I'd like to know more...

Is reiser4 stable?

by wilbert (not verified)

> Is reiser4 stable?

According to www.namesys.com/v4/v4.html: "We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3." (as of 29 dec 2004).

So may be not yet, but testing will definitely be very, very interesting!

by Charles Samuels (not verified)

In my experience, it is.

by Arun Raghavan (not verified)

Would inotify (http://lwn.net/Articles/104343/) help?

by Derek Kite (not verified)

The discussions about what DB backend really are irrelevant. Even the front end, user interface is not really the most important. The middleware, what data is archived and indexed, and how contexts and patterns are matched is the key.

I want something that recognizes contexts of activity. My work patterns are usually by blocks; I sit down and write and assemble the digest. I sit and read my favorite blogs. I read and sort my email. Those are the regular blocks. Then the projects that I work on, ie taxes, planning a trip, researching a specific subject, work tasks such as proposals and product research, etc.

For example, in june of last year I was researching travel in europe since my daughter was travelling there, and I needed to figure out how to her get somewhere. I found interesting sites, some helpful emails came in, including correspondance with my daughter. There was a pattern to that activity. Say I want to arrange a trip for myself and want to find all those sites I found helpful. So I start looking, and keywords london, paris, europe, airline, low-cost come up. Same keywords, same context. The indexing/data retrieval system that recognizes the context, suggests how to replicate the previous context.

It isn't simply data that is indexed, but time, duration, frequency, context, what application. I can search my datafiles quite easily with grep. But I can't for the life of me remember what tax filing software I used last year. Or where that interesting blog on the NHL strike was. The only time recently where I remembering wishing I had an index of a bunch of data files is when reading product documentation pdf's on a cdrom where the filenames were 6 digits.

I don't want something that tells me I have Results 1 - 10 of about 220,000,000 for linux. I know I got thousands of references to KDE on my hard drive. I want a maximum of 20 selections based on the context I am working in.

This would obviously entail hooks into the various data streams. And some kind of realtime archiving and pattern matching. And possibly background data mining. An api is best since applications sometimes know the best way to work with the data that they produce.

This is neat stuff.

Derek

by Aaron J. Seigo (not verified)

> It isn't simply data that is indexed, but time, duration,
> frequency, context, what application.

bingo! you've got it!

and add identity, source and destination to your list. probably others as we go along =)

by Jos (not verified)

Oh yes, i'd love to have a source meta data for every file I've downloaded.

Of course, I'd like to have a source for every line I copy and paste etc. but's going a bit to far.

If file storage is being worked over so thoroughly, an integrated versioning solution would be an ultracool feature.

Cheers, Jos

by James Richard Tyrer (not verified)

Perhaps I am missing something here, but what I want to start with is that I have a directory with a bunch of HTML files in it and I want to be able to search them for content just like Google searches the web.

Will you project do this? or am I talking about something else?

It seems like a KDE front end for ht:/Dig would do what I want.

--
JRT

full text search is a subset of what this will do.

you could certainly say "html documents containing foo bar baz" and it would return what you're suggesting.

but it allows so much more than that as well.

by jameth (not verified)

"that information can come from meta-data, usage patterns, explicit relationships and a host of other places"

Whenever I talk to people or look into the stuff out there, they only seem concerned about meta-data and full-text search! I love that you are planning for both usage patterns and explicit relationships to be included.

Also, on that note, I hope you are considering how the data will be inputted by the user. It can be extremely useful if the user input method allows for them to understand the organizational system without adding complexity. The only model for this that I've found to be useful is that of categorization.

The user has categories which they define and place files into, allowing a file to be in as many categories as they desire. Then, the system can use those categories as a good way to narrow searches. For example, I would categorize everything into at least one of four categories: Work, Personal Work, Entertainment, or Belonging to Someone Else. But, at the same time, something in any one of those categories could be in several others. For example, for myself, a lot of all those categories would be in: Writing, Gaming, and/or Art.

With a good categorization system (maybe visualize it as a set of directories with check-boxes to determine which categories it goes into) I could swiftly and easily place a file while saving it, at least as quickly as I can organize into directories right now. And, if I didn't categorize something right away, it could be automatically be tossed into a category such as 'unsorted' or whatever, so I knew that I hadn't organized it yet.

Further, this is a portion of the organizational system that might be representable in a real filesystem, which means the save dialog wouldn't be completely useless. I asked someone who knows more about file-system performance than I do, and they said that it was perfectly feasible to have directories for each category and hard-link files throughout them. They even said there shouldn't be any performance issues if you used a modern OS. And, a search organization system which can also be somewhat used from a standard browser would be nifty.

And, to go back to my original point, cool! I hope the entire system can be gotten working for KDE 4.0, because it sounds awesome.

by Aaron J. Seigo (not verified)

> Also, on that note, I hope you are considering how the data will be
> inputted by the user.

the goal is to have as little extra data input overhead as possible. expecting users to label documents with descriptive tags breaks down really quickly for a variety of reasons. instead we will be using information that is already there as well as offering ways to author information that implicitly builds this information (on the world wide web, hyperlinks provide such a thing).

> this is a portion of the organizational system that might be representable in
> a real filesystem

locally stored files are only a subset of the information that will make up your personal linkage store. the idea of "information equals document, document equals file on disk, therefore information equals file on disc" is one that we feel is antiquated and getting in the way. not all information is document or storable as a file on disk. this is one reason why we aren't using the file system to store this information.

another reason is that linkage meshes get very dense very quickly when all the relationship information is poured in. i really don't see this scaling well on a literal filesystem layout, nor do i see it being sensicle to someone browsing such a file system.

> I hope the entire system can be gotten working for KDE 4.0

we'll do our best =)

by Illissius (not verified)

just thinking out loud a bit...
I assume you've tried GMail? GMail's labels are essentially the same thing as the grandparent's 'categories'. And while applying the labels to everything manually would indeed be a chore and a half, luckily GMail also has filters :) (so f. ex. everything addressed to [email protected] automatically gets the 'kde-cvs' label). So perhaps something analogous to these filters is what's necessary to make the idea workable. Or on second thought, this would serve basically the same purpose as 'virtual folders', eg, saving a search for quick access later... whether you 'filter' everything into a 'label' on the fly, or just execute the search again whenever the 'virtual folder' is accessed, is basically an implementation detail (albeit, a rather large one), the result to the user is essentially the same... though with the former approach you would also have the flexibility of manually adding/removing items to/from it.

by Derek Kite (not verified)

>as little extra data input overhead as possible

The basic assumption is that the information to categorize, or contextualize (sp?) is in the data and usage patterns. Otherwise this becomes another complicated organizational scheme that needs maintenance.

Another example. I've got an extensive and complicated list of kmail filters that attempt to categorize the incoming emails. I've got a dozen or so family and personal correspondants in one folder. The vendors, such as amazon in another. Registrations and authentications for various sites in another. kde-cvs is in one large folder, divided by module. Spam is handled another way, etc.

The patterns are obvious and quite simple. A few days worth of traffic and usage pattern would give enough information to come up with a similar sorting scheme. Or at least something close that could be easily fine tuned. With the basic contexts recognized, then my inherent skills at noticing anomalies would be used to put the final touches on the system.

If the engine spits out garbage, the interface would be very tricky to put together. Another vast font of useless information thrown in your face. If the engine is capable of narrowing down contexts, and the information is genuinely useful, the interface issues become quite simple. It's easy to present delicious food or beautiful art.

Derek

by jameth (not verified)

>>as little extra data input overhead as possible

>The basic assumption is that the information to categorize, or contextualize (sp?) is in the data and usage patterns. Otherwise this becomes another complicated organizational scheme that needs maintenance.

The system cannot solely rely on usage patterns. Such a reliance results in very easy to lose data. For example, I record the data for my FAFSA and update it every year. That data is used one day a year and has been used twice ever. If usage patterns were how this data were tracked, I wouldn't be able to find it. Of course, that might be found just by the fact that I know the name, but there are some things that are not so easily tracked by name, have no searchable contents, and are rarely used but important. Thus, the system needs make it very easy for users to intentionally track data.

Of course, the intent is not to replace all other ways to access information (at least, I think it isn't) but that is fairly likely to happen if everything goes smoothly. If the system is done right, it will be more efficient 90% of the time, which means that users will learn how to use it and will usually go to it. About 5% of the time, it will be less efficient but they won't know that going in, so they'll use it anyway. Then, once they're using it 95% of the time, they'll start using it all the time as they stop using other methods and forget about them. For that reason, it needs to be more efficient 100% of the time from the start, or some very serious problems can arise.

(Of course, that isn't some proven theory and the statistics are just random examples, but I've seen it happen. Many people have trouble with offline data sources because they are so used to using Google, or even have trouble with page navigation online because they can't just type in a search query.)

by jameth (not verified)

>> this is a portion of the organizational system that might be representable in a real filesystem

> locally stored files are only a subset of the information that will make up your personal linkage store. the idea of "information equals document, document equals file on disk, therefore information equals file on disc" is one that we feel is antiquated and getting in the way. not all information is document or storable as a file on disk. this is one reason why we aren't using the file system to store this information.

That some of the information isn't online doesn't change that categorization is a good way to organize it. Many of my offline categories overlap with my online categories, such as Art and Writing. The categories are just a way to browse it.

> another reason is that linkage meshes get very dense very quickly when all the relationship information is poured in. i really don't see this scaling well on a literal filesystem layout, nor do i see it being sensicle to someone browsing such a file system.

I was referring to storing only the categorization information in the filesystem, not the rest, and that just being a mirroring. The purpose of that is to avoid having the filesystem itself being incapable of organizing data. This is a serious concern because, once using mostly this new organizational system, many people will just rely on it. Then, when browsing with another method, the filesystem may be 100% nonsensical.

That mirroring this the categories to the filesystem would be worse than current organizational systems doesn't mean it wouldn't be better than doing nothing of the sort.

(I'm not trying to tear down the idea as a whole or rebuild it in some new image, just trying to point out some potential problems. Thanks for all the good work.)

by jesusfish (not verified)

I remember hearing Nat Freidman at RealWorld Linux last year speak on something similar, and he showed an app that demonstrated this (what may actually be Beagle now, I'm not quite sure). It would be a really big step to have technology and tools like this I think.

The whole concept resolves around connecting ideas, whether they be data, programs, time, etc. I think Derek hit it right where it's at. When I use my computer, my actions are all motivated by thoughts. I want to know about this, I want to do that, etc. Imagine how nice it would be to connect everything on your desktop relative to a particular thought. An example, say you're working on a wesbite...you could essentially find every program, file, search, etc, that is related to that at one time. It would be like telling your desktop that you want to work on that site, and everything you need is neatly retrieved and organized for you. That is convenience.

Hopefully I've got this whole concept correct, or I sound quite dumb.

by Aaron J. Seigo (not verified)

> Nat Freidman at RealWorld Linux last year speak on something similar

just to repeat Scott here: Beagle and this project really aren't the same sort of thing. you can do similar things with both concepts, but the approach taken is vastly different and there is a host of possibilities that just aren't possible with Beagle/Spotlight/Google Desktop Search type systems that are with where we're going with this. this isn't to say Beagle et al are uninteresting or poorly done, they are just a different type of technology with a slightly different set of goals.

> The whole concept re[v]olves around connecting ideas ... That is convenience.

exactly. =)

by Ferdinand (not verified)

Not that a little anecdote about BeOS would lead us anywhere, but since ReiserFS4 was brought up, it inevitably reminded me of BeOS and its radical goals to build an OS around a, what may be called, database driven storage paradigm rather than the hierarchical organization of file systems. Something that was tried several times, technologically superior but never widely and successful disseminated - you may be forgiven for thinking that there are many software and hardware de-facto standards out there that had better not come into existence.

That much said, it may be helpful and also a bit insane to encourage such efforts that implement more effective solutions by designing them with a user centric focus rather than a feasibility driven approach in mind. Disseminating such new systems may require overcoming seemingly insurmountable thresholds imposed by incompatibilities of existing applications with previous APIs, concepts, designs and last but not least architectures.

The ineffectiveness of machines is typically the result of fundamental design flaws. So, beware and balance the requirements to befriend software evolution!

by Aaron J. Seigo (not verified)

the trick is getting, as you seem to be hinting at, useful applications of these tools out there for public consumption. just because something is theoretically cool, who cares unless it lets you do something practically useful. something so wildly useful that it becomes a compelling reason to seek out that technology.

before visicalc, PCs were not all that interesting to most people. PCs were a radical concept that had so much potential, but it sat there largely unrealized until visicalc. but visicalc didn't add any new capabilities to personal computer hardware or operating systems, but they exposed the existing capabilities through an actual, unique and useful application of those capabilities. suddenly PCs were interesting. and in demand.

database centric desktop concepts have been tossed around for many, many years, particularly in academia but even occasionally in the marketplace as well. none have faired particularly well. it's the old "PC without visicalc" problem.

and this is where working within the KDE project is important. we have a whole desktop here with hundreds of applications in KDE's CVS. most of the developers tend to be in close virtual proximity to each other. as this technology becomes available for consumption, we won't have to go out and shop for third party developers or try and tack ourselves onto the side of a web browser. as a whole community of developers, KDE can deliver an entire environment of applications that make the most of this technology from day 1.

these applications will make all the difference IMO. which is why we are designing this primarily from an application developer's needs (which includes things like reusable, easily integrated user interface components). we believe that given a functional and open-ended API to these tools that we will be amazed and surprised at the unique uses application developers find for them.

by Ian Monroe (not verified)

For being so early in development, this plan reeks of not being vaporware. :)

by Derek Kite (not verified)

Think horsepower.

OS/2 had a useful ability in the Workplace Shell where you could have a folder containing documents, and when you opened or closed the folder, the documents would be opened or closed, ie. applications would be started or shutdown.

At the time I was using a 386. I remember buying memory at $100 for a megabyte. This useful feature was useless because it was so slow. Not get a cup of coffee slow, but go eat lunch slow. At least on the machine that I could afford.

This wasn't rocket science. It was a matter of not having adequate resources.

Some kind of desktop data indexing and retrieval is possible because of the enormous power we have in our desktop computers.

Derek

by Dan Housman (not verified)

I found this article/project interesting. We are a small software company working on a Windows based solution similar in scope. The product is called Viapoint (http://www.viapoint.com). I doubt the true Linux folks will want to play with it but it could be interesting to keep track of how the products evolve across operating systems. We have been calling the product a Smart Organizer and are looking to avoid building desktop search components by calling on Google's Desktop Search APIs or an equivalent so that we can focus on building the context sensitive part of the application as well as functionality to help a user actually work. You can download for free if you want to check it out. I'll have to get Linux or watch videos to see this tool.

by Chris (not verified)

Anybody ever looked at Kat Desktop Search project?

http://kat.mandriva.com/

What's everyone's opinion on using this technology instead of a completely new one?

by cm (not verified)

Actually, if I read http://websvn.kde.org/trunk/playground/base/tenor/ correctly then kat's main developer Roberto Cappuccio is the last person to commit to Tenor. So don't worry.