Pervasive Search in KDE?

It seems that other news web sites are picking up on the news of the KDE Desktop event in Ludwigsburg. ZDNet is running an article about
the new desktop search engine KDE hackers are working on here at aKademy. Aaron Seigo, prominent KDE hacker and usability expert, answered some questions during a telephone interview and gave the outside world some insight of what is happening here at aKademy. You can also find a transcript of the talk by Juk hacker Scott Wheeler on search and meta data ideas for KDE 4 here.

Dot Categories: 

Comments

by Richard Moore (not verified)

Having some sort of indexing daemon that uses meta data is hard to do well, but certainly worth the attempt. One thing to consider is the possibility of making use of derived information as well as information that can be extracted from file contents etc. An example would be noticing that a pair of files are generally used together and providing some mechanism for seeing them as a group.

by superstoned (not verified)

I hope they can make this work with every filesystem, and fast... And I hope KDE will be able to utilise the features of ReiserFS4, to speed up searching meta-data on systems that have ReiserFS4.

Umm, what's your second guess. There's a small but loud number of BSD/KDE users out there. Would not be a first time KDEs capabilities need to be limited because of porting issues.

by Kamil Kisiel (not verified)

So? The indexer can support ReiserFS 4 on systems that have it, and fall back to a different database on systems that don't.

by More-OSX_KDE_In... (not verified)

BSD supports XATTRs ... what else is needed?

Well, i've suggested similar things few times with no luck. Reason? Not portable. Hope you're right and this case is different ( more hyped ).

by Derek Kite (not verified)

Hard to do well indeed. Done wrong it simply becomes more information to clog the mental arteries.

Your idea of pair of files is interesting. How about a little more, making up a context. The metadata already would have what app created it, time, and the data itself contains subject matter. Throw this all together in a way that could be analyzed, and patterns would emerge. For example, oh, he's looking for a plane ticket to europe, more precisely paris, he's asking about it on irc using ksirc, searching google with konqueror, writing an email using kmail to so-and-so. That is how I work, and am frustrated having to recreate the work environment next time I want to look further.

Having the ability to tag with metadata could help, but the system could also come up with reasonable tags based on the context it perceives. Files are only a small part of the workflow that happens on a desktop so metadata would limit this capability to files that I save.

Derek

by fred (not verified)

Semantic analysis of content could be more important. Language analysis. ecc.

by LightweightIndex (not verified)

... well that is, theoretically, anything that supports XFS and metadata.

http://sourceforge.net/projects/doxfs/

by Aaron J. Seigo (not verified)

it's interesting to note how much information we throw away when working with data in applications, such as source and identity. emails may come from the office, from friends, from the Internet, etc... as just one example. we don't use any of this information in our applications currently, which represents a huge loss in information. there is more than simply indexing the data that needs to be looked at, just as Google uses links between pages in addition to the data itself.

and remember that this is all applicable to more than just files in the file system. it's applicable to all sorts of data sets, such as control panels and documentation.

by Stefan Q. (not verified)

...And bookmarks!
As my bookmarks file is growing constantly I'm more and more loosing control.

Often I need to retrieve some information I read on the internet some time before.
But I cannot because it's simply too much work to sort everything somewhere into a deeply nested bookmark tree.

A solution could be a search based on web sites I have visited (or bookmarked).

by Vajsravana (not verified)

Scott says:
"One hurdle is that when saving, a dialogue should prompt the user to add metadata as well as the filename."

I would strongly advice you KDE people not to count on this!
You say the hyerarchical model is obsolete, because there is "too much data"; well, exactly for the same reason the last thing you can expect is a human user to manually "ranking on save" all of the data!

Let's explain better with some example:
Would you accept to fill a metadata dialogue (even if it's just some pair of fields!) for every file you download with konqueror? For every attachment you receive with kmail? For every file of every archive you unpack with ark???
I bet this three examples make not even the beginning of "Scott's 30000 files" and it's already way, way too much!
Imagine adding metadata via a manual dialogue to all ftp directories you mirror, or all the sources you download via cvs, just to be able to find a snippet of code you need every once and then. No one will do it

Of course, whatever ranking system works better if the user chooses to manually provide metainfos... but this will always be the exception, not the rule.

What is needed is a smart way of automatically analyze files, recovering metainfo that can then eventually be edited or completed by hands via an optional dialogue. The same old story of "providing a good default and a way to edit it", isn't it? :)

Put in another way: one cannot pretend a human user to manually rank more then some hundreds files... but for a subset of data so small, he'd better use the old hyerarchical model!

Just my 2 eurocents, hoping you'll find them useful :)

by Maurizio Colucci (not verified)

> Would you accept to fill a metadata dialogue
> (even if it's just some pair of fields!) for
> every file you download with konqueror? For
> every attachment you receive with kmail? For
> every file of every archive you unpack with ark???

You are already doing that. Specifying a directory for the file is a way of adding metadata, only with many more unnecessary limitations.

by Boudewijn Rempt (not verified)

But saving somewhere sensible is very easy to do -- but, in fact, most people don't even do _that_. Most people save their files where the application does so by default. When Winword 2.0 was still current, the place to go look for files people had lost was c:/winword. Nowadays, everything gets saved on the desktop.

How many people do something useful with the document information dialog KOffice has for each and every application? I've never met anyone who wasn't a compulsive menu-option checker who even knew it was there.

Other office suites have the same thing: a dialog to enter meta data. One version of Winword (or was it Wordperfect? -- I forget, it's a decade since I used Windows for real) even presented that dialog by default on saving the document. That was one option you could reliably count on people to find & disable, even
in the notorious settings dialog of WordPerfect (or was it Winword?).

iTunes has metadata, too. But I've never entered any of it: if iTunes cannot automatically grab the information from somewhere, the music gets classified under Unknown Artist/Unknows Album. And even that's not irritating enough for me to do something about it. But that's only me: I've never seen (with my own eyes) anyone else use iTunes for anything but listening to internet radio stations.

And why should people be required to manually add metadata? All important information is already _in_ the document, innit? That's why they created it, right there, where they can use it, not their computer.

Asking, requiring or begging people to add metadata to the files they add does not work. At all. Never. Nowhere. Categorically. Not on your nelly. Useful metadata derives from the content of the document, the revision history, the number of times accessed/altered, original creator and contributors, in short, all the things that can be done automatically.

by Sven Langkamp (not verified)

In a completly meta-based system there will be no more directorys.

It should be possible to enter meta data manually or generate in automatically.
This could be an KWord document:
There is no real need for a filename or a directory, because this could be done by the system.
An unique document can be discribed by automatically by it's mimetype, author, date, things that can be generated from the document itself like headings or image data, etc.

The way this meta data is generated should specified by freedesktop.org so that it can be shared all over the desktop. The acceptance would be also bigger if this projekt would be hosted there.

I think it would be more a "semantic model" of the environment. something like Cyc(http://www.opencyc.org) for the KDE environment, data structures who describe the metadata and structures relationship of your Operating system. No, collection and storing meta data is not enough.

by Aaron J. Seigo (not verified)

> What is needed is a smart way of automatically analyze files, recovering
> metainfo that can then eventually be edited or completed by hands via an
> optional dialogue.

this is the essence of what Scott's prototype demo does using KMetaFileInfo and a database to store the results for quick querying. and yes, it's damn fast once the initial indexing is done. the data scope needs to be extended, the indexing looked at and the db schema made more generic but for a proof-of-concept it was pretty compelling.

by Mark (not verified)

I am looking forward to ending my reliance on a HFS. I don't want to traverse a tree just to find a file I want to work with.

If they can get Google-like quality results (and speed), I'll be quite happy.

But please, don't make me _always_ manually enter metadata. Instead, have the system generate the metadata, including keywords, itself. Then give me the option of accepting or modifing it as required.

They could also use classes of files, specified in the search box, to make things easier. Maybe like `sys:network ethernet config` to find system files related to configuring an ethernet network.

IMHO, this is the single most important improvement to be made for KDE 3.4/4. Can I vote for it on bugs.kde.org?

Oh, and thanks for KDE 3.3. It is yet another quality release!

Mark

by Jeroen (not verified)

Some people will look for eth0 or eth config or eth cfg - if it's supposed to work like google it'll be gobbling up lots of disk space, CPU time and memory to hold, index & suspend its text index.

What might be handy is a ranked system that uses something like mysql's full text search on all text and text-like documents.

A system could weight the relevancy of the text documents. If I have the whole kernel tree installed I'd never really need the search to go there because I am probably looking for text in my own files. But a kernel developer might want a different weigh. The system somehow needs to learn/ask the user's bias for certain file types/systems/directories.

I think a really good search system has to have an involved user; on google you can find the right results when you know how to look for them, on KDE that might mean getting users involved in maintaining relevancy in the index. It's far too much to ask users what a file is when it's saved.

by Derek Kite (not verified)

How about watching how many times the files or folder is accessed? Or determining in what context the document was created (by content or other app activity)?

Derek

by More-OSX_KDE_In... (not verified)

And separating

XATTR values could be set for data vs. system files, say, the same way SELinux MAC security policies use a sort of "state machine" per file to set contexts for access etc. by using extended attributes.

This would work on BSD and Linux ... actually on any system where the FS has extended attributes.

Per user and per folder search would be great as well as some of those advanced graphical "image" search applications (there are OSS libs for visual searching of image data) would be awesome.

Reiserfs4 has the concept of plugins of course .. os maybe some of this would be handled at lower level on those sorts of sytems.

by Spy Hunter (not verified)

Yes. If we want this to work like Google we need to find some source of useful information about files that people don't have to enter manually. File usage data is the most obvious candidate. Automatically extracted summary data is a second choice, and full-text search is a third.

If you want to get really fancy you could notice which files a user is using at the same time and associate them. Then when a search finds one file, the other related files could be presented as well. I imagine this would work really well if you were working on a project that consisted of several files. You could sit down at your computer, type in a keyword that brings up one of the project files, and the others would already be on the screen ready to be clicked on. That way you wouldn't have to organize the files yourself.

by wilbert (not verified)

On a side note: in Digikam I tag my photo's, in JuK or amaroK my music (by tagging and by creating playlists), in Kontact my addresses (as belonging in one or more groups) and appointments, etc.

It would be very nice if I could tag other files and directories as well, so i can relate projects, e-mails, people, music, pictures, etc. to each other. These tags would work across all apps. (Reiser4 should be able to do that.)

So if I open an email from somebody, somehow I could quickly open a file I was working on, together with him/her. If I work on a project, I could quickly open related files, mail or page related persons, visit related websites etc.

But I think I would like files and directories continue to exist.

by PaulSeamons (not verified)

Article said, "and were replaced by, for example, Google, which makes things easy to find..."

They were not replaced by Google, otherwise Google would have no function. Google is a layer placed on top. It is a non-realtime index of mechanically discoverable data. That is what we need for the desktop. A non-realtime index that will mechnically discover meta-data about the files on the system - and that can layer on top of any filesystem (local or remote). Local filesystems can become more realtime, but usually if you are looking for data - it is because you haven't used it lately and it should be indexed by the time you need to look for it.

by Aaron J. Seigo (not verified)

yes, and it's also important to realize that many datasets change rarely if ever during the lifespan of a desktop. our files are very dynamic, tend to be huge in number and diverse in data types so they are something of a worst case scenario. but having a layer on top of our (programmatically friendly) hierarchical systems can make these large, dynamic data sets human friendly.

by Jupp (not verified)

... grammar checking module and style checker module

by Anonymous (not verified)

What about perverse searches in KDE? IMHO, this would be quite useful for porn collections.

by yo (not verified)

Would be nice if the application or a separate process that would index the content of data a user is working on. And have the filesystem remembers which file or data are being access most and use that as a ranking system for a search.

by Mike (not verified)

That's what I do for a long time. ALT+F2, something like the input feld at a searchmachine :-), then typing the name of the program saves me from clicking through the hierarchical K-Menu.
Thanks KDE-Team!

by Fred Schättgen (not verified)

I'm really happy to see so many people talk about the idea of a search engine for KDE. We find things faster with google than on our own harddisk - there is no better way to describe that deficit of current desktops better than with Scott's words.

KDE already has a lot of the infrastructure needed to solve this problem in place already. There is a framework to extract metadata from files, and KDE's search utility can even search this metadata. But still the real potential of all this is not utilized enough.

Now I'm just hoping that the plans for a desktop search engine don't get *too* ambitious. Some are already talking about getting rid of the usual hierarchical filesystems completely in favour of a metadata based search facility. While such ideas are great for a research project, I don't think it makes sense for a production system like KDE at this time.

Like Scott said - ranking is probably the most difficult part of the problem. But the consequence should be that we solve the more simple parts of the problem first.

The key to google's success was not only the sophisticated ranking system, which we can't use for arbitrary files anyway, but also its speed. Google is *simple*. You can't use arbitrarily complex search queries. Instead you pass it a few keywords and you get the result *instantly*.

The first step should be to solve the "simple" problem fast and efficiently.
We shoudn't waste too much time to think about how to make hierarchical file systems obosete while we still can't even do a fast search for a letter saved as PDF for instance.

A good example is Kmail. You could search for documents a long time ago already. But now there is a search bar, where you can simply type a few keywords and you get the results immediately. No sophisticated ranking, no complicated query language. Just fast to access and fast results.
This is the way to go in my opinion.

by AntiGuru (not verified)

according to the article ... sheesh ... and MS Research produces what exactly?? .NET ... hmm most of what MS produces they have bought elsewhere and .NET/C# don't seem worth 7billion somehow (I'm sure they will be cool and all that but Python Java and Perl/Parrot are cool too and didn't cost 7billion.

The spare change must get spent on HID since the Windows UI is not exactly easy to use ...

Hopefully this one more reason MS is going to go the way of the dinosaur: extremely expensive and unproductive research.

by Jeroen (not verified)

http://www.apple.com/quicktime/qtv/expo04/

its about halfway through the movie, shows search in the upcoming OSX tiger. It's very nice and it does pretty much exactly what it needs to do...

The real advantage that MS and Apple have with their desktop is that they can make assumptions on what email clients people are using; kde can assume people use kmail but we have to allow searching of thunderbird and evolution data as well..