The Road to KDE 4: Strigi and File Information Extraction

After a short delay due to a heavy dosage of Real Life(tm), I return to bring you more on the technologies behind KDE 4. This week I am featuring Strigi, an information extraction subsystem that is being fully deployed for KDE 4.0. KDE has previously had the ability to extract information about files of various types, and has used them in a variety of functional contexts, such as the Properties Dialog. Strigi promises many improvements over the existing versions. Read on for more...

Strigi is a library that sits at a lower level than KDE. It is written in C++, and is designed to present a series of generic calls that a program can use to find more information about a given file or files. It is in no way tied to KDE except that the development version lives in KDE's SVN repository. It also has search capabilities, which are not really the focus of this article.

The Strigi libraries are used to get information from within files, such as the dimensions of an image, or the length of an audio clip, embedded thumbnails, number of lines in a log, source code licensing info or just to search a text file for a given string. Strigi has other advantages, as it can work inside compressed files, archives, and so forth seamlessly. In fact, it ships a few useful utility programs, called deepgrep and deepfind. These useful command line programs allow you to search for information within binary file formats as easily as using grep or find on plain text files. KDE is inheriting the same libraries, so we also get this unique advantage of being able to pull information out of files that are buried within binary formats, such as .tgz files. There is a prototype kio_jstreams powered by Strigi that treats archives like local folders, allowing you to visit /home/user/tarball.tar.gz/icons/ for example... This works great when you are using solely KDE integrated applications, but there are currently problems when mixing with other programs. For example, if you're browsing with Konq, and click on a file within a tarball, and you want to open it in the Gimp, well passing that sort of directory would obviously break the Gimp. So for the time being, this mode of operation is an experimental io_slave only, and will continue to be until these sorts of problems are solved. (The other problem is making a tgz or odp file behave as both a file and a directory simultaneously.)

There are many useful ways that Strigi can return data, once a query has been performed. For example, Jos notes: "The program xmlindexer is useful for extracting data from files in a very efficient manner. Because it outputs xml, it is easy to use from any program. Other search projects such as Beagle and Tracker would greatly benefit from using xmlindexer." The xmlindexer program is a binary, so programs can easily call it externally without having to link to Qt or use C++. That said, there are many ways to directly use the Strigi libraries...

The KDE libraries have had methods of extracting information (such as meta data via KFileMetaInfo) from files before, but in many cases they were either slow, or of limited functionality. With Strigi, we have seen as much as a several-fold increase in speed for extracting data from PNG files. I am not aware of any other speeds tests actually being performed, but the general impression is that it is much faster at retrieving file data than most of the previously existing methods.

So in KDE, there are not really any good screenshots to show Strigi in action, as it's really just a library. That's not to say that its effects will be invisible though, as things like the File Properties dialogs are already taking advantage of the Strigi backend to pull the data that was previously provided by KFileMetaInfo. Also, for things like thumbnail and other metadata that is being displayed in the file browsers, Strigi is planned to be used (or already in use in some cases) and preliminary results show massive speed improvements. But so far, this has had little effect on the actual KDE experience to the end user, at least in a visual sense. However, as more KDE subsystems become aware of Strigi, we should start to see more unique and useful uses for all the features that Strigi supports.

For example: One of the biggest benefactors of the Strigi work is NEPOMUK. According to Jos: "Nepomuk is a big European research project on enhancing computer applications to make them semantic and connected. Nepomuk-KDE is the work on a KDE implementation of the standards and ideas that come out of that project. I work together with the people of Nepomuk and especially Sebastian Trueg of Nepomuk-KDE to make sure our work fits together. At the moment Sebastian is writing [an] additional index implementation for Strigi that is better able to work with semantic data." This project uses a lot of metadata and other file contents (like the text of IRC logs, for example) to provide a easy to use search system for the desktop. NEPOMUK will undergo a name change before its final implementation is set.

So while Strigi does the actual digging through the data, other applications such as the Dolphin/Konqueror, the File Properties Dialog or NEPOMUK are the applications that will see the fruits of this work. At the moment, however, work is mostly focused on porting the previously existing KFilePlugins to use the new backend classes. For status reports on this effort, check out the Porting KFilePlugins Progress page on the kde wiki.

To learn more about Strigi, visit the website or join #strigi on irc.kde.org.

Comments

by cm (not verified)

Hu? What does that have to do with what I said?

I only explained how the grandparent poster came up with "Kumopen" given the name Nepomuk,
and said that that would not be a good idea.

Not that I think the suggestion was meant seriously in the first place ...

by Anon (not verified)

I'm not sure where the name NEPOMUK originated, but looking at it the only thing I could think of was that it read as "Kumopen" backwards.

by whatever noticed (not verified)

Network Environment for Personal Ontology-based Management of Unified Knowledge

by funnyfanny (not verified)

i support Aaron on including Nepomuk in Kde 4.0 already - please see the thread on kde-core-devel

http://lists.kde.org/?t=117613635500003&r=1&w=2

i does not need to freeze the api , but please include it mandantory, not matter if it works on windows or not.

by KubuntuUserExMa... (not verified)

If you are using Kubuntu, go ahead install it and give it a shot. I just did. It looks very, very promising (and already useful)

Great work!

by cies breijs (not verified)

this is a real nice kde innovation. this application nicely bridges the command line and the desktop.

i sincerely hope this can become a standard for all unix desktops as desktop search is getting more and more important.

hoooray for kde.

by Marc Driftmeyer (not verified)

this is a real nice kde innovation. this application nicely bridges the command line and the desktop.

i sincerely hope this can become a standard for all unix desktops as desktop search is getting more and more important.

Great for us KDE users and nice to see them doing this as Apple is updating its technologies for Cocoa devs.