The Road to KDE 4: Strigi and File Information Extraction

After a short delay due to a heavy dosage of Real Life(tm), I return to bring you more on the technologies behind KDE 4. This week I am featuring Strigi, an information extraction subsystem that is being fully deployed for KDE 4.0. KDE has previously had the ability to extract information about files of various types, and has used them in a variety of functional contexts, such as the Properties Dialog. Strigi promises many improvements over the existing versions. Read on for more...

Strigi is a library that sits at a lower level than KDE. It is written in C++, and is designed to present a series of generic calls that a program can use to find more information about a given file or files. It is in no way tied to KDE except that the development version lives in KDE's SVN repository. It also has search capabilities, which are not really the focus of this article.

The Strigi libraries are used to get information from within files, such as the dimensions of an image, or the length of an audio clip, embedded thumbnails, number of lines in a log, source code licensing info or just to search a text file for a given string. Strigi has other advantages, as it can work inside compressed files, archives, and so forth seamlessly. In fact, it ships a few useful utility programs, called deepgrep and deepfind. These useful command line programs allow you to search for information within binary file formats as easily as using grep or find on plain text files. KDE is inheriting the same libraries, so we also get this unique advantage of being able to pull information out of files that are buried within binary formats, such as .tgz files. There is a prototype kio_jstreams powered by Strigi that treats archives like local folders, allowing you to visit /home/user/tarball.tar.gz/icons/ for example... This works great when you are using solely KDE integrated applications, but there are currently problems when mixing with other programs. For example, if you're browsing with Konq, and click on a file within a tarball, and you want to open it in the Gimp, well passing that sort of directory would obviously break the Gimp. So for the time being, this mode of operation is an experimental io_slave only, and will continue to be until these sorts of problems are solved. (The other problem is making a tgz or odp file behave as both a file and a directory simultaneously.)

There are many useful ways that Strigi can return data, once a query has been performed. For example, Jos notes: "The program xmlindexer is useful for extracting data from files in a very efficient manner. Because it outputs xml, it is easy to use from any program. Other search projects such as Beagle and Tracker would greatly benefit from using xmlindexer." The xmlindexer program is a binary, so programs can easily call it externally without having to link to Qt or use C++. That said, there are many ways to directly use the Strigi libraries...

The KDE libraries have had methods of extracting information (such as meta data via KFileMetaInfo) from files before, but in many cases they were either slow, or of limited functionality. With Strigi, we have seen as much as a several-fold increase in speed for extracting data from PNG files. I am not aware of any other speeds tests actually being performed, but the general impression is that it is much faster at retrieving file data than most of the previously existing methods.

So in KDE, there are not really any good screenshots to show Strigi in action, as it's really just a library. That's not to say that its effects will be invisible though, as things like the File Properties dialogs are already taking advantage of the Strigi backend to pull the data that was previously provided by KFileMetaInfo. Also, for things like thumbnail and other metadata that is being displayed in the file browsers, Strigi is planned to be used (or already in use in some cases) and preliminary results show massive speed improvements. But so far, this has had little effect on the actual KDE experience to the end user, at least in a visual sense. However, as more KDE subsystems become aware of Strigi, we should start to see more unique and useful uses for all the features that Strigi supports.

For example: One of the biggest benefactors of the Strigi work is NEPOMUK. According to Jos: "Nepomuk is a big European research project on enhancing computer applications to make them semantic and connected. Nepomuk-KDE is the work on a KDE implementation of the standards and ideas that come out of that project. I work together with the people of Nepomuk and especially Sebastian Trueg of Nepomuk-KDE to make sure our work fits together. At the moment Sebastian is writing [an] additional index implementation for Strigi that is better able to work with semantic data." This project uses a lot of metadata and other file contents (like the text of IRC logs, for example) to provide a easy to use search system for the desktop. NEPOMUK will undergo a name change before its final implementation is set.

So while Strigi does the actual digging through the data, other applications such as the Dolphin/Konqueror, the File Properties Dialog or NEPOMUK are the applications that will see the fruits of this work. At the moment, however, work is mostly focused on porting the previously existing KFilePlugins to use the new backend classes. For status reports on this effort, check out the Porting KFilePlugins Progress page on the kde wiki.

To learn more about Strigi, visit the website or join #strigi on irc.kde.org.

Comments

by Lans (not verified)

Thank you Troy for another great article about interesting technology behind KDE4.It has become a habit to read these series every week, and I was very happy to see this new article about Strigi today.

Once again, thank you, and keep up the good work.

by Darkelve (not verified)

Yes thank you!

This is kind a more abstract information than usual,
but to me it is one of the coolest aspects of the coming KDE4! B)

by Troy Unrau (not verified)

Yeah, I actually moved during the break here, and haven't had a net connection until recently... no net means no SVN builds... therefor I had to choose a more abstract topic. Thankfully Jos was very helpful in answering all the questions I had while preparing the article :)

I can't guarantee it'll be weekly articles (at least for the next few weeks as I'm now in my exam block at the uni) but I'll try to keep 'em rolling...

You're welcome :)

by Darkelve (not verified)

"I can't guarantee it'll be weekly articles (at least for the next few weeks as I'm now in my exam block at the uni) but I'll try to keep 'em rolling..."

Well, good luck on the exams.

by Debian User (not verified)

Hi there,

please don't mix files and directories. You will create tremendous confusion.

I do see that I should be able to use tar://path/to/tarfile and file://path/to/tarfile and it sure would be nice, if there were a way to find their relation by means open a double-click, open action in Dolpin of KDE4.

If done correctly, up of the file browser Dolphin would switch to the file:// protocol again.

Unsolved, forever, is the nesting of IO-Slaves, isn't it? What if want to do tar IO-slave over ssh? fish://path/to/tarfile, can't be browsed with tar:// can it? The chaining of IO-Slaves would be nice.

Yours,
Kay

by Jos van den Oever (not verified)

Strigi partially solves this problem. You can do e.g.
jstreams://message.eml/data.zip/familytree.tar.gz/mother.jpg

by Michal (not verified)

What does it mean, partially? IOW what are the limits of that concept using Strigi? If done correctly we might finally fix bug 73821, which would be really cool.

by Jos van den Oever (not verified)

Strigi reads files as a stream. This fits very well with kioslaves. On its own, Strigi can read embedded files from other files. To read files embedded in other files that are read over a network protocol like ftp is a bit more tricky. There would need to be a way to really nest kioslaves to make that possible.

by André (not verified)

I don't get it. I can do this stuff already without using Strigi on KDE 3? It sounds like nothing new at all. KIO slaves have rocked a long time, and being able to navigate tar's and other archives right from your trusted Konq is a very old feature. Works like a charm.

by Troy Unrau (not verified)

Yes and no. You can do some of what Strigi does in KDE 3, but it's slower than using Strigi, and you can't extract the same detail of information (the infrastructure is not there). For example, you can navigate a tarball in KDE 3, but you can't pull embedded thumbnails out of images when browsing within a tarball... Strigi can do that, and fast...

by KDE User (not verified)

Please, don't bring up the Dolphin debate again. Let's focus on fixing Konqueror first.

by Debian User (not verified)

I used Dolphin (the IO-Slave shell) to specifically express that I regard the job as different from what Konqueror (the IO-Slave and KPart shell) would do, like using a KPart for tar and an IO-Slave for the ssh.

That said I didn't think of bringing up the pointless debate. Which is btw mostly pointless, because it's about a decision already done, and only about people not understanding what it is.

And other than that, the debate is by itself not bad. I think it helps to show the developers that the "people" (the mass of slightly informed users) really appreciate Konqueror and want it to stay and want to see continuation of this success story.

So please don't police my use of "Dolphin". While I feel that it was not well communicated by the developers, I do feel and appreciate the role of it. And my wish is exactly a point where Konqueror and Dolphin should behave different. In Konqueror browsing a tar should open a complex KPart with all the details, and in Dolphin it should be like browsing the files inside the tar.

Bah, I really am annoyed to see this police style of comment.

Thanks,
Kay

by KDE User (not verified)

Sorry, not getting it. Why should Konqueror get a complicated KPart for this and Dolphin get the polished one? Who wants a complicated KPart exactly? I want Konqueror to be awesome, not complicated.

by Thomas (not verified)

du... man complex != complicated.

konqu may load Ark as a kpart (which translates to "embed ark in konqu")

dolphin can _not_ load any kpart, so it's limited to the one file-browsing interface hardcoded into dolphin. Still it uses the same kio-slaves like konqu or all other kde-apps can use (making it possible to dive into a tgz-file e.g.)

by Diederik van de... (not verified)

See the following comparison how efficient Strigi is compared to Beagle:
http://www.kdedevelopers.org/node/2639

by Matt (not verified)

That comparison is old enough that I'm not sure it really still applies, especially given that it exposed some bugs that were causing Strigi to return fewer results than the competition.

Still, everything I've seen regarding performance has been very impressive.

> For example, if you're browsing with Konq, and click on a
> file within a tarball, and you want to open it in the Gimp,
> well passing that sort of directory would obviously break the
> Gimp.

You can easily view all KIO slaves in non-KDE apps (such as GIMP, Firefox, OpenOffice, even commandline utilities) through KIO-FUSE. It works by mounting remote locations (or tar archives, in your example) into the root filesystem hierarchy:

http://kde.ground.cz/tiki-index.php?page=KIO+Fuse+Gateway

FUSE is Linux-only (ok, + some BSDs and a hackish Mac OS X-Part).
KDE, is not.

by Debian User (not verified)

Except that of course FUSE is not the solution, what gives?

I already today can use OpenOffice to open files via IO-Slaves. It just takes a temporary file, created behind my back. And why not monitor that file for changes and push these backto the IO-Slave where it came from?

Admited, a lame work-around, but with inotify its going to work nicely.

Yours,
Kay

> I already today can use OpenOffice to open files via IO-Slaves.
> It just takes a temporary file, created behind my back. And
> why not monitor that file for changes and push these backto
> the IO-Slave where it came from?

Because when OpenOffice crashes or misbehaves it leaves your /tmp directory with Gigs of orphaned temporary files.

It's a pain to make OpenOffice and other non-KDE applications aware of IO slaves, and it's outright impossible to do so in closed-source apps. With FUSE, they don't have to be recompiled or modified at all - they see remote files as a normal local files. So it's great for backward compatibility.

by superstoned (not verified)

this already works in KIO, it CAN create a temporary file, monitors it for changes, and uploads the changed file back to the original location. I agree FUSE is cool, but not the cross-platform solution we need.

Using FUSE to create a temporary mountpoint is a lot cleaner than creating temporary files and transmitting the changes back. Especially if you're working on a very big file.

But FUSE isn't on all of KDE's platforms. I think the best solution would be to have FUSE where you can, and temporary files when its not an option.

> I think the best solution would be to have FUSE where you can,
> and temporary files when its not an option.

Exactly!!!!

> FUSE is Linux-only (ok, + some BSDs and a hackish Mac OS X-Part).
> KDE, is not.

Fuse already works in Linux, FreeBSD, DesktopBSD and Solaris. I don't see why the users of these systems should be dragged down by the ineptitude of some of the more obscure OS's.

by superstoned (not verified)

It doesn't work on Mac OS X and Windows, thus it's still not an option.

Will KIO slaves will ever run on Windows/OS X?

imho this is not neccessary.

Applications (Amarok, Krita, Quanta, ...) yes - but KIO-slaves?

Getting the KIO-Slaves to work on the other platforms should be much easier than getting the applications. Especially on OS X where the primary differences between OS X and FreeBSD (or OS X vs Darwin+X) exist only with regards to the GUI.

by Pino Toscano (not verified)

> Will KIO slaves will ever run on Windows/OS X?

> imho this is not neccessary.

> Applications (Amarok, Krita, Quanta, ...) yes - but KIO-slaves?

http is a KIO-Slave too. Do you want Konqueror on MacOSX or Windows?

Getting the KIO-Slaves to work on *all* the platforms KDE support *IS* necessary.

by ben (not verified)

I really like where KDE and its libraries are going. Except for the ioslaves.

It is becoming very obvious that ioslaves do not belong into kde/gnome/openoffice, but at least one layer deeper than that. They should be part of linux, available everywhere, so that commands like "less http://kdenews.org" become possible.

I guess that there is some good reason why this has never happened, but the current state really sucks!

by michael (not verified)

The main problem is that it gets a lot more difficult to implement at a deeper layer. Even now ioslaves have some complex issues to solve.

One of them is that most interresting ioslaves may require user interaction.
lets say "less http://kdenews.org" actually works but the given URL requires authentication. "less" obviously can't handle that, so what will?

What about progress bars? Even small files may take a long time to access. And in many cases "download, work locally, upload" is the only reasonable approach to work with a file.

by Philipp (not verified)

I'm pretty much sure all of the issues can be solved.

It would just need a deeper interaction between a generic library and the above DE / applications. Even the mentioned interactions can be done.

For sure it is a huge task getting it done in a way every DE / app is accepting it and the flexibility in adding new / adjusting current IO-slaves from one to another KDE version is gone. I.e. it needs to be at least pure LGPL so it cannot be done with Qt.

But I don't see a fundamental reason why it shouldn't be possible.

by Kevin Krammer (not verified)

> I'm pretty much sure all of the issues can be solved.

There has been at least one attempt for a unified VFS implementation (look for D-VFS in the archives of the xdg mailinglist on freedesktop.org).

However, main author/developer got lots of negative feedback from people not associated with any desktop project and gave up.

> it needs to be at least pure LGPL so it cannot be done with Qt

That's not an issue. Any such system would be out of process, i.e. a single daemon or multiple daemons (one for each connection like KIO slaves, or one for each protocol or one for each host, etc).

Since communication would happen through a specified protocol (also the way KIO slaves work), the licence of each side of the communication does not influence the licencing options of the other.

by Martin Stubenschrott (not verified)

"less http://kdenews.org" IS possible (at least in my gentoo build with some lessopen-trickery).

But I think you're right, I would also prefer it to have things like that in an DE-independend way.

by Diederik van de... (not verified)

I guess that there is some good reason why this has never happened

Yes, it needs kernel support from BSD, Solaris, HP-UX, Linux and Win32. That's a bit challenge to get right. So that's why the KIO slaves are still implemented at a higher level.

Perhaps some of the recent user-space filesystems (FUSE) open a new way to implement this.

by Philipp (not verified)

>Yes, it needs kernel support from BSD, Solaris, HP-UX, Linux and Win32.
If this would really be the case, then KDE would not be at all possible on these platforms.
If you can do it within KDE, then you can do it seperately as well.

by ac (not verified)

what?

its possible in kde on all platforms because kde is programmed to do it on all platforms. sure, you can do this with every application, but that means you need to change the code. thats the problem - you just can't change the code of all applications for half a dozen platforms...

so if you want something like kio that works with all applications, you need kernel support.

by Philipp (not verified)

Kernelsupport? Nonsense. This is mostly pure userspace.

For sure some support is in the kernel, as some basic network- and filesupport is there for other reasons, but why on earth needs a kernel knowledge about IMAP or EXIF?
The kernel doesn't need to know all these different things.

What you want is a functionality available on Linux platforms and this can be done with a library as well.
You are worried it cannot be used in the bash? For sure bash needs then to be adjusted as well.

Yes, we can't change the code of all applications on all platforms. But if you keep it in KDE alone and in Gnome alone and in other Apps, it for sure will never happen.

by ac (not verified)

you should read the whole of the thread...

just putting ioslaves into another library will not do anything. noone is going to rewrite their software just because some geeks like to really use networks and such.

the only way to make ioslaves fully work for non-kde software now (not in 10 years...) is to use kernel extensions. not because the kernel needs to know about imap or whatever, but because its the only common api for filesystems access every application uses. thats what fuse is about, to make things like that easier.

rewriting the ioslaves in pure c, with a minimum of depedencies - so that maybe someone else would use it, is not a good idea. the kde devs allready have enough work ahead. and you still won't get less-over-http anytime soon, because these apps arn't maintained anyway.

by Evan "JabberWok... (not verified)

Since when has KDE limited itself to being only a Linux desktop?

by Gummi Bear (not verified)

With so many changes for KDE 4 I wonder how long will it take to make KDE 4 stable/usable.
I stopped using Konqueror because it crashes a lot when dealing with embedded multimedia. Kaffeine since a couple of months crashes when I open the playlist and a video is running... I hope we get a stable KDE 4 before jumping to an even cooler KDE 5.

by litb (not verified)

please see this:
http://kaffeine.sourceforge.net/index.php?page=news&details=22

it really was not konquerors' fault.
my suggestion for you:
kplayer , it embeds cleanly - no crashes for me since i used it, caused by kplayer.

for embedding videos into firefox, my suggestion for you is mplayerplug-in:
http://mplayerplug-in.sourceforge.net/

all in all, that nasty crash when viewing embedded videos with kaffeine (which i personally don't like (ohh, see how lame it is)) was not konquerors fault but a misuse of Xlib in means of multithreading.

happy hacking

by MamiyaOtaru (not verified)

Nice of them to have fixed it. Unfortunately I only use Kaffeine 0.4.3. After that, I hate the UI. (Old UI: http://img64.imageshack.us/my.php?image=kaffeine7gu.png ) To prevent the crashes, a small patch of Konqueror is sufficient. That's obviously not the "right" answer, fixing Kaffeine was, but this is nice for those of us using older versions.

The differences amount to
+#include
and
+ XInitThreads(); // fix for kaffeine

(patch file attached)

Again, not so useful now that Kaffeine is fixed, but it allows me to continue using my preferred 0.4.3

by superstoned (not verified)

I started to use Codeine after Kaffeine decided to bloath their UI with all kinds of useless stuff...

by Gleb Litvjak (not verified)

Hmmm... That may be caused by poor packaging (if you used binary packages) or extreme CXXFLAGS and LDFLAGS (if you compiled KDE from source). Also, if many othert programs start segfaulting with no aparent reason, check your computer - this may be a hardware failure (faulty cooling, dust, bad capacitors etc) -- but this is less likely.

by David Taylor (not verified)

Odd you should say that, mine also did a lot but since Kubuntu edgy I haven't seen that problem at all; if you point out a website and it crashes for me I shall file a bug report.

by forget Kaffeine (not verified)

Everyone is switching to KPlayer it seems. It's so much more stable, and the default interface is more simple and easier to use. But it also lets you get under the hood and tweak things the way you like them. And it has a multimedia library to organize your stuff. And it uses MPlayer, which means more compatibility with formats, codecs, and so on.

by somecoward (not verified)

i, for one, think the name Kumopen would be great. and it starts with a k \o/

by Sebastian Trüg (not verified)

What does Kumopen stand for?

by cm (not verified)

Just Nepomuk backwards.

And the same kind of marketing blunder as the name Kant instead of Kate would have been, BTW.

by whatever noticed (not verified)

Nepomuk is not a kde technology, so no point in using a K in the name of it.
Nepomuk is an acronym for
Network Environment for Personal Ontology-based Management of Unified Knowledge