We are here today to talk about the Strigi project - the indexing and search technology of KDE 4 - and to interview Flavio Castelli, a key developer of Strigi. Read on for the interview.
This interview was initially released to KDE Italia and is available for Italian readers here.
Flavio can you introduce yourself to the KDE Italia readers? What did you study? Have you got a job?
I was born 25 years ago in Bergamo, a city near Milan in Italy. I have just taken a second level degree in computer engineering. Now I'm working as a consultant for an IT company in Milan.
When did you hear of KDE for the first time? When did you start using Linux and why?
I discovered KDE and Linux at the same time. In fact the first Linux distribution I installed was shipped with KDE 2 as window manager. I was only eighteen and I had just heard about Linux from one of my schoolmates and some magazines. I found a Red Hat 6.2 installation disk in a magazine and I installed it just for fun.
I tried to use Linux for some months but I ended-up removing it because I wasn't able to solve lots of problems. In those days I didn't have internet at home nor did I know other Linux users.
Then, during the first year of university, I met Linux again. Since I discovered that some stages of the previous summer required Linux knowledge, I installed it again. When the summer arrived there where no interesting stages, but in the meantime I had discovered a new world...
How and when did you get involved in KDE?
I joined the KDE development with the birth of Strigi. That happened during February / March 2006. I had never taken part in such a big and important project before.
How was born Strigi, can you tell us a story about meta information engine search designed for KDE?
The first desktop search program for KDE was Kat. It was a promising project sponsored by Mandriva and maintained by an Italian guy called Roberto Cappuccio.
Unfortunately Kat never reached a stable official version. Its latest versions had some serious bugs, which showed the need to reorganize the source. Roberto had just began to rewrite some parts of Kat when, for personal reasons, he had to leave the development. So the project was left without its leader, with serious problems to fix and a simpler layout to be found. Since the Kat development team was really small, nobody tried to continue Roberto's work.
In the same period, Jos van den Oever (Strigi's maintainer) created the Strigi project. At the beginning Jos had just written some plugins for Kat. He needed a stable version of it to test his code. Since Roberto was really busy and his work was going on slowly, Jos decided to create a small program for his tests. So when the Kat project was closed, Jos expanded this small program and the Strigi project was born.
Lately Strigi entered in KDE 4 with the kdesupport SVN module. This KDE 4 core developer decision let you be proud of your contribution?
Well, I'm really happy and proud of it. I think I'll be happier when, with KDE 4 official release, more people will discover, use and (I hope) appreciate Strigi.
What makes you contribute for KDE instead of the competitors?
When I started using Linux I tried lots of window managers and desktop environments. I liked some of them, but in the end I realized that KDE was my favourite one.
Every day I have lots of advantages using tons of open-source programs. So I decided to offer my time and capabilities to the KDE project. My aim is to contribute to its evolution and to permit other people to use a good and always up-to-date product.
In short I would like to do something useful for other people... :)
Can you say that programming for KDE was an investment? You got C++/Qt programming experience that helped you enriching your personal curriculum. Can this be of help getting a job? Can it be a good call ticket to go to a job interview with a software company?
By working on KDE I'm constantly improving my skills, and that's really good. In the meantime it is a good point on my curriculum. I think that programming for KDE can help during a job interview, but unfortunately this isn't assured (especially here in Italy).
Are you part of a Linux Users Group? Have you ever presented some works for the LUG or in Free Software events?
I'm one of the members of BGLug, which stands for Bergamo Linux User Group. As part of it I had the chance to organize lots of events related to the spreading of Linux and Open Source.
Have you ever stayed at an aKademy or at a Free Software Event? If so can you tell us briefly how it was? What did you do? Do you think a KDE user/developer has to participate at least once to the KDE developers conference (aKademy) in his/her life? Or at a Free Software event?
Unfortunately I have never joined an aKademy. I was going to join two of them but in both cases I didn't find a good (alias cheap) flight.
Indeed, last February I participated in the Bruxelles' FOSDEM. This is an annual meeting of all the European Open Source developers. At FOSDEM I gave a talk regarding Strigi desktop integration. It has been a really positive experience that I'll try to repeat next year!
I think that an open-source developer should take part to a similar manifestation because it can be really useful. By joining these events you can meet lots of interesting people and share with them your opinions. You can't even imagine how many ideas can be born from these debates.
What is the more beautiful experience with KDE? To know the other developers? Or something else?
Actually the more beautiful moment took place after my speech at FOSDEM. When people started asking questions on Strigi I felt in the flesh the interest for my work. It has been gratifying.
Do your parents and friends use Linux and KDE?
I have lots of friends using Linux. While my parents are still using Windows, my sister used Linux for some times and finally switched to Mac OS X.
Also my girlfriend used Linux and KDE for some time. She liked it, but now she uses Windows all the time (that's a choice of her company). Obviously she knows that, living with me, she will meet Linux and KDE again :) .
What could be your slogan to attract people to KDE? Can you give also some "reasons to stay with *nix/KDE"?
Choose the best, switch to Linux & KDE! Ok, I'm not a great advertising man :) .
I suggest to use KDE on Linux (or anything else from the *nix family) because in this way you will obtain a complete and stable system with a good user experience. But, most important of all, you'll have a totally free system.
If one day you won't be working on KDE anymore what could be the reason? Too much time to dedicate to a new job, to your family or what else? Or simply you decided to leave behind your passion for KDE and so leave KDE team? What will you miss of the KDE experience? Obviously we hope you can work in the KDE team for a lot of time yet.
I hope to work on KDE for a long time. I think that a bad interaction between work and family could make me leave KDE.
How much time do you usually spend on KDE?
Every day I spend two hours on KDE, that's the time the train takes to reach my office and bring me back home. Then there're two or more evenings per week, but these ones depend on my "real life" matters. Unfortunately the good times of university are over... :( .
Flavio what are your plans for KDE 4?
Make Strigi better and better. I would like to see it become KDE's "Spotlight".
Personally I want to make the file system monitoring feature stable and multi-platform. Strigi currently offers this functionality only on Linux systems. I would like to extend it to Solaris and BSD.
I'll also try to improve and extend the unit testing suite that I have just rewritten, the main goal is to obtain a good quality assurance tool.
What was your first Linux distribution and why? You tried many ones before you get the right one?
My first Linux distribution was Red Hat. Then on my laptop I've used Slackware for a couple of years. In the meantime on my home PC I tried Mandrake, Red Hat again and in the end, Gentoo.
I immediately fell in love with Gentoo, so I left Slackware and I installed this beautiful distribution on all my computers. I continued to use it also when I changed the architecture of my laptop, switching to an iBook G4.
Anyway after some years I didn't like any longer the wait for the building of all programs, so I switched to Debian. I chose this distribution because it gives good support to the PowerPC architecture.
Which distribution do you use now? Why?
I'm still using Debian. I like it because it is available on different architectures, offers lots of binary programs and, most of all, has a good package manager. I don't care too much about the new Linux distributions or the evolutions of the other ones. I'm really happy with Debian and I don't feel the need to change it.
Mac OS X or Linux?
Linux forever. Since I have two Macintosh I used Mac OS X for some time. I liked some aspects of this OS, but there're lots of things I don't like. I found that Linux is the operative system that fits my needs.
What is your favourite place in the world?
A green place with broadband :) .
Flavio, thanks for your time,
How nice to read this interview. I'll never forget when I first met Flavio. After giving the Kde-Nepomuk talk at FOSDEM in 2007, someone came up to me and started questioning me about the, in his view, inadequate state of file change notification in Strigi. I tried to respond calmly but was a bit careful in how to deal with the tone of the questions.
Says the questioner: "Hey, don't worry, it's me: Flavio!"
"Oh you, bastard!" was the first thing I thought. "You tricked me!" That was really an excellent joke and emphasized how we'd been working together for months and had never spoken to each other or seen on another at all.
All this talk about "desktop search" either on Windows or on Linux has so far been not very useful for me. I just don't own that many text documents on my harddisk. The whole thing becomes very interesting if network search would be considered more and worked easily and "out of the box". So - here is my dream of the ultimate "desktop search":
I can easily (per UI) create and delete "search spaces" on different servers. SSH authentication (i.e. server, user name, password) and a folder to search for (say /var/www) would be sufficient. I give a name for this search space, i.e "webhtml". Other folks on different PCs can connect to this server via the same authentication data. They can subscribe to my existing search spaces or create additional ones. Vice versa I can of course subscribe to their search spaces. Now i.e. every night while nobody is at the office the search is done efficiently on that server only once no matter how many PCs are subscribing to that search space. The index is also stored on that server. Everyone can now make use of the same index. For instance all my PHP functions would be in that index. If now I select search (perhaps even embedded in the file open dialog of Kate) and enter a function name the 10 most recently modified files containing that string would be returned. I can open one in a text editor or KWord or whatever immediately via KIO/FISH. Wouldn't that my great? Perhaps this is possible right now but just to difficult for me to set up.
remote search indexes are indeed a cool thng, and something i hope kde does get to one day. however:
> I just don't own that many text documents on my harddisk
no mp3s, emails or digital photos with metadata? no bookmarks or contacts or im conversations? i think there's a lot more textually representable data around than most people consider. add in dynamic information and it can get pretty interesting =)
Right; it would be great if it would search my digiKam database and report how many of my pictures have a particular tag. What about finding segments of lyrics that are linked to my .ogg files! That would be somethin':
You have 45 pictures that are tagged "Amy".
You have 1 song titled "Amy".
You have 3 songs that contain the phrase "Amy".
That last one seems difficult, but would be _sweet_!
Since Amarok stores the lyrics localally somewhere the last one should be more than possible, especially with a little help from Amarok!
I'm not sure if Amarok embeds the lyrics into the files metadata or stores it in its own database, but either way shouldn't be much of a problem (and if in the metadata, it should be trivial since it would already be scanning the metadata on the file anyways).
What I think would be very interesting would be stuff like this in addition to yours:
You have 1 PDF labeled "useless and confusing name" that was sent to you by "Amy [email protected]". (i.e. the PDF was sent to you by her via IM or email or some other file transfer method)
You have 2 songs by "Konqi" that were recommended to you by "Amy". (i.e. she recommended you the song via a service like Last.fm)
And other things like that, which is the general idea of what Tenor wanted to achieve, and I believe is (at least somewhat) the goal of NEPOMUK (both AFAIK and IIRC). Even though Strigi (best I know) what have these features in KDE 4.0, I'm still looking forward to seeing if it can be useful while not getting in the way (until I forced beagle off my laptop about once a week my laptop's hard drive would start thrashing like mad when beagle updated its index which would last quite a while!).
lyrics are stored in database, not in the files itself.
There is however at least one script available to automatically store the lyrics in the ID3 header. Maybe this could be turned on by default in amarok?
"Maybe this could be turned on by default in amarok?"
I'd love this feature ! Why can Amarok write ID tags and not lyrics ? There should be a button on the top of the lyrics sidebar which allow to write the current lyrics in the music file when we think the lyrics are OK
This might be a legally problem. Although using much more of the ID3 tags would be great. But even the bought music on magnatune.com has only the very basic tags (not even the cover art) :-(
My dream would be finding things as fast as Spotlight can do. When I downloaded something, it's immediately visible in Spotlight (I assume the file dialog calls some hook to spotlight).
Apple iterated over spotlight, and is now at the point to search network shares too. I'd love to see the same thing happening to Strigi: first get local searching really well, so remote searching will be cool do (e.g. with KDE 4.1 or 4.2).
Reverse that, and you'll get a tool which does all sorts of cool things, but doesn't have it's core functionality worked out well enough. So my dream is they get local file searching properly first! :)
While both strigi and beagle are both nice tools when you only consider your personal files sitting on your local hard disk, both are totally insufficient when it comes to files sitting on network shares, either NFS or Samba.
Fileservers should be capable of creating the indexes itself, and giving the client an interface to query the indexes. Results should be presented taking the permissions of the querying user into account, which might be quite complex (think of ACLs, there is more than user/group/world). For permission checks, Kerberos comes to mind, but its IMHO to complex and demanding for simple setups.
If there is a good interface, it should be expandable to email/groupware servers as well.
If I understand correctly, Windows Server is capable of indexing shared files, but I have never used it, so no experience.
As far as I know, the architecture of strigi would support this, and Nepomuk is specifically doing research in the area of sharing meta-data with other people over networks etcetera... So you can expect 'things like this' to come up and be integrated in KDE in the coming years, though of course they won't be KDE 4.0 things.
Ciao Flavio! Compliments for your work!! strigi is a critical kde4 component, and it needs to be really good. It's a pleasure you're developing it ;-)
Continua cosi', viva KDE!!
Since search tools came to be based on word indices, we have been gradually losing the ability to search for arbitrary chunks of text. That is mostly fine for web search, where you are throwing away 99.9999% of the results anyway (most of the time), but becomes an issue for desktop search.
I am asking these questions in general terms since I honestly have not been able to figure out how Strigi works in these respects, despite using it quite a bit. Perhaps Strigi already solves some of the issues I present; it is not apparent that it does.
1) First Question: File name search
The problem of not being able to search for parts of a word becomes especially bad for file names, which are often of the form MyStrigiPresentation.odp. To search for that using a word index, I already need to know the exact file name! Searches for "Strigi" and "Presentation" come up blank.
Does someone have a brilliant solution to this? For instance a clever way to break up long words and also index the parts of them? More realistically; will Strigi be configurable to do full text search (also finding parts of words) on a data source basis. It could then be set up to do this more expensive search for file names, which should be a small amount of data anyway?
2) Second Question: Special characters
The other issue that has bitten me is how non-alphanumeric characters are treated. Suppose I am looking for the text "foo:bar". Will the query "foo" find that? Will "foo:bar"? It depends on how the indexer treats the non-alphanumeric character (a colon in this case). It also depends on how clever the tool is both at indexing at search time: If "foo:bar" was split into "foo" and "bar" at indexing time, the query "foo:bar" may or may not produce a match.
As a user I would love to see some friendly documentation on how different characters are treated and when you can search for things that contain them. Another example: can you search for "C++"? The answer is not obvious. As a user I would like to know what queries are futile without trying them all out.
The filename /home/you/my-strigi-presentation.odp is indexed under the keywords home, you, my, strigi and presentation, so it's easy to find it. The indexer does not understand CamelCase though.
Indexes that index all parts of works takes a lot of diskspace. For this reason, we try to make sane choices about how to break up words we index. Note that you can search for 'foo*' and even for '*foo'. The first is fast, the second is slow on the default lucene index.
Indexing non-alphanumeric text is not done for the same reason. Yes, this means that looking for 'c++' is not possible.
Lucene has the concept of an 'Analyzer' (do not confuse this with a Strigi analyzer) which is a class that can break up text into relevant tokens. For different languages, different analyzers can be used. Strigi does not have this abstraction at the moment. We have only one way of breaking up text.
Thanks for your reply! Strigi does store the full text content, in addition to the word index, does it not? Would it then be possible to turn on full text search for certain kinds of data? That could be turned on by default for e.g. file- and application names and could be switched on by the user for other kinds of data when needed: searching through the Strigi-collected full text would still be faster than going back to the original data source.
Let me also again plead for a full docmentation of all the details on how search works! For instance: what happens when I search for C++? Will that find matches to "C" (including "C++)? Will it find the specific string "C++" anyway, by means of first finding "C" and then checking the full text data for the surrounding characters? Or will it find nothing, either because the search front end does not drop the "++" or since "C" is just one character long and that was deemed too short to index? No need to answer that particular question here, but it would be great if that sort of thing were addressed by the user documentation.
Have a good day!
this is an important problem.
for instance, go to gmail, and try to search for emails with a subject that starts with "**SPAM**".
you just... can't. and yes, I've tried filing a bug; the gmail devs never respond.
you get all sorts of emails that contain the word spam, or Spammer, or "my email address is [email protected]" or other such nonsense, and it drowns out the results you're looking for. also makes it completely, 100% impossible to make a filter based on the "**SPAM**" text.
this was the last straw with gmail for me, and I finally switched back to kmail.
once in a while, I just really, really need a way to do a search for either the literal text I've typed, or a regexp. not having either makes me feel like I'm stuck in some kiddie-program that's not meant for real adults doing real work.
I would love to be able drag and drop a photo to a search bar, and then have it search that picture's meta-data. Or find simular photos with the Gnu image search.
why would you want to 'search the metadata' from a single picture? you use metadata to find something... showing the metadata in KDE 4 is just rightmousebutton -> properties.
finding similar photo's could probably be done with a strigi plugin (or it might be more a Nepomuk thing). Maybe Jos knows...