KDE's Next Generation Semantic Search

For years, KDE software has included a semantic (relationship-based) searching infrastructure. KDE's Semantic Search was built around concepts previously developed in a European Union-funded research project NEPOMUK which explored the use of relationships between data to improve search results. Based on these ideas, KDE's implementation of Semantic Search made it possible to search for all pictures - taken in - a particular place. On top of that, it added text search and tagging.

Incremental improvements

Since its implementation, our developers received and digested a lot of feedback. Application developers requested and received easier to use APIs (Application Programming Interfaces, glue for integration) and widgets (such as the star rating and tagging user interface). For end users, stability and performance were crucial. Much work was put into improving the speed of indexing, keeping it out of the way of users and making Search more reliable.


Vishesh Handa talking about relationships* at conf.kde.in
(*the technical kind)

What is coming

The upcoming release of KDE Applications (version 4.13) will introduce the next step in the effort to improve the performance and stability of search features in KDE software. The improved Semantic Search is lighter on resources and more reliable than it was previously, but, thanks to considerable reuse of existing code, it is mature and offers a complete feature set. Users will find that features such as search are exposed in the same, familiar manner - but searching in a variety of applications will be faster and more reliable.

To accomplish this, developers looked at how Search was being used in practice. The major use-cases they identified are:

  • Finding objects (like files) based on their content, requiring a full text index of files on a system
  • Storing and retrieving simple objects such as tags, ratings, activities etc.
  • Storing and searching through relationships like this file is related to this contact

With a better understanding of use-cases that developed during years of deployment and development, the improved Search technology was specifically designed to do these three things and do them well.

Advancements for end users

The improved Semantic Search brings our users a number of tangible benefits. Its design is more robust, delivering search results quicker and with less overhead. The simplicity of the design will not only reduce failures, but will also make it easier for current and new contributors to add and improve functionality.

Improvements that users will notice:

  • Faster searching and indexing
  • Searching is more accurate
  • More reliability
  • Faster software development

Applications like the Kontact Suite, Dolphin and Gwenview, as well as the Plasma Desktop itself, already benefit from the changes.

For developers

The changes are small enough to make it relatively easy for application developers to move their applications over to the improved Semantic Search, which many have already done for the 4.13 release of KDE Applications. Instead of having a single RDF-based database for all information, Semantic Search now provides separate data stores and search interfaces. This allows it to store and search each type of content in an optimal way. Under the hood, the Semantic Search infrastructure uses SQLite and Xapian to index and retrieve data. More information about the information retrieval architecture can be found on the Community Wiki.

As of today, Semantic Search offers developers:

  • An API for searching
  • A way of storing relations between entities
  • File indexing
  • Email and contact indexing
  • Timeline KIO slave

Developers can find more information on the Baloo wiki page.

KDE Platform 4 and KDE Frameworks 5

When upgrading to KDE Platform 4.13, existing tags, ratings and comments will be transparently migrated to the new storage system. Looking forward even further, Semantic Search is in the process of being ported to Frameworks 5. This Frameworks 5 version will use the same storage system as the version included in Platform 4.13 (and newer) and will be fully compatible with it.

Learn more about Frameworks 5 in the tech preview announcement.

Conclusion

The change to Semantic Search in KDE is a natural next step in the process of taking technology that came out of an academic research project and adapting it to real world use cases. KDE's Semantic Search is at a point where it has become a core part of our infrastructure. It is now well positioned to provide the required robustness and functionality.

Article contributed by Vishesh Handa (KDE Search project maintainer), Stuart Jarvis, Aaron Seigo and Jos Poortvliet

Dot Categories: 

Comments

by Gamaliel (not verified)

for all the hard development effort. I've used the semantic infrastructure to be more productive in everyday life, and since its inception I knew it was the way to go. It helps me catalog my research inventory; it helps me track my files throughout the different stages of its creation -- including drafts by date and e-mails associated with those drafts. It served me once to deliver an "instant presentation", by finding all my tagged diagrams and charts in one go; all I had to do was press "slideshow" on gwenview to make it a visual. The most intriguing thing is its dimensionality, since I usually can organize files only by one dimension, then rely on versions to index the files by date, etc. With Search, I only have to press a folder icon and get a visualization by time, by type, and so on. I don't get the critics who say this search has no added value, when it is one of the highest added values of the KDE desktop for my work.

by Foo (not verified)

> KDE's implementation of Semantic Search made it possible to search for all pictures - taken in - a particular place.
How do I do that in KDE 4.12? How will I be able to do that in KDE 4.13?

by tracyanne (not verified)

You have to create meta data for your pictures, if you don't do that semantic search works no better than any other search. So if you are the sort of person who is not good at that sort of tedious stuff semantic search won't be very effective.

Biggest issue with Nepomuk has always been that application authors didn't fully make use of its abilities. Digikam and Gwenview for example would have to try and add tags like these automatically (digikam actually has functionality like this, afaik) but Nepomuk's performance and stability issues often diminished enthusiasm for using this.

With this new generation we hope that while we diminished some of the technical possibilities, the practical benefits will actually mean that we 'actualize more of its potential' (sorry for the buzzwords) ;-)

by anonymity is great (not verified)

This is great! I like the continuous effort to make the semantic desktop work better and faster. I look forward to using your improvements.

If someone would now start to improve akonadi in the same way, then maybe kdepim will one day become usable again. I cannot recount all the times I had to reintroduce my notes in kjots, my correspondents in kaddressbook and my email accounts in kmail because akonadi decided to screw things up again. Nowadays kmail is running extremely slowly again. The day that I decide to no longer support this slowness and I have to reset everything again, I will reset it in another program (probably thunderbird). I am sick of such basic things like email, addressbook and notes not working properly (knowing that these things worked properly in the 1990s, which is 20 years ago!) because of akonadi.

Actually, Akonadi has been getting a large number of improvements in the last few releases and it was Nepomuk that was holding it back. The Search team has worked closely with the KDE PIMsters to let the two help each other out far better and the most noticeable improvement will probably be in KMail. So, good news for you and other KDE PIM users ;-)

by Joe (not verified)

Akonadi continues to ruin Kmail. 3/4 times I start it up, it fails with akonadi errors. Every...single...workaround has been tried...

I'm sorry it has been a bad experience for you. If you ever find a groupware suite which covers the ground KDE PIM does and works perfectly for everybody, I'd love to hear about it, especially if it is open source!

by Heikki Välisuo (not verified)

I have faithfully used KDE PIM in the belief that the next version will work better. KDE PIM seems quite attractive. It promises to cover a large ground, but then in practice it does not.

Does it work (perfectly) for any one?
It is quite too easy to loose mails, calendar events, notes, contacts, tags of emails ...
The search works sometimes and then does not work anymore.
Once I had useful tag buttons on the toolbar to tag my mail. Not anymore. Disappeared.
Filtering. Worked. Then did not. Then worked. Not working.

There are all kinds of advice how to recover ones data, but repeating them over and over again, sometimes with less, sometimes with more success gets annoying.

Most software evolves so that users report bugs and after some iterations the problem is solved. KDE PIM seems to be immune to this type of cure. My feeling is that there is something basicly wrong in the development process. I guess the truth is that the NEPOMUK stuff and everything is a very challenging thing to implement.

Semantic search is fascinating and would be useful if it worked.
I still find only occasionally the search function that uses the index of the contents of my files. Actually I still use recoll, which is there always I call it.

I am grateful for people working on open source software and having the enthusiasm to develop fancy new stuff. Usually I find it exciting to use experimental software. I sort of feel being part of the project even if I the best I do is to send a couple of bug reports a year.
However, for some reason KDE PIM does not give me this feeling. It feels as if users are only an annoyance to be ignored. Maybe it is because NEPOMUK is such a challenging research project.

Rationally thinking I should move to some simple mail client but I am still waiting for the next version of KDE PIM and trying to figure out, when is the time for me to start testing it. Because it sounds that all the problems will be solved in the next version. Anyhow, I only loose some personal mail, I do not have that many appointments in my calendar, I write my notes in a text file etc. Honestly I do not have to search anything that often.

KDE PIM as part of 4.13 works perfectly for the most tested/supported scenerios like IMAP (especially with a Kolab server) and server-side filtering. But due to the enormous complexity of the problem like the many variations of IMAP and POP servers and all the issues they each have with the protocol, not every combination of everything works perfectly. Part of the issue is that the developers simply don't USE every combination. However, massive numbers of bugs have been fixed in the last 2 releases and we're getting much closer to a 99% solution. 'keep trying' might indeed be the best advice, even though it isn't satisfying...

by Luke-Jr (not verified)

It may not work perfectly, but KMail from KDE 4.4 was far far better than any version since. I continue to use it with much success. And now that Pali has forked it as "kdepim-noakonadi", I am happy to finally once again have working address book and calendar software (that is, pre-Akonadi KAddressBook and KOrganizer).

Why not leverage what Akonadi does and use MySQL? Sqlite sometimes struggles to scale up and we already have a mysql process running for Akonadi, so it seems like an obvious choice. Unless Akonadi is thinking about changing its architecture?

The issue with SQLlite is that it simply isn't capable of handling the load Akonadi needs. But most of the performance related issues in Akonadi have been taken care off in the last 2 releases (with more coming in 4.13) and the last remaining problem was actually Search. Which, due to the work written about above, has now been addressed. You won't recognize KMail 4.13 ;-)

There is still room for further improvement (isn't their always?) but it should certainly be on par with the 1.x series part of KDE 3.

by Kanenas (not verified)

"The issue with SQLite is that it simply isn't capable of handling the load Akonadi need"

IMHO, this is not true. In our database group, we have found SQLite to run circles around MySQL under similar query loads (processing ~0.5 TB of data). You have to design your DB schema correctly, use the correct *covering* indexes and optimize the plans of your queries.

In general, MySQL is faster with simplistic queries. MySQL also caches query results so it *appears* to be faster. SQLite doesn't have a query cache, but in more complex (properly designed) queries it can fully compute a query's answer from scratch at the same speed as MySQL using its internal query caches.

Finally, take under consideration that MySQL still plays it fast and loose with the ACID properties of their transactions (meaning you can corrupt your DB), whereas SQLite's ACID integrity is *better* than Oracle's one.

I don't want to go into more details concerning the quality of MySQL from a DB design perspective, but compared to SQLite's quality it isn't even a comparison. SQLite is a considerable better transactional-relational DB than MySQL. Also SQLite's C code is at my top 5 "work of art" quality C codes.

The problem was not SQL peformance per se but issues with transaction related locking in a heavily multithreaded access environment.

At some point SQLite implemented concurrency safe transactions using a global lock. That might have changed since then.

Obviously, help with improving the way Akonadi deals with MySQL is very welcome, I'm sure. Honestly I'm not very much convinced that it is possible to make SQLite performant - I just updated my ownCloud installation to mySQL and it's 1000x faster - there, too, SQLite just couldn't keep up. But that might of course also be due to incorrect usage. Perhaps, then, the SQLite developers need to think about making their tool a little more capable of working with real world code instead of only highly optimized code... ?

One reason could be that they didn't want to have to support multple backends. Since Akonadi server can be configured for other DBs than MySQL, Baloo would have to be able to do the same in order to use the same backend. And in case of Akonadi being configured for SQLite, it would probably not even work to use the same file.

Ah, I didn't realize Akonadi supported multiple backends.

Why does Akonadi support multiple backends? :D

Could be a big benefit if KDE just picked one solid database and used it everywhere.

People asked for it, so they got it ;-)

The curse of KDE, I suppose, giving users what they want and then getting yelled at :(

As Jos said, some people really wanted it. There are PostgreSQL die-hards that will not allow any MySQL on their systems :)

But it was also needed. MySQL was chosen as a default because it supported the required features best at the time of evaluation. But in order to evaluate different backend, code was needed to make it run with different backends. Also, as part of a project to make Kontact mobile run on a rather restricted WinCE device, the need for SQLite support rose again

by Bruno Friedmann (not verified)

Whatever good is that database, I'm using centralized postgresql server for all my needs. So I don't have several db daemon running on multi-user computers, and am able to fine-tune the db :-)

SO not using mysql is a right choice at least for me.

This is why the DB backend is configurable :-)
PostgreSQL being one of the options supported by the code.

by Hans Bezemer (not verified)

Ok, the first generation has just failed, users are scrambling to disable it. Lots of articles are popping how with in depth analysis how it works and how to untie it from the KDE4 desktop - and you're going for a second try? It eats resources, not to mention valuable processor time, wouldn't it be smart to make it OPTIONAL this time?

You seem to be mis-understanding both what search is, what it is for and what has happened. Yes, it has gone through its share of issues, which have been addressed by now. Of course it eats resources, keep your computer off if you don't want it to. Optional it has always been to some extend, but with better performance come more applications ditching their own search implementation and using the KDE wide implementation. Search is something a large number of applications have: Amarok, Digikam, KDE PIM of course. You use it all the time. Having a centralized solution, in theory, helps not only developers in saving work but has the potential of new use cases and better covering the existing one. It has taken a painfully long time to actually deliver on those promises but any new, complicated technology goes through a period of maturation. Moreover, we unfortunately don't have the resources big companies do and while they can spend extra resources on problematic technologies, in Free Software the rants from users actually have the opposite effect of demotivating developers from fixing the problems.

And unfortunately, people are rarely limited by their lack of knowledge when they rant about technologies so a lot of good, innovative technologies in Free Software go nowhere because people just don't realize their potential nor how counter-productive and pointless their 'input' is.

by Sander van Grieken (not verified)

I also have to say that many people, IT people even, seem to not see the potential that is in the semantic desktop. The 'it uses resources' argument is just silly with today's hardware, especially if the alternative is a 'grep -R .. /', which if run the third time has already used up more disk wear and electricity than an indexed solution would.

I agree though that the user experience and QA on quite a few occasions could have been handled a lot better, although I also see that that responsibility would also party land on the distributions as well.

by Glenn Holmer (not verified)

"Of course it eats resources, keep your computer off if you don't want it to."

That's just irresponsible. A search system like this should always be optional, and any program that breaks without it (instead of putting up a warning dialog about reduced functionality) is broken. No matter how well it performs, there are going to be users who choose not to have the files on their machine indexed, and KDE should respect that.

I recall seeing an option to turn it off in system settings

I just updated to KDE 4.13 on Arch and just like for the last 3 major updates akonadi stopped working.
Just to be sure on my notebook, I looked for a way to disable baloo but the configuration options in Systemsettings are minimal and unlike for previous versions there is no clear way to disable it. The solution seems to be to add the home folder to the exclusion list of folders ( which shows no folder name for me, just the folder icon? ). This is not obvious.

by Renato (not verified)

Congratulations you are doing a great job. It's very important to make clear that semantic search is actually a research project and not a solved problem. And, while it's sad that nepomuk didn't make it, the learning process made possible to build this new infrastructure.

In the end the money spent on nepomuk wasn't thrown away as some news sites tried to imply.

by Sven Brauch (not verified)

In the end, no money was spent on KDE's implementation of nepomuk, different from what some news sites tried to imply ;)
The money was spent on a research project which KDE's implementation built on. That project was totally unrelated to KDE and KDE received none of the money.

by Mathias (not verified)

so the next generation of "KDE Middleware" will still use sqlite?
Will it still give the user the option to use a "real" database instead? sqlite is *not* happy when you run it on db files that are stored on nfs shares... like, any user home directory in a larger environment is...

by Martin Steigerwald (not verified)

So then I wonder why Iceweasel works just nicely on my NFS based home at work. It uses SQLite3 databases as well. Heck, even MySQL based Akonadi works on it. Maybe it's due to using a NetApp FAS with its ONTAP operating system on the server side, but I highly doubt this.

by Richard Z. (not verified)

does not look like rocket science, does it work? Could it be added?

Is new KDE search engine Baloo support morphology search? If not, is there any plans for add mophology search into future KDE versions? Opensource example of working morphology search engine is Sphinx - http://sphinxsearch.com - maybe we can integrate it into KDE Desktop search?

The way Baloo is designed is that it can use any database and search tech - it simply depends on an application implementing it. So yeah, if there is any usecase for it, it would be used... But, unlike Nepomuk, Baloo is explicitly designed to be NOT a solution looking for a problem - which is what implementing Sphinx without any need for it would be ;-)

For working morphology search in KDE Baloo we need to normalize each word - add word in Baloo database in canonical form - http://en.wikipedia.org/wiki/Text_normalization
After this - do the same actions on search query string before do search.
In result we will find all documents that matched searched words in query in any word-form.
This is not too hard to implement, but we need to select library that can do "Text normalization" process.

At the moment if in the folder I have files "1.txt", "2.txt" and "1 2.txt" (yes, with space) and I put in the search string "1 2" I can see ALL files ("1.txt", "2.txt","1 2.txt") instead of just one ("1 2.txt") as if I search works with OR instead of AND