FOSDEM 2005: Desktop Search Interview

Wednesday, 23 February 2005 | Jriddell

The schedule for the KDE developers room talks at FOSDEM is now online. Our final interview with the speakers is with Scott Wheeler who will be giving a talk titled "KDE 4: Beyond Hierarchical Data, The Desktop as a Searchable Web of Context". FOSDEM is this weekend, see you there.

Please introduce yourself and your role in KDE.

I feel like I've been asked this question enough times that I should have an exciting answer by now. But well, I wrote JuK and TagLib as well as a couple of other small applications in KDE CVS and do some work on a handful of things in kdelibs and elsewhere across KDE.

What kind of search capabilities do you think a modern desktop should have?

Well, I think I'd like to step back a bit first and look a little at the problem — and the problem isn't a lack of a search tool, the problem is that it's hard to find things. Search tool or no, all of the ideas flow from the idea of solving the problem rather than just creating a new tool. So, in a sense, I don't think a modern desktop should have a search tool; I think a modern desktop should make it easy to find stuff — we're then left with how to get there.

And I suppose with all of the buzz around search tools these days people have a much more concrete idea in mind when they hear about searching on the desktop. But such wasn't the case when I started kicking these ideas around a while back. Spotlight was announced a few days after I'd submitted my abstract for the KDE Developer's conference, Beagle was relatively low profile, Google for the Desktop and its successors hadn't entered the scene yet, etc.

So, I think — fundamentally "what sort of search should the desktop have" is almost the wrong question. "How should we make it easier to work with the data we accumulate on the desktop?" is closer to the right question. I think search is just part of the answer.

Where did the idea of integrating a search capability throughout KDE come from?

Well, a few things actually. It mostly came from not being able to find things and asking some fundamental questions about how we organize and access information on the desktop. The first step — and this is tied up with the first part of the name of both this talk (which is related to the one that I gave at Linux Bangalore) and the one at the KDE conference this summer — is that hierarchical interfaces simply don't make sense in a lot of cases.

When I started looking around for examples of how this had played out in other domains of information, the most obvious example was the World Wide Web, where we've already moved from hierarchical interfaces to search based interfaces. It seemed logical that we could learn from that metaphor.

On the technical side of things I'd just written the listview search line class (used in JuK) that's now fairly prevalent in KDE that makes filtering of information in lists much easier, so that played into things too.

What do you think of other search tools such as GNOME's Beagle and Google's Desktop Search?

Well, they're fundamentally different in scope. Again, right now the term "desktop search" actually means something; that wasn't really true when I started working on these ideas this summer. So while there are some things in common, they're really pretty different approaches.

Beagle, Spotlight, Google for the Desktop, and their relatives are more interested in static indexing and search through that information. That's kind of where I was at conceptually early this summer when I coded the first mock-up. Since then however the ideas have moved on quite a bit and I think we've actually got something rather more interesting up our proverbial sleeves. (I should note however that I think the Beagle group is doing fine work, but it's something pretty different from what I'm interested in.)

The first difference is that this is a framework, not a tool. Beagle has some elements of this, but it's still not integrated into the core of the desktop. Google for the Desktop is mostly just a standalone tool from what I know of it. Honestly I think it's really below the level of innovation that I tend to expect from Google.

What we're now looking for in the KDE 4 infrastructure is a general way of linking information and storing contextual information — that information can come from meta-data, usage patterns, explicit relationships and a host of other places.

There won't be a single interface to this set of contextual information; we'll provide some basic APIs for accessing the components in KDE applications, but we're quite interested in seeing what application authors will think to do with it. Really I think they'll surprise us.

We're looking at everything from reorganizing KControl to make search and related items and usage patterns more prevalent to annotating mails or documents with notes to reworking file dialogs. Really the scope is pretty broad.

Do you think Free Software solutions from KDE and GNOME can compete with the likes of Google and Microsoft?

Sure. I mean — I don't think the ability to compete with commercial players is significantly different with desktop search than it is with other components of the desktop. And honestly I think we've kind of got a head start here.

Has there been any progress on planning or coding search into KDE yet? Is anyone helping you? What problems are you facing?

There have been a number of cycles through some API and database design sketches. But right now we tend to write code and as soon as it's done we've realized the flaws in it and start rewriting. This will probably continue for a while, but I think we'll be able to have something pretty useful in KDE 4.

There are a number of folks involved in discussion of these issues from various sub-projects inside of KDE. Thusfar it's been mostly myself and Aaron Seigo banging on the API, but others have contributed to the discussions.

I think the biggest problem that we're dealing with is moving from the abstract set of ideas that we're working with into real APIs — trying to keep things general enough to stay as extensible as we'd like them to be, but not so lofty that they're convoluted and useless.

What technologies do you plan on using, e.g. which database?

Well, we've gravitated towards Postgres, but mostly because of licensing. Other than that, well, uhm, we're using Qt. The Qt 4 SQL API seems much improved, so I've kind of been mentally stalling on really finishing up the current code until I can just work with that since otherwise everything would just have to be rewritten in a few weeks.

Is the KDE search tool likely to be cross desktop compatible so we could have a common base with Gnome?

Well, again, this really isn't about a "KDE search tool" -- and the chances of it being GNOME compatible out of the box aren't particularly high. That said, as the data store will just be a Postgres database and ideally we won't have to use too many complex serialized types, there wouldn't be a reason that a GNOME frontend couldn't be written. But generally speaking I'd like to get the technology laid down and then see if we can convince others to adopt it rather than the other way around.

What does the project need most now?

Time. And I mean that in a few ways — we need time to finish fleshing out the ideas, time to let the stuff mature inside of KDE and well, the couple of us working on it could use more time for such. But really as most of the framework for things like metadata collection and whatnot are already inside of KDE this won't be a huge project from the framework side. What will take a good while will be porting over applications to use it where appropriate.

Comments:

Video of presentations? - LB - 2005-02-23

Is it possible to create a video of the KDE-related presentations?, unfortunately I'm not able to go to fosdem, but I'm very interested in the presentations.

Choice of database - AC - 2005-02-23

Have anyone looked at Derby. http://incubator.apache.org/derby/ It is licensed under Apache license, Ansi SQL, portable (pure java), has a small footprint and is supposedly very easy to use. So it seems to fit the bill perfectly.

Lucene? - Michael Schuerig - 2005-02-23

Has anyone of you had a look at the Java search framework Lucene (http://lucene.apache.org/java/docs/index.html) and its C++ port, CLucene (http://sourceforge.net/projects/clucene/), in particular? Lucene is an excellent, sophisticated and yet easily usable framework for indexing and searching. It might be usable as is or for inspiration only. Michael

Re: Lucene? - Claes - 2005-02-23

Definitely agreed. Also note that Beagle uses a C# port of Lucene. Lucene's index format is well documented, and there are also ports to languages like Python (http://pylucene.osafoundation.org). It supports a flexible query syntax, and support for many natural languages. I think the best choice KDE could do is choose Lucene as index format.

Re: Lucene? - Scott Wheeler - 2005-02-23

Lucene is a fine tool -- and CLucene as well and something that I've looked at, but part of what I was trying to indicate in the interview is that we're not working on a "search tool" -- search is just one of the things that we'll be using it for. Lucene is well set up for static document indexing, but isn't particularly useful for a graph based contextual web. This is kind of a problem right now that we didn't have when these ideas were hatched -- people have ideas in mind for what "desktop search" is and that's not really what we're working on. The questions even kind of indicated that -- if I'd done an interview on this stuff last June or so the questions would have been very different because that was before all of this stuff kind of burst onto the main stream. KDE already has a plugin based metadata layer, extending that to where it can extract the needed information is likely the direction that things will move.

search tool ?? - Jakob - 2005-02-23

And what is it all about? A kind of dashboard?

Re: search tool ?? - Pat - 2005-02-23

I think it's only gonna be an API available to every apps. So if you're writing a dashboard app you could use the API in it or if you're writing a mediaplayer u could use the API for your playlist etc... it's nor a tool neither a software, just some API usable by any KDE app.

Re: search tool ?? - Christian Loose - 2005-02-23

Scott's talk at aKademy "Beyond Hierarchical Data: Search and Meta Data as Fundamental Interface Elements" http://conference2004.kde.org/sched-devconf.php (Slides, Transcript, Audio, Video)

Search API progress? - ac - 2005-02-23

Do you do the discussion on some particular mailing list or do you have some place where you show the current drafts (wiki?) or is a proof of concept already available in one of the numerous kdenonbeta modules? I'd think catching all the metadata kfile reads out for ages already would be an excellent start. ;)

why postgres? - Pat - 2005-02-23

does this mean we'll all need a full installation of postgreSQL ? isn't that a bit heavy? why not sqlite or mysql (i know that they're GPL (not sure about sqlite)but they're faster and lighter than postgres which is great but maybe a bit too much). Just because MS is going to use some kind of reworked sqlserver with winfs on longhorn doesn't mean we should do the same with postgres :)

Re: why postgres? - Haden - 2005-02-23

I'm surprised too, sqlite is very small and is in public domain license.

Re: why postgres? - Scott Wheeler - 2005-02-23

Well, Postgres is the third database that I've tried. I did the original mockup prior to the KDE conference this summer with SQLite and the performance for the type of queries that we're doing was so bad that it simply isn't an option. Also SQLite is really only designed to be used from a single process, so we'd have to implement locking and multi-user access in a daemon on top of it, which, well, at that point you're just implementing a database server, really, but without the performance of more robust databases. So I then ported that mockup to MySQL, which performs fine, but is GPL'ed rather than LGPL'ed. As performance is similar for MySQL and Postgres, but Postgres has more flexible licensing (i.e. suitable for use in things linked to kdelibs) Postgres wins there. WinFS is something completely different. It's what's called an Attribute Value System or object database -- and it's only being used for "My Documents", not the complete FS. What we're working on isn't something that's going to replace parts of the FS, it will supplement it with contextual information.

Re: why postgres? - Marco Menardi - 2005-02-23

Have you ever tried FirebordSQL? It's a true database server, ACID compliant, multiplatform, license is a modified versions of the Mozilla Public License (so unfortunatly no pure GPL), takes relly few resources and is developed by an active community: http://www.firebirdsql.org/ "Firebird is a relational database offering many ANSI SQL-99 features that runs on Linux, Windows, and a variety of Unix platforms. Firebird offers excellent concurrency, high performance, and powerful language support for stored procedures and triggers"

Re: why postgres? - AC - 2005-02-23

MPL isn't GPL compatible; there's not even a reason to look further. We're going to have to use the client libraries for these databases. I don't see any reason not to use Postgres; there are loads of ways that all of this could be done, but the datastore is the boring part and Postgres fits our requirements. I don't really see a reason to look further at the moment.

Re: why postgres? - Scott Wheeler - 2005-02-23

Err, that was me. Konqueror is having fun with my cookies.

Re: why postgres? - superstoned - 2005-02-23

cookies are nice :D

Re: why postgres? - Marco Menardi - 2005-02-24

I've never heard of problems with FirebirdSQL and GPL, since you have to connect with the database, not include his code. Maybe their "modified" MPL is modified enought to be able to interface with GPL programs ;) But I'm not a lawyer. I was replying to your message about PostgresSQL vs MySQL, and I'm sure you will be surprised on Firebird performances, low footprint, high stability, low manteniance needs, etc.. So mine was just a suggestion, since if you "don't see any reason not to use Postgres", I don't see any reason to use Postgres instead of Firebird ;) So would be great if you could have a look also at Firebird :) thanks

Re: why postgres? - Ian Monroe - 2005-02-24

Well, I'd assume the client libraries for FirebirdSQL are in MPL as well. You do have to link againist those.

Re: why postgres? - Marco Menardi - 2005-02-24

Well, I use Firebird and, at the moment, my program is in Delphi under Windows, so I'm not an expert and I don't know what "client libraries" you need. I know, for instance, that JCA/JDBC Driver: http://www.firebirdsql.org/index.php?op=devel&sub=jdbc is distributed free of charge under the GNU Lesser General Public License (LGPL). I don't know about ODBC/JDBC Driver http://www.firebirdsql.org/index.php?op=devel&sub=odbc or ADO.NET Data Provider http://www.firebirdsql.org/index.php?op=devel&sub=netprovider but probably the license is included in the package you can download. In any case, if you visit the main site I suggest you http://www.firebirdsql.org/ probably you can understand more and better. Thanks again

Re: why postgres? - brockers - 2005-04-15

So KDE is going to choose a database that is LGPL because of licensing restrictions from using a straight GPL'ed database when we use a GPL'ed toolkit to do EVERYTHING related to Qt? I hope this is not really the case because it would seem to lend credibility to the Gnome-huggers who say that KDE is worthless as a general purpose business desktop because QT is GPL'ed and not LGPL'ed. If we can use QT in KDE why not MySQL? Don't get me wrong, I don't have a problem with Postgres, I just hate to see KDE use the same argument against something that has been leveled against us.

Have you looked at MetaKit? - Mr. Fancypants - 2005-02-23

(See http://www.equi4.com/mkdocs.html) It seems to support most of the usual relational operators without the yuckiness (quoting differences, various conventions for parameters, non-standard syntax, etc.) of SQL. There also seems to be support for concurrent reading and writing (maybe only 1 writer at a time, I'm not sure) and it looks quite mature.

Re: Have you looked at MetaKit? - ac - 2005-02-23

Please stop listing existing databases. I'm sure that Scott and Aaron know that there are many DBs out there.

Re: Have you looked at MetaKit? - anon - 2005-02-24

We're using in akregator now. metakit is very good.

Re: Have you looked at MetaKit? - Aaron J. Seigo - 2005-02-24

as an embedded database it looks good. but we don't need an embedded database, we need a database that can be accessed simultaneously and, preferably, over a network. the TODO list for this project is already big enough without adding "write a scalable RDBMS" ;) once the first edition is out using an external RDBMS then perhaps all the data storage fans can swoop in with their super dooper file systems and coolio ultra-tiny database-like engines and experiment/optimize that area of the software. but it's not the interesting nor a critical part of the project. =)

Re: Have you looked at MetaKit? - Pat - 2005-02-24

maybe u could make the data storage part "pluginable" so that developpers could easily implement different db backend à la kexi or even like that damn amarok while you could focus on the postgresql part :)

Re: why postgres? - Carlo - 2005-02-23

Oh well, please stay with Postgres and don't give a flying fart on those wanting sqlite or some java based db. Very good choice!

Re: why postgres? - muesli - 2005-02-24

hey scott! we (amarok) have dealed with quite all sqlite issues you could imagine ;-) so, let me assure you, that i still believe it could be used, but some simple things have to be done: a) you can set a threadsafe variable in the Makefile. this will help improve the situation! b) why not use a singleton interface for searching? this way you can take care of the locking easily. if you need code: look at collectiondb.cpp/.h in amarok/src. there is even a connection pool n everything. that is also useful for other db interfaces, even though i dunno what qsql offers wrt this issue. c) most sqlite issues can be easily solved by setting proper indices. when you do that, sqlite is barely slower than mysql. although, one must admit that there are more situations where sqlite is not able to use an index at all. anyways, mysql or postgresql is to have imho. reconsider sqlite! it's worth the little hassle. regards, muesli

Re: why postgres? - muesli - 2005-02-24

"anyways, mysql or postgresql is to heavy imho. reconsider sqlite! it's worth the little hassle." s/have/heavy ;-) ...muesli

Re: why postgres? - Scott Wheeler - 2005-02-24

> we (amarok) have dealed with quite all sqlite issues you could imagine Honestly I doubt it. amaroK is a very simple application (in database terms) relative to what we're working on. A music player with a couple MB of data is a very different beast trying to store and query graphs in a database that may easily grow to a few hundred MB. > why not use a singleton interface for searching? Because that's only useful for one process. > most sqlite issues can be easily solved by setting proper indices Not for heavy use of cross table selects on interrelated values. I literally had several queries that took 15 minutes on SQLite that were done in less than 1-2 seconds on PGSQL or MySQL. And that was just on a 10 MB test database. I'm not saying that the limitations of SQLite couldn't possibly be worked around, I'm just saying that there's no compelling reason to work around them when there are better databases available that already solve these problems. SQLite also locks the entire database on write, which just isn't acceptable in a tool used frequently by multiple processes. Basically, as Aaron already said -- it might be possible to work around all of the issues with SQLite. But in the end we'd just be implementing the features that other databases already provide and we'd need a daemon process anyway to handle communication with it, which, well, makes no sense.

Re: why postgres? - Carlo - 2005-02-24

I'd rather like to see Amarok integrate in such a new framework, instead that every application has it's own db. E.g. to access covers and the correspoding metadata with Digikam, without having to add a new album in digikam and the metadata stored twice. If it would be possible to specify something like "apply <filter> on amarok:covers which are smaller than (x,y), jpegs and greenish, pipe result to <dialogX>,[...]", without having joe user to think about what the hell he is doing... ;)

Re: why postgres? - Bryan Feeney - 2005-02-23

SQLite would be really great. Would it be possible to specify a list of directories to be indexed/monitored with the option to recurse through the directory structure and make it available for searches by other users.? E.g. I would have /home/bryan, and mark it recursive and hidden As root I would add /usr/share/music (all my oggs etc., shared between users), and mark it as recursive and visible to all. Other users do the same. The list of public search folders is stored in /etc/indexeddirs. Each person's private list is stored in ~/.indexeddirs. For each directory have a sqlite DB in the root of that directory called .dirindex.sqlite or something. Then when someone does a a search, open and concatentate the private list (~/.indexeddirs) and the public list (/etc/indexeddirs), open all the directories, and search through each? That might be a bit much though, I don't know the specifics of search. Is this being developed using Qt4 or KDE/Qt. It'd be cool if the backend was developed using Qt4, so it would be a nice small dependency that other projects could use.

Re: why postgres? SQLite is the way to go! - ac - 2005-02-23

SQLite is the way to go, small, fast, the Right Thing for this task

Re: why postgres? SQLite is the way to go! - Aaron J. Seigo - 2005-02-23

> small, yes, it is small. pgsql isn't exactly huge however. the postgresql system isn't very large. the rpms on SUSE are ~10MB (includes the docs, stored procedure language and what not) and it's memory usage is also pretty good. we're not talking 100s or even dozens of MB of ram. > fast, not for the types of queries that are required. and, as Scott mentioned above, this needs to support multi-process access which means locking and the whole bit. sqlite is great for the purposes it was intended for; this isn't one of them. =)

Re: why postgres? SQLite is the way to go! - David - 2005-02-23

There are some issues with using SQLite which I think Scott has gone over above. For what SQLite does it is very, very good, but probably not what they're looking for for the purpose of this. However, if you've ever built KDE you'll know just how many other projects KDE depends on. Postgres is actually small-fry compared to the total.

Re: why postgres? SQLite is the way to go! - ac - 2005-02-24

I really don't care what database backend an application uses, but it would be nice if I didn't have to run 5 different DB servers in the background just to use KDE.

Re: why postgres? SQLite is the way to go! - Ian Monroe - 2005-02-24

This is a very valid point... kdelibs requiring a database (which it sounds like it will be doing) will make things a lot easier for a variety of KDE programs that currently have to come with packaged with sqlite.

Re: why postgres? SQLite is the way to go! - Aaron J. Seigo - 2005-02-24

> but it would be nice if I didn't have to run 5 different DB servers in the > background just to use KDE i don't think you will. at most you may have to run one, and even that may well turn out to be optional (at the cost of the features that rely on it).

Re: why postgres? SQLite is the way to go! - Arun Raghavan - 2005-09-03

While my first reaction was, "*Groan*, WhyTF do I need to run a DB server on my PC just to have a good desktop experience", I think this is a good decision. Modern systems will not be loaded excessively by a Postgres server running in the background, and the payoff would be *much* more than worth it. But I do wish it could all be independent of the actual database used. Or at least give a choice between 2-3 popular databases. Maybe eventually, closer to productization, when such pragmatic issues are more important.

reiser4? - me - 2005-02-23

just thinking... I know that using reiser4 should probably not be a requirement to using the !"search tool", but would it make sense to create a reiser4 plugin to be used with your ideas? Maybe you could store some information not in the database, but right with the files. One could argue that this is where the information belongs: in the filesystem. IIRC, Hans Reiser said that whenever you use a database, its because of the shortcomings of your filesystem, and the now-released reiser4 is supposed to fix that.

Re: reiser4? - Pat - 2005-02-23

i think kernel devs wants to implement some of reiser4 unique functionalities into the kernel vfs so that other fs can use them and people won't be forced to use reiser4 but that ain't gonna happen anytime soon so I guess we have to wait. I wonder if we'll get it before winfs :) (the real winfs, not the one that will come next year with longhorn).

Re: reiser4? - Aaron J. Seigo - 2005-02-23

> but would it make sense to create a reiser4 plugin to be used with your ideas seeing as our time is not exactly limitless, i don't think doing two implementations with different storage designs is realistic, especially when very, very few people have the target data storage mechanism (reiser 4 file system) available to them. this is meant to be a practical project rather than a research topic. > One could argue that this is where the information belongs: in the filesystem yes, that's a viable argument given that the information you are indexing/linking exists in the filesystem in the first place. this assumption isn't universally true, however. not only is a lot of our data dynamically generated on demand these days, there's also a lot of data that is implied by our usage forming context that isn't a "document" or even necessarily very suitable for storage as a "file" on disk. trivial example: how would you store a personal annotation of a web page in a filesystem based approach? if we wish to more than just index and search local data, it becomes apparent that the file system is not the catch-all locus of our information anymore.

Re: reiser4? - me - 2005-02-23

well, I'd probably store all annotations in C:\Progra~1\Common~1\Annota~1\HSDS6SB.TMP\ddf7d6s7.txt I guess you're right :)

Re: reiser4? - brockers - 2005-04-15

Dude, you almost made me pee my pants. lol

Re: reiser4? - Simon Edwards - 2005-02-23

> trivial example: how would you store a personal annotation of a web > page in a filesystem based approach? The direction that Reiser is heading is towards a general filesystem/retrival system. Kind of a mix of a traditional filesystem plus a database, while being very flexible and 'plastic' (i.e. you wouldn't have to define a fixed schema before you could use it). Searching using partial chunks of info being a big part it. Basically you could build and search almost arbitrary data-structures (on disk). A traditional Unix style filesystem is just one thing that you could make using this kind of system, but much more would be possible. Read, (and try to get your head around, it's hard!), the Future Vision paper at: http://www.namesys.com/ -- Simon

Re: reiser4? - Aaron J. Seigo - 2005-02-24

> The direction that Reiser is heading is towards a general > filesystem/retrival system yes, it's a very interesting and ambitious goal, one Oracle failed at in the 90s, though for market reasons rather than technical ones. > Basically you could build and search almost arbitrary > data-structures (on disk) of course, and we're using an RDBMS to do exactly that at the moment. the reason that a file system doesn't offer anything (featurewise) above what the RDBMS does to make it more attractive is that not everything in the necessary "arbitrary data structures" refer/link to things that are local or even storable (e.g. time). it's much more practical and reasonable to require people to installed 10MB of software that provides a database engine than it is to require them to reformat their disks and migrate all their data over. Reiser isn't even available for all the platforms KDE runs on. this removes it as a potential target for a practical tool, though it would make a really cool target for a research project. i do think that applications will drive the success of Reiser4, however. and once we have tools such as what we are building, i wouldn't be surprised if someone worked the storage layer to oprtionally use Reiser4 to produce something smaller and more performant. at that point there's a real motivator to use these kinds of file systems that goes beyond the theoretical, at which point they become interesting from the point of view of "off the shelf" software users and manufacturers. > Read, (and try to get your head around, it's hard!), > the Future Vision paper at: i have =) fun stuff... i've been watching Reiser's project for some years now with great interest.

Re: reiser4? - David Neeley - 2006-10-28

Frankly, I believe that SQL databases in general may not be the best solution for this problem. Some years ago, I was one of the investigators into a potential application in which we benchmarked various database solutions. The "post-relational" databases of which the old Pick system was progenitor had performance many times that of the relational ones. In case you are unaware of it, these databases comply with all but one of the Codd and Date's Rules of Normalization--the first, that every item must be "unique and atomic." In other words, a single record could store an entire table. Consider the common introduction to relational databases in which a video rental store is used. One table contains the individual customer records, with a unique customer number. Another table contains the rental records, with one column containing the customer ID numbers. To get a look at rental history for a given customer, then, a join must be performed in which the rental table is searched with each rental by the given customer's ID being extracted to build the list. In a Pick-style database, the individual rental info can be stored directly int he customer table. Extracting rental info merely means looking up that customer record and viewing the ever-expanding rental list. No joins, far less memory, and much more speed! Another possibility might be a datastore similar to the "Titanium" database of Micro Data Base Systems. In that one, many-to-many relations can be directly mapped as selected by the programmer. This results in a well-written program having even more performance yet. In our tests, the Pick style database ran about 25 times faster than the then-current SQL databases; the Titanium engine about 40 times faster and with much lower system overhead. In short, the only advantage I see for an SQL system is the large number of tools available for them. David

Re: reiser4? - Evan "JabberWokky" E. - 2005-02-23

:: One could argue that this is where the information belongs: in the filesystem. The problem is, KDE targets the nebulous "Unix" as a goal. POSIX does not define much (are ACLs even defined as a standard, or are they an optional part of the standard?). :: IIRC, Hans Reiser said that whenever you use a database, its because of the shortcomings of your filesystem, and the now-released reiser4 is supposed to fix that. I agree so wholeheartedly it is difficult to express with words. I think a rich filesystem that is used by apps is as revolutionary as the desktop metaphor. But I also think that KDE is right not to make such a fundamental requirement. As an *optional* way of storing data, on the other hand... Not to mention that a rich filesystem plus an advanced code rights system (a la jail) will result in a very secure, powerful and stable system - far more abuse tolerant and flexible than has ever been commonly available ("Open all the attachments you want - they run fine, anything they change rolls back, and they can't send Spam").

Re: reiser4? - Aaron J. Seigo - 2005-02-24

> I think a rich filesystem that is used by apps you hit the nail on the head: it has to be used by applications. if no applications use it, it's a theoretically cool idea with no real world benefits and becomes an unimportant interesting footnote in computing history. this is why instead of targeting a unique storage system or creating one application/tool, we're creating a system that will allow any (and hopefully as close to "all" as possible) application to easily take advantage of these paradigms. the location of the storage, filesystem or database or clay tablet, is an implementation detail with implications for performance and ease of implementation only. it's the application APIs that matter, and which are also missing. to analogize, Reiser and RDBMS's are like X Window: low level technologies that provides a means to accomplish the task (with varying degrees of success); things like what this interview is about are like Qt: a layer that makes application development leveraging the possibilities of the platforms possible. innovation is not just the creation of a new idea, it's the implementation of that new idea in the marketplace.

Re: reiser4? - uddw - 2005-02-23

Reiser4 as is won't fix a lot, except for more efficient storage of small files maybe. If you want to add a plugin for reiser4 you have to recompile your kernel. If you want other people to use it, you have to distribute a kernel patch. If you want to reach a broad audience in finite time you better put another layer on top of the FS to get things done. I have started playing with reiser4 a while ago to write a plugin to enumerate changed files fast and reliably. But even if I find the time to get it done some day, it would still be hard to make people use it, at least with reiser4's notion of a plugin. If you want to introduce features which completely redefine a filesystem.. what can I say.. good luck?

Re: reiser4? - superstoned - 2005-02-23

but if this was an option in KDE 4.0, the Gentoo guys will start using and testing it, then Suse and Lindows (ReiserFS4 sponsors) will test and maybe start using it, and others will follow... thanx to the gentoo users we don't have much to do with the chicken-and-egg problem, they love new things (me as debian user does so, too, but its easier with gentoo to try them).

Re: reiser4? - me - 2005-02-23

oh...interesting! I'd like to store thousands of big images (between 10 and 200mb a piece) and need to be notified when they are changed... Does your plugin have a webpage? I've been looking for something like that, and I've grown frustrated with the available solutions (fam, enhanced dnotify and even www.dazuko.org suck), so I'd like to know more... Is reiser4 stable?

Re: reiser4? - wilbert - 2005-02-24

> Is reiser4 stable? According to www.namesys.com/v4/v4.html: "We must caution that just as Linux 2.6 is not yet as stable as Linux 2.4, it will also be some substantial time before V4 is as stable as V3." (as of 29 dec 2004). So may be not yet, but testing will definitely be very, very interesting!

Re: reiser4? - Charles Samuels - 2007-02-22

In my experience, it is.

Re: reiser4? - Arun Raghavan - 2005-09-03

Would inotify (http://lwn.net/Articles/104343/) help?

What I want. - Derek Kite - 2005-02-24

The discussions about what DB backend really are irrelevant. Even the front end, user interface is not really the most important. The middleware, what data is archived and indexed, and how contexts and patterns are matched is the key. I want something that recognizes contexts of activity. My work patterns are usually by blocks; I sit down and write and assemble the digest. I sit and read my favorite blogs. I read and sort my email. Those are the regular blocks. Then the projects that I work on, ie taxes, planning a trip, researching a specific subject, work tasks such as proposals and product research, etc. For example, in june of last year I was researching travel in europe since my daughter was travelling there, and I needed to figure out how to her get somewhere. I found interesting sites, some helpful emails came in, including correspondance with my daughter. There was a pattern to that activity. Say I want to arrange a trip for myself and want to find all those sites I found helpful. So I start looking, and keywords london, paris, europe, airline, low-cost come up. Same keywords, same context. The indexing/data retrieval system that recognizes the context, suggests how to replicate the previous context. It isn't simply data that is indexed, but time, duration, frequency, context, what application. I can search my datafiles quite easily with grep. But I can't for the life of me remember what tax filing software I used last year. Or where that interesting blog on the NHL strike was. The only time recently where I remembering wishing I had an index of a bunch of data files is when reading product documentation pdf's on a cdrom where the filenames were 6 digits. I don't want something that tells me I have Results 1 - 10 of about 220,000,000 for linux. I know I got thousands of references to KDE on my hard drive. I want a maximum of 20 selections based on the context I am working in. This would obviously entail hooks into the various data streams. And some kind of realtime archiving and pattern matching. And possibly background data mining. An api is best since applications sometimes know the best way to work with the data that they produce. This is neat stuff. Derek

Re: What I want. - Aaron J. Seigo - 2005-02-24

> It isn't simply data that is indexed, but time, duration, > frequency, context, what application. bingo! you've got it! and add identity, source and destination to your list. probably others as we go along =)

Re: What I want. - Jos - 2005-02-24

Oh yes, i'd love to have a source meta data for every file I've downloaded. Of course, I'd like to have a source for every line I copy and paste etc. but's going a bit to far. If file storage is being worked over so thoroughly, an integrated versioning solution would be an ultracool feature. Cheers, Jos

What I need/want and I don't think that it is SQL - James Richard Tyrer - 2005-02-24

Perhaps I am missing something here, but what I want to start with is that I have a directory with a bunch of HTML files in it and I want to be able to search them for content just like Google searches the web. Will you project do this? or am I talking about something else? It seems like a KDE front end for ht:/Dig would do what I want. -- JRT

Re: What I need/want and I don't think that it is SQL - Aaron J. Seigo - 2005-02-24

full text search is a subset of what this will do. you could certainly say "html documents containing foo bar baz" and it would return what you're suggesting. but it allows so much more than that as well.

Wonderful (and some suggestions) - jameth - 2005-02-24

"that information can come from meta-data, usage patterns, explicit relationships and a host of other places" Whenever I talk to people or look into the stuff out there, they only seem concerned about meta-data and full-text search! I love that you are planning for both usage patterns and explicit relationships to be included. Also, on that note, I hope you are considering how the data will be inputted by the user. It can be extremely useful if the user input method allows for them to understand the organizational system without adding complexity. The only model for this that I've found to be useful is that of categorization. The user has categories which they define and place files into, allowing a file to be in as many categories as they desire. Then, the system can use those categories as a good way to narrow searches. For example, I would categorize everything into at least one of four categories: Work, Personal Work, Entertainment, or Belonging to Someone Else. But, at the same time, something in any one of those categories could be in several others. For example, for myself, a lot of all those categories would be in: Writing, Gaming, and/or Art. With a good categorization system (maybe visualize it as a set of directories with check-boxes to determine which categories it goes into) I could swiftly and easily place a file while saving it, at least as quickly as I can organize into directories right now. And, if I didn't categorize something right away, it could be automatically be tossed into a category such as 'unsorted' or whatever, so I knew that I hadn't organized it yet. Further, this is a portion of the organizational system that might be representable in a real filesystem, which means the save dialog wouldn't be completely useless. I asked someone who knows more about file-system performance than I do, and they said that it was perfectly feasible to have directories for each category and hard-link files throughout them. They even said there shouldn't be any performance issues if you used a modern OS. And, a search organization system which can also be somewhat used from a standard browser would be nifty. And, to go back to my original point, cool! I hope the entire system can be gotten working for KDE 4.0, because it sounds awesome.

Re: Wonderful (and some suggestions) - Aaron J. Seigo - 2005-02-24

> Also, on that note, I hope you are considering how the data will be > inputted by the user. the goal is to have as little extra data input overhead as possible. expecting users to label documents with descriptive tags breaks down really quickly for a variety of reasons. instead we will be using information that is already there as well as offering ways to author information that implicitly builds this information (on the world wide web, hyperlinks provide such a thing). > this is a portion of the organizational system that might be representable in > a real filesystem locally stored files are only a subset of the information that will make up your personal linkage store. the idea of "information equals document, document equals file on disk, therefore information equals file on disc" is one that we feel is antiquated and getting in the way. not all information is document or storable as a file on disk. this is one reason why we aren't using the file system to store this information. another reason is that linkage meshes get very dense very quickly when all the relationship information is poured in. i really don't see this scaling well on a literal filesystem layout, nor do i see it being sensicle to someone browsing such a file system. > I hope the entire system can be gotten working for KDE 4.0 we'll do our best =)

Re: Wonderful (and some suggestions) - Illissius - 2005-02-24

just thinking out loud a bit... I assume you've tried GMail? GMail's labels are essentially the same thing as the grandparent's 'categories'. And while applying the labels to everything manually would indeed be a chore and a half, luckily GMail also has filters :) (so f. ex. everything addressed to kde-cvs@kde.org automatically gets the 'kde-cvs' label). So perhaps something analogous to these filters is what's necessary to make the idea workable. Or on second thought, this would serve basically the same purpose as 'virtual folders', eg, saving a search for quick access later... whether you 'filter' everything into a 'label' on the fly, or just execute the search again whenever the 'virtual folder' is accessed, is basically an implementation detail (albeit, a rather large one), the result to the user is essentially the same... though with the former approach you would also have the flexibility of manually adding/removing items to/from it.

Re: Wonderful (and some suggestions) - Derek Kite - 2005-02-24

>as little extra data input overhead as possible The basic assumption is that the information to categorize, or contextualize (sp?) is in the data and usage patterns. Otherwise this becomes another complicated organizational scheme that needs maintenance. Another example. I've got an extensive and complicated list of kmail filters that attempt to categorize the incoming emails. I've got a dozen or so family and personal correspondants in one folder. The vendors, such as amazon in another. Registrations and authentications for various sites in another. kde-cvs is in one large folder, divided by module. Spam is handled another way, etc. The patterns are obvious and quite simple. A few days worth of traffic and usage pattern would give enough information to come up with a similar sorting scheme. Or at least something close that could be easily fine tuned. With the basic contexts recognized, then my inherent skills at noticing anomalies would be used to put the final touches on the system. If the engine spits out garbage, the interface would be very tricky to put together. Another vast font of useless information thrown in your face. If the engine is capable of narrowing down contexts, and the information is genuinely useful, the interface issues become quite simple. It's easy to present delicious food or beautiful art. Derek

Re: Wonderful (and some suggestions) - jameth - 2005-02-25

>>as little extra data input overhead as possible >The basic assumption is that the information to categorize, or contextualize (sp?) is in the data and usage patterns. Otherwise this becomes another complicated organizational scheme that needs maintenance. The system cannot solely rely on usage patterns. Such a reliance results in very easy to lose data. For example, I record the data for my FAFSA and update it every year. That data is used one day a year and has been used twice ever. If usage patterns were how this data were tracked, I wouldn't be able to find it. Of course, that might be found just by the fact that I know the name, but there are some things that are not so easily tracked by name, have no searchable contents, and are rarely used but important. Thus, the system needs make it very easy for users to intentionally track data. Of course, the intent is not to replace all other ways to access information (at least, I think it isn't) but that is fairly likely to happen if everything goes smoothly. If the system is done right, it will be more efficient 90% of the time, which means that users will learn how to use it and will usually go to it. About 5% of the time, it will be less efficient but they won't know that going in, so they'll use it anyway. Then, once they're using it 95% of the time, they'll start using it all the time as they stop using other methods and forget about them. For that reason, it needs to be more efficient 100% of the time from the start, or some very serious problems can arise. (Of course, that isn't some proven theory and the statistics are just random examples, but I've seen it happen. Many people have trouble with offline data sources because they are so used to using Google, or even have trouble with page navigation online because they can't just type in a search query.)

Re: Wonderful (and some suggestions) - jameth - 2005-02-25

>> this is a portion of the organizational system that might be representable in a real filesystem > locally stored files are only a subset of the information that will make up your personal linkage store. the idea of "information equals document, document equals file on disk, therefore information equals file on disc" is one that we feel is antiquated and getting in the way. not all information is document or storable as a file on disk. this is one reason why we aren't using the file system to store this information. That some of the information isn't online doesn't change that categorization is a good way to organize it. Many of my offline categories overlap with my online categories, such as Art and Writing. The categories are just a way to browse it. > another reason is that linkage meshes get very dense very quickly when all the relationship information is poured in. i really don't see this scaling well on a literal filesystem layout, nor do i see it being sensicle to someone browsing such a file system. I was referring to storing only the categorization information in the filesystem, not the rest, and that just being a mirroring. The purpose of that is to avoid having the filesystem itself being incapable of organizing data. This is a serious concern because, once using mostly this new organizational system, many people will just rely on it. Then, when browsing with another method, the filesystem may be 100% nonsensical. That mirroring this the categories to the filesystem would be worse than current organizational systems doesn't mean it wouldn't be better than doing nothing of the sort. (I'm not trying to tear down the idea as a whole or rebuild it in some new image, just trying to point out some potential problems. Thanks for all the good work.)

Very Interesting - jesusfish - 2005-02-24

I remember hearing Nat Freidman at RealWorld Linux last year speak on something similar, and he showed an app that demonstrated this (what may actually be Beagle now, I'm not quite sure). It would be a really big step to have technology and tools like this I think. The whole concept resolves around connecting ideas, whether they be data, programs, time, etc. I think Derek hit it right where it's at. When I use my computer, my actions are all motivated by thoughts. I want to know about this, I want to do that, etc. Imagine how nice it would be to connect everything on your desktop relative to a particular thought. An example, say you're working on a wesbite...you could essentially find every program, file, search, etc, that is related to that at one time. It would be like telling your desktop that you want to work on that site, and everything you need is neatly retrieved and organized for you. That is convenience. Hopefully I've got this whole concept correct, or I sound quite dumb.

Re: Very Interesting - Aaron J. Seigo - 2005-02-24

> Nat Freidman at RealWorld Linux last year speak on something similar just to repeat Scott here: Beagle and this project really aren't the same sort of thing. you can do similar things with both concepts, but the approach taken is vastly different and there is a host of possibilities that just aren't possible with Beagle/Spotlight/Google Desktop Search type systems that are with where we're going with this. this isn't to say Beagle et al are uninteresting or poorly done, they are just a different type of technology with a slightly different set of goals. > The whole concept re[v]olves around connecting ideas ... That is convenience. exactly. =)

remember some of the architecture of BeOS? - Ferdinand - 2005-02-24

Not that a little anecdote about BeOS would lead us anywhere, but since ReiserFS4 was brought up, it inevitably reminded me of BeOS and its radical goals to build an OS around a, what may be called, database driven storage paradigm rather than the hierarchical organization of file systems. Something that was tried several times, technologically superior but never widely and successful disseminated - you may be forgiven for thinking that there are many software and hardware de-facto standards out there that had better not come into existence. That much said, it may be helpful and also a bit insane to encourage such efforts that implement more effective solutions by designing them with a user centric focus rather than a feasibility driven approach in mind. Disseminating such new systems may require overcoming seemingly insurmountable thresholds imposed by incompatibilities of existing applications with previous APIs, concepts, designs and last but not least architectures. The ineffectiveness of machines is typically the result of fundamental design flaws. So, beware and balance the requirements to befriend software evolution!

Re: remember some of the architecture of BeOS? - Aaron J. Seigo - 2005-02-24

the trick is getting, as you seem to be hinting at, useful applications of these tools out there for public consumption. just because something is theoretically cool, who cares unless it lets you do something practically useful. something so wildly useful that it becomes a compelling reason to seek out that technology. before visicalc, PCs were not all that interesting to most people. PCs were a radical concept that had so much potential, but it sat there largely unrealized until visicalc. but visicalc didn't add any new capabilities to personal computer hardware or operating systems, but they exposed the existing capabilities through an actual, unique and useful application of those capabilities. suddenly PCs were interesting. and in demand. database centric desktop concepts have been tossed around for many, many years, particularly in academia but even occasionally in the marketplace as well. none have faired particularly well. it's the old "PC without visicalc" problem. and this is where working within the KDE project is important. we have a whole desktop here with hundreds of applications in KDE's CVS. most of the developers tend to be in close virtual proximity to each other. as this technology becomes available for consumption, we won't have to go out and shop for third party developers or try and tack ourselves onto the side of a web browser. as a whole community of developers, KDE can deliver an entire environment of applications that make the most of this technology from day 1. these applications will make all the difference IMO. which is why we are designing this primarily from an application developer's needs (which includes things like reusable, easily integrated user interface components). we believe that given a functional and open-ended API to these tools that we will be amazed and surprised at the unique uses application developers find for them.

Re: remember some of the architecture of BeOS? - Ian Monroe - 2005-02-25

For being so early in development, this plan reeks of not being vaporware. :)

Re: remember some of the architecture of BeOS? - Derek Kite - 2005-02-24

Think horsepower. OS/2 had a useful ability in the Workplace Shell where you could have a folder containing documents, and when you opened or closed the folder, the documents would be opened or closed, ie. applications would be started or shutdown. At the time I was using a 386. I remember buying memory at $100 for a megabyte. This useful feature was useless because it was so slow. Not get a cup of coffee slow, but go eat lunch slow. At least on the machine that I could afford. This wasn't rocket science. It was a matter of not having adequate resources. Some kind of desktop data indexing and retrieval is possible because of the enormous power we have in our desktop computers. Derek

Windows attempts at similar objective - Dan Housman - 2005-03-01

I found this article/project interesting. We are a small software company working on a Windows based solution similar in scope. The product is called Viapoint (http://www.viapoint.com). I doubt the true Linux folks will want to play with it but it could be interesting to keep track of how the products evolve across operating systems. We have been calling the product a Smart Organizer and are looking to avoid building desktop search components by calling on Google's Desktop Search APIs or an equivalent so that we can focus on building the context sensitive part of the application as well as functionality to help a user actually work. You can download for free if you want to check it out. I'll have to get Linux or watch videos to see this tool.

Desktop Search - Chris - 2006-03-24

Anybody ever looked at Kat Desktop Search project? http://kat.mandriva.com/ What's everyone's opinion on using this technology instead of a completely new one?

Re: Desktop Search - cm - 2006-03-24

Actually, if I read http://websvn.kde.org/trunk/playground/base/tenor/ correctly then kat's main developer Roberto Cappuccio is the last person to commit to Tenor. So don't worry.