Every once in a while, the KDE community stumbles across a third party application that is well integrated into KDE, but has somehow managed to fly completely beneath the radar. One such application is called simon (small 's' intentional), a speech recognition program that integrates well with KDE and provides a means of interacting with KDE using voice recognition.
What follows below is an overview of simon and excepts from an interview that the dot conducted with Peter Grasch, lead coder.
What is simon?
Simon is a speech recognition subsystem that can be tied into X11 or Windows and, using a plugin-based architecture, can be used to control the user interface. The name simon is a derived from the children's game, 'Simon Says', but in this case, 'Simon Listens'. The application is intentionally lowercased while the financial support organization is called SIMON listens. It is also apparently a back acronym for Speech Interaction MONitor. Peter Grasch writes on the project's origin:
The school [I was attending in 2006] has a subject called "project development" where students group to develop smaller projects with external companies. During the search for a project we were approached by Franz Stieger, a teacher for special needs children who in his day-to-day work was confronted with children who suffer from spasticity. Because their limited motor control they are not able to write clearly with a pen nor are they comfortable writing texts on a keyboard. Franz wanted to know, if it would be possible to use speech recognition software to help them participate in the classroom. With this idea he came to us. Under my leadership, we (a team of four students) then researched available speech recognition software and quickly concluded that none of the commercial or free offerings on the market were
capable to adapt to the speech impairments of our test subjects. And because we had no idea what we were up against we decided - why not create our own solution?
By 2007, Grasch and his team had a working prototype. The first word that simon recognized was thunfisch (tunafish in German) which, when spoken, triggered a full screen image of a tuna. Within a short time, simon had two available commands: Executables and Places. Since then, more commands have been added which are perhaps best illustrated by a video of simon 0.2 (the most recent release) in action.
Caption: Simon 0.2 beta 3 controlling KDE.
At the moment, commands are implemented as configurable plugins. For example, while currently there is no plugin that issues dbus commands to running programs, once such a plugin is written, it would be possible to configure and train these new commands using the graphical interface.
Under the Hood
Simon is written in C++ using Qt and KDE for the user interface and thus fully integrates with KDE 4. Grasch explains their decision to integrate with KDE:
It just seemed a logical step. While simon was [originally] a Qt only application, we had a lot of supporting libraries that in the end were just cheap clones from similar KDE widgets/routines. With time it just got harder and harder to maintain all this extra code. Of course, the KDE implementations were also a lot better tested and were all in all just more mature and complete. What made the decision easy was the foreseeable maturing of KDE on Windows. Because Windows was a target platform we wanted to support it was of course important that such a huge dependency as KDE would work well on win32. Once that was ensured, the porting was a breeze.
Applications built on the KDE 4 platform, such as simon, will happily run under Gnome, XFCE, KDE 3 or any other X11 desktop environments when the appropriate libraries are installed.
I asked Grasch if there were any major advantages to the transition to KDE 4 as it benefits simon, and he replied: More features, less crashes. Or to quote a great tagline: "Code less, create more". [ed: this is part of the tagline for Nokia's Qt Development Framework]
Grasch also comments on the relative stability of running on Linux versus Windows: Simon on Windows is a tick more reliable because of the huge mess that is the sound stack on Linux and especially Ubuntu where their questionable pulseaudio setup sometimes breaks simon (portaudio, actually) in the most unbelievable ways.
To get around these problems, the team provides updated packages for Ubuntu that should allow simon to run reasonably well on that distro. Packages are also available for OpenSuse from the homepage, or you can compile it yourself using the usual KDE toolchain using instructions from the wiki.
Simon uses a number of technologies under the hood that allow voice recognition to happen. Grasch provides rough overview:
The whole recognition process is basically a statistic calculation. Sound input is recorded and tested against a speech model which consists of an acoustic model and a language model.
The language model contains information about the language to be used. This can of course be whatever you define; Standard English, French, Klingon, whatever. The language model most importantly defines what "words" there are and of which sounds do they consist. The sounds are represented with phonemes (which are to spoken speech what characters are for written speech). There are well defined phonetic alphabets like the IPA or the X-SAMPA but again you can of course define and use your own - it doesn't matter. The language model also defines what word combinations are valid ("grammar").
The acoustic model defines how those phonemes sound. This is done by feeding a lot of transcribed speech (speech samples where the system already knows what is being said) into a training algorithm. The output of this training procedure is a Hidden Markov model (HMM). So during recognition, the microphone records speech data which is then digitalized to an uncompressed wave form. This wave form is compared with the data stored in the acoustic model to find out what phonemes are most likely being said. These phonemes are then - with the help of the language model - transformed into words and "sentences".
In comparison to commercially speech recognition offerings simon does not ship with any predefined speech model. Instead, it makes it very easy for the end- user to create his own. A process that is normally extremely complicated (a very similar process to what simon does internally is outlined in the HTK book - the instructions span about 100 pages - and the HTK book is targeted towards linguistic professionals). This makes simon independent of any existing language or pronunciation and gives all the control to the end-user.
The process of creating of the speech model and the output models are state of the art - this has also been verified by the signal processing and speech communication laboratory of the technical university Graz. The acoustic model is trained with the HTK toolkit which has to be installed separately and Julius is used for the recognition (a modified version is shipped with simon).
Grasch continues to talk about dictation, and the problem that it presents to the way that simon has been designed: As detailed above, simon does purposefully not ship with a speech model, but our design decision to enable (require) every user to create his own model opens the door to users with speech impairments or languages that just don't have a large enough user base to be interesting to commercial alternatives. It also makes dictation unrealistic. To achieve a large and well trained enough speech model to enable dictation commercial offerings use thousands of transcribed speech of representative professional speakers to create a "standard" speech model which is later just slightly adapted to match the individual pronunciation of the end-user. As we don't ship with a default model it would mean that every user would have to invest a couple of thousand hours into training the model.
However, we are investigating methods to keep simon as flexible as it currently is but adding the feature to use base-models if needed. At the moment this is only in a planning stage and no code is written. We however, already are planning a few projects that would give us access to large amounts of transcribed speech. As soon as any form of adapting a speaker independent model is added to simon we will also try to make using VoxForge models as easy as possible. VoxForge is a project to build large speaker independent speech models.
Currently, the VoxForge project does not have large enough models to properly enable dictation, however this GPL-licensed project could benefit from simon driving users to help contribute. Grasch suggests using this release for simple control of public kiosks or home computers, and he even provides an example of one user who has adapted simon to do home automation. Whether it can be used to control a spaceship is yet to be seen.
Current Release and Future Plans
The current release, simon 0.2, was focused on stability. To that effect, this is a very successful release. The whole flexibility of the system is really the killer feature [of simon 0.2]. For testing purposes we once quickly created a speech model that would assign commands to: coughing, yawning, snoring, whistling, tongue flicking, etc. After five minutes training we were able to surf the web by coughing into the microphone!
At this point, the project has started to mature and is no longer purely an educational project. It currently operates with a team of three developers, Grasch, Franz Stieger and Matthias Stieger who all work nearly full-time on simon. Of course the long term goals are full dictation, but in the meantime, the team's goals for 0.3 include integration of KDE's 'Get Hot New Stuff' functionality, and improving the recognition process by taking into account the confidence scores of the recognizer. The ability to download new commands, training texts, and vocabularies with less hassle will contribute to a growing community around simon. So, if you want to control Amarok, you simply pick the Amarok control package from a list, read a few texts to train the acoustic model, and start controlling Amarok using voice commands.
In my investigations into the GPL-licensed simon, I've discovered two licensing snags that will likely partially inhibit its uptake, at least for the near future. When I first started researching this software, I thought that it would be great if simon could be shipped with the KDE software main releases, however two of its dependencies have weird licenses and would prevent it from being shipped thus.
HTK, the toolkit responsible for the HMM evaluation is distributed under GPL-incompatible, restrictive license that prevents redistribution. In order to install simon, one must separately download HTK from their website which requires registration. The source is available, and they encourage you to modify and contribute to it, but it cannot be redistributed.
Additionally, Julius, used for the voice recognition has an attribution clause which causes problems with the GPL in a way that is reminiscent of the old-style BSD license (the one with the advertising clause). Any research conducted with simon would thereby require a reference to the Julius authors in the bibliography.
These problems will, for the immediate future, prevent many distributions from shipping simon. Perhaps we should read the text of the GPL to simon to see what it thinks. Grasch notes that there are relatively few installations of simon, so for those using it, these issues are not likely to be pressing.
Simon provides a unique way of interacting with your computer using voice recognition (not dictation, yet) which integrates well with KDE. Installation requires manually obtaining and installing HTK separately from simon, due to a licensing conflict which will probably inhibit the uptake. However, if there are enough users, we as a community may be able to gently push Cambridge to dual license in a compatible way. In the meantime, for the adventurous at heart, or those with special computer interaction needs, simon exists to fill this niche. With the right plugins, one could potentially even order Tea, Earl Grey, hot from within KDE.