faq
flatforty
contribute
subscribe
configure
search
rdf
main
parent
thread
|
Re: That's nice and all but...
by dmalloc on Sunday 15/Jul/2001, @23:11
|
Well, first of all, this is a text TO speech plugin and no voice control plugin. Text TO speech does use synthesized patters of speech, but it is not dependant on huge databses. Very smart people have found a way to describe a "langugae" in which a speech synthesizer, based on a "grammar" can actually produce valid sounding output, whihc our brain recognizes as a word or a sentence.
Yet, even though it sounds stupid, many big Voice recognition softwares have come to the consluion, that simple "comparison" between spoken text and known text is not good enough. There are a few more approaches thos this by now, for example by analyzing key parts of a word, recognizing the seiquenz opf certain sound triplets and other stochastical means of categorizing data. Basically they are developing very complex, yet precise heuristic algorithms for natural speech.
Since that requires to analyze gazillions of GIGbytes of actually spoken data, this reserach is very expensinve and therefor emostly carried by universities of big corperations (see IBM). |
|
|
The Fine Print: The following comments
are owned by whomever posted them.
( Reply )
|
Re: That's nice and all but...
by Carbon on Monday 16/Jul/2001, @02:21
|
>universities of big corporations (see IBM)
Well, I knew it wouldn't take long for IBM to buy a university or two! :-)
Well, text to speech does require (somewhat large, but not huge) databases too. What I think you're referring to by "grammar", to explain it a little more in detail, is the databases that Festival, Mbrola, and (i think) the Macintosh TTS use.
Basically, these (about 10MB) databases consist of two things. The first is a database of the pronunciation of many words. The other part is a sound database containing a sample for each sound that the TTS system can play.
For every word it tries to read, it looks in the pronunciation database for which sounds the word is composed of, gets all those sounds from the sound database, and strings them together. If a word isn't found, it attempts to guess how to pronounce it, often with hilarious results
I don't really know all _that_ much about it (not nearly enought to code something like this myself, anyways), so if you really want more info on how this is done, go to the festival homepage (listed above) and read their thesis-like explanation yourself.
|
[
Reply To This | View ]
|
The Fine Print: The previous
comments are owned by whomever posted them.
( Reply )
|
|