Okay folks, this thread got me thinking about text to speech again. It's a huge wish of mine to get practical text to speech working on the Empeg. The applications are numerous, from having it read driving directions, to voice prompting in the player app, to more frivolous applications like having my Trivia game read you the questions (my original reason for looking into this.)

Here are my findings so far:

1. The *only* text-to-speech engine we even need to be thinking about is flite. It's probably the only one small enough to run on the Empeg, and it does a very good job. The version 1.1 binary release comes with a 16 KHz voice that sounds pretty damn good. Flite is 90% of the puzzle, and it's open source and free.

2. The other 10% of the puzzle comes from the fact that Flite can't write to the Empeg's sound device due to its peculiar buffer and sample rate requirements. This means that the raw sound data produced by Flite needs to be sampled up to 44.1 KHz and then written to the sound device the way the Empeg expects it to be written (4608 bytes at a time.) These limitations aside, flite is amazing. Running on the Empeg (but without the player app running) the flite engine did text-to-speech of the first paragraph of the GNU public license in less than 4 seconds (the paragraph when spoken is at least 20 seconds long.)

3. So tonight, inspired by the aforementioned PhatBox thread, I dug deeper on the web and finally found some sample rate conversion software that will do what we need it to do pretty easily. It takes WAV on stdin and writes it on stdout after some sample rate conversion that seems to be both high quality and pretty fast.

4. Using the above programs (flite and rateconv) I can generate a WAV file that the trusty ol' pcmplay example program can play to the Empeg's sound device. So we have this chain working:

source text --> flite --> wav file ---> rateconv --> pcmplay --> empeg sound output

.
In the UNIX shell, it looks something like this:


flite16k "Pink Floyd. Another Brick in The Wall Part II. 1979" -o test.wav;
rateconv -m 16000 7200 400 65 5 1 0.8 < test.wav | pcmplay

.

When I run this and the player isn't running, the whole process is very quick. Maybe two seconds, which is shorter in duration than the sound output itself. However, it's far from real time due to the fact that it's running the flite engine, writing to a file, then reading that file into the sample rate converter, which passes its stdout into the stdin of pcmplay.

That's not ideal, of course. The Holy Grail is to modify Flite source code to use the Empeg's sound device directly, and grafting in the sample rate conversion code from rateconv so that the output doesn't sound like a 33 RPM record going at 78 RPM. The other modification would be running this modified Flite with the realtime round robin scheduler so that it can play nicely while the player app is running.

I was hoping to make all this happen, but I don't think I'm the guy to do it. With sufficient hacking and thrashing about, I can probably do it. But I know there are people out there who are reading this who are better equipped to take this on. If you're one of those people, please raise your hand. Or if you have anything else to say on this topic, let's hear it. Quasi-realtime TTS on the Empeg would be useful for dozens of applications, and all the software we need is already there. We just need to integrate it all and make it one "speech server" that other user apps can connect to.

So who's in?
_________________________
- Tony C
my empeg stuff