Text to Speech on the Empeg - Let's do it!

Posted by: tonyc

Text to Speech on the Empeg - Let's do it! - 18/04/2002 22:23

Okay folks, this thread got me thinking about text to speech again. It's a huge wish of mine to get practical text to speech working on the Empeg. The applications are numerous, from having it read driving directions, to voice prompting in the player app, to more frivolous applications like having my Trivia game read you the questions (my original reason for looking into this.)

Here are my findings so far:

1. The *only* text-to-speech engine we even need to be thinking about is flite. It's probably the only one small enough to run on the Empeg, and it does a very good job. The version 1.1 binary release comes with a 16 KHz voice that sounds pretty damn good. Flite is 90% of the puzzle, and it's open source and free.

2. The other 10% of the puzzle comes from the fact that Flite can't write to the Empeg's sound device due to its peculiar buffer and sample rate requirements. This means that the raw sound data produced by Flite needs to be sampled up to 44.1 KHz and then written to the sound device the way the Empeg expects it to be written (4608 bytes at a time.) These limitations aside, flite is amazing. Running on the Empeg (but without the player app running) the flite engine did text-to-speech of the first paragraph of the GNU public license in less than 4 seconds (the paragraph when spoken is at least 20 seconds long.)

3. So tonight, inspired by the aforementioned PhatBox thread, I dug deeper on the web and finally found some sample rate conversion software that will do what we need it to do pretty easily. It takes WAV on stdin and writes it on stdout after some sample rate conversion that seems to be both high quality and pretty fast.

4. Using the above programs (flite and rateconv) I can generate a WAV file that the trusty ol' pcmplay example program can play to the Empeg's sound device. So we have this chain working:


source text --> flite --> wav file ---> rateconv --> pcmplay --> empeg sound output

.
In the UNIX shell, it looks something like this:


flite16k "Pink Floyd.  Another Brick in The Wall Part II. 1979" -o test.wav;

rateconv -m 16000 7200 400 65 5 1 0.8 < test.wav | pcmplay

.

When I run this and the player isn't running, the whole process is very quick. Maybe two seconds, which is shorter in duration than the sound output itself. However, it's far from real time due to the fact that it's running the flite engine, writing to a file, then reading that file into the sample rate converter, which passes its stdout into the stdin of pcmplay.

That's not ideal, of course. The Holy Grail is to modify Flite source code to use the Empeg's sound device directly, and grafting in the sample rate conversion code from rateconv so that the output doesn't sound like a 33 RPM record going at 78 RPM. The other modification would be running this modified Flite with the realtime round robin scheduler so that it can play nicely while the player app is running.

I was hoping to make all this happen, but I don't think I'm the guy to do it. With sufficient hacking and thrashing about, I can probably do it. But I know there are people out there who are reading this who are better equipped to take this on. If you're one of those people, please raise your hand. Or if you have anything else to say on this topic, let's hear it. Quasi-realtime TTS on the Empeg would be useful for dozens of applications, and all the software we need is already there. We just need to integrate it all and make it one "speech server" that other user apps can connect to.

So who's in?

Posted by: Terminator

Re: Text to Speech on the Empeg - Let's do it! - 18/04/2002 22:39

Doesnt kim salo's gps project use voice prompts?

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 18/04/2002 22:45

Yup, but they're prerecorded, not realtime, on-the-fly. I want realtime TTS and all the ingredients are there.

Posted by: Shonky

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 00:12

That would be so cool... Just add it to the ever growing list of things I want to have a crack at.

Exactly what were you going to use the TTS for? I can see song titles/album names etc but what else?

Posted by: Shonky

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 00:32

When you say speech server, that could be implemented in a device ala /proc/kernel which lets you upgrade the kernel. You could have /proc/tts and then to say something you would/could simply go:

echo "This is the empeg talking" >/proc/tts

or the equivalent in the userland program. So this might be possible for a hijack thing.

Posted by: rob

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 04:24

The problem with flite is that it sounds pretty awlful, and the car environment is one place you need good quality speech. There are a few very good quality commercial systems, but the problem with them (apart from being commercial) is their large footprint.

Of course, the engine itself can be replaced later in the project if something better comes along. It'll be good to get the basic infrastructure in place with what's available now.

Rob

Posted by: Shonky

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 04:35

Bummer. I hadn't got to try it yet. I assumed it sounded OK from what Tony was saying. Is it the standard very robotic computer voice?

Posted by: thenominous

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:09

I assume that festival falls into the "too large " category, or possibly the too much CPU time required?

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:16

that could be implemented in a device ala /proc/kernel which lets you upgrade the kernel

Exactly what I had in mind.

Posted by: Shonky

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:34

Does it really sound that bad yn0t?

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:35

The problem with flite is that it sounds pretty awlful

Well I'm not sure if you've used v1.1 yet, but it comes with a 16 KHz voice that isn't all that bad. It's not perfect, in fact, it's definitely stripped down from Festival, and it's not going to compete with something like AT&T's Natural Voices or anything, but *it works* and it works right now, and it's getting better.

I wrote the author a while back and he does have a commercial version of flite available via his website http://www.cepstral.com/. But for me, flite does an admirable job considering it's running on the Empeg in real time.

I'm attaching a sample mp3 file for people to make their own judgements. No, it doesn't sound like a real person talking, and yes, the commercial systems are advanced, but I'm not imagining anyone getting them working on the Empeg anytime soon. I perosnally think the TTS provided by flite would be perfectly acceptable for voice prompts reading you the artist/title, playing GPS directions, etc. But that's just my opinion.

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:41

Bummer. I hadn't got to try it yet. I assumed it sounded OK from what Tony was saying. Is it the standard very robotic computer voice?

I would say it's three steps above my answering machine's voice, but certainly not as advanced as other TTS systems I've tried on my PC. I'm not sure if this is a limitation, or if they just don't want to ship flite with a very high quality voice so they can sell their commercial product instead, but it sure does sound robotic.

However, it does a very good job at analyzing sentence structure, inserting pauses, etc. In the example MP3, I didn't have to insert any words phonetically to make it sound right, or tell it where to pause in sentences, etc... I think it would do a good enough job that it would be useful in a variety of applications.

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:44

I assume that festival falls into the "too large " category, or possibly the too much CPU time required?

Ohhhhhhhhhhh yeah. I took a stab at Festival but it doesn't even come CLOSE to running in the Empeg's memory footprint. So CPU isn't even an issue.

Festival is the Emacs of TTS systems. It uses a bunch of LISP files to define its various diphone and speech engines... It's amazingly configurable and tweakable, but massive and bulky. The author saw this as an opportunity to make the same kind of thing work on smaller platforms, and that's Flite. LISP is replaced by tight C code. I think it's a great effort.

Posted by: Shonky

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 05:58

I assume Flite is Festival Lite then? The demo doesn't sound fantastic. The demos on www.cepstral.com sound really good though in my opinion. Pity it's not free....

Posted by: genixia

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 06:43

Hey great, we can get Stephen Hawking on our empegs

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 08:15

I didn't say it sounded great, I said it works.

Fine, guys, wait for perfectly human-sounding TTS to suddenly appear and be able to run in the Empeg's limited resources.

Incidentally, awhile back I spoke with Alan Black, author of Flite. He seemed very interested in selling me Cepstral. I didn't follow up on it because I had other things to do at the time. The thing is I don't think Cepstral has source code available, which would mean we'd have to work out a way to modify it (or have them add support for the Empeg's soundcard and up-sampling the sound.)

If anyone really wants Cepstral I can try to resume dialog with him. The thing is I think we should start working with what we have. Too often around here people want the Mercedes without trying things out on the Chevy Impala first.

Posted by: tfabris

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 08:16

Yeah, but won't you be infringing on a patent now if you do this?

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 08:18

Yeah, but won't you be infringing on a patent now if you do this?

I think I'll get over the mental anguish.

Posted by: rob

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 10:47

Fine, guys, wait for perfectly human-sounding TTS to suddenly appear and be able to run in the Empeg's limited resources.

That'll be sometime around Q3 then.. but I suspect it'll take a bit longer for a free package to meet those requirements.

Rob

Posted by: jwtadmin

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 11:08

I think that this is a great idea. and I would love to see it implemented. This would be great for announcing tracks, say for example if your empeg was in a convertable and you couldn't read the display.

Or definately for the trivia game, which I am eagerly awaiting.

The voice is shaky but as good as my TI99/4a and if a better voice comes out then we can upgrade.

Posted by: snoopstah

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 12:57

Mmmm... it probably sounds better if you knew what it was meant to say!

But it could be pretty good - ideally, you'd have an option in emplode or similar to type in an alternative album/track name for it to say, if it messed up something really big - and otherwise it would just say the stored name.

Cheers,

A.

Posted by: grgcombs

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 14:51

No, not really. The key is it works, now. It works well enough for our early adopter purposes. It's on par with Apple's Text To Speech in about 1998. And takes up nothing of the footprint like Apple's did.

Greg

Posted by: TommyE

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 15:28

Or Amiga in early 1986....

TommyE

Posted by: rob

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 15:46

The problem comes with proper nouns - have it speak a selection of artist names and album titles. We've tested this a lot - even the best commercial solution is only just acceptable. This is OK if you know what it's meant to be saying, but for eyes up navigation it's a bit of a pain.

This area of technology does seem to be advancing more rapidly now than ever before (both in the open source and commercial worlds) - which is nice.

Rob

Posted by: ninti

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 15:52

I was thinking older still, it is amazing that it really sounds little better then a 2 mhz computers can make 20 years ago.

"H'ello, my name is Sam, I am a speech synthesizer for the Ataaari home compuuuter"

20 years....Man, I am getting old.

Posted by: rob

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 15:58

Those old speech synths were mostly getting fed phonetic data, which is an altogether easier proposition than true TTS.

Rob

Posted by: ninti

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 16:08

True enough, they did sound better with phoentic spelling, but Sam would read ordinary typed speech as well, and it wasn't that much worse...i.e., it got most of the major curse words correctly, which is of course worth literally hours of amusement to junior high kids.

Posted by: tfabris

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 16:18

it got most of the major curse words correctly, which is course worth literally hours of amusement to junior high kids.

And their parents, too.

I'm not kidding. One day my friend and I came home from school to find his dad sitting at the C-64 typing text into SAM:

"<friend's dad's supervisor's name> IS A FUCKING ASS HOLE"
"<friend's dad's supervisor's name> IS A DIP SHIT"
(etc.)

Seems he'd had a bad day at work...

Posted by: loren

Re: Text to Speech on the Empeg - Let's do it! - 19/04/2002 17:53

That litereally made me gut laugh... pictureing an older man sitting at a C-64... LOL. nice.

Posted by: dcosta

Re: Text to Speech on the Empeg - Let's do it! - 20/04/2002 10:06

I would like to see something where I could specify a sound file to play for each menu item.
You could use all kinds of neat custom sounds for bands...
and maybe even sound clips of the songs for each song....
that would be super neat....

Posted by: thenominous

Re: Text to Speech on the Empeg - Let's do it! - 20/04/2002 11:31

OK, so the sample mp3 aint wonderful, on the other half its not too bad either. Listening to it on crappy screenbeat3 speakers, it wasnt all that bad. Like Yn0t said, at least this will work and its a starting point!

Thanks for the efforts you've mae so far to this, and pleae dont let the griping interfere!

At the end of the day, how many other car stereos talk to you?

(Although I still want HAL to talk to me on boot up...)

Posted by: number6

Re: Text to Speech on the Empeg - Let's do it! - 20/04/2002 17:20

In reply to:

(Although I still want HAL to talk to me on boot up...)
...
Dave

I'm sorry Dave, I'm afraid I can't do that....

(sorry couldn't resist making the joke)...

Posted by: DomoKun

Re: Text to Speech on the Empeg - Let's do it! - 20/04/2002 23:13

hehe, I programmed my empeg to have Hal greet me on bootup.

Posted by: dcosta

Re: Text to Speech on the Empeg - Let's do it! - 21/04/2002 15:28

how did you do that ?
... me wants, too.

Posted by: DomoKun

Re: Text to Speech on the Empeg - Let's do it! - 21/04/2002 17:32

Check out these 2 threads. Somewhere posted there is the pcmplay program and my bootup init. Just make a similar init script and you'll have bootup sounds too.

http://empeg.comms.net/php/showflat.php?Cat=&Board=empeg_general&Number=78613

http://empeg.comms.net/php/showflat.php?Cat=&Board=hackers_prog&Number=74361

Posted by: thenominous

Re: Text to Speech on the Empeg - Let's do it! - 22/04/2002 02:38

ok ok ok...

I guess I left myself wide open for that one

Posted by: shadow45

Re: Text to Speech on the Empeg - Let's do it! - 22/04/2002 06:11

Damnit, I think I'm going to have to change my picture now. I don't want Domo-kun to kill me!!

*fear*

Posted by: Anonymous

Re: Text to Speech on the Empeg - Let's do it! - 23/04/2002 02:09

http://www.npr.org/programs/atc/features/2002/apr/computervoices/index.html

Posted by: TheAmigo

Re: Text to Speech on the Empeg - Let's do it! - 23/04/2002 22:54

Here's what I posted over a year and a half ago.

I'd still be interested in doing the same thing.

Posted by: gbeer

Re: Text to Speech on the Empeg - Let's do it! - 24/04/2002 22:06

Eh-Ahem...

In reply to:
altman
(addict)
05/10/99 10:46 AM
Voice prompts, including a self-recorded audio tag for your own presets (and all the "factory" voice prompts being in a replaceable/updateable playlist, so you can rerecord the prompts yourself) are on the to-do list, you'll be glad to know

Hugo

Unless phat box's pending patent dates earlier than this...

Posted by: Cliff

Re: Text to Speech on the Empeg - Let's do it! - 24/04/2002 23:06

IBM has ViaVoice running on the iPaq which I believe is the same processor as the empeg. Their "flagship" product does both speech recognition and TTS. Maybe we could do a group buy. ;^)
See more here: http://www-4.ibm.com/software/speech/enterprise/ms_evvee.html

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 06:51

IBM... "Inferior, But Marketable."

I dunno, given the responses above, it doesn't sound like people around here are very interested in TTS/voice prompting, so the concept of pooling our money and buying a closed-source product seems pretty far-fetched.

Posted by: Terminator

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 08:54

How much money does the cepstal guy want per license? I would be willing to pay for a voice that doesnt sound computerish and annoying.

Sean

Posted by: tonyc

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 10:10

Not sure, I can drop him another email. He's a professor at Carnegie Mellon, he does the Cepstral stuff in his free time. He seems partial to the ARM platform (he released v1.1 of flite wth ARM binaries) but more so due to the iPAQ. He didn't seem to have as much zest for the Empeg as the guy who's giving us a free Vorbis license, but I'll see what I can work out.

The problem is we'd need source code so we can graft in sample rate conversion and native use of the Empeg's quirky sound buffer. Not sure we'd be able to convince him to develop a custom Empeg version of Cepstral due to its sample rate/buffer size quirks.

Posted by: Terminator

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 10:40

See what he has to say on what it would take and how difficult it would be. Do we have any empeg owners that live in Pittsburgh that would be willing to demo the empeg for him? Once he sees how it works, he might be more interested in sharing source code under NDAs etc.

Maybe the empeg guys could give us some tips if they think it could be useful in a future product.

Sean

Posted by: Terminator

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 10:48

While surfing the cepstral site, I stumbled across a open source speech recognition system.

http://www.speech.cs.cmu.edu/sphinx/

It mentioned it was a decent candidate for embedded devices. Is this useful at all for the empeg? Any speech recognition experts around?

If you'd like to have a chance to try out an application that uses CMU Sphinx , try the Communicator, an experimental system that helps you plan air travel. You can reach it at the toll-free number 1-877-CMU-PLAN (1-877-268-7526) or at +1 412 268 1084.. The system will provide real flight information.

Sean

Posted by: rob

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 11:07

We're going in our own (commercial) direction with TTS. We got some of the free systems up and running and were not impressed.

Rob

Posted by: Daria

Re: Text to Speech on the Empeg - Let's do it! - 25/04/2002 11:12

In reply to:
Do we have any empeg owners that live in Pittsburgh that would be willing to demo the empeg for him? Once he sees how it works, he might be more interested in sharing source code under NDAs etc.

Better yet, I work at Carnegie Mellon. I should probably put my dash back together before I show it off.

Posted by: JeepBastard

Re: Text to Speech on the Empeg - Let's do it! - 30/06/2002 07:50

Noise-Robust Speech Recognition Technology

Motorola Clamor™ software provides developers with a highly accurate, noise-robust, small vocabulary speech recognizer for use in a variety of electronic and computing applications. The Clamor software-based recognizer was designed to be specifically insensitive to ambient noise and to work with most languages. This makes it suitable for embedded command-and-control applications in phones, PDAs, stereos, set-top boxes, games, automotive equipment, and countless other products.

Speech recognition is coming of age in today's high-tech world. Mobile professionals need access to their pertinent data at all times, often in situations where having one's hands free is important. Imagine a situation in which you could navigate your voicemail system, moderate your car radio volume, or dial your cellular phone using only your voice. Clamor software enables these and many other scenarios in which mobile professionals will be able to access information on the fly.

Features:

Noise Robust - highly accurate even in noisy environments.
Works well in cars with windows open, airport lobbies or with loud music playing.
Dependable - Press-to-talk helps guarantee reliable recognition
Flexible Vocabulary - Up to 40 words or short phrases (up to two seconds in duration) active at one time.
Unlimited number of switchable dictionaries allows for expanded vocabularies.
Accommodates Multiple Users - More than one user can train the system for accurate recognition of his/her voice.
Only two repetitions needed to train each word or phrase.
Works Across Languages - Language independent system will work with virtually any language.
American and British English, French, German, Farsi, Japanese, Cantonese and Mandarin have been tested.
Embedded Platforms:

Motorola's M*CORE®, 56166 and 56800 DSP series, ARM 610/710, StrongARM™, Intel™ X86, OS-9, Windows™ 98, NT and CE. Written in ANSI C, the Clamor software system is portable to most selected platforms.

Requirements:

Microprocessor capable of 10 MIPS or more is desirable
8kHz or 16kHz sampling rate
Nonvolatile memory is required to store user training data
ARM implementation requires 32K Flash for voice templates and 20K RAM for code and buffering
Response Time- One-second average response time with 13MHz ARM 610 implementation.
0.15 second average response time with Pentium™ 133 MHz Windows 95/98 implementation.
Recognition Engine Size- Hand assembled DSP implementation code size for the recognition engine is 1KB.
Speech templates are approximately 16KB for 10 words or short phrases (or .5KB for an average two second voice template)

Windows® evaluation SDK available.

Posted by: JeepBastard

Re: Text to Speech on the Empeg - Let's do it! - 30/06/2002 08:06

http://www.okisemi.com/public/docs/PrdSpeech.html

Speech IC's for external dev.

Posted by: jheathco

Re: Text to Speech on the Empeg - Let's do it! - 30/06/2002 10:47

I just tried calling that number, pretty neat. The voice recognition seems pretty accurate, don't know how it would do in a noisy area like the car though, maybe I should try calling on my cellphone while driving. What do you guys think of this?

Posted by: Terminator

Re: Text to Speech on the Empeg - Let's do it! - 30/06/2002 16:58

Cool, too bad there isnt an evaluation version that someone could get working on the empeg.

Sean