The Baltimore Sun published an excellent, detailed article on expressive speech synthesis [requires free registration] that focuses mostly on work going on at IBM. The resulting speech synthesizer would be not only be able to laugh, cough, pause for a breath, and say uh and um, but would also be able to smoothly switch between an uptone voice and a downtone or neutral voice when appropriate.
The article also discusses speech recognition research aimed detecting when a speaker is frustrated or angry. One of the claimed values of this capability would be in more quickly detecting when a caller should be transferrerd to a live agent. I don’t think this is a huge step forward, though, since most speech applications can already handle this pretty well. Most callers know that if they press zero or say “agent”, they will get transferred to an agent relatively quickly, depending on how the speech application was programmed. One catch is that the other grammars need to be designed to avoid collisions with words like “agent” and “operator” as much as possible.
Also, some speech applications are designed to complete automate a service. If the company providing the service would lose a lot of money by allowing transfers to live agents, they might decide not to offer that feature. While that might sound short-sighted, there are definitely some scenarios where this makes a lot of sense.
Not only does the article provide a concise description of concatenative speech synthesis, but it also includes an interesting update on laughter research. Recent studies have shown that less than 15% of laughter is in response to intentional jokes. You’ll have to read the article to find out why people do laugh, and what the heck a laugh note is.
[via ACM News Service]