Tuesday morning I attended sessions on core speech technology and dialog design.
Dr. Randy Ford from Sonum Technologies, talked about using strong Natural Language Processing (NLP) to improve speech recognition. He claimed that by using N-gram substitution (e.g., replace the likely misrec “think you” with “thank you”), phonetic tumbling, or a hybrid of the two, you can reasonably achieve a 20% improvement on the base recognition. I wish I could provide more detail on phonetic tumbling, but he had to rush through the end of his presentation and I was too busy taking notes on what he had previously said.
Yoon Kim from Novauris discussed using phonetic techniques to improve recognition for large lists. By taking into account syllable structure and stress, they feel they can significantly improve recognition performance over that of conventional SR engines for items in a large corpora. With respect to lexical stress, they are analysing the stress placed on consonants coming immediately before or after vowels, as well as many aspects of how the vowels themselves are pronounced. Recognition accuracy is locally decreased for unstressed vowels, so they have found it helps a lot in this case to also look closely at the stress placed on nearby consonants. Language-specific syllable structure affects how important this differentiation can be. While the English language generally has more complicated syllable structure and stress distinctions than Korean, Korean can have much greater distinction between consonants preceding or following vowels.
Vlad Sejnoha from Nuance then gave a talk on current work at Nuance on speech technology. The speech was very similar to ones given at last Fall’s Nuance Conversations conference. One of these days, I’ll post my notes from that conference.
In the Dialog Design panel, Sondra Ahlén spoke about Spanish voice talents, including Spanish language TTS voices. She provided a lot of interesting trivia on Spanish speakers (Countries with most Spanish speakers in order are Mexico, Columbia, the US, and Spain; 12% of US residents speak Spanish and half of those don’t speak English; Columbian Spanish is considered the standard dialect for Latin American; Mexican Spanish is generally considered the standard dialect for the US). She also gave some examples of the differences between the dialects. She recommended that you always use a native speaker and that you match the dialect to the greatest common population in the expected audience. She then played some sample recordings from the most popular Spanish TTS voices, pointing out that while they are not as well tuned as English TTS voices, they are still quite usable.
My friend Bob Cooper from Avaya then spoke about an older product that was developed when he was at Conita, which was later acquired by Avaya. He spoke about dialog design considerations for power users who use an application multiple times per day. Use auditory feedback and lots of audio cues, optimize for the common path, and replace separator words with distinctive sounds.
A grad student working with Intervoice presented his work on automatic tuning of context free grammars. He used the SONIC large vocabulary (~75k words) SR engine from CU Boulder to transcribe previously recorded utterances. He then re-ranked n-best lists using phonetic, local, syntactic, and sematic weights. Or, at least, so says my hastily scribbled notes. He then employed Princeton’s WordNetto provide automatic categorization via synonyms. Lexical chains were also used to classify the transcription. The most common utterances were automatically added to the grammar, and common sub-sequences were favored over longer sequences. He claimed that for one test with an initially untuned application, his automatic grammar tuner performed within 2% of a manually tuning performed by someone at Intervoice.