A work colleague (thanks, Rob!) emailed me the text from an article in the NY Times about speech recognition applications (free registration required). It’s a mostly favorable discussion of commercial speech applications and speech recognition technologies, although, to no surprise, they ran across a couple frustrated users.
One of the areas taking the biggest hit in the article is natural language speech recognition, e. g., “How can I help you?” While a few people out there (e.g., the guy from IBM interviewed the article and one of the Microsoft guys who spoke at last Spring’s SpeechTek) seem to be living in the fusion world where the big delivery is always just ten years away. But, if you look at how the pace at which NLSR has improved in the last couple years, it’s really hard to believe that it’s going to be hugely better in ten years. Moore’s Law will definitely help out, but if it were just a CPU cycle problem, why don’t you see anybody using grids or supercomputers to deliver human-like NLSR? I believe it’s going to take several major, major scientific breakthroughs before NLSR is good enough to be widely used.
NLSR works great if you have to ask only a small set of questions, but you need to be able to handle a wide range of answers spoken in a wide variety of ways. The problem is that building the statistical language models for the questions and answers is a lot of work and it gets very expensive very quick. But, there are obviously significant advantages to allowing people to respond in full sentences.
Directed dialog works great when you have a large set of questions, but for which the answers are more predictable. While any good quality speech application will beef up the grammar to handle the extra “uhs”, “ums”, “please”, and “thanks” of everyday speech, speakers are still restricted to a more limited set of utterances, at least with respect to short phrases instead of sentences. Nonetheless, a well designed directed dialog application can be highly usable, and yet still relatively inexpensive to build.
For some applications, a hybrid of the two can work well, with the initial question or two handled via NLSR, and the rest of the conversation handled as directed dialog. The downside, though, is that hybrid applications can be much more expensive to build. You have to license more products and you need developers experienced in more technologies. Also, callers can be misled by the open-ended nature of the initial question. They then get frustrated when full sentences aren’t understood as responses to the other prompts. As they say in project management, it’s all about setting expectations appropriately.