The attendance at SpeechTEK West 2006 seems lighter than past years. One issue is that the technical sessions were in meeting rooms far away from the business sessions, so it was a little hard to tell just how many people were actually there in total. The business sessions were definitely more lightly attended than the technical sessions. I wanted to catch up on business issues the first day, so I focused on the industry workshops.
I started out with the Retail Industry Workshop Monday morning. The CTO of Voxify, Amit Desai, was one of the panelists. I’m obviously biased, but I think he did a great job. His presentation was very informative. One of the key points he covered was the ability of speech applications to help companies handle huge spikes in call volumes. Sometimes, the peaks occur for more extended periods of time for retailers, such as the few months before the end of the year holidays. This spike is becoming even more compressed, as people more often buy holiday gifts online and have become comfortable in having gifts shipped directly to the recipients. Even when the spike lasts a few months, it is very difficult for a retailer to plan, hire, and train enough staff to handle all the calls they will receive.
The spikes can be even more dramatic when a retailer offers a short duration promotion. For example, Voxify handled the calls for some commercials that ran during a couple recent major sporting events, with another big one still to come. Our speech applications received around 1,000 simultaneous calls each time the commercial ran. No callers had to wait in queue. Since most people called right after the commercial aired, the volume of calls had mostly fallen off after less than thirty minutes. If live agents had been used, an equal number of agents would have had to have been available in order to also not make a single caller wait in a queue. Even if you did force many of the callers to listen to hold music, hundreds of additional trained agents would have been needed for only about twenty minutes. This is clearly a situation where speech applications can bring a huge benefit.
Companies in the travel and hospitality business have similar spikes, but they also suffer from unpredictable, weather-related spikes. We saw huge increases in calls to our travel & hospitality client applications after Huricane Katrina. We see similar spikes every winter when a big snowstorm strikes part of the US or Canada.
Back to SpeechTEK. Someone from Versay (looking at my notes, I realize that I was pretty bad about not writing down the speakers’ names) talked about VoIP and speech. One of their clients had IVRs in many branches so that local number access could be provided. They have moved that customer over to using a VoIP network. Many of the big VoIP providers, like Level3, provide local number access for most of the US. I think the use of VoIP networks for hosted speech applications is going to be a big trend over the next few years.
Currently, they are using G.711 as the audio codec. Although this doesn’t save you any bandwidth (in fact it eats up quite a bit of bandwidth from the 64 kbps for the RTP payload and the roughly 30 kbps additional overhead for the RTP, UDP, IP, and ethernet headers), he said they felt the bandwidth costs weren’t that bad. Although VoIP brings the promise of lower bitrate codecs, speech recognition engines need all the signal they can get in order to accurately recognize speech. Many of the lower bitrate codecs take advantage of limitations in human audio perception. Speech recognition algorithms don’t have those same limitations, so those algorithms throw away data they could potentially use. He did say they were evaluating some of the lower bitrate codecs, though, for potential future use.
I then attended the Financial Services workshop. Someone from Loquendo, a spin-off from Telecom Italia, started off with a demo of their TTS engine. It was extremely impressive. I’ve listened to the output from quite a few TTS engines, but this one was by far the best. The base engine is quite good, but the ability to fine tune the prosody via SSML tags is amazing.
She then listed quite a few of their customers and many of the applications they had built for these customers, though I would have preferred more detail about just a few of the apps, rather than just a long list. Many of the apps were very simple, but some of them sounded quite complicated. One very interesting app is a Java app running on mobile phones for ebankinter.com that generates about 2% of the trading volume on the Madrid Stock Exchange. This multi-modal app (you can speak to it and also see and interact with related text on the mobile phone screen) is pre-packaged with mobile phones, primarily Blackberries. In the near term, I think this is the only viable way to do a large scale deployment of a multi-modal app. You just face too many issues with getting the app to work reliably with all the different devices that customers are going to want to use.
Someone from Adeptra gave a really interesting presentation on auto-resolution, i.e., automatically verifying instances of fraud by calling a card holder after a purchase. Other vendors provide tools for rating a transactions for likelihood of fraud. Credit card issuers use these ratings to determine when they should call a card holder to ensure that they actually initiated the transaction. The card issuers can save a lot of money by catching fraud early.
The problem is that the systems produce a lot of false positives. While they want to err on the side of safety, they don’t want to annoy their customers. Also, paying people to make these routine calls costs a lot of money. Adeptra offers speech applications that automate placing the outbound caller, detecting whether a person answered the phone (can be as simple as asking them to press any key on their phone), verifying their identity using the same questions a live agent would use, and then asking them whether they initiated the transaction in question.
They also offer apps for collections. He said that about 85% of the targets are people that just need to be reminded to make the payment. These people usually prefer a call from a computer rather than a live person, because it is less embarassing. Another 10% are in some financial trouble, but will typically pay companies in the order they contact them. By automating the calls, Adeptra’s clients can get in line first. The final 5% are the deadbeats that won’t pay a human or a computer.
The Financial Services workshop also included a presentation from someone from TellMe. While his presentation was not particularly specific to Financial Services, it was a useful, general discussion on UI design for speech apps. They have developed a quantitative approach to rating speech apps. As part of developing this system, they had to internally come to some agreement about the importance of all the major elements that make up callers’ experiences. I think there is a lot of value in having that discussion amongst an app development team before developing an application. They feel the most important parts are the interaction quality and the production quality, but they also rate things like accessibility and seamless agent interface. He played a lot of demos of really bad DTMF and speech apps, and a couple decent ones.
Finally, I caught the end of the Healthcare workshop. Healthcare calls can be difficult to automate due to privacy issues, but their are still a lot of opportunities. Medicare related apps are particularly difficult to develop due to all the privacy and general regulatory issues. Even then, there are plenty of opportunities to provide these applications in a hosted environment, as well as on premise.