Kai-Fu Lee is the VP for Speech Solutions at Microsoft. He spoke at SpeechTEK after Bill Gates last week, going into much more detail on Microsoft Speech Server. Microsoft is targeting medium (25-250 agent equivalents), and large (250+ agent equivalents) enterprises. This was a bit surprising to me, as Speech Server appears to be a typical Microsoft 1.0 release, lacking in features, average performance, and somewhat less than stable (based on the demos, anyway). I expect they will actually have far more success on the low end, but I understand the need to put on a good show about the product being enterprise ready.
There isn’t a lot that is innovative in their solution. It’s good and it’s cheap, and there’s a lot to be said for that, but mostly it’s a clone of what many other companies have been doing with VoiceXML for quite a few years. As Kai-Fu said, Microsoft is good at volume sales. I think they should be proud of what they have created, but they’re still a few years behind most of their competitors. The race is on.
Kai-Fu said that customer’s have told them that speech systems are too expensive, too complex, too inflexible with respect to scaling and deployment, and not well integrated. Microsoft appears to have taken a good shot at the first. We’ll have to wait to see how they do against the other objectives.
A product manager then gave a really basic demo of changing a hotel reservation. The first call failed to connect, but Speech Server managed to respond to the second call. This was followed by a demo of a simple multimodal app using Pocket IE Explorer and speech recognition. The Pocket PC UI giving feedback on microphone signal strength was cool.
The biggest news by far was their pricing. They do pricing per simultaneous speech channel and per processor. They also provide a low end Standard Edition and a high end Enterprise Edition.
- Standard Edition – 4-24 channels – $8,000 per processor
- Enterprise Edition – 24-96 channels per node – $18,000 per processor
The packages include the development tools, a SALT browser, ASR, and TTS. Both editionsinclude the ScanSoft Speechify TTS engine. That’s good, because my experience with their TTS software had been pretty iffy. Their ASR software was mediocre, too, but I have heard from several sources that it is significantly improved. You can use the Enterprise edition with ScanSoft’s OSR ASR engine, but you can use only Microsoft ASR with the Standard edition. Nuance fans need not apply for either version. Perhaps Nuance would not bend to the OEM pricing levels that Wal-Mart, I mean Microsoft, demanded.
Kai-Fu glossed over the fact that you still have to buy all the telephony hardware and software from Intel/Dialogic and Intervoice if you actually want to use Speech Server with live telephone calls. Also, VoIP is not supported, just plain old PSTN style calls. The Microsoft website links you to some partner sites where you can request a price quote for a starter system. Microsoft offers a full-featured 180-day trial version of Speech Server, but you still have to buy all the telephony equipment. Even the most basic set-up will cost you about $1000 at a deep discount from their partners trying to grab marketshare.
Standard edition is an all-in-1 box. Everything has to run on the same server, so you might see some performance problems if your application are complex and have elaborate recognition requirements. Also, you get no failover capabilities.
Kai-Fu said that speech application development costs are way too high. They hope to unleash a significant portion of the alleged 7 million developers using Visual Studio onto building speech apps. I worry that this will be like early version of VB and Front Page all over again, with a sea of really bad speech apps to replace the bad desktop apps and bad websites. What makes it even worse, is that voice user interfaces are even harder to design than graphical user interfaces. The Microsoft speech tools are not bad, but they have a very long way to go before your average developer is going to be able to write a speech app that you can tolerate using more than once.
The presentation was followed by a couple customer demos, none of which went smoothly. First up was the NYC Department of Education. They have a web portal that parents can use to get info (absences, grades, food menus, etc.) about their kids and their school. They wanted to speech enable it so as to offer access to those familes without computers. However, my understanding from Kai-Fu’s speech was that Speech Server supports English only. I suspect that the parents in many of the families without computers speak little to no English. The presenter called the number three times before he finally got ringback, but Speech Server never answered the call. After a couple minutes, an assistant finally got it to work. They did a pretty basic speech enabling of the web portal. Nothing exciting, but it did show that Speech Server actually worked. It wasn’t clear whether the problems were operator error or Speech Server failing to answer calls.
The next demo was a semi-disaster. The executive director for some part of the State of Alabama Corrections had trouble seeing the keyboard and the phone. Like the previous presenter, he brought up their web portal. He made a big deal about claiming that he would use his own personal information, so as to not release any private information for a citizen of Alabama. He then proceeded to type in his OWN SOCIAL SECURITY NUMBER, in plain view of a couple hundred people whom he did not know. This brought up a web page with his birth date, height, weight, driver’s license number, license tag on his car, description of his car, etc. The NYC guy at least had a fake family in his system to use for demos. I couldn’t believe this guy hadn’t done the same.
Then the fun began. To demonstrate how a police officer would use the speech app, he called into the system and read in a license tag. Although his voice sounded pretty clear to me, the ASR engine (they didn’t say if it was MS or ScanSoft) misrecognized several characters. It then read back some private information about a vehicle owned by a tow truck company in Clayton, Alabama. So much for the protection of the private info of Alabama citizens. After another try using the NATO phonetic alphabet (Alpha, Bravo, Charlie, etc.) for the letters, he got it to work.
This was followed by a demo from two people from Grange Insurance in Seattle. Their demo actually worked on the first try.
Finally an ISV, Solar Software, and an SI, Accenture, gave demos. Their demos went very well. Solar Software speech enabled Microsoft CRM. Accenture showed a multimodal app of questionable value, but at least it worked. Their argument for going with Speech Server was that it was inexpensive (Kai-Fu Lee prefers the term “better economics”) and they could use the Visual Studio environment that they were already familiar with. Given the short timeframe to give these demos, it’s a little tough to do something really fancy. So, I probably shouldn’t be so hard on these guys.
hi daddy