In the afternoon, I attended two sessions on multimodal applications.
Dave Raggett from the W3C started the session with a talk on Speech Enabling Web Browsers. He has been working on some prototype applications that combine AJAX with speech. He uses a local HTTP server to handle audio on the device (which, for now, is a laptop). A remote HTTP server provides spech services. He uses AJAX, or more specifically, the XMLHTTP object and JavaScript to interact with the remote server. Audio is sent in the request and the interpretation is returned as EMMA markup (SRGS + SISR). He presented a sample application for ordering a pizza that even handled compound utterances. For a prototype, it worked reasonably well. The application was implemented in XHTML, CSS, and JavaScript. He also used AJAX for logging, which allowed him to maintain a synchronized log on the server.
Mark Randolph talked about how Motorola was trying to evolve push-to-talk to “push-to-ask”, i.e., making speech queries to an online database. They are working with SandCherry to commercialize speech apps that use a radio network rather than a telephone network. One nice think about the push-to-talk model is that it creates a clear endpoint for turn-taking in a speech app. They’ve introduced the +V Framework, which provides APIs to interface with local codecs. They are also doing distributed speech recognition by putting the front end of the SR on the device. An ANR codec is used for audio to be played back on the device. DMSP, which uses binary XML, is used to sync the local app with the remote app. Cepstral analysis and some noise reduction is done up front. Endpoint markers are added to aid with transcription. Noise reduction is done only on the sound captured during the push-to-talk phase, partly due to battery usage issues.
Luisa Cordano from Loquendo kicked off the second session. She talked about work they are doing with AirBus. SNOW is a project to provide multimodal access to maintenance info for workers. She played a video that demonstrated a worker being able to capture video with a head mounted camera, call up manuals via speech, and display information on a PDA. The speech and PDA media channels were synchronized.
Someone from Nortel talked about the benefits of standards and gave a high level overview of the kinds of speech and multimodal apps that companies have been building for many years.
Jim Barnett talked in more depth about X+V and SALT plus XHTML. He explained how the X+V tag provides explicit binding of slots. There was some good info in his talk, but not enough of it. This happened to a lot of the speakers at the end of sessions, as their time slots got compressed by earlier speakers.
Finally, Dave Burke at VoxPilot gave a glossy, and yet very informative and technical, presentation on video interactive services. He talked about what they are doing with 3g mobile video (H.324M, 64 kbps per channel) and video over IP. Video is H.264 or MPEG-4 and audio is AMR or G.723. RTP is used for the video stream for video over IP. They use the VXML tag for video. It works, but there has been some discussion on the voice browser working group mailing list about adding tags for other media, such as video. He also talked about video streaming with Skype and Sony IVE (Instant Video Everywhere).