Microsoft Can Recognize Speech as Well as Humans on Switchboard Task

Microsoft researchers have been steadily improving the accuracy of their speech recognition technology—a necessary precursor to many other artificial intelligence applications—achieving parity with human performance on a benchmark task called Switchboard last year.

Since then, the bar for human parity has been moved. Another research group improved the human word error rate to 5.1 percent by using multiple human translators. “This was consistent with prior research that showed that humans achieve higher levels of agreement on the precise words spoken as they expend more care and effort,” writes Xuedong Huang, Microsoft technical fellow and leader of the company’s speech and dialogue group.

Now, Microsoft’s technology has matched the 5.1 percent word-error rate.

“We reduced our error rate by about 12 percent compared to last year’s accuracy level, using a series of improvements to our neural net-based acoustic and language models,” Huang writes, announcing the milestone. “We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels.”

I profiled Huang and his team’s efforts last year as they marked the last milestone toward human parity on this foundational artificial intelligence task. As part of my reporting, I paid careful attention to what I do as a human when I transcribe a recorded interview. It’s a painstaking process. Automating it could save countless hours. It takes me about 2.6 minutes to accurately transcribe each minute of recorded conversation.

Huang says his team improved the recognizer’s language model by using “the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.”

That’s something human transcriptionists do, too. My transcription speed improves as I get further into a conversation, adjusting to new vocabulary, colloquialisms, and accents.

Other improvements came purely from the realm of machines. Huang says the speech recognition team used the Microsoft Cognitive Toolkit to design and optimize models, and graphics processing units now available in Microsoft Azure to run those models.

Major cloud computing providers including Amazon (NASDAQ: [[ticker:AMZN]]) and Google (NASDAQ: [[ticker:GOOGL]]) are racing with Microsoft (NASDAQ: [[ticker:MSFT]]) and other players to provide the best platforms and tools for artificial intelligence tasks.

The speech recognition technology matters because it’s an important building block for a broad range of A.I. applications, particularly those in which machines interact with humans. Huang demonstrated one such application during a keynote address at last year’s Xconomy Intersect tech conference in Seattle. His spoken words were displayed in real-time as closed captions, making it easier to follow along with the presentation. It’s now part of a product called Presentation Translator.

Microsoft chief scientist of speech R&D Xuedong Huang demonstrates a real-time captioning system at Xconomy Intersect 2016 in Seattle. Photo by Danilo Bonilla for Xconomy

Recognizing speech in a controlled environment, such as the recorded telephone conversations that comprise the Switchboard task, is just one step, however. Huang identifies ongoing challenges of recognizing speech in “noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available.”

Meanwhile, natural language understanding—the ability of a computer to extract meaning from blocks of speech or text—remains an unsolved problem.

“Moving from recognizing to understanding speech is the next major frontier for speech technology,” writes Huang, who has been working on the technology in one way or another since the early 1980s.

Author: Benjamin Romano

Benjamin is the former Editor of Xconomy Seattle. He has covered the intersections of business, technology and the environment in the Pacific Northwest and beyond for more than a decade. At The Seattle Times he was the lead beat reporter covering Microsoft during Bill Gates’ transition from business to philanthropy. He also covered Seattle venture capital and biotech. Most recently, Benjamin followed the technology, finance and policies driving renewable energy development in the Western US for Recharge, a global trade publication. He has a bachelor’s degree from the University of Oregon School of Journalism and Communication. View all posts by Benjamin Romano