Microsoft researchers have been steadily improving the accuracy of their speech recognition technology—a necessary precursor to many other artificial intelligence applications—achieving parity with human performance on a benchmark task called Switchboard last year.
Since then, the bar for human parity has been moved. Another research group improved the human word error rate to 5.1 percent by using multiple human translators. “This was consistent with prior research that showed that humans achieve higher levels of agreement on the precise words spoken as they expend more care and effort,” writes Xuedong Huang, Microsoft technical fellow and leader of the company’s speech and dialogue group.
Now, Microsoft’s technology has matched the 5.1 percent word-error rate.
“We reduced our error rate by about 12 percent compared to last year’s accuracy level, using a series of improvements to our neural net-based acoustic and language models,” Huang writes, announcing the milestone. “We introduced an additional CNN-BLSTM (convolutional neural network combined with bidirectional long-short-term memory) model for improved acoustic modeling. Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels.”
I profiled Huang and his team’s efforts last year as they marked the last milestone toward human parity on this foundational artificial intelligence task. As part of my reporting, I paid careful attention to what I do as a human when I transcribe a recorded interview. It’s a painstaking process. Automating it could save countless hours. It takes me about 2.6 minutes to accurately transcribe each minute of recorded conversation.
Huang says his team improved the recognizer’s language model by using “the entire history of a dialog session to predict what is likely to come next, effectively allowing the model to adapt to the topic and local context of a conversation.”
That’s something human transcriptionists do, too. My transcription speed improves as I get further into a conversation, adjusting to new vocabulary, colloquialisms, and accents.
Other improvements came purely from the realm of machines. Huang says the speech recognition team used the Microsoft Cognitive Toolkit to design and optimize models, and graphics processing units now available in Microsoft Azure to run those models.
Major cloud computing providers including Amazon (NASDAQ: [[ticker:AMZN]]) and Google (NASDAQ: [[ticker:GOOGL]]) are racing with Microsoft (NASDAQ: [[ticker:MSFT]]) and other players to provide the best platforms and tools for artificial intelligence tasks.
The speech recognition technology matters because it’s an important building block for a broad range of A.I. applications, particularly those in which machines interact with humans. Huang demonstrated one such application during a keynote address at last year’s Xconomy Intersect tech conference in Seattle. His spoken words were displayed in real-time as closed captions, making it easier to follow along with the presentation. It’s now part of a product called Presentation Translator.
Recognizing speech in a controlled environment, such as the recorded telephone conversations that comprise the Switchboard task, is just one step, however. Huang identifies ongoing challenges of recognizing speech in “noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available.”
Meanwhile, natural language understanding—the ability of a computer to extract meaning from blocks of speech or text—remains an unsolved problem.
“Moving from recognizing to understanding speech is the next major frontier for speech technology,” writes Huang, who has been working on the technology in one way or another since the early 1980s.