The Journey to a Machine That Transcribes Speech as Well as Humans

As a student at the elite Tsinghua University in the early 1980s, Xuedong Huang confronted the same challenge as all other Chinese computer users.

“In China, typing was fairly difficult with a Western keyboard,” says Huang, now a Microsoft (NASDAQ: [[ticker:MSFT]]) distinguished engineer and its chief speech scientist.

To answer that challenge, he helped develop a Chinese dictation prototype, launching him on a three-decade quest that yielded, earlier this fall, a speech recognition system capable of matching human transcriptionists for accuracy.

“Having a natural user interface—that vision always did inspire many people to pursue advanced speech recognition, so I never stopped since 1982,” says Huang, who will speak at Xconomy’s upcoming Intersect event in Seattle on Dec. 8. (See the full agenda and registration details.)

The system he and his Microsoft Research colleagues developed achieved a word error rate of 5.9 percent on the benchmark “Switchboard” task—automatically transcribing hundreds of recorded human-to-human phone conversations. An IBM team previously held the record with a 6.9 percent word error rate, announced earlier this year.

Huang, pictured above, marvels at the technology advancements and collective effort of the speech research community that reached this “historic milestone.”

“That’s a really big moment,” Huang says. “It’s a celebration of the collective efforts over the last 25 years for everyone in the speech research community, everyone in the speech industry, working together, sharing the knowledge.”

As a journalist, I marvel at this achievement, too. Transcription is a necessary part of my job. After interviewing Huang last week in a quiet conference room at Microsoft Research headquarters in Redmond, WA, I paid careful attention to what I actually do when I listen back to a spontaneous conversation and convert it to text. I rewind certain passages repeatedly trying to decipher what was said through cross-talk or mumbles; pause to look up unfamiliar terms, acronyms, proper names; use my knowledge of context, my understanding of colloquialisms; and adjust to an individual’s accent and patterns of speech. (More on this at the bottom.)

That machines can now do this as well as flesh-and-blood professionals—at least in certain situations—shows just how far we’ve come in giving computers human-like capabilities.

While the Microsoft team achieved a new best for machine transcription, its claim of “human parity” is in part based on a better understanding of actual human performance on the same task. It had previously been thought that the human word error rate was around 4 percent, but the source of that claim was ill-defined. Microsoft had a third-party professional transcription service undertake the same Switchboard task in the course of its normal activities. The humans erred at the same rate as the system, making many of the same kinds of mistakes.

The system that hit the human parity mark for transcription is no ordinary machine, of course. It begins with a hardware layer, codenamed Project Philly, that consists of a massive, distributed computing cluster outfitted with Nvidia GPU—graphics processing unit—chips. (GPUs, originally designed for handling video and gaming, have become workhorses of the artificial intelligence world.)

On top of that is Microsoft’s Cognitive Toolkit, an open source software framework for deep learning that makes efficient use of all that computing power, and was updated last month.

The next layer is an ensemble of 10 complementary neural network models. Six perform the acoustic evaluation—the work of recognizing the speech—and four focus on word understanding, parsing things like context and punctuation, Huang says.

The various models were trained on the Switchboard conversations and several other commonly used conversational datasets, ranging in size from a few hundred thousand words to 191 million words (the University of Washington conversational Web corpus). Huang says the system relies mostly on machine learning to improve its accuracy, but it also includes what he calls “semi-supervised rules.” For example, it has access to a dictionary, which provides word pronunciations.

“That’s the knowledge people have accumulated, and we find that actually is useful to not relearn everything,” Huang says. “With human knowledge and machine learning combined, that will give us the best performance so far.”

(It should be noted that this is a simplified description of some amazingly complex technology. Here’s a PDF of the paper in which the Microsoft Research team explains their process and results in detail.)

While the Switchboard task is in English, Huang says the system that achieved the human parity milestone is language independent—provided it is trained with enough data. “So whether it’s German or Thai or Chinese, it’s really just amazing how powerful this,” he says.

He’s careful to note several caveats on this scientific achievement: The system is still

Pages: 12

Author: Benjamin Romano

Benjamin is the former Editor of Xconomy Seattle. He has covered the intersections of business, technology and the environment in the Pacific Northwest and beyond for more than a decade. At The Seattle Times he was the lead beat reporter covering Microsoft during Bill Gates’ transition from business to philanthropy. He also covered Seattle venture capital and biotech. Most recently, Benjamin followed the technology, finance and policies driving renewable energy development in the Western US for Recharge, a global trade publication. He has a bachelor’s degree from the University of Oregon School of Journalism and Communication. View all posts by Benjamin Romano