The Journey to a Machine That Transcribes Speech as Well as Humans

very expensive, benefitting from essentially unlimited computing resources, and more than 20 years of Microsoft research and development. The team spent more than a year focused on the Switchboard task to reach human parity. And it doesn’t transcribe speech in real time.

There’s still lots of work to do to bring this capability from the realm of research to a production system that could improve Microsoft products like Xbox and Cortana.

Meanwhile, Huang is confident that hardware advances will continue. “Don’t worry,” he says, holding up his iPhone and describing the increasingly powerful computers available in the cloud. Beginning in December, Microsoft will offer GPU-powered machines in its Azure cloud computing service. “You have a cloud and client working together. So that trend will not stop,” he says.

The next big scientific challenge is tackling the “cocktail party” problem. Computers still struggle to capture speech in settings that have multiple speakers who may be far from the microphone; echoes and background noise such as a television or music; and other complications.

“Humans have no problem,” Huang says. “They can just adapt, zero in, and intelligently have a good understanding. Even the best Microsoft human parity system performs badly with that kind of open environment.”

He says improvements in microphone technology will help, noting devices such as the Amazon Echo, with its seven microphones that can pinpoint a distant speaker.

Over Thanksgiving, my family had a great time trading knock-knock jokes with Alexa, the intelligence underlying the Amazon device. It was competent, if inconsistent, even in a crowded kitchen.

Huang emphasizes the difference between the Switchboard task—the transcription of conversations between human strangers on an assigned topic—and other computer speech contexts. “When you talk to a computer, you know you are talking to a computer so the way you articulate is different,” Huang says.

Apart from hardware, Huang says improvements in natural language understanding will help solve the cocktail party problem. Humans can better understand the signal in the noise, thanks to common sense and contextual knowledge. But machine understanding “is far from being solved,” Huang says.

So when will that milestone be reached? Reflecting on the journey to the transcription milestone, Huang says he “totally underestimated” the advancements that would be made over the course of his career. He paraphrases a famous Bill Gates quote: “He thought most people… overestimate what they can achieve in a year and underestimate what the community can achieve in 10 years. So, that can be applied to my own prediction. I’m not going to predict what’s going to happen, but it’s just phenomenal.”

How One Human Transcribes

In my career as a journalist, I’ve spent uncounted hours—lost days, perhaps weeks of my life—transcribing. I record many of my interviews and then go about the time consuming process of turning the spoken words into text. The goal is an error rate of zero. We’re trying not to misquote people here, which is a big impetus for recording interviews in the first place, rather than typing or writing notes in real-time.

I’d never tracked exactly how time consuming transcription is. That’s in part because it’s not something I do as a defined activity, separate from writing the story. For me, it’s part of the writing process. I will often stop transcribing to fit a fact or a quote from an interview into my draft as I encounter it during transcription, rather than transcribing the whole thing and then going back to pick out the bits that will actually appear in the story.

For my interview with computer speech expert Xuedong Huang, which ran just shy of 32 minutes, I measured how long it took me to transcribe it with a running stopwatch: one hour, 23 minutes. Put another way, each minute of conversation took me about 2.6 minutes to transcribe. I consider myself a fast typist, but I regularly have to rewind the recording to make sure I heard a word or phrase correctly. This interview took place in a quiet meeting room in Building 99, headquarters of Microsoft Research in Redmond, WA. There were still patches that were difficult to hear on the recording—when both Huang and I were talking simultaneously, for example—or difficult for me to understand, such as when Huang used acronyms, proper names, and other linguistic and computer science terms that were new to me.

I broke the interview up into 12 segments to make it more manageable. Even for an interesting interview, transcription is tedious process; breaks are necessary, interruptions inevitable. This presented an opportunity to measure whether my transcription speed increased over the course of the interview. I assumed it would as I gained experience, training my own neural network with some high-quality data, accumulating valuable context from earlier in the interview, and adjusting to new vocabulary, Huang’s accent, and speaking rhythm. By the end of the interview, a minute of conversation took only 2.3 minutes to transcribe.

That said, a production system capable of accurately transcribing spontaneous human-to-human conversations can’t come soon enough.

Author: Benjamin Romano

Benjamin is the former Editor of Xconomy Seattle. He has covered the intersections of business, technology and the environment in the Pacific Northwest and beyond for more than a decade. At The Seattle Times he was the lead beat reporter covering Microsoft during Bill Gates’ transition from business to philanthropy. He also covered Seattle venture capital and biotech. Most recently, Benjamin followed the technology, finance and policies driving renewable energy development in the Western US for Recharge, a global trade publication. He has a bachelor’s degree from the University of Oregon School of Journalism and Communication.