In 1999, the futurist and speech recognition technology pioneer Ray Kurzweil predicted that by 2009, deafness would be a mere inconvenience rather than a disability.
That’s because deaf people would be carrying small machines that would listen to their companions and display real-time text transcripts of their conversations, Kurzweil imagined in his book The Age of Spiritual Machines.
It’s five years past 2009, and we’re not there yet.
No software company has yet offered a product that can deliver highly accurate, speech-to-text transcription of multiple voices. Such technology would be a boon not only to the deaf, but also to a host of business customers, such as companies that record their meetings, lawyers who record depositions for court cases, and journalists who publish interviews and quotes. But in the meantime, significant business opportunities remain for next-generation transcription companies such as Berkeley, CA-based TranscribeMe.
Human beings still do most of the professional transcribing work on multiple-voice audio files, but TranscribeMe was founded in 2011 to improve on services provided by the traditional lone worker with a transcription machine at home. The company uses a combination of its own speech recognition software and a network of about 30,000 freelance transcriptionists, with the aim of increasing efficiency while controlling costs, says CEO and co-founder Alexei Dunayev (pictured above.)
“That gets us the speed of the computer as well as the quality of people,” Dunayev says.
Although Kurzweil’s forecast was premature, speech recognition software has improved substantially since 1999, making the voice a key element of digital communication. Smartphone users can have short conversations with their digital assistants, and receive text versions of their friends’ voicemail messages—even though the transcriptions are sometimes hilariously off the mark. Plenty of people now speak to their computers rather than type when they’re creating longer text documents, because they use transcription software such as the Dragon products made by Burlington, MA-based Nuance Communications (NASDAQ: [[ticker:NUAN]]). But those programs must still be “trained” to recognize their owners’ speech patterns, so they can produce accurate text copies.
The greater challenge—and one not yet overcome by software companies—is the transcription of conversations involving two or more speakers. Transcriptions get muddled if a speech-to-text program is confronted with a mixture of different voices, rather than the familiar voice of the software owner alone. “The accuracy will plummet for the non-primary speaker,” says Peter Mahoney, Nuance’s chief marketing officer.
Nuance’s labs are working on the problem. “It certainly is an important area for us to do research on,” Mahoney says. (More on Nuance’s efforts later.) But in the meantime, companies such as TranscribeMe are trying to make the most of what technology can already do.
TranscribeMe can’t transcribe a cocktail party conversation in real time for a deaf guest, but Dunayev says the company can turn out a transcript of a business conference session in about three hours. “Typically it would take three times as long for a single worker,” he says.
Here’s how the TranscribeMe system works: Audio files without a lot of background noise are put through the company’s proprietary speech recognition program to get a first draft as a starting point, Dunayev says. Poor-quality recordings skip that step, because software can’t glean much from them. All audio files submitted by customers are sent to human transcribers who sign into TranscribeMe’s online workroom. But first, the files are split up into many slices only a few minutes long each—and sometimes less than a minute, Dunayev says.
Speed is the first reason for dividing the files up. If many transcribers work at the same time on audio snippets, they can produce a full document faster than a single person tackling the whole file from start to finish. TranscribeMe’s software later stitches the scattered text passages together in the right order. The transcribers, if they like, can choose to work for short stretches of time, rather than committing to complete a lengthy assignment. “We let people monetize their downtime,” Dunayev says.
Confidentiality is the second reason for fracturing the files into small segments, Dunayev says. No transcriber hears a full version of a client’s audio file, he says.
TranscribeMe aims for an accuracy rate of 98 percent or better, and offers options such as editing to correct speakers’ grammar mistakes or to remove stuttering. The company’s quality assurance staffers do a final review of each transcript. Its customers include lawyers, law enforcement agencies, insurance companies, medical centers, conference attendees, and researchers who do a lot of interviews, Dunayev says. “The real advantage of our model is quick delivery and almost any volume,” he says.
Although thousands of U.S. transcription companies compete in a market that has existed for decades, Dunayev says demand is growing as the amount of audio and video production rises