In Search of a State-of-the-Art Transcription App

Since reporting on the world’s most capable machine transcription system earlier this year, I’ve been pining for one that would help me with the tedious work of transcribing recorded journalistic interviews.

Not long after I wrote about Microsoft’s speech transcription milestone—researchers at the company built a system that matches the human error rate in transcribing spontaneous conversations—Kelly Altom, a contract program manager working with the Microsoft Translator team, alerted me to an iPhone app he developed called BleuText that he says will “reduce your transcription time drastically.”

As I look ahead to the new year, and the countless hours of transcription stretching out before me, I’m eager to put this technology into practice. I’ve played around with BleuText a bit this week, and while it’s good at transcribing a single voice speaking in a quiet room, the challenge of creating a usable machine transcription of multiple voices in conversation remains. Cross-talk, echoes, and background noise can throw off even the best machine transcription systems.

Altom, who helps customers use the Microsoft Translator API by creating sample applications (but does not speak for Microsoft), built the app to highlight what the technology can do. He’s planning periodic updates to improve performance and add features, and intends to make a version for OSX.

Translator is one of two dozen machine learning APIs Microsoft is selling to customers. These Cognitive Services allow developers to easily imbue their apps with capabilities such as facial recognition, complex text analysis, recommendations, and search.

Amazon and Google are offering their versions of many of these services, too, as are smaller players, who make them available to developers over marketplaces such as Seattle-based Algorithmia. The provision of these services is what passes for the democratization of artificial intelligence technologies.

The tech superpowers, with their legions of computer scientists pushing the frontiers of machine intelligence, make available to anyone the kind of capabilities that were confined to the biggest corporations and governments not so long ago. These capabilities create more hooks into their cloud computing services—large, fast-growing, and fiercely competitive areas of business for the likes of Microsoft (NASDAQ: [[ticker:MSFT]]), Amazon (NASDAQ: [[ticker:AMZN]]), and Google (NASDAQ: [[ticker:GOOG]]).

It is the availability of on-demand, amazingly powerful computing resources in the cloud—as well as massive stores of training data to help transcription and translation systems improve—that make possible something like BleuText.

The BleuText app. Courtesy of Kelly Altom
The BleuText app. Courtesy of Kelly Altom

The app streams audio (.WAV files) recorded on an iPhone or iPad to the cloud-based Microsoft Translator Service, which connects to a version of the Bing Speech API, which transcribes the speech to text. Another technology filters out filler words such as “uh” and “um” and repeated words. The translator service then translates the resulting text to more than 60 languages, streaming the results back to the device.

It’s a powerful tool, to be sure, but it’s not quite at the level of the machine transcription system that a Microsoft Research team built to set a new accuracy record. That system used an ensemble of 10 complementary neural network models to perform acoustic evaluation and word understanding, and ran atop essentially unlimited computing resources.

BleuText—which is available for free download in the App Store through January and will likely be sold on a subscription basis after that—can handle recordings up to two hours in length, Altom says. The speech is recorded using an iPad or iPhone microphone, though Altom says he’s had even better results using external microphones, particularly in noisy spaces.

Early adopters are using both the transcription and translation services for things like recording interviews during legal investigations (with the transcripts later reviewed and corrected by humans) and for communicating with non-English speaking renters, Altom says. Earlier this month, Microsoft debuted a live translation service.

If you’re a journalist or someone who does a lot of your own transcription or translation, tools like BleuText, and the APIs on which it depends, are a welcome development—though I know it would require some adjustments to my usual workflow to implement efficiently. But what if you’re a professional transcriptionist? Is this technology poised to eliminate the need for your human labor?

In 2017, as more jobs are replaced by automation, expect to hear more questions along these lines. Amazon, for example, reported some 45,000 robots at work across 20 fulfillment centers, though we don’t know yet whether the growth of its robot workforce—up 50 percent from a year ago, as The Seattle Times reports—outpaced the growth of its human workforce.

Altom doesn’t think we’re ready to take humans out of the loop in translation and transcription. At least not yet.

“I don’t think we will ever replace human transcribers or translators completely,” he says via email. “Too much depends on context and complexity of language. Legal proceedings is a good example of complexity and context.”

With the rapid pace of progress in areas such as deep neural networks and graphics processing chips—which have proven well-suited to handling complex machine learning workloads—Altom thinks that within four or five years, most transcription and translation will be performed by technology, with humans called in to handle difficult situations or for quality control. “I see that most of the established transcription companies are using, or experimenting with some form of speech-to-text technology,” he says.

Employment data gives a glimpse of the kinds of jobs that require transcription, and therefore might be impacted. There were an estimated 57,830 medical transcriptionists employed in the U.S. in May 2015, according to the latest Department of Labor Occupational Employment Statistics, earning a median annual wage of $34,890.  Millions of other jobs involve at least some transcription in occupations such as word processors and typists, office clerks and secretaries, and court reporters and legal aides.

Feature image credit: Typing, photo by flickr user Sebastien Wiertz used under a Creative Commons license.

Author: Benjamin Romano

Benjamin is the former Editor of Xconomy Seattle. He has covered the intersections of business, technology and the environment in the Pacific Northwest and beyond for more than a decade. At The Seattle Times he was the lead beat reporter covering Microsoft during Bill Gates’ transition from business to philanthropy. He also covered Seattle venture capital and biotech. Most recently, Benjamin followed the technology, finance and policies driving renewable energy development in the Western US for Recharge, a global trade publication. He has a bachelor’s degree from the University of Oregon School of Journalism and Communication.