In late 2010, the Nieman Journalism Lab surveyed reporters for their predictions about what 2011 would bring for the future of journalism. My favorite prediction came from Matt Thompson, an editorial product manager at National Public Radio and a widely respected evangelist for digital journalism. (I happened to meet Thompson in person around the same time at News Foo, a future-of-news conference in Phoenix sponsored by Google and O’Reilly Media. He’s an amazing guy, brimming with about a dozen great ideas per minute.)
Anyway, his prediction was this: Soon—perhaps not in 2011, but in the near future—automatic speech recognition and transcription services would become “fast, free, and decent.” In a jocular reference to the Singularity, Ray Kurzweil’s label for the moment when we’ll be able to upload our minds to computers and live forever, Thompson called the arrival of free and accurate speech transcription “the Speakularity.” He predicted it would be a watershed moment for journalism, since it would mean that all of verbal information reporters collect—everything said in interviews, press conferences, courtrooms, city council meetings, broadcasts, and the like—could easily be turned into text and made searchable.
“Obscure city meetings could be recorded and auto-transcribed,” Thompson wrote. “Interviews could be published nearly instantly as Q&As; journalists covering events could focus their attention on analyzing rather than capturing the proceedings. Because text is much more scannable than audio, recordings automatically indexed to a transcript would be much quicker to search through and edit.”
The implications are obviously immense. But what excited me personally about the Thompson’s concept was the prospect that, as a reporter, I might finally be able to start thinking more about the content of my interviews (analyzing) and less about taking notes (capturing).
Not that I have a problem taking notes. If I had to reveal my superpower, it’s this: I can type extremely fast. I’m talking Clark Kent fast. So fast that I walk away from most interviews with a verbatim transcript. There are always typos in the text, but nothing that can’t be easily deciphered.
It’s a great skill to have, because it means I don’t have to record interviews and waste time transcribing them later. But it comes at a cost. If I’m transcribing during an interview, my brain is divided into three separate operations. First, I’m typing whatever the speaker said a few seconds ago; to use a computational analogy, you might say my finger movements on the keyboard are drawing from the bottom of my short-term memory buffer. Second, my ears are listening to the speaker’s words in the current moment, and adding them to the top of the buffer. Third, I’m trying to comprehend the content and think ahead to the next question I want to ask.
This procedure usually works fine, but it’s exhausting. And if I stop concentrating for even a second, I suffer from buffer overflow, which is just as disastrous for me as it is for a computer program. With automatic and accurate speech transcription, I’d be able to dispense with all the typing and focus fully on the interviewee and their ideas, which would be heavenly.
So, how far off is the Speakularity? The idea itself is not nearly as outlandish as the Singularity (which still has plenty of skeptics, even within the irrationally optimistic population of startup entrepreneurs). Continuous dictation software has been available since the 1990s from companies like