Inside Google’s Age of Augmented Humanity: Part 2, Changing the Equation in Machine Translation

automated systems for translating Arabic and Chinese records into English, and he entered the software in yearly machine translation “bake-offs” sponsored by DARPA. “I got very good results, and people at Google saw that and said ‘We should invite that guy,'” Och says.

Och was getting his results in part by setting aside the old notion that computers should translate expressions between languages based on rules. In rules-based translation, Och says, “What you write down is dictionaries. This word translates into that. Some words have multiple translations, and based on the context you might have to choose this one or that one. The overall structure might change: the morphology, the extensions, the cases. But you write down the rules for that too. The problem is that language is so enormously complex. It’s not like a computer language like C++ where you can always resolve the ambiguities.”

It was the heady success of British and American cryptographers and cryptanalysts at breaking Japanese and German codes during World War II, Och believes, that set the stage for the early optimism about rule-based translation. “If you look 60 years ago, people said ‘In five years, we’ll have solved that, like we solved cryptography.'” But coming up with rules to capture all the variations in the ways people express things turned out to be a far thornier problem than experts expected. “It didn’t take five years [to start to solve it], it took 60 years,” Och says. “And the way we are doing it is different. For us, it’s a computer science problem, not a linguistics problem.”

The pioneers in statistical machine translation in the 1990s, Och says, came from the field of speech recognition, where it was already clear that it would be easier to bootstrap machine-learning algorithms by feeding them lots of recordings of people actually speaking than to codify all the rules behind speech production.

The more such data researchers have, the faster their systems can learn. “Data changes the equation,” says Och. “The system figures out on its own what is correlated. Because we feed it billions of words, it learns billions of rules. The magic comes from these massive amounts of data.”

But back in 2004, when Google was taking a look at Och’s DARPA bake-off entry, the magic was still slow, limited by his team’s computation budget at USC. “We had a few machines, and the goal was to translate a given test sentence, and it would take a few days to translate just that sentence,” he says. Translating random text? Forget it. “Building a real system would have needed much bigger computational resources. We were CPU-constrained, RAM-constrained.”

But Google wasn’t. Access to the search company’s data centers, Och figured, would advance his project by a matter of several years overnight. Then there was Google’s ability to crawl the Web, collecting examples of already-translated texts—which are the key to bootstrapping any statistical machine translation system. But what clinched the deal when Google finally hired Och in early 2004, he says, was the opportunity to work on

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/