automated systems for translating Arabic and Chinese records into English, and he entered the software in yearly machine translation “bake-offs” sponsored by DARPA. “I got very good results, and people at Google saw that and said ‘We should invite that guy,'” Och says.
Och was getting his results in part by setting aside the old notion that computers should translate expressions between languages based on rules. In rules-based translation, Och says, “What you write down is dictionaries. This word translates into that. Some words have multiple translations, and based on the context you might have to choose this one or that one. The overall structure might change: the morphology, the extensions, the cases. But you write down the rules for that too. The problem is that language is so enormously complex. It’s not like a computer language like C++ where you can always resolve the ambiguities.”
It was the heady success of British and American cryptographers and cryptanalysts at breaking Japanese and German codes during World War II, Och believes, that set the stage for the early optimism about rule-based translation. “If you look 60 years ago, people said ‘In five years, we’ll have solved that, like we solved cryptography.'” But coming up with rules to capture all the variations in the ways people express things turned out to be a far thornier problem than experts expected. “It didn’t take five years [to start to solve it], it took 60 years,” Och says. “And the way we are doing it is different. For us, it’s a computer science problem, not a linguistics problem.”
The pioneers in statistical machine translation in the 1990s, Och says, came from the field of speech recognition, where it was already clear that it would be easier to bootstrap machine-learning algorithms by feeding them lots of recordings of people actually speaking than to codify all the rules behind speech production.
The more such data researchers have, the faster their systems can learn. “Data changes the equation,” says Och. “The system figures out on its own what is correlated. Because we feed it billions of words, it learns billions of rules. The magic comes from these massive amounts of data.”
But back in 2004, when Google was taking a look at Och’s DARPA bake-off entry, the magic was still slow, limited by his team’s computation budget at USC. “We had a few machines, and the goal was to translate a given test sentence, and it would take a few days to translate just that sentence,” he says. Translating random text? Forget it. “Building a real system would have needed much bigger computational resources. We were CPU-constrained, RAM-constrained.”
But Google wasn’t. Access to the search company’s data centers, Och figured, would advance his project by a matter of several years overnight. Then there was Google’s ability to crawl the Web, collecting examples of already-translated texts—which are the key to bootstrapping any statistical machine translation system. But what clinched the deal when Google finally hired Och in early 2004, he says, was the opportunity to work on