the onscreen keyboard is really hard to use,” Fleming says. The company can also track that sentiment over time. Adoring blog commentary about iPhone apps, for example, spiked after July 11, when Apple introduced new software that allows users to download third-party software.
Fleming, the former CEO of another Cambridge, MA-based Web analytics company called Icosystem, says Crimson Hexagon is already doing consulting work with customers from a range of industries, including computer hardware, mobile phones, travel, finance, advertising, and international aid. “So we’re finding that this is applicable in a lot of different ways,” she says. “The key thing all of these people have in common is that they care and they want to know what people are saying about them online.”
But how does Crimson Hexagon sort through the enormous mess of material added to the blogosphere every day to find nuggets of opinion about specific brands—and more importantly, how does it sort these opinions into the buckets that interest its clients? That’s where ReadMe comes in. King says the idea for the algorithm crystallized from two initially unrelated projects he was working on at Harvard.
The first was a project to track opinion about the 2008 presidential candidates, back before the primaries, when there were quite a few of them. “We figured out how to find and download the information in all the political blogs, but when we tried all of the standard computer science approaches to classifying them into the categories we were interested in, they were a disaster,” King recounts. “Some of them would work as much as 60 or 70 percent of the time—which means, of course, that 30 or 40 percent of the time you’d get a completely wrong answer. We tried method after method, and they were all failing.”
At the same time, King says, he was helping the World Health Organization tackle the problem of obtaining accurate mortality data in developing countries, where autopsies and death certificates are rare. “They had a way of doing this by surveying the relatives of people who had recently died, and asking them a series of uncomfortable questions like ‘Were they bleeding from the mouth?’ and ‘Did they have a stomach ache?’ And then they’d show the answers to physicians, who would decide what the cause of death was. They called this a verbal autopsy. The problem was that if you showed this data to more than one MD, they would never agree. We found a way to automate the sorting that worked much better than showing the data to physicians, and the World Health Organization has now implemented this all over the world.”
At a certain point, says King, “I realized that the mathematics underlying the two problems was equivalent.” The same method that he had used to reach nearly 100 percent accuracy in sorting the verbal autopsy data, in other words, could also be used to accurately categorize opinions about political candidates.
What is that method? That’s the part that’s a little bit like magic. And to understand it, it might help you to recall the debate that raged during the 2000 U.S. Census about “statistical adjustment” versus “direct enumeration.” Statisticians at the Census Bureau argued that traditional methods of finding and interviewing Americans during the decennial census inevitably overcount some groups and undercount others, and that greater accuracy could be achieved by conducting a separate, large sampling survey, then using its results to adjust the traditional count.
Politicians in Congress—some of whom stood to lose their seats to redistricting if the adjusted census found larger minority populations than expected—balked at this proposal. And to the dismay of statisticians, the Supreme Court ultimately declared that