the official population of the United States is the number of people physically enumerated by census interviewers.
“If you cared about the real number of people in the country, you would want to use the statistical approach,” says King. But at Crimson Hexagon, he says, “we are not bound by an arbitrary legal ruling. We care about the actual fraction of blog opinions in a particular category.” That means the company is free to use statistical methods to adjust its actual counts—which, as it turns out, hugely simplifies the problem of categorizing large numbers of opinions, whether they’re about presidential candidates, causes of mortality, or digital cameras.
Like the census statisticians’ proposal, ReadMe is all about correcting errors in the original data. Using standard search algorithms, Crimson Hexagon can sort blog posts into categories with about 80 percent accuracy—which sounds good, but wouldn’t be useful to the company’s clients, since at that level of error some categories could by off by 50 percent or more. “But suppose,” says King, “that you knew that 10 percent of the documents in Category One should actually be in Category Three. That might not help you classify individual documents better—but since we’re not interested in that, it doesn’t matter. You just subtract that many from Category One and add them to Category Three. It gets you the right answer, proportionally.” In other words, you can stick with an 80-percent accurate categorization scheme, but still get a nearly 100-percent accurate sense of the proportions among the categories. Or as King rephrases the point: “Classifying every needle in the haystack is very difficult, but we can still classify the whole haystack.”
(The sizes of the errors in each category, of course, are critical. The company determines that by having human coders create a “training set” by initially classifying, say, 100 blog posts into categories. “Then you run your statistical method on those same documents and see how well it does,” says King. “It may be doing really well in Category Seven, but 10 percent of the documents that should be in Category Three get put into Category One. Then we know what the correction is” for the whole data set.)
Undoubtedly, that’s all an oversimplification—but you probably get the gist. Fleming says Crimson Hexagon’s clients put its findings to a variety of profitable uses. She told me a story about “a major networking hardware company” (it’s not hard to read between the lines about which company this might be) that asked Crimson Hexagon to monitor general opinion in the blogosphere about whether its stock was overpriced or underpriced. The company was particularly interested in how bloggers would react to a negative earnings announcement it planned the make. “It shocked them to see that there was a little blip in opinion, but it wasn’t significant,” says Fleming. “A PR campaign they had launched two days prior on a new product they were releasing was completely overshadowing the negative reaction from the quarterly earnings. Two direct actions came out of that—one was that they decided not to do this damage control campaign they were about the launch, because it looked like the impact from the negative earnings was not as bad as they thought. And they also had no idea how much reaction this new ad was generating in the market, and they decided to accelerate that campaign.”
More and more companies, says Fleming, are looking for this kind of “fast feedback” about how a new product, campaign, or piece of bad news is playing in the market. For now, companies have to work with Crimson Hexagon consultants to formulate the questions they want answered, come up with sorting categories for ReadMe, and create an initial training set. But by next year, she says, the startup plans to roll out a self-service, Web-based version of the tool.
“There are so many things that people in the business world want to know about what consumers are saying and thinking, where they like to do a quick opinion poll if they could,” Fleming says. “But you don’t have to go out do an opinion poll, because people are naturally talking about these things on the Web”—today’s real Library of Babel.