How Crimson Hexagon Translates the Blogosphere’s Babel Into Wisdom

the official population of the United States is the number of people physically enumerated by census interviewers.

“If you cared about the real number of people in the country, you would want to use the statistical approach,” says King. But at Crimson Hexagon, he says, “we are not bound by an arbitrary legal ruling. We care about the actual fraction of blog opinions in a particular category.” That means the company is free to use statistical methods to adjust its actual counts—which, as it turns out, hugely simplifies the problem of categorizing large numbers of opinions, whether they’re about presidential candidates, causes of mortality, or digital cameras.

Like the census statisticians’ proposal, ReadMe is all about correcting errors in the original data. Using standard search algorithms, Crimson Hexagon can sort blog posts into categories with about 80 percent accuracy—which sounds good, but wouldn’t be useful to the company’s clients, since at that level of error some categories could by off by 50 percent or more. “But suppose,” says King, “that you knew that 10 percent of the documents in Category One should actually be in Category Three. That might not help you classify individual documents better—but since we’re not interested in that, it doesn’t matter. You just subtract that many from Category One and add them to Category Three. It gets you the right answer, proportionally.” In other words, you can stick with an 80-percent accurate categorization scheme, but still get a nearly 100-percent accurate sense of the proportions among the categories. Or as King rephrases the point: “Classifying every needle in the haystack is very difficult, but we can still classify the whole haystack.”

(The sizes of the errors in each category, of course, are critical. The company determines that by having human coders create a “training set” by initially classifying, say, 100 blog posts into categories. “Then you run your statistical method on those same documents and see how well it does,” says King. “It may be doing really well in Category Seven, but 10 percent of the documents that should be in Category Three get put into Category One. Then we know what the correction is” for the whole data set.)

Undoubtedly, that’s all an oversimplification—but you probably get the gist. Fleming says Crimson Hexagon’s clients put its findings to a variety of profitable uses. She told me a story about “a major networking hardware company” (it’s not hard to read between the lines about which company this might be) that asked Crimson Hexagon to monitor general opinion in the blogosphere about whether its stock was overpriced or underpriced. The company was particularly interested in how bloggers would react to a negative earnings announcement it planned the make. “It shocked them to see that there was a little blip in opinion, but it wasn’t significant,” says Fleming. “A PR campaign they had launched two days prior on a new product they were releasing was completely overshadowing the negative reaction from the quarterly earnings. Two direct actions came out of that—one was that they decided not to do this damage control campaign they were about the launch, because it looked like the impact from the negative earnings was not as bad as they thought. And they also had no idea how much reaction this new ad was generating in the market, and they decided to accelerate that campaign.”

More and more companies, says Fleming, are looking for this kind of “fast feedback” about how a new product, campaign, or piece of bad news is playing in the market. For now, companies have to work with Crimson Hexagon consultants to formulate the questions they want answered, come up with sorting categories for ReadMe, and create an initial training set. But by next year, she says, the startup plans to roll out a self-service, Web-based version of the tool.

“There are so many things that people in the business world want to know about what consumers are saying and thinking, where they like to do a quick opinion poll if they could,” Fleming says. “But you don’t have to go out do an opinion poll, because people are naturally talking about these things on the Web”—today’s real Library of Babel.

Author: Wade Roush

Between 2007 and 2014, I was a staff editor for Xconomy in Boston and San Francisco. Since 2008 I've been writing a weekly opinion/review column called VOX: The Voice of Xperience. (From 2008 to 2013 the column was known as World Wide Wade.) I've been writing about science and technology professionally since 1994. Before joining Xconomy in 2007, I was a staff member at MIT’s Technology Review from 2001 to 2006, serving as senior editor, San Francisco bureau chief, and executive editor of TechnologyReview.com. Before that, I was the Boston bureau reporter for Science, managing editor of supercomputing publications at NASA Ames Research Center, and Web editor at e-book pioneer NuvoMedia. I have a B.A. in the history of science from Harvard College and a PhD in the history and social study of science and technology from MIT. I've published articles in Science, Technology Review, IEEE Spectrum, Encyclopaedia Brittanica, Technology and Culture, Alaska Airlines Magazine, and World Business, and I've been a guest of NPR, CNN, CNBC, NECN, WGBH and the PBS NewsHour. I'm a frequent conference participant and enjoy opportunities to moderate panel discussions and on-stage chats. My personal site: waderoush.com My social media coordinates: Twitter: @wroush Facebook: facebook.com/wade.roush LinkedIn: linkedin.com/in/waderoush Google+ : google.com/+WadeRoush YouTube: youtube.com/wroush1967 Flickr: flickr.com/photos/wroush/ Pinterest: pinterest.com/waderoush/