You can learn a lot from Wikipedia, despite all its faults—and Jeff Catlin’s company has done just that.
Boston-based Lexalytics said today that its latest text-analysis software incorporates insights from combing through Wikipedia’s entire user-generated online encyclopedia for relationships between words, phrases, and their meanings. The company says its new software, which powers products used by big brands and other organizations to quantify the meaning and sentiment behind conversations on the Web, will be available this summer.
Before diving into the technology, here are some business metrics. Lexalytics is “quite profitable this year,” according to Catlin, the firm’s CEO. It saw 65 percent revenue growth last year, and is continuing to grow in 2011 in a number of new markets, he says (more on that in a minute). The company currently has 18 employees in the U.S. and U.K. Catlin himself splits time between offices in Boston and Amherst, MA.
About a year ago, my colleague Wade profiled Lexalytics and its humble beginnings in 2003, when Catlin was running an engineering group at LightSpeed Software, a Woburn, MA-based content management startup. LightSpeed was consolidating and closing its East Coast operation, but Catlin convinced the firm’s investors to let him run his division as a separate company (which became Lexalytics).
Lexalytics has gone on to provide “sentiment analysis” technology for companies that help brands and organizations monitor and manage their reputations online, such as Cymfony, ScoutLabs, and social-media firms like Bit.ly. Lexalytics recently landed Newton, MA-based TripAdvisor as a customer and partner; TripAdvisor (which is being spun out of Expedia as a separate public company) uses Lexalytics’ software to understand user sentiment—what people like and don’t like—in their online reviews of hotels, restaurants, cruises, and other attractions.
But the opportunity for Lexalytics goes far beyond understanding sentiment in blogs, tweets, and other social media. As I see it, the technology is really about getting a computer to understand the meaning of sentences and the deeper relationships between words and phrases in documents. So it’s about classifying “wonderful day” as positive and “horrible disaster” as negative, sure, but it’s also about identifying names and acronyms; detecting sarcasm or hype amidst praise or insults; and being able to classify things like “cabin” as a type of room on a ship, “chicken tikka masala” as Indian food, “golf club” as having to do with outdoor recreation, and “Red Sox” as a (currently bad) baseball team.
The technologies behind such “semantic” analysis of text—natural language processing, machine learning, and statistical modeling techniques—have been around for more than a decade. But they have continued to improve in recent years, enhanced in part by the availability of big, user-generated databases, like Wikipedia. And, crucially, the market for these