allow us to get all of the visual and geometric information out of the page. For every rectangle, we pull out things like the x and y coordinates, the heights and widths, the positioning relative to everything else, the font sizes, the colors, and other visual cues. In much the same way, when I was working on the self-driving car, we would look at a patch and do edge detection to determine the shape of a thing or find the horizon.
X: Once you identify those shapes and other elements, how do you say, “This is a headline, this is an article,” et cetera?
MT: We have an ontology. Other people have done good work defining what those ontologies should be—there are many of them at schema.org, which reflects what the search engines have proposed as ontologies. We also have human beings who draw rectangles on the pages and teach Diffbot “this is what an author field looks like, this is what a product looks like, this is what a price looks like,” and from those rectangles we can generalize. It’s a machine learning system, so it lives and breathes on the training data that is fed into it.
X: Do you actually do all the training work yourselves, or do you crowdsource it out somehow?
John Davi: We have done a combination of things. We always have a cold-start problem firing up new type of pages—products versus articles, or a new algorithm for press releases, for example. We leverage both grunt work internally—just grinding out our own examples, which has the side benefit of keeping us informed about the real world—but yeah, also crowdsourcing, which gives us a much broader variety of input and opinion. We have used everything, including off-the-shelf crowdsourcing tools like Mechanical Turk and Crowdflower, and we have build up our own group of quasi-contract crowdsourcers.
Our basic effort is to cold-start it ourselves, then get an alpha-level product into the hands of our customer, which will then drastically increase the amount of training data we have. Sometimes we look at the stream of content and eyeball it and manually tweak and correct. In a lot of cases our customer gets involved. If they have an interest in helping to train the algorithm—it not only makes it better for them, but if they are first out of the gate they can tailor the algorithm to their very particular needs.
X: How much can your algorithms tell about a Web page just from the way it looks? Are you also analyzing the actual text?
MT: First we take a URL and determine what type of page it is. We’ve identified roughly 20 types of pages that all the Web can fall into. Article pages, people pages, product pages, photos, videos, and so on. So one of the fields we return will be what is the type of this thing. Then, depending on the type, there are other fields. For the article API [application programming interface], which is one we have out publicly, we can tell you the title, the author, the images, the videos, and the text that go with that article. And we not only identify where the text is, but we can tell you the topics. We do some natural language processing on the text and we can tell you “This is about Apple,” and we can tell it’s about Apple Computer and not the fruit.
JD: Another opportunity we are excited about his how Diffbot can help augment what is natively on the page. Just by dint of following so many pages through our system, we can augment [the existing formatting] and increase the value for whoever is reading. In the case of an article, the fact that we see so many articles means it’s relatively easy for us to generate tags for any given text.
X: How do you turn this all into a business?
MT: We are actually selling something. We are trying to build the Semantic Web, but in a profitable way. We analyze the pages that people pay us to analyze. That’s currently over 100 million URLs per month, which is a good slice of the Web. Other startups have taken the approach of starting by crawling and indexing the Web, and that is very capital-intensive. By doing it this way, another benefit is that people only send us the best parts of