With the arrival of next-generation gene sequencing machines like the Illumina (NASDAQ: [[ticker:ILMN]]) HiSeq X Ten, medicine has been moving to develop new ways of using genomic data to treat patients. Last month, for example, J. Craig Venter unveiled plans to sequence the entire genome of every patient entering the UC San Diego Moores Cancer Center as an initial goal for his latest startup, Human Longevity Inc.
At the same time, though, it’s becoming clear that generating genomic data for thousands of cancer patients involves working with very large numbers—and that means a wave of new opportunities for innovation are emerging as genomics and Big Data come together. One startup moving to catch this wave is Edico Genome, a San Diego startup founded last year to fix a bottleneck in the way the data being generated by the HiSeq X Ten and other next-generation sequencing machines is processed.
Edico has developed a specialized computer processor for ordering the readout of nucleotides—A, C, T, or G—from short segments of DNA generated by next-generation sequencing technology so they align with a reference genome. It’s a process that genomics specialists refer to as “mapping.”
It is a Big Data problem. The human genome consists of roughly 3.2 billion nucleotide base pairs (made of that four-letter alphabet of DNA) that encode between 20,000 and 25,000 genes. Next-generation sequencing technology cuts the DNA molecule into millions of short segments to “read” the sequence and digitize the results. What comes out is a very large data file that can range from 150 gigabytes to more than 320 gigabytes. An average-size, 200-gigabyte data file would be roughly equivalent to 800 big city phone books—from the days when people used their phone books.
But the data file still consists of millions of segments of DNA that must be mapped to a reference genome. Think of throwing 800 telephone books into a paper shredder, and then trying to reassemble the millions of strips to make sense of the information.
Today, companies like Illumina use clusters of computer servers to map these random DNA segments with a reference genome—a process that typically takes about 20 hours, depending on