San Antonio—Even as genome sequencing has risen in prevalence and declined in cost during the last decade, it still has many scientists perplexed from a data standpoint.
As it becomes more common for pharmaceutical and biotech companies to use data from sequenced genomes for drug testing, the question of how will they effectively process that immense amount of information remains. That discussion was core to much of the Big Data & Data Analytics conference hosted this week by the University of Texas at San Antonio. Companies like Janssen, the pharmaceutical division of New Brunswick, NJ-based healthcare giant Johnson & Johnson, have developed internal teams create solutions.
“This is something we need to work with partners in biotech, startups, academia, to enable the effective capture, interpretation and assimilation of this data,” said Guna Rajagopal, the vice president of Janssen’s computational sciences division, whose team does that data work for the drug developer.
For example, two years ago, Rajagopal’s 60-person division took the results of a drug’s effect on 500 people whose genomes it had sequenced. That produced 90 terabytes of data, which Janssen stored with Amazon. Analyzing that data, which Janssen did at the San Diego Supercomputing Center at the University of California, San Diego, required eight weeks and 257 terabytes of computing power, he said. The company also worked with Intel on the project, he said.
Rajagopal didn’t reveal the results, or the name of the drug, citing restrictions from his company’s legal office. The point of his example : As sequencing becomes a more prevalent in drug development, those partnerships among people in different industries are going to become more important, he said.
“The data must flow across all the organizations we have so the right data comes to the right people to make the right decision,” Rajagopal said. “If you’re talking about 1 million genomes or 10,000 genomes, how are we going to address this bigger challenge?”
The conference itself was a confluence of academics and enterprise, of technology and life sciences. At the University of Texas MD Anderson Cancer Center in Houston, researchers in the computational biology and bioinformatics department have started archiving data, which makes it harder to easily access, because of the sheer volume of data they’re bringing in, according to John Weinstein, the department chair.
“It’s harder to get to, and to get to quickly, but at least it will be preserved,” Weinstein said. “The question is, what comes next? I’m sure there are those who at this meeting know what availability [of storage] there will be.”
One group might be another institution at the University of Texas System. The Texas Advanced Computing Center, which is based in the system’s flagship campus in Austin, has built a computing system it calls Stampede that has 100,000 cores (or processing units) of computing power and 14 petabytes of storage (one petabyte is 1 million gigabytes).
“We’ve got this tsunami coming of data. This is where the Texas Advanced Computing Center wants to come to help,” said Niall Gaffney, the center’s director of data intensive computing. “Stampede is the sort of system that might be able to tackle this million genome problem.”