databases were announced, including datasets from the Haplotype Reference Consortium and the Exome Aggregation Consortium. These, like others in the works, aim to accelerate studies to identify genetic variants and translate them into clinically useful markers.
Distributed global access: While a growing amount of genomic data is being amassed in various repositories across many locations, what matters is the ability to access and share knowledge derived from these insights to inform patient-driven and crowd-sourced solutions. In order to manage and distribute massive volumes of raw genomic information (a single whole genome sequence contains some 100 gigabytes of data), new technologies are being designed to annotate, characterize, and organize it so that it can be searchable and analyzed for relevant patterns, markers, and variants that might predict disease risk or response to a treatment.
New technologies are emerging that have two key capabilities. First, the ability to manipulate and mine the massive volumes of genome data. Second, the ability to share these large amounts of data instantly, via web browsers. Together, these capabilities will open a bunch of new opportunities:
• Researchers in rare diseases will be able to have the critical mass of genomes to power their studies and find, potentially, new clues and cures for rare diseases.
• Researchers in common diseases will have access to large enough reference databases to find meaningful genomic signals and augment their patient data.
• Access to broader genomic data can help companies design clinical trials with better patient criteria, leading to more efficient development of new treatments.
• Over time, instant access to such data will enable physicians everywhere to bring the best genomic insights to the point of patient care.
From availability to accessibility: We predict the start of an accelerated pace of innovation for new technologies that enable distributed global access to genomic data. Here are some examples of how large collections of genomes are becoming more easily accessible:
• The NextCODE Exchange: In collaboration with rare disease researchers from medical institutions in the U.S., Europe, Australia, and Japan, NextCODE Health has developed a browser-based Exchange that allows genomic data to be shared, instantly, across the globe. The purpose of this Exchange, first and foremost, is to help researchers crack more difficult diagnostic cases and rare diseases, and to help accelerate new discoveries for common conditions. Through the Exchange, for instance, the Simons Foundation Autism Research Institute is providing researchers with real-time access to 10,000 exomes and phenotypic data from 2,600 families with one child on the autism spectrum.
• The 1,000 Genomes Project: This project was launched in January 2008 as an international research effort to establish a detailed catalogue of common human genetic variations, and now includes more than 2,000 genomes. Data generated by the 1000 Genomes Project is widely used by the genetics community, and all the sequencing data (including variant calls) are freely available and can be downloaded via standard file transfer protocol (FTP) from the project’s website.
• The Exome Aggregation Consortium: This group, led by the Broad Institute of MIT and Harvard, has just released allele frequency data derived from 63,000 exomes. This large collection of exomes provides very useful reference data. The raw data is aggregated from 25 institutions, and while it is not currently possible to query that data, it may be possible to do so in the future.
As this kind of global knowledge continues to be made available in a distributed, accessible way – supported by advanced information technology – it will enable researchers and clinicians to effectively work with global partners and will lead to new insights and discoveries about diseases.
We are on the cusp of the next wave of data sharing in genomics. It promises to open up new frontiers for collaboration, and revolutionize how we use genomics in medicine.