One Giant Leap for Human Genomics Science and Business

[Editor’s Note: This post was co-authored by Becky Drees, Mark Minie, and Richard Gayle.]

Several months back Spiral Genetics CEO Adina Mangubat lamented the difficulty of getting actionable information from now-abundant human DNA sequence data in an Xconomy post (“First Comes The $1,000 genome, Then Comes The $10,000 Analysis”.  With the simultaneous publication of over 30 research papers and the activation of a novel publically available web publication and analysis tool by the National Human Genome Research Institute’s (NHGRI’s) Encyclopedia of DNA Elements (ENCODE) Consortium this month, the nature of the game has changed.

The Space Age could be said to have started in 1913 when Robert Goddard described the first multistage rocket. It would take almost 45 years before the first manmade object, Sputnik, orbited the Earth in 1957. A little over a decade later, on July 20, 1969, mankind landed on the moon, continuing a flurry of research activity that continues to this day, driving mankind further and further into the reaches of space. We are seeing a similar historic arc of progress in biology, from 1953 when Watson and Crick discovered the structure of DNA; to the initial draft of the Human Genome Project in 2000, and the new findings about DNA functionality published this year by the ENCODE Consortium. We now stand on the precipice of scientific exploration that may very well be as transformative for mankind as the first steps on the moon. And this wave of understanding will have implications far beyond the halls of academia—the new understanding of human genomics will pave the way for many useful applications in biology-based businesses.

We will discuss some immediate impacts of the ENCODE Consortium: a better definition of functional DNA; new approaches toward examining the human genome; and insights into some novel management practices used by the ENCODE Consortium.

The biochemical and computational analysis done by the ENCODE Consortium changes our view of the genomic universe and  resolves the most serious roadblock to understanding human genomics by formally re-defining the gene of a multi-cellular organism as a simple, easily studied biomolecular unit—the RNA transcript of the DNA sequence.

The classic definition – that genes are “heritable units” – provides very little insight into the question of DNA function in the genome. Over the years, many attempts have been made to clarify what we mean by a gene and what we mean by function.

For example, take the vast amounts of DNA that are transcribed into RNA but removed by splicing before any protein is made. Are these parts of the genome “functional?” Are they part of the gene if they are not part of the gene’s ultimate products, the proteins, which are responsible for most (but not all) of the structural features and enzymatic functions of the cell?.

The new ENCODE Consortium data and research approach eliminates this problem by formally defining the genes of higher cells (as opposed to bacterial cells) as the RNA transcript of the DNA sequence and the regions controlling that transcription.

By doing so the ENCODE Consortium has simplified such complexities as alternative splicing, RNA editing, the fact that only 1.5 percent  of the Human DNA encodes proteins, the dominance of non-coding RNAs and epigenetics into a comprehensible and usable model. It also heightens the functional importance of RNA, perhaps even over that of protein.

From thousands of genome-scale data sets, we now see millions of distinct features: RNA transcripts, transcription factor binding sites, and other functional elements. The structure of the “3-D code,” also known as the  “epigenetic code” that specifies how 2 meters of DNA is crammed into a nucleus only a few microns wide, is revealed. The majority of DNA sequences—quite possibly an astounding 80 percent of the genome (!)—can now be linked to a molecular function thanks to the new ENCODE Consortium synthesis. Scientists have argued for years about the functional importance of the non-coding regions of the genome – the so-called “junk DNA.” Whether the molecular function identified by the ENCODE Consortium is directly important, indirectly important or serves no purpose at all to the cell will be the focus of years of research.

The novelty and power of these important new approaches are described beautifully and clearly by one of the ENCODE Consortium’s leading researchers, Dr. John A. Stamatoyannopoulos, from the Departments of Genome Sciences and Medicine, University of Washington School of Medicine, in an open access paper published along with the others. This in turn now makes it possible to rapidly and relatively cheaply explore the basic biology of the Human Genome, translating this new information into novel tools for biology-based businesses.

The ENCODE Consortium project will change the scope of DNA sequence analysis in research, and medicine and new areas of bioscience. To get actionable information from an individual sequence, we must first accurately detect sequence variation.

Then come the big questions: Which DNA variants have functional effects in the cell?  Which are linked to disease or disease risk? How do we connect a change in DNA sequence to an effect on biochemical function, such as a misfolded protein that leads to cystic fibrosis?  To be effective, we must be able to assess the cumulative impact of both coding and non-coding variants across the genome.

Complete genome sequencing is rapidly replacing the protein-focused exome sequencing  now widely used in medical research, while quickly moving into clinical practice as a diagnostic tool for cancer and heritable disorders.

Exome sequencing, which targets only the protein-coding regions, is currently a favored approach.  It is less expensive than sequencing the whole genome, and our ignorance of genome function makes it difficult, if not impossible, to assess the impact of non-coding variants.

The ENCODE Consortium’s annotation of functional elements will make exome sequencing largely a thing of the past. It has created a foundational dataset for assessment and interpretation of sequence variation in the majority of the genome, which goes far beyond  just the protein coding regions.   These new data are one of several factors that will tip the balance from exome sequencing toward whole genome sequencing.   The rest of the genome—the non-coding majority—is full of functional sequence elements: binding sites for regulatory proteins, genes for functional RNAs, and organizational elements that “open” and “close” large regions of the genome.

It does not make sense to ignore non-coding sequences anymore, especially as the cost of whole genome sequencing drops and as instruments improve in their ability to process large volumes of DNA, RNA and proteins.  As a consequence, the “data deluge” gets bigger, a challenge and an opportunity for bioinformatics.  We will keep pace with fast, cloud-based analysis software run on cloud computing platforms to analyze whole genome datasets quickly and cost-effectively.

Most importantly, we have taken a giant step towards actionable interpretation of human genome variation.  Newly discovered variants have been evaluated primarily by their predicted protein-coding effects, relying on annotation of protein-coding genes and amino acid substitution models.

The ENCODE Consortium data reveals previously hidden connections between DNA sequence and biochemical function that will be invaluable in evaluating functional variant effects and links to disease.  This is powerfully demonstrated by a study published in in a recent issue of Science that re-evaluated non-coding, disease-associated sequence variants identified in Genome-Wide Association Studies (GWAS) in light of the ENCODE Consortium’s new information to find new leads to the biological mechanisms affecting disease risks, which is particularly relevant to ongoing research into Crohn’s disease and multiple sclerosis.  The new ENCODE Consortium synthesis also opens up important new non-medical applications, as shown in recently published PLoS Genetics paper demonstrating that GWAS data could potentially generate facial features, eye and hair color for use in forensics from DNA sequences collected at crime scenes.

There are big opportunities in this dataset for software companies and high-performance computing operations. The ENCODE Consortium mapped thousands of new biochemical functions onto the genome.   As a result, we can more completely annotate genome sequences and develop fast, cloud-based analysis algorithms trained on high-confidence data for more accurate, actionable interpretation of individual genomes.

One important consequence of the new ENCODE Consortium’s work for the biomedical science based businesses is the clear implication that RNA is the major player in human biology—not protein—as many scientists believed for decades. With many pipelines failing or drying up, this may indicate that the reason for such failures is lack of  understanding basic biology and not marketing or investment strategies.  It may also suggest that biopharma companies should be actively pursuing RNA based research into treatments and diagnostics rather than retreating from RNA work.

The ENCODE Consortium’s new synthesis also heralds the final demise of the old and broken Central Dogma of molecular biology and its replacement with a more accurate, robust, networked “GPS” meme proposed

Author: Becky Drees

Becky Drees is the chief scientist at Seattle-based Spiral Genetics. Dr. Drees is an accomplished molecular biologist. Becky’s professional background includes research on genome wide protein-protein interaction networks at the University of Washington, genetic interaction maps at the Institute for Systems Biology, and expression profiling of lymphocytes for HIV research at the Fred Hutchinson Cancer Research Institute. Her areas of scientific expertise include high-throughput screening, genome annotation, gene expression analysis, and next generation sequencing technologies. She holds a B.S. in Biochemistry from Texas A&M University and a Ph.D. in Molecular Biology from the University of California, Berkeley.