Anand Jagatia The human genome is a code 3 billion base pairs long, that contains the instructions for our cells to make proteins, and ultimately, us. In 2003, scientists successfully completed the Human Genome Project, managing to sequence every single one of these pairs. But it turns out that was just the beginning. It was like we had the blueprint for how to make a human but no idea how to interpret it. What did this vast series of A's T's C's and G's actually encode?
Researchers now think it contains between 21 and 25,000 genes for proteins, but there's also a huge amount going on in the rest of the genome. For one thing, lots of it doesn't code for protein at all. It codes for sections of RNA that have a function in and of themselves. And there are at least a million stretches of DNA in the genome known as 'switches' or 'elements' that are involved in turning genes on or off, or changing their levels of expression.
Picking up where the Human Genome Project left off, ENCODE, which stands for Encyclopaedia of DNA Elements, is an ambitious projects to identify all of these elements and how they affect gene expression. This week, several papers on the third iteration of the project, ENCODE 3, have been published in Nature.
Rick Myers from the HudsonAlpha Institute for Biotechnology in the US has been part of the project since it began. He explained to me why these areas of the genome are of such interest.
Rick Myers You don't get to be a liver cell or a neuron by expressing the entire genome, you have subsets of the genes being activated, and others being inactivated, and the hope was to identify these switches and what turns those switches on and off in different cell types. So when ENCODE started in 2003, it was a pilot project really looking at only 1% of the genome. So then the second phase of encode was to do that on a genome-wide level. And then ENCODE 3 that we're talking about now, greatly, greatly expanded that so that we're not only looking genome-wide, we're looking at a lot of different cell types, and starting to learn a whole lot more detail of this atlas, essentially, like a like a collection of maps, but now putting these maps together so that they make sense.
Anand Jagatia So can you give us a sense of the scale of this? How many of these switches have you identified in ENCODE 3?
Rick Myers So now, with ENCODE 3, we've identified in the human genome identified almost a million of these switches, there may well be more of them, these are the ones we've identified, in few dozen different cell types. As an aside, we also during this phase, did a mouse ENCODE project, there's several hundred thousand in the mouse that are identified and having the interplay between those two organisms' datasets has been really helpful for interpreting the human genome, for instance.
Anand Jagatia And part of ENCODE 3 was trying to figure out how these million or so switches affect gene expression when bound by different molecules in the cell, which could be proteins or, or even RNA. So what are some of the molecules that you were looking at?
Rick Myers A big push was on the DNA binding proteins that are called transcription factors, proteins that bind to DNA or bind to other proteins bound to DNA, and turn genes on or off or determine the levels of the genes in different cell types and at different times during development, and there are a lot of them. 1600 of these means that we put a lot of our genome and the energy made into making cells into controlling when and where all the genes get expressed.
In addition to what we call transcription factors, there are other more general DNA binding proteins that are called chromatin regulators. They play a role in what the whole genome looks like in a particular cell at any time, in terms of opening up regions in the genome for transcription are helping to keep them closed. So that was an another really important part of ENCODE because they bind to many, many more places in the genome than do transcription factors.
Anand Jagatia So what form does all of this data actually take?
Rick Myers So ENCODE 3 is the first time we actually generated the encyclopaedia which is freely available to everyone, it has all the annotations that ENCODE is generated to date: genes, the switches, transcriptomes, epigenetics, in many different cell types. And it's organised and meant to be easy to use. So computational biologists and many creative scientists in this helped to build tools to take these very large, complex datasets in huge numbers of different contexts, and get out what you want to look at.
Anand Jagatia I mean, looking back, how does, these kinds of datasets and tools that you've used to build this encyclopaedia, how do they compare to what you were using back in the 90s when the Human Genome Project was set up?
Rick Myers It's fun for me, at least to look back on the history that's when we started the Human Genome Project in 1990, the goal was to figure out one person's or one composite human genome sequence. Truth is we really didn't know how we were going to do it. The technology was pretty crude back then. And in 1990, the internet didn't exist or at least we weren't using it yet. And we were copying data onto floppy disks. And you know, providing to that to people as much as we could. And of course, in that subsequent 30 years, we've had enormous increases in computational ability. And thank goodness we do, because the amount of data we have is millions of times more than what we had in 1990.
Anand Jagatia In lots of ways, this is this is basic science, really, you're trying to annotate the genome to figure out what these different elements are and what they do and how they affect gene expression. But people, scientists are using the data and there are practical applications too, have you got any examples you can share?
Rick Myers Yes, so one of them is a severe gastrointestinal disorder in babies. The cause was unknown. And researchers used ENCODE data to identify particular switches that control the expression of a gene that was suspected to be involved in this terrible disorder in babies. They tested the region and sure enough, it was involved in regulating the gene in the digestive system, and that actually identified then the cause of this disease that is also probably related to many other similar diseases in children, and even some adults. And that example is one of many, many where being able to understand how the gene is expressed and how the regulation of the gene is controlled, has helped us understand and even work their way towards not just diagnosis and prognosis, but treatments.
Anand Jagatia So ENCODE 3 is now available in the form of this encyclopaedia, but things don't stop there. What's happening with the next phase of the project?
Rick Myers ENCODE 4 is well underway. The goals in ENCODE 4 are greatly expanded the data collection, a lot more cell types are being included, and we're really working towards analysing all 1600 transcription factors and all the chromatin marks. That's some of the major goals. One of the really important parts about ENCODE 4 is actually integrating all of these data types. You don't have one little element and one protein, controlling the expression of a gene, you have a massive group of components that interact, to give you that specificity of cell type, when it's going to be expressed. And when it happens during development.
Anand Jagatia Rick Myers from the HudsonAlpha Institute for Biotechnology. The data from ENCODE 3 is available at encodeproject.org, and as Rick says it’s completely free for researchers to use with no restrictions. And you can read the accompanying papers on ENCODE 3 at Nature.com