Reference genome assemblies provide a map of a species’ DNA sequence and its spatial context—that is, where along the chromosomes a specific piece of DNA sequence can be found. In the past, the generation of reference assemblies was prohibitively expensive and labour-intensive, so they were only produced for humans and the most important model organisms, and still contained gaps and errors. Draft genomes generated using more affordable second-generation sequencing technologies could be assembled for a larger number of species, but these were of lower quality because they were highly fragmented and their annotation was erroneous in some parts.
However, for a complete understanding of evolutionary processes and other fundamental questions in biology, high-quality reference genome assemblies of all species are required. Technological advances, improved computational methods and the ever-decreasing cost of sequencing enabled the Vertebrate Genomes Project (VGP), which was launched in 2017, to pursue the ambitious goal of producing a reference genome assembly for each of the extant vertebrate species on Earth. In the first phase of the project, the VGP has been focused on testing and improving genome sequencing and assembly approaches, on assembling a first set of 260 high-quality genomes of species representing all vertebrate orders (a work that is still in progress), and on the initial reporting of insights into genome evolution in vertebrates.
Milestones for phase II will be the production of assemblies for about 1,159 vertebrate families, and for phase III will involve the generation of assemblies for more than 10,000 genera; finally, in phase IV, assemblies will be completed for all vertebrate species. All sequence data and assemblies are being made freely available as they are produced and can be downloaded or browsed at GenomeArk, Genbank, Ensembl, and UCSC.
Did you know that…?
…the VGP aims to produce a high-quality reference genome assembly for each of the 71,657 named vertebrate species. Currently, about 3 assemblies are produced per week and this will be scaled up to 125 per week to achieve this goal within 10 years.
The four phases of the VGP. Adapted from: Towards complete and error-free genome assemblies of all vertebrate species.
…the selection of species for the different phases is based on taxonomic hierarchy and includes particular consideration of a species’ conservation status. Phase I will end with the assembly of the genome of a representative each of all orders within the vertebrate subphylum, phase II will include a representative of each vertebrate family and phase III will assemble the genomes of representatives of all vertebrate genera before the VGP concludes with phase IV and the assembly of the genome of each vertebrate species.
…the VGP is setting quality standards in the assembly of genomes and has made recommendations on a minimum set of quality criteria for a high-quality reference genome. An Editorial in Nature Biotechnology covered these in detail in 2018.
…a standard VGP reference genome is assembled using a combination of long-read sequencing, linked-read sequencing, optical mapping and Hi-C data in an automated workflow, and includes a final manual curation step to ensure the highest possible quality. The long reads constitute the basic building blocks of these assemblies and linked reads, optical maps and Hi-C data provide the ‘scaffold’ information to put these building blocks together in their correct order and orientation, organized into the different chromosomes. A polishing step, using more accurate short-read sequencing data, removes any potential errors in the long reads. In the final curation step, genome curators inspect the generated ‘draft assemblies’ to identify and correct any anomalies, in order to create the final curated assemblies.
The VGP standard assembly pipeline. Adapted from: Towards complete and error-free genome assemblies of all vertebrate species.
…some of the biological discoveries from the 16 genomes in the flagship paper, and 25 genomes in total from more than 20 papers in the first wave of VGP publications, include:
- A canonical rapid rise in G and C nucleotides in the regulatory regions of protein-coding genes.
- Repeated evolution of chromosomal rearrangements involving immune genes in bats.
- A universal evolution-based understanding of oxytocin and vasotocin and their receptors across vertebrates.
- The evolution of a complex sex chromosome system in monotreme mammals with multiple X and Y chromosomes.
- The unexpected amount of genetic diversity between maternal and paternal chromosomes in a non-human primate, the marmoset.
- Extensive gene duplications in mitochondrial genomes.
The Vertebrate Genomes Project has used an optimized pipeline to generate high-quality genome assemblies for sixteen species (representing all major vertebrate classes), which have led to new biological insights.
A revised, universal nomenclature for the vertebrate genes that encode the oxytocin and vasopressin–vasotocin ligands and receptors will improve our understanding of gene evolution and facilitate the translation of findings across species.
Andy Murch/Nature Picture Library
New reference genomes of the two extant monotreme lineages (platypus and echidna) reveal the ancestral and lineage-specific genomic changes that shaped both monotreme and mammalian evolution.
Credit: Paulo Oliveira/Alamy
Reference-quality genomes for six bat species shed light on the phylogenetic position of Chiroptera, and provide insight into the genetic underpinnings of the unique adaptations of this clade.
A fully phased, high-quality assembly of the common marmoset genome provides insights into the evolution of sex chromosomes and the conservation of brain-related human disease genes in this primate model for biomedical research.
A new computational method, FALCON-Phase, makes it possible to resolve haplotypes in genome assemblies by using the information from natural intrachromosomal interactions identified by Hi-C, without the need for parental data.
Daniel Heuclin/Nature Picture Library/Alamy
The Vertebrate Genomes Project (VGP) has developed a fully automated pipeline for de novo assembly of mitochondrial genomes and reports the completion of mitogenome assemblies for 100 vertebrate species, which reveal errors and missing sequences in previous mitogenome assemblies.