The evolution and revolution of clinical genomics

The rise of genomics

Resolving the genetic basis of disease seemed like a certainty once the human genome sequence was completed. This new comprehensive map of the body’s operating system meant that the historically difficult task of mapping diseases with evidence of Mendelian inheritance patterns to a causative locus was now readily accessible to the scientific community.

This map also enabled the development of tools with unprecedented capacity to survey genomes at population scale. Microarray technology interrogation of common single-nucleotide polymorphisms (SNPs) revolutionized our understanding of genetic inheritance patterns through the hapmap project1 (Fig. 1) and genome-wide association studies (GWAS) seemed set to unravel the genetic basis of monogenic and complex disease2.

Fig. 1: The rise of genomics at a glance.
figure 1

DNA sequencing costs (blue) over time compared to the number of publications containing specific phrases in PubMed. Some key events in genomics are shown in green

The increased power and precision of GWAS facilitated the mapping of monogenic traits but highlighted “missing heritability” where observed inherited traits could not be explained by observed genetic variance. In the clinical genetics field, this was considered likely due to the inability of SNP-chip technology to adequately measure rare variants, structural defects, polygenic and/or complex inheritance patterns. Furthermore, epistatic interactions, where coinheritance of two or more variants could more adequately explain heritability, continues to be difficult to estimate due to computational limitations3,4. At the same time, next-generation sequencing was poised to further revolutionize approaches to not only survey the genome, but also redefine the understanding of how the genome behaved and the diverse mechanisms through which genetic disease could manifest.

The rise of clinical genomics

Exome sequencing, where the coding regions of genes are enriched from DNA and sequenced, has allowed the direct measurement of variance at the genetic level. It has provided clinicians and researchers base-scale resolution of the coding genome, giving rise to a quantum leap in the resolving power of associating genetic variation directly to altered protein. The scalability and relatively low cost has resulted in large databases characterizing and annotating the variation in the coding genome such as ExAC5. But like many technological advances, exome sequencing has exposed its own limitations, namely technical artefacts of the DNA capture used and overdependence on extant genome annotations. This latter limitation proved significant as the emergence of exome sequencing coincided with the observation of the pervasively transcribed genome6 and the rise of lncRNAs as important functional transcripts, which for the most part were overlooked by the approach.

Importantly, the understanding that the majority of informative variants identified by GWAS occurred within the noncoding genome, and a shift in how the genome was known to encode function through transcribed noncoding regulatory RNAs gave rise to many theories regarding the genetic basis of disease, particularly in providing an explanation for the source of missing heritability. Broadly, theories explaining missing heritability fell into two areas—(1) that variants in regulatory DNA sequences such as promoters, enhancers, and structural elements and regions encoding regulatory RNAs were responsible or (2) that large numbers of individual genetic features, potentially with complex interactions, contributed collectively to inherited traits.

Increasingly inexpensive whole-genome sequencing, particularly with PCR-free library preparations, has made it possible to overcome many of the technical artefacts of exome sequencing—thus yielding high-quality surveys of coding gene variants including SNP, copy-number, and structural variations, and insertion/deletion events. These technical advantages in analyzing coding regions alone enabled improved diagnostic yield of genome sequencing and has led to growing numbers of whole-genome clinical sequencing services worldwide.

Whole-genome sequencing consortia producing large databases of genomic variation, such as GnomAD5, 100,000 genomes7, and the Million Veterans Project8 (Fig. 1) are making publicly available their data for interrogation. As well as assisting in the distinction between rare pathogenic variants and those common in the population, this abundance of data provides measurements of the variation in the 98% of the genome that is non-protein-coding. Therefore, observed noncoding variants as well as protein-coding variants of unknown significance are increasing and there is potential now to advance the use of the noncoding genome to improve clinical diagnostic rates of genetic disease9.

The challenge

When the human genome project was completed, the implications of the complexity evident in the noncoding genome were staggering10. After more than a decade of research, considerable advances have been made in understanding how the genome instructs the development and function of organisms and it is increasingly pertinent that this knowledge is harnessed to maximize diagnoses in clinical genomics practice.

In essence, clinical genomics seeks to causatively associate a clinical feature (disease, drug response, risk) with one of the ~5 million variants (relative to a reference genome) present in every individual. This poses the challenge of effectively developing variant filtering algorithms that narrow the search space for variants to regions where pathogenicity can be most clearly determined, i.e., protein-coding regions related to well-described biological function.

Typical approaches for interpreting clinical genomes involve reducing a genome down to rare coding variants with the appropriate inheritance patterns in a gene list of interest. This approach typically yields a handful of variants for consideration. Various annotations of variant impact are then added including predicting the impact on protein structure (SIFT11, Polyphen12, and VEP13) and observation in disease (COSMIC14, ClinVar15, and HGMD16). The proliferation of these tools have led to aggregator services such as VarCards17 that allow multiple scores for a given variant to be interrogated in one place. A clinical molecular geneticist, molecular genetic pathologist, or other certified professional can interpret these data to assign a likely causal variant (Fig. 2). If a candidate is not apparent from these approaches even in cases where there is a strong genetic component, a diagnosis becomes difficult since biochemical testing of variants of unknown significance is not feasible in a typical pathology laboratory setting and may not be considered to be cost-effective. Furthermore, although expanding the search space to include more variants increases the number of candidates, there is typically insufficient evidence to associate any particular variant with the phenotype.

Fig. 2: How coding and noncoding variation can impact gene function.
figure 2

Variants (arrows) at a hypothetical locus are shown along with potential functional impacts

Efforts worldwide are attempting to expand the annotation of the genome beyond the pure coding and to better understand how variations in these regions can have biological impact to expand the understanding of genetic basis of disease18 and to thus fully realize the clinical utility of the whole genome.

Advances in functional annotation of the genome

Resolving the annotation of gene-level variation

The interpretation of disease-associated variation at the level of the gene is undergoing a shift in understanding. Protein-coding mutations have historically be considered deletarious where they lead to truncations (nonsense/deletions), amino acid alterations (missense/in-frame in/del), frame shifts (in/del) and splicing defects (splice-site donors/acceptors). However, these kinds of mutations have been shown to be relatively common, even in healthy genomes19. Furthermore, these variants can be difficult to interpret in a clinical setting if the mutation occurs in a region not previously reported, or in a gene whose function within the context of the disease in question has not been investigated20. It is also becoming apparent that mutations that do not affect the encoded amino acid (synonymous) can affect gene products in the context of codon frequency and RNA structure21,22. Furthermore, the concept of multiplicity, where gene expression can be impacted by combinations of genetic alterations23 is only starting to be addressed. This implies that the even annotation of coding variants is far from complete.

It is also important to note that the coding proportion of a gene comprises a small percentage of the genetic information encoded by the locus and that alterations in the noncoding sequence can have impact on gene function (Fig. 2) Variation at gene promoters can impact the expression of the gene24, e.g., the TERT promoter is frequently mutated, which leads to overexpression, and in turn, can be a pathogenic basis for causing or driving cancer development25. Variation at imprinted loci can drive the deposition of epigenetic marks responsible for imprinting26, which can lead to aberrant expression. Alterations in 5′ and 3′ untranslated regions of genes can impact transcript stability and translation primarily through RNA structural alterations27,28. Introns can similarly contain important genetic information that can be influenced by mutation29, e.g., disease-associated SNPs within branch points can be associated with altered splicing patterns30. Together, these investigations show that a significant proportion of the clinically relevant genetic information elucidated by whole-genome sequencing is not typically interpreted in diagnostic laboratories.

Resolving the transcriptional and regulatory landscape of the genome

Ever since the first observation of the pervasively transcribed genome more than a decade ago, there has been an explosion in the identification and functional characterization of long noncoding RNA (lncRNA)31 and other noncoding transcript types32. The encyclopedia of DNA elements consortium (ENCODE) raised considerable controversy in 2012 by using tissue-specific transcript profiling, supported by epigenetic profiling of the genome, to suggest that 82% of the human genome was functionally important33. As the vast majority of transcribed species of the genome are noncoding, of which little is still known31, efforts are ongoing to describe the detail and regulation of noncoding RNA. LncRNAs are of particular interest to the field of clinical genomics as their exquisite tissue-specific expression and regulatory behavior34 indicate that a role in disease will become apparent as more is understood about lncRNA biology.

As a result, several large-scale efforts have been undertaken to comprehensively annotate the noncoding transcriptional landscape, particularly through the FANTOM projects6,35,36, ENCODE33 and Roadmap Epigenomics37. The large-scale GTEx project38 has set out to further understand the genetic drivers of tissue-specific gene expression via expression quantitative trait loci (eQTL) analysis. Large-scale screens for noncoding RNA function have elucidated functional annotations for thousands of lncRNAs39 and molecular tools tailored to the unique biology of lncRNA behavior are ongoing40. These efforts have enhanced the understanding of gene transcription and hint at a complexity that requires expanded resolution of functional annotation at the genetic level to inform interpretation in a clinical diagnostic setting.

Interpreting functionality at the whole-genome level

Traditional indicators of functionality (and thus of potential clinical utility), such as conservation, have thus been challenged by this expanding annotation of the genome. The volume of available data has fueled recent computational efforts to annotate functional parts of the genome without necessarily depending exclusively on the coding genome (Table 1). Early attempts used existing annotations to train computational models that could assess the potential function of a variant genome-wide (CADD41/GWAVA42). Newer approaches have used genome-wide data itself to assign functional importance, either through association with DNA binding proteins (Eigen43), or direct measures of resistance to variation (Orion44), to provide comprehensive maps of coding and noncoding regions likely to be impacted by variation. These maps are expanding the pool of potentially clinically relevant variants and continue to evolve with growing interest and innovation.

Table 1 Genome-wide tools for estimating impact of noncoding variation

Noncoding variation and disease

Structural alterations

The physical arrangement of the genome is also critical to homeostasis. Copy-number alterations are associated with many diseases, but can also have no pathogenic effect45,46. The study of disease-associated genomic translocations has typically focused on the generation of gene fusions, which are particularly clinically relevant in cancer47. However, studies of intergenic translocations can also perturb local gene expression, possibly by interrupting chromatin looping and by rearranging regulatory sequence48,49,50. Moreover, chromatin looping51 and nucleosome occupancy52 are also susceptible to alteration by DNA mutation and structural rearrangement.

Localized DNA structures have been associated with genetic disease such as Huntington’s disease mostly as recognition sites for genomic rearrangements53. However, such quaternary structures recently gained traction as important mediators of biological information in themselves with left handed helices (z-DNA54), G-quadruplexes55,56, and DNA:DNA/DNA:RNA triplexes57,58 showing evidence of regulatory function. Indeed the interplay between the physical state of the DNA appears to be intimately associated with the process of gene expression59 and transcription factor binding60. Importantly, it was recently shown that disease-associated variations that disrupt G-quadruplex formation in RNA can affect post-transcriptional regulation of genes27, suggesting that variants in structural features can directly impact cellular function.

Noncoding transcription at GWAS loci

The prevalence of intergenic, disease-associated SNPs from GWAS studies provoked diverse studies into how these variants were contributing to disease, revealing impacts on DNA conformation51, DNA-protein interactions61, and epigenetic marks62. Recent application of RNA-capture sequencing63 to haplotype blocks associated with GWAS disease-associated SNPs revealed a multitude of transcripts of which less than half were in extant transcript databases64. Combined with fine mapping of SNPs associated with breast cancer, this approach revealed enhancer alterations affecting novel transcript expression65. These studies raise the possibility of direct and indirect impacts of disease-associated SNPs on tissue-specific transcription patterns and illustrate that both the resolution of disease-associated variants and genome annotation remain incomplete. The ongoing accumulation of whole-genome data worldwide will eventually resolve the exact disease associations and a greater understanding of the noncoding transcriptome will continue to provide context for elucidating the impact of these variants34.

New classes of functional repeats in the human genome

In a similar vein, pseudogenes have classically been regarded as nonfunctional byproducts of retrotransposition66. With the observation of transcription and evidence of disease linkage67, pseudogene biology is being revisited, however, consensus as to a generic biological role has not yet been reached68,69. Indeed, the process of retrotransposition itself in shaping the genome is undergoing a renaissance through evidence of gene regulatory roles70.

A place for noncoding annotations in clinical genomics

Rules of evidence

In 2015, the American College for Medical Genetics (ACMG) described a set of evidence lines that could be used to ascribe degrees of pathogenicity to a particular variant71. Importantly, these recommendations sought to distinguish deleterious impacts on a gene from contribution to disease. Predicting the impact of coding variation is a more mature process, especially in the case of missense and nonsense mutations. Tools like PolyPhen and VEP are commonly used to estimate genic pathogenicity, although the likely impact of the variant can be open to interpretation. Evidence for disease contribution is usually achieved by cross-referencing rare variants with lists of genes with known roles in the disease of interest, reports in the literature, and clinical databases such as COSMIC and ClinVar. The point at which there is sufficient evidence of a variant causing a disease is becoming refined20. However, due to the complexities in the WGS data, interpretation, and phenotyping, associations can be subject to how the data are evaluated by genetic professionals and can still require in vitro testing. Including non-protein-coding into this framework would require extra complexity predominantly due to the lack of functional data to support impact of a particular variant with precision, given the ongoing genome annotations outlined above (Fig. 3). However, noncoding variants can clearly be clinically relevant and their inclusion into clinical genomics frameworks is necessary for realizing the full clinical utility of genomic information.

Fig. 3: The challenge of assigning variant impact in a complex genome.
figure 3

Assigning variants (red arrows) at a hypothetical locus where protein-coding transcripts (blue), lncRNA (green), and regulatory regions (magenta) are incorporated

A framework for noncoding inclusion in clinical genomics

The clinical interpretation of variants typically begins strictly as an informatics exercise where variants are filtered and ranked according to likelihood of clinical trait association. One of the earliest steps is to omit variants that are noncoding, which in the light of the evidence outlined above may miss vital insights into the molecular basis of a disease. To address this limitation, existing frameworks that estimate noncoding impact such as the GTEX eQTLs and tools outlined in Table 1 should be integrated into existing variant interpretation frameworks such as GEMINI72. While less data is available for accurately calculating variant frequency in noncoding regions, growing whole-genome reference databases are now available for this purpose. These annotations can then be interpreted alongside existing lines of evidence within the context of disease.

The primary paradigm shift required by these additions to clinical genome interpretation workflows will be the expansion of the concept of what part of the genome constitutes a gene. Impacts on a specific gene function can theoretically occur anywhere within the genome. This represents a currently insurmountable computational obstacle for the same reason that epistasis remains an intractable issue in genomics. However, splicing and promoter variations are directly linked to genes and are currently well annotated. For this reason, we propose that variants occurring at splice sites and branch points as well as promoters annotated by ENCODE should be included in clinical genomics where they occur in disease relevant genes. We expect that a more inclusive approach to impacts on gene function will facilitate an improved picture of the clinical landscape, particularly in the case of disease with strong evidence of inheritance where no coding candidate can be found. For example, a promoter variant may be the second-hit in a recessive heterozygous locus leading to total loss of a gene product. Furthermore, as our knowledge of the biology of the genome grow, more interpretative power will become available in the context of clinical genomics. We contend that the potential to improve diagnostic rates using a multi-level whole-genome annotation approach will outweigh the necessary increased time for manual variant review and ruling out of false positives.

The future

Understanding the genetic basis of disease has been an aim of science since heritable traits were first observed. Technological and conceptual progress have given rise to a picture of the genome that is as complex as one would expect from a four letter code that gives rise to living multicellular organisms. Research is currently at the point of attempting to describe and unravel this complexity as discussed above. We expect that tools and knowledge of the noncoding genome will continue to expand and that continued refinement of an integrated coding and noncoding genomic landscape through comprehensive genomic, transcriptomic, and epigenomic profiling will improve the prediction of variant outcomes. The computational issues of epistasis and polygenetic impacts will be improved as more data is generated and more powerful computational frameworks emerge, such as quantum computing to enable large combinatorial calculations that are currently unfeasible. These will go hand in hand with more widespread adoption of moderate throughput screens for rapid and direct measurements of the impact of candidate variants such as CRISPR-Cas9 tools in patient-derived iPS cell lines. It will be important for clinical scientists involved in variant interpretation to remain mindful of the growing clinical significance of the whole genome and for developers of software and knowledgebases used to inform variant interpretation to consider non-protein-coding data sources and algorithms that act on noncoding genomic regions in their workflows.