There is great potential for genome sequencing to enhance patient care through improved diagnostic sensitivity and more precise therapeutic targeting. To maximize this potential, genomics strategies that have been developed for genetic discovery — including DNA-sequencing technologies and analysis algorithms — need to be adapted to fit clinical needs. This will require the optimization of alignment algorithms, attention to quality-coverage metrics, tailored solutions for paralogous or low-complexity areas of the genome, and the adoption of consensus standards for variant calling and interpretation. Global sharing of this more accurate genotypic and phenotypic data will accelerate the determination of causality for novel genes or variants. Thus, a deeper understanding of disease will be realized that will allow its targeting with much greater therapeutic precision.
Precision medicine describes the definition of disease at a higher resolution by genomic and other technologies to enable more precise targeting of subgroups of disease with new therapies. Prominent examples include cystic fibrosis and cancer.
Clinical genomics exists at the intersection of sequencing-led discovery genetics in population cohorts and historical low-throughput approaches to genetic diagnosis in patients. As a result of the different aims of these two endeavours, technologies and algorithms that have been developed for discovery genomics need to be optimized before application to clinical medicine.
Areas of need include the improvement of sequencing technologies. Current short-read approaches are limited in areas of the genome of low complexity (such as repeats), regions of high GC content, regions that are highly polymorphic or that include small-scale (indel) or large-scale (structural variant) disruption of the open reading frame.
Possible routes to such improvements include long-read sequencing, improved algorithms for indel and structural variant calling, graph reference approaches and standardization of nomenclature.
One area that requires specific attention is the quality and coverage of sequence data for clinical genetic testing. In general, the emerging consensus standard is that the coding regions of interest (plus two base pairs on either side) should be covered by 20 high-quality (Q20) reads that are uniquely mapped.
To improve assertions of the disease causality of genetic variants, data sharing of both phenotypic and genotypic information across communities will be required. Projects such as ClinGen and its associated database ClinVar represent an important step in this direction. Large-scale population sequencing projects such as the UK Biobank and the US Precision Medicine Initiative Cohort Program will enhance our understanding of population-scale genetic variation in a way that optimizes our care of the individual with genetic disease.
The sequencing of the human genome led many to speculate on the near-term potential for clinical medicine1. Understanding the genetic basis of disease was naturally expected to lead to better targeted therapies. Indeed, the steep decline in the cost of sequencing, pursuant to the invention of 'next-generation' technologies, facilitated the discovery of many more causative genes2,3 and, more recently, application to individual patients, including several widely reported examples of genome-driven medical decision making4,5,6. Pilot studies explored the use of genomic information more broadly in patient care7,8,9 and the US National Human Genome Research Institute (NHGRI) laid out a 20-year plan for translating insights from genomics to medicine10,11. Additionally, direct-to-consumer companies put genotypes in the hands of interested participants12. However, the brightest spotlight was provided in 2015 by President Obama in his State of the Union address where he laid out a vision for a national Precision Medicine Initiative in the United States13,14.
The term 'precision medicine' (Box 1) was first given prominence by a publication from the US National Research Council that sought to inspire a new taxonomy for disease classification via a knowledge network15. In the appendix of that publication, the authors clarify that its coining, as opposed to the more commonly used term 'personalized medicine', was intended to convey the principle that although therapeutics were rarely developed for single individuals, increasingly, subgroups of patients could be defined, often by genomics, and targeted in more specific ways. Worldwide internet searches for the term increased dramatically after the State of the Union address and have remained at similar levels to that of 'personalized medicine' ever since (Fig. 1a).
The timing does seem right for a new approach: genomic data are more readily available, we have a greater understanding of population-scale genetic variation16,17, and approaches to data integration with electronic medical records will lead to much improved characterization of phenotypes18. However, for precision medicine to succeed it also needs to be more accurate. The current algorithms for genome analysis were developed for population or cohort variant discovery where the consequences of reduced accuracy are a lost opportunity for discovery. By contrast, an inaccurate clinical genetic test could lead to very serious consequences for individuals and families with genetic disease. In this Review, I describe promising applications of precision medicine as it currently exists then move on to discuss the challenges our community needs to face, in the areas of sequencing technology, algorithm development and data sharing, to bring genomics up to clinical grade.
Promising applications of precision medicine
Cystic fibrosis. In the State of the Union address, President Obama specifically gave as an example the drug ivacaftor, which was developed for patients with cystic fibrosis. Cystic fibrosis is an autosomal recessive disease that affects approximately 70,000 people worldwide and that is caused by variants in the cystic fibrosis transmembrane conductance regulator (CFTR) gene. The protein product of this gene is an epithelial ion channel located on the cell surface where it regulates cellular chloride transit. Mutations of CFTR cause abnormal regulation of salt and water, which particularly affects the function of the lungs, pancreas and sweat glands. Recurrent pulmonary disease and resistant infection represent the major therapeutic challenges of cystic fibrosis, and traditional therapies have focused entirely on the secondary consequences of the disease. Genetic understanding of cystic fibrosis has facilitated its categorization into molecular subgroups (Fig. 1b). In some subgroups, the channel reaches the cell surface but there is insufficient ensemble channel activity, but in other subgroups, trafficking leaves the channel in the cell cytoplasm. The oral agent ivacaftor was designed to increase the opening time of activated CFTR channels at the cell surface. Thus, for patients with mutant channels that do not reach the cell surface, ivacaftor would have minimal effect, whereas in patients with channels that are adequately transported, effect sizes for the improvement of pulmonary function could be dramatic. This was the case for the 5% of patients with the G551D mutation who were initially targeted19,20. A newer approach, which was recently approved by the US Food and Drug Administration (FDA), includes the use of ivacaftor and a second agent, lumacaftor, that improves the intracellular processing and delivery of the mutant channel21. This is particularly important for the 85% of patients with cystic fibrosis who have the most common genotype, F508del. For these patients, the mutant channel protein is misfolded, which leads to intracellular degradation. However, if there is proteasomal escape, the protein reaches the cell surface but with a gating abnormality that is similar to G551D. Thus, a combination approach may be optimal for these patients21,22. In this case, detailed understanding of the genetics of cystic fibrosis allows much more precise targeting of specific agents to individuals with specific functional defects.
Precision oncology. Another major area of promise for precision medicine is oncology. Traditional approaches to the classification of solid tumours focused on the tissue of origin. However, since the early success of the ABL1 kinase inhibitor imatinib for chronic myeloid leukaemia (which is driven by a BCR–ABL1 fusion protein), oncology has moved towards molecular classification. Crucial to this recognition of cancer as a genetic disease was the discovery of the central role of somatic mutation of genes that are involved in DNA repair, cell division and apoptosis. Genomic characterization has in fact been standard of care for some time for lung adenocarcinoma: testing for specific epidermal growth factor receptor (EGFR) mutations and anaplastic lymphoma receptor tyrosine kinase (ALK) rearrangements allows the personalization of therapy with targeted kinase inhibitors, such as gefitinib for EGFR and crizotinib for ALK23,24. Similarly, BRAF inhibition in BRAF-mutant melanoma25 was a much heralded early application of precision targeting, but like many attempts to target 'driver' mutations with specific agents, the overall duration of response was disappointingly short owing to the acquisition of secondary resistance through additional somatic events.
A newer approach with the potential for longer term effects is the harnessing of the immune system26 (Fig. 1c). Tumours present antigens in the form of oncogenic viruses, fetal developmental proteins or neoantigens that are formed by somatic mosaicism27. Initial attempts to harness T cell responses to such antigens through vaccination were disappointing but led to a greater appreciation of the importance of the antigen-presenting cell and co-stimulation with, for example, CD28. This led to the identification, not only of the critical steps required for T cell activation, but also of the autoinhibitory pathways mediated by the checkpoint receptors cytotoxic T lymphocyte-associated protein 4 (CTLA4), programmed cell death 1 (PD1; also known as PDCD1) and others. Antibodies to these proteins were rapidly developed and clinical trials of 'immune checkpoint therapy' found broad success in various tumours with, in some cases, prolonged effects28. It is speculated that the sustained effects of these therapies are mediated by memory T cells.
Most recently, trials combining genomic targeting with checkpoint therapy have begun26. In fact, genomic approaches, which have been greatly facilitated by resources such as The Cancer Genome Atlas (TCGA)29, can also enable checkpoint targeting in other ways: RNA sequencing can confirm the expression of the checkpoint ligand in the tumour and the checkpoint receptor in the T cell. In addition, newer computational approaches to detecting neoantigens are beginning to show success30. Indeed, a seminal example of personalized tumour therapy is to combine a neoantigen-led vaccination strategy with the detection of circulating tumour cells and cell-free DNA from tumour cells in plasma31,32.
Pharmacogenomics. Pharmacogenomics was perhaps the earliest application of personalized medicine. Trials of genotyping VKORC1 (which is involved in the biochemical activation of the blood clotting factor vitamin K) and CYP2C9 (a member of the cytochrome p450 drug-metabolizing enzyme family) to optimize warfarin dosing led to some success, including approaches to automated dose estimation33. Indeed, the FDA embraced the possibility of such testing with black box warnings that encouraged the use of genetic testing where possible. However, some debate regarding cost-effectiveness34,35 and the lack of readily available genomic information on large numbers of patients, left these potentially valuable tools in the hands of only a small number of clinics while pharmaceutical companies worked to develop drugs with alternative pharmacokinetics that did not require companion diagnostics. A similar situation emerged for clopidogrel, which is an anti-platelet agent used to prevent coronary artery stent thrombosis. A common loss-of-function polymorphism in CYP2C19 (*2), which is present in up to 35% of individuals of European and African ancestry and 60% of individuals of Asian ancestry, is associated with the reduced conversion of the pro-drug to the active metabolite7. Large studies showed adverse outcomes in poor metabolizers following coronary stent placement procedures36,37 but other studies in different contexts did not show major effects on outcomes38. This was a confusing message for the cardiovascular community39,40, and despite the development of point-of-care diagnostic monitoring41 and a recommendation in the form of a black box warning from the FDA, the presence of platelet activation assays and newer agents that are not metabolized by this pathway42 led to limited use.
However, the promise of pharmacogenomics remains very great as it could apply to every individual taking any medication. Indeed, there have been some estimates that 98% of people carry a high-risk pharmacogenomic diplotype43. Catalysed by carefully curated knowledge bases such as the Pharmacogenomics Knowledgebase (PharmGKB)44,45, professional guidelines already detail many potential uses46. However, for pharmacogenomics to succeed broadly, genotype information that is relevant to drug metabolism needs to be available at the time of prescribing, which usually means a priori genotyping. Several major centres have deployed systems to enable this genotyping47.
In summary, applications of genomics to genetic diseases such as cystic fibrosis and cancer, as well as for pharmacogenomics sit within a broader landscape of promise for the application of genomics to medicine (Table 1). Applications that are further from routine medical application, such as microbiome sequencing and predictive analytics for common variants in complex disease, as well as some targeted approaches already in clinical practice, including non-invasive prenatal testing, are not discussed in this Review.
The US Precision Medicine Initiative
One of the central features of the US Precision Medicine Initiative is the establishment of a 1-million-person cohort of individuals willing to contribute their partnership and data for scientific discovery13,14. This was not, in fact, the first time either President Obama or the Director of the US National Institutes of Health, Francis Collins, had suggested such an idea. As Senator for Illinois, USA, Barack Obama introduced the Genomics and Personalized Medicine Act of 2006 (Ref. 48) that included planning for a national biobanking initiative. Meanwhile, Francis Collins had called for a large-scale prospective cohort study of genes and environment as early as 2004 (Ref. 49) to mirror those of the United Kingdom50, Iceland, Denmark, Canada, Germany and others. In particular, Collins pointed out the advantages of studying the natural history of disease. One feature that seems particularly prominent in the planning of the US Precision Medicine Initiative was the idea of including participants as partners and connecting participants and researchers via mobile technology devices. Such devices could be used for more sophisticated phenotyping or to monitor large populations at risk for disease51.
The convergence of discovery and clinical genetics
Human discovery genetics and clinical genetics began together with family pedigrees and descriptions of inheritance in the absence of knowledge of the molecular cause. The advent of increasingly dense genome markers facilitated the first examples of forward genetics: positional cloning by linkage analysis of pedigrees followed by the discovery of causative genes and variants in those linkage regions. Fuelled by the HapMap project52, the characterization of common variation at a genome-wide scale became possible when hundreds of thousands of markers could be simultaneously analysed on microarray platforms. When next-generation sequencing first became tractable, it was applied as low-coverage sequencing in large populations for single nucleotide variant (SNV) discovery (for example, the 1000 Genomes Project)53. These approaches were successful in the discovery of robustly replicable associations between traits and SNVs of small effect in mostly non-coding regions of the genome54.
Meanwhile, in clinical medicine, diagnostic testing has historically focused on karyotyping to detect chromosomal abnormalities or fluorescence in situ hybridization for large-scale rearrangements. The association of genes to diseases and the facilitation of knowledge through curation in databases such as Online Mendelian Inheritance in Man (OMIM) led to an era of Sanger sequencing of the coding regions of small numbers of genes. If a rare or disrupting variant was not found in a control group of typically 100 Caucasian blood donors, it was deemed important. Meanwhile, crossing over from the discovery world, the microarray was the first high-throughput technology to truly affect medical genetics, offering the detection of deletions and duplications at increased resolution55 and leading to the possibility of a genome-wide test that could be used for undiagnosed disease where no single candidate gene was identified56,57. In a similar manner, laboratories have extended gene panels using next-generation sequencing approaches to include many more genes, even including some for which gene–disease causality is less well established.
History, then, draws an interesting contrast between a clinical genetic testing community that was focused on large-scale disruptions to the open reading frame of genes, and an emerging population genetics community, who first defined themselves through genome-wide common SNV associations with complex disease. The excitement of the present era of precision medicine is driven by their convergence. This convergence was exemplified by a NHGRI-sponsored workshop that brought together clinical geneticists, population geneticists, genetic epidemiologists and statistical geneticists to agree on a framework for the determination of causality for sequence variants in human disease58.
Making genomics more precise and accurate
Implicit in the term precision is an approach to genomics that includes accuracy. Although the formal definitions of precision and accuracy are distinct (Box 1), the use of the term by the US National Research Council panel was intended to convey both meanings15. Semantics aside, there is clearly nothing more important to precision medicine than accurately representing the genomes of individual patients or their tumours59 (Fig. 2). Key challenges to the attainment of accuracy in genomic medicine are described below along with their medical relevance and possible solutions.
Achieving accuracy: anatomy of the genome. The human genome has historically been defined by the reference sequence. The product of the publicly funded human genome project, the human reference was derived from the DNA of more than 50 individuals from whom clones representing single haplotypes were sequenced by a shotgun approach and then patched together in one haploid sequence60. Although the largest contribution probably came from one African American individual, this was an ethnically diverse group and so the reference genome switches from one ethnic haplotype to another at multiple places.
The newest human reference assembly GRCh38 was the result of many years of meticulous work from the Genome Reference Consortium. It adds 178 regions with 261 alternative loci60 and 150 genes that were not previously represented. The genome itself (GRCh38.p5)61 is 3.23 billion bases with (GRCh38.2) 51,087 genes and pseudogenes (of which 20,576 are protein-coding genes, although some algorithms estimate this may be as low as 19,000 (Ref. 62)). The genes vary enormously in size from 8 base pairs (a transfer RNA) to 2,473,559 base pairs (the CNTNAP2 gene encoding the CASPR2 protein). The genes may have as few as 1 exon (for example, a gene encoding a G-protein-coupled receptor) or as many as 363 exons (titin). In the original assembly, there was 198 Mb of heterochromatin gaps and 28 Mb of euchromatin gaps63. The GC richness of the genome, important as a challenge to DNA capture and sequencing chemistry, varies dramatically: first exons generally have a higher GC content than the overall average of around 40%63. The functional importance of GC-rich regions is driven by CpG motifs. These are thought to be particularly sensitive to mutation and are clustered in islands near the 5′ end of genes.
A major challenge to the accurate representation of the genome takes the form of repeated sequence, which represents more than 50% of the genome64,65,66,67. Common types of repeats include segmental duplications, simple repeats, short tandem repeats (recently shown to have an important role in gene expression68), transposon-derived repeats and processed pseudogenes.
Genome anatomy: challenges to clinical diagnosis. Much of this genomic complexity is only challenging because of the prevailing technology used to assess it: short-read sequencing. With extensive paralogy, originating in gene families, segmental duplication or pseudogenes, the genomic location of many short reads cannot be determined with confidence67. With simple repeats, the challenge is different. If the overall length of the repeat region is shorter than the read length, it is possible to resolve length by local re-assembly. However, if the repeat tract is longer than the read length, the length of the repeat region is very challenging to discern. Yet, important genetic diseases are encoded by simple repeats that expand because of the instability of the resulting secondary structures during replication. Indeed, most repeat tracts are pathogenic in a range greater than the typical size of a short read (100–250 base pairs)69,70 (Fig. 3a). For example, in Huntington disease, the risk begins at 40 trinucleotide CAG repeats (120 base pairs) in HTT and increases from there.
Highly polymorphic regions also cause major challenges for short-read sequencing. The prototypical region is the major histocompatibility complex (MHC) that encodes human leukocyte antigens (HLAs). This is a 3.6 Mb segment on chromosome 6p21 that contains more than 100 genes of which six are the basis of the most commonly reported immune typing. The HLA region is fundamental for our definition of self and is associated with more than 100 diseases and many drug reactions, including some that are potentially fatal — for example, carbamazepine-associated toxic epidermal necrolysis71,72 and abacavir-induced liver injury73. However, the MHC is challenging to resolve using only short-read approaches because of the lack of a comprehensive catalogue of haplotypes and the intrinsic lack of phase information — that is, knowledge of the parental chromosome of origin — in short reads74.
This lack of phase information is challenging in other clinically relevant situations, for example, the demonstration of compound heterozygosity, where two variants are found in the same gene. Knowing whether a mixture of doubly mutated and wild-type protein is expressed versus whether two singly mutated proteins are expressed is a critical distinction75. This is important in pharmacogenomics for which current standard practice is that combinations of variants mapped from association study evidence are assumed to be in trans. A related example of compound heterozygosity that can be resolved with a simple algorithm is the multiple nucleotide variant (MNV) where two variants appear at consecutive positions (Fig. 3b) — understanding the consequences for protein coding from each gene copy requires a variant-calling algorithm that distinguishes phase. MNVs seem to be particularly frequent in cancer76.
Long-read sequencing approaches, which involve either barcoding fragments of longer molecules for short-read sequencing and subsequent in silico reassembly77,78 or direct sequencing of the longer molecule79, can theoretically provide answers to many of these currently unsolved challenges80. Long-read sequencing facilitates de novo assembly that automatically provides phase information81,82,83. It improves the likelihood that any given structural variant will be sequenced with a localizing non-duplicated region. Tracts of simple repeats that are even thousands of base pairs long can theoretically be captured. Length sizes for the currently available long-read technologies now have their median in the 5–10 kb range, with a long tail reaching to tens of thousands of base pairs or more84,85, while short-read reconstruction approaches can have median haplotype blocks as large as several megabases78. In addition, such sequencing provides a more complete picture of the genome. Recently, interstitial euchromatic gaps with the human reference genome (GRCh37) were closed by a long-read sequencing method79. These gaps were identified as predominantly long runs of short tandem repeats embedded within GC-rich regions. In a second approach, combining two long-read technologies improved the length of assembly scaffold and structural variant detection. Such haploid79 or diploid86 approaches demonstrate previously unrecognized genomic complexity, particularly in structural variation. Unfortunately, however, long-read approaches remain between one and two orders of magnitude more expensive than short-read approaches80 and also require larger amounts of DNA, partly to overcome a high error rate, delaying their widespread adoption for human genome sequencing.
Quality scores and compression. As more and more individuals are sequenced as part of clinical medicine, there will be an increasing need for long-term storage and retrieval of their data. Indeed, some have estimated that the data size of genomics will surpass that of online video and particle physics87, making this a major challenge for precision medicine. Some methods encode differences from a reference sequence, while others focus on quality scores88. Each base that is called by a sequencing machine has an associated quality score. Traditionally, these values have been reported in a format known as 'Phred', which was originally derived from chromatogram traces of early sequencers89,90. The number is expressed as the negative log 10 of the probability of error: q = −10log10(p).
The most common cut off is Q20, which corresponds to a 1-in-100 chance that a base call is incorrect. Notably, this score is calibrated by each sequencing vendor according to internal protocols. The scores, however, represent a large amount of data in an alignment file. And this provides an opportunity for compression. In fact, several approaches to the compression of genomic data, both lossy compression and lossless compression, have been proposed91,92,93. Some have even experimented with mapping these scores to a single byte (8 bits). Although compression of the reads themselves provides modest gains, compression of the quality scores offers much greater potential.
Alignment and assembly. The output from sequencing is a large text file of short or long reads along with their quality scores. Deriving a complete picture of a single human's genome from these 'raw' reads requires assembly and comparison with a reference genome. Typically, short reads are aligned to a reference genome using an algorithm that searches for the best match. First principles might suggest advantages to de novo assembly (assembling the genome by overlapping the reads without the aid of a reference sequence) using methods such as the De Bruijn or string graph83,94. However, de novo assembly, particularly of short reads, is computationally intense and impractical for clinical genome sequencing79. Currently, the vast majority of human exome and genome sequences are aligned to a reference sequence. The reference sequence itself has been the focus of some concern because it was derived from a pool of individuals and, as such, contains risk variants. In addition, it does not accurately represent longer range haplotypes owing to the switching between reference individuals in some regions95. Mapping quality will also be poorer in regions of variation95.
Several algorithmic approaches to optimal alignment exist. One approach takes advantage of dynamic programming to yield an exact match for pairwise local or global alignment. It involves the generation of a similarity matrix of two sequences where a score or penalty is awarded for match or mismatch followed by a traceback step that identifies the highest scoring matrix cell. This was originally proposed by Needleman and Wunsch96 and was later adapted for local alignment by Waterman and Smith97. The approach is computationally expensive but maximizes the sensitivity and specificity of downstream variant calling, especially with respect to gapped alignment98. Several methods for speeding up these algorithms have been recommended; for example, using graphics-processing units99.
Despite these advantages, a compression heuristic became the most commonly applied approach to alignment for human genomes100,101. This approach is based on a variant of the suffix array102, which is an approach to the representation of sub-strings in a format that is efficient for searching and compression. Similar to many compression approaches, the Burrows–Wheeler transform (BWT) aims to group similar letters, sorting them lexicographically then storing the letter and the number of times it is repeated before changing. Importantly, what this approach offers over a simple sort is the possibility of inversion. That is, the original sequence can be recreated from the compressed output. For both compression and alignment use cases, reversing the transform is crucial. Notably, although much more rapid in a cohort discovery context, the BWT is less optimal in a single-patient ('N = 1') clinical context. Although certain contexts demand speed103,104, in most cases accuracy is primary for clinical genomics and an exact match global alignment generally performs better105.
Even with an exact match algorithm, a major challenge for short-read sequencing arises when a read maps to more than one place. The read could be placed at the best aligning position, or, if it aligns equally well to more than one position, it could be placed at a randomly chosen position, at every position or not placed at all. Remarkably, there is no consensus regarding which of these placements is best, and different algorithms adopt different approaches with some allowing this placement to be specified in the command line. Clearly, the longer the read the less likely this issue of placement will be a problem, but for 100 bp reads, fully 5% of the genome will originate non-unique reads67. Given that a typical whole genome sequenced to 30× coverage generates approximately 1.3 billion reads, this represents 65 million reads that have no possibility of being accurately located106,107. In practice, it is typically closer to 10% of reads in a whole-genome sequence alignment that remain unplaced, meaning that a further 65 million reads are lost that will probably be enriched for paralogous areas under variable evolutionary constraint (for example, gene families or pseudogenes) or places where the genome being tested differs from the reference genome in ways more complex than single nucleotide variation. Unaligned reads may also represent non-human DNA, in which case, new approaches to the diagnosis of infectious diseases can take advantage by mapping these to databases of viruses and bacterial organisms.
Variant calling. After assembly or alignment comes variant calling. The most common approach is to compare the most likely genotype at each position to that of a standardized reference sequence. This is usually the most current version of the human genome reference but in tumour sequencing might be the patient's germline sequence. Notably, the human reference sequence is haploid. Thus, a homozygous disease-risk variant in a clinical genome sequence will not be called if it also occurs in the haploid reference sequence95. In the case of the factor V Leiden variant found in the reference, for example, a person with an up to 80-fold risk of thromboembolic disease would be undetected by the analysis of any standard variant call format (VCF) file108. Some solutions to this issue take the form of ethnicity-specific, major allele reference sequences95 and family-based diploid reference approaches56. The move towards graph-based assembly approaches, in which the sequence and population variation are contained within a single structure, is underway74. Another solution involves calling all known risk-associated positions8,105 or calling every position into a genomic variant call file, including both reference and variant calls: gVCF (Fig. 2). This has the advantage of distinguishing between a 'no call' and a 'homozygous reference call', which is unable to be distinguished using the standard approach. The challenge with calling every position is the loss of the advantageous drop in file size from raw data to variant call file (from ~five orders of magnitude drop to only ~two orders of magnitude drop).
Different classes of variation have widely varying call accuracy and reproducibility109, which is something made more challenging by the lack, until recently, of a fully characterized single human's diploid genome. In its place, the NA12878 genome available in cell lines from Coriell has been adopted, led by a consortium from the US National Institute of Standards and Technology that is called Genome In A Bottle67,110. The consortium made a consensus call set freely available that was derived from 14 data sets from five sequencing technologies, including seven read mappers and three variant callers. The initial work demonstrated a lack of concordance across different technologies but a clear theme emerged, which was also reflected in work with a more clinical focus8, that the accuracy of calling varied widely across different variant classes.
Single nucleotide variation. Overall, single nucleotide variation is called with high sensitivity and specificity for approximately 77% of the genome, approaching 99% concordance8 with genotyping microarray-based approaches in those regions110. This is nevertheless encouraging not only because important Mendelian disease is encoded by this class of variation but also because of the ever-expanding genome-wide association study evidence of single nucleotide variation that is confidently associated with complex human disease. Notably, common single nucleotide variation associations remain overall less relevant from a clinical perspective, as there is currently only very limited evidence of clinical utility in predictive scores derived from common variation111.
Insertions and deletions. In contrast to single nucleotide variation, calling of small insertions or deletions (indels) is less accurate. In one study, the concordance between two platforms for indel calling was only 57% across the genome and 33% for inherited disease risk genes8. This is particularly concerning for clinical genomics, as variation that disrupts the reading frame or that affects the structure of the protein in a major way is likely to be more clinically important. A further challenge to the appropriate identification of indels is the lack of standardization of nomenclature (Fig. 3c). The customary approach to naming genetic variants in the clinical domain is known as HGVS (from the Human Genome Variation Society) and relates the variant position relative to the gene rather than to the chromosome as is more common in discovery genetics112. Parsers now exist to map such variants to the more commonly used chromosome location113,114 but this does not resolve all the issues. Although with single nucleotide variation there can be challenges in appropriate localization given its dependency on alignment and transcript diversity, with insertions or deletions the challenge is greater. Specifically, the locating coordinate could be left or right justified. This is not a theoretical problem, but rather one with very clear clinical implications. For example, the F508del variation in CFTR (discussed above) is the most common variant that is causative of cystic fibrosis (Fig. 1b) but it is represented in two different ways. HGVS convention requires right shifting or justification of ambiguous indel variants for reporting relative to the transcript (the most 3′ position possible should be assigned). However, when calling variants on the genome from aligned sequences, the convention for genomic reporting of ambiguous indel variants in VCF is to left shift or justify relative to the published reference sequence, which represents one, arbitrarily chosen, strand. Because transcripts can be notated in either direction (that is, on either strand), unifying the justification to the left or right would still lead to discordance approximately 50% of the time (Fig. 3c). Careful manual curation is currently the only approach that can resolve these issues. Notably, this error was recently reconciled by ClinVar (but not by dbSNP).
At their most fundamental, algorithms for calling indels remain inferior to those for calling SNVs. Dindel was widely adopted, including into the Genome Analysis ToolKit (GATK) framework115, and local de novo assembly approaches as well as use of 'known' indel positions improved this further. However, sensitivity still drops very rapidly to below 50% in simulated genomes as the size of the indel increases, even above three base pairs. Newer approaches116,117,118,119 show substantial improvements by including prior knowledge of existing indels and by the use of local de novo re-assembly. However, a great deal of work needs to be done before this class of variation can confidently be called for clinical purposes. Although false-positive indel calls may be resolved by validation with alternative approaches, false-negative calls remain a considerable concern for precision medicine because, if a convincingly causal disease-associated variant remains undetected, this represents a missed opportunity for diagnosis and intervention.
Structural variation. In discovery projects, structural variation has been detected through microarrays and sequencing, but algorithms to detect structural variation from short-read sequencing are fundamentally limited by the length of the short read. Indeed, the extent to which the discovery of structural variation has been missed has been illustrated by recent long-read sequencing approaches that have revealed novel variation that was previously undetected by short-read technologies120. In many ways, this is not surprising given that the aim of the Human Genome Project was to produce one sequence and the fact that long-read data on even one human genome have only recently become available79. However, the detection of structural variation remains a high priority for precision medicine because it is a especially important class of variation, particularly for neurodevelopmental disorders121. Current clinically deployed microarray approaches are limited by the distance between markers, by the lack of adequate control populations, and by the sensitivity of the detection technology (fluorescence). Improved algorithms for calling structural variants from genome data are a major prerequisite for the advancement of precision medicine, particularly as sequencing brings some intrinsic benefits, such as the ability to detect copy-number-neutral structural variants including balanced translocations. Approaches to maximize the diagnostic yield for structural variation from sequencing data have existed for some time122 but have only recently been improved and rigorously tested123,124.
Sequencing gene panels, exomes or genomes
Medical diagnostics to take advantage of next-generation sequencing can have various strategies that differ in the proportion of the genome that they interrogate (Fig. 4). These are discussed below and include: the capture of the coding regions of a limited panel of genes (often between ten and 100 genes); the capture of the coding regions of almost all genes (the exome); or whole-genome sequencing (sequencing all of the genome that is accessible to short-read sequencing).
Capture-based interrogation of gene panels and exomes. The enrichment of selected areas of the genome by hybridization to known sequences is known as capture. Capture was initially developed for the research market and, in the case of the exome, was designed to balance genome-wide coverage with commercial viability2,3 (Fig. 4a,b). Coverage metrics for these products were typically quoted as a mean or median (for example, 100-fold coverage) but this average greatly belied the vast differences in coverage in different regions (Fig. 4b). Indeed, certain exons in medically important genes (for example, potassium voltage-gated channel subfamily H member 2 (KCNH2)) were effectively missing. In addition, capture oligonucleotides were designed to bait sequences that exactly matched the human reference assembly, so they captured the regions of the genome that we care most about less efficiently: those parts that are variant.
Although in a research context these issues reduced the power to detect variation in certain areas, they did not impede the overall goal of finding some important new variations and so the total incremental benefit over microarray-based methods for discovery in the coding regions overshadowed any major concerns.
By contrast, use in the clinical world is for a single patient with a potentially devastating medical problem. In this case, missing any region of a gene could have serious consequences were it to contain the causative variant. Alternatively, it could result in false reassurance. In either case, the consequences have meaning even beyond the individual to all family members who are potentially at risk of inheriting the disease. Metrics such as '90% coverage at 10× or more' that were common for exome research products are not appropriate for clinical diagnostics. This created a challenge for clinical laboratories for which the existing standard was that a clinical report would not be signed out unless every coding base pair (as well as the two bases on either side to account for splice dinucleotides) was called.
Groups have responded to these challenges by augmenting coverage in certain regions, both coding and non-coding125,126 (Figs 4c,5), through the addition of extra probes in these regions (known as augmented exome sequencing). Some laboratories also use targeted PCR to fill in gaps127. However, increasing capture in certain areas only goes so far in improving the ability to make a call at every position. GC-rich regions — for example, the first exon of most genes — cannot be optimized simply with extra coverage. These regions require library preparation and sequencing conditions that are tailored to their high GC chemistry125.
Whole-genome sequencing. Sequencing the whole genome seems at first to be an answer to these problems. As all genomic DNA is included, concerns relating to capture are not relevant and coverage is clearly more evenly distributed (Fig. 4d). In addition, regulatory areas of the genome are included. Given that most variants that were associated with disease from genome-wide association studies (GWAS) were in non-coding regions, and given that ENCODE (The Encyclopedia of DNA Elements) suggested that large portions of the non-coding genome might be in some way important, this could be valuable data for clinical genomics. For discovery, this remains true. However, for clinical application, GWAS hits with low magnitude of effect remain of limited, though increasing, value as very few associations between Mendelian disease and regulatory variation have been described (and these regions can be added to a capture kit). The current major benefit of whole-genome sequencing for clinical medicine is likely to be in the identification of structural variation, but the algorithms have not so far been accurate enough for short reads to allow this at clinical grade. Overall, replacing exome sequencing with whole-genome sequencing at 30× would lead to the sacrifice of confident callability of the coding genome to provide coverage of the non-coding genome. This 50-fold increase in sequencing has an unclear value for clinical application, as well as for research groups looking to maximize study size for dollar sequencing spend128.
Comparison of approaches. In comparing different genome diagnostics, a standard metric is helpful. Advances in diagnostics or therapeutics in medicine are judged by the standard of 'non-inferiority'. Here, non-inferiority to Sanger sequencing requires that every coding base pair +/− two bases should receive a confident call. In addition, as Sanger sequencing usually requires PCR of the specific exon, non-inferiority should include only uniquely mapped reads. Thus, the idea of a quality-coverage-mappability metric for comparing different research and commercial products has been gaining traction (Fig. 5). This metric quantifies the number of base pairs per gene of interest that are not covered by 20 or more uniquely mapped Q20 bases. An absolute base pair count is preferable to a percentage because any base can theoretically harbour a disease-causing variant and genes widely vary in their size. For example, if 10% of the gene titin is not callable this would represent >10,000 base pairs of potentially disease-causing variation that are missed (titin is associated with dilated cardiomyopathy). We recently made available129 a tool to generate this metric for a given set of genes based on raw data output from various providers8,67,129 (Fig. 5). An important finding from the application of this metric is that a clinical diagnostic that is based on whole-genome sequence at the standard coverage meets this standard far less often for known disease genes than does augmented exome sequencing (the reverse would be expected for non-disease genes, as these genes are not currently augmented by any vendor). Application of this metric may provide independent verification of new sequencing approaches. For example, data from the Illumina X Ten sequencing machine meets this standard less often than data from the prior HiSeq 2500 (Ref. 130). Although this reduced calling confidence could potentially be overcome by increasing whole-genome coverage, this is at the cost of having to generate a large amount more genome-wide sequencing data and, notably, increasing the cost per genome beyond US$1,000.
Consistent with independent community verification is the emerging field of community-led regulatory science. Indeed, an important aspect of the US Precision Medicine Initiative is funding for new regulatory approaches. The first manifestation of this is precisionFDA — a website and development environment that will provide tools to allow easier comparison of products from different sequencing vendors and informatics companies131,132. The tool used to generate Figure 5 is one of the launch tools of precisionFDA129.
A genome diagnostic combining multiple technologies. For clinical application, there is some value to whole-genome coverage, although it is more important to cover every coding base pair, especially of genes already known to be important for disease. This observation suggests the concept of a 'coding-enhanced' genome (Fig. 4d). In this concept, coding regions are covered at a high depth through specialized capture but there is some coverage of the whole genome to allow structural variant discovery from the same assay. Until such a time as long-read approaches are cost-effective for genome-wide coverage, then targeted capture of long molecules for complex areas of the genome maximizes the cost–discovery balance. Such a combination approach has the advantage of maximizing the opportunity to diagnose disease through excellent coverage of the coding medical genome, maximizing the accuracy of structural variant calling, repeat calling and variant characterization in complex areas of the genome, and at the same time rationalizing sequencing and data storage costs. As with all clinical genetic tools in an environment of rapidly expanding knowledge, the captured regions will probably need to be updated every few months to account for newly discovered genes and variants.
Causality and disease categorization
Accurate and precise genomic approaches will greatly facilitate the central tenet of precision medicine: more sophisticated definitions of disease133. The concept of causality is fundamental and recurrent in clinical genetics, as science has provided an abundance of association evidence. Indeed, discovery genetics has identified robust statistical associations between diseases and genetic variants but for a variant to be useful as a diagnostic test or therapeutic target, it is crucial to demonstrate a causal link. Achieving confidence in the determination of causality between a gene or variant and a disease is a complex task that requires various types of supportive data58. Clinical genetics has historically embraced a univariate paradigm in its approach to causality. Even the professional guidelines for variant classification134 force variants into categories on a linear (but not proportional) scale between 'pathogenic' and 'benign'. However, it is clear that the clinical expressivity of a particular variant will depend on the magnitude and dependency of its effect. In this case, dependency incorporates genetic background and other factors such as age and environmental exposure in determining whether the clinical variant is expressed as disease. If the magnitude of the effect of a given variant is large and its dependency small (for example, a chromosomal abnormality) then the disease will generally always be evident if the particular variant is present. If the magnitude of the effect of a given variant is small and its dependency is large (for example, a common variant for a complex disease) then the effect of the variant may never be discernible in isolation. In between these two extremes is a highly variable relationship between variant and disease that is better conceptualized as a multivariate model with a large number of inputs. For Mendelian disease, one or more variants will be highly weighted, with other inputs having a substantially lower weighting (perhaps ten 'modifying variants'). For complex disease, recent data have suggested that there will probably be hundreds of variants with small weightings135,136. Significant weighting would also be given to environmental modifiers that may interact with genetic effects.
Therefore, a major challenge is the convenient storage and retrieval of the causal evidence for each variant. Until recently, data on clinically relevant variants were to be found in the literature and in the proprietary databases of commercial testing companies. Sharing occurred but not in any structured or efficient way. The initiation of the ClinVar database and its population by the ClinGen project137, as well as efforts such as Decipher in the UK, have led to more global sharing of rigorously curated evidence. However, the challenges of implementing even standardized guidelines for the interpretation of this evidence are considerable138. Nevertheless, the goal of accumulating and sharing clinical evidence is worth pursuing because for many the 'second case' — another patient who presents in a similar way with a variant in the same gene — provides the highest level of evidence possible for causality. In fact, the newest work from ClinGen indicates that gene-specific or disease-specific overlays should be added to guidelines to maximize concordance between interpreters in a domain-specific way139.
The past decade has witnessed a rapid acceleration in our understanding of the genetic basis of many diseases. With this greater understanding comes the possibility of re-defining disease at higher resolution and, along with this, targeting with more precise therapy. However, for precision medicine to succeed, genomics must also be more accurate. Whereas in cohort discovery projects, if a base is not covered, or an algorithm is insensitive, all that is missed is an opportunity for discovery. In clinical medicine, failing to make a diagnosis, or making a diagnosis in error, could have devastating consequences for individuals and families. In discussing this extraordinary opportunity for precision medicine to fulfil the promise of the human genome project, I have described surmountable challenges in advancing the accuracy of clinical genomics. Reducing reliance on reference sequences, making phasing routine, improving calling of indels and structural variants, characterizing complex areas of the genome through long-read sequencing and maximizing the cost effectiveness of genomic coverage will all be crucial. Advancing regulatory processes in parallel will be a necessary step to ensure high standards and patient safety. Creating large cohorts of individuals committed to partnering in discovery will maximize the benefit and speed its global dispersion. Finally, educating the next generation of physicians and laboratory directors will be crucial to the generation of the workforce that is required to sustain the initial promise.
Fuelled by technological advancement, fundamental discovery of genetic elements related to health and disease has been the engine of human genetics for decades. Building on this foundation, precision medicine will use the knowledge gained to redefine disease, to realize new therapies and to provide hope for generations of patients to come.
The author extends his grateful thanks to R. Goldfeder, A. Dainis, M. Grove, D. Church, M.J. Clark, S. Garcia, G. Chandratillake and C. Caleshu for helpful discussion and suggestions on the manuscript.
- Checkpoint receptors
Mediate important immune autoinhibitory pathways, including programmed cell death 1 (PD1) and cytotoxic T lymphocyte-associated protein 4 (CTLA4).
The study and application of the effect of genetic variation on the response to pharmaceuticals.
- Black box warnings
Named for the black border surrounding the text of the warning on the package insert or label of a drug. They detail the safety concerns that are of a more serious nature than those described elsewhere on the package or label. The border is used when a serious adverse event can be caused by the medication or can be prevented by appropriate use of the medication.
- Companion diagnostics
Diagnostic tests that help to direct the appropriateness of a specific drug therapy.
- Linkage analysis
An approach to establish the probability that a given genomic region is associated with a phenotype, usually in an extended pedigree.
- HapMap project
An international consortium aimed at characterizing the haplotype diversity of the human genome.
In shotgun sequencing, longer DNA fragments are broken into smaller fragments for sequencing using chain termination (Sanger) chemistry.
Copies of a gene that are no longer functional in the same way as the original gene, usually because of deactivating mutations, such as premature stop codons. Pseudogenes can be either processed (derived from retrotransposition of a mature transcript) or non-processed (derived from a DNA duplication event that includes a modification leading to a loss of transcription or translation).
- Segmental duplications
Typically pericentromeric or subtelomeric duplications, concentrated in the Y chromosome, generally tens to hundreds of kilobases in length.
- Short tandem repeats
Microsatellite DNA motifs consisting of 2–6 bp repeated elements of median length 25 bp and accounting for 1% of the genome. They predispose to DNA polymerase slippage events and high mutation rates. Recent work suggests an important role in gene expression.
- Transposon-derived repeats
Repeats derived from transposons, which are DNA elements that can change their positions within the genome.
A paralogue is a gene related to another by duplication. In this Review, the words paralogy and paralogous are used as umbrella terms for areas of the human genome that are identical to each other. Note that paralogues can be formally distinguished from homologues (genes related to one another by descent from a common ancestor) and orthologues (genes related to one another by speciation).
- De novo assembly
Arranging DNA sequence reads in the most likely order of origination without alignment to a reference sequence.
- Structural variant
A region of DNA usually greater than 500 bases variant from a defined reference.
- Lossy compression
A class of data encoding that reduces data size for storage, handling and transmission at the expense of loss of content.
- Lossless compression
A class of data encoding where the original can be perfectly restored from the compressed file.
- Compression heuristic
An approach to compression that is not designed to be optimal but is rather designed to be practical.
- Variant call format
(VCF). A file format standard for the cataloguing of genetic variation in one or many genomes.
- Major allele
The most common allele in a given population.
- Mendelian disease
A genetic disease that follows traditionally recognized patterns of simple inheritance, for example, autosomal dominant.
An algorithm with a specific application in translating one terminology to another.
A curated database of clinically relevant human genetic variation along with the evidence for its disease causality.
A minimally curated database of single nucleotide human genetic variation.
- Splice dinucleotides
The almost invariant canonical dinucleotides that are crucial for splicing (GT: donor; AG: acceptor).
Depending on only one variable.
Depending on multiple variables.