Introduction

Exome and genome sequencing (ES and GS) using massively parallel sequencing (also known as next-generation sequencing, NGS) is increasingly being implemented in clinical settings.1 At present there is the potential for wide variation in sequencing metrics between institutions, between samples sequenced at the same institution, and even within a single sample. Thus, as sequencing moves increasingly into the clinical arena, the application of these methods needs to be accompanied by the development of performance metrics and by an understanding of potential technical limitations of GS and ES as a clinical test. Indeed, the misnomers of “whole” exome and “whole” genome sequencing demonstrate that our field is communicating a confusing message to end users—neither is truly whole. To this end, there has been increased focus on clinical sequencing standards, including the development of professional standards and guidelines for the use of NGS in clinical laboratories.2,3,4,5,6,7

Under current recommendations, putative clinically relevant variants identified through NGS should be validated using Sanger sequencing or other orthogonal methods,3,4 although this practice has been challenged.6,8,9 In addition to knowing that positive results are accurate, clinicians and patients need information to accurately interpret a “negative” clinical sequencing result. This includes distinguishing when negative findings may be attributable to incomplete sequencing results. A key contributor to incomplete sequencing is reduced coverage in regions lacking sufficient high-quality aligned bases for variant calling.3,10 Understanding the effects of reduced coverage requires a number of steps, including (i) setting definitions for “reduced coverage” regions that are not well represented in NGS results, (ii) establishing methods for measuring and reporting reduced coverage regions as part of clinical sequencing quality, and (iii) examining the potential impact of reduced coverage in the interpretation of clinically relevant regions of the genome.

The Clinical Sequencing Exploratory Research (CSER) Consortium, funded by the National Human Genome Research Institute and the National Cancer Institute (NHGRI/NCI), supports both the methods development needed to integrate sequencing into the clinic and the ethical, legal, and psychosocial research required to responsibly apply personal genomic sequence data to medical care.11 The CSER Sequencing Standards Working Group worked to collectively establish a framework for identifying reduced coverage regions in the clinical sequencing setting. This report provides a summary of, and rationale for, the definitions and methods used in this framework. As a demonstration we examined clinical sequencing data on 500 patients generated between 2011 and 2015 from 10 CSER centers (8 performing germ-line ES and 2 performing germ-line GS) and identified reduced coverage regions within and across projects. To provide clinical context, we examined reduced coverage regions in an exemplar gene list: 4,656 genes taken from the GeneTests database, a collection of genes for which a clinical test is available in a diagnostic lab (as of February 2015).

By presenting a framework for identifying reduced coverage bases, this report provides a process road map for other clinical sequencing centers looking to perform similar analyses on their data. This work summarizes factors, such as capture methods and guanine–cytosine (GC) content, that contribute to reduced coverage. In addition, the demonstrative analysis on 500 samples identifies regions of clinically relevant genes that appear to be universally difficult to sequence using Illumina-based sequencing technology. This highlights the importance of communicating sequencing standards in clinical reports and suggests that orthogonal or advanced methods may be needed to identify variants in some clinically relevant regions.

Materials and methods

Sequencing at each center

All human subjects provided informed consent to participate in these studies. Institutional review boards at each center approved their respective studies. Data sets analyzed in this study have been deposited in dbGaP, corresponding to the CSER studies at each center.

Generation of GeneTests target files

The February 2015 version of GeneTests was obtained 24 February 2015. Genomic coordinates for coding exons associated with transcripts of coding genes and genomic coordinates for all exons associated with noncoding genes were compiled.

Analysis of reduced coverage regions

The Exome Coverage and Identification (ExCID) Report is a software tool used to assess sequence depth in user-defined regions from read data (BAM file); annotate regions with gene, transcript, and exon information; and report intervals below a user-defined coverage threshold. For this study we consider a base to have reduced coverage if the base is covered <20 × in at least 90% of the samples within each center. ExCID Version 2.1 was used for all analyses, and is available on GitHub (https://github.com/cbuhay/ExCID).

Regions of clinical interest

We used two curated databases, the 4 May 2015 release of ClinVar2 and the 2015.1 release of Human Gene Mutation Database (HGMD),3 to assess if any of these reduced coverage regions in the GeneTests list contained clinically actionable variants. Variants in each database were separately intersected with reduced coverage regions identified in GeneTests across 8 ES centers, across 2 GS centers and across all 10 centers using BEDTools.1 We also used the 3 August 2015 release of OMIM to relate reduced coverage regions identified in GeneTests to clinical phenotypes.

Results

Clinical sequencing cohorts across 10 institutions

To compare reduced coverage regions across multiple clinical sequencing projects, we collected data from 10 centers conducting clinical sequencing research projects as part of the NHGRI CSER consortium. Each center provided sequencing data for a cohort of 50 patient samples sequenced in their respective projects. An overview of the sequencing approach taken in each project at the time of data collection is shown in Table 1. Eight of the 10 centers used germ-line ES for their projects, while two centers used germ-line GS. All 10 projects used Illumina sequencing, though the read length, targets captured, depth of coverage, and other parameters varied between projects (Table 1). Standard sequencing metrics for each cohort using these respective approaches are shown in Supplementary Table 1 online.

Table 1 Overview of the sequencing approach used at each center

An approach for identifying reduced coverage regions in sequencing data

To identify the reduced coverage regions in these projects, we first needed to establish definitions for regions that had insufficient coverage for variant calling, and therefore clinical utility. Although raw coverage count (the number of reads at any given locus) is frequently used as a marker for usability, coverage is not the sole determinant of the ability to call variants. In addition to coverage, the assessment of a region requires base pair–level inspection of the mapping quality (MQ) of the reads placed in this region and base quality (BQ) of the individual base pairs in each read. Therefore, we established the concept of usable bases: a high-quality base (BQ ≥20) that comes in a properly paired read with high mapping quality (MQ ≥20). Setting quality thresholds at 20 assures there is a 99% probability that each base in each read is correctly called and uniquely mapped.12,13 We defined any site (locus) to be well covered if it contains at least 20 usable bases (≥20 × coverage with BQ ≥20 and MQ ≥20). It is important to note that although raw coverage may exceed 20 ×, requiring 20 usable bases increases confidence that most germ-line variants will be detected14 and that stochastic sequencing errors are not detected as false-positive variants.15 To focus on loci that were consistently unusable across multiple samples at any given sequencing center, we defined the reduced coverage loci as those that had fewer than 20 usable bases in at least 90% of the samples in that center’s cohort.

To identify the reduced coverage loci in any collection of sequencing data, we used a tool called Exome Coverage and Identification (ExCID), which assesses sequence depth in user-defined targeted regions from read data (BAM file); annotates targets with gene, transcript, and exon information; and reports intervals below a user-defined coverage threshold. Input parameters (see Materials and Methods) include the input BAM files, the targets to interrogate, and the definitions of reduced coverage loci (i.e., more than 90% of samples in the cohort with less than 20 × usable base coverage). Targets may be specified to include any regions of particular interest. The information reported includes coverage metrics and bases covered below the coverage threshold for each sample in the cohort; exact bases that are reduced coverage; the length of the reduced coverage region; gene, transcript, and exon information; and percentage of genes covered greater than the threshold. We further determined the GC content of each interval, position in the original target, and mappability over the reduced coverage regions using standard approaches (see Materials and Methods).

Defining critical genes

While these tools could be applied to all sequencing data, a more practical approach might be to apply these sequencing standards to a list of critical genes for any given clinical application. Such a list might include the genes of interest in a particular clinical condition or a set of genes selected for interpretation by a clinical sequencing lab. To demonstrate the approach of identifying critical regions, we selected an exemplar gene list that might be representative of the types of genes a clinical sequencing lab would find of interest. For this exemplar gene list, we used a publicly available curated list of 4,656 genes for which a clinical test is available in a diagnostic lab, as registered in GeneTests (https://www.genetests.org) as of February 2015 (Supplementary Table 2).

To examine reduced coverage loci within this gene list, we first needed to convert the list of gene names to specific genomic coordinates that corresponded to coding regions of interest. We compiled the coordinates for the coding exons for the canonical isoform of each gene, using the RefSeq transcript annotated in HGMD16 (http://www.hgmd.cf.ac.uk/ac/index.php). The RefSeq and HGMD nomenclatures and genomic coordinates continued a standard that was already being used at the participating institutions; it adheres to the nomenclature guidelines issued by the HUGO Gene Nomenclature Committee (http://www.genenames.org/guidelines.html). We chose the coding sequence coordinates to include each coding gene. For the GeneTests gene list, the total size of the target regions was 9 Mbp. Sequencing metrics across the GeneTests genes for each cohort are shown in Supplementary Table 3.

Reduced coverage regions in GeneTests genes across 10 centers

Using the ExCID tool and the GeneTests list, we surveyed the clinical sequencing data from each of the 10 project cohorts. The goal of this analysis was to determine the reduced coverage regions—characterized as bases below 20 × usable base coverage in ≥90% of the 50 samples in each cohort—specifically within the GeneTests genes.

The survey of reduced coverage regions in the 4,656 genes in the GeneTests list demonstrated some variability among the 10 projects (Figure 1). The total reduced coverage bases in each of the 10 projects ranged from 107 kb to 817 kb, comprising between 1.2% and 9.1% of the total coding bases contained within the GeneTests genes (Figure 1a). The number of reduced coverage exons (defined as exons containing at least one reduced coverage base) across the projects varied from 1,237 to 6,519, comprising between 1.8% and 9.6% of the total number of exons within the GeneTests list (Figure 1b). There were 533 exons (0.8%) that had reduced coverage at all 10 centers. There was wider variation in the number of reduced coverage genes at each site (defined as genes containing at least one reduced coverage base), with a range of 526 to 2,816, comprising between 11.3% and 60.5% of the 4,656 GeneTests genes (Figure 1c). One hundred forty-six genes (3.1%) were affected by bases that had reduced coverage at all centers, totaling up to 66.4 kb (Supplementary Table 4). To test the robustness of results, we also used the Genome Analysis Toolkit DepthOfCoverage tool on the same data sets, which yielded similar results (Supplementary Table 5).

Figure 1
figure 1

Reduced coverage regions in the GeneTests list. (a) Comparison of reduced coverage bases among all centers. (b) Comparison of GeneTests exons affected by the reduced coverage bases among all centers. (c) Comparison of GeneTests exons affected by the reduced coverage bases among all centers. (d) Pairwise comparisons of reduced coverage regions between any two centers. Absolute values represent the reduced coverage bases common to two centers. Percentages represent the overlap in reduced coverage bases between two centers as compared with the union of reduced coverage bases at the two centers. High correlation existed between the two GS centers (sites I and J) and among the ES centers using same capture design. (e) Disease-associated ClinVar variants overlapping the reduced coverage bases in each center.

This 66.4-kb region that had reduced coverage in all 10 centers comprises 0.74% of the coding bases in the GeneTests lists. These loci represent the regions that had reduced coverage in 90% of the 50 samples from each of the eight ES projects as well as the two GS projects. Repeat analyses using 100 samples from nine centers demonstrated similar results, suggesting that this is not a function of sample size (Supplementary Table 6). Figure 1d illustrates a pairwise comparison of the centers, demonstrating both the percent overlap and absolute overlap in reduced coverage regions between any two centers. Greater correlation was seen between the two GS centers (I and J) and between the ES centers using similar capture designs.

Among the centers performing ES, 201,011 bp had reduced coverage at all eight centers (Figure 2a). Additional bases with reduced coverage uniquely at each ES center ranged from 171 bp to 205,293 kb. Figure 2b illustrates the number of bases, complete exons, and complete genes in the GeneTests list successfully covered at one or more centers using ES. Ninety-four percent of the genes in GeneTests (4,370) were successfully covered in their entirety by at least one of the eight ES centers. Only 17% of the genes (814) were completely covered by all eight centers. Two hundred eighty-six genes from GeneTests contained at least one region that had reduced coverage in all eight ES centers. The specific genes involved in these reduced coverage regions are listed in Supplementary Table 4. There were different regions that had uniquely reduced coverage by the ES centers and the GS centers. Of the 201 kb that had reduced coverage by all eight ES centers, 134.7 kb were successfully sequenced in the two GS centers (Supplementary Figure 1). Conversely, of the 105 kb that had reduced coverage in both GS centers, 39.4 kb were successfully sequenced in at least one of the exome centers.

Figure 2
figure 2

Reduced coverage bases in GeneTests. (a) Comparison of reduced coverage regions among eight exome-sequencing centers. (b) The percent of total bases, total whole exons, and total whole genes amongst the 4,656 GeneTests list successfully covered at one or more centers. To be included, every base in an exon or gene must have been a usable base (coverage ≥20 ×, mapping quality ≥20, base quality ≥20). ES, exome sequencing.

To assess potential clinical relevance, these reduced coverage regions were cross referenced with two curated databases of relationships between genetic variants and clinical phenotypes, ClinVar17 and HGMD.16 The goal of this analysis was to determine if any of the reduced coverage regions contained any known disease-associated loci. Among 30,861 disease-associated variants in ClinVar, the reduced coverage bases at each sequencing center ranged from 93 to 1,323 (Figure 1e). Of the 146 GeneTest genes that had reduced coverage by all 10 centers, 22 genes had a reduced coverage region that overlapped with a clinical variant in HGMD or ClinVar (Table 2). Similarly, of the 286 GeneTest genes that had reduced coverage by the eight ES centers, 28 genes had a reduced coverage region that overlapped with a clinical variant in HGMD or ClinVar (Table 2). Details of the phenotypes associated with these specific clinically relevant positions can be found in Table 2.

Table 2 Genes containing reduced coverage bases found in all 10 centers or all 8 ES centers that overlap with known disease-associated variants in ClinVar or HGMD

Characteristics of reduced coverage regions and exons

The reduced coverage bases across all centers were made up of 735 distinct contiguous intervals. Forty-two percent of these intervals (309) had lengths ranging from 1 to 5 bp (Figure 3a), and these short fragments were predominantly GC-rich (>70%). The remaining reduced coverage intervals ranged in length from 5 to 490 bp and though they only made up 58% of the reduced coverage intervals, they accounted for 65.7 kb (97%) of the total bases for the reduced coverage regions in this analysis. A similar analysis for just the exome centers is shown in Supplementary Figure 2.

Figure 3
figure 3

Analysis of reduced coverage regions common to all centers. (a) Overall, there were 735 missing intervals totaling 66.4 kbp in the intersection of exome and genome centers. Forty-two percent of all missing intervals had lengths that were 5 bp or shorter. The remainder of missing interval lengths ranged widely and occurred at less frequency, but they accounted for 65.7 kbp or 97.7% of the total length of all missing intervals combined. (b) Of the >67,000 exons accounting for 4,656 genes in GeneTests, 533 had reduced coverage regions. These regions fell into two distinct groups—either a small part (<20%, with the vast majority less than 10%) of the entire exon had reduced coverage, or most of an exon (>90%) had reduced coverage. (c) Comparison of guanine–cytosine (GC)% distribution between the GeneTests baseline and the reduced coverage regions in all centers.

The 4,656 genes in the GeneTests list are made up of 67,759 exons. Of those, 533 exons had reduced coverage loci at all 10 centers. The reduced coverage loci fell into two distinct groups. In the first group, less than 20% (and most often less than 10%) of the entire exon had reduced coverage bases. In the second group, most of the exon (>90%) had reduced coverage (Figure 3b). Exons with missing intervals spanning >90% of the entire coding sequence had lengths ranging from about 17 bp to over 2,112 bp, with the median length around 105 bp. Thirty-six percent (194 of 533) of the problematic exons were in either the first or last exon of the gene, much higher than expected by chance alone, given that only 17% of the 67,759 exons are either first or last exons (Supplementary Table 7). The GC content also contributed to a large number of the reduced coverage loci, with 29% of the reduced coverage regions with a GC content >80%, as compared with 0.15% of the bases in GeneTests overall (Figure 3c).

Potential clinical implications of reduced coverage regions

One goal of identifying the reduced coverage regions in clinical sequencing data was to understand the clinical implications of incomplete sequencing of potentially relevant genes. Several reduced coverage regions were identified that could affect the molecular diagnosis of patients for a variety of phenotypes (Table 2; Supplementary Table 4). Many of these appeared to be due to issues with high similarity with other parts of the genome leading to an inability of reads to be uniquely aligned. For instance, the STRC gene has recently been revealed to be a major contributor to congenital sensorineural hearing loss; however, single-nucleotide variants and small indels cannot be reliably detected via NGS due to STRC having 99.6% identity with the coding region of a nearby pseudogene.18,19,20 Similarly, current ES and GS approaches have a difficult time detecting pathogenic variants for adult-onset polycystic kidney disease, a disease with a prevalence of 1/400–1/1,000 and for which 85% of pathogenic variants identified are in PKD1.21 PKD1 is part of a genomic region that has been duplicated six times on chromosome 16, leading to an inability to accurately map reads over a large portion of the gene.22,23,24 For both STRC and PKD1, specific targeted sequencing strategies can be deployed to improve the coverage of these genes to address the presence of reduced coverage bases resulting from ES or GS.

Other clinically relevant genes that have reduced coverage likely have simpler solutions to accurately detect variation. SHOX, a gene associated with short stature, is part of the pseudo-autosomal regions that occur at the ends of both the X and Y chromosome. Since ES and GS typically align against the default human reference, this region is indicated as occurring on two different chromosomes (X and Y), thus mimicking a region with high similarity. Defining this region as its own chromosome (XY) would mitigate the mappability issues associated with this region.

Discussion

Massively parallel sequencing data are increasingly being used for clinical decision making. In these clinical contexts, it is critical to understand parameters that affect sequencing quality and to establish methods for measuring and reporting clinical sequencing results. In addition to knowing that data are accurate, clinicians and patients using clinical sequencing data need reassurance that the lack of clinically relevant findings are true negatives, and not due to inconclusive sequencing results. Moreover, it is important for everyone to understand that no test is 100% sensitive and that clinical decisions need to be made in light of a valid metric of sensitivity. Coverage metrics are a crucial component of the overall sensitivity of NGS.

In this report, we have proposed a definition for reduced coverage regions and have established a set of standards for variant calling in clinical sequencing applications. To enable sequencing centers to assess the regions of poor sequencing quality in their own data, we optimized a tool, ExCID (now publicly available for use), which provides a list of reduced coverage loci within genes or regions of particular interest. To demonstrate an approach for examining reduced coverage regions in clinical sequencing data, we used these tools on clinical data generated in 10 projects from different sequencing centers. This approach identified reduced coverage regions in clinically relevant genes, including known clinically relevant loci that were uniquely missed at individual centers, in multiple centers, or in all centers.

Comparing the reduced coverage regions across the various centers allowed categorization of the problematic regions and suggests possible solutions to improve standards. Reduced coverage regions that were unique to an individual center conducting ES were likely due to specific methodology at that center (Table 1). Choices about capture method, mean target coverage, or analysis pipeline may account for differences between centers. Since these parameters can be changed, regions that have reduced coverage only at a particular center can likely be salvaged by modifying the sequencing approach. For example, deeper sequencing is likely to reduce the number of reduced coverage regions that are unique to any single center (Supplementary Figure 3).

In contrast, loci that had reduced coverage in all or most of the ES centers but not by GS were likely due to difficulty with hybridization capture of the region. This might include difficulty generating appropriate baits for the relevant regions or difficulty with capture itself. Alternative bait design or orthogonal methods for sequencing these regions might help salvage these specific regions. Finally, there were certain regions in the genome that had reduced coverage at all 10 centers, regardless of sequencing strategy. Although the total number of regions in this category were few, they did contain some potentially clinically relevant sites (Table 2). While it may be difficult to develop additional methods to salvage these regions, it is important for clinical sequencing centers to note the inability to provide conclusive sequencing information about these regions, especially those that may have clinical implications.

Centers conducting clinical sequencing can use the tools and approach described in this study to analyze their own clinical sequencing data, which may be particularly important for clinical laboratories conducting test validation. Once a list of genes or regions of interest have been identified, these can be converted to a list of coordinates (see Materials and Methods). Using these coordinates and a collection of representative sequencing data, centers can run ExCID on this data to identify those regions that have reduced coverage in 90% of the samples. By comparing these reduced coverage regions with the data presented in this study, centers may be able to better understand the reasons for poor coverage in their own data and develop potential salvage approaches, as detailed above.

While this study provides a generalized approach for analyzing clinical sequencing data for reduced coverage regions, there are a number of caveats. First, the data from each center represented specific cohorts from each project. These were intended to be exemplars to demonstrate the approach to analyzing data at an individual center and between multiple centers; they do not represent a comprehensive survey of human sequencing data. Second, although eight centers using ES data were represented, only two of our centers used GS data. Therefore, conclusions regarding the differences between GS and ES must be considered with that limitation in mind. Third, this analysis assumes that centers are performing germ-line sequencing to find heterozygous single-nucleotide polymorphisms. For labs seeking to identify variants with a lower allelic fraction (for example, cases of germ-line mosaicism or somatic mutations), additional considerations will be relevant and standards proposed here are unlikely to be sufficient. Finally, all 10 centers used Illumina-based sequencing approaches. At present, there are several other platforms available for clinical sequencing, which may provide different results. Moreover, the rapid evolution of platforms suggests that the specific data here may not be exactly reproduced using the most current technologies. Indeed, the technology used at most of the centers in this study has already evolved since the generation of data for this study. Nevertheless, though the data on reduced coverage regions at 10 centers may not be representative of the breadth of technologies used today, the approach to analyzing reduced coverage regions remains platform agnostic, and can be applied to any clinical sequencing data.

When communicating results to clinicians and patients, sequencing centers need to be able to recognize and address reduced coverage regions. The tools and framework defined in this study can provide information about regions that have systematically reduced coverage at the center, and regions that have reduced coverage in any individual sample, which may also include additional regions that have sporadically reduced coverage. Both are important for quality control and communication of clinical sequencing data test results. It is important for centers to understand systematic reduced coverage regions to understand limitations of their clinical sequencing testing, and, when appropriate, to modify approaches to improve sequencing quality. At the individual level, regions that have reduced coverage in any particular sample may have direct clinical implications. Therefore, it is important that these regions be accurately communicated to the clinicians using the reports, so that they can correctly factor this into their decision making. Communicating clinical sequencing data to clinicians and patients remains challenging, and including issues of sequencing standards adds further complexity to this—but is important. Further research into specific approaches to documenting and communicating these standards will be required.