Characterizing reduced coverage regions through comparison of exome and genome sequencing data across 10 centers

Sanghvi, Rashesh V; Buhay, Christian J; Powell, Bradford C; Tsai, Ellen A; Dorschner, Michael O; Hong, Celine S; Lebo, Matthew S; Sasson, Ariella; Hanna, David S; McGee, Sean; Bowling, Kevin M; Cooper, Gregory M; Gray, David E; Lonigro, Robert J; Dunford, Andrew; Brennan, Christine A; Cibulskis, Carrie; Walker, Kimberly; Carneiro, Mauricio O; Sailsbery, Joshua; Hindorff, Lucia A; Robinson, Dan R; Santani, Avni; Sarmady, Mahdi; Rehm, Heidi L; Biesecker, Leslie G; Nickerson, Deborah A; Hutter, Carolyn M; Garraway, Levi; Muzny, Donna M; Wagle, Nikhil

doi:10.1038/gim.2017.192

Article
Published: 16 November 2017

Characterizing reduced coverage regions through comparison of exome and genome sequencing data across 10 centers

Rashesh V Sanghvi MS¹^na1^na4,
Christian J Buhay BS¹^na2^na4,
Bradford C Powell MD, PhD²,
Ellen A Tsai PhD^3,4,
Michael O Dorschner PhD⁵,
Celine S Hong PhD⁶^na3,
Matthew S Lebo PhD^3,7,
Ariella Sasson PhD^8,9,
David S Hanna BS⁵,
Sean McGee PhD¹⁰,
Kevin M Bowling PhD¹¹,
Gregory M Cooper PhD¹¹,
David E Gray MS¹¹,
Robert J Lonigro MS^12,14,
Andrew Dunford BS¹³,
Christine A Brennan BS^12,14,
Carrie Cibulskis BSc¹³,
Kimberly Walker MS¹,
Mauricio O Carneiro PhD¹³,
Joshua Sailsbery PhD¹⁵,
Lucia A Hindorff PhD¹⁶,
Dan R Robinson PhD^12,14,
Avni Santani PhD^17,18,
Mahdi Sarmady PhD^17,18,
Heidi L Rehm PhD^3,7,13,
Leslie G Biesecker MD⁶,
Deborah A Nickerson PhD¹⁰,
Carolyn M Hutter PhD¹⁶^na5,
Levi Garraway MD, PhD^13,19,20,21^na5,
Donna M Muzny MSc¹^na5 &
Nikhil Wagle MD^13,19,20^na5
on behalf of the NHGRI Clinical Sequencing Exploratory Research (CSER) Consortium

Genetics in Medicine volume 20, pages 855–866 (2018)Cite this article

1646 Accesses
19 Citations
35 Altmetric
Metrics details

Abstract

Purpose

As massively parallel sequencing is increasingly being used for clinical decision making, it has become critical to understand parameters that affect sequencing quality and to establish methods for measuring and reporting clinical sequencing standards. In this report, we propose a definition for reduced coverage regions and describe a set of standards for variant calling in clinical sequencing applications.

Methods

To enable sequencing centers to assess the regions of poor sequencing quality in their own data, we optimized and used a tool (ExCID) to identify reduced coverage loci within genes or regions of particular interest. We used this framework to examine sequencing data from 500 patients generated in 10 projects at sequencing centers in the National Human Genome Research Institute/National Cancer Institute Clinical Sequencing Exploratory Research Consortium.

Results

This approach identified reduced coverage regions in clinically relevant genes, including known clinically relevant loci that were uniquely missed at individual centers, in multiple centers, and in all centers.

Conclusion

This report provides a process road map for clinical sequencing centers looking to perform similar analyses on their data.

You have full access to this article via your institution.

Download PDF

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Wenpin Hou & Zhicheng Ji

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Saori Sakaue, Kathryn Weinand, … Soumya Raychaudhuri

Introduction

Exome and genome sequencing (ES and GS) using massively parallel sequencing (also known as next-generation sequencing, NGS) is increasingly being implemented in clinical settings.¹ At present there is the potential for wide variation in sequencing metrics between institutions, between samples sequenced at the same institution, and even within a single sample. Thus, as sequencing moves increasingly into the clinical arena, the application of these methods needs to be accompanied by the development of performance metrics and by an understanding of potential technical limitations of GS and ES as a clinical test. Indeed, the misnomers of “whole” exome and “whole” genome sequencing demonstrate that our field is communicating a confusing message to end users—neither is truly whole. To this end, there has been increased focus on clinical sequencing standards, including the development of professional standards and guidelines for the use of NGS in clinical laboratories.^2,3,4,5,6,7

Under current recommendations, putative clinically relevant variants identified through NGS should be validated using Sanger sequencing or other orthogonal methods,^3,4 although this practice has been challenged.^6,8,9 In addition to knowing that positive results are accurate, clinicians and patients need information to accurately interpret a “negative” clinical sequencing result. This includes distinguishing when negative findings may be attributable to incomplete sequencing results. A key contributor to incomplete sequencing is reduced coverage in regions lacking sufficient high-quality aligned bases for variant calling.^3,10 Understanding the effects of reduced coverage requires a number of steps, including (i) setting definitions for “reduced coverage” regions that are not well represented in NGS results, (ii) establishing methods for measuring and reporting reduced coverage regions as part of clinical sequencing quality, and (iii) examining the potential impact of reduced coverage in the interpretation of clinically relevant regions of the genome.

The Clinical Sequencing Exploratory Research (CSER) Consortium, funded by the National Human Genome Research Institute and the National Cancer Institute (NHGRI/NCI), supports both the methods development needed to integrate sequencing into the clinic and the ethical, legal, and psychosocial research required to responsibly apply personal genomic sequence data to medical care.¹¹ The CSER Sequencing Standards Working Group worked to collectively establish a framework for identifying reduced coverage regions in the clinical sequencing setting. This report provides a summary of, and rationale for, the definitions and methods used in this framework. As a demonstration we examined clinical sequencing data on 500 patients generated between 2011 and 2015 from 10 CSER centers (8 performing germ-line ES and 2 performing germ-line GS) and identified reduced coverage regions within and across projects. To provide clinical context, we examined reduced coverage regions in an exemplar gene list: 4,656 genes taken from the GeneTests database, a collection of genes for which a clinical test is available in a diagnostic lab (as of February 2015).

By presenting a framework for identifying reduced coverage bases, this report provides a process road map for other clinical sequencing centers looking to perform similar analyses on their data. This work summarizes factors, such as capture methods and guanine–cytosine (GC) content, that contribute to reduced coverage. In addition, the demonstrative analysis on 500 samples identifies regions of clinically relevant genes that appear to be universally difficult to sequence using Illumina-based sequencing technology. This highlights the importance of communicating sequencing standards in clinical reports and suggests that orthogonal or advanced methods may be needed to identify variants in some clinically relevant regions.

Materials and methods

Sequencing at each center

All human subjects provided informed consent to participate in these studies. Institutional review boards at each center approved their respective studies. Data sets analyzed in this study have been deposited in dbGaP, corresponding to the CSER studies at each center.

Generation of GeneTests target files

The February 2015 version of GeneTests was obtained 24 February 2015. Genomic coordinates for coding exons associated with transcripts of coding genes and genomic coordinates for all exons associated with noncoding genes were compiled.

Analysis of reduced coverage regions

The Exome Coverage and Identification (ExCID) Report is a software tool used to assess sequence depth in user-defined regions from read data (BAM file); annotate regions with gene, transcript, and exon information; and report intervals below a user-defined coverage threshold. For this study we consider a base to have reduced coverage if the base is covered <20 × in at least 90% of the samples within each center. ExCID Version 2.1 was used for all analyses, and is available on GitHub (https://github.com/cbuhay/ExCID).

Regions of clinical interest

We used two curated databases, the 4 May 2015 release of ClinVar² and the 2015.1 release of Human Gene Mutation Database (HGMD),³ to assess if any of these reduced coverage regions in the GeneTests list contained clinically actionable variants. Variants in each database were separately intersected with reduced coverage regions identified in GeneTests across 8 ES centers, across 2 GS centers and across all 10 centers using BEDTools.¹ We also used the 3 August 2015 release of OMIM to relate reduced coverage regions identified in GeneTests to clinical phenotypes.

Results

Clinical sequencing cohorts across 10 institutions

To compare reduced coverage regions across multiple clinical sequencing projects, we collected data from 10 centers conducting clinical sequencing research projects as part of the NHGRI CSER consortium. Each center provided sequencing data for a cohort of 50 patient samples sequenced in their respective projects. An overview of the sequencing approach taken in each project at the time of data collection is shown in Table 1. Eight of the 10 centers used germ-line ES for their projects, while two centers used germ-line GS. All 10 projects used Illumina sequencing, though the read length, targets captured, depth of coverage, and other parameters varied between projects (Table 1). Standard sequencing metrics for each cohort using these respective approaches are shown in Supplementary Table 1 online.

Table 1 Overview of the sequencing approach used at each center

Full size table

An approach for identifying reduced coverage regions in sequencing data

To identify the reduced coverage regions in these projects, we first needed to establish definitions for regions that had insufficient coverage for variant calling, and therefore clinical utility. Although raw coverage count (the number of reads at any given locus) is frequently used as a marker for usability, coverage is not the sole determinant of the ability to call variants. In addition to coverage, the assessment of a region requires base pair–level inspection of the mapping quality (MQ) of the reads placed in this region and base quality (BQ) of the individual base pairs in each read. Therefore, we established the concept of usable bases: a high-quality base (BQ ≥20) that comes in a properly paired read with high mapping quality (MQ ≥20). Setting quality thresholds at 20 assures there is a 99% probability that each base in each read is correctly called and uniquely mapped.^12,13 We defined any site (locus) to be well covered if it contains at least 20 usable bases (≥20 × coverage with BQ ≥20 and MQ ≥20). It is important to note that although raw coverage may exceed 20 ×, requiring 20 usable bases increases confidence that most germ-line variants will be detected¹⁴ and that stochastic sequencing errors are not detected as false-positive variants.¹⁵ To focus on loci that were consistently unusable across multiple samples at any given sequencing center, we defined the reduced coverage loci as those that had fewer than 20 usable bases in at least 90% of the samples in that center’s cohort.

To identify the reduced coverage loci in any collection of sequencing data, we used a tool called Exome Coverage and Identification (ExCID), which assesses sequence depth in user-defined targeted regions from read data (BAM file); annotates targets with gene, transcript, and exon information; and reports intervals below a user-defined coverage threshold. Input parameters (see Materials and Methods) include the input BAM files, the targets to interrogate, and the definitions of reduced coverage loci (i.e., more than 90% of samples in the cohort with less than 20 × usable base coverage). Targets may be specified to include any regions of particular interest. The information reported includes coverage metrics and bases covered below the coverage threshold for each sample in the cohort; exact bases that are reduced coverage; the length of the reduced coverage region; gene, transcript, and exon information; and percentage of genes covered greater than the threshold. We further determined the GC content of each interval, position in the original target, and mappability over the reduced coverage regions using standard approaches (see Materials and Methods).

Defining critical genes

While these tools could be applied to all sequencing data, a more practical approach might be to apply these sequencing standards to a list of critical genes for any given clinical application. Such a list might include the genes of interest in a particular clinical condition or a set of genes selected for interpretation by a clinical sequencing lab. To demonstrate the approach of identifying critical regions, we selected an exemplar gene list that might be representative of the types of genes a clinical sequencing lab would find of interest. For this exemplar gene list, we used a publicly available curated list of 4,656 genes for which a clinical test is available in a diagnostic lab, as registered in GeneTests (https://www.genetests.org) as of February 2015 (Supplementary Table 2).

To examine reduced coverage loci within this gene list, we first needed to convert the list of gene names to specific genomic coordinates that corresponded to coding regions of interest. We compiled the coordinates for the coding exons for the canonical isoform of each gene, using the RefSeq transcript annotated in HGMD¹⁶ (http://www.hgmd.cf.ac.uk/ac/index.php). The RefSeq and HGMD nomenclatures and genomic coordinates continued a standard that was already being used at the participating institutions; it adheres to the nomenclature guidelines issued by the HUGO Gene Nomenclature Committee (http://www.genenames.org/guidelines.html). We chose the coding sequence coordinates to include each coding gene. For the GeneTests gene list, the total size of the target regions was 9 Mbp. Sequencing metrics across the GeneTests genes for each cohort are shown in Supplementary Table 3.

Reduced coverage regions in GeneTests genes across 10 centers

Using the ExCID tool and the GeneTests list, we surveyed the clinical sequencing data from each of the 10 project cohorts. The goal of this analysis was to determine the reduced coverage regions—characterized as bases below 20 × usable base coverage in ≥90% of the 50 samples in each cohort—specifically within the GeneTests genes.

The survey of reduced coverage regions in the 4,656 genes in the GeneTests list demonstrated some variability among the 10 projects (Figure 1). The total reduced coverage bases in each of the 10 projects ranged from 107 kb to 817 kb, comprising between 1.2% and 9.1% of the total coding bases contained within the GeneTests genes (Figure 1a). The number of reduced coverage exons (defined as exons containing at least one reduced coverage base) across the projects varied from 1,237 to 6,519, comprising between 1.8% and 9.6% of the total number of exons within the GeneTests list (Figure 1b). There were 533 exons (0.8%) that had reduced coverage at all 10 centers. There was wider variation in the number of reduced coverage genes at each site (defined as genes containing at least one reduced coverage base), with a range of 526 to 2,816, comprising between 11.3% and 60.5% of the 4,656 GeneTests genes (Figure 1c). One hundred forty-six genes (3.1%) were affected by bases that had reduced coverage at all centers, totaling up to 66.4 kb (Supplementary Table 4). To test the robustness of results, we also used the Genome Analysis Toolkit DepthOfCoverage tool on the same data sets, which yielded similar results (Supplementary Table 5).

This 66.4-kb region that had reduced coverage in all 10 centers comprises 0.74% of the coding bases in the GeneTests lists. These loci represent the regions that had reduced coverage in 90% of the 50 samples from each of the eight ES projects as well as the two GS projects. Repeat analyses using 100 samples from nine centers demonstrated similar results, suggesting that this is not a function of sample size (Supplementary Table 6). Figure 1d illustrates a pairwise comparison of the centers, demonstrating both the percent overlap and absolute overlap in reduced coverage regions between any two centers. Greater correlation was seen between the two GS centers (I and J) and between the ES centers using similar capture designs.

Among the centers performing ES, 201,011 bp had reduced coverage at all eight centers (Figure 2a). Additional bases with reduced coverage uniquely at each ES center ranged from 171 bp to 205,293 kb. Figure 2b illustrates the number of bases, complete exons, and complete genes in the GeneTests list successfully covered at one or more centers using ES. Ninety-four percent of the genes in GeneTests (4,370) were successfully covered in their entirety by at least one of the eight ES centers. Only 17% of the genes (814) were completely covered by all eight centers. Two hundred eighty-six genes from GeneTests contained at least one region that had reduced coverage in all eight ES centers. The specific genes involved in these reduced coverage regions are listed in Supplementary Table 4. There were different regions that had uniquely reduced coverage by the ES centers and the GS centers. Of the 201 kb that had reduced coverage by all eight ES centers, 134.7 kb were successfully sequenced in the two GS centers (Supplementary Figure 1). Conversely, of the 105 kb that had reduced coverage in both GS centers, 39.4 kb were successfully sequenced in at least one of the exome centers.

To assess potential clinical relevance, these reduced coverage regions were cross referenced with two curated databases of relationships between genetic variants and clinical phenotypes, ClinVar¹⁷ and HGMD.¹⁶ The goal of this analysis was to determine if any of the reduced coverage regions contained any known disease-associated loci. Among 30,861 disease-associated variants in ClinVar, the reduced coverage bases at each sequencing center ranged from 93 to 1,323 (Figure 1e). Of the 146 GeneTest genes that had reduced coverage by all 10 centers, 22 genes had a reduced coverage region that overlapped with a clinical variant in HGMD or ClinVar (Table 2). Similarly, of the 286 GeneTest genes that had reduced coverage by the eight ES centers, 28 genes had a reduced coverage region that overlapped with a clinical variant in HGMD or ClinVar (Table 2). Details of the phenotypes associated with these specific clinically relevant positions can be found in Table 2.

Table 2 Genes containing reduced coverage bases found in all 10 centers or all 8 ES centers that overlap with known disease-associated variants in ClinVar or HGMD

Full size table

Characteristics of reduced coverage regions and exons

The reduced coverage bases across all centers were made up of 735 distinct contiguous intervals. Forty-two percent of these intervals (309) had lengths ranging from 1 to 5 bp (Figure 3a), and these short fragments were predominantly GC-rich (>70%). The remaining reduced coverage intervals ranged in length from 5 to 490 bp and though they only made up 58% of the reduced coverage intervals, they accounted for 65.7 kb (97%) of the total bases for the reduced coverage regions in this analysis. A similar analysis for just the exome centers is shown in Supplementary Figure 2.

The 4,656 genes in the GeneTests list are made up of 67,759 exons. Of those, 533 exons had reduced coverage loci at all 10 centers. The reduced coverage loci fell into two distinct groups. In the first group, less than 20% (and most often less than 10%) of the entire exon had reduced coverage bases. In the second group, most of the exon (>90%) had reduced coverage (Figure 3b). Exons with missing intervals spanning >90% of the entire coding sequence had lengths ranging from about 17 bp to over 2,112 bp, with the median length around 105 bp. Thirty-six percent (194 of 533) of the problematic exons were in either the first or last exon of the gene, much higher than expected by chance alone, given that only 17% of the 67,759 exons are either first or last exons (Supplementary Table 7). The GC content also contributed to a large number of the reduced coverage loci, with 29% of the reduced coverage regions with a GC content >80%, as compared with 0.15% of the bases in GeneTests overall (Figure 3c).

Potential clinical implications of reduced coverage regions

One goal of identifying the reduced coverage regions in clinical sequencing data was to understand the clinical implications of incomplete sequencing of potentially relevant genes. Several reduced coverage regions were identified that could affect the molecular diagnosis of patients for a variety of phenotypes (Table 2; Supplementary Table 4). Many of these appeared to be due to issues with high similarity with other parts of the genome leading to an inability of reads to be uniquely aligned. For instance, the STRC gene has recently been revealed to be a major contributor to congenital sensorineural hearing loss; however, single-nucleotide variants and small indels cannot be reliably detected via NGS due to STRC having 99.6% identity with the coding region of a nearby pseudogene.^18,19,20 Similarly, current ES and GS approaches have a difficult time detecting pathogenic variants for adult-onset polycystic kidney disease, a disease with a prevalence of 1/400–1/1,000 and for which 85% of pathogenic variants identified are in PKD1.²¹ PKD1 is part of a genomic region that has been duplicated six times on chromosome 16, leading to an inability to accurately map reads over a large portion of the gene.^22,23,24 For both STRC and PKD1, specific targeted sequencing strategies can be deployed to improve the coverage of these genes to address the presence of reduced coverage bases resulting from ES or GS.

Other clinically relevant genes that have reduced coverage likely have simpler solutions to accurately detect variation. SHOX, a gene associated with short stature, is part of the pseudo-autosomal regions that occur at the ends of both the X and Y chromosome. Since ES and GS typically align against the default human reference, this region is indicated as occurring on two different chromosomes (X and Y), thus mimicking a region with high similarity. Defining this region as its own chromosome (XY) would mitigate the mappability issues associated with this region.

Discussion

Massively parallel sequencing data are increasingly being used for clinical decision making. In these clinical contexts, it is critical to understand parameters that affect sequencing quality and to establish methods for measuring and reporting clinical sequencing results. In addition to knowing that data are accurate, clinicians and patients using clinical sequencing data need reassurance that the lack of clinically relevant findings are true negatives, and not due to inconclusive sequencing results. Moreover, it is important for everyone to understand that no test is 100% sensitive and that clinical decisions need to be made in light of a valid metric of sensitivity. Coverage metrics are a crucial component of the overall sensitivity of NGS.

In this report, we have proposed a definition for reduced coverage regions and have established a set of standards for variant calling in clinical sequencing applications. To enable sequencing centers to assess the regions of poor sequencing quality in their own data, we optimized a tool, ExCID (now publicly available for use), which provides a list of reduced coverage loci within genes or regions of particular interest. To demonstrate an approach for examining reduced coverage regions in clinical sequencing data, we used these tools on clinical data generated in 10 projects from different sequencing centers. This approach identified reduced coverage regions in clinically relevant genes, including known clinically relevant loci that were uniquely missed at individual centers, in multiple centers, or in all centers.

Comparing the reduced coverage regions across the various centers allowed categorization of the problematic regions and suggests possible solutions to improve standards. Reduced coverage regions that were unique to an individual center conducting ES were likely due to specific methodology at that center (Table 1). Choices about capture method, mean target coverage, or analysis pipeline may account for differences between centers. Since these parameters can be changed, regions that have reduced coverage only at a particular center can likely be salvaged by modifying the sequencing approach. For example, deeper sequencing is likely to reduce the number of reduced coverage regions that are unique to any single center (Supplementary Figure 3).

In contrast, loci that had reduced coverage in all or most of the ES centers but not by GS were likely due to difficulty with hybridization capture of the region. This might include difficulty generating appropriate baits for the relevant regions or difficulty with capture itself. Alternative bait design or orthogonal methods for sequencing these regions might help salvage these specific regions. Finally, there were certain regions in the genome that had reduced coverage at all 10 centers, regardless of sequencing strategy. Although the total number of regions in this category were few, they did contain some potentially clinically relevant sites (Table 2). While it may be difficult to develop additional methods to salvage these regions, it is important for clinical sequencing centers to note the inability to provide conclusive sequencing information about these regions, especially those that may have clinical implications.

Centers conducting clinical sequencing can use the tools and approach described in this study to analyze their own clinical sequencing data, which may be particularly important for clinical laboratories conducting test validation. Once a list of genes or regions of interest have been identified, these can be converted to a list of coordinates (see Materials and Methods). Using these coordinates and a collection of representative sequencing data, centers can run ExCID on this data to identify those regions that have reduced coverage in 90% of the samples. By comparing these reduced coverage regions with the data presented in this study, centers may be able to better understand the reasons for poor coverage in their own data and develop potential salvage approaches, as detailed above.

While this study provides a generalized approach for analyzing clinical sequencing data for reduced coverage regions, there are a number of caveats. First, the data from each center represented specific cohorts from each project. These were intended to be exemplars to demonstrate the approach to analyzing data at an individual center and between multiple centers; they do not represent a comprehensive survey of human sequencing data. Second, although eight centers using ES data were represented, only two of our centers used GS data. Therefore, conclusions regarding the differences between GS and ES must be considered with that limitation in mind. Third, this analysis assumes that centers are performing germ-line sequencing to find heterozygous single-nucleotide polymorphisms. For labs seeking to identify variants with a lower allelic fraction (for example, cases of germ-line mosaicism or somatic mutations), additional considerations will be relevant and standards proposed here are unlikely to be sufficient. Finally, all 10 centers used Illumina-based sequencing approaches. At present, there are several other platforms available for clinical sequencing, which may provide different results. Moreover, the rapid evolution of platforms suggests that the specific data here may not be exactly reproduced using the most current technologies. Indeed, the technology used at most of the centers in this study has already evolved since the generation of data for this study. Nevertheless, though the data on reduced coverage regions at 10 centers may not be representative of the breadth of technologies used today, the approach to analyzing reduced coverage regions remains platform agnostic, and can be applied to any clinical sequencing data.

When communicating results to clinicians and patients, sequencing centers need to be able to recognize and address reduced coverage regions. The tools and framework defined in this study can provide information about regions that have systematically reduced coverage at the center, and regions that have reduced coverage in any individual sample, which may also include additional regions that have sporadically reduced coverage. Both are important for quality control and communication of clinical sequencing data test results. It is important for centers to understand systematic reduced coverage regions to understand limitations of their clinical sequencing testing, and, when appropriate, to modify approaches to improve sequencing quality. At the individual level, regions that have reduced coverage in any particular sample may have direct clinical implications. Therefore, it is important that these regions be accurately communicated to the clinicians using the reports, so that they can correctly factor this into their decision making. Communicating clinical sequencing data to clinicians and patients remains challenging, and including issues of sequencing standards adds further complexity to this—but is important. Further research into specific approaches to documenting and communicating these standards will be required.

References

Biesecker LG, Green RC. Diagnostic clinical genome and exome sequencing. N Engl J Med 2014;371:1170.
Article Google Scholar
Brownstein CA, Beggs AH, Homer N et al. An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge. Genome Biol 2014;15:R53.
Article Google Scholar
Gargis AS, Kalman L, Berry MW et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 2012;30:1033–1036.
Article CAS Google Scholar
Rehm HL, Bale SJ, Bayrak-Toydemir P et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med 2013;15:733–747.
Article Google Scholar
Weiss MM, Van der Zwaag B, Jongbloed JD et al. Best practice guidelines for the use of next-generation sequencing applications in genome diagnostics: a national collaborative study of Dutch genome diagnostic laboratories. Hum Mutat 2013;34:1313–1321.
Article Google Scholar
Sikkema-Raddatz B, Johansson LF, de Boer EN et al. Targeted next-generation sequencing can replace Sanger sequencing in clinical diagnostics. Hum Mutat 2013;34:1035–1042.
Article CAS Google Scholar
Matthijs G, Souche E, Alders M et al. Guidelines for diagnostic next-generation sequencing. Eur J Hum Genet 2016;24:2–5.
Article CAS Google Scholar
NISC Comparative Sequencing Program, Beck TF, Mullikin JC, Biesecker LG. Systematic evaluation of Sanger validation of next-generation sequencing variants. Clin Chem 2016;62:647–654.
Article Google Scholar
Strom SP, Lee H, Das K et al. Assessing the necessity of confirmatory testing for exome-sequencing results in a clinical molecular diagnostic laboratory. Genet Med 2014;16:510–515.
Article CAS Google Scholar
Ajay SS, Parker SC, Abaan HO, Fajardo KV, Margulies EH. Accurate and comprehensive sequencing of personal genomes. Genome Res 2011;21:1498–1505.
Article Google Scholar
Green RC, Goddard KAB, Jarvik GP et al. Clinical Sequencing Exploratory Research Consortium: accelerating evidence-based practice of genomic medicine. Am J Hum Genet 2016;98:1051–1066.
Article CAS Google Scholar
Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008;18:1851–1858.
Article CAS Google Scholar
Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res 1998;8:175–185.
Article CAS Google Scholar
Bainbridge MN, Wang M, Wu Y et al. Targeted enrichment beyond the consensus coding DNA sequence exome reveals exons with higher variant densities. Genome Biol 2011;12:R68.
Article CAS Google Scholar
Parla JS, Iossifov I, Grabill I, Spector MS, Kramer M, McCombie WR. A comparative analysis of exome capture. Genome Biol 2011;12:R97.
Article CAS Google Scholar
Stenson PD, Ball EV, Mort M et al. Human Gene Mutation Database (HGMD): 2003 update. Human Mutat 2003;21:577–581.
Article CAS Google Scholar
Landrum MJ, Lee JM, Riley GR et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014;42:D980–985.
Article CAS Google Scholar
Francey LJ, Conlin LK, Kadesch HE et al. Genome-wide SNP genotyping identifies the Stereocilin (STRC) gene as a major contributor to pediatric bilateral sensorineural hearing impairment. Am J Med Genet A 2012;158A:298–308.
Article Google Scholar
Mandelker D, Amr SS, Pugh T et al. Comprehensive diagnostic testing for stereocilin: an approach for analyzing medically important genes with high homology. J Mol Diagn 2014;16:639–647.
Article CAS Google Scholar
Vona B, Hofrichter MA, Neuner C et al. DFNB16 is a frequent cause of congenital hearing impairment: implementation of STRC mutation analysis in routine diagnostics. Clin Genet 2015;87:49–55.
Article CAS Google Scholar
Harris PC, Torres VE. Polycystic kidney disease, autosomal dominant. In: Pagon RA et al. (eds) GeneReviews. University of Washington: Seattle,: WA, 1993.
Qi XP, Du ZF, Ma JM et al. Genetic diagnosis of autosomal dominant polycystic kidney disease by targeted capture and next-generation sequencing: utility and limitations. Gene 2013;516:93–100.
Article CAS Google Scholar
Loftus BJ, Kim UJ, Sneddon VP et al. Genome duplications and other features in 12 Mb of DNA sequence from human chromosome 16p and 16q. Genomics 1999;60:295–308.
Article CAS Google Scholar
The European Polycystic Kidney Disease Consortium. The polycystic kidney disease 1 gene encodes a 14 kb transcript and lies within a duplicated region on chromosome 16. Cell 1994;77:881–894.
Article Google Scholar

Download references

Acknowledgments

The authors received the following financial support: NHGRI/NIH U01HG006500 and NHGRI/NIH U41HG006834 (M.L. and H.L.R.); NHGRI/NCI: U01 HG006507 and U01 HG007307 (M.O.D.); NHGRI U01 HG006487 (B.C.P.); NHGRI U01 HG006500 (E.A.T.); 5R01DA030976-04, 5U01HG006487-03, and 1U19HD077632-02 (K.C.W.); NHGRI/NCI U01 HG06507, NHGRI/NIH U01 HG007307 and NHGRI/NIH U01 HG007292 (D.A.N.); NHGRI U01 HG006492 (L.A.G. and N.W.); NHGRI/NCI 1U01HG00648, NHGRI U54-HG003273, and NHGRI/NIH 1UM1HG008898-01 (C.J.B, R.V.S., K.W., and D.M.M.); and NHGRI ZIA HG200359-09 (L.G.B. and C.S.H.). L.A.H. and C.M.H. are members of the National Institutes of Health CSER staff team, responsible for scientific management of the CSER program.

Author information

Current affiliation: New York Genome Center, New York, New York, USA
Current affiliation: Bioinformatics Core, Regeneron Pharmaceuticals, Inc., Tarrytown, New York, USA
Current affiliation: National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, USA
The first two authors are co-first authors.
The last four authors are co-senior authors.

Authors and Affiliations

Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
Rashesh V Sanghvi MS, Christian J Buhay BS, Kimberly Walker MS & Donna M Muzny MSc
Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
Bradford C Powell MD, PhD
Laboratory for Molecular Medicine, Partners HealthCare Personalized Medicine, Cambridge, Massachusetts, USA
Ellen A Tsai PhD, Matthew S Lebo PhD & Heidi L Rehm PhD
Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts, USA
Ellen A Tsai PhD
UW Medicine Center for Precision Diagnostics, and Department of Pathology, University of Washington, Seattle, Washington, USA
Michael O Dorschner PhD & David S Hanna BS
Medical Genomics and Metabolic Genetics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
Celine S Hong PhD & Leslie G Biesecker MD
Department of Pathology, Brigham & Women’s Hospital and Harvard Medical School, Boston, Massachusetts, USA
Matthew S Lebo PhD & Heidi L Rehm PhD
Department of Biomedical and Health Informatics, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, USA
Ariella Sasson PhD
Division of Genomic Diagnostics, Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, USA
Ariella Sasson PhD
Department of Genome Sciences, University of Washington, Seattle, Washington, USA
Sean McGee PhD & Deborah A Nickerson PhD
HudsonAlpha Institute for Biotechnology, Huntsville, Alabama, USA
Kevin M Bowling PhD, Gregory M Cooper PhD & David E Gray MS
Department of Pathology, University of Michigan, Ann Arbor, Michigan, USA
Robert J Lonigro MS, Christine A Brennan BS & Dan R Robinson PhD
Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
Andrew Dunford BS, Carrie Cibulskis BSc, Mauricio O Carneiro PhD, Heidi L Rehm PhD, Levi Garraway MD, PhD & Nikhil Wagle MD
Michigan Center for Translational Pathology, University of Michigan Medical School, Ann Arbor, Michigan, USA
Robert J Lonigro MS, Christine A Brennan BS & Dan R Robinson PhD
Renaissance Computing Institute, Chapel Hill, North Carolina, USA
Joshua Sailsbery PhD
Division of Genomic Medicine, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA
Lucia A Hindorff PhD & Carolyn M Hutter PhD
Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Avni Santani PhD & Mahdi Sarmady PhD
Department of Pathology and Laboratory Medicine, Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, USA
Avni Santani PhD & Mahdi Sarmady PhD
Center for Cancer Precision Medicine and Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
Levi Garraway MD, PhD & Nikhil Wagle MD
Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts, USA
Levi Garraway MD, PhD & Nikhil Wagle MD
Howard Hughes Medical Institute, Chevy Chase, Maryland, USA
Levi Garraway MD, PhD

Authors

Rashesh V Sanghvi MS
View author publications
You can also search for this author in PubMed Google Scholar
Christian J Buhay BS
View author publications
You can also search for this author in PubMed Google Scholar
Bradford C Powell MD, PhD
View author publications
You can also search for this author in PubMed Google Scholar
Ellen A Tsai PhD
View author publications
You can also search for this author in PubMed Google Scholar
Michael O Dorschner PhD
View author publications
You can also search for this author in PubMed Google Scholar
Celine S Hong PhD
View author publications
You can also search for this author in PubMed Google Scholar
Matthew S Lebo PhD
View author publications
You can also search for this author in PubMed Google Scholar
Ariella Sasson PhD
View author publications
You can also search for this author in PubMed Google Scholar
David S Hanna BS
View author publications
You can also search for this author in PubMed Google Scholar
Sean McGee PhD
View author publications
You can also search for this author in PubMed Google Scholar
Kevin M Bowling PhD
View author publications
You can also search for this author in PubMed Google Scholar
Gregory M Cooper PhD
View author publications
You can also search for this author in PubMed Google Scholar
David E Gray MS
View author publications
You can also search for this author in PubMed Google Scholar
Robert J Lonigro MS
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Dunford BS
View author publications
You can also search for this author in PubMed Google Scholar
Christine A Brennan BS
View author publications
You can also search for this author in PubMed Google Scholar
Carrie Cibulskis BSc
View author publications
You can also search for this author in PubMed Google Scholar
Kimberly Walker MS
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio O Carneiro PhD
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Sailsbery PhD
View author publications
You can also search for this author in PubMed Google Scholar
Lucia A Hindorff PhD
View author publications
You can also search for this author in PubMed Google Scholar
Dan R Robinson PhD
View author publications
You can also search for this author in PubMed Google Scholar
Avni Santani PhD
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Sarmady PhD
View author publications
You can also search for this author in PubMed Google Scholar
Heidi L Rehm PhD
View author publications
You can also search for this author in PubMed Google Scholar
Leslie G Biesecker MD
View author publications
You can also search for this author in PubMed Google Scholar
Deborah A Nickerson PhD
View author publications
You can also search for this author in PubMed Google Scholar
Carolyn M Hutter PhD
View author publications
You can also search for this author in PubMed Google Scholar
Levi Garraway MD, PhD
View author publications
You can also search for this author in PubMed Google Scholar
Donna M Muzny MSc
View author publications
You can also search for this author in PubMed Google Scholar
Nikhil Wagle MD
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

on behalf of the NHGRI Clinical Sequencing Exploratory Research (CSER) Consortium

Corresponding authors

Correspondence to Donna M Muzny MSc or Nikhil Wagle MD.

Ethics declarations

Conflict of Interest

L.G.B. is an uncompensated consultant for Illumina and receives royalties from Genentech and honoraria from Wiley Blackwell. L.A.G. is a consultant for Foundation Medicine, Novartis, Boehringer Ingelheim, and Third Rock; an equity holder in Foundation Medicine; and a member of the Scientific Advisory Board at Warp Drive. L.A.G. receives sponsored research support from Novartis, Astellas, BMS, and Merck. N.W. is a consultant for Novartis; is an equity holder in Foundation Medicine; and receives sponsored research support from Novartis, Genentech, and Merck.

Electronic supplementary material

Supplementary Information

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sanghvi, R.V., Buhay, C.J., Powell, B.C. et al. Characterizing reduced coverage regions through comparison of exome and genome sequencing data across 10 centers. Genet Med 20, 855–866 (2018). https://doi.org/10.1038/gim.2017.192

Download citation

Received: 09 August 2017
Accepted: 05 September 2017
Published: 16 November 2017
Issue Date: August 2018
DOI: https://doi.org/10.1038/gim.2017.192

Keywords

This article is cited by

A robust pipeline for ranking carrier frequencies of autosomal recessive and X-linked Mendelian disorders
- Wenjuan Zhu
- Chen Wang
- Shen Gu
npj Genomic Medicine (2022)
A highly sensitive and specific workflow for detecting rare copy-number variants from exome sequencing data
- Ramakrishnan Rajagopalan
- Jill R. Murrell
- Laura K. Conlin
Genome Medicine (2020)
Molecular diagnosis in recessive pediatric neurogenetic disease can help reduce disease recurrence in families
- Mahmoud Y. Issa
- Zinayida Chechlacz
- Joseph G. Gleeson
BMC Medical Genomics (2020)
Considerations for whole exome sequencing unique to prenatal care
- Ahmad Abou Tayoun
- Heather Mason-Suares
Human Genetics (2020)

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Introduction

Materials and methods

Sequencing at each center

Generation of GeneTests target files

Analysis of reduced coverage regions

Regions of clinical interest

Results

Clinical sequencing cohorts across 10 institutions

An approach for identifying reduced coverage regions in sequencing data

Defining critical genes

Reduced coverage regions in GeneTests genes across 10 centers

Characteristics of reduced coverage regions and exons

Potential clinical implications of reduced coverage regions

Discussion

References

Acknowledgments

Author information

Authors and Affiliations

Consortia

on behalf of the NHGRI Clinical Sequencing Exploratory Research (CSER) Consortium

Corresponding authors

Ethics declarations

Conflict of Interest

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

This article is cited by

Search

Quick links