Introduction

Whole-exome sequencing (WES) has been used increasingly in clinical diagnostics for a variety of indications to identify the underlying genetic cause of disease. Although single-gene testing and gene panels are still often used when a specific disorder associated with a small number of genes is suspected, WES is increasingly being utilized earlier in the diagnostic evaluation, especially for disorders that are genetically heterogeneous, such as complex neurologic diagnoses and multiple congenital anomalies.1 WES has been used as a method of gene discovery in large series of patients with autism, epilepsy, brain malformations, congenital heart disease, and neurodevelopmental disabilities, and it has effectively identified many novel disease genes and pathways.2,3,4,5,6 The yield of WES in clinical series ranges from 22 to 26%; however, it is still unclear which clinical indications are most likely to yield a diagnosis using WES.7,8,9,10,11 For example, the diagnostic yield in patients with ataxia was 12.8% in one clinical case series and 44.1% in another.7,8 Furthermore, because WES is comprehensive and unbiased in its analysis of all known disease-causing genes, it has the advantage of identifying more than one genetic condition even when the clinical presentation does not make it obvious that there is more than one diagnosis.8,9

As a clinical diagnostic laboratory, we sought to review our experience analyzing 3,040 consecutive cases utilizing WES over the past 3 years to better understand the clinical contexts in which WES can provide a plausible explanation for the presenting symptoms.

Materials and Methods

We reviewed 3,040 consecutive cases referred for clinical WES from January 2012 until October 2014.

Clinical categorization

The referring physician provided the primary clinical diagnoses and International Classification of Diseases–version 9 (ICD-9) codes, which were used to inform the selection of a primary phenotype using top-level Human Phenotype Ontology terms. The amount of clinical information provided by the referring physicians was variable. Some physicians provided only ICD-9 codes, others completed the phenotype checklist on the submission form or sent a clinical summary, and others sent multiple clinic notes and previous laboratory results. All clinical information provided by the referring physician was thoroughly reviewed to ensure that the primary phenotype selected was appropriate. Specific neurological diagnoses such as seizures or autism spectrum disorder were selected as the primary phenotype category for patients in whom it was the primary feature and in whom other neurological diagnoses were not present. Otherwise, “abnormality of the nervous system” was selected for patients with multiple neurological diagnoses. Patients with multiple congenital anomalies were assigned to that category rather than any single-organ Human Phenotype Ontology term. In addition to the primary phenotypic categorization, all other symptoms were noted by selecting any applicable lower-level Human Phenotype Ontology terms and Human Gene Mutation Database (HGMD) disease terms.12,13

Genomic DNA was extracted from whole blood obtained from the affected individual and any submitted family members. Clinicians were encouraged to provide blood or DNA specimens for both parents and all available affected family members whenever possible. WES was performed for the proband and two other family members (mother and father or up to two additional affected family members if available) on exon targets isolated by capture using the Agilent SureSelect Human All Exon V4 (50 Mb) kit (Agilent Technologies, Santa Clara, CA). One microgram of DNA was sheared into 350- to 400-bp fragments, which were then repaired, ligated to adaptors, and purified for subsequent polymerase chain reaction (PCR) amplification. Amplified products were then captured by biotinylated RNA library baits in solution according to the manufacturer’s instructions. Bound DNA was isolated with streptavidin-coated beads and reamplified. The final isolated products were sequenced using the Illumina HiSeq 2000 or 2500 sequencing system with 100-bp paired-end reads (Illumina, San Diego, CA). DNA sequence was mapped to the published human genome build UCSC hg19/GRCh37 reference sequence using Burroughs Wheeler Aligner (BWA) with the latest internally validated version at the time of sequencing, progressing from BWA v0.5.8 through BWA-Mem v0.7.8 (ref. 14).

Targeted coding exons and splice junctions of known protein-coding RefSeq genes were assessed for average depth of coverage, with a minimum depth of 10× required for inclusion in downstream analysis. Local realignment around insertion-deletion sites and regions with poor mapping quality was performed using the Genome Analysis Toolkit Indel Realigner v1.6 (ref. 15). Variant calls were generated simultaneously on all sequenced family members using SAMtools v0.1.18 (ref. 16). All coding exons and surrounding intron/exon boundaries up to 13 bp 5’ and 6 bp 3‘ of the splice junction were analyzed. Automated filtering removed common sequence changes (defined as ≥10% minor allele frequency in the 1000 Genomes database). The targeted coding exons and splice junctions of the known protein-coding RefSeq genes were assessed for the average depth of coverage and data quality threshold values. As an additional quality control measure, the kinship coefficient was calculated from the WES data for all sequenced family members using kinship-based inference.10 This allowed for the identification of misspecified relationships and unreported consanguinity for relations as distant as second cousins.

WES data for all sequenced family members was analyzed using GeneDx’s XomeAnalyzer (a variant annotation, filtering, and viewing interface for WES data), which includes nucleotide and amino acid annotations, population frequencies (NHLBI Exome Variant Server, 1000 Genomes, and internal databases), in silico prediction tools, amino acid conservation scores, and mutation references. Variants were filtered based on inheritance patterns, variant type, gene lists of interest developed internally, phenotype, and population frequencies, as appropriate ( Figure 1 ). Phenotype-driven gene lists were generated based on Human Phenotype Ontology and HGMD gene–phenotype associations, but analysis was not restricted to these gene lists. Resources including the HGMD, 1000 Genomes database, NHLBI GO Exome Sequencing Project, OMIM, PubMed, and ClinVar were used to evaluate genes and detect sequence changes of interest. Variants were interpreted using guidelines similar to the recently published American College of Medical Genetics and Genomics (ACMGG) guidelines.11 The general assertion criteria for variant classification are publicly available on the GeneDx ClinVar submission page (http://www.ncbi.nlm.nih.gov/clinvar/submitters/26957/). Identified sequence changes of interest were confirmed and segregation within the family was determined in all members of the family for whom biospecimens were available by conventional di-deoxy DNA sequence analysis with a new DNA preparation when blood was provided.

Figure 1
figure 1

Method for analysis and categorization of genetic variants identified from exome sequencing. The median number of results requiring human analysis is provided for each. Noisy copy-number variant (CNV) samples, defined as >50 CNV calls, have been omitted in calculating the CNV search number because CNVs are not evaluated in these cases except to check for very large events. MAF, minor allele frequency; WES, whole-exome sequencing.

Uniparental disomy (UPD) and regions of homozygosity were identified directly from the WES sequencing data by processing it using custom Perl scripts to identify segments of at least 80% Mendelian error or homozygosity, respectively, over regions of at least 8 Mb. Copy-number variants (CNVs) were also called from the WES data using a relative coverage method we have described previously.17 All potentially diagnostic UPD and CNV results were confirmed before reporting by an appropriate orthogonal measure, such as chromosome microarray, exon aCGH, multiple ligation-dependent probe amplification, Sanger sequencing, or qPCR.

For those requesting mitochondrial genome testing, the entire proband mitochondrial (Mt) genome (16,569 base pairs) was amplified by two separate long-range PCRs using different sets of primers in the D-loop region, with each set of primers amplifying the entire mitochondrial genome.18 The resulting long PCR products were pooled together at equal molar ratio and enzymatically fragmented using Illumina’s Nextera transposase reaction. The final isolated products were sequenced using the Illumina MiSeq sequencing system with 151-bp paired-end reads. Alignment and variant calling were performed as described for WES, except additional variant calls were added using the Genome Analysis Toolkit Unified Genotyper v1.6 and a custom GeneDx program designed to detect heteroplasmic variants down to 1.5% variant frequency. Large deletions of 1,500 bp or more were also called from the mitochondrial genome sequencing data by evaluating coverage using custom Perl scripts and the R DNAcopy package.19 All reportable variants with heteroplasmy 15% or higher were confirmed using Sanger sequencing, and all definitive or predicted pathogenic variants with heteroplasmy lower than 15% were confirmed using real-time ARMS qPCR20 in the proband. Fifty-eight cases were reanalyzed within a year from first analysis at the request of the ordering physician with the hope that advances in the field would provide clarification of the diagnosis. In those cases, variants were recalled and annotated and variant analysis was performed as described above.

The primary results were classified into the four categories below. More than one result could have been provided for a patient.

  • Category 1 (definitive result): pathogenic or variant(s) likely pathogenic in a known disease gene associated with the reported phenotype.

  • Category 2 (possible/probable diagnosis): variant(s) in a known disease gene possibly associated with the reported phenotype. This category includes novel variants, including missense variants or in-frame insertions/deletions in disease genes, that overlap the phenotype provided for the proband. This category also includes recessive conditions that overlap with the phenotype provided for the proband in which only a single pathogenic variant is identified.

  • Category 3 (novel candidate gene): variant(s) predicted to be deleterious in a novel candidate gene that have not previously been implicated in human disease or for which the published data to support human disease association may not yet be definitive. Supporting data could be based on model organism data, CNV data, tolerance of the gene to sequence variation, data about tissue or developmental timing of expression, or knowledge of the gene function and pathway analysis. Further research is required to evaluate any of the suggested candidate genes.

  • Category 4 (negative result): no variants in genes associated with the reported phenotype identified.

Any of these primary result categories could also have had secondary findings from the ACMGG list of 56 genes reported concurrently. The presence of a secondary finding was not factored into the overall classification of the case for the primary analysis. For the secondary findings, results were analyzed in 2,091 cases. Starting in May 2013, 56 genes, as recommended by the ACMGG, were examined for variants definitively known to be pathogenic or expected to be pathogenic due to a predicted loss of function allele. For the secondary findings, results were classified as either positive or nothing to report based on the ACMGG guidelines.

Statistical analysis was performed with Pearson chi-square.

Results

We analyzed 3,040 unique, consecutive WES cases, with 532 (17.5%) submitted as proband only, 200 (6.6%) with one additional family member, 2,081 (68.4%) with two additional family members, most often parents, and 227 (7.5%) with three or more additional family members. WES was performed in up to four family members to maximize sensitivity and specificity of variant calling and interpretation and to enable identification of de novo variants and determination of phase. Mitochondrial genome sequencing was also performed in 40% of cases (N = 1,221). The mean age of the probands was 11.4 ± 13.2 years, and the median age was 6.8 years (Supplementary Table S4 online).

Exome sequencing produced an average of 11 GB of sequence per sample. Mean coverage of targeted regions was 140× per sample, with >98% covered with at least 10× coverage. Sequencing produced an average Illumina Q score of 36, with an average Q30 value of 93%. Variant calling on the entire genome produced ~100,000 variants per sample. Limiting variants to just those within the region of interest and filtering of common SNPs (>10% frequency present in 1000 Genomes database) resulted in ~5,000 variants per proband sample. The median number of results requiring human evaluation for each of the automated searches ranged from 5 to 70 ( Figure 1 ). Note that there is some overlap between categories because a variant could, for instance, be a known HGMD variant, de novo, and missense, and thus appear once in all three searches. For a typical trio, approximately 200 unique variants were generated by the automated searches for manual review.

A large number of patients were referred primarily for indications related to the nervous system (35.6%) or multiple congenital anomalies (24.0%) ( Table 1 ); 5.1% had isolated seizures and 4.3% were referred for the primary diagnosis of autism spectrum disorder. Many patients had a neurodevelopmental disorder, often with other associated clinical features. Intellectual disability or developmental delay was observed as a clinical feature in 51.8%, seizures were observed in 27.3%, and autism spectrum disorder was observed in 16.0%; 5.7% of patients were referred with the primary diagnosis of a mitochondrial disorder. Other less common indications for referral were muscular disorder (3.6%), metabolic disorder (2.8%), connective-tissue disorder (2.7%), and immunological disorder (2.1%).

Table 1 Distribution of primary Human Phenotype Ontology categories among probands analyzed by whole-exome sequencing

Overall, across the 3,040 cases, a definitive result was provided in 28.8%, a possible/probable result was provided in 51.8%, a candidate gene result as the only finding was provided in 7.6%, and a negative result was provided in 11.8%. A candidate gene result was reported in 24.2% of all cases; 16.6% of all cases received a report of a candidate gene in addition to other findings. The frequency of a definitive result was 23.6% when only the proband was analyzed (N = 542) and increased to 31.0% when three members of the family were analyzed by WES (N = 2,088) (P = 0.0008). When the age of the proband was restricted to 30 years or younger, the results were similar, with definitive results in 20.8% (N = 332) when only the proband was analyzed and increasing to 30.9% (N = 1,686) when three members of the family were analyzed (P = 0.0002). When the age of the proband was restricted to 12 months or younger, there was a definitive result in 32.0% (N = 25) when only the proband was analyzed, and it increased to 32.8% (N = 134) when three members of the family were analyzed (P = 0.95).

In the analysis of the cases with a definitive result by mode of inheritance, 37.7% were autosomal dominant and de novo, 29.4% were autosomal recessive, 4.8% were X-linked and de novo, 6.5% were X-linked and inherited, 7.2% were autosomal dominant and inherited, 8.5% were autosomal dominant of unknown origin due to lack of parental samples, 3.5% were encoded in the mitochondria, and 2.3% were indeterminately autosomal dominant or recessive ( Figure 2 ).

Figure 2
figure 2

Mode of inheritance for positive cases. AD, autosomal dominant; AR, autosomal recessive.

UPD was identified in 11 cases, including one case of segmental UPD, and a definitive diagnosis was established in four of these cases, with three cases due to homozygosity for a recessive disorder and one case due to paternal UPD of chromosome 6. In 15 different families, somatic mosaicism in blood was identified and confirmed with Sanger sequencing in seven probands (0.8%) and eight parents (0.9%). The lower boundary for detecting mosaicism was approximately 20%. The genes in which mosaic pathogenic variants were identified included ADCY5, CDKL5 (two cases), GJA, NF1, TSC2, and GABRA1. WES data allowed us to identify 22 cases with a partial or complete gene deletion and one case with a gene duplication (2.6% of cases with deletions/duplications), all of which were confirmed with an alternate method. WES data also identified five cases with CNVs ranging from 1.4 to 21.6 Mb (Supplementary Table S3 online).

When examining the diagnostic yield for a definitive diagnosis, the yield was highest for problems with hearing (55%, N = 11), vision (47%, N = 60), the skeletal muscle system (40%, N = 43), the skeletal system (39%, N = 54), multiple congenital anomalies (36%, N = 729), the skin (32%, N = 31), the central nervous system (31%, N = 1,082), and the cardiovascular system (28%, N = 54) ( Figure 3 ). The sample size is still modest for many of these indications but is large for the categories of multiple congenital anomalies and central nervous system disorders. For probands with specific neurological diagnoses either in isolation or associated with other clinical features, the overall diagnostic yield for a definitive diagnosis was 30% (N = 1,576) for intellectual disability or developmental delay, 28% (N = 830) for seizures, and 20% (N = 487) for autism spectrum disorder. Pathogenic variants in several disease genes were identified in multiple cases (Supplementary Table S1 online).

Figure 3
figure 3

Percentage of cases with a definitive diagnosis based on the primary phenotype. N indicated in parentheses. CNS, central nervous system; MCA, multiple congenital anomaly.

Twenty-five of the cases in this series revealed two genetic diagnoses, and three cases had three distinct genetic diagnoses ( Table 2 ). Of these 28 cases, 13.3% (N = 4) were consanguineous families. Although cases with multiple diagnoses are rare and there may be disagreements about the interpretation of variants between laboratories, in most of these cases a single diagnosis did not explain all of the clinical features, but all the clinical features were accounted for by both (or all three) molecular findings. One patient with developmental delay, hypotonia, dysmorphic features, nystagmus, ptosis, and laryngomalacia had a de novo pathogenic variant in three genes: KIF21A, NF1, and DYRK1A. One of the consanguineous cases had both a de novo nonsense pathogenic variant in SATB2 and an inherited homozygous pathogenic variant for dihydrolipoamide dehydrogenase deficiency. In a small number of cases, there were overlapping clinical features in the two genetic diagnoses, and identification of a second genetic etiology was important because one of the conditions was treatable or there were clinical trials for treatment (glutaric aciduria type 1/myofibrillar myopathy, dihydrolipoamide dehydrogenase deficiency/SATB2 pathogenic variant, Tay-Sachs/muscle, eye, brain disease, adenylosuccinate lyase deficiency/STXBP1 pathogenic variant, and CASK pathogenic variant /adrenoleukodystrophy). Reanalysis of 58 cases resulted in definitive diagnoses in 11 cases (20.0%), due largely to novel associations of genes with human disease since the time of initial test report.

Table 2 Clinical characteristics and pathological variants identified in patients found to have more than one genetic diagnosis accounting for the phenotype

Secondary findings as recommended by ACMGG were offered to 2,382 participants, of whom 291 opted out (12.2%). Secondary findings (Supplementary Table S2 online) were analyzed in 2,091 cases, and 6.2% (N = 129) had reportable secondary findings. The distribution of clinical conditions for which secondary findings were reported was 34.9% (N = 45) for cardiomyopathies, 31.0% (N = 40) for long QT/Brugada syndrome/catecholaminergic polymorphic ventricular tachycardia, 14.7% (N = 19) for hereditary cancer syndromes, 10.9% (N = 14) for familial hypercholesterolemia, 7.0% (N = 9) for Marfan syndrome/aortic aneurysm, and 1.6% (N = 2) for malignant hyperthermia.

Discussion

Our large consecutive series of 3,040 clinical WES cases provides data about the diagnostic yield. Overall, 28.8% of cases were provided with a definitive diagnosis. Compared with a diagnostic yield of 23.6% for proband-only cases, an increase to 31.0% was achieved when three members of the family were evaluated. This increase is due largely to the ability to detect de novo variants and determine phase when two variants in a gene are identified when both parents are tested, accounting for 42.5% of the cases solved. Our series has yielded results similar to, but slightly higher than, other case series, in which yields of 25% to 26% were reported, probably because we have performed WES on multiple family members. In addition to the cases with a definitive diagnosis, for 24.2% of the cases a candidate gene was reported that may in the future be reclassified as a presumed definitive diagnosis. Utilizing the power of our large data set, we were able to identify novel candidate genes in which multiple patients with similar phenotypes had rare variants predicted to result in loss of function, many of which were de novo. Access to this large number of cases has proven extremely valuable, and 8.3% (61/735) of the cases with reported candidate genes now include genes that have been published as disease-causing, including DDX3X, ARID2, CRB2, KAT6A, AHDC1, SLC1A4, SPATA5, PURA, SLC13A5, KCNB1, ASNS, NR2F2, CTCF, TUBB2A, COQ4, WDR45, and DYRK1A.21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37 As a result of our experience, we believe that candidate genes should be reported from WES analysis and shared in databases, such as GeneMatcher, so that clinicians, researchers, and clinical laboratories can be connected to facilitate more rapid dissemination of information and validation of novel disease genes.

Our series is larger than previously reported clinical diagnostic series and has the power to begin to address the yield of WES for a large number of clinical indications. Although the numbers are still modest for some indications, our series demonstrates a diagnostic yield of at least 25% for indications of deafness, blindness, muscular dystrophies/myopathies, skeletal dysplasias, dermatologic conditions, multiple congenital anomalies, intellectual disabilities/developmental delay/hypotonia, cardiac diseases, metabolic disorders, hematologic conditions, and seizures. The yield for isolated autism spectrum disorders was 14%. A total of 28 cases had two or three independent genetic diagnoses producing an aggregate phenotype that was usually the sum of independent phenotypes, confounding the clinician’s ability to arrive at a single diagnosis of the patient using only clinical information. Comprehensive approaches of WES enable diagnoses of these extremely complex cases.

In addition, many of the cases expanded the phenotype associated with the previously described genetic condition and would probably not have been diagnosed with targeted gene panels because they were not suspected clinically. All of these examples highlight the utility of an unbiased approach to genetic diagnosis.

Many of the patients referred had been through extensive prior genetic and metabolic evaluations, including karyotype, chromosomal microarray, gene panels, and/or single-gene testing. The yield of the test would have been expected to be even higher if WES had been used earlier in the diagnostic evaluation, and we suggest that WES will increasingly be considered as a first-line test for many indications, including neurodevelopmental disorders/intellectual disability, autism spectrum disorder, and certain congenital anomalies. This could shorten the time to diagnosis and limit the expense and burden of other approaches to diagnosis, including imaging studies and invasive procedures.

Additionally, methods to detect both mosaicism and CNVs have improved since we started testing, and we have identified mosaicism in the proband or parent in 1.6% of cases and copy-number changes ranging in size from a few exons of a gene to multiple genes in 2.4% of cases. Thus, the yield we report is likely to be the lower bound for this test in the future (Supplementary Table S5 online).

Importantly, several of the diagnoses resulted in immediate changes in treatment for the patient. A homozygous pathogenic variant in Aldolase B (ALDOB) diagnosed hereditary fructose intolerance in a child with recurrent coma and seizures and suggested the simple intervention of eliminating fructose from the diet. The diagnosis of pyridoxamine 5-prime-phosphate oxidase deficiency was made in a child with intractable seizures that are treatable with pyridoxal phosphate. The diagnosis of GLUT1 deficiency syndrome was made in a child with absence seizures, ataxia, and learning disability and immediately suggested treatment with a ketogenic diet. Segawa syndrome due to a tyrosine hydroxylase deficiency was diagnosed in a patient with progressive hypotonia, seizures, and developmental delay, indicating treatment with l-DOPA and selegiline. The diagnosis of acute intermittent porphyria in a patient with postural orthostatic tachycardia syndrome immediately suggested medications to prevent future episodes that can be lethal.

We identified secondary findings in 6.2% of the cases analyzed, a frequency slightly higher than that reported by other WES laboratories.7,8 This difference is probably due to the inclusion of likely pathogenic variants before consensus emerged about interpretation of the ACMGG guidelines to report secondary findings and other groups utilizing more stringent criteria for calling a variant pathogenic or likely pathogenic.

The unbiased approach of WES provides a valuable tool of increasing clinical utility to diagnose genetic conditions, especially in children and patients with severe and/or multisystemic conditions. It can be especially useful to detect de novo pathogenic variants using a trio design analyzing both biological parents and the proband. The clinical yield of this test will continue to increase over time, allowing providers to efficiently arrive at a diagnosis. This information will help avoid needless diagnostic procedures and begin support and care tailored to the patient’s diagnosis, inform reproductive decisions, and provide patients and families with closure by providing a diagnosis.

Disclosure

K.R., J.J., M.T.C., P.V., F.M., F.G., A.V.-B., K.G.M., D.M., R.B., S.S., B.F., D.P.-A., G.R., T.B., and S.B. are employees of GeneDx. N.S. is an employee of Takeda. J.N. is an employee of Pathway Genomics. J.T. and E.H. are employees of Invitae. W.K.C. is a consultant for BioReference Laboratories. Clinical informed consent was obtained for all subjects. The study was approved by the Institutional Review Board of Columbia University.