Introduction

Carrier screening is performed either preconception or during pregnancy to determine a couple’s risk of having a child with a recessive genetic disorder. The number of individuals who could benefit from such screening is substantial because roughly 2 million women give birth to their first child each year in the United States1. The disorders for which testing is recommended vary based on a number of different patient-specific factors. For instance, the American Congress of Obstetricians and Gynecologists recommends that screening for cystic fibrosis be offered to all women of reproductive age2 and that testing be performed for additional disorders if indicated by family history, partner’s carrier status, or ethnicity.3,4,5

Today, carrier screening is typically performed using focused genotyping technologies that are designed to interrogate specific mutations within a gene of interest. However, because of cost and complexity, these tests often do not include all known disease-causing mutations. By contrast, next-generation DNA sequencing (NGS) can comprehensively genotype a set of genes in a cost-efficient manner and is therefore poised to supplant current technologies for routine, high-volume carrier screening.

For NGS to be used for carrier screening in a clinical setting, it must satisfy at least three requirements. First, analytical accuracy must be both high and well characterized within the clinically relevant genes or regions. Previous reports have demonstrated a broad range of accuracy values, and in some cases it is unclear whether these values hold within the relevant regions of the genome.6,7,8,9 In addition, accuracy for insertions and deletions is generally either substantially lower or uncharacterized, and measured to lower precision. Second, the NGS workflow employed should yield data sufficient to cover the vast majority of targeted bases at a depth sufficient to make high-quality genotype calls (both variant and nonvariant). It has been noted, however, that the percentage of bases callable at a given depth varies widely with both the sample preparation workflow and the total amount of sequencing.8,10 Finally, the workflow must be highly robust and reproducible, which can often be achieved through automation. However, typical NGS sample preparation workflows are not amenable to high-throughput automation because of rate-limiting mechanical shearing, reaction purifications, size selections, and kitted reagent costs (typically $50–200 per sample).

Here, we describe an integrated NGS workflow that meets these requirements for carrier screening. The workflow combines automated, optimized molecular inversion probe target capture with molecular barcoding to maximize the sample throughput of an NGS machine and employs a novel read assembly–based alignment method that enables accurate identification of both substitution and insertion/deletion lesions ( Figure 1 ). We apply this workflow to sequence the protein-coding regions of 15 genes in which loss-of-function mutations cause recessive Mendelian disorders often included as part of routine carrier screening, and demonstrate through realistic simulation and comparison with Sanger sequencing data that our approach achieves high accuracies and detects the vast majority of disease-causing mutations.

Figure 1
figure 1

Next-generation DNA sequencing workflow. Genomic DNA samples are input to a molecular inversion probe capture reaction. Each target (depicted by blue and orange regions) is captured by multiple probes that anneal to nonoverlapping genomic intervals. PCR is performed using primers containing patient-specific barcodes, yielding barcode libraries (turquoise and purple). Equal volumes of the libraries are pooled and enter the Illumina HiSeq high-throughput sequencing workflow. Following sequencing, reads enter either the alignment only (AO, left) or Genotyping by Assembly-Templated Alignment (GATA, right) analysis pipeline. AO first partitions reads by sample molecular barcode, then in parallel for all samples performs short-read alignment, base-quality recalibration, realignment around putative indels, and genotyping. GATA partitions reads first by sample molecular barcode, then by target. Reads are assembled into contiguous sequences (contigs) that are then aligned to the reference genome. Raw reads are then aligned to the contigs, and raw-read mapping and variant information relative to the reference are determined using reference-contig and read-contig alignments. Finally, base-quality score recalibration and genotyping are performed on the mapped, raw reads. gDNA, genomic DNA.

Materials and Methods

Molecular inversion probe design

Molecular inversion probes11,12 were designed to capture the coding regions and certain well-characterized noncoding regions of 15 genes (see Supplementary Tables S1 and S2 online). The 5′ and 3′ targeting arms (extension and ligation, respectively) comprised a total of 40 nucleotides and were designed to flank 130-bp target regions. Further details can be found in the Supplementary Materials and Methods online.

Target capture, barcoding, and NGS

Genomic DNA was purchased from the Coriell Cell Repositories (Camden, NJ) or isolated from whole blood by the Gentra Puregene method (Qiagen, Gaithersburg, MD) concluding with an overnight incubation at 65 °C. All samples were considered “IRB Exempt” by Liberty IRB, our independent institutional review board. Genomic DNA was subjected to multiplex target capture using molecular inversion probes. Captured product was subjected to PCR to attach molecular barcodes in a manner that allowed sequencing from either end of the captured region.12 The PCR product was pooled and sequenced on the Illumina HiSeq 2000. Further details can be found in the Supplementary Materials and Methods online.

NGS data analysis with alignment-only algorithm

Raw .bcl files were converted to qseq files using bclConverter (Illumina). Fastq files were generated by “debarcoding” genomic reads using the associated barcode reads; reads for which barcodes yielded no exact match to an expected barcode, or contained one or more low-quality base calls, were discarded. The remaining reads were aligned to hg18 on a per-sample basis using Burrows-Wheeler Aligner version 0.5.7 for short alignments,13 and genotype calls were made using Genome Analysis Toolkit version 1.0.4168 after base-quality score recalibration, realignment (with GATK version 1.0.5083),14 and targeting arm removal (to prevent synthetic, reference-designed molecular inversion probe arm sequence from interfering with genotype calling). High-confidence genotype calls were defined as having depth ≥50 and strand bias score ≤0. Clinical significance of variant calls was determined by matching against a VCF-formatted database of disease-causing mutations curated from the literature, with equivalent insertion/deletion regions calculated as previously described.15

NGS data analysis with Genotyping by Assembly-Templated Alignment algorithm

Debarcoded fastq files were obtained as described above and partitioned by capture region (exon) using the target arm sequence as a unique key. Reads were assembled in parallel by exon using SSAKE version 3.7 with parameters “-m 30 -o 15”.16 The resulting contiguous sequences (contigs) were aligned to hg18 using BWA version 0.5.7 for long alignments17 with parameter “-r 1”. Short-read alignment was performed as described above, except that sample contigs (rather than hg18) were used as the input reference sequence. Software was developed in Java to accurately transfer coordinate and variant data (gaps) from local sample space to global reference space for every BAM-formatted alignment. Genotyping and base-quality recalibration were performed on the coordinate-translated BAM files using GATK version 1.6.5.

Sanger sequencing

PCR was carried out with the genomic DNA described in the “Target capture, barcoding, and NGS” section using a modified version of the protocol from Zimmerman et al.18 and using PCR primers from Jones et al.19 with M13 tails removed (regions in Supplementary Table S3 online). More information can be found in the Supplementary Materials and Methods online.

Sanger data analysis and cross-validation to NGS

Mutation Surveyor software (MS; Softgenetics, State College, PA) version 4.0.5 was used in batch-mode with default parameters to align ab1 files to target reference sequence and make genotype calls. Positions at which MS base calls did not match in the forward and reverse directions were removed from consideration. All high-quality NGS genotype calls (both reference and non-reference) within 10 bp (inclusive) of target exons were subjected to cross-validation against VCF-converted MS variant calls and orthogonal confirmation, if necessary (see Supplementary Figure S1 online). A detailed description of cross-validation to NGS is provided in the Supplementary Materials and Methods online.

Assessment of detectability of clinical mutations by simulation

A total of 145 Coriell samples were sequenced and analyzed by Genotyping by Assembly-Templated Alignment (GATA, described above). Specific fields (base sequence and qualities) within aligned reads (BAM records) from the Illumina sequencer were manipulated in silico to introduce the clinically relevant DNA lesion pattern within the context of empirical sequencing errors and sample-specific proximal variants. To simulate heterozygous carriers, input reads covering the mutation were chosen at random for sequence manipulation with an average probability of 0.5. All reads, whether manipulated or not, were output in fastq format for subsequent GATA analysis as described. This process was repeated for each of 81 clinically significant mutations whereupon genotyped (observed) alleles were cross-referenced back to the original simulated (expected) allele. Samples for which the allele was already present were excluded from simulation (e.g., many Coriell samples in the set contained the common CFTR F508del mutation). Mutations with detection rates <100% between the expected and observed alleles were classified as undetectable by NGS.

Determining clinical significance of variant allele calls

Each NGS-detected variant allele is annotated for functional (clinical) significance by determining its relative position within the corresponding consensus coding sequence. Truncating mutations were considered clinically significant lesions for the purposes of the work presented here. A more detailed description of how clinical significance of variant allele calls was determined can be found in the Supplementary Materials and Methods online.

Results

Completeness and reproducibility

We performed automated target capture and molecular barcoding followed by NGS on a set of 194 samples derived from immortalized cell lines (55 containing specific disease-causing mutations and 139 chosen to represent ethnic diversity, see Supplementary Figure S2 online) and 59 samples derived from whole blood (see Supplementary Table S4 online). We targeted all exons including 10 nt of flanking intronic sequence, plus additional intronic regions known to contain disease-causing mutations in 15 genes causative of 14 recessive Mendelian diseases (see Supplementary Table S1 online) using tiling molecular inversion probes (see Materials and Methods section). A total of 25,907,612,945 base pairs (bp) of de-multiplexed sequence were generated, corresponding to an average per-base coverage per sample of 2,399× (minimum: 891×; maximum: 4,000×), see Supplementary Table S4 online. Of the 42,858 bases targeted for capture in each sample, we made high-confidence genotype calls (both reference and non-reference) at an average of 97.3% (minimum: 92.2%; maximum: 99.8%) for cell line-derived DNA and 99.9% (minimum: 99.8%; maximum: 99.9%) for blood-derived DNA (see Supplementary Table S2 online).

The DNA extraction protocol used for our blood samples concluded with an overnight incubation at 65 °C in a Tris-based buffer. Subsequent experiments showed that this step reduced the mean size of the purified DNA (see Supplementary Figure S3 online); shearing was likely caused by acid hydrolysis during a temperature-induced pH shift of the buffer.20 We hypothesize that lower-molecular-mass genomic DNA is more readily denatured and therefore more accessible to molecular inversion probes, resulting in improved capture performance. Consistent with this hypothesis, we find that reducing the overnight incubation temperature to 25 °C significantly reduces the percentage of target bases that yield high-confidence genotype calls (see Supplementary Figure S3 online). To assess reproducibility, a subset of 126 samples derived from cell-line DNA (see Supplementary Table S4 online) was processed twice, each time by a different operator on different liquid-handling equipment. At least 92% of bases were called at ≥50× coverage in all samples, with high agreement between replicates (Pearson correlation coefficient 0.868). Of 5,177,206 total genotype calls compared, 17 were discordant, for a concordance rate of 0.999997. These occurred at only five unique genomic positions, consistent with systematic sequencing error as the primary cause.

Sanger concordance

To assess the overall accuracy of our NGS genotype calls (reference and non-reference) on a set of 194 samples (see Supplementary Table S4 online), we compared genotype calls for the target region from the NGS pipeline with those generated by automated analysis (Mutation Surveyor) of bidirectional Sanger sequence of PCR amplicons. Within a total of 6,997,906 bp of sequence called by both methods, we observed 3,973 concordant and 1,220 discordant single-nucleotide variant (SNV) genotype calls. We performed a manual review of discrepant calls and ultimately determined that nine high-quality SNV calls were true discrepancies, corresponding to eight NGS false positives and one NGS false negative ( Table 1 , see Supplementary Note 1 online). We observed a total of 4,000 true-positive SNV calls and 6,992,746 true-negative SNV calls ( Table 1 ). The NGS SNV false-positive rate was 1.14 × 10−6 (95% Wilson binomial confidence interval [5.80 × 10−7 − 2.26 × 10−6]). Positive predictive value for common (i.e., in dbSNP) SNV calls was 100%; it was 97% for novel (i.e., not in dbSNP) calls. The false-positive calls occurred at five unique genomic loci, three of which were at adjacent positions in a single exon of the gene MCOLN1 and were caused by GATK realignment.

Table 1 Comparison of NGS genotype calls (alignment-only algorithm) to Sanger-derived genotype calls

The NGS SNV false-negative rate was 2.50 × 10−4 (95% Wilson binomial confidence interval [1.28 × 10−5–1.41 × 10−3]. Sensitivity for common SNV calls was 99.97% (95% CI: 99.9–100%), and sensitivity was 100% (95% CI: 98.6–100%) for novel calls. The false-negative call observed occurred in chromosome 11 of a sample previously characterized as aneuploid.21 Of 473 NGS reads covering the false-negative locus, 9.5% supported the correct heterozygous A/C genotype call ( Figure 2a ), with Sanger sequencing showing low peak height for the alternate A allele ( Figure 2b ). Shotgun full-genome sequencing of this sample demonstrated a bimodal distribution of allele ratios for heterozygous calls in chromosome 11 ( Figure 2c ) and illustrated variable chromosome copy numbers ( Figure 2d ), supporting the conclusion that this sample was aneuploid.

Figure 2
figure 2

Skewed allelic fractions in aneuploid cell-line GM18540. (a) IGV view of NGS data from GM18540 for the non-reference genotype call of interest (shown between vertical lines). (b) Bidirectional Sanger data for the variant-containing region. (c) Histogram of allele ratios for all non-reference genotype calls in chromosome 11 derived from whole-genome shotgun sequencing (WGSS) of GM18540 and control sample GM18537. (d) Genome-wide relative coverage for GM18540. WGSS coverage data for each of the autosomes was binned into 50 kb intervals and the log-ratios of the per-sample mean normalized values were plotted versus chromosome position. Dashed vertical lines denote chromosome boundaries; within a chromosome the ratios are arranged according to genomic position. IGV, integrative genomics viewer; NGS, next-generation DNA sequencing.

For insertions and deletions (indels), we observed a total of 61 true positives, 394 false positives (27 unique alleles), and 3 false negatives (2 unique alleles, both in exon 1 of SMPD1), for a sensitivity of 95.3% overall (95% CI: 86.9–99.0%), and 100% if exon 1 of SMPD1 is excluded from analysis (95% CI: 94.1–100%). Of 31 clinically relevant disease mutations, we detected all 31, for a sensitivity at disease-causing loci of 100% (95% CI: 88.8–100%).

Detection of pathogenic mutations

We next sought to assess our ability to detect variants that cause the Mendelian diseases targeted by our panel (see Supplementary Table S1 online) in a set of 194 cell line–derived samples. Of these samples, 55 were derived from individuals who were either carriers of or affected by one of the diseases being assayed and collectively contained a total of 95 previously characterized disease mutations. During the design of our NGS workflow, we determined that three of these lesions would be inaccessible by our approach—two were large deletions spanning multiple exons, and one was within a region of CFTR exon 10 that is paralogous to other genomic regions (see Supplementary Table S5 online). Of the 92 mutations we could expect to detect by NGS, we detected all 92 (see Supplementary Table S5 online), and coverage-based analyses of the regions harboring the two large deletions illustrated that evidence of these lesions is present in the sequencing data (see Supplementary Figure S4 online). We also identified truncating (and likely disease-causing) mutations in two affected samples in which previously only one mutation was known (see Supplementary Figure S5 online and Supplementary Table S5 online), as well as 9 carriers in the set of 139 previously uncharacterized HapMap, 1000 Genomes Project, and Human Diversity Panel samples (see Supplementary Table S5 online).

GATA

Although substitutions comprise the majority of coding variation in the human genome, indels are often clinically relevant. Indels, especially when large or present in cis with substitutions, are notoriously difficult to detect with short NGS reads. Assembly of short reads can improve indel detection sensitivity, but this is often at the cost of decreased SNV and indel specificity due to the presence of spurious contigs. We devised an algorithm, termed GATA, that first forms an assembly from reads partitioned into subsets by targeting arm sequence and then performs base quality– and coverage-informed genotyping by alignment of raw reads back to the assembled contigs ( Figure 1 ). We compared the performance of GATA for indel genotyping to the more conventional genotyping-by-alignment-only (AO) algorithm used in the Sanger concordance studies. Across a set of 147 samples analyzed, both indel sensitivity and specificity were increased with GATA relative to AO ( Table 2 ). GATA detected 23 unique insertions and deletions, which were confirmed by manual review of Sanger traces. Of these, nine (39%) were not detected by AO in one or more samples, including BLM c.2207_2212delinsTAGATTC—the most common disease-causing mutation for Bloom syndrome in people of Ashkenazi Jewish descent22—as well as several alleles in SMPD1 (see Supplementary Table S6 online), the gene associated with Niemann–Pick disease ( Figure 3 ). Performance for substitutions was identical for both detection methods (AO and GATA). GATA and AO both utilize GATK; however because versions and functionality differ, we assessed whether the newer (GATA) version of GATK, when used with AO, would improve indel performance. Although this eliminates false-positive indel calls, sensitivity is only marginally improved ( Table 2 ).

Table 2 Genotyping by Assembly-Templated Alignment (GATA) algorithm improves detection of insertions and deletions
Figure 3
figure 3

Genotyping by Assembly-Templated Alignment correctly genotypes insertions and deletions that are undetectable by the alignment-only method. Read from top to bottom, each panel provides tracks for cumulative depth of coverage (vertical gray bars); representative MIP alignments (horizontal gray bars) with mismatches (colored letters), insertions (purple bars), and gaps (dashed lines); chromatogram; reference DNA and amino acid sequence for (a) heterozygous BLM c.2207_2212delinsTAGATTC in sample GM04408 as well as several alleles in the first exon of SMPD1 including (b) a heterozygous 18-bp deletion in sample GM20342 (minus strand), (c) a heterozygous 12-bp insertion and homozygous substitution in sample GM17282 (plus strand), and (d) compound heterozygous 6- and 12-bp deletions in sample GM00502 (minus strand). Chromatogram trace offsets corresponding to specific heterozygous insertion and deletion patterns are indicated with slanted lines color coded by reference base. For clarity, offsets are shown for (c) and (d) only. MIP, molecular inversion probe.

Simulation to assess detectability of rare pathogenic mutations

Although we were able to empirically demonstrate detectability for all disease-causing mutations present in our sample set, there exist a number of disease-causing mutations for which samples cannot be readily obtained. To assess whether our NGS workflow can detect these additional mutations, we sought to perform simulations in silico. Because detectability can be affected by any element of the workflow, we implemented a simulator that employed read sets from actual samples rather than model reads derived from the reference genome at uniform coverage. This allowed for realistic representation of target abundance distribution, neighboring in cis variants, as well as cycle- and context-dependent sequencing errors. Disease-causing variants were introduced into raw reads by a Bernoulli process, with an average 0.5 probability of introducing the lesion, to simulate the heterozygous genotypes carrier screening aims to detect. A total of 81 heterozygous variants were simulated in a read set of at least 144 samples, with the exception of CFTR c.1521_1523delCTT (F508del), the most common disease-causing mutation for cystic fibrosis in Caucasian populations23,24 (see Supplementary Table S7 online). This mutation was present in several samples, which were removed from simulation analysis (see Materials and Methods section). Of the simulated variants, 67 (83%) were correctly genotyped in all (generally 145 of 145) samples and only four relatively large (>7 bp) deletions were undetected in one or more samples (see Supplementary Table S7 online). We were unable to make high-confidence genotype calls for the remaining 10 variants. We did not find any variants to be undetectable in all samples (see Supplementary Table S7 online).

Discussion

Robustness, completeness, and accuracy are three of the main factors that define the utility of a genetic carrier testing workflow in a clinical laboratory. By utilizing a target enrichment methodology that is performed in a single tube and requires no mechanical shearing or purifications of individual samples, we have developed an automated NGS workflow that yields highly reproducible results across samples and operators. This reproducibility ensures that samples will not have to be rerun frequently, minimizing both turnaround time and per-sample cost.

Because each clinically meaningful base pair must be sequenced before an actionable medical report can be generated, a high level of completeness minimizes the amount of costly rework necessary for a sample. We have demonstrated completeness consistent with low to no rework for the samples and target set studied here, and substantially better than other previously reported methods using multiplex target capture or PCR with NGS.8,11,12,25 This improvement is probably the result of a number of optimizations we have made, including the use of a tiling MIP design that ensures multiple probes capture each base, thereby reducing the probability of allele dropout and systematic sequencing error, and the use of a DNA extraction protocol that effortlessly shears the DNA to a lower molecular mass. It should be noted, however, that performance with other sets of genes may vary due to interfering effects such as extreme GC content (see Supplementary Figure S6 online), repetitive or low-complexity regions, and paralogous sequence.

Regarding accuracy, the only SNV false negative that we observed was in a sample that exhibited skewed allele ratios along the chromosome, which should not commonly occur when testing for germline mutations in clinical specimens derived from whole blood. In addition, the SNV false-positive rate of ~1.1 per million base pairs corresponds to a low confirmation burden for clinical testing and surpasses values previously reported. Given our small target set and the rare nature of indels, it is difficult to provide a precise measurement of our accuracy for indels, although our data do suggest that the use of GATA substantially improves our ability to detect small lesions. The high level of coverage used in this study probably had a positive effect on both the accuracy and completeness achieved. Although lower levels of coverage could be used, it would be reasonable to expect that this would lead to either lower accuracy (ability to detect heterozygous lesions) or completeness, and might be unnecessary because list sequencing reagent costs here approximate those for genotyping arrays, although capital costs may be substantially different.

It is worth noting that measuring accuracy to a sufficient level of precision and generality can be challenging within conserved coding regions because selective pressure limits the spectrum of variation present. Although we sequenced a large number of samples, the relatively small size of our target limited the number of unique alleles observable and meant that ~90% of such variants were common (i.e., present in dbSNP). Nonetheless, there is no a priori reason to believe that our measured accuracy will not generalize to other rare and private mutations present in the targeted loci. Supporting this point, our simulations using real data and controlled for sample-to-sample variability indicate that we can detect a number of very rare disease-causing alleles of different types and sequence contexts, including insertions (up to 12 bp), deletions (up to 22 bp), and complex combinations thereof.

The reference standard one considers ground truth can impose a ceiling on measurable accuracy. We employed a large-scale automated analysis of what is widely deemed the “gold standard” for DNA sequencing: bidirectional Sanger traces derived from PCR amplicons.26 The NGS workflow detected allele dropout in the Sanger data, a known limitation of that technology (see Supplementary Figure S7 online)27 and not surprising because each base sequenced by NGS was captured by multiple probes with independent targeting arms. Had we instead employed the less laborious and more commonly used reference of HapMap Project genotyping data, we would have observed 12 NGS false negatives and 7 false positives in the subset of samples characterized by this approach (see Supplementary Table S8 online). Because these were all shown by our Sanger analysis to be HapMap Project genotyping errors, this would have underestimated both sensitivity and specificity.

Indel detection methods that only employ gapped alignment of short reads to reference are often limited by false positives introduced by systematic, context-dependent sequencing error, and false negatives introduced by failure of the aligner to open or extend gaps. An assembly-based paradigm would address these limitations, but raw contigs do not always carry base-quality and coverage information. The GATA algorithm combines these approaches to deliver sensitive and specific indel detection with SNV performance on par with a traditional AO pipeline.

Many alleles detected exclusively by GATA were from a short-tandem-repeat region encoding the N-terminal signal peptide in SMPD1 (see Supplementary Table S6 online). Consistent with previous reports, GATA detected non-reference alleles in 96% of samples, a rate that is strikingly high because hg18 contains a minor allele that is frequently substituted (V36A). Although common hexanucleotide indels at this locus are clinically benign,28,29 any pathogenic mutation present in cis would probably be missed using a conventional approach for variant detection. Indeed, when reads were aligned independently, several genomic positions in this region consistently fell below our specified coverage threshold. GATA therefore should yield higher sensitivity for rare mutations linked to polymorphisms in the first exon of SPMD1 and potentially other short-tandem-repeat loci as well.

The simulation methodology applied here attempts to assess detectability of rare pathogenic mutations in a highly realistic manner. Simply deriving reads from a reference genome modified to include the mutation of interest can overestimate the detection probability because of real-world factors that would otherwise render the mutation undetectable. In addition, we are able to determine whether a mutation is sometimes, rather than always or never, detectable because it is simulated in the read sets of hundreds of samples; for example, this could occur in a particular genetic background with a low-frequency in cis variant that interferes with alignment of reads containing the mutation. Nonetheless, certain mutation types, particularly large deletions, are still not amenable to this paradigm because they could fundamentally alter the distribution of reads generated across the relevant region. In these cases, either human samples or synthetic templates remain the only way to assess detectability. Furthermore, the outcome of this analysis can feed back into the probe design phase if it identifies genes or regions that harbor variants that will be difficult to accurately detect.

We have presented an automated, integrated workflow that converts human genomic DNA isolated from blood or cell lines into clinically relevant variant calls within 6 days. We achieved high genotype concordance with conventional electrophoretic sequencing across a set of 15 genes, and demonstrated the ability to detect a range of important disease-causing mutations. Our new analysis pipeline allows for sensitive and specific detection of indels, while simultaneously incorporating raw base quality and coverage into SNV (non-reference) genotype calls. Realistic simulation on actual run data indicates that a number of pathogenic mutations undetectable by a traditional alignment-based genotyping approach are accessible by GATA. Collectively, the data presented here indicate that our workflow has met our three requirements for a carrier screening test, and hence is ready for clinical use.

Disclosure

M.A.U., C.J.K., P.S., B.B., N.C., J.E., V.G., S.H., D.M., C.M., M.M.N., C.F.T., and G.J.P. are or were employees of, and shareholders in, Good Start Genetics. M.A.U., C.J.K., B.B., V.G., M.M.N., C.F.T., G.M.C., and G.J.P. are listed as inventors on patents or patent applications related to this work. G.M.C. is an adviser to, and shareholder in, Good Start Genetics.