Introduction

Approximately, 10–15% of colorectal adenocarcinomas arise in the setting of genetic alterations disrupting the DNA mismatch repair mechanism [1]. Roughly, 20% of the mismatch repair-deficient tumors, or 2–4% of all colorectal cancers, occur in Lynch syndrome patients with germline mutations that impair the function of mismatch repair proteins [1,2,3,4]. Mismatch repair-deficient and -proficient tumors are important to distinguish because of implications for prognosis, treatment, and cancer screening in patients found to have Lynch syndrome and their family members. In addition, mismatch repair deficiency is an important marker for immunotherapy, with the accelerated Food and Drug Administration approval of pembrolizumab for the treatment of any advanced mismatch repair-deficient solid tumor and nivolumab for the treatment of metastatic mismatch repair-deficient colorectal cancer in 2017 [5, 6].

Many institutions now routinely screen for mismatch repair deficiency in newly detected colorectal carcinomas with immunohistochemical staining for MLH1, MSH2, MSH6, and PMS2. Loss of expression of at least one of these proteins occurs in 90–95% of microsatellite instability high tumors [1, 2], which include tumors with germline or somatic loss-of-function mutations involving mismatch repair genes and those with epigenetic silencing of mismatch repair protein expression via MLH1 promoter hypermethylation. While immunohistochemical screening is highly sensitive and specific in predicting microsatellite instability status [7], some cancers with pathogenic Lynch syndrome variants have been reported to demonstrate microsatellite instability high status while maintaining intact mismatch repair protein expression [8].

Due to a higher frequency of replication errors in repetitive DNA sequences, mismatch repair-deficient tumors have increased numbers of insertion and deletion mutations (indels) in microsatellite regions [9, 10], including in mononucleotide repeats in the genome. These indels are the basis of clinically used polymerase chain reaction techniques, which target a well-described set of microsatellites [11]. With the increasing availability of next-generation sequencing for cancer in clinical settings, it is now possible to use algorithms to routinely search for indels in microsatellites contained in next-generation sequencing panel genes. Although the deletion of long indels may be informatically challenging, the detection of single-nucleotide indels is significantly easier and can be performed automatically using available software packages [12].

To take advantage of these data, efforts have been made to use targeted next-generation sequencing panels and whole-exome sequencing to determine mismatch repair status [13,14,15]. Our group and others have developed algorithms that use total mutational burden with and without the incorporation of mononucleotide insertion and deletion mutations [16, 17]. Algorithms dependent on total mutational burden do not distinguish POLE-associated ultramutated colorectal carcinomas from mismatch repair-deficient carcinomas.

Here, we refine our approach by specifying the types of insertion and deletion mutations associated with mismatch repair-deficient tumors and using single-nucleotide indel events as the sole marker in the detection of mismatch repair-deficient tumors. This refinement leaves us with a simple metric that can be applied to achieve equal sensitivity and improved specificity in a previously published training cohort. We then apply this metric to a large and independent cohort to further validate its utility in colorectal adenocarcinomas. Ultimately, our results demonstrate that a simple metric can be applied to next-generation sequencing data to accurately screen for mismatch repair deficiency.

Materials and methods

Patient selection for training and validation cohorts

The study population was prospectively enrolled via Profile, an institutional cancer genotyping cohort study [18], and included cases in which both clinical immunohistochemical screening for mismatch repair deficiency and targeted next-generation sequencing had been performed on tumor tissue. The training set was a subset of the 243 cases previously described [17]. One case was excluded because next-generation sequencing had been performed on a metastasis in a patient with two synchronous primary tumors, one of which was mismatch repair–deficient by immunohistochemistry and the other of which had unknown mismatch repair immunohistochemistry status. Another case was excluded because the patient had both colorectal neoplasia and a metastatic gallbladder adenocarcinoma, the latter of which had received next-generation sequencing. After these exclusions, the final training cohort numbered 241 cases. The validation set included 436 additional sequenced colorectal carcinomas. Demographics of the training and validation cohorts are shown in Table 1.

Table 1 Demographic information for training and validation sets

All patients provided written informed consent. This study was approved by the institutional review board of the Dana Farber Cancer Institute and the Partners Human Research Committee.

Library preparation and next-generation sequencing

Targeted next-generation sequencing was performed on tumor specimens as previously described [19]. In brief, specimens were macrodissected to enrich for regions with at least 20% tumor nuclei, and DNA was extracted from tissue frozen in optimal cutting temperature compound or formalin-fixed and paraffin-embedded tissue. Tumor-only sequencing was performed without a paired nonneoplastic specimen.

At least 50 ng sonically sheared DNA was used for library preparation using Illumina TruSeq LT reagents (Illumina Inc., San Diego, CA). Solution-based hybrid capture was performed using a custom RNA bait set (Agilent Technologies, Santa Clara, CA) for the coding region of genes of interest. There were two versions of the in-house targeted next-generation sequencing panel used in this project: in the first, there were 275 genes covering a total of 757,787 bp, and in the second, there were 298 genes covering a total of 831,033 bp [17]. Both panels of genes included Lynch syndrome-associated DNA mismatch repair genes MLH1, MSH2, MSH6, and PMS2. The first and second panels were used for training set cases, and the second panel was used in the validation set. Massively parallel sequencing was performed using an Illumina HiSeq2500 (Illumina, San Diego, CA). Sequencing results were analyzed via a custom informatics pipeline. The presence of insertion and deletion mutations was determined using GATK Indelocator (Broad Institute, Cambridge, MA).

Mismatch repair testing by immunohistochemistry

Immunohistochemical staining for MLH1, MSH2, MSH6, and PMS2 was performed as per previously published lab protocol [17]. Results of immunohistochemical staining were determined by chart review of pathology reports.

PCR testing of microsatellite loci

Microsatellite instability was evaluated by polymerase chain reaction amplification of five different microsatellite loci – four mononucleotide repeats (BAT25, BAT26, BAT40, and BAT34c) and one dinucleotide repeat (D18S55) – using fluorescently labeled primers in paired tumor-normal samples. Polymerase chain reaction products were analyzed using capillary gel electrophoresis (3130xl Genetic Analyzer, Applied Biosystems, Foster City, CA). Microsatellite instability was defined by alteration in the distribution of lengths of the PCR products in the tumor sample relative to the normal sample. Samples with instability in two of five loci or more were classified as microsatellite instability high.

Statistical analysis

Two-sided Welch’s t tests were used in statistical comparisons, with a threshold of p < 0.05 used to define statistical significance. Where applicable, we report standard errors.

Determination of next-generation sequencing metric to distinguish mismatch repair-deficient and mismatch repair-proficient cases

The training and validation cohorts were separated by mismatch repair immunohistochemistry status into mismatch repair-deficient and mismatch repair-proficient groups. In the training cohort, next-generation sequencing results were used to identify single-base pair indels in mononucleotide repeat sequences, defined as nucleotide repeats of length 2 or more. For each mononucleotide repeat sequence length, we compared the number of indel events in mismatch repair-deficient and mismatch repair-proficient cases. In mismatch repair-deficient cases, indel events were shown to occur preferentially in longer mononucleotide repeat sequences. We determined the minimum number of base pairs within a mononucleotide repeat (“mononucleotide repeat length”) at which there was a statistically significant difference between pooled mismatch repair-deficient and mismatch repair-proficient indel events. For the remainder of the analysis, we only considered indel events occurring in mononucleotide repeats of at least this threshold length.

Considering only these indel events, we examined the total number of indel events per case, normalizing to the number of megabases covered in the panel, and we compared training set mismatch repair-deficient and mismatch repair-proficient cases. We selected an indel/case cutoff that optimized discrimination of mismatch repair-deficient and mismatch repair-proficient training set cases. Finally, we applied the same mononucleotide repeat length threshold and indel event cutoff to the validation cohort to validate the performance of our metric.

Simulation of small gene panels for detection of mismatch repair deficiency

Using the same cohort, we simulated a limited panel to detect mismatch repair deficiency. The training set data were pooled and stratified by gene. For each gene, we tabulated the total number of indel event detected, and created a rank-order list of genes in which the first-ranked gene had the highest number of total indel events, the second-ranked gene had the second highest number of indel events, and so on. We then calculated the sensitivity and specificity of increasingly large panels in discriminating mismatch repair-deficient and mismatch repair-proficient tumors. Based on these results, we excluded genes that negatively impacted specificity, and we assessed the limited panels using validation set data.

Results

Selection of mononucleotide repeat length cutoff

Immunohistochemical analysis, performed for clinical screening purposes at the time of biopsy or resection, was used to identify 23 mismatch repair-deficient and 218 mismatch repair-proficient tumors in the training set, and 46 mismatch repair-deficient and 390 mismatch repair-proficient tumors in the validation set. Considering mononucleotide sequences of two or more repeats, the training set contained a total of 236 single-nucleotide indel events in mismatch repair-deficient tumors and 76 single-nucleotide indel events in mismatch repair-proficient tumors. These events were stratified by the length of the mononucleotide repeat sequence in which the indel event was detected (Fig. 1). Most indel events in mismatch repair-proficient tumors (43 of 76, 57%) occurred in shorter mononucleotide repeat sequences of length 2 or 3 base pairs, while most indel events in mismatch repair-deficient tumors were detected in mononucleotide repeat sequences of length 4 or more base pairs (228 of 236, 97%). Proportionally, mismatch repair-deficient tumors were found to have more indels in mononucleotide stretches of length 4–8 base pairs, relative to mismatch repair-proficient tumors (p ≤ 2.1 × 10–5, Fig. 1b). Based on these differences in the training set, we determined that mononucleotide repeats of length 4 or more base pairs were most useful in distinguishing mismatch repair-deficient tumors from mismatch repair-proficient tumors.

Fig. 1
figure 1

Analysis of single-nucleotide indel events at mononucleotide repeats. (a) Absolute number of indel events as a function of mononucleotide repeat length (base pairs). The total number of indel events is plotted for mismatch repair-deficient tumors (light gray) and mismatch repair-proficient tumors (dark gray). There are 23 mismatch repair-deficient and 218 mismatch repair-proficient tumors in the training set, from which single-nucleotide indel events are counted. (b) Number of indel events normalized to the total number megabases sequenced as a function of mononucleotide repeat length (base pairs). The values shown in the panel are p values comparing indels/Mbp occurring in mononucleotide repeat regions of a given length. For mononucleotide length between 4 and 8 base pairs, mismatch repair-deficient and mismatch repair-proficient tumors have a statistically significant difference in indels/Mbp. The error bars represent standard errors

Application of next-generation sequencing criteria for mismatch repair deficiency in training and validation datasets

We next quantified how many indel events occurred in mononucleotide repeat regions of length 4 or more base pairs in mismatch repair-deficient and mismatch repair-proficient tumors. To enable comparison across cases, we normalized the number of indels per case to the number of megabases sequenced (indels/Mbp). In the training set, mismatch repair-deficient tumors had an average of 13.0 ± 1.2 indels/Mbp, and mismatch repair-proficient tumors had an average of 0.45 ± 0.05 indels/Mbp. True-positive cases in the training set contained an average of 13.6 ± 0.3 indels/Mbp. We found that using a cutoff of 3 indels/Mbp, next-generation sequencing results were concordant with immunohistochemistry for 22 of 23 mismatch repair-deficient tumors and 218 of 218 mismatch repair-proficient tumors, achieving 96% sensitivity and 100% specificity (Fig. 2a).

Fig. 2
figure 2

Total indels/Mbp by next-generation sequencing compared to mismatch repair immunohistochemistry. (a) Jitter boxplot for training set. This boxplot shows the number of indels for each case occurring in mononucleotide repeats of length 4 or more base pairs, normalized to the number of Mbp sequenced; the upper and lower boundaries of the boxes correspond to the 25th and 75th quartiles, and the line within the box represents the median. Out of 218 total tumors, 190 mismatch repair-proficient tumors had a total of 0 indels/Mbp, such that the 25th quartile, 75th quartile, and median values are equal. A cutoff value of 3 indels/Mbp (horizontal line) is chosen as a metric to classify cases as mismatch repair-deficient or mismatch repair-proficient. (b) Jitter boxplot for validation set. 432 of 436 cases are correctly classified compared to mismatch repair immunohistochemistry

Applying the cutoff of 3 indels/Mbp to the validation set, next-generation sequencing results were concordant with immunohistochemistry for 44 of 46 mismatch repair-deficient tumors and 388 of 390 mismatch repair-proficient tumors, achieving 96% sensitivity and 99% specificity (Fig. 2b), with two false-negative cases and two false-positive cases compared to immunohistochemical staining (see “Analysis of discordant cases”, below). In the validation set, mismatch repair-deficient tumors had an average of 13.3 ± 1.1 indels/Mbp, and mismatch repair- proficient tumors had an average of 0.31 ± 0.06 indels/Mbp. True-positive cases in the validation set were found to have an average of 13.9 ± 1.1 indels/Mbp.

Analysis of discordant cases

The only discordant case in the training set was false-negative by next-generation sequencing criteria. On immunohistochemical screening, the tumor was called mismatch repair-deficient due to heterogeneous absence of MSH2/MSH6 expression. Subsequent PCR testing showed microsatellite instability in zero of five loci. Our next-generation sequencing metric classified the case as mismatch repair-proficient, with no indels in mononucleotide repeats of length 4 or more. Further examination of our next-generation sequencing data showed that the tumor had two APC nonsense mutations (p.K670* and p.E1322*) and a TP53 missense mutation (p.R175H). Based on these results, this tumor might have been misclassified by immunohistochemical screening or might exhibit heterogeneity with mismatch repair deficiency involving a subclone not tested by next-generation sequencing. Although reclassification of the case as mismatch repair-proficient would improve the sensitivity of our approach, we retained the mismatch repair-deficient classification because detailed review of this case was prompted by our next-generation sequencing findings.

There were altogether four misclassified cases in the validation set. In one false-negative case, the tumor was called mismatch repair-deficient because of very weak MLH1 staining and loss of PMS2 expression. Our next-generation sequencing metric classified the case as mismatch repair-proficient, with no indels in mononucleotide repeats of length 4 or more. MLH1 promoter methylation studies were performed clinically and found that the promoter was not hypermethylated. Next-generation sequencing also detected an MLH1 p.L260R missense mutation. In the second false-negative case, our next-generation sequencing metric detected a single indel in the ASXL1 gene, in a mononucleotide repeat of length 8. Immunohistochemistry showed loss of MSH2/MSH6 expression, and next-generation sequencing detected p.R389* and p.V840fs mutations in MSH2. In both false-negative cases, pathogenic variants were identified in mismatch repair genes; however, the neoplasms did not demonstrate characteristic elevated burden of indels in our targeted sequencing panel. Although our study was designed to focus only on indel events as a phenotypic feature of mismatch repair-deficient cancers, the incorporation of variant interpretation for pathogenicity in mismatch repair genes could have improved our sensitivity to 100% in validation.

In both false-positive cases in the validation set, PCR testing for microsatellite instability had been performed due to high clinical suspicion of mismatch repair deficiency despite negative immunohistochemistry results. Both tumors were found to be microsatellite instability high. In one case, instability was detected in five of five microsatellite markers, and the case was found to have a BRAF p.V600E mutation, which is typically associated with sporadic microsatellite instability high colorectal cancers. The initial immunohistochemical analysis was performed on a biopsy specimen. We repeated immunohistochemistry on the surgical resection specimen using the same tissue block that was tested by next-generation sequencing, which showed loss of MLH1 and PMS2 protein expression, consistent with the presence of the BRAF p.V600E mutation.

In the second false-positive case, microsatellite instability was detected in four of five microsatellite markers, and the tumor was found to harbor MSH2 p.G751R, a pathogenic missense variant. In this case, the initial immunohistochemical analysis was performed on a primary resection specimen. We repeated mismatch repair immunohistochemistry on the subsequent lymph node metastasis using the same tissue block that was tested by next-generation sequencing and confirmed intact expression of all four mismatch repair proteins. We concluded that our next-generation sequencing metric accurately detected microsatellite instability high status in both false-positive cases. One discordant case was potentially attributable to intratumoral heterogeneity, and the second was attributable to an MSH2 missense mutation with intact immunohistochemical staining.

Analysis of gene distribution of indel events

We analyzed which genes in the training dataset were more likely to harbor mismatch repair deficiency-associated indel events. The two genes with the most indel events in mismatch repair-proficient tumors were APC and TP53, well-characterized tumor suppressor genes frequently mutated in colorectal cancer (Fig. 3a) [20]. APC was also the gene with the most indel events per case in mismatch repair-deficient tumors (Fig. 3b). However, the gene with the second highest number of indel events was DMD, a gene encoding 3685 amino acids with no known link to the pathogenesis of colorectal cancer. Overall, the frequency of indel mutations is multifactorial and is likely due to a combination of the mechanism by which mismatch repair deficiency introduces errors in repeat regions, gene size and, therefore the likelihood of acquiring a new mutation, and biological selection for pathogenic alterations that may drive tumor progression [21].

Fig. 3
figure 3

Analysis of genes in which indel events occur. (a) Genes in which multiple indel events are detected across all mismatch repair-proficient tumors. Pooling data from all sequenced mismatch repair-proficient tumors, there are at least two indel events detected in the genes shown in this panel. The most commonly affected genes are APC and TP53, both of which are tumor suppressors implicated in tumorigenesis. (b) In total, 10 genes in which indel events most commonly occur in all colorectal cancers. The plot shows the total number of indel events detected, normalized to number of mismatch repair-deficient (light gray) and mismatch repair-proficient (dark gray) tumors. There are 5 genes across all 218 mismatch repair-proficient tumors with at least 2 detected indel events in mononucleotide repeats of length 4 or more. In contrast, there are 51 such genes across the 23 mismatch repair-deficient tumors

Selection of a limited gene panel—number of genes and number of indels

We recognize that next-generation sequencing panels of around 300 genes are not common in clinical practice; therefore, we asked whether a smaller panel including a subset of genes from our analysis could be used to infer mismatch repair deficiency. Using data from the same cohort, we simulated a limited next-generation sequencing panel, considering genes with the most frequently occurring indel events (Fig. 3).

Due to the relatively high number of indel events in the APC gene in mismatch repair-proficient tumors, we excluded APC to improve specificity. DMD indels were associated with mismatch repair-deficient status, with 1 event detected in 218 mismatch repair-proficient tumors and 7 events detected in 23 mismatch repair-deficient tumors in the training set (p = 0.006). However, this gene was removed from our institutional next-generation sequencing panel during the validation set enrollment period; hence, we excluded DMD from our limited panel since we could not adequately assess its inclusion using our validation cohort.

After the exclusion of DMD and APC, we examined the changes in sensitivity and specificity as a function of increasing next-generation sequencing panel size, preferentially including the genes most frequently affected by indel events (Fig. 4). To improve sensitivity in detecting mismatch repair-deficient cases, we used a cutoff of one indel event in a mononucleotide repeat of four or more nucleotides.

Fig. 4
figure 4

Sensitivity and specificity of a simulated limited gene panel. (a) Performance of targeted next-generation sequencing panel using a detection threshold of one indel event as more genes are included. The genes in the assay (horizontal axis) are numbered in order of decreasing indel events detected in the training set, and the presented data are reanalyzed as discussed in the Methods. Data corresponding to the gene with the most indel events (ARID1A) are the first to be included in the next-generation sequencing assay, data corresponding to the gene with the second highest number of indel events (KMT2D) are the second to be included, and so on, for all 296 sequenced genes after exclusion of APC and DMD. (b) An expanded view of the first 10 genes in a shows the diagnostic utility of a limited next-generation sequencing panel

We found that there were significant gains in sensitivity with the progressive inclusion of the most frequently affected genes from the training set (Fig. 4a). At the low extreme, the three-gene panel of ARID1A, KMT2D and SOX9 achieved 76% sensitivity and 98% specificity in the validation set (Fig. 4b). The inclusion of up to 10 additional genes led to modest improvements in sensitivity. Because 40–50 gene panels are used at some institutions for targeted next-generation sequencing, we reanalyzed our data examining only the 40 most commonly affected genes; we found that this 40 gene panel achieved a sensitivity of 96% and a specificity of 92% in the validation set.

Discussion

The increasing availability of next-generation sequencing technology has made the broad profiling of cancer genomes feasible as part of clinical care. Benefits of next-generation sequencing include the simultaneous analysis of multiple oncogenes and tumor suppressor genes using small amounts of pathological tissue. For advanced colorectal cancer, pathway activating mutations of KRAS and NRAS have been associated with lack of response to anti-EGFR antibody therapy [22, 23]. ERBB2 amplification has been associated with clinical response to targeted therapy in patients with treatment-refractory disease [24]. BRAF mutations hold prognostic significance in microsatellite stable and unstable phenotypes [25, 26], in addition to providing a therapeutic target [27]. PIK3CA mutations represent potential therapeutic targets for PI3K pathway inhibition and are associated with favorable outcomes with aspirin use [28]. In addition to single gene targets, mismatch repair deficiency is an important biomarker to predict response to immune checkpoint inhibitors [5, 6].

The clinical algorithm for Lynch syndrome screening is also complex. Clinical guidelines now recommend universal screening of colorectal cancers by mismatch repair protein immunohistochemistry [29]. This may be followed by a variety of molecular tests, including microsatellite instability testing, BRAF mutation analysis, MLH1 promoter methylation analysis, and eventually germline sequencing. It is reasonable to imagine a future where a single next-generation sequencing test can replace most molecular assays currently performed for both therapy selection and Lynch syndrome screening.

Total mutational burden and increased indels in DNA microsatellites are parameters for mismatch repair deficiency that can be detected by next-generation sequencing analysis. To this end, algorithms have been developed to distinguish mismatch repair-deficient and mismatch repair-proficient tumors. For example, MSISensor [15], MSISeq [13], and MANTIS [30] utilize whole-exome sequences of paired normal-tumor DNA samples to distinguish mismatch repair-deficient and mismatch repair-proficient tumors based on the number of indels in microsatellites. These algorithms also have been shown to successfully determine mismatch repair status with subsets of whole-exome sequences (i.e., targeted next-generation sequencing panels). Another algorithm – mSINGS [31] – performed well using whole-exome tumor sequences, without the need for paired normal data. Similar algorithms have been applied to search for driver events in microsatellite regions across many tumor types [32, 33].

Previously, we had used total mutational burden and number of indels in mononucleotide repeat sequences to distinguish mismatch repair-deficient and mismatch repair-proficient tumors using targeted next-generation sequencing data [17]. In the prior publication, mononucleotide repeats were defined as nucleotide repeats of length 2 or more. Here, we refine our analysis and demonstrate that indels in repeats of at least four consecutive nucleotides are enriched in mismatch repair-deficient specimens. We show that consideration of this single parameter of the number of single-nucleotide indels in mononucleotide repeat regions is equally sensitive to and more specific than the prior algorithm. In total, we evaluate 677 colorectal adenocarcinomas with concurrent immunohistochemical analysis and achieve 96% sensitivity and 99% specificity in a validation cohort.

The current molecular gold standard for microsatellite analysis by polymerase chain reaction examines longer DNA repeats in noncoding regions of the genome that are consistently altered in the setting of mismatch repair deficiency. In contrast, our current analysis evaluates shorter mononucleotide repeats that are inconsistently mutated from case to case. Any single gene has at most 0.3 single-nucleotide indel event per mismatch repair-deficient carcinoma. However, the ability of next-generation sequencing to examine multiple genes at high throughput allows a high degree of discrimination of mismatch repair-deficient compared to mismatch repair-proficient cancers when indel events are compiled across a targeted genome. These mutational patterns support that deficiency of the mismatch repair machinery plays a functional role in mutagenesis involving coding regions, leading to potential frameshift mutations with functional significance in cancer-associated genes.

At our institution, we currently perform universal screening for mismatch repair deficiency for all colorectal cancers by immunohistochemistry. For cases that undergo clinical next-generation sequencing, we prospectively analyze all cancers for mismatch repair status using next-generation sequencing data. Thus, we use existing sequencing data to apply a secondary screening method to detect mismatch repair-deficient tumors at little additional cost. Furthermore, we demonstrate here that limited gene panels can achieve high accuracy after adjusting the indel/Mbp cutoff. At the extreme end of the spectrum, a panel of only three genes achieves 76% sensitivity and 98% specificity for discriminating mismatch repair-deficient and mismatch repair-proficient tumors in a validation cohort. More practically, use of a one indel cutoff in a panel of the 40 most informative genes detects mismatch repair deficiency with 96% sensitivity and 92% specificity in a validation cohort. These results serve as a proof of concept that panels already used in practice are sufficiently large to serve as a secondary screen, and that carefully selected, limited panels have potential for mismatch repair-deficient screening.

We recognize certain limitations of our study. Our laboratory uses hybrid capture for panel gene enrichment, and the generalizability of the analysis to next-generation sequencing assays using amplicon-based library preparation is unknown. Since polymerase chain reactions performed as a part of library preparation are prone to errors in repeat regions, validation of analytical pipelines to distinguish true somatic events from sequencing artifact is essential. Due to variations in library preparation chemistry, sequencing platform, and informatics pipeline, we recommend that laboratories considering implementing mismatch repair analysis by next-generation sequencing should independently validate these algorithms for their assay, and an adjustment of analytical thresholds for distinguishing mismatch repair-deficient from mismatch repair-proficient tumors may be necessary depending on next-generation sequencing panel design. While our study and others have demonstrated that next-generation sequencing can establish mismatch repair and microsatellite instability status for colorectal cancers, the broad application of such algorithms to other solid tumor types requires additional research.

In this study, we show that a simple metric can be applied to accurately distinguish mismatch repair-deficient and mismatch repair-proficient tumors using targeted next-generation sequencing of tumors without paired normal data. The simplicity of our approach allows for potential generalizability across targeted next-generation sequencing panels as well as rapid adoption into preexisting sequence analysis pipelines.