Introduction

Short-read next-generation sequencing (NGS) has become the diagnostic assay of choice in many clinical laboratories given its potential for efficient large-scale analysis. For genetically heterogeneous disorders, especially those presenting with clinical heterogeneity, clinical laboratories are beginning to transition from targeted gene panels to whole-exome sequencing (WES). However, with the expansion to exome-wide analysis, the limitations of NGS in regions of high homology will become increasingly apparent. Diagnostic laboratories have historically had deep gene-specific knowledge regarding the presence of homologous sequences. Such expertise does not scale easily, and unless this knowledge is precurated, a clinical laboratory may risk reporting false-positive and false-negative variant calls resulting from inaccurate mapping of short reads to highly homologous regions, including pseudogenes. Although most bioinformatic NGS data analysis pipelines are homology-aware, adequate resources and guidance tailored toward use at early stages of test development are lacking. Awareness of problematic regions is critical at the test design stage as well as the reporting stage to guide decision making regarding whether to exclude regions from the test and to decide whether alternative assays must be used for critical genes. This is particularly the case for WES, in which sets of genes to be analyzed can be configured ad hoc based on a specific clinical scenario.

With the GENCODE project identifying 11,216 unique pseudogenes,1 the potential for homology to interfere with clinical sequencing is widespread and must be examined before launching a clinical test. Homology is especially concerning for genes with high detection rates for the disease of interest or for genes where professional societies suggest return of results regardless of the patient’s diagnosis. The American College of Medical Genetics and Genomics (ACMG) recognizes the inherent difficulty in interrogating regions of high homology and says in its most recent guidelines that “the laboratory must develop a strategy for detecting disease-causing variants within regions with known homology.”2 Likewise, the College of American Pathologists states that laboratories testing highly homologous genes must devise methodology that can distinguish between gene and pseudogene and document the accuracy of their testing.3 However, translating these recommendations into clinical practice can be challenging.

Multiple resources have been developed computationally to describe pseudogenes on a genomic level.4,5,6,7 Although these resources are extremely useful to define pseudogene structure, location, and extent of homology to the parent gene, they have not been created with diagnostic applications in mind. Some features of high interest to molecular diagnostic laboratories include the extent to which a homologous gene is problematic for NGS and/or Sanger sequencing, information about the medical importance of the gene, as well as information on the genomic context of the homology. Annotation of homologous genes and exons can facilitate decisions regarding whether to exclude such regions from the NGS assay or to develop ancillary assays to ensure accuracy and optimize clinical sensitivity.

An oftentimes underappreciated portion of the NGS bioinformatic pipeline is the alignment of the short reads to the reference genome, which is challenged by reads deriving from regions with high homology. Mapping quality (MQ), which is an output of alignment algorithms that are associated with each read,8,9 provides an estimate of the probability that the read was aligned to the wrong location in the reference genome. This metric is derived from a complex calculation that takes into account multiple factors, including the probability of the read arising from other areas of the genome and base quality (the probability of a sequencing error) at bases that differ from the reference sequence. MQ may be misleading in specific instances, especially in cases where genuine variants are present in the read. An alternative approach to measuring homology is mappability.10 Mappability is easy to compute and can be performed in silico using only the sequence of the reference genome. Consequently, this approach does not incur the time and materials cost of running actual samples and can be run before finalizing the design of an NGS assay. Importantly, mappability is not affected by variations across experiment runs or different sequencing platforms, making it a good candidate to serve as the foundation of a universal homology resource.

We adjusted the mappability algorithm10 to mimic the sizes of typical Sanger sequencing amplicons and NGS library fragments and searched for sites of either ≥98% or 100% homology elsewhere in the genome. From this analysis, we generated exon and gene level lists relevant to current clinical testing that include regions that are difficult or impossible to analyze by standard Sanger or NGS approaches. These data confirm the widespread nature of high homology throughout the human exome, which is well understood in general but poses risk as laboratories expand their analyses beyond small panels of familiar genes.

Materials and Methods

Mappability analysis

Whole-genome sequence (hg19, GRCh37) was downloaded from http://genome.ucsc.edu/ and indexed using gem-indexer (http://algorithms.cnag.cat/wiki/The_GEM_library). The gem-mappability10 was run using the following arguments: (i) -l 1000 -m 0 -t 2, (ii) -l 250 -m 0 -t 2, and (iii) -l 250 -m 5 -t 7. The mappability outputs were converted to .wig files using gem-2-wig and then subsequently to .bed files using BEDOPS wig2bed.11 These .bed files were adjusted to 1-based coordinates using a custom Python script. Genomic coordinates for each exon in the human exome (hg19) were downloaded from http://genome.ucsc.edu and extended 65 bp upstream and downstream of each exon to include flanking sequences that are typically analyzed in hybrid capture NGS approaches and contain clinically important regulatory sequences for RNA splicing. These coordinates are referred to as “start_minus_65bp” and “end_plus_65bp” positions for each exon in the Supplementary Tables online and constitute the regions analyzed in our analysis.

A pipeline of custom Python scripts was run that extracts the mappability score for each position within an exon and calculates homology metrics at the exon level (Supplementary Tables S1–S4 online). These tables were sorted to identify exons that contain high homology to sequences elsewhere in the genome using the criteria outlined in the Results section. Scripts used for this analysis are available upon request.

Generating lists of homologous exons and genes

A mappability score10 was assigned to each genomic position that reflects the degree of homology associated with the local sequence of length l. We varied settings on the mappability algorithm to derive exon-level lists of particular relevance to clinical genetics laboratories (see Figure 1 for details).

Figure 1
figure 1

Mappability-based generation of homologous exon and gene lists. A sequence of length “l” is scanned across the genome to find homologous sequences allowing for “m” mismatches. Matched sequences at homologous loci with zero (m = 0, 100% match) or five (m = 5, up to 2% mismatch) mismatches are shown. The mappability score is defined as the reciprocal of the number of matches in the genome and is assigned to the first position in the sequence. A unique sequence will be present only once in the genome and will have a mappability score of 1. By contrast, any sequence that matches to more than one location will have a mappability score <1. Mappability scores were calculated for each position in the exome. Four lists were generated using the indicated mappability settings and criteria for assembling exon-level and gene-level lists. For all NGS-relevant mappability analyses, the k-mer length was set to 250 bp (l = 250), because this approximates a commonly used library fragment size. To conservatively mimic the amplicon size used in standard Sanger sequencing approaches, the k-mer length was increased to 1,000 bp (l = 1,000). Distinct exon-level criteria were used to generate four lists. The total number of exons, genes, and genes with medical relevance is presented for each list.

Homology-type annotation

Sequences of the 11,557 exons along with flanking ±65 bp from the “NGS Problem List–Low Stringency” were obtained and a local alignment for each region was performed against the genome using the BLAT algorithm12 at 90% minimum identity threshold. The target coordinates of the BLAT hits were recorded and annotated as follows: (i) “same gene”—target coordinates do not overlap with the query coordinates but the target coordinates overlap with the query gene coordinates; (ii) “different gene”—target coordinates overlap with another gene coding sequence (CDS) in the exome and do not overlap with the query gene coordinates; (iii) “pseudogene”—target coordinates overlap with psiDRv0 pseudogene regions6 but do not overlap with any CDS in the exome; and (iv) “non-CDS”—target coordinates do not overlap with psiDRv0 pseudogene regions or with any CDS in the exome.

Medical relevance gene filter and annotation

To focus our analysis on genes with suspected or established medical relevance, our lists of genes containing problematic exons were intersected with a list of 4,773 genes obtained from OMIM (http://omim.org, accessed November 2012), HGMD (http://www.hgmd.cf.ac.uk, accessed November 2012), and ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/, accessed December 2012) (Supplementary Table S13 online). Medically relevant genes identified on the “NGS Problem List–High Stringency” (193 genes; Figure 1 ) were further annotated to provide basic information useful in a molecular diagnostic setting through searches of OMIM, HGMD, ClinVar, GeneReviews (http://www.ncbi.nlm.nih.gov/books/NBK1116/), and PubMed (http://www.ncbi.nlm.nih.gov/pubmed). Evidence from the published literature was recorded with corresponding PubMed IDs when available. Additional annotations include age of onset, prevalence, and categorization (Mendelian, association, somatic, pharmacogenetic). The evidence for each gene–disease association was graded using specific criteria (Supplementary Figure S1 online) from evidence level 0 (undetermined association) through evidence level 3 (definitive association).

MQ analysis

MQ scores were obtained from 30 WES experiments for each read using the BWA-MEM algorithm13 performed on the Illumina HiSeq 2500 platform. The scores were averaged at each base within the exome (±15 bp surrounding each exon) from the reads aligned to this position. The percentage of positions with low MQ scores (<17) was obtained for each gene and reported in the gene-level Tables (Supplementary Tables S5–S12 online). This threshold was chosen because it is consistent with settings commonly used in NGS bioinformatic pipelines14 to exclude low-quality reads from the variant calling process.

Results

Generation of lists of homologous regions in the exome for clinical diagnostic use

To identify regions of the exome that may be problematic for diagnostic testing due to high homology, we used the gem-mappability algorithm10 and tailored it to reflect standard NGS and Sanger sequencing approaches ( Figure 1 ).

We compiled four exon-resolution lists with increasing degrees of sequence homology, each containing quantifiable measures of the degree of involvement by high homology, such as the percentage of affected positions and the largest stretch of contiguously affected positions. “Dead zone” lists indicate that identical matches exist elsewhere in the genome, making it impossible to unambiguously map an NGS read or generate a standard-length Sanger amplicon in these affected regions. “NGS problem” lists allow for up to 2% mismatch. In these regions, a read that matches the reference sequence has the potential to align correctly. However, there is a significant risk that a read containing a sequencing error or variant may misalign to a highly homologous region elsewhere. Data from these exon-level lists were aggregated by gene name and served as the basis for building gene-level lists expressing the degree to which a gene is affected by high sequence homology. These lists include any gene with at least one affected exon ( Figure 1 ). Summary metrics are provided for each gene detailing the percentages of affected exons and positions. Gene-level lists were subdivided by intersection with a list of 4,773 genes with known or suspected medical relevance to highlight clinically important genes.

List 1. “NGS Dead Zone” (2.2% of exons; 619 genes): These regions are entire exons or large contiguous portions of exons that have 100% identity to other loci. A contiguous region of 250 or more affected positions indicates a stretch of 100% homology that extends 499 or more base pairs. When attempting to sequence the central positions of a homologous sequence of this length, a read’s mate pair cannot assist with unambiguous alignment of the read because it also falls inside the region of 100% homology, given a library fragment size of 250 bp.

Lists 2 and 3. “NGS Problem Lists–High Stringency” (3.9% of exons; 1,168 genes) and “Low Stringency” (5.9% of exons; 2,512 genes): After defining regions that are definitively problematic for standard NGS approaches, we adjusted the mappability settings to allow for up to 2% mismatches ( Figure 1 ) to capture additional regions that may pose problems due to high homology but do not share 100% identity with other loci. The purpose of the NGS problem lists is to warn users about regions that may pose problems for standard NGS, especially under certain conditions, but are not altogether unanalyzable. A “high stringency” exon-based list included exons with ≥90% of their positions affected by high homology or a contiguous stretch of 250 or more affected positions to account for significant regions of high homology within large exons ( Figure 1 ). In these cases, the overwhelming majority of reads generated from this region will possess high homology to other loci, which may lead to misalignment to other portions of the genome. Variant calls based on reads mapped to these affected regions should be viewed with extreme caution. The “low stringency” list captures any exon with at least one position whose associated 250-bp sequence fragment maps to more than one place in the genome with 2% or less mismatches ( Figure 1 ). This analysis aims to detect all regions at risk for homology interference, and thus a substantial fraction of exons on this list may not cause analytical problems in practice.

List 4. “Sanger Dead Zone” (1.6% of exons; 467 genes): Some clinical laboratories routinely use Sanger sequencing to confirm variants detected by NGS. A homologous region that is difficult to analyze by NGS may still be interrogated accurately by Sanger sequencing provided that it is possible to design amplicons larger than the region affected by sequence homology. We identified regions in the exome that cannot be Sanger sequenced using standard short amplicon protocols, assuming a maximum amplicon size of 1,000 bp.

To test whether our mappability-based approach adequately flags regions with empirically demonstrated poor MQ, we compared our data with MQ scores derived from 30 WES data sets (see Methods for details). The majority of genes on our lists contained positions with low average MQ ( Figure 2 , Supplementary Tables S5–S12 online). Additionally, there was a positive correlation between the percentage of positions with low MQ and the percentage of affected positions by our analyses. The “NGS Problem List–Low Stringency” identified 1,557 of 1,676 (92.9%) of genes in the exome with at least one position with a low average MQ score. This comparison serves as an empirical validation of our method and suggests that our resources can help identify regions that are at risk for problems due to high homology before implementing and running an NGS assay.

Figure 2
figure 2

Gene-level lists include genes with high medical relevance. Medically relevant genes with definitive disease association (evidence level 3, Supplementary Figure S1 online) from the “NGS Dead Zone” and “Sanger Dead Zone” lists are shown with selected annotations sorted by percentage of affected positions. % Observed low mapping quality (MQ) indicates the percentage of positions with low average MQ scores within the gene calculated from whole-exome sequencing and is presented for comparison with our mappability-based metrics. Homology type classifies the genomic context of the homologous sequences for each gene. Categories of mutant alleles are abbreviated as follows: M, Mendelian; A, risk association; S = somatic. Taking stereocillin (STRC) as an example, one can see that it has been associated with sensorineural hearing loss. Homologous loci elsewhere in the genome include an annotated pseudogene. Ten of its exons fall within the “NGS Dead Zone” and cannot be reliably analyzed by standard next-generation sequencing (NGS). A candidate orthogonal approach to “fill in” these homologous regions after NGS might be Sanger sequencing. However, four of the STRC exons are also found on the “Sanger Dead Zone” list and an alternative method, such as long-range polymerase chain reaction, should be used to avoid pseudogene interference.18

High-priority clinically relevant genes affected by homology

It is particularly important to be aware of genes that are both affected by homology and strongly associated with a clinical phenotype because these are the most likely to be tested, and variants found in these genes are most likely to be causally linked to a patient’s phenotype.

To focus attention on “cannot miss” genes affected by high homology that should be considered when designing and interpreting clinical assays, we selected the 193 medically relevant genes from the “NGS Problem List–High Stringency” analysis (Supplementary Table S8 online) for further manual annotation to highlight well-known genes with established roles in disease. Efforts to fully and deeply curate all gene–disease relationships are underway by the Clinical Genome Resource (https://clinicalgenome.org/), but this level of curation is beyond the scope for our work. However, we curated this set of genes as a first step for understanding the clinical importance of the highly homologous genes identified in our analysis, with the understanding that this curation will need to be frequently updated as new knowledge on gene–disease relationships emerges. Of the 193 genes, 85 (44.0%) were well-established disease genes (evidence level 3, Supplementary Figure S1 online). Additionally, genes with definitive gene–disease associations from the “NGS Dead Zone” and the “Sanger Dead Zone” lists are presented in Figure 2 and sorted by the percentage of affected positions. These genes are significant contributors to prevalent and severe genetic disorders, including PMS2 (colon cancer), STRC (hearing loss), and TTN (dilated cardiomyopathy). Three genes on the “NGS Problem List–High Stringency” (Supplementary Table S8 online)—MYH7, PMS2, and SDHC—appear on the ACMG incidental findings list,15 with the recommendation that they be analyzed during exome or genome sequencing. Other genes with high homology frequently tested in clinical practice include PKD1 (autosomal dominant polycystic kidney disease), CYP2D6 (pharmacogenetics), and CHEK2 (cancer predisposition).

Categorization by type of homology

Although all highly homologous sequences pose the same challenges when reads are aligned to the reference genome, the specific genomic context of the affected regions can have significantly different implications for assays that are chosen to supplement NGS and allow for comprehensive interrogation. The subsequent sections describe four common scenarios, their implications for clinical testing, and possible solutions. These solutions may not fit every laboratory’s needs because operational implementation of NGS assays can differ, but they provide a template for the clinical testing of genes with these specific difficult genomic contexts. We identified the locations of the sequences throughout the genome homologous to each gene on our lists and classified these regions to provide a homology-type annotation ( Figure 3 ). It is important to note that a gene may be affected by multiple types of homology due to the presence of multiple highly homologous sequences or due to a single highly homologous sequence that closely matches multiple regions of differing homology types.

Figure 3
figure 3

Annotation of homology type. (a) Schematic overview of classification into four homology types—same gene, different gene, pseudogene, and non-CDS (1–4). An example of a local alignment search is shown using a homologous exon (black) as the query that returns a match to each of the four genomic contexts (gray, 1–4). ψ indicates an annotated pseudogene. An actual match from our analysis is shown for each homology type in the table. Note that exon numbering is arbitrary and does not correspond to any specific transcript. (b) Table listing the percentage of genes with each homology type annotation in the “NGS Problem List–Low Stringency.” Note that individual genes may be affected by multiple types of homology.

Intragenic homology (same gene). In this case, the genomic context for homology resides within the CDS of the same gene ( Figure 3 ). An example is the Nebulin (NEB) gene, for which an 8-exon segment (exons 82–105) is triplicated with nearly 100% sequence identity between the repeated blocks. For this type of intragenic large tandem repeat, commonly used approaches to discriminate homologous sequences (e.g., long-range polymerase chain reaction) are inappropriate. Solutions include gene-aware bioinformatic filtering that applies a different allelic fraction threshold for the gene that recognizes the potential for a heterozygous genotype in any one of the repeated regions to result in a reduced percentage of variant reads due to read misalignment. Other solutions include visual analysis of NGS read alignments or Sanger sequencing traces to identify the presence of variant alleles at low-read fractions, although any variant detected through this approach could not be specifically assigned to any one of the triplicated regions.

Homology to functional genes (different gene). In the case of high homology between paralogous functional genes, both may need to be analyzed if they are relevant for the same disorder ( Figure 3 ). One example is the cardiac muscle alpha and beta heavy chain genes, MYH6 and MYH7. Both are associated with hypertrophic cardiomyopathy, are located adjacent to each other on chromosome 14, and share significant homology of exon 26 such that NGS read mapping is problematic. However, the high homology between the two genes is restricted to the coding sequences and does not extend into the introns. In such cases where homology is restricted to coding sequences, it may be possible to design a unique Sanger sequencing assay to resolve variant calls post-NGS.

Homology to nonfunctional pseudogenes (pseudogene). Pseudogene sequence can be seen as a contaminant and possible source of assay interference because unique analysis of the functional gene is desired ( Figure 3 ).16 One approach to enable clinical testing of genes in this category is to use long-range polymerase chain reaction sequencing assays to uniquely amplify the gene of interest. This approach has been used successfully for the analysis of PKD1 in the diagnosis of autosomal dominant polycystic kidney disease17 and STRC in the diagnosis of autosomal recessive sensorineural hearing loss.18

Homology to other sequences not annotated above (non-CDS). These regions of homology fall outside the exonic CDS and annotated pseudogenes ( Figure 3 ). High homology to these sequences can be potentially overcome using a long-range polymerase chain reaction as described for annotated pseudogenes.

Resource availability

Our exon-level and gene-level lists are freely accessible on the precision FDA (https://precision.fda.gov/) and NCBI GeT-RM (http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/) websites.

Discussion

Exome sequencing allows for the large-scale analysis of genes without the intimate knowledge that comes with testing single genes or small gene panels. Highly homologous sequences are prevalent within the exome and have the potential to confound molecular diagnostic testing. We present a resource to empower clinical laboratories to recognize and overcome challenges presented by homology. This resource is primarily geared toward laboratory directors and probably has the greatest utility during the test design phase, but it can be used at all steps in the diagnostic testing process, including bioinformatic filtering to recognize homologous sequences that may result in a false-positive or false-negative variant call.

Our gene-level lists provide a quick reference to gauge possible challenges, design test content, and determine whether ancillary tests need to be developed for certain genes. These lists can be sorted by medical relevance and the degree of affectedness at the exon or base pair levels. Alternatively, specific genes can be queried by name. Use of the exon-level lists provides additional detail such as the number of positions affected within the exons as well as their precise location. The genomic coordinates can be interfaced with bioinformatic pipelines or used to generate tracks that will flag regions with high homology.

We compared our mappability-based approach to MQ, which is an output of commonly used read alignment algorithms such as BWA.8,9 Mappability exclusively measures homology within the reference genome, whereas MQ is calculated empirically for each read within an NGS experiment and is influenced by multiple variables in addition to the presence of homologous sequences. It is critical to recognize that MQ scores are not designed to handle sequence variants and can be misleading in the presence of variants that increase the percent identity of a read with a homologous region. Although our mappability-based approach and MQ should be considered distinct metrics that are not redundant or interchangeable, there is substantial overlap in a global sense due to the shared effect of homology. The agreement that we observed between these measures validated that our method could identify problematic portions of the exome that are at risk for homology interference before performing an NGS assay. We believe that our mappability-based approach and MQ provide complementary measures that should be used together in NGS assay development and implementation.

The latest versions of many variant callers attempt to remedy homology issues by not calling variants in regions of low MQ. Although this approach minimizes but does not completely eliminate false-positive variant calls in regions of high homology, it also raises the possibility of false-negative results because actual variants in homologous regions may be filtered out. Moreover, if a clinical laboratory is not aware that reads in these regions are excluded due to low MQ scores, then there can be an illusion of comprehensive analysis, when in reality pertinent genomic regions remain ineffectively analyzed. Thus, laboratories should closely scrutinize homologous regions to avoid missing important variants and promising coverage of regions that cannot be accurately assessed. This is especially important for genes with high medical relevance, including those on the ACMG incidental findings list.15 For other regions of lesser importance whose analysis is impaired by homology, a laboratory may choose to recognize the technical limitations of the assay and forego analysis of these loci.

Limitations

The gene curation performed as part of this resource is limited in scope and focused on a set of homologous, medically relevant genes that are likely to pose problems for NGS assays. Knowledge of medical relevance is continually expanding and quickly outdated, and information presented in this resource should not be relied upon solely. Using the databases that were the basis for the data in this article, the list of (potentially) disease-associated genes has now grown to more than 7,000 genes; therefore, future curation will undoubtedly add critical disease genes to the analysis presented here. The data presented here should be regarded as a starting point for clinical laboratories, keeping in mind that that they will continue to evolve.

Future directions

We anticipate that future developments will bring new challenges and opportunities with regard to the problem of homology interference in NGS. The inclusion of additional sequences in the GRCh38 genome assembly may increase the number of sites affected by high homology and exacerbate the problems encountered in currently affected regions. Additionally, polymorphic pseudogenes have been identified that may compromise clinical testing in a small percentage of the population. For example, a processed pseudogene that is homologous to SMAD4 has been found in ~0.26% of the population.19 Further studies are necessary to continue to elucidate the variability in homologous sequences among individuals. Long-read sequencing technologies, such as those developed by Pacific Biosciences and Oxford Nanopore Technologies, can largely overcome the problem of inaccurate alignment of homologous reads, although these platforms have not yet been implemented in many diagnostic laboratories. Additionally, synthetic long-read approaches can be used with current-generation sequencing instruments to try to circumvent problems posed by homology. Although these approaches and technologies hold promise for more accurate sequencing read alignments, the resource described herein should help laboratories using standard NGS and Sanger sequencing approaches to clinical testing minimize analytical errors due to homology interference.

Disclosure

The authors declare no conflict of interest.