We comprehensively assessed the influence of reference minor alleles (RMAs), one of the inherent problems of the human reference genome sequence.
The variant call format (VCF) files provided by the 1000 Genomes and Exome Aggregation Consortium (ExAC) consortia were used to identify RMA sites. All coding RMA sites were checked for concordance with UniProt and the presence of same codon variants. RMA-corrected predictions of functional effect were obtained with SIFT, PolyPhen-2, and PROVEAN standalone tools and compared with dbNSFP v2.9 for consistency.
We systematically characterized the problem of RMAs and identified several possible ways in which RMA could interfere with accurate variant discovery and annotation. We have discovered a systematic bias in the automated variant effect prediction at the RMA loci, as well as widespread switching of functional consequences for variants located in the same codon as the RMA. As a convenient way to address the problem of RMAs we have developed a simple bioinformatic tool that identifies variation at RMA sites and provides correct annotations for all such substitutions. The tool is available free of charge at http://rmahunter.bioinf.me.
Correction of RMA annotation enhances the accuracy of next-generation sequencing–based methods in clinical practice.
The study was supported by Russian Science Foundation grant 14 50 00069. Equipment from the Biobank of the Research Park of St. Petersburg State University was used for whole-exome sequencing experiments analyzed in the present study.
Supplementary material is linked to the online version of the paper at http://www.nature.com/gim