Complete knowledge of the genetic variation in individual human genomes is a crucial foundation for understanding the etiology of disease. Genetic variation is typically characterized by sequencing individual genomes and comparing reads to a reference. Existing methods do an excellent job of detecting variants in approximately 90% of the human genome; however, calling variants in the remaining 10% of the genome (largely low-complexity sequence and segmental duplications) is challenging. To improve variant calling, we developed a new algorithm, DISCOVAR, and examined its performance on improved, low-cost sequence data. Using a newly created reference set of variants from the finished sequence of 103 randomly chosen fosmids, we find that some standard variant call sets miss up to 25% of variants. We show that the combination of new methods and improved data increases sensitivity by several fold, with the greatest impact in challenging regions of the human genome.
Sequence Read Archive
We thank the Broad Institute Genomics Platform for generating data for this project and J. Bochicchio for project management. We thank the 1000 Genomes Project for early access to 250-base reads from the Centre d′Etude du Polymorphisme Humain (CEPH) trio. We thank Illumina for the use of sequence data from the 17-member CEPH pedigree and M. Eberle for detailed information on these data. We thank Z. Iqbal for detailed advice regarding running Cortex on 250-base reads and M. Carneiro and E. Banks for detailed advice regarding running GATK on the same data. We thank H. Li for early access to bwa-mem and for his extensive examination of our manuscript. We thank S. Fisher for help preparing the data cost estimate. We thank M. Daly, M. DePristo, S. Gabriel, Z. Iqbal, D. MacArthur, D. Neafsey and M. Zody for helpful comments. This project has been funded in part with federal funds from the National Human Genome Research Institute, US National Institutes of Health, US Department of Health and Human Services, under grants R01HG003474 and U54HG003067, and in part with federal funds from the National Institute of Allergy and Infectious Diseases, US National Institutes of Health, US Department of Health and Human Services, under contract HHSN272200900018C.
Integrated supplementary information
This data set provides a tab-delimited list of the variants called in the fosmid regions showing the fosmid identifier, position, reference allele alternate allele and a comma-separated list of notes. The notes are either “fosmid,” indicating that the variant is present in the fosmid reference set or of the form caller=state, where caller is Platinum-100, GATK-250, DISCOVAR or Cortex and state is het, hom or hom+. Here het means that both the variant and reference were called, hom means that the variant but not the reference was called and hom+ means that the variant was called as homozygous. This data set was automatically generated. See also the manually identified defects noted in Supplementary Table 1a.