Brief Communication | Published:

Efficient genotype compression and analysis of large genetic-variation data sets

Nature Methods volume 13, pages 6365 (2016) | Download Citation

Abstract

Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

References

  1. 1.

    et al. Proc. Natl. Acad. Sci. USA 111, E455–E464 (2014).

  2. 2.

    et al. PLoS Biol. 13, e1002195 (2015).

  3. 3.

    et al. Bioinformatics 27, 2156–2158 (2011).

  4. 4.

    & Science 336, 740–743 (2012).

  5. 5.

    1000 Genomes Project Consortium. et al. Nature 491, 56–65 (2012).

  6. 6.

    , & In Proc. 14th International Conference on Scientific and Statistical Database Management (Ed. Kennedy, J.) 99–108 (IEEE, 2002).

  7. 7.

    , , , & J. Braz. Comput. Soc. 15, 19–34 (2009).

  8. 8.

    et al. J. Korean Astron. Soc. 47, 115–122 (2014).

  9. 9.

    & Bioinformatics 26, 841–842 (2010).

  10. 10.

    Bioinformatics 27, 718–719 (2011).

  11. 11.

    & IEEE Trans. Inf. Theory 23, 337–343 (1977).

  12. 12.

    & Evolution 38, 1358 (1984).

  13. 13.

    , & Genome Res. 19, 136–142 (2009).

Download references

Acknowledgements

We are grateful to C. Chiang for conceptual discussions, I. Levicki for helpful advice on AVX2 operations, S. McCarthy and P. Danecek for their guidance with htslib, and Z. Kronenberg for advice on population genetics measures. We also thank the Exome Aggregation Consortium and the groups that provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about. This research was supported by a US National Human Genome Research Institute award to A.R.Q. (NIH R01HG006693).

Author information

Affiliations

  1. Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA.

    • Ryan M Layer
    • , Neil Kindlon
    •  & Aaron R Quinlan
  2. Analytical and Translational Genetics Unit, Harvard Medical School, Boston, Massachusetts, USA.

    • Konrad J Karczewski
  3. Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA.

    • Aaron R Quinlan
  4. USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA.

    • Aaron R Quinlan

Consortia

  1. Exome Aggregation Consortium

    A list of members and affiliations appears in Supplementary Note 1.

Authors

  1. Search for Ryan M Layer in:

  2. Search for Neil Kindlon in:

  3. Search for Konrad J Karczewski in:

  4. Search for Aaron R Quinlan in:

Contributions

R.M.L. designed and wrote GQT and analyzed the data. N.K. wrote a fast output method. K.J.K. analyzed ExAC data. A.R.Q. conceived and designed the study. R.M.L. and A.R.Q. wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to Ryan M Layer or Aaron R Quinlan.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–8, Supplementary Note 1 and Supplementary Table 1

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nmeth.3654