Efficient genotype compression and analysis of large genetic-variation data sets

Article metrics

Abstract

Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Creation and data exploration of an individual-centric genotype index.
Figure 2: GQT query performance and applications of the genotype index.

References

  1. 1

    Zuk, O. et al. Proc. Natl. Acad. Sci. USA 111, E455–E464 (2014).

  2. 2

    Stephens, Z.D. et al. PLoS Biol. 13, e1002195 (2015).

  3. 3

    Danecek, P. et al. Bioinformatics 27, 2156–2158 (2011).

  4. 4

    Keinan, A. & Clark, A.G. Science 336, 740–743 (2012).

  5. 5

    1000 Genomes Project Consortium. et al. Nature 491, 56–65 (2012).

  6. 6

    Wu, K., Otoo, E.J. & Shoshani, A. In Proc. 14th International Conference on Scientific and Statistical Database Management (Ed. Kennedy, J.) 99–108 (IEEE, 2002).

  7. 7

    Siqueira, T.L.L., Ciferri, C.D.deA., Times, V.C., de Oliveira, A.G. & Ciferri, R.R. J. Braz. Comput. Soc. 15, 19–34 (2009).

  8. 8

    Liu, Y.-B. et al. J. Korean Astron. Soc. 47, 115–122 (2014).

  9. 9

    Quinlan, A.R. & Hall, I.M. Bioinformatics 26, 841–842 (2010).

  10. 10

    Li, H. Bioinformatics 27, 718–719 (2011).

  11. 11

    Ziv, J. & Lempel, A. IEEE Trans. Inf. Theory 23, 337–343 (1977).

  12. 12

    Weir, B.S. & Cockerham, C.C. Evolution 38, 1358 (1984).

  13. 13

    Chen, G.K., Marjoram, P. & Wall, J.D. Genome Res. 19, 136–142 (2009).

Download references

Acknowledgements

We are grateful to C. Chiang for conceptual discussions, I. Levicki for helpful advice on AVX2 operations, S. McCarthy and P. Danecek for their guidance with htslib, and Z. Kronenberg for advice on population genetics measures. We also thank the Exome Aggregation Consortium and the groups that provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about. This research was supported by a US National Human Genome Research Institute award to A.R.Q. (NIH R01HG006693).

Author information

R.M.L. designed and wrote GQT and analyzed the data. N.K. wrote a fast output method. K.J.K. analyzed ExAC data. A.R.Q. conceived and designed the study. R.M.L. and A.R.Q. wrote the manuscript.

Correspondence to Ryan M Layer or Aaron R Quinlan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

A list of members and affiliations appears in Supplementary Note 1.

Integrated supplementary information

Supplementary Figure 1 Sorting variants by allele frequency improves compression.

(a) A graphical comparison of the genotype distribution of individuals (rows) and variants (columns) before and after sorting. These data represent genotypes from the 1,000 Genomes Project, phase 3, for a portion chromosome 20. (b) The run length distribution for unsorted and sorted genotypes. (c) The distribution of the number of runs for sorted and unsorted data. In both cases the second peak is composed predominantly of individuals from African decent (AFR); 604/661 AFR are in the second peak in the sorted case, and 640/661 AFR in the unsorted case.

Supplementary Figure 2 Bitmaps enable rapid genotype comparisons en masse.

(a) A bit array marks the existence of one genotype state (for example, homozygous reference; “0/0”) for all variants. Similarly, a bitmap index is composed of a distinct bit array for each possible genotype state. (b) Example genotypes in VCF format are presented for three individuals (I1, I2, I3) at 10 variant sites (V1-V10). A bitwise AND of the bit arrays corresponding to the heterozygous genotype yields the variant that is heterozygous in all individuals.

Supplementary Figure 3 Efficiency improvements for genotype comparisons when using a bitmap index.

(a) When considering genotypes in ASCII format (e.g., VCF), an algorithm searching for the set of variants that are heterozygous in all individuals must operate on every genotype for each for every individual separately. (b) In contrast, when genotypes are represented with a bitmap index, where a set of genotypes are encoded into a single CPU word (for brevity, only the bit arrays associated with the heterozygous state are shown), bitwise logical operations can be used to operate on all of the genotypes in the word with a single operation. This example assumes a word size of 8, but modern CPU support up to a 64-bit word and the speedup of bitmaps will be linear to the word size. For the 24 genotypes given here (3 individuals, 8 genotypes each), the ASCII-base algorithm executes the “if” statement 24 times, while the bit-wise algorithm executes the logical AND (“&”) only three times, with both algorithms producing equivalent results.

Supplementary Figure 4 A comparison of binary encodings and associated bit-wise logical operations.

Two bit arrays are given in three different binary encodings: the uncompressed bit array (in 7 bit words) on top, followed by the run-length encoding (RLE) and word-aligned hybrid encoding (WAH), which each add an additional bit as part of the encoding to make 8 bit words. RLE maps each set of consecutive bits to a new value that uses one bit for the run value, and the remaining bits for the number of bits in the run. WAH maps bits to one of two types of values: those that include runs and those that encode the raw binary. The first bit in a WAH value indicates if the value encodes a run (the fill bit). If that bit is set then the second bit gives the run value and the remaining bits give the run length in number of words (i.e., not number of bits as in RLE). If the fill bit is not set then the remaining bits are the uncompressed binary values. (b) The logical OR for the three encodings is given. For the uncompressed binary, the OR follows bit for bit across both values. For RLE, the logical OR is undefined (without inflation) because the two encoded values are no longer aligned and have different lengths. For WAH, the values are aligned based on their run length, then the logical OR is performed (in this case) between the value bit and the associated uncompressed values.

Supplementary Figure 5 De novo mutation discovery in the CEPH 1463 pedigree.

(a) The CEPH 1463 pedigree. Our analysis is focused on the discovery of de novo mutations in NA12878, the daughter of NA12891 and NA12892. (b) A GQT query for de novo mutations based on the expected genotypes (homozygous for the reference allele) in NA12878’s parents, as well as an expected heterozygous genotype in NA12878. (c) True de novo mutations in NA12878’s germline should be passed on to 50% of her offspring, on average. Allowing for genotyping error and binomial expectation, we filter for more confident de novo mutation candidates by requiring that the apparent mutation allele is passed on to at least 30% of NA12878’s children. (d) A GQT query that further filters candidate mutations by excluding those lying in low complexity regions of the genome.

Supplementary Figure 6 BCF and GQT file composition.

VCF files are composed of a variant data section and a sample genotype section and each of these include both core information (e.g. variant position and alleles, and genotype, respectively) and extra metadata. BCF encodes both sections into a binary format, and then compresses the data using blocked LZ77 compression. GQT uses three files: a BIM file that contains all variant data compressed with LZ77, a GQT file with WAH-compressed genotypes (without metadata), and a VID file that stores the mapping between the allele-frequency variant order and the original source VCF variant order. PLINK (not shown) omits all metadata, storing uncompressed binary encoded for genotypes in a BED file and variant positions in a BIM file.

Supplementary Figure 7 GQT compression performance for 1,000 Genomes data (phase 3) and simulated genomes.

Compression ratios describe the fold reduction in file size relative to an uncompressed VCF file. GQT represents solely total size of the GQT genotype index. GQT* reflects the total size of the GQT genotype index plus the size of the BIM and VID indices. BCF* represents the size of the full BCF file including genotype metadata, whereas BCF omits genotype metadata. Lastly, PLINK reflects the size of PLINK v1.9 BED and BIM files. (a) File size reduction for 1,000 Genomes phase 3, which comprised 2,504 individuals and more than 84 million variants. (b) A comparison of BCFTOOLS, GQT, and PLINK for simulated genotypes on a 100-Mb genome with between 100 and 100,000 individuals.

Supplementary Figure 8 Fixation index (FST) analysis between populations.

FST for 1,000 Genomes phase 3 of Europeans versus East Asians and Europeans versus Africans on chromosome 12.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8, Supplementary Note 1 and Supplementary Table 1 (PDF 1285 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Layer, R., Kindlon, N., Karczewski, K. et al. Efficient genotype compression and analysis of large genetic-variation data sets. Nat Methods 13, 63–65 (2016) doi:10.1038/nmeth.3654

Download citation

Further reading