Efficient genotype compression and analysis of large genetic-variation data sets

Layer, Ryan M; Kindlon, Neil; Karczewski, Konrad J; Quinlan, Aaron R

doi:10.1038/nmeth.3654

Brief Communication
Published: 09 November 2015

Efficient genotype compression and analysis of large genetic-variation data sets

Ryan M Layer¹,
Neil Kindlon¹,
Konrad J Karczewski²,
Exome Aggregation Consortium &
…
Aaron R Quinlan^1,3,4

Nature Methods volume 13, pages 63–65 (2016)Cite this article

4745 Accesses
38 Citations
27 Altmetric
Metrics details

Subjects

Abstract

Genotype Query Tools (GQT) is an indexing strategy that expedites analyses of genome-variation data sets in Variant Call Format based on sample genotypes, phenotypes and relationships. GQT's compressed genotype index minimizes decompression for analysis, and its performance relative to that of existing methods improves with cohort size. We show substantial (up to 443-fold) gains in performance over existing methods and demonstrate GQT's utility for exploring massive data sets involving thousands to millions of genomes. GQT can be accessed at https://github.com/ryanlayer/gqt.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Creation and data exploration of an individual-centric genotype index.**

**Figure 2: GQT query performance and applications of the genotype index.**

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Genome-wide association studies

Article 26 August 2021

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

References

Zuk, O. et al. Proc. Natl. Acad. Sci. USA 111, E455–E464 (2014).
Article CAS Google Scholar
Stephens, Z.D. et al. PLoS Biol. 13, e1002195 (2015).
Article Google Scholar
Danecek, P. et al. Bioinformatics 27, 2156–2158 (2011).
Article CAS Google Scholar
Keinan, A. & Clark, A.G. Science 336, 740–743 (2012).
Article CAS Google Scholar
1000 Genomes Project Consortium. et al. Nature 491, 56–65 (2012).
Wu, K., Otoo, E.J. & Shoshani, A. In Proc. 14th International Conference on Scientific and Statistical Database Management (Ed. Kennedy, J.) 99–108 (IEEE, 2002).
Siqueira, T.L.L., Ciferri, C.D.deA., Times, V.C., de Oliveira, A.G. & Ciferri, R.R. J. Braz. Comput. Soc. 15, 19–34 (2009).
Article Google Scholar
Liu, Y.-B. et al. J. Korean Astron. Soc. 47, 115–122 (2014).
Article Google Scholar
Quinlan, A.R. & Hall, I.M. Bioinformatics 26, 841–842 (2010).
Article CAS Google Scholar
Li, H. Bioinformatics 27, 718–719 (2011).
Article Google Scholar
Ziv, J. & Lempel, A. IEEE Trans. Inf. Theory 23, 337–343 (1977).
Article Google Scholar
Weir, B.S. & Cockerham, C.C. Evolution 38, 1358 (1984).
CAS Google Scholar
Chen, G.K., Marjoram, P. & Wall, J.D. Genome Res. 19, 136–142 (2009).
Article CAS Google Scholar

Download references

Acknowledgements

We are grateful to C. Chiang for conceptual discussions, I. Levicki for helpful advice on AVX2 operations, S. McCarthy and P. Danecek for their guidance with htslib, and Z. Kronenberg for advice on population genetics measures. We also thank the Exome Aggregation Consortium and the groups that provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about. This research was supported by a US National Human Genome Research Institute award to A.R.Q. (NIH R01HG006693).

Author information

Authors and Affiliations

Department of Human Genetics, University of Utah, Salt Lake City, Utah, USA
Ryan M Layer, Neil Kindlon & Aaron R Quinlan
Analytical and Translational Genetics Unit, Harvard Medical School, Boston, Massachusetts, USA
Konrad J Karczewski
Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, USA
Aaron R Quinlan
USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah, USA
Aaron R Quinlan

Authors

Ryan M Layer
View author publications
You can also search for this author in PubMed Google Scholar
Neil Kindlon
View author publications
You can also search for this author in PubMed Google Scholar
Konrad J Karczewski
View author publications
You can also search for this author in PubMed Google Scholar
Aaron R Quinlan
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Exome Aggregation Consortium

Contributions

R.M.L. designed and wrote GQT and analyzed the data. N.K. wrote a fast output method. K.J.K. analyzed ExAC data. A.R.Q. conceived and designed the study. R.M.L. and A.R.Q. wrote the manuscript.

Corresponding authors

Correspondence to Ryan M Layer or Aaron R Quinlan.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

A list of members and affiliations appears in Supplementary Note 1.

Integrated supplementary information

Supplementary Figure 1 Sorting variants by allele frequency improves compression.

(a) A graphical comparison of the genotype distribution of individuals (rows) and variants (columns) before and after sorting. These data represent genotypes from the 1,000 Genomes Project, phase 3, for a portion chromosome 20. (b) The run length distribution for unsorted and sorted genotypes. (c) The distribution of the number of runs for sorted and unsorted data. In both cases the second peak is composed predominantly of individuals from African decent (AFR); 604/661 AFR are in the second peak in the sorted case, and 640/661 AFR in the unsorted case.

Supplementary Figure 2 Bitmaps enable rapid genotype comparisons en masse.

(a) A bit array marks the existence of one genotype state (for example, homozygous reference; “0/0”) for all variants. Similarly, a bitmap index is composed of a distinct bit array for each possible genotype state. (b) Example genotypes in VCF format are presented for three individuals (I1, I2, I3) at 10 variant sites (V1-V10). A bitwise AND of the bit arrays corresponding to the heterozygous genotype yields the variant that is heterozygous in all individuals.

Supplementary Figure 3 Efficiency improvements for genotype comparisons when using a bitmap index.

(a) When considering genotypes in ASCII format (e.g., VCF), an algorithm searching for the set of variants that are heterozygous in all individuals must operate on every genotype for each for every individual separately. (b) In contrast, when genotypes are represented with a bitmap index, where a set of genotypes are encoded into a single CPU word (for brevity, only the bit arrays associated with the heterozygous state are shown), bitwise logical operations can be used to operate on all of the genotypes in the word with a single operation. This example assumes a word size of 8, but modern CPU support up to a 64-bit word and the speedup of bitmaps will be linear to the word size. For the 24 genotypes given here (3 individuals, 8 genotypes each), the ASCII-base algorithm executes the “if” statement 24 times, while the bit-wise algorithm executes the logical AND (“&”) only three times, with both algorithms producing equivalent results.

Supplementary Figure 4 A comparison of binary encodings and associated bit-wise logical operations.

Two bit arrays are given in three different binary encodings: the uncompressed bit array (in 7 bit words) on top, followed by the run-length encoding (RLE) and word-aligned hybrid encoding (WAH), which each add an additional bit as part of the encoding to make 8 bit words. RLE maps each set of consecutive bits to a new value that uses one bit for the run value, and the remaining bits for the number of bits in the run. WAH maps bits to one of two types of values: those that include runs and those that encode the raw binary. The first bit in a WAH value indicates if the value encodes a run (the fill bit). If that bit is set then the second bit gives the run value and the remaining bits give the run length in number of words (i.e., not number of bits as in RLE). If the fill bit is not set then the remaining bits are the uncompressed binary values. (b) The logical OR for the three encodings is given. For the uncompressed binary, the OR follows bit for bit across both values. For RLE, the logical OR is undefined (without inflation) because the two encoded values are no longer aligned and have different lengths. For WAH, the values are aligned based on their run length, then the logical OR is performed (in this case) between the value bit and the associated uncompressed values.

Supplementary Figure 5 De novo mutation discovery in the CEPH 1463 pedigree.

(a) The CEPH 1463 pedigree. Our analysis is focused on the discovery of de novo mutations in NA12878, the daughter of NA12891 and NA12892. (b) A GQT query for de novo mutations based on the expected genotypes (homozygous for the reference allele) in NA12878’s parents, as well as an expected heterozygous genotype in NA12878. (c) True de novo mutations in NA12878’s germline should be passed on to 50% of her offspring, on average. Allowing for genotyping error and binomial expectation, we filter for more confident de novo mutation candidates by requiring that the apparent mutation allele is passed on to at least 30% of NA12878’s children. (d) A GQT query that further filters candidate mutations by excluding those lying in low complexity regions of the genome.

Supplementary Figure 6 BCF and GQT file composition.

VCF files are composed of a variant data section and a sample genotype section and each of these include both core information (e.g. variant position and alleles, and genotype, respectively) and extra metadata. BCF encodes both sections into a binary format, and then compresses the data using blocked LZ77 compression. GQT uses three files: a BIM file that contains all variant data compressed with LZ77, a GQT file with WAH-compressed genotypes (without metadata), and a VID file that stores the mapping between the allele-frequency variant order and the original source VCF variant order. PLINK (not shown) omits all metadata, storing uncompressed binary encoded for genotypes in a BED file and variant positions in a BIM file.

Supplementary Figure 7 GQT compression performance for 1,000 Genomes data (phase 3) and simulated genomes.

Compression ratios describe the fold reduction in file size relative to an uncompressed VCF file. GQT represents solely total size of the GQT genotype index. GQT* reflects the total size of the GQT genotype index plus the size of the BIM and VID indices. BCF* represents the size of the full BCF file including genotype metadata, whereas BCF omits genotype metadata. Lastly, PLINK reflects the size of PLINK v1.9 BED and BIM files. (a) File size reduction for 1,000 Genomes phase 3, which comprised 2,504 individuals and more than 84 million variants. (b) A comparison of BCFTOOLS, GQT, and PLINK for simulated genotypes on a 100-Mb genome with between 100 and 100,000 individuals.

Supplementary Figure 8 Fixation index (F_ST) analysis between populations.

F_ST for 1,000 Genomes phase 3 of Europeans versus East Asians and Europeans versus Africans on chromosome 12.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8, Supplementary Note 1 and Supplementary Table 1 (PDF 1285 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Layer, R., Kindlon, N., Karczewski, K. et al. Efficient genotype compression and analysis of large genetic-variation data sets. Nat Methods 13, 63–65 (2016). https://doi.org/10.1038/nmeth.3654

Download citation

Received: 05 June 2015
Accepted: 07 October 2015
Published: 09 November 2015
Issue Date: January 2016
DOI: https://doi.org/10.1038/nmeth.3654

This article is cited by

GBC: a parallel toolkit based on highly addressable byte-encoding blocks for extremely large-scale genotypes of species
- Liubin Zhang
- Yangyang Yuan
- Miaoxin Li
Genome Biology (2023)
VariantStore: an index for large-scale genomic variant search
- Prashant Pandey
- Yinjie Gao
- Carl Kingsford
Genome Biology (2021)
Intronic variant in POU1F1 associated with canine pituitary dwarfism
- Kaisa Kyöstilä
- Julia E. Niskanen
- Hannes Lohi
Human Genetics (2021)
GCViT: a method for interactive, genome-wide visualization of resequencing and SNP array data
- Andrew P. Wilkey
- Anne V. Brown
- Ethalinda K. S. Cannon
BMC Genomics (2020)
A novel genomic region on chromosome 11 associated with fearfulness in dogs
- R. Sarviaho
- O. Hakosalo
- H. Lohi
Translational Psychiatry (2020)