Sketching algorithms for genomic data analysis and querying in a secure enclave

Kockan, Can; Zhu, Kaiyuan; Dokmai, Natnatee; Karpov, Nikolai; Kulekci, M. Oguzhan; Woodruff, David P.; Sahinalp, S. Cenk

doi:10.1038/s41592-020-0761-8

Article
Published: 04 March 2020

Sketching algorithms for genomic data analysis and querying in a secure enclave

Can Kockan ORCID: orcid.org/0000-0002-1011-3071^1,2^na1,
Kaiyuan Zhu ORCID: orcid.org/0000-0001-9976-6505^1,2^na1,
Natnatee Dokmai¹^na1,
Nikolai Karpov¹,
M. Oguzhan Kulekci³,
David P. Woodruff⁴ &
…
S. Cenk Sahinalp²

Nature Methods volume 17, pages 295–301 (2020)Cite this article

3477 Accesses
31 Citations
15 Altmetric
Metrics details

Subjects

Abstract

Genome-wide association studies (GWAS), especially on rare diseases, may necessitate exchange of sensitive genomic data between multiple institutions. Since genomic data sharing is often infeasible due to privacy concerns, cryptographic methods, such as secure multiparty computation (SMC) protocols, have been developed with the aim of offering privacy-preserving collaborative GWAS. Unfortunately, the computational overhead of these methods remain prohibitive for human-genome-scale data. Here we introduce SkSES (https://github.com/ndokmai/sgx-genome-variants-search), a hardware–software hybrid approach for privacy-preserving collaborative GWAS, which improves the running time of the most advanced cryptographic protocols by two orders of magnitude. The SkSES approach is based on trusted execution environments (TEEs) offered by current-generation microprocessors—in particular, Intel’s SGX. To overcome the severe memory limitation of the TEEs, SkSES employs novel ‘sketching’ algorithms that maintain essential statistical information on genomic variants in input VCF files. By additionally incorporating efficient data compression and population stratification reduction methods, SkSES identifies the top k genomic variants in a cohort quickly, accurately and in a privacy-preserving manner.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Overview of the SkSES pipeline.**

**Fig. 2: The impact of increasing cohort size on the runtime of SkSES, with a fixed sketch size.**

**Fig. 3: The fraction of true top k significant SNPs (according to χ² statistic) included in the query of top l SNPs returned by SkSES (accuracy) as a function of k.**

Fig. 4: The impact of normalizing genotype matrix A and multiplying the sketching matrix (\(\hat{{\it{X}}}={\it{XT}}\)) for reducing the space needed for PCA on the left singular vectors \({\hat{{\rm{U}}}}_{L}\) as well as the ranks of Cochran–Armitage trend χ² statistics across all unique SNPs in the iDASH2017-chr1 dataset.

Secure secondary utilization system of genomic data using quantum secure cloud

Article Open access 02 November 2022

Citizen-centered, auditable and privacy-preserving population genomics

Article 25 March 2021

Truly privacy-preserving federated analytics for precision medicine with multiparty homomorphic encryption

Article Open access 11 October 2021

Data availability

The VCF files from the original benchmarking dataset for the iDASH-2017 competition (iDASH2017-chr1) can be found at http://www.humangenomeprivacy.org/2017/competition-tasks.html. The VCF files consisting of whole-genome data (iDASH2017-wg) and the synthetic VCF files are available from the corresponding author upon request. The AMD dataset (dbGaP phs001039.v1.p1) from ref. ⁴⁴ can be obtained via dbGaP authorized access.

Code availability

The source code for SkSES, under MIT license, is available for download at GitHub: https://github.com/ndokmai/sgx-genome-variants-search.

References

Numanagić, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005 (2016).
Article Google Scholar
Alberti, C. et al. An introduction to MPEG-G, the new ISO standard for genomic information representation. Preprint at bioRxiv https://doi.org/10.1101/426353 (2018).
Davies, R. GA4GH File Encryption Standard https://github.com/samtools/hts-specs/blob/master/crypt4gh.pdf (2017).
Kelleher, J. et al. htsget: a protocol for securely streaming genomic data. Bioinformatics 35, 119–121 (2018).
Article Google Scholar
Hach, F., Numanagic, I. & Sahinalp, S. C. DeeZ: reference-based compression by local assembly. Nat. Methods 11, 1082 (2014).
Article CAS Google Scholar
Anonymous. CRAM format specification (version 3.0) https://samtools.github.io/hts-specs/CRAMv3.pdf (2017).
Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2014).
Article Google Scholar
Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
Article CAS Google Scholar
Ginart, A. A. et al. Optimal compressed representation of high throughput sequence data via light assembly. Nat. Commun. 9, 566 (2018).
Article Google Scholar
Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2017).
Article Google Scholar
Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71 (2013).
Article CAS Google Scholar
Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462 (2014).
Article CAS Google Scholar
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417 (2017).
Article CAS Google Scholar
Flajolet, P. & Martin, G. N. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science (ed. Snyder, L.) 76–82 (IEEE, 1983).
Karp, R. M. On-line algorithms versus off-line algorithms: how much is it worth to know the future? IFIP Congress 1, 416–429 (1992).
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A. C. & Brown, C. T. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS ONE 9, e101271 (2014).
Article Google Scholar
Alon, N., Matias, Y. & Szegedy, M. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999).
Article Google Scholar
Charikar, M., Chen, K. & Farach-Colton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming 693–703 (Springer, 2002).
Cormode, G. & Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 58–75 (2005).
Article Google Scholar
McGuire, A. L. et al. Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genet. Med. 10, 495 (2008).
Article Google Scholar
Bloss, C. S. Does family always matter? Public genomes and their effect on relatives. Genome Med. 5, 107 (2013).
Article Google Scholar
Shringarpure, S. S. & Bustamante, C. D. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 (2015).
Article CAS Google Scholar
Ayday, E., Raisaro, J. L., Hengartner, U., Molyneaux, A. & Hubaux, J.-P. Privacy-preserving processing of raw genomic data. In Data Privacy Management and Autonomous Spontaneous Security (eds García-Alfaro, J. et al.) 133–147 (Springer, 2014).
He, D. et al. Identifying genetic relatives without compromising privacy. Genome Res. 24, 664–72 (2014).
Article CAS Google Scholar
Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).
Article CAS Google Scholar
McLaren, P. J. et al. Privacy-preserving genomic testing in the clinic: a model using HIV treatment. Genet. Med. 18, 814 (2016).
Article CAS Google Scholar
Shimizu, K., Nuida, K. & Rätsch, G. Efficient privacy-preserving string search and an application in genomics. Bioinformatics 32, 1652–1661 (2016).
Article CAS Google Scholar
Xie, W. et al. Securema: protecting participant privacy in genetic association meta-analysis. Bioinformatics 30, 3334–3341 (2014).
Article CAS Google Scholar
Zhao, Y., Wang, X., Jiang, X., Ohno-Machado, L. & Tang, H. Choosing blindly but wisely: differentially private solicitation of DNA datasets for disease marker discovery. J. Am. Med. Inform. Assoc. 22, 100–108 (2014).
Article Google Scholar
Shahbazi, A., Bayatbabolghani, F. & Blanton, M. Private computation with genomic data for genome-wide association and linkage studies. In Proc. 3rd International Workshop Genome Privacy Security (2016); https://www.acsu.buffalo.edu/~mblanton/publications/genopri16.pdf.
Chen, F. et al. Premix: privacy-preserving estimation of individual admixture. In AMIA Annual Symposium Proceedings Vol. 2016, 1747–1755 (American Medical Informatics Association, 2016).
Lauter, K., López-Alt, A. & Naehrig, M. Private computation on encrypted genomic data. In International Conference on Cryptology and Information Security in Latin America (eds Aranha, D. F. & Menezes, A.) 3–27 (Springer, 2014).
Wang, S. et al. Healer: homomorphic computation of exact logistic regression for secure rare disease variants analysis in GWAS. Bioinformatics 32, 211–218 (2015).
PubMed PubMed Central Google Scholar
Zhang, Y., Blanton, M. & Almashaqbeh, G. Secure distributed genome analysis for GWAS & sequence comparison computation. BMC Med. Inform. Decis. Mak. 15, S4 (2015).
Article Google Scholar
Halevi, S. & Shoup, V. Algorithms in HElib. In International Cryptology Conference (Garay, J. A. & Gennaro, R.) 554–571 (Springer, 2014).
Yao, A. C. Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science (ed. Pippenger, N.) 160–164 (IEEE, 1982).
Wang, X., Chan, H. & Shi, E. Circuit ORAM: on tightness of the Goldreich–Ostrovsky lower bound. In Proc. of the 22nd ACM SIGSAC Conference on Computer and Communications Security (eds Ray, I., Li, N. & Kruegel, C.) 850–861 (ACM, 2015).
Anati, I., Gueron, S., Johnson, S. P. & Scarlata, V. R. Innovative technology for CPU based attestation and sealing. https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing (2013).
Lewis, C. M. Genetic association studies: design, analysis and interpretation. Brief. Bioinformatics 3, 146–153 (2002).
Article CAS Google Scholar
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Article CAS Google Scholar
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100 (2014).
Article Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS Google Scholar
Wang, X. et al. IDASH secure genome analysis competition 2017. BMC Med. Genomics 11, 85 (2018).
Article Google Scholar
Cho, H., Wu, D. J. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 36, 547 (2018).
Article CAS Google Scholar
Celis, P. Robin Hood Hashing. PhD thesis, Univ. Waterloo (1986).
Deng, F. & Rafiei, D. New estimation algorithms for streaming data: count-min can do more. http://webdocs.cs.ualberta.ca/~drafiei/papers/cmm.pdf (2007).
Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).
Article Google Scholar
Boutsidis, C., Woodruff, D. P. & Zhong, P. Optimal principal component analysis in distributed and streaming models. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing (eds Wichs, D. & Mansour, Y.) 236–249 (ACM, 2016).
Cohen, M. B., Elder, S., Musco, C., Musco, C. & Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (eds Servedio, R. A. & Rubinfeld, R.) 163–172 (ACM, 2015).
Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).
Article Google Scholar

Download references

Acknowledgements

N.K. was partially supported by NSF CCF-1844234, NSF CCF-1525024 and IIS-1633215. M.O.K. was partially supported by TUBITAK grant 114E293. Part of the work was done while D.P.W. was visiting the Simons Institute for the Theory of Computing. S.C.S. was supported in part by NSF CCF-1619081, NIH GM108348, NIH HG010798 and the Indiana University Grand Challenges Program, Precision Health Initiative, before moving to his current post at NCI.

We thank S. Simmons and H. Cho from the Computer Science and Artificial Intelligence Laboratory at MIT (now at the Broad Institute) for useful discussions and their help in benchmarking SkSES on the AMD dataset⁴⁴ against the SMC tool. We also thank L. Wang and D. Bu at Indiana University for providing the iDASH2017-wg dataset. We finally thank the Linux team and B. Shei from University Information Technology Services at Indiana University for useful instructions on software installation and preparation.

Author information

These authors contributed equally: Can Kockan, Kaiyuan Zhu, Natnatee Dokmai.

Authors and Affiliations

Department of Computer Science, Indiana University, Bloomington, IN, USA
Can Kockan, Kaiyuan Zhu, Natnatee Dokmai & Nikolai Karpov
Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
Can Kockan, Kaiyuan Zhu & S. Cenk Sahinalp
Informatics Institute, Istanbul Technical University, Istanbul, Turkey
M. Oguzhan Kulekci
Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
David P. Woodruff

Authors

Can Kockan
View author publications
You can also search for this author in PubMed Google Scholar
Kaiyuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Natnatee Dokmai
View author publications
You can also search for this author in PubMed Google Scholar
Nikolai Karpov
View author publications
You can also search for this author in PubMed Google Scholar
M. Oguzhan Kulekci
View author publications
You can also search for this author in PubMed Google Scholar
David P. Woodruff
View author publications
You can also search for this author in PubMed Google Scholar
S. Cenk Sahinalp
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.K., N.D., M.O.K. and S.C.S. initially participated in the iDASH-2017 competition. K.Z., N.K. and S.C.S. formulated the problem with limited memory. K.Z., S.C.S. and D.P.W. further formulated the problem to correct for population stratification. C.K., K.Z. and N.D. implemented the proposed solution. C.K., K.Z., N.D., N.K. and S.C.S. co-wrote the manuscript. M.O.K., D.P.W. and S.C.S. supervised the study.

Corresponding author

Correspondence to S. Cenk Sahinalp.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Fig. 1 and Supplementary Notes 1–5

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kockan, C., Zhu, K., Dokmai, N. et al. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat Methods 17, 295–301 (2020). https://doi.org/10.1038/s41592-020-0761-8

Download citation

Received: 17 April 2019
Accepted: 22 January 2020
Published: 04 March 2020
Issue Date: March 2020
DOI: https://doi.org/10.1038/s41592-020-0761-8

This article is cited by

Functional genomics data: privacy risk assessment and technological mitigation
- Gamze Gürsoy
- Tianxiao Li
- Mark B. Gerstein
Nature Reviews Genetics (2022)
Sociotechnical safeguards for genomic data privacy
- Zhiyu Wan
- James W. Hazel
- Bradley A. Malin
Nature Reviews Genetics (2022)
Treating medical data as a durable asset
- Amalio Telenti
- Xiaoqian Jiang
Nature Genetics (2020)