Sketching algorithms for genomic data analysis and querying in a secure enclave

Abstract

Genome-wide association studies (GWAS), especially on rare diseases, may necessitate exchange of sensitive genomic data between multiple institutions. Since genomic data sharing is often infeasible due to privacy concerns, cryptographic methods, such as secure multiparty computation (SMC) protocols, have been developed with the aim of offering privacy-preserving collaborative GWAS. Unfortunately, the computational overhead of these methods remain prohibitive for human-genome-scale data. Here we introduce SkSES (https://github.com/ndokmai/sgx-genome-variants-search), a hardware–software hybrid approach for privacy-preserving collaborative GWAS, which improves the running time of the most advanced cryptographic protocols by two orders of magnitude. The SkSES approach is based on trusted execution environments (TEEs) offered by current-generation microprocessors—in particular, Intel’s SGX. To overcome the severe memory limitation of the TEEs, SkSES employs novel ‘sketching’ algorithms that maintain essential statistical information on genomic variants in input VCF files. By additionally incorporating efficient data compression and population stratification reduction methods, SkSES identifies the top k genomic variants in a cohort quickly, accurately and in a privacy-preserving manner.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Overview of the SkSES pipeline.
Fig. 2: The impact of increasing cohort size on the runtime of SkSES, with a fixed sketch size.
Fig. 3: The fraction of true top k significant SNPs (according to χ2 statistic) included in the query of top l SNPs returned by SkSES (accuracy) as a function of k.
Fig. 4: The impact of normalizing genotype matrix A and multiplying the sketching matrix (\(\hat{{\it{X}}}={\it{XT}}\)) for reducing the space needed for PCA on the left singular vectors \({\hat{{\rm{U}}}}_{L}\) as well as the ranks of Cochran–Armitage trend χ2 statistics across all unique SNPs in the iDASH2017-chr1 dataset.

Data availability

The VCF files from the original benchmarking dataset for the iDASH-2017 competition (iDASH2017-chr1) can be found at http://www.humangenomeprivacy.org/2017/competition-tasks.html. The VCF files consisting of whole-genome data (iDASH2017-wg) and the synthetic VCF files are available from the corresponding author upon request. The AMD dataset (dbGaP phs001039.v1.p1) from ref. 44 can be obtained via dbGaP authorized access.

Code availability

The source code for SkSES, under MIT license, is available for download at GitHub: https://github.com/ndokmai/sgx-genome-variants-search.

References

  1. 1.

    Numanagić, I. et al. Comparison of high-throughput sequencing data compression tools. Nat. Methods 13, 1005 (2016).

  2. 2.

    Alberti, C. et al. An introduction to MPEG-G, the new ISO standard for genomic information representation. Preprint at bioRxiv https://doi.org/10.1101/426353 (2018).

  3. 3.

    Davies, R. GA4GH File Encryption Standard https://github.com/samtools/hts-specs/blob/master/crypt4gh.pdf (2017).

  4. 4.

    Kelleher, J. et al. htsget: a protocol for securely streaming genomic data. Bioinformatics 35, 119–121 (2018).

  5. 5.

    Hach, F., Numanagic, I. & Sahinalp, S. C. DeeZ: reference-based compression by local assembly. Nat. Methods 11, 1082 (2014).

  6. 6.

    Anonymous. CRAM format specification (version 3.0) https://samtools.github.io/hts-specs/CRAMv3.pdf (2017).

  7. 7.

    Grabowski, S., Deorowicz, S. & Roguski, Ł. Disk-based compression of data from genome sequencing. Bioinformatics 31, 1389–1395 (2014).

  8. 8.

    Hach, F., Numanagić, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).

  9. 9.

    Ginart, A. A. et al. Optimal compressed representation of high throughput sequence data via light assembly. Nat. Commun. 9, 566 (2018).

  10. 10.

    Chandak, S., Tatwawadi, K. & Weissman, T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 34, 558–567 (2017).

  11. 11.

    Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71 (2013).

  12. 12.

    Patro, R., Mount, S. M. & Kingsford, C. Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat. Biotechnol. 32, 462 (2014).

  13. 13.

    Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Methods 14, 417 (2017).

  14. 14.

    Flajolet, P. & Martin, G. N. Probabilistic counting. In 24th Annual Symposium on Foundations of Computer Science (ed. Snyder, L.) 76–82 (IEEE, 1983).

  15. 15.

    Karp, R. M. On-line algorithms versus off-line algorithms: how much is it worth to know the future? IFIP Congress 1, 416–429 (1992).

  16. 16.

    Zhang, Q., Pell, J., Canino-Koning, R., Howe, A. C. & Brown, C. T. These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PloS ONE 9, e101271 (2014).

  17. 17.

    Alon, N., Matias, Y. & Szegedy, M. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 137–147 (1999).

  18. 18.

    Charikar, M., Chen, K. & Farach-Colton, M. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming 693–703 (Springer, 2002).

  19. 19.

    Cormode, G. & Muthukrishnan, S. An improved data stream summary: the count-min sketch and its applications. J. Algorithms 55, 58–75 (2005).

  20. 20.

    McGuire, A. L. et al. Confidentiality, privacy, and security of genetic and genomic test information in electronic health records: points to consider. Genet. Med. 10, 495 (2008).

  21. 21.

    Bloss, C. S. Does family always matter? Public genomes and their effect on relatives. Genome Med. 5, 107 (2013).

  22. 22.

    Shringarpure, S. S. & Bustamante, C. D. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 (2015).

  23. 23.

    Ayday, E., Raisaro, J. L., Hengartner, U., Molyneaux, A. & Hubaux, J.-P. Privacy-preserving processing of raw genomic data. In Data Privacy Management and Autonomous Spontaneous Security (eds García-Alfaro, J. et al.) 133–147 (Springer, 2014).

  24. 24.

    He, D. et al. Identifying genetic relatives without compromising privacy. Genome Res. 24, 664–72 (2014).

  25. 25.

    Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).

  26. 26.

    McLaren, P. J. et al. Privacy-preserving genomic testing in the clinic: a model using HIV treatment. Genet. Med. 18, 814 (2016).

  27. 27.

    Shimizu, K., Nuida, K. & Rätsch, G. Efficient privacy-preserving string search and an application in genomics. Bioinformatics 32, 1652–1661 (2016).

  28. 28.

    Xie, W. et al. Securema: protecting participant privacy in genetic association meta-analysis. Bioinformatics 30, 3334–3341 (2014).

  29. 29.

    Zhao, Y., Wang, X., Jiang, X., Ohno-Machado, L. & Tang, H. Choosing blindly but wisely: differentially private solicitation of DNA datasets for disease marker discovery. J. Am. Med. Inform. Assoc. 22, 100–108 (2014).

  30. 30.

    Shahbazi, A., Bayatbabolghani, F. & Blanton, M. Private computation with genomic data for genome-wide association and linkage studies. In Proc. 3rd International Workshop Genome Privacy Security (2016); https://www.acsu.buffalo.edu/~mblanton/publications/genopri16.pdf.

  31. 31.

    Chen, F. et al. Premix: privacy-preserving estimation of individual admixture. In AMIA Annual Symposium Proceedings Vol. 2016, 1747–1755 (American Medical Informatics Association, 2016).

  32. 32.

    Lauter, K., López-Alt, A. & Naehrig, M. Private computation on encrypted genomic data. In International Conference on Cryptology and Information Security in Latin America (eds Aranha, D. F. & Menezes, A.) 3–27 (Springer, 2014).

  33. 33.

    Wang, S. et al. Healer: homomorphic computation of exact logistic regression for secure rare disease variants analysis in GWAS. Bioinformatics 32, 211–218 (2015).

  34. 34.

    Zhang, Y., Blanton, M. & Almashaqbeh, G. Secure distributed genome analysis for GWAS & sequence comparison computation. BMC Med. Inform. Decis. Mak. 15, S4 (2015).

  35. 35.

    Halevi, S. & Shoup, V. Algorithms in HElib. In International Cryptology Conference (Garay, J. A. & Gennaro, R.) 554–571 (Springer, 2014).

  36. 36.

    Yao, A. C. Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science (ed. Pippenger, N.) 160–164 (IEEE, 1982).

  37. 37.

    Wang, X., Chan, H. & Shi, E. Circuit ORAM: on tightness of the Goldreich–Ostrovsky lower bound. In Proc. of the 22nd ACM SIGSAC Conference on Computer and Communications Security (eds Ray, I., Li, N. & Kruegel, C.) 850–861 (ACM, 2015).

  38. 38.

    Anati, I., Gueron, S., Johnson, S. P. & Scarlata, V. R. Innovative technology for CPU based attestation and sealing. https://software.intel.com/en-us/articles/innovative-technology-for-cpu-based-attestation-and-sealing (2013).

  39. 39.

    Lewis, C. M. Genetic association studies: design, analysis and interpretation. Brief. Bioinformatics 3, 146–153 (2002).

  40. 40.

    Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

  41. 41.

    Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100 (2014).

  42. 42.

    Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

  43. 43.

    Wang, X. et al. IDASH secure genome analysis competition 2017. BMC Med. Genomics 11, 85 (2018).

  44. 44.

    Cho, H., Wu, D. J. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat. Biotechnol. 36, 547 (2018).

  45. 45.

    Celis, P. Robin Hood Hashing. PhD thesis, Univ. Waterloo (1986).

  46. 46.

    Deng, F. & Rafiei, D. New estimation algorithms for streaming data: count-min can do more. http://webdocs.cs.ualberta.ca/~drafiei/papers/cmm.pdf (2007).

  47. 47.

    Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

  48. 48.

    Boutsidis, C., Woodruff, D. P. & Zhong, P. Optimal principal component analysis in distributed and streaming models. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing (eds Wichs, D. & Mansour, Y.) 236–249 (ACM, 2016).

  49. 49.

    Cohen, M. B., Elder, S., Musco, C., Musco, C. & Persu, M. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the 47th Annual ACM Symposium on Theory of Computing (eds Servedio, R. A. & Rubinfeld, R.) 163–172 (ACM, 2015).

  50. 50.

    Patterson, N., Price, A. L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

Download references

Acknowledgements

N.K. was partially supported by NSF CCF-1844234, NSF CCF-1525024 and IIS-1633215. M.O.K. was partially supported by TUBITAK grant 114E293. Part of the work was done while D.P.W. was visiting the Simons Institute for the Theory of Computing. S.C.S. was supported in part by NSF CCF-1619081, NIH GM108348, NIH HG010798 and the Indiana University Grand Challenges Program, Precision Health Initiative, before moving to his current post at NCI.

We thank S. Simmons and H. Cho from the Computer Science and Artificial Intelligence Laboratory at MIT (now at the Broad Institute) for useful discussions and their help in benchmarking SkSES on the AMD dataset44 against the SMC tool. We also thank L. Wang and D. Bu at Indiana University for providing the iDASH2017-wg dataset. We finally thank the Linux team and B. Shei from University Information Technology Services at Indiana University for useful instructions on software installation and preparation.

Author information

C.K., N.D., M.O.K. and S.C.S. initially participated in the iDASH-2017 competition. K.Z., N.K. and S.C.S. formulated the problem with limited memory. K.Z., S.C.S. and D.P.W. further formulated the problem to correct for population stratification. C.K., K.Z. and N.D. implemented the proposed solution. C.K., K.Z., N.D., N.K. and S.C.S. co-wrote the manuscript. M.O.K., D.P.W. and S.C.S. supervised the study.

Correspondence to S. Cenk Sahinalp.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Fig. 1 and Supplementary Notes 1–5

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kockan, C., Zhu, K., Dokmai, N. et al. Sketching algorithms for genomic data analysis and querying in a secure enclave. Nat Methods 17, 295–301 (2020). https://doi.org/10.1038/s41592-020-0761-8

Download citation