Secure genome-wide association analysis using multiparty computation

Cho, Hyunghoon; Wu, David J; Berger, Bonnie

doi:10.1038/nbt.4108

Letter
Published: 07 May 2018

Secure genome-wide association analysis using multiparty computation

Nature Biotechnology volume 36, pages 547–551 (2018)Cite this article

7138 Accesses
115 Citations
378 Altmetric
Metrics details

Subjects

Abstract

Most sequenced genomes are currently stored in strict access-controlled repositories^1,2,3. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and aid the discovery of new drug targets^4,5. However, concerns over genetic data privacy^6,7,8,9 may deter individuals from contributing their genomes to scientific studies¹⁰ and could prevent researchers from sharing data with the scientific community¹¹. Although cryptographic techniques for secure data analysis exist^12,13,14, none scales to computationally intensive analyses, such as GWAS. Here we describe a protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable secure genome crowdsourcing, allowing individuals to contribute their genomes to a study without compromising their privacy.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Overview of our secure GWAS pipeline.**

**Figure 2: Our secure GWAS protocol achieves practical runtimes, and all of our scalability metrics follow a linear trend.**

Sketching algorithms for genomic data analysis and querying in a secure enclave

Article 04 March 2020

Can Kockan, Kaiyuan Zhu, … S. Cenk Sahinalp

Public platform with 39,472 exome control samples enables association studies without genotype sharing

Article Open access 10 January 2024

Mykyta Artomov, Alexander A. Loboda, … Mark J. Daly

Citizen-centered, auditable and privacy-preserving population genomics

Article 25 March 2021

Dennis Grishin, Jean Louis Raisaro, … Jean-Pierre Hubaux

References

Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Gaziano, J.M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Article PubMed Google Scholar
Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).
Article PubMed PubMed Central Google Scholar
Ioannidis, J.P.A., Trikalinos, T.A. & Khoury, M.J. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164, 609–614 (2006).
Article PubMed Google Scholar
Moonesinghe, R., Khoury, M.J., Liu, T. & Ioannidis, J.P.A. Required sample size and nonreplicability thresholds for heterogeneous genetic associations. Proc. Natl. Acad. Sci. USA 105, 617–622 (2008).
Article CAS PubMed PubMed Central Google Scholar
Brenner, S.E. Be prepared for the big genome leak. Nature 498, 139 (2013).
Article CAS PubMed Google Scholar
Gymrek, M., McGuire, A.L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013).
Article CAS PubMed Google Scholar
Shringarpure, S.S. & Bustamante, C.D. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 (2015).
Article CAS PubMed PubMed Central Google Scholar
Harmanci, A. & Gerstein, M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat. Methods 13, 251–256 (2016).
Article CAS PubMed PubMed Central Google Scholar
Sanderson, S.C. et al. Motivations, concerns and preferences of personal genome sequencing research participants: baseline findings from the HealthSeq project. Eur. J. Hum. Genet. 24, 14–20 (2016).
Article PubMed Google Scholar
Majumder, M.A., Cook-Deegan, R. & McGuire, A.L. Beyond our borders? Public resistance to global genomic data sharing. PLoS Biol. 14, e2000206 (2016).
Article PubMed PubMed Central Google Scholar
Cramer, R. & Damgård, I. Secure Multiparty Computation (Cambridge University Press, 2015).
Gentry, C. Fully homomorphic encryption using ideal lattices. STOC '09 Proceedings of the Forty-First Annual ACM symposium on Theory of Computing 169–178 (2009).
Yao, A.C. Protocols for secure computations. IEEE Annual Symposium on Foundations of Computer Science 160–164 (1982).
Jiang, X. et al. A community assessment of privacy preserving techniques for human genomes. BMC Med. Inform. Decis. Mak. 14 (Suppl. 1), S1 (2014).
Article PubMed PubMed Central Google Scholar
Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lu, W., Yamada, Y. & Sakuma, J. Efficient secure outsourcing of genome-wide association studies. IEEE Security and Privacy Workshops 3–6, doi:10.1109/SPW.2015.11 (2015).
Wang, S. et al. HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics 32, 211–218 (2016).
PubMed Google Scholar
Constable, S.D., Tang, Y., Wang, S., Jiang, X. & Chapin, S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med. Inform. Decis. Mak. 15 (Suppl. 5), S2 (2015).
Article PubMed PubMed Central Google Scholar
Bogdanov, D., Kamm, L., Laur, S. & Sokk, V. Implementation and evaluation of an algorithm for cryptographically private principal component analysis on genomic data. 3rd International Workshop on Genome Privacy and Security http://2016.genopri.org/uploads/3/9/9/9/39999711/genopri16_paper_10.pdf (2016).
Bonte, C. et al. Privacy-preserving genome-wide association study is practical. Cryptology ePrint Archive https://eprint.iacr.org/2017/955 (2017).
Jagadeesh, K.A., Wu, D.J., Birgmeier, J.A., Boneh, D. & Bejerano, G. Deriving genomic diagnoses without revealing patient genomes. Science 357, 692–695 (2017).
Article CAS PubMed Google Scholar
Freedman, M.L. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).
Article CAS PubMed Google Scholar
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
Ben Or, M., Goldwasser, S. & Wigderson, A. Completeness Theorems for Non-Cryptographic Fault-Tolerant Distributed Computation. STOC '88 Proceedings of the Twentieth Annual ACM symposium on Theory of Computing 1–10 (1988).
Bogdanov, D., Laur, S. & Willemson, J. Sharemind: a framework for fast privacy-preserving computations. ESORICS 5283, 192–206 (2008).
Google Scholar
Damgård, I., Pastro, V., Smart, N. & Zakarias, S. Multiparty computation from somewhat homomorphic encryption. CRYPTO 2012, 643–662 (2012).
Google Scholar
Keller, M., Orsini, E. & Scholl, P. MASCOT: faster malicious arithmetic secure computation with oblivious transfer. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 830–842 (2016).
Halko, N., Martinsson, P.-G. & Tropp, J.A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).
Article Google Scholar
Galinsky, K.J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hosgood, H.D. III et al. Interactions between household air pollution and GWAS-identified lung cancer susceptibility markers in the Female Lung Cancer Consortium in Asia (FLCCA). Hum. Genet. 134, 333–341 (2015).
Article CAS PubMed PubMed Central Google Scholar
Xie, G. et al. Association of granulomatosis with polyangiitis (Wegener's) with HLA-DPB1*04 and SEMA6A gene variants: evidence from genome-wide analysis. Arthritis Rheum. 65, 2457–2468 (2013).
Article CAS PubMed PubMed Central Google Scholar
Knight, A., Askling, J., Granath, F., Sparen, P. & Ekbom, A. Urinary bladder cancer in Wegener's granulomatosis: risks and relation to cyclophosphamide. Ann. Rheum. Dis. 63, 1307–1311 (2004).
Article CAS PubMed PubMed Central Google Scholar
Lee, S. et al. Inferring fine-grained control flow inside SGX enclaves with branch shadowing. Proceedings of the 26th USENIX Security Symposium 557–574 https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-lee-sangho.pdf (USENIX Association, 2017).
Xu, Y., Cui, W. & Peinado, M. Controlled-channel attacks: deterministic side channels for untrusted operating systems. Proceedings of the 2015 IEEE Symposium on Security and Privacy 640–656 (2015).
Simmons, S., Sahinalp, C. & Berger, B. Enabling privacy-preserving GWASs in heterogeneous human populations. Cell Syst. 3, 54–61 (2016).
Article CAS PubMed PubMed Central Google Scholar
Simmons, S. & Berger, B. Realizing privacy preserving genome-wide association studies. Bioinformatics 32, 1293–1300 (2016).
Article CAS PubMed PubMed Central Google Scholar
Lan, Q. et al. Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia. Nat. Genet. 44, 1330–1335 (2012).
Article CAS PubMed PubMed Central Google Scholar
Figueroa, J.D. et al. Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum. Mol. Genet. 23, 1387–1398 (2014).
Article CAS PubMed Google Scholar
Fritsche, L.G. et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat. Genet. 48, 134–143 (2016).
Article CAS PubMed Google Scholar
Tryka, K.A. et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

H.C. and B.B. are partially supported by the US National Institutes of Health GM108348 (to B.B.). H.C. is also partially supported by Kwanjeong Educational Foundation. D.J.W. is supported by fellowships from the Simons and National Science Foundations.

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Hyunghoon Cho & Bonnie Berger
Department of Computer Science, Stanford University, Stanford, California, USA
David J Wu
Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
Bonnie Berger

Authors

Hyunghoon Cho
View author publications
You can also search for this author in PubMed Google Scholar
David J Wu
View author publications
You can also search for this author in PubMed Google Scholar
Bonnie Berger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.C., D.J.W., and B.B. developed the methods. H.C. implemented the software and performed experiments with assistance from D.J.W. and B.B. B.B. supervised the project. All authors wrote the manuscript.

Corresponding author

Correspondence to Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Our secure GWAS protocol obtains accurate association statistics.

Using our protocol, we securely performed GWAS on three published case-control data sets for lung cancer (n = 9,098 after quality control), bladder cancer (n = 10,678), and age-related macular degeneration (AMD; n = 20,679). All of the tested SNPs passing quality control are shown in the figure: 378,492 loci for lung cancer, 389,868 loci for bladder cancer, and 221,295 loci for AMD. Our securely computed Cochran-Armitage trend test p-values (one-sided) accurately matched the ground truth we obtained based on plaintext data.

Supplementary Figure 2 Our secure GWAS protocol accurately estimates the effect size of associated SNPs via logistic regression.

We implemented logistic regression in our secure computation framework and applied it to a subset of 100 SNPs (randomly chosen among the top 1000 associations) in the lung cancer data set (n = 9,098 after quality control). The odds ratio of a SNP is given by the exponential function evaluated at the estimated weight associated with the SNP's minor allele dosage in a logistic regression model. Analogous to our main GWAS protocol, we included 10 additional phenotypes (e.g., age group) and five principal components securely obtained by our GWAS protocol as covariates in the model. As shown in the scatter plot, the odds ratios securely obtained by our protocol accurately matched those computed based on a plaintext implementation of logistic regression, the latter of which also used a plaintext PCA algorithm to obtain the top principal components. Performing logistic regression on 100 SNPs completed in about a day using our experimental setup. Although performing logistic regression genome-wide is still prohibitively expensive, our method enables a heuristic two-step approach where the odds ratios are computed for only the SNPs passing a certain significance threshold in our main GWAS protocol. Note that our logistic regression pipeline provides the same security guarantees as our main GWAS protocol; namely, no information about the underlying genotypes and phenotypes is revealed during the process other than the final output.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cho, H., Wu, D. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36, 547–551 (2018). https://doi.org/10.1038/nbt.4108

Download citation

Received: 25 August 2017
Accepted: 19 February 2018
Published: 07 May 2018
Issue Date: July 2018
DOI: https://doi.org/10.1038/nbt.4108

This article is cited by

COLLAGENE enables privacy-aware federated and collaborative genomic data analysis
- Wentao Li
- Miran Kim
- Arif Harmanci
Genome Biology (2023)
Sequre: a high-performance framework for secure multiparty computation enables biomedical data sharing
- Haris Smajlović
- Ariya Shajii
- Ibrahim Numanagić
Genome Biology (2023)
Genome interpretation in a federated learning context allows the multi-center exome-based risk prediction of Crohn’s disease patients
- Daniele Raimondi
- Haleh Chizari
- Yves Moreau
Scientific Reports (2023)
Federated singular value decomposition for high-dimensional data
- Anne Hartebrodt
- Richard Röttger
- David B. Blumenthal
Data Mining and Knowledge Discovery (2023)
Manticore: A Framework for Efficient Multiparty Computation Supporting Real Number and Boolean Arithmetic
- Mariya Georgieva Belorgey
- Sergiu Carpov
- Mohsen Mohammadi
Journal of Cryptology (2023)