Secure genome-wide association analysis using multiparty computation

Abstract

Most sequenced genomes are currently stored in strict access-controlled repositories1,2,3. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and aid the discovery of new drug targets4,5. However, concerns over genetic data privacy6,7,8,9 may deter individuals from contributing their genomes to scientific studies10 and could prevent researchers from sharing data with the scientific community11. Although cryptographic techniques for secure data analysis exist12,13,14, none scales to computationally intensive analyses, such as GWAS. Here we describe a protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable secure genome crowdsourcing, allowing individuals to contribute their genomes to a study without compromising their privacy.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Overview of our secure GWAS pipeline.
Figure 2: Our secure GWAS protocol achieves practical runtimes, and all of our scalability metrics follow a linear trend.

References

  1. 1

    Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).

    Article  Google Scholar 

  2. 2

    Gaziano, J.M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).

    Article  Google Scholar 

  3. 3

    Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).

    Article  Google Scholar 

  4. 4

    Ioannidis, J.P.A., Trikalinos, T.A. & Khoury, M.J. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164, 609–614 (2006).

    Article  Google Scholar 

  5. 5

    Moonesinghe, R., Khoury, M.J., Liu, T. & Ioannidis, J.P.A. Required sample size and nonreplicability thresholds for heterogeneous genetic associations. Proc. Natl. Acad. Sci. USA 105, 617–622 (2008).

    CAS  Article  Google Scholar 

  6. 6

    Brenner, S.E. Be prepared for the big genome leak. Nature 498, 139 (2013).

    CAS  Article  Google Scholar 

  7. 7

    Gymrek, M., McGuire, A.L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013).

    CAS  Article  Google Scholar 

  8. 8

    Shringarpure, S.S. & Bustamante, C.D. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 (2015).

    CAS  Article  Google Scholar 

  9. 9

    Harmanci, A. & Gerstein, M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat. Methods 13, 251–256 (2016).

    CAS  Article  Google Scholar 

  10. 10

    Sanderson, S.C. et al. Motivations, concerns and preferences of personal genome sequencing research participants: baseline findings from the HealthSeq project. Eur. J. Hum. Genet. 24, 14–20 (2016).

    Article  Google Scholar 

  11. 11

    Majumder, M.A., Cook-Deegan, R. & McGuire, A.L. Beyond our borders? Public resistance to global genomic data sharing. PLoS Biol. 14, e2000206 (2016).

    Article  Google Scholar 

  12. 12

    Cramer, R. & Damgård, I. Secure Multiparty Computation (Cambridge University Press, 2015).

  13. 13

    Gentry, C. Fully homomorphic encryption using ideal lattices. STOC '09 Proceedings of the Forty-First Annual ACM symposium on Theory of Computing 169–178 (2009).

  14. 14

    Yao, A.C. Protocols for secure computations. IEEE Annual Symposium on Foundations of Computer Science 160–164 (1982).

  15. 15

    Jiang, X. et al. A community assessment of privacy preserving techniques for human genomes. BMC Med. Inform. Decis. Mak. 14 (Suppl. 1), S1 (2014).

    Article  Google Scholar 

  16. 16

    Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).

    CAS  Article  Google Scholar 

  17. 17

    Lu, W., Yamada, Y. & Sakuma, J. Efficient secure outsourcing of genome-wide association studies. IEEE Security and Privacy Workshops 3–6, doi:10.1109/SPW.2015.11 (2015).

  18. 18

    Wang, S. et al. HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics 32, 211–218 (2016).

    PubMed  PubMed Central  Google Scholar 

  19. 19

    Constable, S.D., Tang, Y., Wang, S., Jiang, X. & Chapin, S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med. Inform. Decis. Mak. 15 (Suppl. 5), S2 (2015).

    Article  Google Scholar 

  20. 20

    Bogdanov, D., Kamm, L., Laur, S. & Sokk, V. Implementation and evaluation of an algorithm for cryptographically private principal component analysis on genomic data. 3rd International Workshop on Genome Privacy and Security http://2016.genopri.org/uploads/3/9/9/9/39999711/genopri16_paper_10.pdf (2016).

  21. 21

    Bonte, C. et al. Privacy-preserving genome-wide association study is practical. Cryptology ePrint Archive https://eprint.iacr.org/2017/955 (2017).

  22. 22

    Jagadeesh, K.A., Wu, D.J., Birgmeier, J.A., Boneh, D. & Bejerano, G. Deriving genomic diagnoses without revealing patient genomes. Science 357, 692–695 (2017).

    CAS  Article  Google Scholar 

  23. 23

    Freedman, M.L. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).

    CAS  Article  Google Scholar 

  24. 24

    Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    CAS  Article  Google Scholar 

  25. 25

    Ben Or, M., Goldwasser, S. & Wigderson, A. Completeness Theorems for Non-Cryptographic Fault-Tolerant Distributed Computation. STOC '88 Proceedings of the Twentieth Annual ACM symposium on Theory of Computing 1–10 (1988).

  26. 26

    Bogdanov, D., Laur, S. & Willemson, J. Sharemind: a framework for fast privacy-preserving computations. ESORICS 5283, 192–206 (2008).

    Google Scholar 

  27. 27

    Damgård, I., Pastro, V., Smart, N. & Zakarias, S. Multiparty computation from somewhat homomorphic encryption. CRYPTO 2012, 643–662 (2012).

    Google Scholar 

  28. 28

    Keller, M., Orsini, E. & Scholl, P. MASCOT: faster malicious arithmetic secure computation with oblivious transfer. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 830–842 (2016).

  29. 29

    Halko, N., Martinsson, P.-G. & Tropp, J.A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).

    Article  Google Scholar 

  30. 30

    Galinsky, K.J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).

    CAS  Article  Google Scholar 

  31. 31

    Hosgood, H.D. III et al. Interactions between household air pollution and GWAS-identified lung cancer susceptibility markers in the Female Lung Cancer Consortium in Asia (FLCCA). Hum. Genet. 134, 333–341 (2015).

    CAS  Article  Google Scholar 

  32. 32

    Xie, G. et al. Association of granulomatosis with polyangiitis (Wegener's) with HLA-DPB1*04 and SEMA6A gene variants: evidence from genome-wide analysis. Arthritis Rheum. 65, 2457–2468 (2013).

    CAS  Article  Google Scholar 

  33. 33

    Knight, A., Askling, J., Granath, F., Sparen, P. & Ekbom, A. Urinary bladder cancer in Wegener's granulomatosis: risks and relation to cyclophosphamide. Ann. Rheum. Dis. 63, 1307–1311 (2004).

    CAS  Article  Google Scholar 

  34. 34

    Lee, S. et al. Inferring fine-grained control flow inside SGX enclaves with branch shadowing. Proceedings of the 26th USENIX Security Symposium 557–574 https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-lee-sangho.pdf (USENIX Association, 2017).

  35. 35

    Xu, Y., Cui, W. & Peinado, M. Controlled-channel attacks: deterministic side channels for untrusted operating systems. Proceedings of the 2015 IEEE Symposium on Security and Privacy 640–656 (2015).

  36. 36

    Simmons, S., Sahinalp, C. & Berger, B. Enabling privacy-preserving GWASs in heterogeneous human populations. Cell Syst. 3, 54–61 (2016).

    CAS  Article  Google Scholar 

  37. 37

    Simmons, S. & Berger, B. Realizing privacy preserving genome-wide association studies. Bioinformatics 32, 1293–1300 (2016).

    CAS  Article  Google Scholar 

  38. 38

    Lan, Q. et al. Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia. Nat. Genet. 44, 1330–1335 (2012).

    CAS  Article  Google Scholar 

  39. 39

    Figueroa, J.D. et al. Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum. Mol. Genet. 23, 1387–1398 (2014).

    CAS  Article  Google Scholar 

  40. 40

    Fritsche, L.G. et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat. Genet. 48, 134–143 (2016).

    CAS  Article  Google Scholar 

  41. 41

    Tryka, K.A. et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

H.C. and B.B. are partially supported by the US National Institutes of Health GM108348 (to B.B.). H.C. is also partially supported by Kwanjeong Educational Foundation. D.J.W. is supported by fellowships from the Simons and National Science Foundations.

Author information

Affiliations

Authors

Contributions

H.C., D.J.W., and B.B. developed the methods. H.C. implemented the software and performed experiments with assistance from D.J.W. and B.B. B.B. supervised the project. All authors wrote the manuscript.

Corresponding author

Correspondence to Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Our secure GWAS protocol obtains accurate association statistics.

Using our protocol, we securely performed GWAS on three published case-control data sets for lung cancer (n = 9,098 after quality control), bladder cancer (n = 10,678), and age-related macular degeneration (AMD; n = 20,679). All of the tested SNPs passing quality control are shown in the figure: 378,492 loci for lung cancer, 389,868 loci for bladder cancer, and 221,295 loci for AMD. Our securely computed Cochran-Armitage trend test p-values (one-sided) accurately matched the ground truth we obtained based on plaintext data.

Supplementary Figure 2 Our secure GWAS protocol accurately estimates the effect size of associated SNPs via logistic regression.

We implemented logistic regression in our secure computation framework and applied it to a subset of 100 SNPs (randomly chosen among the top 1000 associations) in the lung cancer data set (n = 9,098 after quality control). The odds ratio of a SNP is given by the exponential function evaluated at the estimated weight associated with the SNP's minor allele dosage in a logistic regression model. Analogous to our main GWAS protocol, we included 10 additional phenotypes (e.g., age group) and five principal components securely obtained by our GWAS protocol as covariates in the model. As shown in the scatter plot, the odds ratios securely obtained by our protocol accurately matched those computed based on a plaintext implementation of logistic regression, the latter of which also used a plaintext PCA algorithm to obtain the top principal components. Performing logistic regression on 100 SNPs completed in about a day using our experimental setup. Although performing logistic regression genome-wide is still prohibitively expensive, our method enables a heuristic two-step approach where the odds ratios are computed for only the SNPs passing a certain significance threshold in our main GWAS protocol. Note that our logistic regression pipeline provides the same security guarantees as our main GWAS protocol; namely, no information about the underlying genotypes and phenotypes is revealed during the process other than the final output.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–2 (PDF 264 kb)

Life Sciences Reporting Summary (PDF 170 kb)

Supplementary Tables

Supplementary tables 1–3 (PDF 199 kb)

Supplementary Notes

Supplementary notes 1–12 (PDF 1959 kb)

Supplementary Code

An implementation of our secure GWAS protocol in C++. (ZIP 372 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cho, H., Wu, D. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36, 547–551 (2018). https://doi.org/10.1038/nbt.4108

Download citation

Further reading

Search

Quick links

Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing