Most sequenced genomes are currently stored in strict access-controlled repositories1,2,3. Free access to these data could improve the power of genome-wide association studies (GWAS) to identify disease-causing genetic variants and aid the discovery of new drug targets4,5. However, concerns over genetic data privacy6,7,8,9 may deter individuals from contributing their genomes to scientific studies10 and could prevent researchers from sharing data with the scientific community11. Although cryptographic techniques for secure data analysis exist12,13,14, none scales to computationally intensive analyses, such as GWAS. Here we describe a protocol for large-scale genome-wide analysis that facilitates quality control and population stratification correction in 9K, 13K, and 23K individuals while maintaining the confidentiality of underlying genotypes and phenotypes. We show the protocol could feasibly scale to a million individuals. This approach may help to make currently restricted data available to the scientific community and could potentially enable secure genome crowdsourcing, allowing individuals to contribute their genomes to a study without compromising their privacy.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Gaziano, J.M. et al. Million Veteran Program: a mega-biobank to study genetic influences on health and disease. J. Clin. Epidemiol. 70, 214–223 (2016).
Chen, Z. et al. China Kadoorie Biobank of 0.5 million people: survey methods, baseline characteristics and long-term follow-up. Int. J. Epidemiol. 40, 1652–1666 (2011).
Ioannidis, J.P.A., Trikalinos, T.A. & Khoury, M.J. Implications of small effect sizes of individual genetic variants on the design and interpretation of genetic association studies of complex diseases. Am. J. Epidemiol. 164, 609–614 (2006).
Moonesinghe, R., Khoury, M.J., Liu, T. & Ioannidis, J.P.A. Required sample size and nonreplicability thresholds for heterogeneous genetic associations. Proc. Natl. Acad. Sci. USA 105, 617–622 (2008).
Brenner, S.E. Be prepared for the big genome leak. Nature 498, 139 (2013).
Gymrek, M., McGuire, A.L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013).
Shringarpure, S.S. & Bustamante, C.D. Privacy risks from genomic data-sharing beacons. Am. J. Hum. Genet. 97, 631–646 (2015).
Harmanci, A. & Gerstein, M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat. Methods 13, 251–256 (2016).
Sanderson, S.C. et al. Motivations, concerns and preferences of personal genome sequencing research participants: baseline findings from the HealthSeq project. Eur. J. Hum. Genet. 24, 14–20 (2016).
Majumder, M.A., Cook-Deegan, R. & McGuire, A.L. Beyond our borders? Public resistance to global genomic data sharing. PLoS Biol. 14, e2000206 (2016).
Cramer, R. & Damgård, I. Secure Multiparty Computation (Cambridge University Press, 2015).
Gentry, C. Fully homomorphic encryption using ideal lattices. STOC '09 Proceedings of the Forty-First Annual ACM symposium on Theory of Computing 169–178 (2009).
Yao, A.C. Protocols for secure computations. IEEE Annual Symposium on Foundations of Computer Science 160–164 (1982).
Jiang, X. et al. A community assessment of privacy preserving techniques for human genomes. BMC Med. Inform. Decis. Mak. 14 (Suppl. 1), S1 (2014).
Kamm, L., Bogdanov, D., Laur, S. & Vilo, J. A new way to protect privacy in large-scale genome-wide association studies. Bioinformatics 29, 886–893 (2013).
Lu, W., Yamada, Y. & Sakuma, J. Efficient secure outsourcing of genome-wide association studies. IEEE Security and Privacy Workshops 3–6, doi:10.1109/SPW.2015.11 (2015).
Wang, S. et al. HEALER: homomorphic computation of ExAct Logistic rEgRession for secure rare disease variants analysis in GWAS. Bioinformatics 32, 211–218 (2016).
Constable, S.D., Tang, Y., Wang, S., Jiang, X. & Chapin, S. Privacy-preserving GWAS analysis on federated genomic datasets. BMC Med. Inform. Decis. Mak. 15 (Suppl. 5), S2 (2015).
Bogdanov, D., Kamm, L., Laur, S. & Sokk, V. Implementation and evaluation of an algorithm for cryptographically private principal component analysis on genomic data. 3rd International Workshop on Genome Privacy and Security http://2016.genopri.org/uploads/3/9/9/9/39999711/genopri16_paper_10.pdf (2016).
Bonte, C. et al. Privacy-preserving genome-wide association study is practical. Cryptology ePrint Archive https://eprint.iacr.org/2017/955 (2017).
Jagadeesh, K.A., Wu, D.J., Birgmeier, J.A., Boneh, D. & Bejerano, G. Deriving genomic diagnoses without revealing patient genomes. Science 357, 692–695 (2017).
Freedman, M.L. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Ben Or, M., Goldwasser, S. & Wigderson, A. Completeness Theorems for Non-Cryptographic Fault-Tolerant Distributed Computation. STOC '88 Proceedings of the Twentieth Annual ACM symposium on Theory of Computing 1–10 (1988).
Bogdanov, D., Laur, S. & Willemson, J. Sharemind: a framework for fast privacy-preserving computations. ESORICS 5283, 192–206 (2008).
Damgård, I., Pastro, V., Smart, N. & Zakarias, S. Multiparty computation from somewhat homomorphic encryption. CRYPTO 2012, 643–662 (2012).
Keller, M., Orsini, E. & Scholl, P. MASCOT: faster malicious arithmetic secure computation with oblivious transfer. Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 830–842 (2016).
Halko, N., Martinsson, P.-G. & Tropp, J.A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 217–288 (2011).
Galinsky, K.J. et al. Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet. 98, 456–472 (2016).
Hosgood, H.D. III et al. Interactions between household air pollution and GWAS-identified lung cancer susceptibility markers in the Female Lung Cancer Consortium in Asia (FLCCA). Hum. Genet. 134, 333–341 (2015).
Xie, G. et al. Association of granulomatosis with polyangiitis (Wegener's) with HLA-DPB1*04 and SEMA6A gene variants: evidence from genome-wide analysis. Arthritis Rheum. 65, 2457–2468 (2013).
Knight, A., Askling, J., Granath, F., Sparen, P. & Ekbom, A. Urinary bladder cancer in Wegener's granulomatosis: risks and relation to cyclophosphamide. Ann. Rheum. Dis. 63, 1307–1311 (2004).
Lee, S. et al. Inferring fine-grained control flow inside SGX enclaves with branch shadowing. Proceedings of the 26th USENIX Security Symposium 557–574 https://www.usenix.org/system/files/conference/usenixsecurity17/sec17-lee-sangho.pdf (USENIX Association, 2017).
Xu, Y., Cui, W. & Peinado, M. Controlled-channel attacks: deterministic side channels for untrusted operating systems. Proceedings of the 2015 IEEE Symposium on Security and Privacy 640–656 (2015).
Simmons, S., Sahinalp, C. & Berger, B. Enabling privacy-preserving GWASs in heterogeneous human populations. Cell Syst. 3, 54–61 (2016).
Simmons, S. & Berger, B. Realizing privacy preserving genome-wide association studies. Bioinformatics 32, 1293–1300 (2016).
Lan, Q. et al. Genome-wide association analysis identifies new lung cancer susceptibility loci in never-smoking women in Asia. Nat. Genet. 44, 1330–1335 (2012).
Figueroa, J.D. et al. Genome-wide association study identifies multiple loci associated with bladder cancer risk. Hum. Mol. Genet. 23, 1387–1398 (2014).
Fritsche, L.G. et al. A large genome-wide association study of age-related macular degeneration highlights contributions of rare and common variants. Nat. Genet. 48, 134–143 (2016).
Tryka, K.A. et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res. 42, D975–D979 (2014).
H.C. and B.B. are partially supported by the US National Institutes of Health GM108348 (to B.B.). H.C. is also partially supported by Kwanjeong Educational Foundation. D.J.W. is supported by fellowships from the Simons and National Science Foundations.
The authors declare no competing financial interests.
Integrated supplementary information
Using our protocol, we securely performed GWAS on three published case-control data sets for lung cancer (n = 9,098 after quality control), bladder cancer (n = 10,678), and age-related macular degeneration (AMD; n = 20,679). All of the tested SNPs passing quality control are shown in the figure: 378,492 loci for lung cancer, 389,868 loci for bladder cancer, and 221,295 loci for AMD. Our securely computed Cochran-Armitage trend test p-values (one-sided) accurately matched the ground truth we obtained based on plaintext data.
Supplementary Figure 2 Our secure GWAS protocol accurately estimates the effect size of associated SNPs via logistic regression.
We implemented logistic regression in our secure computation framework and applied it to a subset of 100 SNPs (randomly chosen among the top 1000 associations) in the lung cancer data set (n = 9,098 after quality control). The odds ratio of a SNP is given by the exponential function evaluated at the estimated weight associated with the SNP's minor allele dosage in a logistic regression model. Analogous to our main GWAS protocol, we included 10 additional phenotypes (e.g., age group) and five principal components securely obtained by our GWAS protocol as covariates in the model. As shown in the scatter plot, the odds ratios securely obtained by our protocol accurately matched those computed based on a plaintext implementation of logistic regression, the latter of which also used a plaintext PCA algorithm to obtain the top principal components. Performing logistic regression on 100 SNPs completed in about a day using our experimental setup. Although performing logistic regression genome-wide is still prohibitively expensive, our method enables a heuristic two-step approach where the odds ratios are computed for only the SNPs passing a certain significance threshold in our main GWAS protocol. Note that our logistic regression pipeline provides the same security guarantees as our main GWAS protocol; namely, no information about the underlying genotypes and phenotypes is revealed during the process other than the final output.
Supplementary Figures 1–2 (PDF 264 kb)
Supplementary tables 1–3 (PDF 199 kb)
Supplementary notes 1–12 (PDF 1959 kb)
An implementation of our secure GWAS protocol in C++. (ZIP 372 kb)
About this article
Cite this article
Cho, H., Wu, D. & Berger, B. Secure genome-wide association analysis using multiparty computation. Nat Biotechnol 36, 547–551 (2018). https://doi.org/10.1038/nbt.4108
Privacy-Preserving Genome-Wide Association Study for Rare Mutations - A Secure FrameWork for Externalized Statistical Analysis
IEEE Access (2020)
Nature Genetics (2020)
Nature Methods (2020)
Molecular Barcoding as a Defense Against Benchtop Biochemical Attacks on DNA Fingerprinting and Information Forensics
IEEE Transactions on Information Forensics and Security (2020)