Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Quantification of private information leakage from phenotype-genotype data: linking attacks

Abstract

Studies on genomic privacy have traditionally focused on identifying individuals using DNA variants. In contrast, molecular phenotype data, such as gene expression levels, are generally assumed to be free of such identifying information. Although there is no explicit genotypic information in phenotype data, adversaries can statistically link phenotypes to genotypes using publicly available genotype-phenotype correlations such as expression quantitative trait loci (eQTLs). This linking can be accurate when high-dimensional data (i.e., many expression levels) are used, and the resulting links can then reveal sensitive information (for example, the fact that an individual has cancer). Here we develop frameworks for quantifying the leakage of characterizing information from phenotype data sets. These frameworks can be used to estimate the leakage from large data sets before release. We also present a general three-step procedure for practically instantiating linking attacks and a specific attack using outlier gene expression levels that is simple yet accurate. Finally, we describe the effectiveness of this outlier attack under different scenarios.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Illustration of a linking attack.
Figure 2: Computation of the ICI and correct predictability of genotypes.
Figure 3: Estimates of ICI leakage versus predictability.
Figure 4: Genotype-expression associations and linking attacks.
Figure 5: Accuracy of linking attacks.

Similar content being viewed by others

References

  1. Sboner, A., Mu, X., Greenbaum, D., Auerbach, R.K. & Gerstein, M.B. The real cost of sequencing: higher than you think! Genome Biol. 12, 125 (2011).

    Article  Google Scholar 

  2. Rodriguez, L.L., Brooks, L.D., Greenberg, J.H. & Green, E.D. The complexities of genomic identifiability. Science 339, 275–276 (2013).

    Article  CAS  Google Scholar 

  3. Erlich, Y. & Narayanan, A. Routes for breaching and protecting genetic privacy. Nat. Rev. Genet. 15, 409–421 (2014).

    Article  CAS  Google Scholar 

  4. Sweeney, L., Abu, A. & Winn, J. Identifying participants in the Personal Genome Project by name. Social Science Research Network 10.2139/ssrn.2257732 (2013).

  5. Golle, P. Revisiting the uniqueness of simple demographics in the US population. in Proc. 5th ACM Workshop on Privacy in Electronic Society 77–80 (ACM, 2006).

  6. Golle, P. Revisiting the uniqueness of simple demographics in the US population. in Proc. 5th ACM Workshop on Privacy in Electronic Society 77–80 (ACM, 2006).

  7. GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 (2013).

  8. Ardlie, K.G. et al. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648–660 (2015).

    Article  Google Scholar 

  9. Pakstis, A.J. et al. SNPs for a universal individual identification panel. Hum. Genet. 127, 315–324 (2010).

    Article  Google Scholar 

  10. Wei, Y.L., Li, C.X., Jia, J., Hu, L. & Liu, Y. Forensic identification using a multiplex assay of 47 SNPs. J. Forensic Sci. 57, 1448–1456 (2012).

    Article  CAS  Google Scholar 

  11. Gymrek, M., McGuire, A.L., Golan, D., Halperin, E. & Erlich, Y. Identifying personal genomes by surname inference. Science 339, 321–324 (2013).

    Article  CAS  Google Scholar 

  12. Homer, N. et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 4, e1000167 (2008).

    Article  Google Scholar 

  13. Im, H.K., Gamazon, E.R., Nicolae, D.L. & Cox, N.J. On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy. Am. J. Hum. Genet. 90, 591–598 (2012).

    Article  CAS  Google Scholar 

  14. Lunshof, J.E., Chadwick, R., Vorhaus, D.B. & Church, G.M. From genetic privacy to open consent. Nat. Rev. Genet. 9, 406–411 (2008).

    Article  CAS  Google Scholar 

  15. Church, G. et al. Public access to genome-wide data: five views on balancing research with privacy and protection. PLoS Genet. 5, e1000665 (2009).

    Article  Google Scholar 

  16. Narayanan, A. & Shmatikov, V. Robust de-anonymization of large sparse datasets. in Proc. IEEE Symposium on Security and Privacy 111–125 (IEEE, 2008).

  17. Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).

    Article  CAS  Google Scholar 

  18. Abecasis, G.R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2015).

    Google Scholar 

  19. Erlich, Y. et al. Redefining genomic privacy: trust and empowerment. PLoS Biol. 12, e1001983 (2014).

    Article  Google Scholar 

  20. Sweeney, L. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10, 557–570 (2002).

    Article  Google Scholar 

  21. Ninghui, L., Tiancheng, L. & Venkatasubramanian, S. t-closeness: privacy beyond k-anonymity and l-diversity. in Proc. IEEE 23rd International Conference on Data Engineering 106–115 (IEEE, 2007).

  22. Machanavajjhala, A., Gehrke, J., Kifer, D. & Venkitasubramaniam, M. λ-diversity: privacy beyond k-anonymity. Proc. 22nd International Conference on Data Engineering 24 (IEEE, 2006).

  23. Meyerson, A. & Williams, R. On the complexity of optimal K-anonymity. in Proc. 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems 223–228 (ACM, 2004).

  24. Dwork, C. Differential privacy. in Proc. 33rd International Colloquium on Automata, Languages and Programming 1–12 (Springer-Verlag, 2006).

  25. Fredrikson, M. et al. Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. in Proc 23rd USENIX Security Symposium 17–32 (USENIX, 2014).

  26. Cover, T.M. & Thomas, J.A. Elements of Information Theory 2nd edn. (John Wiley & Sons, 2005).

  27. Shabalin, A.A. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics 28, 1353–1358 (2012).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

We acknowledge support from the US National Institutes of Health and from the A.L. Williams Professorship funds. We thank A. Serin Harmanci for constructive comments and discussions on the study design and running of external tools. We also thank D. Clarke for comments on the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

A.H. designed the study, gathered data sets, performed experiments and drafted the manuscript. M.G. conceived the study, oversaw the experiments and wrote the manuscript. Both authors approved the final version of the manuscript.

Corresponding author

Correspondence to Mark Gerstein.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Schematic comparison of linking attacks and detection of a genome in a mixture of attacks.

(a) Each box in the figure represents a dataset in the form of a matrix. Multiple boxes next to each other correspond to concatenation of matrices. Linking attacks aim at linking genotype and phenotype datasets. The phenotype datasets contain both “predicting” phenotypes and other phenotypes, some of which can be sensitive. The attacker first predict genotypes using each of the predicting phenotype. The predicted genotypes are then compared with the genotypes in the genotype dataset. After the linking, all the datasets are concatenated where the identifiers can be matched to the sensitive phenotypes. Different colors indicate how the linking merges different information. (b) The detection of a genome in a mixture attacks start with a genotype dataset. The attacker gets access to the statistics of a GWAS or genotyping dataset (for example, regression coefficients or allele frequencies). Then the attacker generates a statistic and tests it against that of a reference population. The testing result can be converted into the study membership indicator (attended/not attended) which shows whether or not the tested individual was in the study cohort.

Supplementary Figure 2 Illustration of the expression and genotype data sets.

Variant genotype dataset contains the genotypes for q eQTL variants for nv individuals. jth entry for kth eQTL is denoted by vk,j . Similarly, the expression dataset contains the expression levels for q genes. The kth expression level for jth individual is denoted by ek,j. The variant genotypes for kth variant are distributed over samples in accordance with the random variable Vk. Likewise, the expression levels for kth gene is distributed per random variable Ek. These random variables are correlated with each other with correlation coefficient, denoted by p(Ek, Vk) (right).

Supplementary Figure 3 The attacker’s presumed strategy for a linking attack.

(a),(b) The phenotype and variant pairs are sorted by descending absolute correlations values. For the top n pairs, joint predictability and ICI are computed. (c) The average joint predictability of genotypes versus the average cumulative ICI leakage for multiple eQTLs. The error bars (one standard deviation) for ICI and predictability are shown on the real eQTLs.

Supplementary Figure 4 Illustration of computation of the individual characterizing information (ICI) and correct predictability of genotypes.

ICI for a set of n variant genotypes is computed using the genotype distributions of the variants, as illustrated by the histogram plots under each variant. Each genotype contributes to ICI additively with the logarithm of reciprocal of the genotype frequency (illustrated by the genotype distributions). Given an eQTL where genotype of variant V1 is correlated to expression of gene 1 (E1), the predictability of the genotype given expression level is e is computed in terms of the entropy of conditional genotype distribution, given expression level e. The conditional distribution is built by slicing the joint distribution at expression level e.

Supplementary Figure 5 Illustration of genotype prediction and prediction accuracy.

(a) Illustration of prior, joint, and posterior distributions of genotypes and expression levels. The leftmost figure shows the distribution of genotypes over the sample set, which is labelled as the prior distribution. The middle figure shows the joint distribution of genotypes and expression levels with a significant negative correlation between genotype values and the expression levels. The rightmost figure shows the posterior distribution of genotypes given that the gene expression level is 10. The posterior distribution has a maximum (MAP prediction) at genotype 2, which is indicated by a star. (b) The number of selected and average correctly predicted eQTL genotypes with changing absolute correlation threshold. The error bars (one standard deviation) are shown for correctly predicted eQTL genotypes.

Supplementary Figure 6 Models of joint genotype-expression distribution with varying numbers of parameters for a positively correlated eQTL.

(a) The true genotype-expression distribution. Grey boxes represent the expression distributions given different genotypes. Red line indicates the gradient of correlation between genotype and expression. (b) First simplification of the joint distribution. The expression distribution can be modeled with Gaussians with different means and variances with total of 6 parameters. (c) Simplification of joint distribution with equal variances. The variances can be assumed same for different genotypes, resulting in a 4-parameter model. (d) A representation of the uniform expression distribution given genotypes, where 4 parameters are required. The conditional distribution of expression is uniform (blue rectangles) over the ranges (e1, e2), (e2, e3), and (e3, e4) given genotypes 0, 1, and 2, respectively. The transparent grey rectangles show the original distributions. (e) A simplification of (d) where conditional probability of expression is zero given genotype is 1. In this model, only one parameter (emid) is necessary. The conditional probability of expression given genotypes 0 and 2 are uniform for expression levels below emid and above emid, respectively (shown with blue rectangles). The original distribution is shown with grey rectangles for comparison. Extremity-based prediction uses an instantiation of the model in (e).

Supplementary Figure 7 The median absolute gene expression extremity statistics over 462 individuals in the GEUVADIS data set.

(a) For each individual, the extremity is computed over all the genes (23,662 genes) reported in the expression dataset. The median of the absolute value of the extremity is plotted. X-axis shows the sample index and y-axis shows the extremity. The absolute median extremity fluctuates around 0.25, which is exactly the midpoint between minimum and maximum values of absolute extremity. (b) The plot shows the extremity threshold versus the median number of genes (over 462 individuals) above the extremity threshold. Around half of the genes (indicated by dashed yellow lines) have higher than 0.3 extremity on average over all the individuals. Also, around 1000 genes have higher than 0.45 extremity over all individuals (indicated by green dashed lines). (c) Accuracy of extremity based genotype prediction with changing absolute correlation threshold. (d) The linking accuracy with changing absolute extremity (x-axis) and absolute correlation thresholds (y-axis). The heatmap colors indicate the accuracy.

Supplementary Figure 8 A representative example of extremity-based linking.

The phenotype dataset (Consisting of gene expression levels for 6 genes) is shown above. Each phenotype measurement is represented by blue (negative extreme), yellow (positive extreme), or grey (non-extreme) dots. Based on the extremity of phenotypes, the attacker performs prediction of genotypes, which are shown below in (2). She uses the eQTL dataset (with genes and SNPs) for prediction. Blue and brown triangles correspond to the correct genotype predictions. The grey crosses correspond to the incorrect or unavailable genotype predictions. The attacker compares the predicted genotypes to the genotype dataset in (3), where triangles show the genotypes, and performs linking. 3 individuals (Bob, Alice, and John) are highlighted. The attacker can link Bob and John by matching them to their genotypes using the 4 SNPs indicates by pink shaded rectangles. The correct prediction of rs7274244 (in yellow dashed rectangle) enables the attacker to distinguish between correct entries and reveal both of their disease status as positive. For Alice, the predicted genotypes are equally matching to two entries both of which match at 2 genotypes (Out of the 3 SNPs indicated by grey shaded rectangles). In addition, the matching entries, PID-b and PID-k, have negative and positive disease status, respectively. Thus the attacker cannot exactly reveal Alice’s disease status.

Supplementary Figure 9 Illustration of linking for the jth individual.

The attacker first predicts the genotypes (Ṽj) which are then used to compute the distance to all the individuals in the genotype dataset. The computed distances are then sorted in increasing order. The top-matching individual with smallest distance (in the example, individual a) is assigned as the linked individual. The first distance gap, d1,2, is computed as the difference between the second (dj,(2)) and the first (dj,(1)) distances in the sorted list.

Supplementary Figure 10 The accuracy of linking attacks under different scenarios.

(a) The distribution of ranks for close relatives (blue) and for random individuals (red) in the linking in 30 HAPMAP CEU trio dataset. Assigned rank is shown in x-axis and frequency is shown on y-axis. (b) The positive predictive value (PPV) versus sensitivity with changing i1,2 threshold for the eQTL selection in (c) where linking accuracy is around 70%, indicated by dashed yellow line. The grey dashed line marks the 95% PPV. The magenta lines show the same plot for random threshold selections. (c) The accuracy of linking attack when the eQTLs are discovered on the training set of 210 individuals and linking is performed on testing set of 211 individuals. The association strength (for eQTL selection) as reported by Matrix eQTL is plotted on x-axis and linking accuracy is plotted on y-axis. (d) The accuracy of linking when the simulated set of 100,211 individuals are used in the genotype dataset, using the eQTLs identified in training sample set in (a).

Supplementary Figure 11 Illustration of risk-assessment procedure for joint genotyping and phenotyping data generation.

There are two paths of risk assessment to be performed. The first path evaluates the risks associated with release of the QTL datasets. The genotype and phenotype data (on the left) are first used for quantitative trait loci identification (QTL identification box). This generates the significant QTLs. These are then utilized, in addition to the list of external QTL databases, in quantification of leakage versus predictability, as presented in Section 2.2. These results are then relayed to the risk assessment procedures. The second risk assessment procedure evaluates the release of genotype and phenotype datasets. For this, the datasets are tested under a set of linking attacks for evaluation of characterization risks. The results are then relayed to risk assessment procedures.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11, Supplementary Table 1 and Supplementary Note 1 (PDF 2232 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Harmanci, A., Gerstein, M. Quantification of private information leakage from phenotype-genotype data: linking attacks. Nat Methods 13, 251–256 (2016). https://doi.org/10.1038/nmeth.3746

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.3746

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research