Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Imputing gene expression from selectively reduced probe sets


Measuring complete gene expression profiles for a large number of experiments is costly. We propose an approach in which a small subset of probes is selected based on a preliminary set of full expression profiles. In subsequent experiments, only the subset is measured, and the missing values are imputed. We developed several algorithms to simultaneously select probes and impute missing values, and we demonstrate that these 'probe selection for imputation' (PSI) algorithms can successfully reconstruct missing gene expression values in a wide variety of applications, as evaluated using multiple metrics of biological importance. We analyze the performance of PSI methods under varying conditions, provide guidelines for choosing the optimal method based on the experimental setting, and indicate how to estimate imputation accuracy. Finally, we apply our approach to a large-scale study of immune system variation.

Your institute does not have access to this article

Relevant articles

Open Access articles citing this article.

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: An integrated approach to probe selection and imputation.
Figure 2: Relative and absolute imputation accuracy.
Figure 3: Additional evaluation metrics of biological relevance.
Figure 4: Cost-benefit analysis to determine the optimal number of selected probes and modular decomposition subsets.
Figure 5: The samples, probe ratio, linearity (SPRL) predictor and ImmVar results.

Accession codes


Gene Expression Omnibus


  1. Amit, I. et al. Unbiased reconstruction of a mammalian transcriptional network mediating pathogen responses. Science 326, 257–263 (2009).

    CAS  Article  Google Scholar 

  2. Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999).

    CAS  Article  Google Scholar 

  3. Schena, M., Shalon, D., Davis, R.W. & Brown, P.O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).

    CAS  Article  Google Scholar 

  4. Cheung, V.G. et al. Natural variation in human gene expression assessed in lymphoblastoid cells. Nat. Genet. 33, 422–425 (2003).

    CAS  Article  Google Scholar 

  5. Schadt, E.E. et al. Genetics of gene expression surveyed in maize, mouse and man. Nature 422, 297–302 (2003).

    CAS  Article  Google Scholar 

  6. Emilsson, V. et al. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).

    CAS  PubMed  Google Scholar 

  7. Su, A.I. et al. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. USA 99, 4465–4470 (2002).

    CAS  Article  Google Scholar 

  8. Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006).

    CAS  Article  Google Scholar 

  9. Lein, E.S. et al. Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176 (2007).

    CAS  Article  Google Scholar 

  10. Dimas, A.S. et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325, 1246–1250 (2009).

    CAS  Article  Google Scholar 

  11. Alizadeh, A.A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000).

    CAS  Article  Google Scholar 

  12. Gasch, A.P. et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11, 4241–4257 (2000).

    CAS  Article  Google Scholar 

  13. Wagner, A. Estimating coarse gene network structure from large-scale gene perturbation data. Genome Res. 12, 309–315 (2002).

    CAS  Article  Google Scholar 

  14. Hughes, T.R. et al. Functional discovery via a compendium of expression profiles. Cell 102, 109–126 (2000).

    CAS  Article  Google Scholar 

  15. Whitfield, M.L. et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 13, 1977–2000 (2002).

    CAS  Article  Google Scholar 

  16. Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science 282, 699–705 (1998).

    CAS  Article  Google Scholar 

  17. Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2, 65–73 (1998).

    CAS  Article  Google Scholar 

  18. Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9, 3273–3297 (1998).

    CAS  Article  Google Scholar 

  19. DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686 (1997).

    CAS  Article  Google Scholar 

  20. Pomeroy, S.L. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415, 436–442 (2002).

    CAS  Article  Google Scholar 

  21. van 't Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002).

    CAS  Article  Google Scholar 

  22. Bibikova, M. et al. Quantitative gene expression profiling in formalin-fixed, paraffin-embedded tissues using universal bead arrays. Am. J. Pathol. 165, 1799–1807 (2004).

    CAS  Article  Google Scholar 

  23. Paik, S. et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N. Engl. J. Med. 351, 2817–2826 (2004).

    CAS  Article  Google Scholar 

  24. Bustin, S.A. Absolute quantification of mRNA using real-time reverse transcription polymerase chain reaction assays. J. Mol. Endocrinol. 25, 169–193 (2000).

    CAS  Article  Google Scholar 

  25. Geiss, G.K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat. Biotechnol. 26, 317–325 (2008).

    CAS  Article  Google Scholar 

  26. Spurgeon, S.L., Jones, R.C. & Ramakrishnan, R. High throughput gene expression measurement with real time PCR in a microfluidic dynamic array. PLoS ONE 3, e1662 (2008).

    Article  Google Scholar 

  27. Gnirke, A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing. Nat. Biotechnol. 27, 182–189 (2009).

    CAS  Article  Google Scholar 

  28. Xing, E.P., Jordan, M.I. & Karp, R.M. Feature selection for high-dimensional genomic microarray data. in Proc. Int. Conf. Mach. Learn. (eds. Brodley, C.E. & Pohoreckyj Danyluk, A.) 601–608 (ICML 2001).

  29. Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344, 539–548 (2001).

    CAS  Article  Google Scholar 

  30. Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics 17, 520–525 (2001).

    CAS  Article  Google Scholar 

  31. Heng, T.S.P. et al. The Immunological Genome Project: networks of gene expression in immune cells. Nat. Immunol. 9, 1091–1094 (2008).

    CAS  Article  Google Scholar 

  32. Oba, S. et al. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 19, 2088–2096 (2003).

    CAS  Article  Google Scholar 

  33. Kim, H., Golub, G.H. & Park, H. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21, 187–198 (2005).

    CAS  Article  Google Scholar 

  34. Bø, T.H., Dysvik, B. & Jonassen, I. LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res. 32, e34 (2004).

    Article  Google Scholar 

  35. Scherf, U. et al. A gene expression database for the molecular pharmacology of cancer. Nat. Genet. 24, 236–244 (2000).

    CAS  Article  Google Scholar 

  36. Liu, X. et al. Analysis of cell fate from single-cell gene expression profiles in C. elegans. Cell 139, 623–633 (2009).

    CAS  Article  Google Scholar 

  37. Zahn, J.M. et al. AGEMAP: a gene expression database for aging in mice. PLoS Genet. 3, e201 (2007).

    Article  Google Scholar 

Download references


This work was supported by grant RC2 GM093080 (funded through the American Recovery and Reinvestment Act) from the US National Institutes of Health–National Institute of General Medical Sciences to C.B. and D.K. We thank I. Amit and J. Ye for useful comments on this manuscript.

Author information

Authors and Affiliations



Y.D. and D.K. designed the methods; Y.D. implemented the methods, wrote the code, performed the experiments and analyzed the data; T.F. and C.B. provided data and gave feedback on the results; Y.D. and D.K. wrote the manuscript; C.B. reviewed and commented on the manuscript.

Corresponding author

Correspondence to Daphne Koller.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8, Supplementary Table 1, Supplementary Results and Supplementary Note (PDF 1946 kb)

Supplementary Data

The data sets used in the experiments described in the paper, in the file format used by the PSI software. (ZIP 11389 kb)

Supplementary Software

PSI software. (ZIP 16 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Donner, Y., Feng, T., Benoist, C. et al. Imputing gene expression from selectively reduced probe sets. Nat Methods 9, 1120–1125 (2012).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing