Abstract
Missing data are an unavoidable complication in many machine learning tasks. When data are ‘missing at random’ there exist a range of tools and techniques to deal with the issue. However, as machine learning studies become more ambitious, and seek to learn from ever-larger volumes of heterogeneous data, an increasingly encountered problem arises in which missing values exhibit an association or structure, either explicitly or implicitly. Such ‘structured missingness’ raises a range of challenges that have not yet been systematically addressed, and presents a fundamental hindrance to machine learning at scale. Here we outline the current literature and propose a set of grand challenges in learning from data with structured missingness.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Little, R. J. A. & Rubin, D. B. Statistical Analysis With Missing Data Vol. 793 (John Wiley & Sons, 2019).
Karlaš, B. et al. Nearest neighbor classifiers over incomplete information: from certain answers to certain predictions. Preprint at https://arxiv.org/abs/2005.05117 (2020).
Rubin, D. B. Inference and missing data. Biometrika 63, 581–592 (1976).
Pigott, T. D. A review of methods for missing data. Educ. Res. Eval. 7, 353–383 (2001).
Schafer, J. L. & Graham, J. W. Missing data: our view of the state of the art. Psychol. Methods 7, 147–177 (2002).
Heitjan, D. F. & Rubin, D. B. Ignorability and coarse data. Ann. Stat. 19, 2244–2253 (1991).
Emmanuel, T. et al. A survey on missing data in machine learning. J. Big Data 8, 1–37 (2021).
Gao, J., Li, P., Chen, Z. & Zhang, J. A survey on deep learning for multimodal data fusion. Neur. Comput. 32, 829–864 (2020).
Yan, X., Hu, S., Mao, Y., Ye, Y. & Yu, H. Deep multi-view learning methods: a review. Neurocomputing 448, 106–129 (2021).
Xu, C., Tao, D. & Xu, C. A survey on multi-view learning. Preprint at https://arxiv.org/abs/1304.5634 (2013).
Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Silva, L. A. V. & Rohr, K. Pan-cancer prognosis prediction using multimodal deep learning. In 2020 IEEE 17th International Symposium on Biomedical Imaging 568–571 (IEEE, 2020).
Rubin, D. B. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91, 473–489 (1996).
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Kaissis, G. A., Makowski, M. R., Rückert, D. & Braren, R. F. Secure, privacy-preserving and federated machine learning in medical imaging. Nat. Mach. Intell. 2, 305–311 (2020).
Li, T., Sahu, A. K., Talwalkar, A. & Smith, V. Federated learning: challenges, methods, and future directions. IEEE Signal Process. Mag. 37, 50–60 (2020).
Holmes, C. Artificial Intelligence and Health: A Summary Report of a Roundtable Held on 16 January 2019 (Academy of Medical Sciences, 2019); https://acmedsci.ac.uk/policy/policy-projects/artificial--intelligence-and-health
Dong, X. et al. TOBMI: trans-omics block missing data imputation using a k-nearest neighbor weighted approach. Bioinformatics 35, 1278–1283 (2019).
Naito, T. et al. A deep learning method for HLA imputation and trans-ethnic MHC fine-mapping of type 1 diabetes. Nat. Commun. 12, 1639 (2021).
Audigier, V. et al. Multiple imputation for multilevel data with continuous and binary variables. Stat. Sci. 33, 160–183 (2018).
Kamphuis, R., Jolani, S. & Lugtig, P. The blocked imputation approach for missing data. Preprint at ResearchGate https://doi.org/10.13140/RG.2.2.12467.32803 (2018).
Che, Z., Purushotham, S., Cho, K., Sontag, D. & Liu, Y. Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8, 6085 (2018).
Wang, Z., Akande, O., Poulos, J. & Li, F. Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison. Preprint at https://arxiv.org/abs/2103.09316 (2021).
Tierney, N. J., Harden, F. A., Harden, M. J. & Mengersen, K. L. Using decision trees to understand structure in missing data. BMJ Open 5, e007450 (2015).
Singal, G. et al. Development and validation of a real-world clinicogenomic database. J. Clin. Oncol. 35, 2514 (2017).
Van Buuren, S. & Groothuis-Oudshoorn, K. mice: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011).
Leslie, D. et al. Artificial intelligence, human rights, democracy, and the rule of law: a primer. Preprint at https://arxiv.org/abs/2104.04147 (2021).
MacArthur, B. D., Dorobantu, C. & Margetts, H. Resilient government requires data science reform. Nat. Hum. Behav. https://doi.org/10.1038/s41562-022-01423-6 (2022).
Seaman, S., Galati, J., Jackson, D. & Carlin, J. What is meant by “missing at random"? Stat. Sci. 28, 257–268 (2013).
Doretti, M., Geneletti, S. & Stanghellini, E. Missing data: a unified taxonomy guided by conditional independence. Int. Stat. Rev. 86, 189–204 (2018).
Tian, J. Missing at random in graphical models. In Artificial Intelligence and Statistics 977–985 (PMLR, 2015).
Antelmi, L. et al. Combining multi-task learning and multi-channel variational auto-encoders to exploit datasets with missing observations -application to multi-modal neuroimaging studies in dementia. Preprint at https://hal.inria.fr/hal-03114888 (2021).
Newman, M. Networks (Oxford Univ. Press, 2018).
Bianconi, G. Higher-Order Networks (Cambridge Univ. Press, 2021).
Gutknecht, A. J., Wibral, M. & Makkeh, A. Bits and pieces: understanding information decomposition from part-whole relationships and formal logic. Proc. R. Soc. A 477, 20210110 (2021).
Bick, C., Gross, E., Harrington, H. A. & Schaub, M. T. What are higher-order networks? Preprint at https://arxiv.org/abs/2104.11329 (2021).
Carlsson, G. Topology and data. Bull. Am. Math. Soc. 46, 255–308 (2009).
Joharinad, P. & Jost, J. Geometry of data. Preprint at https://arxiv.org/abs/2203.07208 (2022).
Bianconi, G. Multilayer Networks (Oxford Univ. Press, 2018).
Kiani, N. A., Gomez-Cabrero, D. & Bianconi, G. (eds) Networks of Networks in Biology (Cambridge Univ. Press, 2021).
Lee, K. M., Biedermann, S. & Mitra, R. D-optimal designs for multiarm trials with dropouts. Stat. Med. 38, 2749–2766 (2019).
Lee, K. M., Mitra, R. & Biedermann, S. Optimal design when outcome values are not missing at random. Stat. Sinica https://doi.org/10.5705/ss.202016.0526 (2018).
Lee, K. M., Biedermann, S. & Mitra, R. Optimal design for experiments with possibly incomplete observations. Stat. Sinica 28, 1611–1632 (2018).
Noonan, J. & Zhigljavsky, A. in Black Box Optimization, Machine Learning, and No-Free Lunch Theorems (eds Pardalos, P. M. et al.) 273–318 (Springer, 2021).
Zhigljavsky, A. & Noonan, J. Covering of high-dimensional cubes and quantization. SN Oper. Res. Forum 1, 18 (2020).
Burnett, T. & Jennison, C. Adaptive enrichment trials: what are the benefits? Stat. Med. 40, 690–711 (2020).
Nijman, S. W. J. et al. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J. Clin. Epidemiol. 142, 218–229 (2022).
Ipsen, N., Mattei, P.-A. & Frellsen, J. How to deal with missing data in supervised deep learning? In Artemiss-ICML Workshop on the Art of Learning with Missing Values (2020).
Buolamwini, J. & Gebru, T. Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency 77–91 (PMLR, 2018).
Leslie, D. Understanding bias in facial recognition technologies. Preprint at https://doi.org/10.48550/arXiv.2010.07023 (2020).
Gelman, A. et al. Bayesian Data Analysis (3rd ed.). (Chapman and Hall/CRC, 2013).
Gelfand, A. E. & Smith, A. F. M. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc. 85, 398–409 (1990).
Van Buuren, S. Flexible Imputation of Missing Data (CRC, 2018).
Schouten, R. M., Lugtig, P. & Vink, G. Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Sim. 88, 2909–2930 (2018).
Brand, J. P. L. Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets (Print Partners Ispkamp, 1999).
Brand, J. P. L., Van Buuren, S., Groothuis-Oudshoorn, K. & Gelsema, E. S. A toolkit in SAS for the evaluation of multiple imputation methods. Stat. Neerland. 57, 36–45 (2003).
Mayer, I. Causal Inference from Heterogeneous Data with Missing Data: Application to Critical Care Management. PhD thesis, EHESS (2021).
Kusner, M. J., Loftus, J., Russell, C. & Silva, R. Counterfactual fairness. Advances in neural information processing systems, 30. NeurIPS (2017).
Shen, A., Han, X., Cohn, T., Baldwin, T. & Frermann, L. Contrastive learning for fair representations. Preprint at https://arxiv.org/abs/2109.10645 (2021).
Ding, P. & Li, F. Causal inference: a missing data perspective. Stat. Sci. 33, 214–237 (2017).
Seaman, S. R. & White, I. R. Review of inverse probability weighting for dealing with missing data. Stat. Methods Med. Res. 22, 278–295 (2013).
Sun, BaoLuo et al. Inverse-probability-weighted estimation for monotone and nonmonotone missing data. Am. J. Epidemiol. 187, 585–591 (2017).
Westreich, D. et al. Imputation approaches for potential outcomes in causal inference. Int. J. Epidemiol. 44, 1731–1737 (2015).
Verheij, R. A., Curcin, V., Delaney, B. C. & McGilchrist, M. M. Possible sources of bias in primary care electronic health record data use and reuse. J. Med. Internet Res. 20, e185 (2018).
Kiang, M. V. et al. Sociodemographic characteristics of missing data in digital phenotyping. Sci. Rep. 11, 15408 (2021).
Tsiampalis, T. & Panagiotakos, D. B. Missing-data analysis: socio-demographic, clinical and lifestyle determinants of low response rate on self-reported psychological and nutrition related multi-item instruments in the context of the ATTICA epidemiological study. BMC Med. Res. Methodol. 20, 148 (2020).
Leslie, D., Mazumder, A., Peppin, A., Wolters, M. K. & Hagerty, A. Does “AI" stand for augmenting inequality in the era of covid-19 healthcare? BMJ 372, n304 (2021).
Fatumo, S. et al. A roadmap to increase diversity in genomic studies. Nat. Med. 28, 243–250 (2022).
Abdill, R. J., Adamowicz, E. M. & Blekhman, R. Public human microbiome data are dominated by highly developed countries. PLoS Biol. 20, e3001536 (2022).
Gebru, T. et al. Datasheets for datasets. Commun. ACM 64, 86–92 (2021).
Rostamzadeh, N. et al. Healthsheet: development of a transparency artifact for health datasets. Preprint at https://arxiv.org/abs/2202.13028 (2022).
Tierney, N. J., Harden, F. A., Harden, M. J. & Mengersen, K. L. Using decision trees to understand structure in missing data. BMJ Open 5, e007450 (2015).
Martínez-Plumed, F., Ferri, C., Nieves, D. & Hernández-Orallo, J. Missing the missing values: the ugly duckling of fairness in machine learning. Int. J. Intell. Syst. 36, 3217–3258 (2021).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Bansal, A., Sharma, R. & Kathuria, M. A systematic review on data scarcity problem in deep learning: solution and applications. ACM Comput. Surv. 54, 1–29 (2022).
Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interf. 15, 20170387 (2018).
Liang, W. et al. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 4, 669–677 (2022).
Koch, B., Denton, E., Hanna, A. & Foster, J. G. Reduced, reused and recycled: the life of a dataset in machine learning research. Preprint at https://arxiv.org/abs/2112.01716 (2021).
Heather, J. M. & Chain, B. The sequence of sequencers: the history of sequencing DNA. Genomics 107, 1–8 (2016).
Li, P. et al. CleanML: a study for evaluating the impact of data cleaning on ml classification tasks. In 2021 IEEE 37th International Conference on Data Engineering 13–24 (IEEE, 2021).
Krishnan, S., Wang, J., Wu, E., Franklin, M. J. & Goldberg, K. ActiveClean: interactive data cleaning for statistical modeling. Proc. VLDB Endow. 9, 948–959 (2016).
Zhang, L., Yang, M. & Feng, X. Sparse representation or collaborative representation: which helps face recognition? In IEEE International Conference on Computer Vision 471–478 (IEEE, 2011).
Chakraborti, T., McCane, B., Mills, S. & Pal, U. A generalised formulation for collaborative representation of image patches (GP-CRC). In Proc. British Machine Vision Conference (2017).
Ben Schafer, J., Frankowski, D., Herlocker, J. & Sen, S. Collaborative filtering recommender systems. In Lecture Notes in Computer Science: The Adaptive Web. Springer, Berlin, Heidelberg. 291–324 (2007).
Chakraborti, T., McCane, B., Mills, S. & Pal, U. Collaborative representation based fine-grained species recognition. In Proc. IEEE International Conference on Image and Vision Computing New Zealand, 1-6 (IEEE, 2016).
Vinje, W. E. & Gallant, J. L. Sparse coding and decorrelation in primary visual cortex during natural vision. Science 287, 1273–1276 (2000).
Raghunathan, T. E. Synthetic data. Annu. Rev. Stat. Appl. 8, 129–140 (2021).
Jordon, J. et al. Synthetic data—what, why and how? Preprint at https://arxiv.org/abs/2205.03257 (2022).
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems 30 (2017).
Zhang, H., Goodfellow, I., Metaxas, D. & Odena, A. Self-attention generative adversarial networks. In International conference on machine learning. 7354–7363 (PMLR, 2019)
Yoon, J., Jordon, J. & Schaar, M. GAIN: missing data imputation using generative adversarial nets. In International Conference on Machine Learning 80, 5689–5698 (PMLR, 2018).
Birnbaum, B. et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. Preprint at https://doi.org/10.48550/arXiv.2001.09765 (2020).
Alerskans, E. et al. Construction of a climate data record of sea surface temperature from passive microwave measurements. Remote Sens. Environ. 236, 111485 (2020).
Katiraie-Boroujerdy, P. S., Nasrollahi, N., Hsu, K. L. & Sorooshian, S. Evaluation of satellite-based precipitation estimation over Iran. J. Arid Environ. 97, 205–219 (2013).
Andersson, T. R. et al. Seasonal arctic sea ice forecasting with probabilistic deep learning. Nat. Commun. 12, 5124 (2021).
Groves, R. M. et al. Survey Methodology (John Wiley & Sons, 2011).
Ledford, H. How Facebook, Twitter and other data troves are revolutionizing social science. Nature 582, 328–331 (2020).
Acknowledgements
This work was sponsored by the Turing-Roche Strategic Partnership. We thank C. Matus for her talents in figure illustrations and design and V. Hellon for her expert community management.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Subho Majumdar, Girmaw Abebe Tadesse and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Mitra, R., McGough, S.F., Chakraborti, T. et al. Learning from data with structured missingness. Nat Mach Intell 5, 13–23 (2023). https://doi.org/10.1038/s42256-022-00596-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00596-z
This article is cited by
-
Assessing internal displacement patterns in Ukraine during the beginning of the Russian invasion in 2022
Scientific Reports (2024)
-
Clinical AI tools must convey predictive uncertainty for each individual patient
Nature Medicine (2023)