Despite living in the richest economy in the world, children in the United States have worse health outcomes as assessed by quantitative, standardized metrics than children in other upper income countries.1 Improvements in child health outcomes over the past 30 years associated with availability of vaccines against common childhood diseases, greater chance of survival after preterm birth, and improved nutrition have not reduced race-, ethnicity-, geography-, and poverty-based disparities in child health.2,3 Access to large medical, biological, and environmental data sets and progress in data science provide multidimensional data and data analytics that more fully capture the dynamic and complex interactions among genetic background, culture, social determinants of health, development, environment, and biologic risk.1,4,5,6,7,8 These new tools build upon the progress from systems biology omics-based algorithms in which mechanistic explanations of outcomes were anchored in biologic problems to permit integration of environmental (e.g., pollution), cultural, demographic, and geographic factors that interact with child development and biology to create adverse health outcomes.6,9,10 For example, machine learning algorithms (approaches that improve automatically with additional data experience), a branch of artificial intelligence,11 recognize and apply patterns in multidimensional data that train models for prediction and stratification of disease risk based on biologic risk as well as developmental, cultural, and environmental risk determinants of adverse child health outcomes.5,10,12,13,14

The epidemiology of child health (children are generally healthier than adults, represent a smaller fraction of the population (22% <18 years of age in the United States),15 have a higher risk of rare diseases, and have an increasing prevalence of medical complexity (3–5%),15 the lack of progress in reducing disparities in child health outcomes, and the difficulties in applying adult-based strategies to improve child health outcomes highlight the potential for Big Data and new data science to impact child health. By combining data from traditional sources, e.g., electronic health records, imaging results, biobanks/registries, and omics measurements, with demographic, cultural, and environmental sources, artificial intelligence-based data science methods may identify previously unrecognized patterns associated with childhood disease without starting with an a priori hypothesis.6 Recent reports have described combining Big Data from different sources in pediatric oncology, nephrology, and sepsis diagnosis.16,17,18 However, these reports underline both the potential and the difficulties of combining data sets with limited data dimensions.19 For example, in pediatric oncology, combining existing institutional registries improves descriptions of natural history and responses to therapies but does not provide insight into contributions of environment, geography, or other less studied factors to disease risk, response to therapy, or prognosis.16 Similarly, in pediatric nephrology, development of quantitative definitions of acute kidney injury and descriptions of acute kidney injury epidemiology have not yet fully integrated omics or other potentially informative data elements.17 Big Data strategies for diagnosis of sepsis in low and middle income countries have helped improve quality of epidemiologic data but have not identified pathophysiologic or structural proposals to improve outcomes.18 Big Data and data science can accelerate improvements in child health outcomes through reductions in health disparities, improvements in clinical best practices and in quality and safety outcomes, prediction of individual risk of disease progression and response to therapies, increased family and patient-involvement in health care diagnosis and decisions, and discovery of personalized disease mechanisms. We will begin with a brief description of the characteristics of Big Data and data science strategies that will enable improvement in best practices and discovery of disease mechanisms, followed by examples of pediatric Big Data initiatives.

Big Data characteristics and strategies for analysis


Data production in healthcare is high volume and complex, encompassing a wide variety of inputs and formats including administrative, biomarker (e.g., genomic), physiologic, biometric, laboratory, and imaging. Data may be derived from multiple sources such as electronic health records, medical devices, mobile health platforms, clinical registries, biobanks, and patient self-reports.5 Within each of these sources, many different data types exist that can be further categorized as structured (e.g., demographics, laboratory results), unstructured (e.g., free text in notes and comments), and semi-structured (a combination of structured and unstructured data).20

The fundamental features of Big Data were originally defined in the early 2000s by the 3V model,21 which includes Volume, Variety, and Velocity. This model has since been extended to 6Vs, adding Variability, Veracity, and Value (Fig. 1).4,21,22,23,24,25

Fig. 1: Big Data features defined by the 6V model.
figure 1

Big Data features defined by the 6V model. Descriptions of each Big Data feature.

With these core features in mind, healthcare data can be further characterized by many quantitative properties that have been previously categorized by Shilo et al. into 7 axes of health data5 (Fig. 2). “Sample size,” a representation of volume, is important for achieving sufficient statistical power. However, data with an n of 1 (rare diseases) can also provide value in the definition of disease trajectory and clinical response.26 “Depth of phenotyping” describes the variety of medical data used to characterize individuals and ranges from the molecular level (e.g., omics, microbiome) to the social level (e.g., demographics, lifestyle, environment, and social determinants of health). Integration of this array of data into a valuable understanding of health can be challenging due to lack of data interoperability and of common definitions. “Longitudinal follow-up” is critical for child health and includes data gathering over different time points, a characterization of variability. The Barker hypothesis27 is a prime example of the importance of life course tracking to improve child and adult outcomes. Similarly, outcomes of underrepresented minorities can be improved with advocacy for engagement and adherence to follow-up in order to obtain complete long-term data.1 “Interactions between subjects” is an application of value to find the connections between subjects (e.g., shared environments, twins), which can increase the statistical power to discover disease mechanisms. “Heterogeneity and diversity” of a cohort to include appropriate representation of the real-world population (factors such as age, sex, ethnicity, socioeconomic status, exposure to different social determinants of health) contribute to the variety of data. “Standardization of data” and “linkage between data sources” are vital to veracity and tackling the challenge of having variety in data.

Fig. 2: Quantitative properties  represent the complexity of healthcare data.
figure 2

Descriptions of the 7 axes of health data. Adapted from Shilo et al.5.

Analysis strategies

As this wide array of data is gathered, finding meaningful results requires application of appropriate analytic strategies, the basis of data science. Three overarching approaches to analysis have been used: descriptive, predictive, and counterfactual.28,29 Descriptive analysis utilizes conventional parametric and non-parametric statistics to provide quantitative estimates of central tendency and variability. Strengths of this approach include the ability to condense large amounts of data into single summative metrics and a high degree of explainability. It is the most common form of analysis for real-time and historical data used in clinical research and is frequently the basis for intervention guidelines and protocols.29 Predictive analysis utilizes observational data to identify relational patterns between variables, either correlation or anti-correlation. Additional variables can be added to the model to control for factors which independently influence outcomes of interest. Once a model has been constructed, it can be “reversed” by entering values for each of the variables to generate a prediction of the probability of the outcome (for binary models) or an estimated value of the outcome (in continuous models).

In contrast to associations identified by descriptive and conventional predictive modeling, counterfactual prediction analysis is the foundation of causal analysis and inference. In this approach, one starts with the outcome and works backwards to the model inputs to evaluate how changes in the input might reverse or “flip” the outcome. Counterfactual explanations describe the smallest change to the input variables that causes a change from one predicted outcome to another. It is currently the least commonly used analytic approach. However, it has great potential to answer causal questions, for example in the genetic analysis of complex diseases.30

Machine learning analysis

Machine learning refers to computer algorithms which utilize artificial neural networks to identify salient variables or “features” from a large pool of candidate variables which, in association, predict outcomes with the greatest accuracy.31 Although machine learning is widely considered an “advanced” technique, it encompasses a broad array of approaches ranging from simple, conventional regression modeling through deep learning. The strategy for utilizing machine learning is common across all model types; first, the system is “trained” by exposing it to a representative sample cohort of patients with known and labeled outcomes. As the system learns with each pass through the training samples, predictive accuracy improves. Once training has been maximized, the system is validated using a separate cohort of patients never before seen by the model.

As previously noted, “machine learning” is a broad term encompassing many different computational strategies including neural networks, decision trees, support-vector machines, and deep learning. All machine learning is patterned after human neural networks—each variable is conceived as a node where the pathway through the network diverges based on the value at the node before eventual mapping to the outcome at the end of the network. When such a system is “learning,” the algorithm makes observations of individual patients and develops a decision tree, a branching graph where the value of a given variable increases or decreases the predicted probability of the outcome and is also influenced by the path through prior branches in the tree.32,33 As the algorithm is exposed to more examples, the weights of each branch point are iteratively increased or decreased until accuracy can no longer be increased. Not surprisingly, machine learning systems achieve optimal performance when extremely large datasets are available for training.

A significant drawback of machine learning analysis is the challenge of explaining the findings of the algorithm. As opposed to traditional regression modeling, where each variable is actively chosen by the investigator, and the relationship to the outcome is quantified in understandable units, machine learning decisions are made by a computer based on optimizing outcome prediction. In “unsupervised” machine learning, outcome labels are not provided to the computer, and grouping decisions are made without human input. While this approach offers the potential to discover previously unknown connections between variables, it may also result in convoluted and clinically implausible relationships.

Big Data challenges

The rapid growth of Big Data utilization in healthcare has unmasked many limitations and challenges. Data quality, accuracy, completeness, and availability are major hurdles in using large healthcare datasets4,34 and can lead to inaccurate analysis,1 biased inference, and false discoveries.35 Similarly, management and storage of large amounts of data present challenges with maintaining data security and accuracy over time, archiving data, managing data warehouses, and removing/disposing of information. Applying the most appropriate analytic approach to Big Data requires understanding of the technical and quality limitations of Big Data.

Data integration is both a key challenge and a critical component for improving seamless access to robust, reproducible, and diverse sources of data.36 Sharing data across institutions, for example, can be difficult due to differences in data types, definitions, and formats. Data security and privacy are also concerns when clinical information is shared, restricting the use of patient identifying information. Addressing these issues will require advancements in the standardization of data. The FAIR Guiding Principles (Findability, Accessibility, Interoperability, and Reusability) of data management provide guidelines for data production and data publishing focused on maximizing data quality and usability and are required by the National Institutes of Health for data management and sharing plans for all award applications submitted after January 25, 2023.37,38 Improvements in data management which incorporate these fundamental principles will be crucial to harnessing the potential of clinical Big Data.

Examples of Big Data initiatives and strategies for improvement of child health

Although still early in development, pediatric-specific Big Data projects have begun to emerge that include data commons for integration and interrogation of data from inpatient and ambulatory cohorts of children and electronic health record- or vital sign-based data for development of disease risk scores. Here we highlight examples of pediatric Big Data initiatives. Other multi-institutional pediatric data networks are summarized in Table 1.

Table 1 Pediatric Big Data networks.

Genomic Information Commons

Genomic Information Commons (GIC) is a National Institutes of Health (NIH)-funded, multi-institutional effort to provide an extensive, linked database of genotype, phenotype, biospecimens, and electronic health record-derived metadata in a highly accessible, federated database.39 A cooperative effort that has recently expanded from three to nine academic pediatric centers in the United States, it leverages robust, easy to use computational infrastructure for preliminary data queries, retention and control of all data and biospecimens at member institutions, scalable inclusion of additional member institutions, executed, inter-institutional data use and material transfer agreements, active participation by patients and families in defining network operations and research priorities, and large and diverse patient populations for genomic discovery and identification of potential therapies. Considerable effort has been made to maximize the value and usability of the data for investigators while maintaining high privacy and security standards.


PEDSnet is a national pediatric learning health system with a multi-specialty network of collaborators from Children’s Hospitals across the United States.40 Using a common data model from its original funding source, the Patient-Centered Outcomes Research Institute (PCORI), this network formed a centralized data sharing environment with executed, institutional data use agreements to create large datasets of pediatric clinical data extracted from electronic health records which enable communities of patients and clinicians to perform research and quality improvement projects that improve child health.40,41 PEDSnet currently includes longitudinal clinical data from 2009 for over 6.5 million children, about 9% of all the children in the US.42

Clinical data from electronic health records are gathered in quarterly cycles from the contributing sites in a structured and templated manner. The PCORI Common Data Model ensures terminology and data details are standardized and permits interoperability with other PCORI-sponsored Clinical Data Research Networks. An extensive data quality assessment process is used for careful analysis of the quality and characteristics of the data from each participating site.43 Data issues are categorized,44,45 reviewed by data scientists, and discussed with each submitting site for resolution to address four dimensions of data quality, fidelity, consistency, accuracy, and completeness.43

Data security and privacy are addressed with several steps that include storage of limited datasets in the PEDSnet Data Coordinating Center without patient identifiers. When data are requested by researchers, the minimum necessary aggregated data are provided, and institution-specific information is combined into “counts.” PEDSnet data have been used in 61 publications (June, 2022) that include disease-specific, quality and safety, and coronavirus disease 2019 (COVID-19)-related questions.46,47,48,49


PhysioNet is a large collection of clinical and physiologic data from both inpatient (e.g., traumatic brain injury) and ambulatory (e.g., gait analysis) venues that includes open-source tools for computational analysis.50 Initially established in 1999 with NIH support, it is now structured into three components:

  1. 1.

    “PhysioBank”—an extensive archive of digitized physiologic signals from fetal, pediatric, and adult sources.

  2. 2.

    “PhysioToolkit”—an open-source collection of tools for the processing and analysis of physiologic signals.

  3. 3.

    Extensive documentation and tutorials for new and advanced users.

Of the 202 PhysioNet databases, 12 are fetal or pediatric specific. Although most database access is free and without restriction, a subset of databases requires registration and completion of a data use agreement. In addition to providing a data repository that is compliant with the FAIR guidelines, PhysioNet has helped to standardize multiple types of data file formats.50,51,52 This standardization expands the number of software tools that can be used to conduct analyses by researchers including many free and open-source options, providing equitable data access for researchers with limited resources such as those in low and middle-income countries.

Electronic Health Record-based risk scores

Late onset neonatal sepsis is a significant contributor to morbidity and mortality. The nSOFA score53 is a sepsis prediction score, based upon the adult SOFA (Sequential Organ Failure Assessment) score and pSOFA (a pediatric variation developed for older children54) score, which utilizes elements from electronic health records to identify infants at high risk for sepsis related mortality. As with the adult SOFA score, a rise in the nSOFA score was highly correlated with sepsis related mortality—a difference that could be detected within 6–12 h of sepsis evaluation. This score has since been validated in a multi-center cohort of more than 600 infants55 with excellent performance, noting an area under the curve (AUC) of 0.88 for the prediction of mortality. The nSOFA score has also been shown to discriminate between survival and non-survival on the first day of life in extremely preterm infants.56,57 These examples highlight the potential for application of data science analytics on data extracted from electronic health records to generate useful tools for severity of illness stratification and targeted treatments.

In older, hospitalized children, identification and stratification of illness severity and need for critical care have used tools initially developed and validated in adult populations58 (e.g., Early Warning Scores (EWS) such as the National Early Warning Score (NEWS)).59,60 In a retrospective study, 2–3% of pediatric hospital admissions experience cardiopulmonary arrest and require resuscitation.61 The Pediatric Early Warning System (PEWS) score62 provides a similar predictive model for the pediatric inpatient population. In subsequent validation, PEWS scores identified patients at risk of deterioration 12 h in advance of clinically apparent deterioration,63 reduced the risk of emergency response calls shortly after admission from the Emergency Department,64 improved timely and orderly transfer of patients to the ICU,65 and increased the number of days without medical codes outside the ICU.66

Physiology-based risk scores

Over the last two decades in the NICU, the use of multiple devices for monitoring physiology-based signals including the electrocardiogram (ECG), pulse oximetry, respiratory rate, arterial blood pressure, transcutaneous partial pressure of carbon dioxide (CO2), cerebral and organ oximetry (via near-infrared spectroscopy (NIRS)), and electroencephalogram (EEG) has increased. Despite this broad array of available data, clinicians use these physiologic biomarker data almost exclusively for in-the-moment decisions and most often from only one signal, such as targeted oxygen saturation thresholds to reduce risk of retinopathy of prematurity (ROP). Computational integration of individual or multiple monitored physiologic biomarkers over time may reveal unrecognized patterns of pathology.

For example, the Heart Rate Observation (HeRO) score uses ECG characteristics,67 comprised of beat-to-beat variability in heart rate, accelerations, and decelerations (indicative of autonomic nervous system tone), to predict continuous sepsis risk in the next 24 h.68 Extensive neonatal validation testing has demonstrated the HeRO score’s superior performance over more traditional laboratory or clinical assessments and can reduce sepsis-related mortality by as much as 20%.69,70

Using a similar approach to the HeRO score, several groups have used other quantitative physiologic biomarkers (e.g., heart rate variability) to predict adverse long term neurodevelopmental outcomes,71,72,73 moderate-severe MRI abnormalities, or death.72,74 Similarly, quantitative characteristics of continuous EEG monitoring can be used to predict the later occurrence of seizures75 and outcomes at 24 months76 and 5 years.77

Research gaps and future directions

Despite the potential of Big Data strategies to improve child health and patient safety by trans-institutional identification of rare patient phenotypes, adverse patient events (e.g., sentinel pediatric events or adverse drug reactions),78 and responses to therapies, significant barriers remain in the practical application of research strategies to real-world bedside care.79,80,81 One of the most significant barriers is the lack of a universal, interoperable, modular system for capturing and sharing medical data. Proprietary and institutionally cloistered electronic health record systems limit discovery of critical components of best clinical and nursing practices82 and of pediatric-and disease-specific patient characteristics, disease risk, and adverse events. In addition, the large volume of data generated during healthcare delivery requires intentional system design that optimizes future data usability and minimizes cost for data extraction, reformatting, and loss. Similarly, although linkage of individual medical records longitudinally across maternal/fetal, neonatal, child, and adult epochs would be of great value for discovery of fetal, neonatal, and childhood origins of pediatric and adult diseases, such a system remains largely impracticable unless all the care for an individual is obtained within a single health system across the life course. Even within open source electronic health record systems, multiple data formats which vary according to system and region reduce interoperability across institutions.83 The PRISM model described by Hirschfeld et al.84 provides a theoretical framework for the future of intentional system design that captures elements of the health phenotype across four dimensions (experience, performance, adaptability, potential) and results in a life course model of an individual’s health (an Ideal Health Prism) which could enable comprehensive study of the fetal and childhood origins of childhood and adult diseases.

Through several different programs, such as the Big Data to Knowledge (BD2K) program, the NIH has emphasized the importance of using FAIR principles to insure availability of NIH-funded project data for the scientific community.37 Through resources such as GIC, PhysioNet, and PEDSNet, secure pediatric data with common formatting models are becoming available. However, compliance with Health Insurance Portability and Accountability Act of 1996 (HIPAA)-associated privacy rules often necessitates extensive manual data review and modification to ensure that all protected health information has been removed. For example, removal or shifting of all elements of date and time (high value Big Data components) in a truly random fashion and uniformly across elements for linkage consistency due to HIPAA protection is both laborious and prone to introduce errors.

Another challenge is the integration of multimodal data. Most of the analytic strategies previously described utilized data from a single source (e.g., the electronic health record) and along a similar time scale. Although this approach simplifies the data collection process, such “siloed” data provide an incomplete understanding of the disease process. Several recent projects85,86,87 have demonstrated that building a complex model using multimodal data with different scales can generate neonatal outcome predictions with greater accuracy than single-domain predictions alone. Examples of different scales include race or genetic background, which are immutable characteristics, sepsis status which is discrete but may evolve over time, and vital signs, such as heart rate or blood pressure, which are continuously changing. Further development of artificial intelligence-based data science tools which integrate genomic susceptibility with developmental epoch, environmental factors, social determinants of health, maternal/fetal characteristics, and family/patient self-reported data will be necessary.88 Meta-dimensional analysis, specifically concatenation-based integration, is one potential strategy but is not yet in routine use.89

Life course research is an example of integration of multiple data types from diverse sources (e.g., institutional data warehouses and research repositories) to capture the complexity of health trajectories.14,90,91 Incorporation of geocoded data and environmental factors as well as patient-reported measures such as social well-being into the electronic health record represent concrete strategies for greater inclusion and more accurate representation of populations currently underrepresented in research and permit analysis of the impact of social determinants of health on disease pathogenesis and response to therapies.91

A significant note of caution must be raised about racial bias in the use of physiology and electronic health record-based Big Data. Two sources of error may contribute to data-related bias. First, devices may not reliably capture measures owing to differences in physiology or phenotype. For example, significant recent attention has focused on the poor performance of pulse oximetry in adult and neonatal African-American patients.92,93 Lack of inclusion of melanin’s light absorption in the red and infrared spectrum in the underlying pulse oximeter algorithm increases the risk of occult hypoxemia in both adult and neonatal populations94 and of adverse neonatal outcomes.95,96 Second, even when collected data are reliable, machine learning models must be trained on representative samples that avoid racial bias. As recently demonstrated in a comparison of a new, intentionally designed machine learning algorithm for the prediction of ICU mortality with several widely used scores (APACHE, SAPS II, and MEWS), at least two of these systems (SAPS II and MEWS) were found to have significant racial bias.97 As with occult hypoxemia, the potential risk of harm comes from false negatives or underestimated disease risk. The equal opportunity difference analysis performed in this study is an optimal tool to identify these deficiencies.

The longstanding problem of gaps in studies of medications and medical devices in pediatric age groups represents an important priority for Big Data and data science to improve child health. As a consequence of these gaps, between 25% and 90% of medications are prescribed to children in an “off label” manner without regulatory approval.98 Instead of data-driven use, treatment options expand organically through extrapolation of adult data,99 anecdotal reports by providers, and practice drift.100,101 Although the primary focus of Big Data and data science has been on improving diagnosis of disease, elucidating mechanisms, and predicting outcomes, these same data science tools can and should be used to identify treatment response in children from real world data. For example, multicenter, federated data commons and advanced data analytics can be leveraged to identify and pool small numbers of infants and children at individual institutions into sample sizes which permit statistically valid examination of treatment response and adverse outcomes. Recently, real world data from electronic health records and other sources have been successfully analyzed to obtain regulatory approval for previously off-label medications in children.102


The urgency of the COVID-19 pandemic has demonstrated the potential for rapid application of Big Data and data science to integrate and analyze electronic health record data across health care systems and countries for identification of child-specific disease characteristics, best clinical practices, and responses to therapeutic interventions.103,104 These studies suggest the feasibility of the application of Big Data and data science to child health questions and the potential impact of such studies on prediction and mitigation of disease risk over decades of life. Realizing the potential of these tools for integrating genetic risk with developmental epoch, environmental factors, social determinants of health, patient- and family-reported data, and disease biology will require funding prioritization from the NIH and other agencies, unprecedented collaboration among institutions, investigators, and patients/families, consolidation of existing data networks, and child health-specific innovation.