Improving child health through Big Data and data science

Vesoulis, Zachary A.; Husain, Ameena N.; Cole, F. Sessions

doi:10.1038/s41390-022-02264-9

Download PDF

Review Article
Published: 16 August 2022

Improving child health through Big Data and data science

Zachary A. Vesoulis¹,
Ameena N. Husain¹ &
F. Sessions Cole¹

Pediatric Research volume 93, pages 342–349 (2023)Cite this article

5248 Accesses
7 Citations
25 Altmetric
Metrics details

Abstract

Child health is defined by a complex, dynamic network of genetic, cultural, nutritional, infectious, and environmental determinants at distinct, developmentally determined epochs from preconception to adolescence. This network shapes the future of children, susceptibilities to adult diseases, and individual child health outcomes. Evolution selects characteristics during fetal life, infancy, childhood, and adolescence that adapt to predictable and unpredictable exposures/stresses by creating alternative developmental phenotype trajectories. While child health has improved in the United States and globally over the past 30 years, continued improvement requires access to data that fully represent the complexity of these interactions and to new analytic methods. Big Data and innovative data science methods provide tools to integrate multiple data dimensions for description of best clinical, predictive, and preventive practices, for reducing racial disparities in child health outcomes, for inclusion of patient and family input in medical assessments, and for defining individual disease risk, mechanisms, and therapies. However, leveraging these resources will require new strategies that intentionally address institutional, ethical, regulatory, cultural, technical, and systemic barriers as well as developing partnerships with children and families from diverse backgrounds that acknowledge historical sources of mistrust. We highlight existing pediatric Big Data initiatives and identify areas of future research.

Impact

Big Data and data science can improve child health.
This review highlights the importance for child health of child-specific and life course-based Big Data and data science strategies.
This review provides recommendations for future pediatric-specific Big Data and data science research.

Approaches to protocol standardization and data harmonization in the ECHO-wide cohort study

Article Open access 16 February 2024

SPR perspectives: Environmental influences on Child Health Outcomes (ECHO) Program: overcoming challenges to generate engaged, multidisciplinary science

Article 15 June 2021

A new era: improving use of sociodemographic constructs in the analysis of pediatric cohort study data

Article 18 February 2021

Introduction

Despite living in the richest economy in the world, children in the United States have worse health outcomes as assessed by quantitative, standardized metrics than children in other upper income countries.¹ Improvements in child health outcomes over the past 30 years associated with availability of vaccines against common childhood diseases, greater chance of survival after preterm birth, and improved nutrition have not reduced race-, ethnicity-, geography-, and poverty-based disparities in child health.^2,3 Access to large medical, biological, and environmental data sets and progress in data science provide multidimensional data and data analytics that more fully capture the dynamic and complex interactions among genetic background, culture, social determinants of health, development, environment, and biologic risk.^1,4,5,6,7,8 These new tools build upon the progress from systems biology omics-based algorithms in which mechanistic explanations of outcomes were anchored in biologic problems to permit integration of environmental (e.g., pollution), cultural, demographic, and geographic factors that interact with child development and biology to create adverse health outcomes.^6,9,10 For example, machine learning algorithms (approaches that improve automatically with additional data experience), a branch of artificial intelligence,¹¹ recognize and apply patterns in multidimensional data that train models for prediction and stratification of disease risk based on biologic risk as well as developmental, cultural, and environmental risk determinants of adverse child health outcomes.^{5,10,12,13,14}

The epidemiology of child health (children are generally healthier than adults, represent a smaller fraction of the population (22% <18 years of age in the United States),¹⁵ have a higher risk of rare diseases, and have an increasing prevalence of medical complexity (3–5%),¹⁵ the lack of progress in reducing disparities in child health outcomes, and the difficulties in applying adult-based strategies to improve child health outcomes highlight the potential for Big Data and new data science to impact child health. By combining data from traditional sources, e.g., electronic health records, imaging results, biobanks/registries, and omics measurements, with demographic, cultural, and environmental sources, artificial intelligence-based data science methods may identify previously unrecognized patterns associated with childhood disease without starting with an a priori hypothesis.⁶ Recent reports have described combining Big Data from different sources in pediatric oncology, nephrology, and sepsis diagnosis.^16,17,18 However, these reports underline both the potential and the difficulties of combining data sets with limited data dimensions.¹⁹ For example, in pediatric oncology, combining existing institutional registries improves descriptions of natural history and responses to therapies but does not provide insight into contributions of environment, geography, or other less studied factors to disease risk, response to therapy, or prognosis.¹⁶ Similarly, in pediatric nephrology, development of quantitative definitions of acute kidney injury and descriptions of acute kidney injury epidemiology have not yet fully integrated omics or other potentially informative data elements.¹⁷ Big Data strategies for diagnosis of sepsis in low and middle income countries have helped improve quality of epidemiologic data but have not identified pathophysiologic or structural proposals to improve outcomes.¹⁸ Big Data and data science can accelerate improvements in child health outcomes through reductions in health disparities, improvements in clinical best practices and in quality and safety outcomes, prediction of individual risk of disease progression and response to therapies, increased family and patient-involvement in health care diagnosis and decisions, and discovery of personalized disease mechanisms. We will begin with a brief description of the characteristics of Big Data and data science strategies that will enable improvement in best practices and discovery of disease mechanisms, followed by examples of pediatric Big Data initiatives.

Big Data characteristics and strategies for analysis

Characteristics

Data production in healthcare is high volume and complex, encompassing a wide variety of inputs and formats including administrative, biomarker (e.g., genomic), physiologic, biometric, laboratory, and imaging. Data may be derived from multiple sources such as electronic health records, medical devices, mobile health platforms, clinical registries, biobanks, and patient self-reports.⁵ Within each of these sources, many different data types exist that can be further categorized as structured (e.g., demographics, laboratory results), unstructured (e.g., free text in notes and comments), and semi-structured (a combination of structured and unstructured data).²⁰

The fundamental features of Big Data were originally defined in the early 2000s by the 3V model,²¹ which includes Volume, Variety, and Velocity. This model has since been extended to 6Vs, adding Variability, Veracity, and Value (Fig. 1).^{4,21,22,23,24,25}

**Fig. 1: Big Data features defined by the 6V model.**

With these core features in mind, healthcare data can be further characterized by many quantitative properties that have been previously categorized by Shilo et al. into 7 axes of health data⁵ (Fig. 2). “Sample size,” a representation of volume, is important for achieving sufficient statistical power. However, data with an n of 1 (rare diseases) can also provide value in the definition of disease trajectory and clinical response.²⁶ “Depth of phenotyping” describes the variety of medical data used to characterize individuals and ranges from the molecular level (e.g., omics, microbiome) to the social level (e.g., demographics, lifestyle, environment, and social determinants of health). Integration of this array of data into a valuable understanding of health can be challenging due to lack of data interoperability and of common definitions. “Longitudinal follow-up” is critical for child health and includes data gathering over different time points, a characterization of variability. The Barker hypothesis²⁷ is a prime example of the importance of life course tracking to improve child and adult outcomes. Similarly, outcomes of underrepresented minorities can be improved with advocacy for engagement and adherence to follow-up in order to obtain complete long-term data.¹ “Interactions between subjects” is an application of value to find the connections between subjects (e.g., shared environments, twins), which can increase the statistical power to discover disease mechanisms. “Heterogeneity and diversity” of a cohort to include appropriate representation of the real-world population (factors such as age, sex, ethnicity, socioeconomic status, exposure to different social determinants of health) contribute to the variety of data. “Standardization of data” and “linkage between data sources” are vital to veracity and tackling the challenge of having variety in data.

**Fig. 2: Quantitative properties represent the complexity of healthcare data.**

Analysis strategies

As this wide array of data is gathered, finding meaningful results requires application of appropriate analytic strategies, the basis of data science. Three overarching approaches to analysis have been used: descriptive, predictive, and counterfactual.^28,29 Descriptive analysis utilizes conventional parametric and non-parametric statistics to provide quantitative estimates of central tendency and variability. Strengths of this approach include the ability to condense large amounts of data into single summative metrics and a high degree of explainability. It is the most common form of analysis for real-time and historical data used in clinical research and is frequently the basis for intervention guidelines and protocols.²⁹ Predictive analysis utilizes observational data to identify relational patterns between variables, either correlation or anti-correlation. Additional variables can be added to the model to control for factors which independently influence outcomes of interest. Once a model has been constructed, it can be “reversed” by entering values for each of the variables to generate a prediction of the probability of the outcome (for binary models) or an estimated value of the outcome (in continuous models).

In contrast to associations identified by descriptive and conventional predictive modeling, counterfactual prediction analysis is the foundation of causal analysis and inference. In this approach, one starts with the outcome and works backwards to the model inputs to evaluate how changes in the input might reverse or “flip” the outcome. Counterfactual explanations describe the smallest change to the input variables that causes a change from one predicted outcome to another. It is currently the least commonly used analytic approach. However, it has great potential to answer causal questions, for example in the genetic analysis of complex diseases.³⁰

Machine learning analysis

Machine learning refers to computer algorithms which utilize artificial neural networks to identify salient variables or “features” from a large pool of candidate variables which, in association, predict outcomes with the greatest accuracy.³¹ Although machine learning is widely considered an “advanced” technique, it encompasses a broad array of approaches ranging from simple, conventional regression modeling through deep learning. The strategy for utilizing machine learning is common across all model types; first, the system is “trained” by exposing it to a representative sample cohort of patients with known and labeled outcomes. As the system learns with each pass through the training samples, predictive accuracy improves. Once training has been maximized, the system is validated using a separate cohort of patients never before seen by the model.

As previously noted, “machine learning” is a broad term encompassing many different computational strategies including neural networks, decision trees, support-vector machines, and deep learning. All machine learning is patterned after human neural networks—each variable is conceived as a node where the pathway through the network diverges based on the value at the node before eventual mapping to the outcome at the end of the network. When such a system is “learning,” the algorithm makes observations of individual patients and develops a decision tree, a branching graph where the value of a given variable increases or decreases the predicted probability of the outcome and is also influenced by the path through prior branches in the tree.^32,33 As the algorithm is exposed to more examples, the weights of each branch point are iteratively increased or decreased until accuracy can no longer be increased. Not surprisingly, machine learning systems achieve optimal performance when extremely large datasets are available for training.

A significant drawback of machine learning analysis is the challenge of explaining the findings of the algorithm. As opposed to traditional regression modeling, where each variable is actively chosen by the investigator, and the relationship to the outcome is quantified in understandable units, machine learning decisions are made by a computer based on optimizing outcome prediction. In “unsupervised” machine learning, outcome labels are not provided to the computer, and grouping decisions are made without human input. While this approach offers the potential to discover previously unknown connections between variables, it may also result in convoluted and clinically implausible relationships.

Big Data challenges

The rapid growth of Big Data utilization in healthcare has unmasked many limitations and challenges. Data quality, accuracy, completeness, and availability are major hurdles in using large healthcare datasets^4,34 and can lead to inaccurate analysis,¹ biased inference, and false discoveries.³⁵ Similarly, management and storage of large amounts of data present challenges with maintaining data security and accuracy over time, archiving data, managing data warehouses, and removing/disposing of information. Applying the most appropriate analytic approach to Big Data requires understanding of the technical and quality limitations of Big Data.

Data integration is both a key challenge and a critical component for improving seamless access to robust, reproducible, and diverse sources of data.³⁶ Sharing data across institutions, for example, can be difficult due to differences in data types, definitions, and formats. Data security and privacy are also concerns when clinical information is shared, restricting the use of patient identifying information. Addressing these issues will require advancements in the standardization of data. The FAIR Guiding Principles (Findability, Accessibility, Interoperability, and Reusability) of data management provide guidelines for data production and data publishing focused on maximizing data quality and usability and are required by the National Institutes of Health for data management and sharing plans for all award applications submitted after January 25, 2023.^37,38 Improvements in data management which incorporate these fundamental principles will be crucial to harnessing the potential of clinical Big Data.

Examples of Big Data initiatives and strategies for improvement of child health

Although still early in development, pediatric-specific Big Data projects have begun to emerge that include data commons for integration and interrogation of data from inpatient and ambulatory cohorts of children and electronic health record- or vital sign-based data for development of disease risk scores. Here we highlight examples of pediatric Big Data initiatives. Other multi-institutional pediatric data networks are summarized in Table 1.

Table 1 Pediatric Big Data networks.

Full size table

Genomic Information Commons

Genomic Information Commons (GIC) is a National Institutes of Health (NIH)-funded, multi-institutional effort to provide an extensive, linked database of genotype, phenotype, biospecimens, and electronic health record-derived metadata in a highly accessible, federated database.³⁹ A cooperative effort that has recently expanded from three to nine academic pediatric centers in the United States, it leverages robust, easy to use computational infrastructure for preliminary data queries, retention and control of all data and biospecimens at member institutions, scalable inclusion of additional member institutions, executed, inter-institutional data use and material transfer agreements, active participation by patients and families in defining network operations and research priorities, and large and diverse patient populations for genomic discovery and identification of potential therapies. Considerable effort has been made to maximize the value and usability of the data for investigators while maintaining high privacy and security standards.

PEDSnet

PEDSnet is a national pediatric learning health system with a multi-specialty network of collaborators from Children’s Hospitals across the United States.⁴⁰ Using a common data model from its original funding source, the Patient-Centered Outcomes Research Institute (PCORI), this network formed a centralized data sharing environment with executed, institutional data use agreements to create large datasets of pediatric clinical data extracted from electronic health records which enable communities of patients and clinicians to perform research and quality improvement projects that improve child health.^40,41 PEDSnet currently includes longitudinal clinical data from 2009 for over 6.5 million children, about 9% of all the children in the US.⁴²

Clinical data from electronic health records are gathered in quarterly cycles from the contributing sites in a structured and templated manner. The PCORI Common Data Model ensures terminology and data details are standardized and permits interoperability with other PCORI-sponsored Clinical Data Research Networks. An extensive data quality assessment process is used for careful analysis of the quality and characteristics of the data from each participating site.⁴³ Data issues are categorized,^44,45 reviewed by data scientists, and discussed with each submitting site for resolution to address four dimensions of data quality, fidelity, consistency, accuracy, and completeness.⁴³

Data security and privacy are addressed with several steps that include storage of limited datasets in the PEDSnet Data Coordinating Center without patient identifiers. When data are requested by researchers, the minimum necessary aggregated data are provided, and institution-specific information is combined into “counts.” PEDSnet data have been used in 61 publications (June, 2022) that include disease-specific, quality and safety, and coronavirus disease 2019 (COVID-19)-related questions.^46,47,48,49

PhysioNet

PhysioNet is a large collection of clinical and physiologic data from both inpatient (e.g., traumatic brain injury) and ambulatory (e.g., gait analysis) venues that includes open-source tools for computational analysis.⁵⁰ Initially established in 1999 with NIH support, it is now structured into three components:

1.
“PhysioBank”—an extensive archive of digitized physiologic signals from fetal, pediatric, and adult sources.
2.
“PhysioToolkit”—an open-source collection of tools for the processing and analysis of physiologic signals.
3.
Extensive documentation and tutorials for new and advanced users.

Of the 202 PhysioNet databases, 12 are fetal or pediatric specific. Although most database access is free and without restriction, a subset of databases requires registration and completion of a data use agreement. In addition to providing a data repository that is compliant with the FAIR guidelines, PhysioNet has helped to standardize multiple types of data file formats.^50,51,52 This standardization expands the number of software tools that can be used to conduct analyses by researchers including many free and open-source options, providing equitable data access for researchers with limited resources such as those in low and middle-income countries.

Electronic Health Record-based risk scores

Late onset neonatal sepsis is a significant contributor to morbidity and mortality. The nSOFA score⁵³ is a sepsis prediction score, based upon the adult SOFA (Sequential Organ Failure Assessment) score and pSOFA (a pediatric variation developed for older children⁵⁴) score, which utilizes elements from electronic health records to identify infants at high risk for sepsis related mortality. As with the adult SOFA score, a rise in the nSOFA score was highly correlated with sepsis related mortality—a difference that could be detected within 6–12 h of sepsis evaluation. This score has since been validated in a multi-center cohort of more than 600 infants⁵⁵ with excellent performance, noting an area under the curve (AUC) of 0.88 for the prediction of mortality. The nSOFA score has also been shown to discriminate between survival and non-survival on the first day of life in extremely preterm infants.^56,57 These examples highlight the potential for application of data science analytics on data extracted from electronic health records to generate useful tools for severity of illness stratification and targeted treatments.

In older, hospitalized children, identification and stratification of illness severity and need for critical care have used tools initially developed and validated in adult populations⁵⁸ (e.g., Early Warning Scores (EWS) such as the National Early Warning Score (NEWS)).^59,60 In a retrospective study, 2–3% of pediatric hospital admissions experience cardiopulmonary arrest and require resuscitation.⁶¹ The Pediatric Early Warning System (PEWS) score⁶² provides a similar predictive model for the pediatric inpatient population. In subsequent validation, PEWS scores identified patients at risk of deterioration 12 h in advance of clinically apparent deterioration,⁶³ reduced the risk of emergency response calls shortly after admission from the Emergency Department,⁶⁴ improved timely and orderly transfer of patients to the ICU,⁶⁵ and increased the number of days without medical codes outside the ICU.⁶⁶

Physiology-based risk scores

Over the last two decades in the NICU, the use of multiple devices for monitoring physiology-based signals including the electrocardiogram (ECG), pulse oximetry, respiratory rate, arterial blood pressure, transcutaneous partial pressure of carbon dioxide (CO₂), cerebral and organ oximetry (via near-infrared spectroscopy (NIRS)), and electroencephalogram (EEG) has increased. Despite this broad array of available data, clinicians use these physiologic biomarker data almost exclusively for in-the-moment decisions and most often from only one signal, such as targeted oxygen saturation thresholds to reduce risk of retinopathy of prematurity (ROP). Computational integration of individual or multiple monitored physiologic biomarkers over time may reveal unrecognized patterns of pathology.

For example, the Heart Rate Observation (HeRO) score uses ECG characteristics,⁶⁷ comprised of beat-to-beat variability in heart rate, accelerations, and decelerations (indicative of autonomic nervous system tone), to predict continuous sepsis risk in the next 24 h.⁶⁸ Extensive neonatal validation testing has demonstrated the HeRO score’s superior performance over more traditional laboratory or clinical assessments and can reduce sepsis-related mortality by as much as 20%.^69,70

Using a similar approach to the HeRO score, several groups have used other quantitative physiologic biomarkers (e.g., heart rate variability) to predict adverse long term neurodevelopmental outcomes,^71,72,73 moderate-severe MRI abnormalities, or death.^72,74 Similarly, quantitative characteristics of continuous EEG monitoring can be used to predict the later occurrence of seizures⁷⁵ and outcomes at 24 months⁷⁶ and 5 years.⁷⁷

Research gaps and future directions

Despite the potential of Big Data strategies to improve child health and patient safety by trans-institutional identification of rare patient phenotypes, adverse patient events (e.g., sentinel pediatric events or adverse drug reactions),⁷⁸ and responses to therapies, significant barriers remain in the practical application of research strategies to real-world bedside care.^79,80,81 One of the most significant barriers is the lack of a universal, interoperable, modular system for capturing and sharing medical data. Proprietary and institutionally cloistered electronic health record systems limit discovery of critical components of best clinical and nursing practices⁸² and of pediatric-and disease-specific patient characteristics, disease risk, and adverse events. In addition, the large volume of data generated during healthcare delivery requires intentional system design that optimizes future data usability and minimizes cost for data extraction, reformatting, and loss. Similarly, although linkage of individual medical records longitudinally across maternal/fetal, neonatal, child, and adult epochs would be of great value for discovery of fetal, neonatal, and childhood origins of pediatric and adult diseases, such a system remains largely impracticable unless all the care for an individual is obtained within a single health system across the life course. Even within open source electronic health record systems, multiple data formats which vary according to system and region reduce interoperability across institutions.⁸³ The PRISM model described by Hirschfeld et al.⁸⁴ provides a theoretical framework for the future of intentional system design that captures elements of the health phenotype across four dimensions (experience, performance, adaptability, potential) and results in a life course model of an individual’s health (an Ideal Health Prism) which could enable comprehensive study of the fetal and childhood origins of childhood and adult diseases.

Through several different programs, such as the Big Data to Knowledge (BD2K) program, the NIH has emphasized the importance of using FAIR principles to insure availability of NIH-funded project data for the scientific community.³⁷ Through resources such as GIC, PhysioNet, and PEDSNet, secure pediatric data with common formatting models are becoming available. However, compliance with Health Insurance Portability and Accountability Act of 1996 (HIPAA)-associated privacy rules often necessitates extensive manual data review and modification to ensure that all protected health information has been removed. For example, removal or shifting of all elements of date and time (high value Big Data components) in a truly random fashion and uniformly across elements for linkage consistency due to HIPAA protection is both laborious and prone to introduce errors.

Another challenge is the integration of multimodal data. Most of the analytic strategies previously described utilized data from a single source (e.g., the electronic health record) and along a similar time scale. Although this approach simplifies the data collection process, such “siloed” data provide an incomplete understanding of the disease process. Several recent projects^85,86,87 have demonstrated that building a complex model using multimodal data with different scales can generate neonatal outcome predictions with greater accuracy than single-domain predictions alone. Examples of different scales include race or genetic background, which are immutable characteristics, sepsis status which is discrete but may evolve over time, and vital signs, such as heart rate or blood pressure, which are continuously changing. Further development of artificial intelligence-based data science tools which integrate genomic susceptibility with developmental epoch, environmental factors, social determinants of health, maternal/fetal characteristics, and family/patient self-reported data will be necessary.⁸⁸ Meta-dimensional analysis, specifically concatenation-based integration, is one potential strategy but is not yet in routine use.⁸⁹

Life course research is an example of integration of multiple data types from diverse sources (e.g., institutional data warehouses and research repositories) to capture the complexity of health trajectories.^14,90,91 Incorporation of geocoded data and environmental factors as well as patient-reported measures such as social well-being into the electronic health record represent concrete strategies for greater inclusion and more accurate representation of populations currently underrepresented in research and permit analysis of the impact of social determinants of health on disease pathogenesis and response to therapies.⁹¹

A significant note of caution must be raised about racial bias in the use of physiology and electronic health record-based Big Data. Two sources of error may contribute to data-related bias. First, devices may not reliably capture measures owing to differences in physiology or phenotype. For example, significant recent attention has focused on the poor performance of pulse oximetry in adult and neonatal African-American patients.^92,93 Lack of inclusion of melanin’s light absorption in the red and infrared spectrum in the underlying pulse oximeter algorithm increases the risk of occult hypoxemia in both adult and neonatal populations⁹⁴ and of adverse neonatal outcomes.^95,96 Second, even when collected data are reliable, machine learning models must be trained on representative samples that avoid racial bias. As recently demonstrated in a comparison of a new, intentionally designed machine learning algorithm for the prediction of ICU mortality with several widely used scores (APACHE, SAPS II, and MEWS), at least two of these systems (SAPS II and MEWS) were found to have significant racial bias.⁹⁷ As with occult hypoxemia, the potential risk of harm comes from false negatives or underestimated disease risk. The equal opportunity difference analysis performed in this study is an optimal tool to identify these deficiencies.

The longstanding problem of gaps in studies of medications and medical devices in pediatric age groups represents an important priority for Big Data and data science to improve child health. As a consequence of these gaps, between 25% and 90% of medications are prescribed to children in an “off label” manner without regulatory approval.⁹⁸ Instead of data-driven use, treatment options expand organically through extrapolation of adult data,⁹⁹ anecdotal reports by providers, and practice drift.^100,101 Although the primary focus of Big Data and data science has been on improving diagnosis of disease, elucidating mechanisms, and predicting outcomes, these same data science tools can and should be used to identify treatment response in children from real world data. For example, multicenter, federated data commons and advanced data analytics can be leveraged to identify and pool small numbers of infants and children at individual institutions into sample sizes which permit statistically valid examination of treatment response and adverse outcomes. Recently, real world data from electronic health records and other sources have been successfully analyzed to obtain regulatory approval for previously off-label medications in children.¹⁰²

Summary

The urgency of the COVID-19 pandemic has demonstrated the potential for rapid application of Big Data and data science to integrate and analyze electronic health record data across health care systems and countries for identification of child-specific disease characteristics, best clinical practices, and responses to therapeutic interventions.^103,104 These studies suggest the feasibility of the application of Big Data and data science to child health questions and the potential impact of such studies on prediction and mitigation of disease risk over decades of life. Realizing the potential of these tools for integrating genetic risk with developmental epoch, environmental factors, social determinants of health, patient- and family-reported data, and disease biology will require funding prioritization from the NIH and other agencies, unprecedented collaboration among institutions, investigators, and patients/families, consolidation of existing data networks, and child health-specific innovation.

Data availability

All data generated or analyzed for this review are included in this published article.

References

Perez, L. G., Peet, E. D., Vegetabile, B. & Shih, R. A. Big Data needs and challenges to advance research on racial and ethnic inequities in maternal and child health. Womens Health Issues 32, 90–94 (2022).
Article PubMed Google Scholar
GBD 2017 Child and Adolescent Health Collaborators & Reiner, R. C. et al. Diseases, injuries, and risk factors in child and adolescent health, 1990 to 2017: findings from the Global Burden of Diseases, Injuries, and Risk Factors 2017 Study. JAMA Pediatr. 173, e190337 (2019).
Article PubMed Central Google Scholar
Ely, D. M. & Driscoll, A. K. Infant mortality in the United States, 2017: data from the period linked birth/infant death file. Natl Vital Stat. Rep. 68, 1–20 (2019).
PubMed Google Scholar
Pablo, R. G. J. et al. Big data in the healthcare system: a synergy with artificial intelligence and blockchain technology. J. Integr. Bioinform. 19, 20200035 (2021).
Shilo, S., Rossman, H. & Segal, E. Axes of a revolution: challenges and promises of big data in healthcare. Nat. Med. 26, 29–38 (2020).
Article CAS PubMed Google Scholar
Hulsen, T. et al. From big data to precision medicine. Front. Med. 6, 34 (2019).
Article Google Scholar
U.S. Department of Health and Human Services, Office of Disease Prevention and Health Promotion. Healthy people 2030. https://health.gov/healthypeople/objectives-and-data/social-determinants-health (2022).
Bennett, T. D. et al. Data science for child health. J. Pediatr. 208, 12–22 (2019).
Article PubMed PubMed Central Google Scholar
Obermeyer, Z. & Emanuel, E. J. Predicting the future - big data, machine learning, and clinical medicine. N. Engl. J. Med. 375, 1216–1219 (2016).
Article PubMed PubMed Central Google Scholar
MacEachern, S. J. & Forkert, N. D. Machine learning for precision medicine. Genome 64, 416–425 (2021).
Article PubMed Google Scholar
Hunt, X. et al. Artificial intelligence, big data, and mhealth: the frontiers of the prevention of violence against children. Front. Artif. Intell. 3, 543305 (2020).
Article PubMed PubMed Central Google Scholar
Camacho, D. M., Collins, K. M., Powers, R. K., Costello, J. C. & Collins, J. J. Next-generation machine learning for biological networks. Cell 173, 1581–1592 (2018).
Article CAS PubMed Google Scholar
Lu, C. Y., McMahon, P. M. & Wu, A. C. Modeling genomic screening in newborns. JAMA Pediatr. https://doi.org/10.1001/jamapediatrics.2021.5798 (2022).
Goulooze, S. C. et al. Beyond the randomized clinical trial: innovative data science to close the pediatric evidence gap. Clin. Pharm. Ther. 107, 786–795 (2020).
Article Google Scholar
Boat, T. F. & Whitsett, J. A. How can the pediatric community enhance funding for child health research. JAMA Pediatr. 175, 1212–1214 (2021).
Article PubMed Google Scholar
Major, A., Cox, S. M. & Volchenboum, S. L. Using big data in pediatric oncology: current applications and future directions. Semin. Oncol. 47, 56–64 (2020).
Article PubMed PubMed Central Google Scholar
Sutherland, S. M. Big data and pediatric acute kidney injury: the promise of electronic health record systems. Front. Pediatr. 7, 536 (2019).
Article PubMed Google Scholar
Iregbu, K. et al. Global health systems’ data science approach for precision diagnosis of sepsis in early life. Lancet Infect. Dis. 22, e143–e152 (2021).
Article PubMed Google Scholar
Martínez-García, M. & Hernández-Lemus, E. Data integration challenges for machine learning in precision medicine. Front. Med. 8, 784455 (2022).
Article Google Scholar
AnalytixLabs. Characteristics of Big Data. A complete guide. Blogs & updates on data science, business analytics, AI machine learning. https://www.analytixlabs.co.in/blog/characteristics-of-big-data/ (2021).
Bello-Orgaz, G., Jung, J. J. & Camacho, D. Social big data: recent achievements and new challenges. Int. J. Inf. Fusion 28, 45 (2016).
Article Google Scholar
Ishawarappa & Anuradha, J. A brief introduction on Big Data 5Vs characteristics and Hadoop technology. https://cyberleninka.org/article/n/1071853/viewer (2015).
SearchDataManagement. The 5 V’s of Big Data. https://searchdatamanagement.techtarget.com/definition/5-Vs-of-big-data (2022).
Luo, J., Wu, M., Gopukumar, D. & Zhao, Y. Big Data application in biomedical research and health care: a literature review. Biomed. Inf. Insights 8, 1–10 (2016).
CAS Google Scholar
Andreu-Perez, J., Poon, C. C. Y., Merrifield, R. D., Wong, S. T. C. & Yang, G. Z. Big Data for health. IEEE J. Biomed. Health Inf. 19, 1193–1208 (2015).
Article Google Scholar
Brokamp, E. et al. One is the loneliest number: genotypic matchmaking using the electronic health record. Genet. Med. 23, 1830–1832 (2021).
Article PubMed Google Scholar
Kwon, E. J. & Kim, Y. J. What is fetal programming?: a lifetime health is under the control of in utero health. Obstet. Gynecol. Sci. 60, 506–519 (2017).
Article PubMed PubMed Central Google Scholar
Hernán, M. A., Hsu, J. & Healy, B. A second chance to get causal inference right: a classification of data science tasks. CHANCE 32, 42–49 (2019).
Article Google Scholar
ProjectPro. Types of analytics: descriptive, predictive, prescriptive analytics. https://www.projectpro.io/article/types-of-analytics-descriptive-predictive-prescriptive-analytics/209 (2022).
Hu, P., Jiao, R., Jin, L. & Xiong, M. Application of causal inference to genomic analysis: advances in methodology. Front. Genet. 9, 238 (2018).
Shalev-Shwartz, S. & Ben-David. S. Understanding Machine Learning: From Theory to Algorithms (Cambridge University Press, 2014).
Chollet, F. Deep Learning with Python (Manning Publications Co., 2018).
Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).
Sanders, C. et al. Understanding the limits of large datasets. J. Cancer Educ. 27, 664–669 (2012).
Article PubMed PubMed Central Google Scholar
Wang, W. & Krishnan, E. Big Data and clinicians: a review on the state of the science. JMIR Med. Inf. 2, e1 (2014).
Article Google Scholar
Sinha, A., Hripcsak, G. & Markatou, M. Large datasets in biomedicine: a discussion of salient analytic issues. J. Am. Med. Inf. Assoc. 16, 759–767 (2009).
Article Google Scholar
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Article PubMed PubMed Central Google Scholar
NIH. NOT-OD-21-013: Final NIH policy for data management and sharing. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html (2022).
Mandl, K. D. et al. The Genomics Research and Innovation Network: creating an interoperable, federated, genomics learning system. Genet. Med. 22, 371–380 (2020).
Article CAS PubMed Google Scholar
Forrest, C. B. et al. PEDSnet: a national pediatric learning health system. J. Am. Med. Inf. Assoc. 21, 602–606 (2014).
Article Google Scholar
Forrest, C. B., Margolis, P., Seid, M. & Colletti, R. B. PEDSnet: how a prototype pediatric learning health system is being expanded into a national network. Health Aff. 33, 1171–1177 (2014).
Article Google Scholar
PEDSnet. Home. http://pedsnet.org (2022).
PEDSnet. PEDSnet data quality. http://pedsnet.org (2022).
Khare, R. et al. Predicting causes of data quality issues in a clinical data research. Netw. AMIA Summits Transl. Sci. Proc. 2018, 113–121 (2018).
Google Scholar
Khare, R. et al. A longitudinal analysis of data quality in a large pediatric data research network. J. Am. Med. Inf. Assoc. 24, 1072–1079 (2017).
Article Google Scholar
Davis, S. M. et al. Population-based assessment of cardiometabolic-related diagnoses in youth with Klinefelter syndrome: a PEDSnet study. J. Clin. Endocrinol. Metab. 107, e1850–e1859 (2022).
Khare, R. et al. Development and evaluation of an EHR‐based computable phenotype for identification of pediatric Crohn’s disease patients in a National Pediatric Learning Health System. Learn Health Syst. 4, e10243 (2020).
PubMed PubMed Central Google Scholar
Denburg, M. R. et al. Using electronic health record data to rapidly identify children with glomerular disease for clinical research. J. Am. Soc. Nephrol. 30, 2427–2435 (2019).
Article PubMed PubMed Central Google Scholar
Bailey, L. C. et al. Assessment of 135794 pediatric patients tested for severe acute respiratory syndrome coronavirus 2 across the United States. JAMA Pediatr. 175, 176–184 (2021).
Article PubMed Google Scholar
Goldberger, A. L. et al. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101, E215–E220. https://doi.org/10.1161/01.CIR.101.23.e215 (2000).
Kemp, B. & Olivan, J. European data format “plus” (EDF+), an EDF alike standard format for the exchange of physiological data. Clin. Neurophysiol. J. 114, 1755–1761 (2003).
Article Google Scholar
Shafranovich, Y. Common format and MIME type for comma-separated values (CSV) files. RFC Editor. https://www.rfc-editor.org/rfc/rfc4180.txt (2005).
Wynn, J. L. & Polin, R. A. A neonatal sequential organ failure assessment score predicts mortality to late-onset sepsis in preterm very low birth weight infants. Pediatr. Res. 88, 85–90 (2020).
Article PubMed Google Scholar
Matics, T. J. & Sanchez-Pinto, L. N. Adaptation and validation of a pediatric sequential organ failure assessment score and evaluation of the sepsis-3 definitions in critically ill children. JAMA Pediatr. 171, e172352 (2017).
Article PubMed PubMed Central Google Scholar
Fleiss, N. et al. Evaluation of the neonatal sequential organ failure assessment and mortality risk in preterm infants with late-onset infection. JAMA Netw. Open 4, e2036518 (2021).
Article PubMed PubMed Central Google Scholar
Travers, C. P., Carlo, W. A. & Ambalavanan, N. The future of outcome prediction for preterm infants in the neonatal ICU. Am. J. Respir. Crit. Care Med. 205, 6–8 (2022).
Article PubMed Google Scholar
Lavilla, O. C. et al. Hourly kinetics of critical organ dysfunction in extremely preterm infants. Am. J. Respir. Crit. Care Med. 205, 75–87 (2022).
Article PubMed Google Scholar
Morgan, R., Williams, F. & Wright, M. An early warning scoring system for detecting developing critical illness. Clin. Intensive Care 8, 100 (1997).
Google Scholar
Smith, G. B., Prytherch, D. R., Meredith, P., Schmidt, P. E. & Featherstone, P. I. The ability of the National Early Warning Score (NEWS) to discriminate patients at risk of early cardiac arrest, unanticipated intensive care unit admission, and death. Resuscitation 84, 465–470 (2013).
Article PubMed Google Scholar
Alam, N. et al. The impact of the use of the Early Warning Score (EWS) on patient outcomes: a systematic review. Resuscitation 85, 587–594 (2014).
Article CAS PubMed Google Scholar
Reis, A. G., Nadkarni, V., Perondi, M. B., Grisi, S. & Berg, R. A. A prospective investigation into the epidemiology of in-hospital pediatric cardiopulmonary resuscitation using the international Utstein reporting style. Pediatrics 109, 200–209 (2002).
Article PubMed Google Scholar
Duncan, H., Hutchison, J. & Parshuram, C. S. The Pediatric Early Warning System score: a severity of illness score to predict urgent medical need in hospitalized children. J. Crit. Care 21, 271–278 (2006).
Article PubMed Google Scholar
Akre, M. et al. Sensitivity of the pediatric early warning score to identify patient deterioration. Pediatrics 125, e763–e769 (2010).
Article PubMed Google Scholar
Frascogna, M. N., Merkle, E., Dowdy, K. & Seals, S. The effect of pediatric early warning score use on emergency response calls after admission from the pediatric emergency department. Pediatr. Emerg. Care 37, e930–e933 (2021).
Article PubMed Google Scholar
Agulnik, A. et al. Impact of implementing a Pediatric Early Warning System (PEWS) in a pediatric oncology hospital. Pediatr. Qual. Saf. 3, e065 (2018).
Article PubMed Central Google Scholar
Demmel, K. M., Williams, L. & Flesch, L. Implementation of the Pediatric Early Warning Scoring System on a Pediatric Hematology/Oncology Unit. J. Pediatr. Oncol. Nurs. 27, 229–240 (2010).
Article PubMed Google Scholar
Fairchild, K. D. & O’Shea, T. M. Heart rate characteristics: physiomarkers for detection of late-onset neonatal sepsis. Clin. Perinatol. 37, 581–598 (2010).
Article PubMed PubMed Central Google Scholar
Griffin, M. P. et al. Abnormal heart rate characteristics preceding neonatal sepsis and sepsis-like illness. Pediatr. Res. 53, 920–926 (2003).
Article PubMed Google Scholar
Moorman, J. R. et al. Mortality reduction by heart rate characteristic monitoring in very low birth weight neonates: a randomized trial. J. Pediatr. 159, 900.e1–906.e1 (2011).
Article Google Scholar
Fairchild, K. D. et al. Septicemia mortality reduction in neonates in a heart rate characteristics monitoring trial. Pediatr. Res. 74, 570–575 (2013).
Article PubMed PubMed Central Google Scholar
Goulding, R. M. et al. Heart rate variability in hypoxic ischemic encephalopathy: correlation with EEG grade and 2-y neurodevelopmental outcome. Pediatr. Res. 77, 681–687 (2015).
Article PubMed Google Scholar
Massaro, A. N. et al. Heart rate variability in encephalopathic newborns during and after therapeutic hypothermia. J. Perinatol. 34, 836–841 (2014).
Article CAS PubMed PubMed Central Google Scholar
Vesoulis, Z. A., Rao, R., Trivedi, S. B. & Mathur, A. M. The effect of therapeutic hypothermia on heart rate variability. J. Perinatol. 37, 679–683 (2017).
Article CAS PubMed PubMed Central Google Scholar
Vergales, B. D. et al. Depressed heart rate variability is associated with abnormal EEG, MRI, and death in neonates with hypoxic ischemic encephalopathy. Am. J. Perinatol. 31, 855–862 (2014).
Article PubMed Google Scholar
Jain, S. V., Zempel, J. M., Srinivasakumar, P., Wallendorf, M. & Mathur, A. Early EEG power predicts MRI injury in infants with hypoxic-ischemic encephalopathy. J. Perinatol. 37, 541–546 (2017).
Article CAS PubMed Google Scholar
Murray, D. M., Boylan, G. B., Ryan, C. A. & Connolly, S. Early EEG findings in hypoxic-ischemic encephalopathy predict outcomes at 2 years. Pediatrics 124, e459–e467 (2009).
Article PubMed Google Scholar
Murray, D. M., O’Connor, C. M., Ryan, C. A., Korotchikova, I. & Boylan, G. B. Early EEG grade and outcome at 5 years after mild neonatal hypoxic ischemic encephalopathy. Pediatrics 138, e20160659 (2016).
McMahon, A. W. et al. Big Data in the assessment of pediatric medication safety. Pediatrics 145, e20190562 (2020).
Article PubMed Google Scholar
Hoodbhoy, Z. et al. Machine learning for child and adolescent health: a systematic review. Pediatrics 147, e2020011833 (2021).
Article PubMed Google Scholar
van de Sande, D. et al. Developing, implementing and governing artificial intelligence in medicine: a step-by-step approach to prevent an artificial intelligence winter. BMJ Health Care Inf. 29, e100495 (2022).
Article Google Scholar
Slopen, N. & Heard-Garris, N. Structural racism and pediatric health—a call for research to confront the origins of racial disparities in health. JAMA Pediatr. 176, 13–15 (2022).
Article PubMed Google Scholar
Cole, F. S. Improving VLBW infant outcomes with big data analytics. Pediatr. Res. 90, 20–21 (2021).
Article PubMed PubMed Central Google Scholar
Purkayastha, S., Allam, R., Maity, P. & Gichoya, J. W. Comparison of open-source electronic health record systems based on functional and user performance criteria. Health. Inf. Res. 25, 89–98 (2019).
Article Google Scholar
Hirschfeld, S. et al. Health measurement model-bringing a life course perspective to health measurement: the PRISM model. Front. Pediatr. 9, 605932 (2021).
Article PubMed PubMed Central Google Scholar
Temko, A. et al. Multimodal predictor of neurodevelopmental outcome in newborns with hypoxic-ischaemic encephalopathy. Comput. Biol. Med 63, 169–177 (2015).
Article PubMed Google Scholar
Vesoulis, Z. A., El Ters, N. M., Herco, M., Whitehead, H. V. & Mathur, A. M. A web-based calculator for the prediction of severe neurodevelopmental impairment in preterm infants using clinical and imaging characteristics. Children 5, 151 (2018).
Na, J. Y. et al. Artificial intelligence model comparison for risk factor analysis of patent ductus arteriosus in nationwide very low birth weight infants cohort. Sci. Rep. 11, 22353 (2021).
Article CAS PubMed PubMed Central Google Scholar
Topol, E. J. Individualized medicine from prewomb to tomb. Cell 157, 241–253 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. & Kim, D. Methods of integrating data to uncover genotype–phenotype interactions. Nat. Rev. Genet. 16, 85 (2015).
Article CAS PubMed Google Scholar
Hanson, H. A. et al. Charting the life course: emerging opportunities to advance scientific approaches using life course research. J. Clin. Transl. Sci. 5, e9 (2020).
Article PubMed PubMed Central Google Scholar
Hanson, H. A. et al. Opportunities for life course research through the integration of data across Clinical and Translational Research Institutes. J. Clin. Transl. Sci. 2, 156–162 (2018).
Article PubMed PubMed Central Google Scholar
Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E. & Valley, T. S. Racial bias in pulse oximetry measurement. N. Engl. J. Med. 383, 2477–2478 (2020).
Article PubMed PubMed Central Google Scholar
Vesoulis, Z., Tims, A., Lodhi, H., Lalos, N. & Whitehead, H. Racial discrepancy in pulse oximeter accuracy in preterm infants. J. Perinatol. 42, 79–85 (2022).
Article PubMed Google Scholar
Feiner, J. R., Severinghaus, J. W. & Bickler, P. E. Dark skin decreases the accuracy of pulse oximeters at low oxygen saturation: the effects of oximeter probe type and gender. Anesth. Analg. 105, S18–S23 (2007).
Article PubMed Google Scholar
Vesoulis, Z. A. et al. Early hypoxemia burden is strongly associated with severe intracranial hemorrhage in preterm infants. J. Perinatol. 39, 48–53 (2019).
Article PubMed Google Scholar
BOOST II United Kingdom Collaborative Group et al.Oxygen saturation and outcomes in preterm infants. N. Engl. J. Med. 368, 2094–2104 (2013).
Article Google Scholar
Allen, A. et al. A racially unbiased, machine learning approach to prediction of mortality: algorithm development study. JMIR Public Health Surveill. 6, e22400 (2020).
Article PubMed PubMed Central Google Scholar
Ristovska, L. Regulations and data sources on pediatric clinical studies in the United States and European Union (White Paper). https://www.nber.org/sites/default/files/2020-08/Regulations%20and%20Data%20Sources%20on%20Pediatric%20Clinical%20Studies%20in%20the%2CUnited%20States%20and%20European%20Union.pdf (2020).
Novak, E. & Allen, P. J. Prescribing medications in pediatrics: concerns regarding FDA approval and pharmacokinetics. Pediatr. Nurs. 33, 64–70 (2007).
PubMed Google Scholar
Durrmeyer, X., Vutskits, L., Anand, K. J. S. & Rimensberger, P. C. Use of analgesic and sedative drugs in the NICU: integrating clinical trials and laboratory data. Pediatr. Res. 67, 117–127 (2010).
Article PubMed Google Scholar
Sharpe, C. et al. Levetiracetam versus phenobarbital for neonatal seizures: a randomized controlled trial. Pediatrics 145, e20193182 (2020).
Article PubMed Google Scholar
Bolislis, W. R., Fay, M. & Kühler, T. C. Use of real-world data for new drug applications and line extensions. Clin. Ther. 42, 926–938 (2020).
Article CAS PubMed Google Scholar
Bourgeois, F. T. et al. International analysis of electronic health records of children and youth hospitalized with COVID-19 infection in 6 countries. JAMA Netw. Open 4, e2112596 (2021).
Article PubMed PubMed Central Google Scholar
Klann, J. G. et al. Validation of an internationally derived patient severity phenotype to support COVID-19 analytics from electronic health record data. J. Am. Med. Inf. Assoc. 28, 1411–1420 (2021).
Article Google Scholar

Download references

Funding

This work was supported by grants from the National Institutes of Health K23 NS111086 (to Z.A.V.) and the Children’s Discovery Institute (F.S.C.).

Author information

Authors and Affiliations

The Edward Mallinckrodt Department of Pediatrics, Washington University in St. Louis School of Medicine, and St. Louis Children’s Hospital, St. Louis, MO, USA
Zachary A. Vesoulis, Ameena N. Husain & F. Sessions Cole

Authors

Zachary A. Vesoulis
View author publications
You can also search for this author in PubMed Google Scholar
Ameena N. Husain
View author publications
You can also search for this author in PubMed Google Scholar
F. Sessions Cole
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.D.V., A.N.H., and F.S.C. contributed to conception and design of this manuscript; Z.D.V. and A.N.H. drafted the manuscript; Z.D.V., A.N.H., and F.S.C. critically revised the manuscript and reviewed and approved the manuscript for publication.

Corresponding author

Correspondence to F. Sessions Cole.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics approval and consent to participate

No patient consent was required.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Vesoulis, Z.A., Husain, A.N. & Cole, F.S. Improving child health through Big Data and data science. Pediatr Res 93, 342–349 (2023). https://doi.org/10.1038/s41390-022-02264-9

Download citation

Received: 07 March 2022
Revised: 10 June 2022
Accepted: 28 June 2022
Published: 16 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1038/s41390-022-02264-9

This article is cited by

Proceedings from the CIHLMU 2022 Symposium: “Availability of and Access to Quality Data in Health”
- Sabita Tuladhar
- Kimothy Mwamelo
- Sarah Scholze
BMC Proceedings (2023)
Emerging role of artificial intelligence, big data analysis and precision medicine in pediatrics
- Atul Malhotra
- Eleanor J. Molloy
- Sarah B. Mulkey
Pediatric Research (2023)