Introduction

The Apgar scoring system was developed by Dr. Virginia Apgar over 70 years ago as a tool to assess the condition of a newborn at birth based on five variables: heart rate, respiratory effort, muscle tone, reflex irritability, and color. Its original purpose was to allow for immediate observation and prompt identification of newborns who need resuscitative measures during transition to extrauterine life [1]. However, over the ensuing seven decades, despite the advancement made in evidence-based newborn resuscitation and the advent of Neonatal Resuscitation Program (NRP) that requires evaluation of the newborn infant without any role for the Apgar score [2], its international recognition and universal use continues.

Over those years, the use of the Apgar scoring tool has gone far beyond its original purpose to guide clinical management decisions and establish a correlation to long term infant health outcomes [3,4,5,6]. In a review of 501 papers published in 2018–19, the Apgar score was used as a prognostic factor for outcomes in 19%, more than half of these focused on short term morbidities [7]. Numerous studies have examined the association between low Apgar score and a variety of short-term neonatal morbidities [8,9,10] but the significance and value of a low Apgar score in identifying newborn infants likely to manifest these morbidities has not been systematically examined.

Methods

We conducted a retrospective study using data from the medical records of infants born at >22 weeks gestational age at VCU Health Systems in Richmond VA, a regional academic medical center with a 40 bed NICU and an average of 2600 live births per year in the period between 9/1/2013 and 3/30/2020. The Labor and Delivery service maintained a database of live births, including medical record number, mode of delivery, and Apgar scores. The electronic medical record (EMR) for the mothers and newborns of each delivery were then queried for demographic information and all discharge diagnoses. Information contained in both the database and the EMR (birthdate, medical record numbers) was used to confirm matching of each data set.

Ten short term outcomes, defined as occurring during the initial hospital stay, were selected. Each was counted using any of the ICD9 (before 2016) and ICD10 codes that might be applied. For example, in ICD10, Respiratory Distress of the Newborn is coded by either P22.0 or P22.9. The selected common morbidities were as follows: bronchopulmonary dysplasia (BPD); necrotizing enterocolitis (NEC); intraventricular hemorrhage (IVH); retinopathy of prematurity (ROP); hypoxic ischemic encephalopathy (HIE); respiratory distress syndrome (RDS); transient tachypnea of the newborn (TTNB); newborn sepsis, hypoglycemia (HypoG); and meconium aspiration syndrome (MAS).

Eight predictor variables that would be available during the initial hospitalization and that have been previously associated with one or more of the short-term outcomes were recorded for each mother-infant dyad. These included gestational age, birthweight, gender, race, mode of delivery, being small for gestational age (SGA) and 1- and 5-min Apgar scores.

The study was approved as exempt by the Virginia Commonwealth University IRB.

Analysis

Because of the well-documented variability in Apgar scoring across countries [11], individuals [12] and by newborn conditions [13], we chose to use cut-off scores at both one and 5 min. The most common definition of significance in the 2018–19 review was a total score of less than or equal to 6 [7] while a score of less than or equal to 3 has been commonly used to identify possible asphyxia [14]. We therefore defined four Apgar score groups: 1 L (1 min ≤ 6), 1VL (1 min ≤ 3), 5 L (5 min ≤ 6), and 5VL (5 min ≤ 3) We did not use any scores beyond 5 min because it was generally only done when the previous scores were low, which creates a likely selection bias for this data.

The sensitivity, specificity, negative (NPV) and positive (PPV) predictive value as well as odds ratio for the four low Apgar scores were calculated for each short-term morbidity. Multivariable logistic analysis was performed for each morbidity using each of the low Apgar scores and gestational age by week, gender, race, mode of delivery, and whether they were SGA. Odds ratios, 95% confidence intervals, and p values were calculated for all models. Receiver operator characteristic (ROC) curves for the multivariable models were calculated with and without each low Apgar score and the differences in the area under the curve (AUC) when each was included or omitted was calculated. For comparison, a similar analysis was done using the Apgar score first and then adding in the clinical factors. Statistical significance between these AUCs was calculated DeLong’s test [15].

Results

Figure 1 is a consort diagram of the study with 17,135 mothers in the labor and delivery birth delivery database from September 1, 2013 to March 30, 2020. Of these 16,703 had complete EMR data for the newborns. There were 533 with missing discharge diagnosis codes, 372 with a variety of congenital anomalies that might impact the outcomes, 189 who were <23 weeks gestation, and 67 who died in the delivery room. The final cohort consisted of 15,542 (90.7%) infants. The median length of stay was 2 days (IQR 2–3 days).

Fig. 1: Consort diagram for study.
figure 1

Of the 17,135 deliveries listed in the service database, 905 were eliminated due primarily to missing data leaving the 15,542 subject data set.

Table 1 presents the incidence of each short-term outcome for the study cohort, ranging from 7.3% for MAS to 0.4% for NEC overall and the incidence of each potential risk factor for each outcome. By univariate analysis of clinical factors, gender was a significant risk for TTNB, RDS, and HypoG, race for all outcomes except HIE, gestational age for all outcomes, mode of delivery for all outcomes except MAS, and SGA status significantly related to TTNB, HypoG, and IVH. Focusing on the Apgar score, as noted in Table 2, the four different low Apgar scores were significantly associated with the ten outcomes in 38 of the 40 scenarios. The only exceptions were TTNB and HypoG for 5VL.

Table 1 Demographics and Low Apgar Score distribution for all subjects and short-term outcome (%).
Table 2 Sensitivity, Specificity, Positive (PPV) and Negative (NPV) Predictive Values and Odds Ratio for Low Apgar Scores for discharge diagnosis outcomes.

To examine whether a low Apgar score remained a significant risk factor when other clinical risk factors (gestational age, gender, race, mode of delivery, and SGA status) were accounted for, multivariable logistic models were created with each of the four low Apgar scores for each outcome. As shown in Table 3, in this analysis, there were 11 scenarios where a low score was not significant. For HypoG, only 5VL remained significant, and for NEC, none of the Apgar scores remained significant. In addition, a 1L score was not significant for BPD or ROP, a 5VL score was not significant for IVH and the 5L score was not significant for TTNB.

Table 3 Odds ratio for Low Apgar Score in multivariable logistic models for each short-term outcome. Other variables were gestational age, gender, race, mode of delivery, and SGA status.

To further define how much a low Apgar score contributes to the risk identification of newborn infants for the ten outcome diagnoses, ROC curves were created for each diagnosis using multivariate equations incorporating the clinical factors listed above. with and without each of the 4 low Apgar scores, and the significance of the difference in the AUC when the low Apgar was present or absent determined. Birthweight was not included as birthweight and SGA status were, in combination, stronger contributors to the final model. We also examined the AUC for the ROC curves created using the Apgar score first and then adding the clinical factors. In Tables 4 and 5, for each Apgar score, the upper rows are for the condition where the score is added to the ROC constructed from the clinical factors, while the second group of rows shows what happens to the AUC when the clinical factors are added to the model constructed first with the Apgar score. When the 1 min Apgar score was added (Table 4), the AUC increased significantly for HIE at both the L and VL levels and for RDS and MAS for a score <6. There was no effect for the other 17 outcomes. The average change in AUC was 3.92% (CI 0.60 to 7.25). In contrast, adding the clinical factors to the Apgar score curve increased the AUC significantly for all the outcomes and far more substantially (average 27.64% CI 22.81 to 32.47). The results for the 5 min L and VL scores (Table 5) are similar. Adding 5-minute scores to clinical factors changed the AUC of the ROCs by 1.836% (CI −0.3593 to 4.031) while adding clinical factors to the 5 min score increased the AUCs by 45.01% (CI 36.93 to 53.09). Figure 2 illustrates the difference in the ROC curves when the addition of a low Apgar is significant (HIE) and when it is not (sepsis).

Table 4 AUC of ROC of clinical factors without and with low Apgar scores (top 4 rows of 1VL and 1L sections) and for ROC of low Apgar score without and with clinical factors (bottom 4 rows of 1VL and 1L sections).
Table 5 AUC of ROC of clinical factors without and with low Apgar scores (top 4 rows of 5VL and 5L sections) and for ROC of low Apgar score without and with clinical factors (bottom 4 rows of 5VL and 5L sections).
Fig. 2: ROC curves for multivariable logistic regression models.
figure 2

A HIE without (blue/light gray) and with (red/dark gray) 1VL. B HIE without (blue/light gray) and with (red/dark gray) 1L. C Sepsis without (blue/light gray) and with (red/dark gray) 1VL. D Sepsis without (blue/light gray) and with (red/dark gray)1L.

Discussion

The primary and predominant purpose of the Apgar score has been to assess the status of an infant in the first few minutes after birth [1, 7]. The rationale for such a scoring system is based on the understanding that having difficulties in the transition to extrauterine life is not good for the newborn, i.e., that such difficulties are associated with worse outcomes so identifying these babies could lead to interventions that could mitigate these outcomes. This is supported by the observation that the adoption of the Apgar score did not become widespread, and then universal, until there was evidence that low scores occurred far more frequently in babies who either died or had neurological deficits in the first year of life [16, 17].

Over the decades, the score has consistently been used as a risk factor in clinical studies [7]. It has been associated not only with an increased incidence of long-term neurological conditions, including cerebral palsy and seizures [18], but also with a wide variety of conditions such as attention deficit disorder/hyperactivity [19], permanent dentition [20], cancer [21], food allergy [22], autism spectrum disorder [23], polycystic kidney disease [24] and amblyopia [25]. The Apgar score is used as often for research into morbidities that manifest in the post-natal period, including all the discharge diagnoses used as short-term outcomes in this current study [926,27,28,29,30,31,32,33,34]. Short term outcomes have also been used for all studies that have examined modifications or replacements for the Apgar, and any future such efforts are likely to do the same [35,36,37]. It is noteworthy that the NRP does not use the 1- or 5-min Apgar score.

This study used a range of morbidities occurring during the initial hospital stay to determine, first, if a low Apgar score is more frequent in those babies who were given these diagnoses compared to those without the conditions and confirmed previous associations for the risk factors and the various short-term morbidities. We confirmed that low scores at 1 or 5 minutes were significantly associated with the outcomes examined. Of the 10 outcomes and 4 low Apgar score groups, low scores were significant in 38 circumstances; the exceptions were the 5VL score for HypoG and TTNB. When other clinical factors were included in the analysis (Table 3), low scores lost their significance in an additional 9 outcome/score groups. 1L was significant for six, 5VL for eight, and 5L for six outcomes respectively.

The AUC value of the ROC is often used to assess the clinical value of a predictive model [38] with higher values above 0.5 indicating a better model. We have used this to further analyze how much the presence of a low Apgar score contributes to identifying newborn infants who will go on to have one of the short-term outcomes included in this study. This confirmed that low Apgar scores can make a major and significant contribution in predicting HIE, which is not surprising since low scores are often part of the diagnosis [4] and supports the validity of this analytic method. Overall, the addition of a low Apgar score added little to the inclusive predictive model. It was only statistically significant for the 1L Apgar score for RDS and MAS. Otherwise, it improved the AUC by less than 3.5%, and in many cases by less than 1%. In contrast, the addition of clinical factors to ROCs constructed by Apgar scores alone increased by 14–86%, indicating that the Apgar score does not contribute as much to identifying newborns at risk for short term morbidities as other clinical factors combined.

There are several significant limitations to this study that should be noted. The study used retrospective data from a single center. The ten outcomes had a wide range of incidences, which can have an effect on predictive values, for example, and several are associated primarily with prematurity, specifically RDS, BPD, ROP, IVH, and NEC, but previous studies have included Apgar scores in risk assessments for these conditions [27, 31, 33]. We chose to include all gestational ages in our analysis because all newborns receive an Apgar score. Additionally, even though the incidence of several of the prematurity-related outcomes is very low in the term subjects, the proportion of cases made up of term infants is high. For example, for RDS, the incidence in term infants was 1.5% compared to 63% of those <33 weeks, but because there were 25 times as many term as very preterm subjects, term infants made up 29% of those with RDS. Prematurity was also a cofactor used in the multivariate analyses to help account for its influence. The accuracy of discharge diagnosis codes has been questioned [39]. As one example, we found several instances where codes for both TTNB and RDS were assigned to the same subject. Our goal was to ensure that all potential cases were captured, so we used a wide range of codes. As a result, for some short-term outcomes, such as MAS, there was a high incidence. Additionally, other risk factors such as maternal age or maternal chorioamnionitis were not included. We chose to use the two most common [7] cut-off values at 1 and 5 minutes rather than all the Apgar scores from 1 to 10 to account for some of the known variability in scoring and capture a sufficient number of subjects per outcome to analyze. Other investigators have used the complete Apgar scale, usually.in long term outcome studies involving over one million subjects [9]. Finally, Dr. Apgar designed her system to assess the status of the infant immediately after birth. Starting in 1966 [16], however, it has been used as a risk factor hundreds of times.

Strengths include a larger number of subjects than most studies which have examined the Apgar score in relation to short term outcomes. While we looked at the common ways of assessing a risk factor, such as sensitivity, specificity, positive and negative predictive value, and the odds ratio within the context of a multivariable analysis as well as the AUC of the ROC graphs, adding the AUC analysis with and without the low Apgar is a way to directly assess the question of its utility.

The Apgar score has been performed around the world to an estimated three billion or more newborn infants over the last 70 years. During that time, concerns have been repeatedly raised about it. Yet it remains an important tool in the delivery room for assessing the immediate condition of the newborn. It appears to have good utility for assessing risks of long-term outcomes when applied to large populations, but our findings suggest that it is not a significant contributor to identifying newborn infants who would benefit from a higher level of care because of the risk of short-term outcomes.