Introduction

Infants born preterm are physiologically and metabolically immature and have higher rates of morbidity and mortality, and poorer long-term neurodevelopmental outcomes than those born at term [1]. Amongst other issues, they are at risk of apnea of prematurity [2] and intermittent hypoxemia [3], which result in a decrease in oxygen saturation and bradycardia and have been associated with increased risk of neurodevelopmental impairment [4, 5]. Rates of apnea are correlated with the degree of prematurity, occurring most frequently in extremely preterm infants, though late preterm infants are also affected [2]. Late preterm infants also experience frequent episodes of intermittent hypoxemia [3] and poorer neurodevelopmental outcomes than term-born infants [6].

Methylxanthines are respiratory stimulants that have been used in preterm neonates for decades to both prevent and treat apnea of prematurity and to facilitate extubation [7]. Caffeine is a naturally occurring methylxanthine used extensively worldwide for hundreds of years for its central nervous system stimulant properties [7]. Caffeine and other methylxanthines, such as theophylline, have been used in the treatment of apnea in newborn infants since the 1970s [8]. The precise mechanism by which methylxanthines improve respiratory function continues to be debated, but caffeine is known to stimulate the respiratory center in the medulla by antagonizing adenosine A1 and A2A receptors, increasing sensitivity and response to carbon dioxide and PO2 and enhancing diaphragmatic function [9]. Caffeine is now used in preference to other methylxanthines due to its wider therapeutic window and longer duration of action in neonates, which allow for daily dosing and remove the need for therapeutic drug monitoring [10, 11].

Despite this longstanding clinical use there remain several evidence gaps, including indications for treatment, dosing regimen, the most appropriate patient population, and the short- and long-term effects of caffeine therapy [12]. The aim of this systematic review was to assess the effectiveness of caffeine in reducing the rate or occurrence of apnea and reducing long‐term neurodevelopmental impairment in preterm infants (<37 weeks’ post-menstrual age [PMA]). A secondary aim was to assess if there is any difference in these outcomes between caffeine given at standard doses (≤10 mg·kg−1 caffeine citrate equivalent) and high doses (>10 mg·kg−1 caffeine citrate equivalent).

Methods

This systematic review was guided by the Cochrane Handbook for Systematic Reviews of Interventions [13] and is reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement [14]. Prior to the literature search being conducted, the protocol was registered with the Prospective Register of Systematic Reviews (PROSPERO, CRD42020154678).

We included all randomized controlled trials (RCTs) in preterm infants (<37 weeks’ PMA) of caffeine (at any dose and for any reason) vs. placebo or no treatment (comparison one), or high-dose caffeine (>10 mg·kg−1 citrate equivalent) vs. low-dose caffeine (≤10 mg·kg−1 caffeine citrate equivalent) (comparison two), which reported one or more prespecified outcomes. We included published studies and those published in abstract if they included sufficient information to confirm eligibility and allow Grading of Recommendations Assessment, Development and Evaluation (GRADE) [15]. We did not include observational or non-randomized studies. No limit was placed on year of publication, and studies in any language were included and translated if an English abstract was available for the initial screening stage.

We reported outcomes across four developmental epochs: neonatal/infancy (<1 year of age), early childhood (ages 1–5 years), middle childhood (ages 6–11 years) and adolescence (ages 12–19 years). If longitudinal studies reported multiple assessments of an outcome within the epoch, the last reported assessment in each epoch was included in the analysis.

The primary outcome for the neonatal/infant epoch was apnea, defined as a pause in breathing of ≥20 s, or <20 s with bradycardia (heart rate <100 beats per minute [bpm]), cyanosis or pallor [16], or as per author definitions. For all other epochs, the primary outcome was neurocognitive impairment, defined by authors, using standardized tests appropriate for age.

Secondary outcomes for the neonatal/infant epoch included bronchopulmonary dysplasia (BPD), defined as ongoing requirement for oxygen or respiratory support at 36 weeks’ PMA; intermittent hypoxemia, expressed as events per hour and defined as a fall in oxygen saturation (SpO2) of 10% or more from baseline, or as defined by authors; retinopathy of prematurity (ROP) Stage III or worse [17]; intraventricular hemorrhage (IVH) grade III or IV [18]; patent ductus arteriosus (PDA), defined as use of medical or surgical treatment for ductal closure; tachycardia, defined as mean heart rate ≥160 bpm or as per authors; duration of mechanical ventilation; duration of positive pressure support; growth velocity, including weight gain (g.kg−1.day−1), linear growth (cm.week−1) and head growth (cm.week−1) to 36 weeks’ PMA (or as defined by authors); death; survival without neurosensory impairment (including, but not limited to deafness, blindness and cerebral palsy); and time to establish full enteral feeds (as defined by authors).

For all other epochs, secondary outcomes included: motor impairment, defined by authors using standardized tests appropriate for age; hearing impairment, defined as requiring one or more hearing aids or worse, or as per authors; visual acuity less than 1 LogMAR, or as per authors; death; survival without neurosensory impairment, including, but not limited to, deafness, blindness, death and cerebral palsy; emotional-behavioral difficulties, as defined by authors; cerebral palsy; chronic lung disease, defined as physician-diagnosed asthma or ≥2 episodes of parent-reported wheeze, or as per authors; and height and weight expressed as Z-scores.

Search strategy

We searched Pubmed, Medline, Embase, the Cumulative Index to Nursing and Allied Health Literature (CINHAL Plus) and the Cochrane Central Register of Controlled Trials (CENTRAL) databases from inception to 11 July 2022 using relevant MeSH terms and keywords (caffeine and premature/ prematurity/ preterm/ low birthweight and variations). The search was limited to studies involving humans, with no limit on year of publication or language. No limits on study type were applied at the initial search stage. We also searched The World Health Organization International Clinical Trials Registry Platform (ICTRP) (who.int/ictrp/search/en/), the US National Library of Medicine Clinical Trials Registry (clinicaltrials.gov), and Australia and New Zealand Clinical Trials Registry (ANZCTR) (anzctr.org.au), for any additional trials meeting the inclusion criteria not located through the above searches. Where results of trials were not available in the public domain, we contacted the authors listed in the trial registration to confirm the status of the trial, and whether any results were available for inclusion. We hand-searched bibliographies of included studies, review papers and conference abstracts to identify any additional studies. Covidence (Covidence Systematic Review Software, Veritas Health Innovation, 2020) was used to manage search results and screen studies for inclusion.

Study selection

Two review authors independently screened all retrieved titles and abstracts to assess eligibility for inclusion. The full text of all potentially relevant studies was retrieved and assessed independently by two authors to determine eligibility. Any disagreements were resolved by mutual discussion and consultation with a third author if required. Summary characteristics of each study were extracted and tabulated.

Data extraction, bias, and quality assessment

Two authors independently extracted data from all included studies using a prespecified data form. Any discrepancies were resolved by mutual discussion and consulting a third author if required. Additional information was sought from study corresponding authors if information was unclear or not published.

Two review authors independently assessed the risk of bias (RoB) of all included trials using the Cochrane RoB tool [19] for the following domains: sequence generation (selection bias); allocation concealment (selection bias); blinding of participants and personnel (performance bias); blinding of outcome assessment (detection bias); incomplete outcome data (attrition bias); selective reporting (reporting bias); any other bias. Any disagreements were resolved by mutual discussion or consulting a third author if necessary. For one study, where EO, JA, and CM were investigators, an alternative independent colleague (AW) with no association to the study conducted the data extraction and RoB assessment in conjunction with SH.

Review Manager (RevMan version 5.4.1. The Cochrane Collaboration, 2020) was used to summarize and analyze the data. Meta-analysis using fixed effects was performed if data from >2 RCTs were available. Apnea was reported using different measures that precluded a single meta-analysis; therefore, apnea was analyzed both as a dichotomous and continuous variable. We calculated the risk ratio (RR) for dichotomous outcomes and mean difference (MD) for continuous outcomes, with confidence intervals (CI) of 95%. If data were reported as median and interquartile range, means and standard deviations were estimated [20]. Planned secondary analyses included subgroup analysis by indication for caffeine and gestation length. Statistical heterogeneity was defined as an I2 > 50% and low p-value for the Chi-Square test, and categorized according to GRADE guidelines [15]. Methodological causes of heterogeneity were explored via subgroup analysis and sensitivity analysis, excluding studies at high risk of bias.

Outcomes were classified by all authors according to their importance for decision-making using GRADE classifications (7–9 critical, 4–6 important but not critical, 1–3 less important) [15]. Certainty of the evidence was assessed using the GRADE framework [15] and agreed by all authors. Imprecision was assessed using optimal information size (OIS) assuming alpha 0.05 and beta 0.2 [21] and considered serious if the total number of participants was less than the OIS for the outcome, or very serious if total participants numbered less than half the OIS. For continuous outcomes we assumed alpha 0.05 and beta 0.2, and delta 0.33.

Study characteristics and results were tabulated, and forest plots generated for all comparisons where data was available.

Results

Literature search and study selection

Our search identified 6509 studies (Fig. 1). Following the removal of 3542 duplicates, 2968 studies were screened and 2801 excluded. The full text of 159 papers were reviewed, resulting in the inclusion of 15 studies in the final review.

Fig. 1
figure 1

Preferred Reporting Items for Systematic Reviews and Meta-Analyses flow diagram of study selection.

Study characteristics

We identified 15 eligible RCTs enrolling a total of 3530 premature infants. Most trials enrolled infants born at <32 weeks’ PMA [22,23,24,25,26,27,28,29,30], although some included infants up to 35 [31] or 36 [32] weeks, or defined eligibility based on birthweight [24, 33,34,35] or clinical decision to treat with caffeine [36]. Eight trials compared caffeine to placebo or no treatment [22,23,24,25, 31, 33,34,35]. Seven trials compared different doses of caffeine [26,27,28,29,30, 32, 36], including one [32] with four different dosing arms and a placebo arm, which contributed to both comparisons. Trials were widely geographically located and all except one [32] enrolled only infants in neonatal units. Most trials were small, with only one enrolling more than 300 infants [33] (Table 1). Eight of the included trials had high RoB in one or more domains [24, 25, 28, 30,31,32, 34, 35], especially for ‘incomplete outcome data’ (Table 2). All included studies reported at least one outcome for the neonatal/infant epoch. Two studies [33, 36] reported outcomes in early childhood, and only one study [33] reported outcomes in middle childhood. No studies reported results in adolescence.

Table 1 Study characteristics.
Table 2 Overall risk of bias of included studies.

Caffeine vs. placebo/no treatment

Primary outcome

Neonatal/infancy: For the primary outcome of apnea (dichotomous), evidence of very low certainty from five trials showed possible benefit from receiving caffeine compared to placebo or no treatment (risk ratio [RR] 0.59, 95% confidence interval [CI] 0.46, 0.75, 453 infants) (Table 3) [23, 24, 32, 34, 35].

Table 3 GRADE summary of findings for caffeine vs placebo comparison.

There was statistical heterogeneity (I2 = 78%) among trials, although the direction of effect consistently favored caffeine (Fig. 2). In sensitivity analysis, exclusion of two trials at high risk of bias [24, 34], did not substantially alter the results (RR 0.62, 95% CI 0.50, 0.77, three trials, 263 infants).

Fig. 2: Forest plots of the neonatal/infant primary outcome, and critical and selected important secondary outcomes.
figure 2

aApnea results are presented as a dichotomous measure (for caffeine vs placebo comparison) or a continuous measure (for high vs low-dose comparison), based on how apnea was measured in the majority of studies in each comparison. The forest plot for the alternate measure for each comparison is presented in Fig. 3. bDeath before one year of age was also considered a critical outcome, but only 1 study reported this measure (in the low vs. high-dose comparison). This data is included in Fig. 3, with other secondary outcomes.

Early childhood: For the primary outcome of neurocognitive impairment, evidence of low certainty from one trial could not exclude clinical benefit or harm from receiving caffeine compared to placebo (RR 0.98, 95% 0.63, 1.51, 1518 children) (Table 3) [37].

Middle childhood: For the primary outcome of neurocognitive impairment, evidence of moderate certainty showed possible benefit from receiving caffeine compared to placebo (RR 0.84, 95% 0.71, 1.01, 1 trial, 920 children) (Table 3) [38].

There were no data for the primary outcome of neurocognitive impairment in adolescence.

Secondary outcomes

Moderate certainty evidence indicated probable clinical benefit of receiving caffeine compared to placebo or no treatment for BPD (RR 0.77, 95% CI 0.69, 0.86, three trials, 2059 infants, I2 = 31%) and patent ductus arteriosus (RR 0.67, 95% CI 0.60, 0.74, four trials, 2242 infants, I2 = 0%)(Table 3), and motor impairment in middle childhood (RR 0.72 95% CI 0.57, 0.91, one trial, 930 infants) (Table 3). Caffeine therapy may reduce neurocognitive impairment and cerebral palsy (Table 3). It is possible that caffeine reduces weight gain velocity after birth, but it does not appear to affect body size in childhood (Table 3). The evidence was too uncertain to determine the effect of caffeine on intermittent hypoxemia, respiratory support, feeding, other major neonatal morbidities, death, other developmental outcomes in childhood, and asthma/wheeze (Table 3; Fig. 3).

Fig. 3
figure 3

Forest plots of additional neonatal/infant secondary outcomes.

Secondary analysis

There were insufficient data to undertake the planned subgroup analyses.

High-dose vs. low-dose caffeine

Primary outcome

Neonatal/infancy: For the primary outcome of apnea (continuous), evidence of very low certainty from four trials showed possible benefit from receiving high-dose caffeine compared to low-dose caffeine, although the effect size was small (mean difference [MD] −0.2, 95% CI −0.3, −0.2, 560 infants) (Table 4) [27, 28, 30, 36]. There was statistical heterogeneity (I2 = 87%) among trials, although the direction of effect consistently favored high-dose caffeine (Fig. 2). In sensitivity analysis, exclusion of one trial at high risk of bias [28], did not alter the results (MD −0.2, 95% CI −0.3, −0.2, 530 infants, I2 = 81%).

Table 4 GRADE summary of findings for low vs high-dose caffeine comparison.

Other epochs: No trials of high-dose vs. low-dose caffeine reported on neurocognitive impairment.

Secondary outcomes

Moderate certainty evidence from four trials showed probable benefit for BPD with high-dose vs. low-dose caffeine (RR 0.71 95% CI 0.55, 0.91, 586 infants, I2 = 0%)(Table 4). Evidence of very low certainty from seven trials suggested that high-dose vs. low-dose caffeine may increase the rate of tachycardia (RR 2.29 95%CI 1.41, 3.72, 839 infants, I2 = 0%)(Table 4). The evidence was too uncertain to determine the effect of high-dose vs. low-dose caffeine on other neonatal outcomes (Table 4; Fig. 3). For the critical outcome of survival without neurosensory impairment in early childhood, low certainty evidence from one trial meant that a benefit of high-dose vs. low-dose caffeine could not be excluded (RR 0.92 95%CI 0.82, 1.03, 236 children) (Table 4).

Secondary analysis

There were insufficient data to undertake the planned subgroup analyses.

Discussion

Currently, there is no high-certainty evidence for use of caffeine in preterm neonates for any critical or important outcomes from birth to adolescence. However, in very preterm neonates, caffeine therapy probably reduces the rate of BPD and PDA; possibly increases survival without neurosensory impairment in early childhood and reduces cerebral palsy; and probably reduces the rate of neurocognitive impairment and motor impairment in middle childhood. Although traditionally given for apnea of prematurity, the evidence supporting this benefit of caffeine was of very low certainty, given the considerable heterogeneity in contributing studies, RoB inherent in these studies and the relatively small number of infants for whom data are available.

In general, evidence for the relative effectiveness of high vs. low-dose caffeine is even less certain, but moderate certainty evidence indicates higher doses probably reduce the rate of BPD more than lower ones, and very low certainty evidence suggests higher doses may cause more tachycardia.

Quantifying the effect of caffeine on longer-term outcomes is limited by the available studies, with only two trials presenting any outcome data beyond the neonatal/infancy period (one in each comparison) and only one of those reporting significant follow-up assessments and results. As a result, meta-analysis was not possible in epochs beyond neonatal/infancy, and the certainty of the findings is limited. No information was available comparing the effects of high and low-dose caffeine on neurodevelopmental outcomes.

This review provides a current and comprehensive summary of the available literature on the use of caffeine in preterm infants and included 15 RCTs covering 3530 premature infants. In contrast to previous systematic reviews, we included all studies enrolling preterm infants (<37 weeks’ PMA), rather than limiting the population to infants born at earlier gestational ages [39,40,41]. This was because moderate and late preterm infants may experience apnea of prematurity [2] and are known to have episodes of intermittent hypoxemia [3], and so may also benefit from caffeine therapy, though the evidence in this are remains uncertain. Previous systematic reviews have addressed a single question (either caffeine vs. placebo, or high vs. low-dose regimens), rather than considering both together as in this review, and have often included trials of other methylxanthines which are no longer routinely used in addition to caffeine. Furthermore, these older systematic reviews did not apply the explicit and comprehensive GRADE criteria to the assessment of the quality of the evidence, and so have perhaps overstated the certainty of the evidence underlying their recommendations [10, 42]. Recently published Cochrane reviews present GRADE analysis for only a subset of outcomes [41, 43], whereas in this review GRADE analysis was performed for all outcomes with available data.

The Cochrane Neonatal Group have recently published reviews of caffeine dosing regimens in preterm infants [41] and of methylxanthines vs placebo / no treatment [43]. However, this later Cochrane review includes a substantial number of trials that used other methylxanthines (aminophylline and theophylline) no longer routinely used in clinical practice and does not include some of the more recent trials of caffeine [24, 32, 34] included in this review. Both the Cochrane and other reviews of caffeine low-dose vs high-dose caffeine therapy have concluded that higher doses of caffeine are [44] or may be [40, 41] more effective in reducing the occurrence of extubation failure. Analysis of the evidence for the important outcome of BPD has resulted in different conclusions in different reviews; either that higher doses reduce the rate of BPD compared with lower doses [39, 41, 44] or that higher doses do not alter the rate of BPD [40]. In contrast to previously published reviews [39, 41, 44], we pre-defined high (>10 mg·kg−1day−1 caffeine citrate equivalent) and low doses (≤10 mg·kg−1day−1) of caffeine on the basis of maintenance dose, avoiding cross-over of doses included in the comparison groups and hence producing a more meaningful comparison. This may explain the differences in findings, as some other reviews have included trials where the only difference in dose was in the loading dose [39, 40], or where both doses used would be considered low doses in current clinical practice [39]. We also included all trials where infants received caffeine, regardless of indication, as we wished to include apnea given for the prevention of neurodevelopmental impairment, as well as solely for the prevention or treatment of apnea, or to assist in extubation.

As a systematic review, the robustness of the conclusions is limited by the quality and quantity of the included studies. The caffeine vs. placebo comparison identified and included a number of recent studies that have not previously been included in published meta-analysis [24, 30, 34], but some of these studies have domains with high risk of bias and there was a high degree of heterogeneity between studies, limiting the quality of the evidence. Furthermore, this comparison is dominated by a single study, which contributed over 2000 infants of the 2592 participants identified [45]. We had planned to undertake subgroup analysis to assess the effectiveness of caffeine based on the indication for use (prophylaxis, treatment of apnea or for extubation, late hypoxemia or established lung disease) and by gestation (extremely, very, moderately or late preterm) but were unable to undertake these analyses due to the lack of data broken down by these variables in the identified studies. The lack of data on the effectiveness of caffeine in these different subgroups remains an important evidence gap, and further research is needed to inform evidence-based decision-making in clinical care.

While caffeine is widely used in neonatal units, the evidence remains uncertain, and other reviews on the topic have called for further clinical trials in this area [40, 44, 46]. We join previous authors in this call for further research, and this systematic review indicates the evidence gaps where more information is required to guide clinical practice. In particular, there is a lack of data on long-term outcomes following different doses of caffeine in the neonatal period, and longer-term follow-up of infants in recent trials should be conducted to address this evidence gap. This is particularly important given the indications in this and other [39, 40, 44] meta-analyses that higher doses may be more effective in improving short-term outcomes, as use of higher doses in clinical practice should be preceded by evidence of the long-term safety of such doses. In addition to dose, more information is required on how the indication for treatment, infant gestation, duration of treatment/stopping and timing of initiation and discontinuation influence outcomes. Whether caffeine should be used during mechanical ventilation should also be ascertained, and if the dose should be decreased with tachycardia or increased with gestational age.

Conclusion

Caffeine administered to preterm infants probably reduces BPD, PDA, and motor impairment, with higher doses probably conferring additional benefit in reducing BPD but possibly increasing the occurrence of tachycardia. However, most of the current evidence is of low certainty, and establishing the optimal dose requires more research, including long-term outcome assessment.