Main

Preterm birth, defined by the World Health Organization as the delivery of an infant before 37 completed weeks of gestational duration (1), is a major cause of infant mortality and long-term child morbidity in the United States (2,3,4). The search for the causes of preterm birth and its prevention has engaged multiple disciplines including the basic, clinical, and social sciences. While these research initiatives have documented associations between preterm birth and a wide range of biologic, demographic, social, and clinical influences (5,6,7), definitive insight into the primary etiologic pathways of preterm birth remains elusive (8).

There has been growing interest in the use of new analytic frameworks and techniques (9) to explore broad spatial and temporal patterns of preterm birth in the hope that these approaches might provide new clues to important etiologies or preventive opportunities (10,11). Often colloquially labeled “big data” analyses, these strategies take advantage of expanding computing power and machine learning techniques to illuminate relationships or patterns in very large genetic, clinical, or epidemiologic datasets (12,13). Among the most useful of these approaches are cluster analyses which have been employed to identify complex patterns in health service utilization (14) and a variety of health outcomes, including chronic respiratory disease (15), schizophrenia (16), and necrotizing enterocolitis (17).

This study assesses recent spatial and temporal trends in preterm birth in the United States. Of special concern is the exploration of patterns in these trends, particularly among different gestational age groupings, geographic regions, and social groups over time.

Results

Temporal Trends

Trends of preterm birth in the United States are presented in Figure 1 . Total births of <37 wk of gestation rose steadily between 1981 and 2006, increasing 35.6% over this time period. However, after 2006, the percentage of births born at <37 wk began to decline, falling 8.4% between 2006 and 2011. This trend was largely accounted for by changes in births of between 34 and 37 wk gestation. The percentage of births of <32 wk gestation was generally more stable over the study period but did rise 12.7% from 1981 through 2006 and then declined by 5.4% between 2007 and 2011.

Figure 1
figure 1

Trends in preterm birth rates for the United States by gestational age groups <32 wk, 32–37 wk, and total births <37 wk gestation: 1981–2011.

PowerPoint slide

Rates of preterm birth and trends for the 1995 to 2010 period and for the 2011 year for each of the 50 states and the District of Columbia are presented in Table 1 . Considerable variation in trends over these periods was noted. States in the highest quintile of <34 wk birth rates in 1995, showed little change between 1995 and 2005 but averaged a 4.6% decline in this category between 2005 and 2010. States in the lowest quintile of <34 wk birth rates in 1995 experienced modest increases between 1995 and 2005 and a 3.0% decline in the 2005–2010 period.

Table 1 Trends in preterm births and birth rates: US states and the District of Columbia

The examination of gestational age trends between 2000 and 2010 revealed a general inverse relationship between the absolute rate of <34 births in 2000 and the percent reduction in this rate over the decade. However, this relationship was significant only for births to white women (P = 0.037). The relative weakness of the association between initial absolute <34 wk birth rates and subsequent trends was also true when births <31 wk gestation were assessed independently.

The association between trends for the <34 wk and 34–36 wk gestational age groups are presented in Figure 2 . Significant associations were found for total (P < 0.05) (see Figure 2a ), white (P < 0.05) (see Figure 2b ), and African American births (P < 0.01) (see Figure 2c ). Note that only states reporting at least 1,000 African American births were included in the figure and that the scale of % change was greater for African American births than that for whites. In addition, a significant relationship was also observed between trends in <34 wk gestation birth rates for white and African American births (P ≤ 0.001) (see Figure 2d ).

Figure 2
figure 2

The relationships between the percent change in gestational age <34 wk birth rates and the percent change in gestational age 34–36 wk birth rates between 2000 and 2010 in all US states and the District of Columbia and for White and African American births. Each data point represents one state or the District of Columbia (a). Births in all states and the District of Colombia; (b) White births; (c) African American births; (d) Percent change in gestational age <34 wk birth rates for White births by percent change in gestational age <34 wk birth rates for African American births. Note that states reporting less than 500 African American births for any year during the study period are not plotted on the figure.

PowerPoint slide

Gestational age-specific birth rates were then calculated for all US counties between the years 1971 and 2008. 905 counties reported more than 10 births of <34 wk gestation annually for this period. However, the calculation of total national counts of year-to-year changes in county preterm birth rates was examined only for the 1985–2008 period, the years for which all states reported county-based data. Figure 3 presents the smoothed, moving average annual change in the percent of births occurring <34 wk for the 905 counties. The number of counties reporting >5% increase in the birth rate of <34 wk infants was generally similar to the number of counties reporting >5% decrease. However, counties reporting relative stability in the <34 wk birth rate generally outnumbered those reporting >5% changes. The figure also suggests that some periodicity may exist in the number of counties reporting increased rates of <34 wk births. Significant peaks occurring at ~7-y intervals were noted.

Figure 3
figure 3

Number of counties reporting an annual >5 percent and >10 percent increase in the birth rate of <34 wk gestation births: United States, moving 2-y average (total counties = 905).

PowerPoint slide

Cluster Analysis

The k-means cluster analysis generated 10 patterns that best described the year-to-year changes in the distribution of births among nine gestational age groups. In this manner, each county was assigned to one of these characteristic clusters for each year during the study period. Figure 4 presents both the gestational age definitions generated by the k-means cluster analysis ( Figure 4a ) and the number of counties that were characterized by each of these 10 clusters for each year under study ( Figure 4b ). The colors were assigned randomly to each of the 10 clusters but are consistent in both Figure 4a , b . Figure 4a suggests considerable variation in the gestational age patterns generated by the cluster analysis, primarily in the gestational age groups > 36 wk. Figure 4b documents that while the cluster analysis was structured to identify 10 unique clusters, only 3 clusters accounted for more than half of all the county-years. The prevalence of these three clusters, cluster 1 (color-coded as pink), cluster 4 (color-coded as orange), and cluster 7 (color-coded at tan) are shown in Figure 4b . Note that the y-axis for the prevalence of each of the clusters in the figure varies, given the wide variation in the number of counties falling into each cluster in any given year. Cluster 4 (orange) and cluster 7 (tan) which predominated in the earlier years under study were characterized by relative increases in late preterm births. Cluster 1 (pink), which predominated in the latter years of study, was characterized by relative declines in mid and late preterm births. More broadly, the early years of the study were characterized by considerable heterogeneity in cluster assignments with cluster 4 (orange) accounting for the largest number of counties. However, by the early 1980’s, cluster 7 (tan) began to dominate the county assignments, creating a far more homogeneous cluster pattern. The 2000’s were characterized by greater heterogeneity in county assignments. During this latter period, cluster 1 (pink) was the most common cluster assignment.

Figure 4
figure 4

K-means cluster assignments by year: All counties, United States. (a) The gestational age definitions derived from the cluster analysis. (b) The prevalence of each cluster among US counties (note differences in prevalence scales for each cluster). Each color represents a different cluster and is consistently assigned in both a and b. The colors in descending order in the Figure are: pink, red, green, orange, purple, light purple, tan, light green, turquoise, and light blue.

PowerPoint slide

The geographic distribution of the cluster assignments is depicted in Figure 5 . Each of the national maps presents the cluster assignments (designated by the assigned colors) for 5 representative years. The map for 1972 reflects areas of missing data as county data were not available for a considerable number of states during the early years of the study period. These were omitted from the cluster analyses until data became available from all these states in the mid 1970’s, except from Arizona which reported data beginning in the mid-1980’s. Missing county data are depicted in the geo-temporal mapping as blank areas. However, for the majority of states reporting county data, there was substantial heterogeneity in cluster patterns throughout the United States. The maps for 1982, 1992, and 2002, on the other hand, depict the predominance of cluster 7 (tan) and cluster 1 (pink) throughout the country. This suggests that over this time period, the trends in gestational age distribution noted in national data reflected fairly generalized trends with relatively little geographic variation among different counties or regions. However, the map for 2008 reveals greater variation with cluster 2 (red), cluster 5 (purple), and cluster 9 (dark blue) growing in prevalence, particularly in parts of the mid-west.

Figure 5
figure 5

Maps of county cluster assignments for selected years. (a) 1972; (b) 1982; (c) 1992; (d) 2002; (e) 2008. Colors represent one of the 10 cluster assignments as defined in Figure 4 (note gray areas represent counties that did not report sufficient data for the specified year).

PowerPoint slide

Discussion

This study utilized four decades of national birth data and machine learning algorithms to examine spatial and temporal patterns in preterm births for the total population of United States. States with high absolute rates at the beginning of the study period tended to have more precipitous declines over subsequent years; however, this was not uniformly observed as some states with high initial rates experienced little improvement. While early and late preterm birth rates may reflect different phenotypes or etiologies, our observations indicate that trends in these rates tend to move together within states. Moreover, while the disparity in preterm birth rates between African American and white births in any given state remains profound, the absolute rates also tend to move together over time. County-level birth rates of preterm infants also revealed some temporal synchrony with the suggestion of a possible multi-year periodicity. When machine learning techniques were used to assess temporal and spatial patterns in preterm births, a dynamic picture emerged characterized by periods of both heterogeneity and homogeneity among US counties.

There was considerable variation in state preterm birth rate trends over the study period. Overall, jurisdictions with the highest absolute preterm birth rates in 2000, such as the District of Columbia, experienced on average the largest declines over the subsequent decade. This observation could reflect the general tendency of initially aberrant rates to regress to the mean. However, the relationship between initial absolute rates and subsequent trends was not strong to begin with and there were states that did not conform to this general observation. The reasons for this variation are unclear; however, trends in a variety of demographic or obstetrical management strategies are not likely to have been responsible for the observed variations (18,19).

While it has been difficult in large datasets to accurately distinguish between the wide variety of clinical phenotypes associated with preterm birth (20), the stratification of preterm births into early and late preterm births has served as a useful proxy (21). While there was considerable differentiation in the state-specific trends in early and late preterm birth rates, the analyses presented here suggest that these two rates do not move independently within states. Rather, there was a significant relationship in the movement of these rates over time raising the question of whether these gestational age categories may share some common influences.

Periodicity in preterm birth rates has been noted for some time. However, this has largely taken the form of seasonal patterns, including patterns that may be related to annual infectious outbreaks, such as influenza (10). There has been little evidence suggesting significant periodicity beyond 1 y. The finding in this study of some periodicity in multi-year time scales warrants confirmation and is undergoing more detailed analysis. Nevertheless, these findings do raise questions regarding potential infectious etiologies that tend to operate on multi-year cycles (22). In addition, social phenomena, including economic cycles (23,24), demographic shifts, including trends in maternal age (25), or other environmental influences (26,27) could produce complex, multi-year cycles and are worthy of more directed research into the specific role these potential influences have on trends in preterm birth.

The cluster analysis revealed both periods of relative congruity and incongruity in preterm birth trends across the United States. The use of unsupervised, agnostic clustering techniques are increasingly being used to identify heretofore unrecognized patterns in large, multidimensional datasets concerned with complex health disorders (28,29). In this analysis clusters were determined by seeking the 10 patterns that best described year-to-year changes in gestational age distributions for all counties in the United States from 1971 through 2008. While the predominance of shifts in late preterm and term births appears to have influenced the most prevalent clusters, the generalized distribution of these trends at the county level was of special interest. The transformation over time from a highly heterogeneous pattern to a long, homogeneous pattern and then most recently, back to a more heterogeneous pattern raises important questions about the broad influences shaping gestational age patterns in specific geographic areas. There is a large literature documenting a wide variety of individual social, behavioral, and environmental contributors to preterm birth, many of which vary substantially by geographic location. Periods of substantial spatial heterogeneity in preterm birth patterns could reflect a time frame in which such local, variable influences predominate. However, during periods of substantial homogeneity in preterm birth patterns, these local variations in etiologic influences may become relatively aligned or combined with broad currents of influence that affect gestational age distributions across large geographic areas, transcending county, state and regional borders. While this analysis cannot identify the distinct contribution of any specific influence, it is useful to document that such broad, spatial, and temporal influences exist and may be important in shaping patterns of preterm birth. Such influences could include rapidly disseminated changes in clinical practice (30,31), regional or national trends in economic well-being (23,24), or wide-scale environmental exposures, some of which remain somewhat speculative (27).

The findings of this analysis should be interpreted with caution. The determination of gestational age in US Natality Files has traditionally been based on reported last menstrual period, a variable long recognized as being susceptible to error, particularly for low gestational ages (32). In addition, the relatively long period under study in this analysis would likely involve a changing capacity to assess gestational age correctly. However, for this concern to introduce biased estimation, the error in gestational classification would need to vary with the geographic and spatial clusters. Nevertheless, the US Natality Files have been used successfully to identify a variety of important risk factors for preterm birth as well as in documenting long-term trends in adverse birth outcomes (33). Moreover, the focus of this analysis was on broad patterns of preterm birth using relatively large geographic aggregations of births or clusters of births. The presented analyses also utilize both state and county-level data. These spatial jurisdictions are hierarchal in nature (states are composed of counties) and, while the county and state-level analyses are presented separately, some caution should be used in assessing potential influences that may vary or be shared by individuals or across different spatial aggregations (34).

The cluster analysis should also be interpreted with care. The construction of 10 clusters in the k-means analysis was an arbitrary decision and it is possible that levels of heterogeneity could change if more or fewer clusters were constructed. In addition, unlike the temporal trend analyses, the cluster analyses were confined to singleton births in order to seek patterns generally unassociated with factors related to multiple gestations. The clustering models utilizing three temporal periods could generate anomalous results at the earliest and latest margins of the period under study. However, a sensitivity analysis was conducted to assess this concern and confirmed the robustness of the presented cluster assignments. It should also be noted that there have been a variety of new techniques in the spatial and temporal analysis of complex datasets (35), some of which have been applied to the study of birth outcomes (36). The current analysis used k-means cluster strategies because of its computational efficiency for high-dimensional analyses in very large datasets; however, these new statistical refinements could provide enhanced analytic approaches to spatial and temporal health-related research.

While the exploration of broad spatial and temporal patterns of preterm birth using very large databases and machine learning techniques cannot substitute for detailed epidemiologic, clinical, and basic investigations, the analyses presented here utilize new analytic techniques to explore broad patterns of preterm birth in the United States. The findings suggest potential periodicity and common influences among different gestational age groups and disparate geographic jurisdictions over time. These broad temporal and spatial patterns could help frame new analytic perspectives and promising investigative hypotheses into preterm birth, one of the most profound and persistent adverse health outcomes in the United States today.

Methods

Study Population

Our analyses utilized birth certificate information for all births in the United States between 1971 and 2011 and were derived from the National Center for Health Statistics Natality Files (37). These files provide information on a variety of demographic, maternal and infant characteristics, including maternal and infant race\ethnicity, estimated gestational age, and county and state of maternal residence. All counties across 51 regions (50 states and the District of Columbia) were included in the analysis. The analyzed data set consisted of more than 3,000 counties over 40 y, yielding about 120,000 county-year data points representing ~145 million live births.

Temporal Trends

Gestational age-specific birth rates (i.e., birth prevalences) were calculated for each year in the study period. These rates were generated by dividing the number of births in a specified gestational age category in a given year by the total number of births for that population and year and presented as a percentage of all births. For the secular trend analyses, gestational age-specific birth rates were calculated for all live-births (including all singleton and multiple births; stillbirths were not included) in each state and the District of Columbia. State and county designation was allocated based on the residence of the mother at the time of delivery and not on the state or county location of birth. Similarly, the designation of a birth as white or African American was based on maternal report. Analyses were performed by using SAS 9.1 (SAS Institute, Cary, NC) statistical software. Simple trends were calculated for 5-y periods between 1995 and 2010. In addition, analyses were conducted to assess whether states with relatively high absolute rates of preterm births in 2000 were more or less likely to have experienced a decline in these rates over the subsequent 10-y period. Analyses were also conducted to examine the relationship between trends in <34 wk and 34–36 wk categories and between different racial-ethnic groups.

County-Level Trends

Distinct from the state-based analyses, trends were calculated for all US counties for the 1985 through 2008 period. County-level data were not available subsequent to 2008. Total counts of all counties experiencing a >5% and/or >10% year-to-year increase of <34 wk births were calculated for the 1985–2008 period, the years for which all states reported county-based data. Trends were calculated as the percent change in preterm births over the examined period as (Rt2Rt1/Rt1)/100, where Rt2 represents the latter year or years and Rt1 the initial study year or years. Simple and smoothed, moving average plots of counties experiencing annual changes of >5% and >10% increases in <34 wk births were calculated.

Cluster Analysis

We used cluster analysis to characterize temporal-spatial patterns of gestational age at delivery for all singleton births in the United States between 1971 and 2008. Clustering is used to detect latent structures or regularities within a dataset. We used the k-means procedure, a clustering strategy that requires the choice of the desired number of cluster centers with the k-means procedure iteratively estimating the cluster centers by minimizing the total variance within each cluster (38). The first step of the analysis was the calculation of gestational age-specific birth rates for singleton births for each county in the United States for the period 1971 through 2008. The percentage of births that fell into each of the following gestational age categories for each year in each county was computed: Under 20, 20–27, 28–31, 32–35, 36, 37–39, 40, 41, 42 wk and over. Next temporal trends for each category were characterized using a model which breaks the date range 1971–2008 into three periods 1971-A, A-B, B-2008 (where the model selects A and B independently for each county and each age category) and fits a straight line to the data in each time range, forcing the lines to meet at the boundary years A and B. The slope of the trend line in each year was taken to characterize the trend for that age group in that county for that year. Each year in each county was thus characterized by nine gestational age trend slopes. Each county-year pair was assigned to this vector of nine values. The number of clusters is set a priori and here the vectors were associated into 10 clusters via the k-means clustering methodology (26). The above analysis was carried out in python (Python Software Foundation, Beaverton, Oregon) using the SciPy library (SciPy.org). The presented analyses utilized publicly available vital statistics datasets and were approved by the Human Subjects Institutional Review Board of Stanford University.

Statement of Financial Support

This study was funded in part by the March of Dimes (Grant # 50185), White Plains, New York