Introduction

Estimated fetal weight (EFW) is a clinically important indicator to manage perinatal risk and affect the timing and route of delivery1,2,3,4. Meanwhile, the estimation of fetal weight is helpful to identify small for gestational age (SGA), large for gestational age (LGA), or macrosomia5,6,7. Therefore, a reliable fetal weight estimation is of value for clinicians. Numerous EFW models8,9,10,11,12,13 were developed using ultrasound measurements, such as head circumference (HC), abdominal circumference (AC), biparietal diameter (BPD) and femur length (FL). Most of these models were established based on ultrasound measurements of Caucasians population. However, the NICHD Fetal Growth Studies showed that the fetal weight differed significantly with race after 20 weeks14.

Although some researchers developed the EFW models using 2D and 3-dimensional (3D) ultrasound measurements in a Chinese population which were confirmed better than those models based on Caucasians population15,16,17, 2D ultrasound is still widely used in the clinical practice for sonographer compared to 3D ultrasound. Thus, it is necessary to develop more accurate birth weight prediction models based on 2D ultrasound measurements.

In addition, the common feature of these published EFW models is that fetal weight was estimated by single model for the whole gestational age. According to the Intergrowth-21st study18, the fetal growth velocity is different depending on gestational age. Moreover, Sotiriadis et al.19 found that the divergence between birth weight and the EFW based on the Hadlock’s formula is greater for earlier gestational ages, and the same finding was confirmed in a similar study using Norwegian birth registry20. It is therefore that the single EFW model may not accurately estimate the fetal weight for different gestational age stages.

We postulate that developing different prediction models for each gestational age stage may improve the accuracy of birth weight prediction in a Chinese population. In this study, we established the staged birth weight prediction models from 28 to 42 weeks using 2D ultrasound measurements in a larger Chinese population, and validated its prediction performance in 28–42 weeks.

Results

A total of 20,619 women who had an ultrasound scan within 3 days prior to delivery were included, of which 11 women had a stillbirth, 42 women delivered before 28 weeks or after 42 weeks, and 1256 women lacked complete ultrasound measurements (HC, BPD, AC or FL). Ultimately, 19,310 women were analyzed in the study. The clinical and demographic characteristics of women and newborns were shown in Table 1. The median and interquartile range (IQR) of maternal age was 28 and 4 years, respectively. The prevalence rate of gestational diabetes mellitus (GDM) and preeclampsia was 3.66% and 1.95%, respectively. Birth weight had a median of 3300 g and IQR of 560 g. Of 19,310 newborns, 1827 (9.46%) were small for gestational age (SGA, <10th weight percentile for local population21), 2790 (14.45%) were large for gestational age (LGA, >90th weight percentile), and 1073 (5.56%) were macrosomia (≥4,000 g). The median gestational age at delivery was 39.43 weeks (ranged from 28 to 42 weeks). Table 1 showed that 987 (5.11%) babies were born preterm (<37 weeks), 194 (1.00%) babies were born term with low birth weight (<2,500 g and ≥37 weeks), and 1,072 (5.55%) macrosomia were born term (≥4,000 g and ≥37 weeks). The mean of ultrasound-to-delivery intervals was 1.23 days (range, 0–3).

Table 1 Clinical and demographic characteristics of pregnant women and newborns in the study who had an ultrasound scan within 3 days prior to delivery.

Demographic information for study population

Of 19,310 subjects, 17,377 cases were randomly assigned to the training group and 1,933 to the validation group. Comparisons of demographic characteristics (maternal age, maternal weight, maternal height, parity, gestational age, ultrasound-to-delivery intervals, newborn’s gender, and birth weight) between the training group and the validation group were shown in Table S1. There were no significant differences for all demographic characteristics in two groups.

Development of new staged birth weight prediction models

MLR, FPR and VM were used to establish birth weight prediction models for five gestational age stages (Table 2). To compare the performance of new staged birth weight prediction models, 21 previously published formulas were selected (Table S2). Figure 1 showed the comparisons of all birth weight prediction models for five gestational age stages. For first three gestational age stages (28–30, 31–33, 34–36 weeks), the new VM model presented the lowest systematic errors (0.3%, 0.08% and 0.03%, respectively), which were not significantly different from zero (p = 0.832; 0.923; 0.918). Compared to the MLR and the FPR models, at least a 67.39%, 1.21 times and 7 times decrease in the systematic errors and a 6.97%, 0.26%, 0.36% decrease in the random errors for the VM model were found in 28–30, 31–33, and 34–36 weeks, respectively. The systematic errors of previously published EFW models were higher than new VM models for 28–30, 31–33 and 34–36 weeks except for Haddock (A,B,F) formula (systematic error: 0.23%) (Fig. 1a–c). For last two gestational age stages (37–39 and 40–42 weeks), the lowest systematic errors (0.09% and 0.29%, respectively) were found in the MLR model for 37–39 weeks and the FPR model for 40–42 weeks, which were significantly lower from those in other two new models and 21 previously published models with p-value all <0.05 (Fig. 1d,e). In addition, the random errors in the VM model for first three gestational age stages, the MLR model for 37–39 weeks and the FPR model for 40–42 weeks were the smallest among all the models. Therefore, the VM, MLR and FPR model were used as new staged birth weight prediction models to validate the prediction performance for 28–36, 37–39 and 40–42 weeks, respectively.

Table 2 New staged birth weight prediction models for the training group in different gestational age stages.
Figure 1
figure 1

The systematic errors and random errors for established staged birth weight prediction models and previously published models for different gestational age stages in the training group. (ae) Represent the systematic errors (±random errors) for 29–30, 31–33, 34–36, 37–39 and 40–42 weeks, respectively. *Indicates that significantly different from zero, p < 0.05; A: abdominal circumference (AC); B: biparietal diameter (BPD); H: head circumference (HC); F: femur length (FL); MLR: multiple linear regression; FPR: fractional polynomial regression; VM: volume-based model.

Validation of the different prediction models

Figure 2 displayed the systematic errors and random errors of different prediction models. In the first three gestational age stages, VM model was selected to compare the prediction performance with 21 previously published models, and in the last two stages, MLR model and FPR model were used, respectively. The lowest systematic errors (−0.08%, 0.51%, −0.09%, 0.36%, and 0.29%) were derived from new staged models for five gestational age stages, and were not significantly different from zero (p = 0.989; 0.853; 0.922; 0.116; 0.367) (Fig. 2). In 28–30 weeks, the absolute values of systematic errors for previously published models ranged from 1.98% to 21.81%, which were at least 23.75 times higher than new model (Fig. 2a). In 31–33 weeks, the absolute values of systematic errors of 21 published models (ranged from 0.69% to 18.69%) had a 35.29% increase compared to new model at least (Fig. 2b). Similarly, a 1.33 times decrease in the new model was observed compared to Combs (A,H,F) model with the second lowest systematic error in 34–36 weeks (Fig. 2c). For 37–39 and 40–42 weeks, new models had significantly lower systematic errors than 21 published models (p-value both <0.001) (Fig. 2d,e). The random errors of new staged models were close to published models; however, a slight decrease was observed in new staged models for each gestational age stage.

Figure 2
figure 2

The systematic errors and random errors for new staged birth weight prediction models and previously published models for different gestational age stages in the validation group. (ae) Represent the systematic errors (±random errors) for 29–30, 31–33, 34–36, 37–39 and 40–42 weeks, respectively. New staged birth weight prediction models: VM model for 28–30, 31–33 and 36 weeks, MLR model for 37–39 weeks, FPR model for 40–42 weeks. *Indicates that significantly different from zero, p < 0.05; A: abdominal circumference (AC); B: biparietal diameter (BPD); H: head circumference (HC); F: femur length (FL); MLR: multiple linear regression; FPR: fractional polynomial regression; VM: volume-based model.

The aggregate systematic errors, random errors, RMSE and prediction rates within 1%, 5% and 10% of birth weight were calculated for each model in the validation group (Table 3). New staged models had the lowest systematic error (0.31%) compared with published models for all fetuses in the validation group. The systematic errors for published models were significantly different from zero with p-value all <0.001. The random errors for new staged models were equal or smaller than those for published models, while, RMSE values for new models were lower than those for published models. Hadlock (A,B,H,F) model and new staged models had the same prediction rate within 1% of birth weight (10.86%), which were better than other published models. The prediction rates within 5% and 10% of birth weight for new models were higher than those for other published models, which were 54.47% and 85.10%, respectively.

Table 3 Comparisons of staged birth weight prediction models with previously published models for all newborns in the validation group.

The accuracy parameters (sensitivity, specificity, PPV, NPV, +LR, −LR, and overall accuracy) for each birth weight prediction model for detection of SGA, LGA and macrosomia at birth were shown in Tables 46, respectively. For prediction of SGA, LGA and macrosomia, a considerable variation in models’ sensitivity, specificity, PPV, +LR, and −LR were observed; while, a minor variation in models’ NPV and overall accuracy were found. Compared to previously published models, new staged models performed better accuracy for detection of SGA, LGA and macrosomia in Chinese population.

Table 4 Accuracy of staged birth weight prediction models for detection of SGA at birth.
Table 5 Accuracy of staged birth weight prediction models for detection of LGA at birth.
Table 6 Accuracy of staged birth weight prediction models for detection of macrosomia at birth.

Comparison of new staged models and single models

New single models were also established by MLR, FPR and VM for all gestational ages based on the training group (Table S3). Table S4 showed the systematic errors and random errors of single models, which were higher than those of staged models for all newborns in the validation group. New single models had higher RMSE values and lower prediction rates within 1%, 5%, and 10% of birth weight than new staged models (Table S4). Comparisons of accuracy parameters of single models for prediction of SGA, LGA and macrosomia were displayed in Table S5. We observed that staged models had a better accuracy for prediction of SGA, LGA and macrosomia at birth.

Discussion

Although many EFW models based on 2D ultrasound measurements have been formed, their accuracy was unsatisfied22. Developing the EFW formula requires as many pregnant women as possible who have a standardized ultrasound scan, and birth weight measurement18. To our knowledge, this study is the first report on birth weight prediction for different gestational age stages (from 28–30 to 40–42 weeks) in a large Chinese population including 19,310 newborns. We formed new staged birth weight prediction models, which were best established by VM, MLR and FPR model for first three gestational stages (28–30, 31–33 and 34–36 weeks), 37–39 weeks and 40–42 weeks, respectively.

For the first three gestational age stages, VM model showed the best prediction performance in the training group among three new models (VM, MLR, FPR) and previously published models (Fig. 1a–c). In 28–30 weeks, the systematic error of Hadlock (AB,F) model (0.23%) was lower than that of VM model (0.3%) and not significantly different from zero (p = 0.834), but the random error increased 7.85% (Fig. 1a). In 31–33 weeks, the systematic error of Hadlock (A,H,F) model (0.17%) increased 1.13 times compared to VM model (0.08%) (Fig. 1b). In 34–36 weeks, Hadlock (A,B,H,F) model had a second lower systematic error (−0.06%) which was the double of VM model (0.03%) (Fig. 1c). Furthermore, considering the comparisons with the previously published models in the validation group (Fig. 2a–c), it suggests that the VM model could provide more accurate prediction of birth weight for the first three stages.

For the last two gestational age stages, the systematic errors for previously published models were great higher than new staged models in the training group, and significantly different from zero with p-value all <0.05 (Fig. 1d,e). In 37–39 weeks, MLR model had the lowest systematic error (0.09%) and random error (7.23%) compared with FPR model (0.3%, 7.73%) and VM model (0.5%, 7.84%) (Fig. 1d). In 40–42 weeks, at least a 44.83% in the systematic error and a 9.03% in random error of FPR model were observed compared to MLR model and VM model (Fig. 1e). Meanwhile, according to the comparisons of prediction models in the validation group (Fig. 2d,e), the MLR and FPR model were considered as the best prediction models in the 37–39 and 40–42 weeks, respectively. The lowest aggregate systematic error and random error were also found in the new staged models (Table 3).

What’s more, it is acceptable if the prediction rate within 10% of birth weight was more than 80%16. In our study, the prediction rate within 10% of birth weight in new staged models was 85.1%, which was higher than previously published models. Furthermore, new staged models presented the better accuracy (sensitivity, specificity, PPV, NPV, +LR, −LR and overall accuracy) than previously published models for detection of SGA, LGA and macrosomia at birth. To further illustrate the accuracy of staged models, single models were developed using the same methods. Our results showed that single models presented the higher systematic errors, random errors, RMSE values, and the lower prediction rates within 1%, 5% and 10% of birth weight than staged models. The similar results of accuracy for single models and staged models for detection of SGA, LGA and macrosomia at birth were observed. It suggests that staged models had better performance than single models due to the varying growth velocity in different gestational age stages. Thus, we think that new staged models could improve the accuracy of birth weight estimation.

Dudley23 compared 11 EFW formulas and concluded that there was no preferred model for estimation of fetal weight due to population differences, maternal factors and measurement methods. To avoid the significant differences for fetal weight with race14, some studies reported the EFW models using ultrasound measurements in a Chinese population. Liao et al.15 established an EFW formula using 1,197 fetal biometrics who were delivered between 37 and 41 weeks. Yang et al.16 formed a new birth weight prediction model, in which 290 Hong Kong pregnant women who were delivered at 37–42 weeks were included. However, the prediction models were established using 2D and 3D ultrasound, and sample of two studies was not large and limited to the late third-trimester fetuses. Woo et al.24 developed an EFW formula with only 125 subjects whose detailed information was not included. It was reported that the prediction errors of Woo’s formula were higher than Hadlock’s formula25. Furthermore, our study showed that Woo’s model had higher systematic error, random error and RMSE value, and lower prediction rates within 1%, 5%, and 10% of birth weight than new staged models (Table 3).

Melamed et al.26 indicated that even the most precise models tend to the larger prediction errors. The potential sources of error are: first, observer differences. It is confirmed that ultrasound measurements, especially AC, are variable between operators, even with experience27,28. Second, because of different body composition, even the same circumference (AC) or length (FL) measurements may lead to different weight29. Third, fetal position is a factor that affects the measurement of fetal biometrics, which may be addressed by 3D ultrasound15. The use of 2D and 3D ultrasound measurements in the birth weight prediction will be needed for future study.

The subjects in this study were those who had both a delivery and an ultrasound 3 days prior to delivery such that it may cause selection bias in theory. However, with the universal use of sonographic technique in clinical practice, Chinese pregnant women receive regular ultrasound scan during pregnancy, especially prenatal ultrasonography which has become an essential part of prenatal diagnosis. Additionally, Cohen et al.30 found that more than 3 days of ultrasound-to-delivery intervals tended to affect the accuracy of EFW. Thus, the selection of population in this study is not bias in some extent. Nevertheless, it is remarkable that the application of new staged birth weight prediction models should be cautious in other population study, because many other countries (e.g., USA, UK) pregnant women are not routinely scanned in late pregnancy, but are selected for ultrasonography based on pre-pregnancy risk factors and obstetric complications31,32. Therefore, further studies should be undertaken to verify the accuracy of our new staged birth weight prediction models in other population studies, and explore the influence of population selection for different ultrasound-to-delivery intervals on birth weight prediction.

The strengths of this study are, first, it was a study with a large sample size including 1,9310 Chinese fetuses; second, the population was split into the training group and the validation group to better establish models and validate the prediction performance of models, respectively; third, we recruited the pregnant women who had an ultrasound scan within 3 days prior to delivery to avoid the prediction bias caused by wide ultrasound-to-delivery intervals30; fourth, we proposed the staged birth weight prediction models instead of single model, that is, developing the different prediction models for each gestational age stage. Some studies focused on the accuracy of fetal weight estimation for preterm or term fetuses33,34. For example, Hadlock’s9 and Warsof’s11 formulas usually underestimate preterm fetal weight, and Shepard’s10 formula was likely to overestimate fetal weight at term. This study showed the direct evidence that new staged prediction models could more accurately estimate birth weight.

There are several limits in our study. First, there are several types of ultrasound machines in the study, and the difference of ultrasound machines calibration may cause the slightly impact on the variability of the measurements35. Second, due to the smaller sample size for 28–30 weeks compared to others’, the accuracy of the fetal weight estimation may be affected before 30 weeks.

In conclusion, compared to the previously published models, new staged prediction models presented the higher accuracy in a Chinese population during 28–42 weeks. It suggests that new staged models could be more accurate than single formula on the birth weight prediction for given gestational age stage.

Materials and Methods

Study population

This was a retrospective cross-sectional study of all women who had an ultrasound examination within 3 days prior to delivery. The study subjects were from two sites: Wuhan Women and Children Medical Care Center (between May 2012 and June 2015), and Shenzhen Luohu Maternity and Children Health Care Hospital (between January 2011 and December 2015). The data was from healthcare information system of two hospitals. Inclusion and exclusion criteria were used: (1) singleton pregnancy, (2) delivery between 28–42 weeks’ gestation, (3) live birth without any congenital malformation, (4) complete measurements (HC, BPD, AC and FL). Gestational age was determined using the self-reported last menstrual period (LMP) if it agreed with the ultrasound estimation within 7 days; otherwise, the ultrasound estimation based on crown-lump length was used36,37.

All participants signed inform consents prior to engaging in any study activities. This study was approved by the ethics committee of Tongji Medical College, Huazhong University of Science and Technology, and Wuhan Women and Children Medical Care Center. All the research procedures were performed in accordance with relevant guidelines and regulations.

Ultrasound measurements

The 2D ultrasound measurements for all participates included the following biometrics: BPD, HC, AC and FL, which were obtained from the ultrasound images and uploaded electronically to the data management system. The three types of ultrasound machines, an ALOKA SSD-5500SV (Tokyo, Japan), a Philips iu22 or HD15 (Bothell, WA, USA), and a GE Voluson E10 or E8 (Zipf, Austria), were used at two sites. Fetal biometrics consisting of BPD, HC, AC and FL were measured in millimeters during the ultrasound examination. BPD was measured from the outer border of the proximal parietal bone to the inner border of the distal parietal bone (“outer to inner”) at the widest part of the skull. HC was obtained by placing the calipers on the outer border of the skull and using the ellipse facility to follow the outer perimeter of the skull to calculate HC. AC measurements were taken at the outer surface of the skin line, using the ellipse facility. For FL, the calipers were placed at the ends of the ossified diaphysis without including the distal femoral epiphysis if it was visible. Birth weight was measured within 1 hour after birth by experienced obstetric nurses using standardized procedures.

To control the quality of fetal ultrasound measurements, ultrasound examinations were performed by experienced and certified sonographers with subspecialty training in ultrasound imaging according to a standard protocol. All the sonographers had their scan evaluated for quality control at the early period of the study.

Statistical analysis

The participates were divided into two groups through employing random sampling method: training group and validation group, which were used to establish birth weight prediction models and validate the accuracy of models, respectively. The Student t test and Pearson Chi-square test was used to examine the clinical and demographic characteristics of two groups, including maternal age, maternal weight, maternal height, parity, gestational age, ultrasound-to-delivery intervals, newborn’s gender, and birth weight.

It was reported that the growth velocity of Chinese fetal weight showed an inversed “V” shape, and peaked at 34 weeks; furthermore, a significant difference in fetal weight was found at 28, 30, 32 and 38 weeks compared to the published Caucasian data38. To determine growth velocity, the “interval method” proposed by Guihard-Costa39,40 was used, that is, length of pregnancy was divided into 3-week intervals. Therefore, due to the effect of growth velocity on body weight, we divided gestational age into five stages, 3-week intervals per stage, that is, 28–30, 31–33, 34–36, 37–39 and 40–42 weeks. The multiple linear regression (MLR), fractional polynomial regression (FPR)41,42 and volume-based model (VM)12,43 were used to establish new birth weight prediction models for different gestational age stages. In the regression model, we also considered the interactions among BPD, HC, AC and FL. In volume-based model, fetal body weight was calculated by the sum of weight of fetal trunk and head. Based on physical and geometric theory, the volume of trunk was expressed as FL × AC2, thus, the weight of trunk was equal to be proportional to FL × AC2. Similarly, the weight of head was proportional to the volume of head, which was modeled as HC3, BPD3, BPD × HC2 and BPD2 × HC. The models, including the variables, the coefficients and the fractional polynomial powers (only for FPR), were elicited by the backward elimination algorithm.

We used two ways to evaluate the accuracy of new staged birth weight prediction models and 21 previously published EFW formulas8,9,10,11,12,13,18,24,44,45,46,47,48: (1) comparing systematic error, random error, root-mean-square error (RMSE) and proportion of prediction within 1%, 5% and 10% of actual birth weight for all prediction models. Systematic error was calculated as the mean of percentage error (PE) which was defined as \({\rm{PE}}=[({\rm{EFW}}-{\rm{Birth}}\,{\rm{Weight}})/{\rm{Birth}}\,{\rm{Weight}}]\times {\rm{100}} \% \). Random error was evaluated by the standard deviation (SD) of the PE; (2) comparing the sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (+LR), negative likelihood ratio (−LR) and overall accuracy for detection of SGA, LGA and macrosomia. At last, in order to validate the performance of new staged birth weight prediction models, we established single models for all gestational ages in the training group and compared their accuracy in the validation group.

All the statistical analyses were carried out in R statistical software version 3.4.1 and SAS Software version 9.4. The statistical significance was set at an α level of 0.05 with a two-sided test.