Association of pre- and early-pregnancy factors with the risk for gestational diabetes mellitus in a large Chinese population

Gestational diabetes mellitus (GDM) has aroused wide public concern, as it affects approximately 1.8–25.1% of pregnancies worldwide. This study aimed to examine the association of pre-pregnancy demographic parameters and early-pregnancy laboratory biomarkers with later GDM risk, and further to establish a nomogram prediction model. This study is based on the big obstetric data from 10 “AAA” hospitals in Xiamen. GDM was diagnosed according to the International Association of Diabetes and Pregnancy Study Group (IADPSG) criteria. Data are analyzed using Stata (v14.1) and R (v3.5.2). Total 187,432 gestational women free of pre-pregnancy diabetes mellitus were eligible for analysis, including 49,611 women with GDM and 137,821 women without GDM. Irrespective of confounding adjustment, eight independent factors were consistently and significantly associated with GDM, including pre-pregnancy body mass index (BMI), pre-pregnancy intake of folic acid, white cell count, platelet count, alanine transaminase, albumin, direct bilirubin, and creatinine (p < 0.001). Notably, per 3 kg/m2 increment in pre-pregnancy BMI was associated with 22% increased risk [adjusted odds ratio (OR) 1.22, 95% confidence interval (CI) 1.21–1.24, p < 0.001], and pre-pregnancy intake of folic acid can reduce GDM risk by 27% (adjusted OR 0.73, 95% CI 0.69–0.79, p < 0.001). The eight significant factors exhibited decent prediction performance as reflected by calibration and discrimination statistics and decision curve analysis. To enhance clinical application, a nomogram model was established by incorporating age and above eight factors, and importantly this model had a prediction accuracy of 87%. Taken together, eight independent pre-/early-pregnancy predictors were identified in significant association with later GDM risk, and importantly a nomogram modeling these predictors has over 85% accuracy in early detecting pregnant women who will progress to GDM later.

www.nature.com/scientificreports/ to identify high-risk women for effective surveillance and prevention efforts, which can gain 12 to 16 weeks of intervention time. Currently, published data on this subject mainly focus on demographic parameters. In addition, considering the complex nature of GDM, the impact of any risk predictor on the development of GDM may be small when assessed in isolation, but may be more obvious in combination with other risk factors 13,14 .
In the literature, dozens of studies have attempted to construct a risk prediction model for GDM [14][15][16] , yet the prediction performance remains untested or less satisfactory, curbing its translation into clinical application.
To fill this gap in knowledge and yield more information for future research, we, based on the big obstetric data from Xiamen, China, aimed to examine the association of potential risk predictors (including pre-pregnancy demographic parameters and early-pregnancy laboratory biomarkers) with later GDM risk, and further to establish a nomogram prediction model by regressing conventionally-recognized and newly-identified predictors of significance.

Methods
Study design and ethical approval. This is a multicenter hospital-based cohort study. Maternal and child health data from 10  Ethical approval was obtained from the Institutional Review Boards at all participating hospitals, and informed consent was signed from all participants undergoing direct interview. Data sharing certification complied with the relevant policies set forth by the Xiamen Health Bureau.
Study participants. Study participants were restricted to gestational women aged ≥ 18 years who had the expected date of confinement falling from the year 2008 to 2018, as well as data on standard glucose challenge test and/or OGTT. Gestational women with pre-pregnancy diabetes mellitus were excluded from the current analysis.
Diagnosis of gestational diabetes mellitus. GDM was diagnosed according to criteria set forth by the International Association of Diabetes and Pregnancy Study Group (IADPSG) 17 . During the 24th to 28th gestational weeks, women who had non-fasting plasma glucose ≥ 7.8 mmol/L with a 1-h 50-g glucose challenge test were requested to undertake a 2-h 75-g OGTT, which was carried out in the morning after an overnight fasting of over 8 h, with blood samples abstracted at fasting, 1 h and 2 h after the glucose load. A pregnant woman is diagnosed to have GDM if one or more of the following criteria are satisfied: fasting plasma glucose ≥ 5.1 mmol/L, 1-h plasma glucose ≥ 10.0 mmol/L, or 2-h plasma glucose ≥ 8.5 mmol/L. Demographic characteristics. Pre-pregnancy demographic data were self-reported by study participants at the first pre-natal visit during the 8th to 12th weeks of pregnancy, including age, age at menarche, cigarette smoking, alcohol drinking, education, medical histories of diabetes mellitus and hypertension, pre-pregnancy intake of folic acid, pregnancy week, the presence of hemopathy, epilepsy, hyperthyroidism, cardiovascular diseases, liver diseases, kidney diseases, and lung diseases, as well as maternal family histories of diabetes mellitus and hypertension. Data on the previous histories of GDM and macrosomia were missing.
Cigarette smoking status was classified as never smoking and ever (former or current) smoking. Alcohol drinking was classified as never drinking and ever (former or current) drinking. Education was classified as high (college or equivalent degree or above) and low (high school degree or below) education. A maternal family history of diabetes mellitus or hypertension was defined as one or more of affected relatives within three generations who had clinically confirmed diabetes mellitus or hypertension.
Body height (to the nearest 0.1 cm) and pre-pregnancy weight (to the nearest 0.1 kg) were measured by nurses or trained staff. Pre-pregnancy body mass index (BMI) was calculated as pre-pregnancy weigh (kg) divided by the square of body height (m).

Laboratory biomarkers.
Besides recording demographic information, fasting blood samples were also abstracted at the first pre-natal visit during the 8th to 12th weeks of pregnancy for the measurement of laboratory biomarkers. In this study, because coverage on laboratory biomarkers differed across participating hospitals, only white cell count, platelet count, hemoglobin, alanine transaminase (ALT), aspartate aminotransferase (AST), albumin, direct bilirubin, conjugated bilirubin, creatinine, and blood urea nitrogen (BUN) were included for analysis. In view the strong biological relevance between direct bilirubin and conjugated bilirubin, only direct bilirubin was retained in the analysis. The concentrations of these biomarkers were quantified by the clinical laboratory or department of each participating hospital.

Statistical analyses.
Utilizing the extract-transform-load process in SQL server 2008 R2, crude variables containing missing (> 50%) values were removed. In addition, implausible values or extreme outliers that might represent transcription or data entry errors were checked. All outliers were reported to the data entry technicians, who corrected the database by comparing against the paper records or in consultation with the obstetri- www.nature.com/scientificreports/ cians. Then, variables were imported into the Stata software version 14.1 for Windows (Stata Corp, TX) for data cleaning and management. All study participants were divided into the patient group and the control group according to the diagnosis of GDM. The distributions of continuous characteristics, summarized as mean (standard deviation) and median (interquartile range), were appraised for normality by use of 1-sample Kolmogorov-Smirnov test. Continuous characteristics that were found to deviate from normality were compared between the two groups using the Wilcoxon-Mann-Whitney rank sum test, and the t test otherwise. Categorical characteristics, summarized as count and percentage, were compared using the χ 2 test.
To assess the possibility of non-random measurement error for clinical biomarkers resulting from procedural differences across the ten "AAA" hospitals in this study, the intraclass correlation coefficient (ICC) was employed, and it is a statistic that can be used to quantify the degree to which observations within a cluster differ from those between clusters 18 . The confidence limits for ICC were estimated using the multivariable delta method 19 . The ICC ranges from 0 to 1, with an ICC of 0 indicating the variance in clinical biomarkers is not due to variation between the hospitals.
The identification of significant factors for the risk of GDM was done by using the Logistic regression analysis before (model 0) and after adjusting for confounding factors (model 1 and model 2). Confounders in model 1 included age, cigarette smoking, alcohol drinking, education, and age at menarche, and confounders in model 2 additionally included maternal family histories of diabetes mellitus and hypertension, and the presence of hemopathy, epilepsy, hyperthyroidism, cardiovascular diseases, liver diseases, kidney diseases, and lung diseases. The risk for GDM was denoted by odds ratio and its 95% confidence interval (95% CI). Significant factor is identified if statistical significance (p value < 0.05) is fulfilled simultaneously across three different models.
For continuous significant factors identified, Spearman rank correlation coefficients were calculated to check for collinearity. If pairwise correlation coefficient is over 0.6, only one factor is retained for analysis.
To examine the prediction performance of significant independent factors, two models were constructed: basic model and full model. Factors in the basic model included age, alcohol drinking, cigarette smoking, education, age at menarche, maternal family histories of diabetes mellitus and hypertension, as well as the presence of hemopathy, epilepsy, hyperthyroidism, cardiovascular diseases, liver diseases, kidney diseases, and lung diseases. The full model additionally included significant independent factors. Prediction accuracy gained by adding significant independent factors to the basic model was appraised by use of the following statistics or tests: Akaike information criteria (AIC), Bayesian information criteria (BIC) 20 , likelihood ratio test, net reclassification improvement (NRI), integrated discrimination improvement (IDI) 21 , and area under receiver operating characteristic curve (AUROC) under both calibration and discrimination aspects. What's more, the net benefits gained by adding significant independent factors were visually appraised in decision curve analysis 22 . In the plot of decision curve analysis, the X-axis represents thresholds for GDM risk, and the Y-axis represents net benefits hinged on different thresholds. The farthest the curve is, the highest the net benefit is.
To facilitate clinical application, a risk prediction model illustrated as a nomogram was established by regressing conventionally-recognized and newly-identified significant independent factors. The performance of this nomogram model was appraised by using both concordance index (C-index, which equals to the AUROC) and calibration curve. The larger the C-index, the more accurate was the risk prediction for GDM. The C-index ranges from 0.0 to 1.0, and it is generally accepted that the C-index of < 0.7 suggests no improvement in model performance 23 . In calibration curve, the 45° line denotes the optimal prediction in calibration curve, showing how far the predicted probabilities of the nomogram are from the actual observations. The nomogram model was established using the R programming environment (version 3.5.2) "rms" package 24 .
All reported p values are based on two-sided tests of significance, and p value less than was considered as statistically significant.

Results
Study participants. Data on 258,466 gestational women at 10 "AAA" hospitals were extracted from the Xiamen Primary Health Information System. After excluding 64,335 women with missing values on glucose challenge test and/or OGTT, 3161 women with abnormal values, and 578 women with pre-pregnancy diabetes mellitus, 187,432 gestational women were eligible for inclusion, with 49,611 women diagnosed with GDM and 137,821 women free of GDM in the final analysis.
Baseline characteristics. Table 1 shows the baseline characteristics of all study participants. Women with GDM were older (mean: 29.33 vs. 28.34 years, p < 0.001), had higher pre-pregnancy BMI (mean: 21.27 vs. 20.59 kg/m 2 , p < 0.001) and lower education levels (23.20% vs. 25.70%, p < 0.001) than women free of GDM. No differences were noted for age at menarche, cigarette smoking, alcohol drinking, and maternal family history of hypertension between the two groups.
The possibility of measurement error for clinical biomarkers resulting from procedural differences across multiple hospitals was assessed using the ICC statistic (Supplementary Table 1). The ICCs for all clinical biomarkers were all relatively low (< 0.07), indicating a low probability of clustering within hospitals and a less likelihood of differences in measurement techniques between hospitals.
Identification of significant factors. Three models, namely model 0, model 1, and model 2, were constructed under the Logistic regression models to identify potential factors in significant association with GDM risk ( Table 2). Before and after adjusting for confounders, eight factors were consistently and significantly associated with GDM at a significance level of 0.001, including pre-pregnancy BMI, pre-pregnancy intake of folic acid, white cell count, platelet count, alanine transaminase, albumin, direct bilirubin, and creatinine. Of note, per Correlation analysis of significant factors. Spearman correlation analysis was performed to test collinearity of significant continuous factors identified above. As reflected by the Spearman correlation coefficients ( Table 3). The correlation coefficients ranged from − 0.08 to 0.29.
Prediction performance assessment. The prediction performance of eight significant independent factors was assessed by means of calibration and discrimination statistics. As showed in Table 4, the differences in AIC and BIC values were significantly greater than 10 between the basic model and the full model, indicating the significant prediction by adding eight significant factors, which was further confirmed by the likelihood ratio test (p < 0.001).
In addition, the significance of NRI and IDI statistics revealed that the addition of eight significant independent factors to the basic model can differentiate women with GDM from gestational women under study, which was further reinforced by the significant AUROC difference between the two models (p < 0.001).
Furthermore in the decision curve analysis, there were evident net benefits after adding these eight factors to the basic model (Fig. 1). Table 1. The baseline characteristics of the study participants. SD standard deviation, IQR inter-quartile range (25% quantile to 75% quantile), OGTT oral glucose tolerance test. Besides age expressed as mean (SD), the other continuous variables are expressed as median (IQR). Categorical data are summarized as count (percentage). *Between patients and controls, age was compared by using the t test, and the other continuous variables were compared using the Wilcoxon-Mann-Whitney rank sum test; all categorical variables were compared using the χ 2 test. www.nature.com/scientificreports/  Table 3. Correlation analysis of continuous significant factors in predicting gestational diabetes mellitus in both patients and controls. coef. Coefficient, ALT alanine transaminase. The lower triangular data represent the correlation coefficients in patients, and the upper triangular data represent the correlation coefficients in controls.  www.nature.com/scientificreports/ Establishment of a risk prediction model. In view of the nonlinear relationship between continuous significant factors and the decent prediction performance, a risk prediction model was hence established by using the nomogram technique by modeling age and the eight identified factors of significance, including prepregnancy BMI, pre-pregnancy intake of folic acid, white cell count, platelet count, alanine transaminase, albumin, direct bilirubin, and creatinine, as illustrated in Fig. 2. This nomogram model had a good prediction accuracy, with the C-index of being 87%, indicating 87% correct model identification of the high-risk women who will experience GDM across all possible pairs of pregnant women. The calibration curve for this nomogram model is presented in Supplementary Figure 1. In addition, positive predictive value and negative predictive value for differentiating the presence and absence of GDM were estimated to be 72.91% and 93.69%, respectively. Decision curve analysis of eight pre-and early-pregnancy significant independent factors in predicting gestational diabetes mellitus later. GDM gestational diabetes mellitus. The orange solid line corresponds to the basic model that includes age, alcohol drinking, cigarette smoking, education, age at menarche, maternal family histories of diabetes mellitus and hypertension, as well as the presence of hemopathy, epilepsy, hyperthyroidism, cardiovascular diseases, liver diseases, kidney diseases, and lung diseases. The green solid line corresponds to the full model that includes both factors in the basic model and the eight newlyidentified unrelated significant factors, including pre-pregnancy body mass index, pre-pregnancy intake of folic acid, white cell count, platelet count, alanine transaminase, albumin, direct bilirubin, and creatinine. Over threshold probabilities of 0.2, the net benefit gained by adding the eight significant factors was greater than that in the basic model. Figure 2. Establishment of a risk prediction nomogram based on pre-and early-pregnancy significant independent factors for gestational diabetes mellitus later. BMI body mass index, ALT alanine transaminase, Cr creatinine, GDM gestational diabetes mellitus. This nomogram can be used to manually obtain predicted values from a regression model that was fitted with the pre-and early-pregnancy significant independent factors. In detail, there is a reference line at the top for reading scoring points (range 0-100) from all factors in the regression model, which were summed together to calculate the total points, and then the predicted values can be read at the bottom. www.nature.com/scientificreports/

Discussion
The aim of this study was to examine the association of promising pre-pregnancy demographic parameters and early-pregnancy laboratory biomarkers with the later risk of GDM, and further to establish a prediction model.
The key findings of our analysis were the identification of eight independent pre-/early-pregnancy predictors in significant association with the later risk of GDM, and importantly incorporation of these significant predictors in a nomogram model had over 85% accuracy in early detecting pregnant women who will progress to GDM at the third trimester. A growing number of epidemiological parameters and laboratory biomarkers have been evaluated in prediction of GDM in the medical literature. For instance, Guo and colleagues retrospectively analyzed 3956 Chinese women who underwent their first antenatal visits, and found that age, pre-pregnancy obesity, first-trimester, fasting plasma glucose, and a family history of diabetes mellitus were significant predictors of later GDM 16 . In addition, many laboratory biomarkers in circulation such as fibroblast growth factors 25 , fatty acids 26 , and ferritin 27 have been listed as promising drivers of GDM. Currently, one of the greatest challenges facing global obstetricians is the identification of proper early-pregnancy laboratory biomarkers and the establishment of a prediction model incorporating some well-established factors for GDM, yet for a few established risk factors such as age and obesity, the results of most studies are not often reproducible for other parameters or biomarkers. The reasons for the repeated failure are not fully understood, and may be attributable to inter-population heterogeneity in genetic backgrounds, study designs, phenotype definitions, analytical methodologies, unaccounted environmental exposures or lifestyle presences [28][29][30] . In addition to determining the key reasons for inconsistent replications, given the distinct genetic heterogeneity and epidemiologic characteristics, it is highly suggested to construct a database of potential determinants of GDM in each racial or ethnic population.
To derive a relatively reliable estimate, we resorted to a big database from ten "AAA" hospitals in Xiamen, involving 258,466 gestational women between 2008 and 2018, and thereof data from 187,432 gestational women with pre-pregnancy diabetes mellitus were finally analyzed. To control for confounders, we adopted a graded adjustment method, and only factors that were consistently associated with the significant risk of GDM were identified. After removing factors with strong evidence of correlation, we identified eight significant factors independently associated with GDM, and six of them are laboratory biomarkers. Consistent with the results of most previous studies 16,31-34 , we here confirmed the contribution of pre-pregnancy obesity to the increased risk of having GDM, as well as the beneficial impact of pre-pregnancy intake of folic acid. Although the six laboratory biomarkers of significance identified in this study are routinely measured in clinical practice, their association with GDM is the subject of debate due to conflicting data or is rarely reported. Taking albumin as an example, Piuri and colleagues observed a significantly higher level of albumin in women with GDM than the general population 35 , whereas there was no material difference in albumin in the study by Gungor and colleague 36 . By contrast, we found that albumin level was significantly lower in women with GDM than women without GDM. A real finding can fail to replicate due to numerous reasons, including divergent genetic backgrounds and insufficient statistical power. Nevertheless, it is widely recognized that the risk attributable to a single index or biomarker is small, considering that GDM is a multifactorial disease to which inherited, environmental, and lifestyle factors contribute independently or interactively 37,38 , and such a small effect may also be exacerbated by the presence of other factors. For practical reasons, to construct a multivariable prediction model with decent prediction performance for GDM is imperative.
To shed some light on this issue, in an attempt to test prediction accuracy and justify gained benefits of eight significant factors identified in this study, we employed multiple statistics from both calibration and discrimination aspects 39 and visual tools in decision curve analysis. On the basis of decent prediction performance, we regressed age and eight significant factors in a nomogram model, and found that this model had an 87% prediction accuracy. The importance of this nomogram prediction model lies in the facilitation of clinical appraisal of future developing GDM during the third trimester of pregnancy. For further practical application, we agree the results of the present population in Xiamen will require further validation in an independent Chinese population and additional follow-up for confirmation of this nomogram risk prediction model presented here.
Our study findings have important public health implications. In clinical practice, the diagnosis of GDM is made during the 24th to 28th weeks of pregnancy. If we can predict the later occurrence of GDM by using prepregnancy or early-pregnancy markers, the time window of adverse gestational consequences can be dramatically improved by immediate intervention on the high-risk pregnant women. It is worth noting that the nomogram prediction model we established can tease out 87% of women who will progress to GDM later.
Limitations. There are some limitations to the present analysis. First, all gestational women were exclusively enrolled from ten "AAA" hospitals in Xiamen, China, and the extrapolation of our findings to the other regions or racial groups is limited. Second, other important factors such as the previous histories of GDM and macrosomia 40 , as well as sleep quality 41 and ambient air pollution exposure 42 , which have been reported to be strong predictors for future GDM, were not available for us. Third, all laboratory biomarkers were measured only once, and it is of great interest to monitor their dynamic changes in susceptibility to the later development GDM. Fourth, most demographic data in this cohort, especially pre-pregnancy intake of folic acid, were self-reported and error-prone, and so the possibility of measurement error and residual confounding remains.

Conclusions.
Taken together, though a big data analysis, we have identified eight independent pre-/earlypregnancy predictors in significant association with the later risk of GDM, and importantly a nomogram modeling these predictors has over 85% accuracy in early detecting pregnant women who will progress to GDM at the third trimester. For practical reasons, we hope the current investigation will not remain just an endpoint