Prediction of gestational diabetes mellitus using machine learning from birth cohort data of the Japan Environment and Children's Study

Recently, prediction of gestational diabetes mellitus (GDM) using artificial intelligence (AI) from medical records has been reported. We aimed to evaluate GDM-predictive AI-based models using birth cohort data with a wide range of information and to explore factors contributing to GDM development. This investigation was conducted as a part of the Japan Environment and Children's Study. In total, 82,698 pregnant mothers who provided data on lifestyle, anthropometry, and socioeconomic status before pregnancy and the first trimester were included in the study. We employed machine learning methods as AI algorithms, such as random forest (RF), gradient boosting decision tree (GBDT), and support vector machine (SVM), along with logistic regression (LR) as a reference. GBDT displayed the highest accuracy, followed by LR, RF, and SVM. Exploratory analysis of the JECS data revealed that health-related quality of life in early pregnancy and maternal birthweight, which were rarely reported to be associated with GDM, were found along with variables that were reported to be associated with GDM. The results of decision tree-based algorithms, such as GBDT, have shown high accuracy, interpretability, and superiority for predicting GDM using birth cohort data.


All GDM-PH(+) GDM-PH(−)
For the GDM-PH(−) group, wherein overfitting occurred frequently, the results of changing the sampling methods are shown in Table 3.The results obtained using the SVM model did not change, even after altering the sampling methods.However, the results of undersampling in the RF model improved; the TPR increased to 0.18 (95% CI 0.14-0.22).The TPR also improved in both undersampling and oversampling in the GBDT and LR models as follows: undersampling in GBDT, 0.35 (95% CI 0.34-0.38);oversampling in GBDT, 0.21 (95% CI 0.16-0.27);undersampling in LR, 0.24 (0.17-0.30); and oversampling in LR, 0.23 (0.17-0.28).Following changes in sampling methods, undersampling showed higher accuracy than oversampling in the GBDT, LR, and RF models (except the SVM models).
Using GBDT modeling for GDM-PH(−) group, the relationship between TPR, false-positive rate (FPR), and change in AUC on altering the risk threshold is shown in Fig. 1.When the risk threshold was reduced, the TPR increased faster than the FPR.The AUC yielded a unimodal graph with a maximum value of 0.66 when the risk threshold was 0.025.In other models, the probability of GDM occurrence was zero in most input data due to overfitting; thus, altering the risk threshold was ineffective.
Variables with high variable importance (VIP) identified in the analysis of the GBDT model without changing the sampling methods are shown in Table 4. Variables with high VIP in the GDM-PH(−) group included HbA1c levels, BMI before pregnancy, and maternal age.Variables with high VIP in the GDM-PH(+) group included triglyceride levels, platelet count, and firstborn child's birth year.SHAP (SHapley Additive exPlanation) summary plot (Mean (|SHAP Value|) is shown in Fig. 2a.Variables with high Mean (|SHAP Value|) in the GDM-PH(+) group included number of previous deliveries, 1st born child's birth year, and BMI before pregnancy.Figure 2b shows variables with a high Mean (|SHAP Value|) in the GDM-PH(−) group, including maternal age, HbA1c levels, and BMI before pregnancy.

Discussion
We compared four machine learning methods to improve GDM prediction models based on a large birth cohort.
GBDT exhibited the highest accuracy, followed by LR, RF, and SVM.Without changing the sampling methods, overfitting occurred upon the use of all algorithms except for GBDT for GDM-PH(−).The accuracy for GDM prediction of all algorithms, except for SVM, improved without overfitting using undersampling or oversampling.Changing the risk thresholds improved the accuracy of GBDT.Furthermore, GBDT results were more accurate than the existing method wherein LR used only maternal age, pre-pregnancy BMI, and laboratory results of specimens (see Supplementary Table S1 online).This could be because variables useful for GDM prediction can be increased using JCES data and GBDT can construct the boundary surface non-linearly.www.nature.com/scientificreports/There were some differences in variables important for predicting GDM in the GBDT model between the GDM-PH(+) (recurrent GDM) and GDM-PH(−) (new-onset GDM) groups.Thus, differences in VIP between recurrent GDM and new-onset GDM in JECS data were not based on parity.
The RF, GBDT, and SVM algorithms used are reportedly effective for structured data; thus, we compared them to determine the most appropriate one for the JECS data.For the GDM-PH(+) group, overfitting occurred in the data in the SVM model.Other algorithms yielded stable results without overfitting.For the GDM-PH(−) group, overfitting occurred in the data of all models, except for the GBDT model.Owing to the exploratory approach for predicting GDM, the data set used here was unique because it included many variables that do not affect GDM.Those noisy data cause compounding negative effects on generalizability and overfitting 22 .Imbalanced datasets often result in an overfitted model to achieve high classification accuracy 23 .The GDM-PH(+) group (N = 624) had a much smaller sample size than the GDM-PH(−) group (N = 82,074).Both groups included many variables that did not affect GDM.However, the ratio of the GDM and non-GDM groups were almost similar in the GDM-PH(+) group.Furthermore, the GDM-PH(−) group had a very low incidence of GDM (2.8%).In the SVM model, the problems related to many variables that do not improve the predictability negatively affect the analysis.SVM extracts records near the boundary surface as support vectors and creates a discrimination surface using only support vectors 10 .Thus, SVM can reduce the number of records used for analysis.However, SVM is not an algorithm to properly select variables from a large number of variables.Therefore, the choice of support vectors in our study was inappropriate, possibly leading to overfitting.
In contrast, RF and GBDT models use decision tree algorithms.The decision tree requires repeated binary decision-making.Therefore, variables that do not included the predictability of GDM are not included in the decision tree 11,12 .Therefore, decision tree algorithms are highly robust to data with many variables.In the RF algorithm, random sampling of the training dataset is performed as the first step to create multiple datasets 11 .Subsequently, the RF algorithm creates a decision tree model for each dataset to predict results by the majority rule.Sampling datasets from the GDM-PH(−) group using this particular algorithm may not ensure model diversity generated by random sampling, possibly leading to overfitting.In the GBDT model, hyperparameter optimization is performed using gradient descent before the start of each subsequent training session 12 .Therefore, unlike the RF model, the decision tree in the GBDT model may be trained while reducing the bias between the case and control groups.However, the TPR was not high even in the GBDT model (the only model without overfitting).
As in this study, the development of prediction models using data with a low case-to-control ratio requires adjustment of the sample size of the training data by changing the sampling method 24 .The TPRs of the RF, GBDT, and LR models were improved by changing the sampling method (Table 3).Undersampling could prevent overfitting with excessive control data.However, in the SVM model, the accuracy of GDM prediction has not improved, possibly because changes in sampling methods do not solve the problem of multidimensional data with many variables that do not improve the predictability.Considering oversampling, the TPR improved slightly in the GBDT and LR models, but overfitting occurred in the RF model.The oversampling technique randomly duplicates data until the case-to-control ratio reaches a specific value; thus, this technique may not solve the problem of the RF model (i.e., model diversity).In contrast, in the LR model, imbalance corrections, including changes in sampling methods may even worsen model performance 25 .In this study, the method of changing the risk threshold was used for imbalance corrections without changing the sampling method.The JECS data were not designed to estimate imbalance correction; thus, it was not possible to evaluate such effects in this study.However, our study demonstrated the potential to improve TPRs while maintaining the FPR low by changing the thresholds (Fig. 1).Generally, lowering the risk threshold increases both the TPR and FPR, but setting an appropriate risk threshold for LR and GBDT enables imbalance corrections without changing the sampling method.For setting the risk threshold, Goorbergh et al. used two fixed values-the prevalence of malignancy in the training dataset and the default risk threshold of 0.5.However, in this study, when the risk threshold was varied from 0 to 0.5 in steps of 0.005, the AUC reached its maximum value at 0.025, as did the GDM prevalence at 0.027.
We performed an exploratory analysis of factors contributing to GDM using AI.Typically, our exploratory methods have the following disadvantages: (1) the results may be inappropriate depending on the AI algorithms used, and (2) due to the cost, increasing the number of participants to obtain enough variables that are acceptable for the exploratory analysis was difficult.However, using sufficient data, selecting appropriate algorithms, and comparing VIPs, it was possible to identify variables previously not associated with GDM and verify previously reported associated factors.
In this study, we predicted the development of GDM based on information that could be collected in the early stages of pregnancy.Mothers are more likely to be diagnosed with GDM at 24-28 weeks of gestation.In this study, the average date of completion of the collected questionnaires was 14-15 weeks, which is considered early enough to predict the diagnosis, even if considering the time between the blood collection and the results of the tests.In meta-analysis, protective association of physical activity (21-46%) from GDM when comparing any type of physical activity to none in either the pre-pregnancy or early pregnancy period 20 .If a high-risk group near the 1st trimester can be extracted, it may lead to GDM prevention.
In this study, 775 questions were used to predict the incidence of GDM.Obviously, it would not be practical to build a prediction model using all of these questions, as it would take a lot of time to enter the predictors.Therefore, it is important to screen out as many variables as possible that are important for prediction.In this study, two evaluation criteria, VIP and Mean (|SHAP value|), were used to select predictors.High VIP variables identified in this study are listed in Table 4. Previous GDM studies identified a history of GDM in previous pregnancies, maternal age, and obesity as risk factors for GDM 26 .Additionally, the effect of GDM on the interpregnancy interval was reported 27 .The JECS data do not include the interpregnancy interval.Therefore, the firstborn child's birth year was considered as an alternative variable.One study reported a significant difference in white blood cell count and platelet count between the GDM and non-GDM groups in the second trimester 28 .www.nature.com/scientificreports/A meta-analysis showed a significant increase in lipid levels (e.g., triglyceride) in mothers with GDM in the first and second trimesters 29 .Variables that are reportedly associated with GDM in studies conducted before the JECS were also identified as factors with high VIP in this study.However, regarding urinary creatinine concentration, a study on the associations between urinary metals in early pregnancy and the subsequent risk of GDM reported no significant difference in urinary creatinine between GDM and non-GDM groups 30 .However, we revealed urinary creatinine concentration with a higher VIP from the GDM-PH(−) group, especially the nulliparous group.
Although the reason for this is unknown, it may be a surrogate indicator for some other factor, such as physique.
The items in the questionnaire administered at enrollment in this study include the 8-item Short-Form Health Survey (SF-8) items for health-related quality of life (HRQOL) 31 .Physical component summary and mental component summary were variables with high VIP regardless of the presence or absence of a history of GDM.Regarding GDM and HRQOL, a systematic review examining the short-and long-term progression of HRQOL and their association with GDM diagnosis was reported; GDM does not directly lead to reduced QOL in mothers but causes some complicated interactions with psychological factors, resulting in reduced QOL 32 .The SF-8 data in this study were collected before 22 weeks of gestation.Our study results suggest that mothers' HRQOLs are related to the risk of GDM; thus, GDM further reduces HRQOL.
Recent studies examining the association between mothers' birth weights and GDM revealed that mothers with low birth weights or macrosomia were at higher risk of GDM 33 .We identified mothers' birth weights as factors with high VIP in the GDM-PH(−) group.Hales et al. reported a correlation between low birth weight and subsequent glucose intolerance 34 .GDM is mild glucose intolerance; thus, mothers with low birth weights may have an increased risk of GDM.
High Mean|SHAP| variables identified in this study are shown Fig. 2a,b.Although similar variables to the VIPs were found in the top 20, SF-8 MCS and SF-8 PCS were absent from the top 20 for both GDM-PH(+) and GDM-PH(−).On the other hand, the GDM-PH(+) group showed a new variable, chocolate and vitamin D intake from the dietary questionnaire, and the GDM-PH(−) group showed a new variable, supplement intake (folic acid).Both vitamin D and folic acid have been reported to have an association with GDM 35,36 .
Although variables including those already reported to be associated with GDM, such as these, were detected in this study, the AUC score of gradient boosting for those with GDM history (0.67) was below the acceptable minimum for clinical implication (0.70).But in the JECS study, we are currently analyzing maternal genetic data, which will be provided in the future.By re-constructing the model after taking these genetic backgrounds into account, we expect to improve the prediction accuracy.
This study has some limitations.First, in Japan, the diagnostic criteria of the Japanese Society of Obstetrics and Gynecology are used to determine GDM.But the JECS is a multi-region, multi-medical institution cohort study; GDM data were obtained from medical record transcripts; thus, we could not review in detail the diagnostic criteria of GDM for the co-operating health care provider(s) 21 .Second, analysis in this study was performed considering information collected at the time of study enrollment.However, we did not consider the effects of other factors not identified in the JECS, especially genetic information and family history of diabetes mellitus.Third, information on diet was collected using self-administered questionnaires.Therefore, the results may not accurately reflect the actual food or nutrient intake.Fourth, the incidence of GDM in Japan is 7-13% 37 .However, the incidence of GDM in this study was 2.7%.This may indicate that the JECS included more healthconscious mothers or favored low enrollment for high-risk pregnancies, leading to sampling bias.Fifth, the JECS was conducted in Japan, and most participants were Japanese.Thus, generalization to populations from other countries may be inaccurate because the JECS results consider the unique living environment and lifestyle in Japan.Finally, with the size of the JECS data, it is difficult to obtain predictions by physicians as an external evaluation.Studies combining the findings of this study (mother's birth weight and psychological factors) with previously reported factors, including genes, are needed for more accurate prediction compared to prediction by other means, such as using genes.
In conclusion, we demonstrated that exploratory analysis using AI for a large birth cohort is possible through the appropriate use of algorithms.Algorithm comparison revealed high accuracy, interpretability, and superiority of decision tree-based algorithms, including GBDT considering datasets in this study.Further studies regarding GDM prediction using AI are needed to improve the TPR by collecting other variables, including genetic information and family history of diabetes mellitus.Using exploratory analysis of the JECS data, we identified the importance of previously reported variables related to GDM and new variables, such as HRQOL in early pregnancy and mothers' birth weights related to GDM.

Data sources
The JECS is a nationwide birth cohort study.The design of the JECS study is described elsewhere 18 .The eligibility criteria for participants in the JECS did not consider the presence or absence of disease.This study used the jecsta-20190930 dataset, which was released in October 2019.The following data were identified from the dataset and used for analysis: maternal questionnaire data and dietary data from the survey administered at study enrollment (T1), medical record transcripts during pregnancy, child's sex determined after delivery, laboratory results of specimens collected by 21 weeks of gestation, parental education and household income data collected during mid-late pregnancy, and parents' birth weights (reported by the mother) after delivery.For the study outcome, GDM cases were defined as GDM during pregnancy per medical record transcripts.The maternal questionnaire included the Kessler Psychological Distress Scale (K6) as an indicator of psychological distress 38 , a short version of the International Physical Activity Questionnaire as an indicator of physical activity 39,40 , SF-8 Health Survey (SF-8) as an indicator of health-related quality of life 31 , and environmental exposures 41 .Food-intake and Table 5. Settings of hyper parameters of each algorithm.GDBT gradient boosting decision tree, LR logistic regression, RF random forest, SVM support vector machine.

Table 2 .
AUC, TPR, and FPR for the training dataset in various algorithms.AUC area under the receiver operating characteristic curve, CI confidence interval, GDBT gradient boosting decision tree, GDM gestational diabetes mellitus, GDM-PH(+) past history of GDM, GDM-PH(−) no past history of GDM, LR logistic regression, RF random forest, SVM support vector machine.

Table 3 .
Result of resampling for the GDM-PH(−) group.AUC area under the receiver operating characteristic curve, CI confidence interval, GDBT gradient boosting decision tree, GDM gestational diabetes mellitus, GDM-PH(+) past history of GDM, GDM-PH(−) no past history of GDM, LR logistic regression, RF random forest, SVM support vector machine.

AUC TPR (%) FPR Mean 95% CI Mean 95% CI Mean 95% CI
Changes in true-positive rate and false positive rate by differences in risk thresholds in GBDT.AUC area under the receiver operating characteristic curve, GDBT gradient boosting decision tree.