Identification of PCB congeners and their thresholds associated with diabetes using decision tree analysis

Few studies have investigated the potential combined effects of multiple PCB congeners on diabetes. To address this gap, we used data from 1244 adults in the National Health and Nutrition Examination Survey (NHANES) 2003–2004. We used (1) classification trees to identify serum PCB congeners and their thresholds associated with diabetes; and (2) logistic regression to estimate the odds ratios (ORs) and 95% confidence intervals (CIs) of diabetes with combined PCB congeners. Of the 40 PCB congeners examined, PCB 126 has the strongest association with diabetes. The adjusted OR of diabetes comparing PCB 126 > 0.025 to ≤ 0.025 ng/g was 2.14 (95% CI 1.30–3.53). In the subpopulation with PCB 126 > 0.025 ng/g, a lower PCB 101 concentration was associated with an increased risk of diabetes (comparing PCB 101 < 0.72 to ≥ 0.72 ng/g, OR 3.3, 95% CI 1.27–8.55). In the subpopulation with PCB 126 > 0.025 & PCB 101 < 0.72 ng/g, a higher PCB 49 concentration was associated with an increased risk of diabetes (comparing PCB 49 > 0.65 to ≤ 0.65 ng/g, OR 2.79, 95% CI 1.06–7.35). This nationally representative study provided new insights into the combined associations of PCBs with diabetes.

to include an over-sampling of minorities and to represent national non-institutionalized U.S. populations 24 .Information on sociodemographic characteristics, lifestyle characteristics, diet, and medical conditions are collected via an in-person interview and a physical examination in a mobile examination center (MEC), respectively.The NHANES data are released publicly every two years.The study was approved by the National Center for Health Statistics (NCHS) Research Ethics Review Board.
For this study, we used data from NHANES 2003-2004 because it provided the most recent measurements of serum PCBs for each participant.We limited the analysis to non-pregnant adults aged ≥ 20 years who had data available on serum PCBs and diabetes information (n = 1,258).Additional exclusions were individuals whose body mass index (BMI) data were unavailable (n = 30) and individuals with missing covariate information (n = 4).As a result, 1224 adult participants were included in the study.

Exposure assessment
Serum PCBs were measured by high-resolution gas chromatography-mass spectrometry (HRGC/ID-HRMS) among a randomly selected one-third of participants who were 12 years old or older.Briefly, around 2-10 ml of serum sample spiked with 13C-labeled internal standards were extracted using a C18 solid phase extraction (SPE) procedure with hexane 25 .Each congener had a specific limit of detection.According to NHANES analytic guidance, values below LOD were assigned the value of LOD divided by the square root of 2.

Diabetes ascertainment
Diabetes status was ascertained through a self-reported questionnaire by trained interviewers and lab tests.Specifically, participants were defined as having diabetes if they reported having been previously diagnosed with diabetes by a physician, or undiagnosed diabetes but had glycohemoglobin (A1C) ≥ 6.5% or plasma fasting glucose concentrations ≥ 126 mg/dl 26,27 .This method of diabetes ascertainment was found to be 63.2% sensitive and 97.4% specific for diabetes in a previous NHANES validation study 28 .

Sociodemographic and lifestyle characteristics assessment
Information on age, sex (male/female), race/ethnicity (non-Hispanic White, non-Hispanic Black, Hispanic, and other), education (less than high school, high school, and higher than high school), family history of diabetes (yes/no), family income, smoking status, alcohol consumption, and physical activity was assessed by selfreported questionnaires during the in-person interview.Family income-to-poverty ratio (PIR) was categorized as ≤ 1.30, 1.31-3.50,and > 3.50 29 .Smoking status was categorized as never (smoked less than 100 cigarettes in their lifetime), ever (not smoke at the time of the survey) and current smoker (smoke at the time of the survey) 30 .Physical activity was categorized as < 600, 600-1200, and > 1200 metabolic equivalents of task (MET) min per week 31 .Weight and height were measured following a standardized protocol during the physical examination, and BMI was calculated as weight in kilograms divided by height in meters squared.BMI categories were defined as underweight (< 18.5 kg/m 2 ), normal (18.5-24.9kg/m 2 ), overweight (25.0-29.9kg/m 2 ), and obese (≥ 30.0 kg/ m 2 ).Sixteen underweight participants were combined with normal-weight participants for statistical analyses.Dietary information was obtained through 24-h dietary recall.Total energy intake (kcal/day) and alcohol intake were calculated using the USDA food composition database.Alcohol intake was then categorized as non-drinker (0 g/day), moderate drinker (0.1-28 g/day for men and 0.1-14 g/day for women), and heavy drinker (≥ 28 g/day for men and ≥ 14 g/day for women) 32 .Diet quality, represented by Healthy Eating Index-2010 (HEI), has been found to be associated with a decreased risk of diabetes 33 .A higher HEI score indicates a higher diet quality based on 12 food components including total fruit, whole fruit, total vegetables, greens and beans, whole grains, dairy, total protein foods, seafood and plant proteins, fatty acids, refined grains, sodium, and empty calories (e.g., added sugars) 32 .

Statistical analysis
For descriptive statistical analyses, we accounted for the complex, multistage design of NHANES by using appropriate sample weights, strata, and primary sampling units.We compared population characteristics by quintile of lipid adjusted serum concentration of the sum of 40 PCBs (∑40-PCBs) using the t-test for continuous variables and the chi-square test for categorical variables.Then, we examined the potential combined effects of the 40 PCB congeners on diabetes in two steps.
In our first step, we used the decision tree classification model to identify serum PCB profiles in relation to diabetes with a corresponding threshold.The classification tree, a non-parametric supervised learning method, was chosen for several reasons.First, it can perform dimensionality reduction and classification simultaneously, which is helpful for analyzing serum PCBs, a complex mixture of different congeners.Second, it can identify potential interactions among a mixture of PCBs.Third, it can identify threshold values for each PCB congener.Last, it is robust for outliers of PCBs and does not have to make assumptions about data distributions.The participants were classified as living with diabetes or not based on all measured 40 PCB congeners.The entire dataset was randomly split into 70% training sets (n = 858) and 30% test sets (n = 386).And a ten-fold crossvalidation procedure was used to optimize the parameters and prune the tree to avoid overfitting.We used the confusion matrix and computed the accuracy with test sets to evaluate the tree's performance (Supplemental Code).This analysis was performed using the rpart package in R version 4.1.2.
In our second step, logistic regression was used to estimate odds ratios (ORs) and 95% confidence intervals (CIs) of diabetes associated with the identified serum PCB profiles.We followed NHANES analytic guidelines accounting for sample weights and sample design.In the basic models, we adjusted for only demographic variables including age, gender and race/ethnicity.In the full models, we additionally adjusted for variables that could serve as potential confounders including BMI, education level, family income to poverty ratio, smoking status, alcohol intake, physical activity level, 2010 healthy eating index, and family history of diabetes.
Although NHANES does not explicitly collect information on the type of diabetes, we considered participants to have type 1 diabetes if they started insulin within 1 year of diabetes diagnosis, or were currently using insulin, or were diagnosed with diabetes under age 30 (62).To explore the influence of diabetes type, we performed a sensitivity analysis excluding those possible type 1 diabetes cases; therefore, the vast majority of the remaining cases would be type 2 diabetes cases.This second step was performed using survey procedures with SAS software (version 9.4; SAS Institute Inc., Cary, NC, USA).

Results
Among the 1224 eligible participants, their weighted mean (SE) age was 46 (0.6) years old, 50.8% (95% CI 47.2-54.4%)were female and 70.9% (95% CI 64.0-77.7%)were non-Hispanic White.The prevalence of diabetes was 13.2% in the study population and the weighted median of serum concentration of the sum of 40 PCBs (∑40-PCBs) was 153.9 ng/g lipid adjusted (interquartile range [IQR] 87.9-266.4).Compared to participants with a lower serum concentration of ∑40-PCBs, those with a higher serum concentration of ∑40-PCBs were more likely to be older, have a lower total energy intake, a better dietary quality as assessed by the HEI-2010, and diabetes; and less likely to be Hispanic, current smokers, and have a lower family income (Table 1).
Using a non-parametric supervised learning method, a classification tree consisting of a combination of PCB congeners and their thresholds that related to diabetes were learned among the 858 training samples (Fig. 1).Identified PCB profiles that related to diabetes were indicated in the internal nodes.Each node separated the participants into two more homogeneous subpopulations based on whether their serum PCB concentrations were higher or lower than the threshold.The proportion of subpopulations were indicated above each identified PCB profile.The red color indicates a higher probability of having diabetes.At the root node, the PCB profile (ng/g lipid weight) most related to diabetes was identified: participants with serum concentration of PCB 126 ≥ 0.025 had a higher probability of having diabetes.At the internal nodes, among participants with serum concentration of PCB 126 ≥ 0.025, additional six PCB profiles with PCB 101, 49, 151, 149, and 169 were identified (Fig. 1, Table 2).The accuracy rate of the model on test data was 0.842, which indicates the model could predict 84.2% of the samples correctly.
Table 3 presents adjusted ORs and 95% CI of diabetes risk by the identified PCB profiles.After adjusting for confounders, PCB 126 was still the most consistent congener associated with diabetes; the ORs (95% CIs) of diabetes were 2.11 (1.24-3.61) in the basic model and 2.14 (1.30-3.53) in the full model for participants with a higher serum concentration of PCB 126 (> 0.025 ng/g), compared to those with a lower PCB 126 (≤ 0.025 ng/g).When we performed sensitivity analysis by sex, we found a higher odds ratio for females than males, although the confidence intervals overlapped (OR (95% CI) for females: 2.85 (1.08-7.51);males 1.87 (1.03-3.37)).
Interestingly, in the subpopulation with a higher serum concentration of PCB 126, a lower serum concentration of PCB 101 was associated with an increased risk of diabetes (comparing PCB 101 < 0.72 to ≥ 0.72 ng/g, fully adjusted OR 3.3, 95% CI 1.27-8.55).In the subpopulation with a higher serum concentration of PCB 126 and a lower serum concentration of PCB 101, a higher serum concentration of PCB 49 was associated with an increased risk of diabetes (comparing PCB 49 > 0.65 to ≤ 0.65 ng/g, fully adjusted OR 2.79, 95% CI 1.06-7.35).Although the last two identified PCB profiles with PCB 126, 101, 49, 151, 149, and 169 were also significantly associated with diabetes, these findings were inconclusive because of the wide confidence intervals.In the sensitivity analyses excluding those who possibly had type 1 diabetes, similar results were observed (Supplemental Table 1).

Discussion
In this nationally representative sample of US adults, we identified serum PCB congeners and their thresholds on diabetes using classification tree analysis.After adjustment for demographic, socioeconomic, dietary, and lifestyle factors, we found that serum PCB 126, a dioxin-like PCB, was the congener that was most consistently associated with diabetes.Further, we identified the combined associations of serum PCB 126, 101, and 49 with diabetes.All three of these congeners are constituents of commercial Aroclor products produced as large volume chemicals by Monsanto.
Our finding that a higher serum concentration of PCB 126 was associated with an increased risk of diabetes in the NHANES 2003-2004 was consistent with the previous findings in the NHANES 1999-2002 and in a Belgian study 11,17 .Comparing our threshold of PCB 126 identified by classification tree to that in the NHANES 1999-2002, our threshold (≥ 0.025 ng/g) were lower than their medium group (0.031-0.084 ng/g) and high group (≥ 0.084 ng/g) that associated with total diabetes (medium vs. low OR = 1.67, 95% CI: 1.03-2.71and high vs. low OR 3.68, 95% CI 2.09-6.49).PCB 126 was the most consistent congener associated with diabetes is plausible because it is the most potent dioxin-like PCB congener that can interact with the aryl hydrocarbon receptor (AhR), alter glucose transport and insulin tolerance in mice through an AhR-dependent mechanism [34][35][36] , and inhibit adipogenesis which leads to alteration in fatty acid metabolism 37 .
With respect to the findings of the combined associations, to our best knowledge, the only other comparable study is a recently published study that compared the multipollutant effects of persistent organic pollutants (POPs) mixture exposure on gestational diabetes mellitus (GDM) risk 38 .That study evaluated six non-dioxinlike (DNL) PCBs (PCB 28, 52, 101, 138, 153, and 180) with other POPs and found that PCB 101 was the most important predictor for glucose homeostasis but the least important predictor for GDM.This discrepancy and Vol:.( 1234567890  www.nature.com/scientificreports/PCB 101 and GDM was observed among pregnant women in a prior study 41 .Inverse associations of GDM with PCB 101 at relatively low or high concentrations were shown in their dose-response curves.Although GDM tends to be a temporary condition, the risk of developing diabetes is tenfold higher among women with GDM history than those with no GDM history 42 .In the subpopulation with a higher PCB 126 and a lower PCB 101, we observed a positive association between PCB 49 and diabetes.PCB 49, like PCB 101, is a non-dioxin-like PCB with 3,4 postions unsubstituted in one of the biphenyl rings.Although there is relatively little known regarding the toxicity of PCB 49, it has been shown that to have estrogenic activity and can disrupt normal endocrine function 43 .However, this finding was different from those in an Anniston cohort study that observed a null association between estrogenic congener group   16 .The difference in PCBs examined (specific PCB profile vs. the sum of 7 estrogenic congeners), race/ethnicity (national representative vs. 46% African American), exposure level (general population vs. highly exposed) likely complicated the comparison of the findings.
A major strength of this analysis was that we used data-driven approach to analyze a complex mixture of serum PCBs, which can assess the associations between 40 serum PCB congeners and diabetes simultaneously.Another strength was the use of nationally representative data from NHANES, which allows us to generalize our findings to the population of the U.S.This study also had some limitations.First, we examined the combined associations of PCBs with diabetes in a smaller subpopulation with a higher serum PCB 126 concentration.Although this method can provide interpretable results for the exposed populations, the referent groups were different populations of varying size.Thus, we cannot compare the magnitude of the observed associations across the subpopulation.Moreover, the data-driven approach presents difficulties in consistently reproducing the identified PCB congeners and their thresholds across various studies.Second, we cannot establish a temporal relation for the observed association between PCBs and diabetes because of the cross-sectional study design.Third, as the NHANES study does not differentiate type 1 from type 2 diabetes, we cannot definitively distinguish the effects on type 1 and type 2 diabetes separately.Since type 2 diabetes contributes 90% or more of total diabetes in adults in the U.S. 44 , the observed association was likely to be largely reflected by type 2 diabetes.In addition, we performed a stratified analysis excluding those who possibly had type 1 diabetes, and found similar findings as in our main analysis.Fourth, although we controlled for BMI as a potential confounder, it remains unclear whether BMI is in the causal pathway between PCB exposure and diabetes 45 .Furthermore, we did not investigate other contaminants or the combined actions of PCB and other persistent organic pollutants for contributing to diabetes, both of which could also have influenced our observations.

Conclusions
In conclusion, in one of the few studies to investigate the combined associations of PCBs with diabetes risk, we identified serum PCB congeners and their thresholds associated with diabetes using classification tree analysis.Our findings provide new insights into the combined associations of PCBs with diabetes.Additional prospective studies with more detailed diabetes type information are needed to replicate these findings.Table 3. Multivariable-adjusted odd ratios (ORs) and 95% confidence intervals (CIs) of diabetes by the combined associations of PCB congeners, NHANES 2003-2004. 1Basic model was adjusted for age, sex, race/ ethnicity. 2Full model was adjusted for age, sex, race/ethnicity, BMI, education level, Family income to poverty ratio, smoking status, alcohol intake, physical activity level, 2010 healthy eating index, and family history of diabetes. 3The fully adjusted odd ratio was very large due to the small sample size.Some covariates had few observations in the sub-category group (e.g., only three people had diabetes were normal weight).Significant values are in bold.Fully adjusted model 2 1 Very large or infinite 3

Figure 1 .
Figure 1.Classification tree from step one analysis of serum PCB congeners (ng/g lipid adjusted weight) and diabetes status, NHANES 2003-2004.The root node and internal nodes present the identified PCB profiles including PCB congeners and their thresholds; the leaf nodes (endpoints) present the predicted diabetes status (No: non-diabetes and Yes: diabetes) with its probability.The red color indicates a higher probability of having diabetes.

Table 1 .
Population characteristics by quintiles of total serum PCB concentrations in NHANES 2003-2004.Data are presented as the weighted mean and standard error for continuous variables; and weighted percentages and standard error for categorical variables.Some percentages may not sum to 100% because of missing values.BMI body mass index, HEI-2010 2010 healthy eating index, MET metabolic equivalent of task.Significant values are in bold.
39,40entific Reports | (2023) 13:18322 | https://doi.org/10.1038/s41598-023-45301-1www.nature.com/scientificreports/enzymesincludingcytochrome P450 (CYP) 2B1 and CYP3A1.It also has an unsubstituted ring at the 3 and 4 positions making it more readily metabolized.Since PCB 101 is metabolized through CYP 3A1, and PCB 126 can induce activation of CYP 3A39,40, PCB 126 may enhance the metabolism of PCB 101.However, we did not detect a strong correlation between PCB101 levels and PCB126 (Spearman correlation = 0.18), suggesting that the interaction between PCB mixtures and diabetes is complex and may involve other contaminants.As the Liu et al. study did not include PCB 126 in the analysis, it is possible that the observed positive association between PCB 101 and GDM actually reflects the effect of PCB 126 or other contaminants.In addition, it is very common that environmental exposure and health outcomes are not linearly associated.The non-linear relationship between

Table 2 .
PCB congeners and their thresholds associated with diabetes as determined by decision tree analysis.