Prediction models and risk assessment for silicosis using a retrospective cohort study among workers exposed to silica in China

This study aims to develop a prognostic risk prediction model for the development of silicosis among workers exposed to silica dust in China. The prediction model was performed by using retrospective cohort of 3,492 workers exposed to silica in an iron ore, with 33 years of follow-up. We developed a risk score system using a linear combination of the predictors weighted by the LASSO penalized Cox regression coefficients. The model’s predictive accuracy was evaluated using time-dependent ROC curves. Six predictors were selected into the final prediction model (age at entry of the cohort, mean concentration of respirable silica, net years of dust exposure, smoking, illiteracy, and no. of jobs). We classified workers into three risk groups according to the quartile (Q1, Q3) of risk score; 203 (23.28%) incident silicosis cases were derived from the high risk group (risk score ≥ 5.91), whilst only 4 (0.46%) cases were from the low risk group (risk score < 3.97). The score system was regarded as accurate given the range of AUCs (83–96%). This study developed a unique score system with a good internal validity, which provides scientific guidance to the clinicians to identify high-risk workers, thus has important cost efficient implications.

silicosis and the subsequent premature mortality were frequently observed among tin, tungsten, and gold miners 4,5 . Silicosis is an incurable disease with irreversible progressive nature and all treatment strategies are relevant to postponing the progression and control of subsequent medical conditions 6 , mainly from secondary tuberculosis, functional disability, autoimmune diseases, or lung cancer 7,8 . The limited value of current treatment toward silicosis leads urgent needs for developing risk prediction models of silicosis for guiding better disease prevention and control among workers exposed to silica dust in China.
Risk prediction of silicosis is an approach of developing statistical models to estimate the probability of developing disease over an adequate time period (latency) that helps clinicians identify workers at higher risk of silicosis, which allows for an earlier counseling of risk factor modifications to decrease risk. Risk prediction model research has been well established in absolute cancer risk prediction, such as breast cancer [9][10][11][12][13] ; however, it is a relatively new area in the occupational health studies 14 . Although a few of occupational epidemiological studies had explored prediction models to predict the probability of the occurrence of pneumoconiosis or work-related illnesses (e.g., shoulder pain, occupational allergic disease) [15][16][17][18][19] , most models were developed based on cross-sectional studies and only the Australia study explored the 'prognostic model' to project the occurrence of silicosis and lung cancer in the next 40 years but no discriminatory power was reported 15 .
This study aims to develop a risk model to predict the probability of silicosis occurrence among 3,492 workers exposed to silica dust in China who had been followed up over 33 years. Results from this study provide new insights for risk prediction models among workers exposed to silica dust and thereby offer significant contributions for guiding strategic measures for silicosis prevention in occupational health practice.

Results
By the end of 2008, we observed 298 incident cases of silicosis with a crude cumulative incident rate of 8.53% after an average of 32.90 ± 8.70 years of follow up with a follow-up rate of 95%, contributing an overall 125,814.69 person-years of observation. As described in Supplementary Table 1, a total of 1,347 workers (38.57%) died and the average age at death was 63.13 ± 12.59 years old. Over 96% of workers were married but more than half of them were illiterate. There were 77.38% workers who had ever smoked (current smoker, 43.44%; former smoker, 33.93%). The average age at first exposure to silica dust was 23.73 ± 6.36 years, while the age at entry into cohort was 27.62 ± 7.81 years, which indicated that some workers had dust exposure prior to their entry into the cohort. The mean concentration of respirable silica dust was 0.08 ± 0.04 mg/m 3 and the net years of dust exposure was 24.03 ± 9.09 years. Overall, more than half of workers had 2 or more jobs experienced (38.43% had 2 jobs and 23.05% had 3 jobs or more) in the iron ore. Table 1 presents the results of univariate Cox regression analyses of the potential predictors for the risk of silicosis, including exposure related variables (age at first exposure to silica dust, age of entering the cohort, average exposure to respirable silica dust, cumulative dust exposure time), jobs experienced in the iron ore, socio-demographics (education level, illiteracy; marriage status), smoking habits, and history of lung diseases (pulmonary tuberculosis, chronic bronchitis, and asthma). Six predictors (Table 1) identified to have significant contributions to the risk prediction were age at entry of the cohort (Wald χ 2 = 155.06, p < 0.001), mean concentration of respirable silica dust exposure (Wald χ 2 = 282.95, p < 0.001), net years of dust exposure (Wald χ 2 = 110.52, p < 0.001), no. of jobs experienced in the iron ore (Wald χ 2 = 43.77, p < 0.001), ever smoking (Wald χ 2 = 5.08, p = 0.024), and illiteracy (Wald χ 2 = 60.75, p < 0.001). Results from Schoenfeld residual plots showed that the hazard ratios of all the predictors (Table 1) were likely to be proportional, while the smoothing-spline curves fluctuated around zero; this phenomenon suggested that the assumption of proportionality was satisfied for all the predictors included in the Cox model (data not shown).
We included six significant predictors into the multivariate full Cox model, stepwise selection procedure (0.05 for entry and 0.051 for removing the variables), and the LASSO method for further selection. LASSO model is a shrinkage method which was applied to deal with model's over-fitting and resulted in a greater reduction in the magnitude of coefficients for the weak predictors than those for the strong predictors. Results of comparing coefficients of models of all predictors (Model1) and two selection methods (stepwise selection procedure, model2; the LASSO method, model3) are summarized in Table 2. Six significant predictors (obtained from the Table 1) were all retained in the Cox model using stepwise selection and LASSO selection approach. Although the coefficient of each selected predictor varied less by using three different selection models, we prefer the LASSO coefficients because the LASSO method reduced coefficients more for the predictors with a weaker association than the stronger ones, and also presented a smallest AIC value (AIC = 4937.97, Akaike information criterion) 20,21 .
A bootstrap re-sampling method was applied to generate a large bootstrap dataset 22 , which allows all predictors included in the model becoming statistically significant when 200 bootstraps datasets for LASSO selection was applied. The coefficients of age of entering the cohort, mean concentration of respirable silica dust exposure, net years of dust exposure, no. of jobs experienced in the iron ore, and illiteracy were virtually not affected by the LASSO method, and the results were consistent with their 100% selection in the bootstrap samples. Smoking was selected in 94% of the bootstrap samples. Further analyses revealed that the coefficients for age at first exposure to silica dust, unmarried, pulmonary Scientific RepoRts | 5:11059 | DOi: 10.1038/srep11059 tuberculosis, chronic bronchitis and asthma were shrunk towards zero, and this provide evidence that the unselected variables (as shown Table 1) had less contributions to the risk prediction of silicosis occurrence. These simulation results showed a clear consistence between the estimated effect of a predictor according to the original LASSO selection and the bootstrapped results. Figure 1 demonstrates the results of risk score analysis from the final model by LASSO selection (Model3) incorporating with a linear combination of the six risk predictors weighted by their Cox We further split the risk score for silicosis into three subgroups according to its quartiles (Q1, Q3): low risk group (risk score < 3.97), moderate risk group (3.97 ≤ risk score < 5.91), and high risk group (risk score ≥ 5.91). In the iron ore cohort and based on the classification of this score system, 875 workers were classified into the low risk group with 4 (0.46%) silicosis cases, 1,745 workers were classified into the moderate risk group with 91 (5.21%) silicosis cases, and 872 workers were classified into the high risk group with 203 (23.28%) silicosis cases.

Discussion
This retrospective cohort study with a long follow-up duration (33 years) demonstrated that the risk to silicosis for workers in the iron ore mining could be well predicted based on socio-demographic data and exposure related risk factors according to a newly developed risk score scheme. Our results are thus  the novel findings that provide scientific evidence to guide clinicians to differentiate workers at high risk from those at low risk so that targeted prevention can be applied cost efficiently. The significant predictors ruled in this study are age at the entry of cohort (which stands for age of first exposure to silica dust), smoking, illiteracy, no. of jobs experienced in the iron ore, mean concentration of respirable silica dust exposure, and net years of dust exposure. These findings are consistent with many occupational epidemiological studies that a dose-response relationship was indicated between the occurrence of silicosis and the mean concentration of respirable silica dust exposure or net years of dust exposure [23][24][25][26][27] . There is evidence on a synergistic effect between crystalline silica dust and smoking to enhance the risk of silicosis and lung cancer 25,27 , and this provides supportive evidence for inclusion of smoking as an independent predictor in our newly developed score scheme to predict the risk of silicosis. On the other hand, age at the first exposure of silica dust might have contributed to the risk of silicosis occurrence, but it was not a significant predictor which was thus not included in our score system. Among 2,396 workers who had past dust exposure prior to the entry of the cohort, the mean dust concentration was quite similar among workers with different age at first dust exposure ( Supplementary  Table 2A & 2B), which explained the statistical non significance for the relationship between age at first exposure and risk of silicosis, a further possible explanation for this situation should be that those workers were enrolled into this cohort while they were assigned to relatively high-exposure tasks.
Previous studies reported that chronic obstructive pulmonary disease (e.g., chronic bronchitis, emphysema) was associated with a prolonged dust exposure [23][24][25][26][28][29][30] . A study in Mongolia reported that dust-induced chronic bronchitis and silicosis accounted for 67.8% of occupational diseases 24 . Prolonged exposure to silica dust may lead to airflow obstruction causing chronic bronchitis and emphysema. We included histories of chronic bronchitis diagnosed by doctors as an independent predictor in the Cox model for the risk prediction of silicosis, but it was not statistically significant. It is likely that insufficient cases of chronic bronchitis observed throughout the follow-up period yielded an inadequate power for detection of a significant result based on the context of our current study. On the other hand, biomass burning is an important indoor air pollutant in the remote areas of China, in particular in the early years of follow-up. Exposure to fumes or particular matters emitted from the biomass burning may be associated with the occurrence of respiratory disorders (e.g., chronic bronchitis) or even enhances the risk of silicosis occurrence. We did not collect information on the use of biomass burning, and it thus posed another limitation to the current study.
We developed an unique score system for risk prediction of silicosis using shrinkage estimates with LASSO method in fitting Cox model that enables to adjust for model's over fitting and avoid extreme predictions 21 ; this approach had been applied in a number of case studies [31][32][33] but it is less applied in occupational epidemiological studies. We recognized that results from bootstrap re-sampling procedure presented a good internal validity for the LASSO model applied in this study. On the other hand, risk score values developed from our study are evident to have a good predictive accuracy for prediction the occurrence of silicosis in this study (the AUCs ≥ 80%). Moreover, coefficients of the LASSO model transferred to risk score values demonstrated more convenient application for the clinicians in their routine practices and consulting patients at different levels of risk to silicosis, which offers another major advantage of the current study. Nevertheless, there are concerns on the model's generalizability, as we used cohort data collected from an iron ore to fit the prediction model, which may not represent the whole profile of silica dust exposure workforce in different industrial circumstances. Further validation and calibration of our score system in other industrial settings with large population is warranted.
In conclusion, this study developed a unique risk score system to classify silica dust exposed workers into different risk to silicosis using a LASSO penalized Cox model, with a good internal validity and high model discrimination. This newly developed score system provides clinicians scientific guidance and convenient approaches to identify high-risk workers in routine consultation, thus has important cost efficient implications. Nevertheless, further larger well-designed cohort studies are warranted to validate our results, while cautions should be born in mind when applying the results from our study to other industrial settings.

Materials and Methods
The cohort and follow-up. The detailed information for the study design and subjects recruitment has been described in the previous studies 4 . In brief, a total of 3,658 workers who had ever been exposed to silica dust were enrolled in this retrospective cohort from an iron ore mine in China from January 1, 1964 through December 31, 1974 and followed up from 01/01/1964 to 31/12/2008. We included all workers in this cohort with a history of exposure to crystalline silica dust who had worked at least 1 year in the iron ore mining, and excluded workers if she was a female worker (139 individuals), or the male workers who had sufficient evidence of silicosis at the entry of cohort (15 individuals) or he had already died at the entry of the cohort (12 individuals). Eligible subjects finally included in the study were 3,492 workers who had completed information on socio-demographics, lifetime occupational silica dust exposure, and a history of previous diseases diagnosed by the  Diagnosis of silicosis and important predictors. The diagnosis of silicosis was originally made by a diagnostic expert panel comprising of three readers according to the Chinese National Diagnostic Criteria of Pneumoconiosis 1986 (i.e., the routine diagnoses) 34 , hereas all the chest radiographs were re-evaluated by a panel of experts using the Diagnostic Criteria in 2009 35 and showed a good agreement between the two diagnostic criteria. Following the criteria, silicosis was classified into three stages: stage one, stage two, and stage three, which was very similar with the ILO (International Labour Organization) classification 36 . Silicosis stage one was approximate to the category 1, silicosis stage two was similar to the category 2, and silicosis stage three was equal to the category 3 in concurrence with the appearance of large opacities in ILO classification. Historical industrial hygiene data were collected from the periodical monitoring data for total dust for the period 1964-2008. The total dust concentrations and exposure duration (number of hours) per shift were summarized for each job title in a 3-year interval during 1950-1986 and then measured on an annual basis afterwards. There were 33 job titles confirmed to have exposure to silica dust, number of jobs experienced for workers were included as a measurement for different dust exposure. The concentration of total dust was monitored in the iron ore three times per month by the local industrial hygienists with an overall 55,333 monitored sampling data in total for the period 1964-2008. The average concentration of total dust for each specific job title in a defined calendar year was calculated by taking arithmetic mean of all the samplings of the specific job measured in the corresponding year. It has been noted that workers used half-face respirator as personal protective equipment in the recent decade but only the wet process was applied in the early years of follow-up.
The local industrial hygienist performed the direct measurement for total dust by placing the samplers in a location near workers' routine practice in a typical working day, and then collected dust samples for a 15-20 minutes interval when the work was in progress according to the Chinese National Sampling Regulations 37 . All samples collected were then weighed to estimate the total dust level in mg/m3; meanwhile the occupational hygienists transformed the total dust level to the concentration of crystalline silica dust for each specific job title according to a validated approach 38 . Statistical methods and prediction model specification. We used the Visual FoxPro (VFP) and Microsoft excel software to establish datasets. We then doubly checked the data and cleaned accordingly using SAS software (9.1.3; SAS Institute, Cary, NC). We transformed relevant continuous variables to the categorical scale for clinical use 16,17,39 . As shown in Supplementary Figure 1, we performed three steps to assess the risk prediction model for the occurrence of silicosis following the principles of Clinical Prediction Model proposed by Ewout W. Steyerberg 21 Step I -Predictors selection. We performed univariate Cox regression analyses to screening out potential predictors by estimating the Hazard Ratios (HR) and their 95% Confidence Intervals (CIs), predictors with p < 0.05 will be retained and evaluated further. Three models were performed to check the potential multicollinearity among predictors: full model predictor selection (Model1, all predictors entered into the Cox model at the same time), stepwise Cox model selection (Model2) and Least Absolute Shrinkage and Selection Operator (LASSO) selection (Model3). The LASSO method was used to shrink the coefficients toward zero by setting a constraint on the sum of the absolute standardized coefficients 20,40 . Schoenfeld residual plots were used to assess the assumption of proportional hazard ratio according to Harrell's rho results. If the assumption of proportional hazard ratio was not fulfilled, the extended Cox model was carried out for the development of prediction model 41-43 . Step II -Risk model construction. We selected the optimal model from the Step I for risk model construction according to the smallest Akaike Information Criterion (AIC). We then assessed the effect of potential predictors using a risk score analysis on the basis of a linear combination of the selected predictors weighted by the Cox regression coefficient, using a formula as following: S x i n i 1 = ∑ = , where S is risk score, x is selected predictors, n is the number of selected predictors.
Step III -Risk model evaluation. Bootstrap re-sampling was used to evaluate the model's internal validity for the optimal model finally selected 22,44,45 . The discriminative ability of the model was determined according to the time-dependent Receiver-Operator Characteristic (ROC) curves and the corresponding Area Under the Curve (AUC) were calculated to assess the predictive accuracy of the model; the larger the area is under the AUC, the better the model is for the risk prediction 46 Sensitivity analyses were conducted to evaluate the performances of different risk scores by plotting (t, AUC (t)) for different values (cutoff points) of follow up time t to test robustness of the results. All statistical analyses were two-sided and performed with R software (Version 3.1.0, 2014-04-10; R Foundation for Statistical Computing).