Introduction

In the current global environment there continues to be the potential of military conflicts, terrorist activities and/or accidents involving large-scale exposures of people to ionizing radiation. The general population is not equipped with physical radiation dosimetry devices, and consequently there is an important need to develop and perfect biodosimetry1,2,3 approaches, which can reliably estimate the radiation dose absorbed by each exposed person, based on samples of biological materials (e.g. blood) from that person. Obtaining such dose reconstructions for the potentially large number of exposed persons in a reasonably short time is important for providing accurate individual-specific information to potentially exposed persons (including the “worried well”), for performing appropriate triage and prescribing treatment regimens if needed, and potentially for predicting long-term health risks.

Among the large and growing variety of currently available radiation biodosimetry techniques, cytogenetic damage endpoints measured by the dicentric chromosome (DCA) and cytokinesis-block micronucleus (CBMN) assays remain the most accurate and reliable options4. The DCA assay is the current “gold standard” in biodosimetry5. Cytogenetic biodosimetry techniques continue to be improved and developed6,7,8,9,10,11, and implementation of these assays using high-throughput automated approaches is becoming more widespread12,13,14,15,16,17. The automatic scoring technologies are advantageous because they can be employed in mass-casualty situations.

There is also growing recognition of the potential utility of the rapidly evolving field of machine learning (ML) for radiation biodosimetry, particularly in combination with high-throughput automated scoring techniques18,19,20. Such combined approaches have the capacity to rapidly generate individualized radiation dose reconstructions based on data from blood samples obtained from vast numbers of people affected by a large-scale radiological event such as an improvised nuclear device (IND) detonation2,11,21,22,23,24,25,26. This is possible because ML can integrate multiple types of data inputs, such as different biodosimetry assay results as well as other variables including demographic data on the exposed individuals, and use these data to produce predictions (i.e. reconstructions) of the absorbed radiation dose. As an example of this approach, we have recently demonstrated that ML tools are promising for high-throughput biodosimetry in complex exposure scenarios where neutrons are present together with photons27.

In the current study, we investigated the capabilities of ML to improve absorbed dose reconstructions by combining data from automated DCA and CBMN assays. We hypothesize that using DCA and CBMN output together, in the context of high-throughput automatic scoring systems, will provide superior dose reconstruction accuracy, compared with using either assay alone. The rationale for this hypothesis is that dicentric chromosomes and micronuclei have different radiation dose response shapes and dependences after exposure to different dose rates of ionizing radiation on demographic variables28,29, which enables each of these assays to complement each other. In other words, the information provided by the DCA and CBMN assays is not completely redundant, and the contributions of each assay can be exploited by ML to improve the accuracy of dose predictions. For example, CBMN yields can turn over and start to decrease at lower doses than DCA yields, which implies that if DCA is high (did not turn over) but CBMN is low (already turned over) in a given sample, then the sample could have received a high radiation dose.

This hypothesis is novel for the following reason: Although ML approaches were previously used for biodosimetry-related image analysis tasks, we consider that our group at the Columbia University Irving Medical Center (CUIMC) is perhaps the first to implement ML directly for dose reconstruction, using data from automated scoring systems. To our knowledge, dose reconstruction using both DCA and CBMN data as predictor variables within a single ML model (instead of using each assay alone) was never previously implemented.

Such a combined biodosimetry strategy could be particularly useful in realistic scenarios for mass-casualty events, where: (1) The population of potentially irradiated individuals is very heterogeneous in terms of age, sex and other factors. (2) Radiation can be delivered at very different dose rates, e.g. extremely high dose rate “prompt” exposures within the first fraction of a second after a nuclear device detonation30, followed by protracted exposures from radioactive fallout days-weeks later31. To mimic such scenarios in the laboratory, we collected and analyzed blood samples from 155 donors of different ages (3–69 years) and sexes (49.1% males), ex vivo irradiated with 0–8 Gy of photon or electron beams at dose rates varying from 0.08 Gy/day to > 600 Gy/s. Both DCA and CBMN assays were performed on aliquots taken from each blood sample, and analyzed in the same laboratory. The potential advantages of this approach include the following aspects: (1) For these two assays one biological material is used—peripheral blood. (2) Both assays are cytogenetic, and can thus be performed on the same equipment and using the same reagents, such as culture medium, hypotonic solution, fixative and DNA dye (DAPI), during some of the assay steps. (3) The proposed approach described here can be introduced in different cytogenetic biodosimetry laboratories that use both of these assays.

The current study represents a proof of principle for the innovative concept of combining data from several automated radiation biodosimetry assays to improve radiation dose reconstructions. Presented here, we used DCA and CBMN assays, but the concept of combining multiple predictor variables in ML-based biodosimetry is potentially extendable to other types of radiation-induced damage biomarkers. Combining data from multiple assays has the potential of increasing the reliability of dose reconstructions, and possibly overcoming confounders, which may not affect all assays equally.

Materials and methods

Experimental procedures

Our group’s methodology for ex vivo irradiation of human donor blood samples and implementation of automated DCA and CBMN assays on these samples is described in detail elsewhere8,9,10,32. It is summarized briefly below.

This study was approved by Columbia University’s Institutional Review Board (IRB) protocol IRB-AAAF2671. Blood from pediatric donors (Sterling IRB protocol #8933) was collected by Jean Brown Clinical Research (Salt Lake City, UT) into sodium heparin vacutainer tubes, and shipped overnight in a temperature-controlled shipper (22 °C Credo Cube, Fisher Scientific, Pittsburgh, PA). Blood from adults was collected at Columbia University Irving Medical Center. Blood was then aliquoted into 2D-barcoded tubes (Matrix Storage Tubes, Thermo Fisher). All recruited blood donors filled in a questionnaire, which included questions about exposure to X rays, CT scans, chemotherapy, or radiotherapy within the last 6 months. Potential donors with such exposures were excluded from the study. All methods were performed in accordance with the relevant guidelines and regulations. Informed consent was obtained from all subjects and/or their legal guardian(s).

High dose rate irradiations were performed using a custom-modified Varian Clinac irradiator30. Blood sample aliquots from different donors were irradiated using 6 or 9 MeV electron pulses. Pulse durations were between 0.1 and 4 µs. Detailed dosimetry was performed using EBT-3 film and/or a NIST-traceable Advanced Markus Ion Chamber30.

Low dose rate irradiations that simulate radioactive fallout after a nuclear explosion were performed using a combination of our modified XRAD 320 machine33 and the VADER31, which delivers a photon dose rate of 0.1–1.0 Gy/day. Blood sample aliquots were housed in a customized plastic incubator33, which was placed into the irradiation chamber. Dosimetry was performed daily using a calibrated 10 × 6–6 ion chamber (Radcal, Monrovia, CA).

Before irradiations, we dilute whole blood in RPMI in 1:4 ratio. After that, we transfer 150 µl (30 µl blood + 120 µl RPMI) of diluted blood into plates, centrifuge it and replace 120 µl of old RPMI with 270 µl of fresh PB-max. After each type of irradiation, the diluted blood aliquots were transferred into 96 well plates, and processed for DCA and CBMN assays using the RABiT-II automatic scoring system8,9,10. Imaging-based identification of dicentric chromosomes or micronuclei was implemented using a BioTek Cytation Cell Imaging Multi-Mode Reader (with 20× objective) and analyzed using a custom software, FluorQuant v.6.1 (micronucleus assay) and FluorQuantDic v.4 (dicentric assay), written in Visual C++ using the OpenCV computer vision libraries (Version 3.1, http://www.opencv.org).

Data set

The main variables in the resulting data set were: The yield of dicentric chromosomes per monocentric chromosome (Yield). The yield of micronuclei per binucleated cell (Mi_BN). Sex, coded as a binary variable with 0 = males and 1 = females. Race, coded as a categorical variable with 0 = unreported; 1 = African American; 2 = Asian; 3 = White; 4 = Mixed. Ethnicity, coded as a categorical variable with 0 = unreported; 1 = Hispanic/Latino; 2 = non-Hispanic/Latino. The radiation dose (Dose, in Gy), which was eventually treated as the target variable to be predicted by the ML model using the variables listed above as predictors. The data set consisted of 1349 blood sample aliquots from 155 donors32.

The radiation dose rate was converted into ordinal categories (Dose_rate_category) as follows: 0 = very low dose rate (approximately 0.08 Gy/day, with dose delivered over 48 h) using our custom-built VADER irradiator31; 1 = 1 Gy/min; 2 = 1 Gy/s; 3 = approximately 600 Gy/s (3 Gy in 2 electron pulses or 8 Gy in 3 pulses with 5.6 ms between pulses); 4 = single 5 µsec electron pulse. By default, we did not include Dose_rate_category in the set of predictor variables to train ML algorithms for dose reconstruction because: (1) dose rate information would not be available in a realistic mass casualty biodosimetry scenario; (2) we were interested to investigate whether or not the proposed approach of combining DCA with CBMN data by ML would be able to decently reconstruct radiation doses even when dose rate is unknown and can vary over a very wide range. However, to assess the potential effect of dose rate on ML model predictions, we also fitted a model version where Dose_rate_category was included in the predictor set. In this case, unexposed samples (with Dose = 0 Gy) were randomly assigned to any of the dose rate categories from 0 to 4.

Data pre-processing

The data set, composed of DCA and CBMN data for each aliquot of blood from each donor for each irradiation condition and replicate, was imported for analysis using the R 4.2.034 programming language. Blood sample aliquots with < 20 binucleated cells (BN) or < 20 monocentric chromosomes (MC) were removed because these samples were likely to produce unreliable DCA or CBMN data due to low numbers of counted events. The number of retained samples in this data set was 1122, provided in Supplementary_File_1 online. Summary statistics for these samples, split into training and testing halves, are shown in Table 1. Among the 1122 samples retained for analysis, 145 received 0 Gy, 541 received 3 Gy, 6 received 4 Gy, and 430 received 8 Gy. The initial number of samples was 1349, so approximately 17% of the samples were excluded from analysis.

Table 1 Summary statistics for the training and testing halves of the analyzed data set.

Since the raw micronucleus yield per binucleated cell (Mi_BN) decreased at high radiation doses (at 8 Gy compared with 3 Gy), we created a corrected “linearized” micronucleus index (Mi_BN_c). It was calculated as follows, where Mi is the number of micronuclei in the sample, BN is the number of binucleated cells, MN is the number of mononucleated cells, and k is an adjustable model parameter:

$$Mi\_BN\_c=\frac{Mi}{BN}+ \left (\frac{1}{k}\right)\frac{MN}{BN}$$
(1)

We used quantile regression (quantreg R package implementation) to model the dose response of the median (50th percentile) of Mi_BN_c using a linear quadratic (LQ) function. The parameters of this function are α and β, and a baseline value (intercept) c, described by the following equation, where D is dose (in Gy):

$$Median\left(Mi\_BN\_c\right)=c+\alpha D+\beta {D}^{2}$$
(2)

During the fitting procedure for Eqs. (1, 2), we varied parameter k in Eq. (1) in increments, so that the β term in Eq. (2) approached zero and became statistically consistent with zero. Therefore, parameter k was incrementally adjusted (in steps of 10 units) such that the median of the resulting Mi_BN_c index would approach a linear dose response over the studied dose range of 0–8 Gy. This goal was achieved by k = 70. The resulting Mi_BN_c index, which represents an additional engineered feature for dose reconstruction, was added to the data set provided in Supplementary_File_1 online.

Machine learning analyses

Our main goal in this study was to implement ML-based regression approaches to estimate (reconstruct) the radiation dose received by each blood sample. A schematic representation of the experimental design for this study is provided in Fig. 1.

Figure 1
figure 1

Schematic representation of the experimental design for this study. Details are explained in the main text.

We used the Boruta feature selection algorithm (implemented by the Boruta R package)35 to identify and discard any weak predictor variables, which would not be useful for reconstructing dose in this data set. Boruta iteratively compares the importance score of each predictor with the importance score of its randomly shuffled “shadow”, in the context of a random forest model36. It duplicates the data set and randomly shuffles the values in each column. These shuffled values are called shadow features, and they are re-created in each iteration. Those predictors that had significantly (p-value < 0.05 with Bonferroni correction) worse importance scores than the shadow features during Boruta implementation on a randomly selected training half of the data were discarded from further analysis.

We considered several state of the art tree ensemble ML methods as useful approaches for the task of dose reconstruction on the data set containing the retained predictors. Tree ensembles such as random forest (RF)36, XGBoost37, LightGBM38, and CatBoost39 represent a popular group of ML algorithms, which tend to perform well on data sets composed of tabular data, such as the one analyzed here. Such methods fit many models (decision trees) and combine them into an ensemble, which tends to produce more reliable predictions than a single model. Their strengths include the ability to model non-linear relationships and interactions between variables, and low sensitivity to correlations between predictor variables and to outlier observations.

The RF algorithm, pioneered by the famous American statistician Leo Breiman36, generates many uncorrelated decision trees by bootstrap aggregation, or “bagging” (randomly selecting samples from training data with replacement) and feature randomness (selecting a random subset of predictor variables for each tree). Predictions from all trees are then averaged for regression problems such as the one here.

By comparison, the boosting strategy uses an iterative approach where trees are added to the ensemble sequentially, so that each next tree attempts to improve the fit to those data instances, which were poorly fitted by previous trees. State of the art boosting algorithms include XGBoost, LightGBM, and CatBoost. XGBoost was created at the University of Washington in USA and became widely popularized due to its strong performance, for example in various ML competitions37. LightGBM, developed by the Microsoft corporation, differs by using the Gradient-Based One-Sided Sampling (GOSS) technique, which updates a given tree using a selection of the largest gradients and randomly sampled small gradients38. CatBoost was developed by the Yandex company in Russia, and is optimized for handling categorical variables automatically, with no need for manual pre-processing (such as one hot encoding) by the user39.

We implemented these ML algorithms in the Python 3.10.5 programming language, using the Jupyter notebooks interface (https://jupyter.org/). To establish some “baseline” of performance for comparison with the algorithms listed above, we used several other algorithms: linear regression, elastic net regression, support vector machines regression (SVR), and the linear-tree package (https://github.com/cerlymarco/linear-tree) which builds trees with linear models at the leaves. Linear regression was used because it can be regarded as the “simplest” type of modeling tool for the data analyzed here. Elastic net is a regularized regression method which implements both L1 and L2 penalties, often resulting in improved performance compared with linear regression, and/or with regularized algorithms which use one of these penalties but not both40. SVR is an adaptation of the powerful support vector machines ML algorithm, which uses a “geometric” approach to separate data classes, from classification to regression problems41. All three methods were implemented using sklearn in Python: the LinearRegression, ElasticNetCV and SVR packages, respectively. We expected all of these algorithms to perform somewhat worse than the more flexible RF and boosting methods listed above.

To mitigate the problem of overfitting, which can affect all model types, we trained each ML model using repeated k-fold cross validation (fivefold, repeated 30 times) on a randomly selected ¼ of the data, and evaluated each model on another ¼ of the data. The remaining ½ of the data was set aside for ultimate testing (validation) of the preferred model, which was identified using the initial comparison of models.

Three performance metrics were used to evaluate each ML model during the initial model comparison: mean absolute error (MAE), root mean square root error (RMSE) and coefficient of determination (R2). These first two metrics are described in Eqs. (34) below, where D represents the actual dose and \(\widehat{D}\) represents predicted (reconstructed) dose, calculated for i = 1..N data points.

$$\mathrm{MAE }= \frac{1}{N }\sum \limits_{i=1}^{N}|{D}_{i }-\widehat{{D}_{i}}|$$
(3)
$$\mathrm{RMSE }=\sqrt{\frac{1}{N }\sum \limits_{i=1}^{N}{({D}_{i }- {\widehat{D}}_{i})}^{2}}$$
(4)

The last metric (R2) is the square of the Pearson correlation coefficient between actual and predicted doses.

We compared all three metrics across the evaluated ML models to select the best-performing preferred model, and the second-place model. Both of those models were refined by hyperparameter tuning in Python and R using grid search strategies, and the best tuned versions were evaluated using the same three performance metrics on the originally withheld ½ of the data—the testing set.

We used Shapley Additive exPlanations (SHAP)42 to identify which features (predictor variables) had the greatest impact on the dose reconstructions generated by the preferred top two models. The SHAP approach originated in the fields of economics and game theory, but it is also quite useful for interpreting ML models. An important strength of the SHAP methodology is that it estimates the contribution of each feature to the model's predictions, taking into account the multitude of possible orders in which the feature of interest could be added to the model. The SHAP calculation is summarized below, where F represents the number of features in the model, S represents a subset of these features, v is the function that generates the value of the model’s prediction based on the features (the reconstructed dose in this case), i is the index of the feature of interest, and SHAPi is the SHAP value of feature i:

$$\begin{aligned}{SHAP}_{i} & =\sum_{S\subseteq F-i}\left[\frac{\left|S\right|!\left(\left|F\right|-\left|S\right|-1\right)!}{\left|F\right|!}\left(v\left(S{U}_{i}\right)-v\left(S\right)\right)\right] \\ & =\frac{1}{\left|F\right|}\sum_{S\subseteq F-i} \left [{ \left(\begin{array}{c}\left|F\right|-1\\ \left|S\right|\end{array} \right)}^{-1}(v\left(S{U}_{i}\right)-v(S)) \right]\end{aligned}$$
(5)

The terms in this equation have the following interpretations. \(\frac{1}{\left|F\right|}\) is a scaling factor. \(S\subseteq F-i\) indicates that the feature of interest (i) is excluded from the set for the current calculation. \({(\begin{array}{c}\left|F\right|-1\\ \left|S\right|\end{array})}^{-1}\) represents how many groups of size |S| can be formed from |F|-1 features. \(v\left(S{U}_{i}\right)-v(S)\) represents the marginal value of adding feature i to the set, i.e. a comparison of the model’s prediction values when feature i is included vs excluded from the set. We used the shap.Explainer in Python and the fastshap and SHAPforxgboost packages in R to calculate SHAP values for various features in the selected ML models.

Results

Radiation dose responses for the DCA and CBMN assays are shown in Fig. 2. Linear quadratic (LQ) quantile regression for the median of dicentric chromosome yield (Yield, Fig. 2A) produced the following parameters: baseline value in unexposed samples, c = 1.63 × 10–2 ± 1.30 × 10–3 (standard error), p-value < 10–6; linear dose response term α = 3.22 × 10–3 ± 8.20 × 10–4 Gy−1, p-value 1.0 × 10–4; quadratic dose response term β = 5.60 × 10–4 ± 1.00 × 10–4 Gy−2, p-value < 10–6. For raw micronucleus yield (Mi_BN, Fig. 2B), the regression parameters were: c = 3.19 × 10–2 ± 1.56 × 10–3, p-value < 10–6; α = 9.60 × 10–2 ± 5.04 × 10–3 Gy−1, p-value < 10–6; β = -1.05 × 10–2 ± 9.50 × 10–4 Gy−2, p-value < 10–6. The decrease in Mi_BN at 8 Gy compared with 3 Gy prompted us to develop the corrected “linearized” micronucleus index Mi_BN_c. Using k = 70 in Eq. (1) led to a small and non-significant β term in Eq. (2): 1.60 × 10–4 ± 9.60 × 10–4 Gy−2 (p-value 0.87). Dropping this non-significant β term produced the following parameters for a linear dependence of the median Mi_BN_c (Fig. 2C) on dose: c = 0.200 ± 2.14 × 10–3, α = 9.45 × 10–2 ± 2.33 × 10–3 Gy−1.

Figure 2
figure 2

Dose responses for the yields of dicentric chromosomes (per total number of chromosomes in the sample, (A), and raw micronuclei (B) and corrected (linearized) micronuclei (C) per cell. Circles represent the data for individual blood samples, and curves represent linear-quadratic quantile regression fits that describe the median (50th percentile) of the distribution of each variable. The data (circles) were randomly “jittered” by small amounts along the x-axis to improve visualization by reducing overlap. The dose values were 0, 3, 4, and 8 Gy.

The dependences of Mi_BN, Mi_BN_c and Yield on age and radiation dose are shown in Fig. 3. The fitted curves represent LQ quantile regressions for the median of each variable at each dose (0, 3 or 8 Gy). These results suggest that the DCA and CBMN assay yields, particularly Mi_BN_c and Yield, tended to increase with age. This age-related increase, especially in baseline values (i.e., at 0 Gy), may reflect reduction in DNA repair efficiency and induction of genomic instability due to aging and factors such as tobacco smoking43,44,45.

Figure 3
figure 3

Age dependences for yields of dicentric chromosomes (A), and raw micronuclei (B) and corrected (linearized) micronuclei (C). Green = 0 Gy, blue = 3 Gy, red = 8 Gy. Circles represent the data for individual blood samples, and curves of each corresponding color represent linear-quadratic quantile regression fits that describe the median (50th percentile) of the distribution of each variable.

The initial feature selection procedure using the Boruta algorithm rejected the Sex variable as a weak predictor, which outperformed random noise in only 35.8% of the iterations. By comparison, Race outperformed noise in 82.8% of iterations, Ethnicity in 98.5%, and all other predictors (Age, Mi_BN, Mi_BN_c and Yield) in 100%. Consequently, Sex was discarded from further analysis. This finding suggests that sex of the blood donors did not have a significant effect on dose reconstructions in this data set, although other research suggests that sex may play a role in cytogenetic assays28,46. Specifically, the absence of a sex effect is consistent with the results from manual scoring of dicentrics47, but not micronuclei48.

Using the retained predictor variables, we compared the performances of several state-of-the-art ML methods, assessing their abilities to reconstruct radiation dose in a heterogeneous population of samples exposed to different dose rates. The results of the initial model comparison (Table 2) suggest that RF and CatBoost had the best and second-best performances, respectively.

Table 2 Initial comparison of ML model performances for dose reconstruction.

Hyperparameter tuning by grid search approaches was used on these top two ML algorithms—RF and CatBoost. Their performances were evaluated on the testing data set, which was not seen by any of the models during training. RF, tuned in R using the ranger and caret packages, ultimately performed the best on testing data, with R2 for actual vs. reconstructed doses = 0.845, RMSE = 1.160 Gy and MAE = 0.628 Gy. Its best hyperparameters were: number of decision trees in the forest, num.trees = 500; the variable importance measure, importance = "permutation"; the rule by which each split in a tree occurs, splitrule = "extratrees"; minimum number of data instances in a terminal node, min.node.size = 1; the number of features considered in each tree, mtry = 6. To assess how RMSE and MAE metrics varied by the actual radiation dose, we calculated them separately for 0, 3 and 8 Gy samples. For 0 Gy, RMSE = 0.954 Gy and MAE = 0.476 Gy. For 3 Gy, the corresponding values were 1.128 and 0.566 Gy, and for 8 Gy they were 1.255 and 0.750 Gy, respectively. As expected, the error metric magnitudes increased somewhat as function of increasing dose, but this tendency was not very dramatic.

Predictions (dose reconstructions) for the best-performing RF model on testing data are provided in Supplementary_File_2 online and displayed graphically in Fig. 4. In addition to predictions for the mean reconstructed dose, we calculated quantile predictions from an RF model with the same hyperparameters, using the ranger R package with the quantreg = TRUE option. The quantiles were 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, and corresponding predicted values are labeled as Reconstructed_Dose_q5 to Reconstructed_Dose_q95 in Supplementary_File_2 online.

Figure 4
figure 4

Visualization of actual and reconstructed radiation doses. The reconstruction was performed by the random forest algorithm on the testing half of the data set. The left panel shows median dose predictions, generated using quantile random forest, and the right panel shows mean dose predictions. The violin plots for 0, 3 and 8 Gy show the distributions of corresponding reconstructed dose values. The model’s performance metrics were: R2 for actual vs. reconstructed doses = 0.845, RMSE = 1.160 Gy and MAE = 0.628 Gy. The median reconstructed doses which corresponded to actual doses of 0, 3 and 8 Gy were 0.132, 3.043 and 7.700, respectively.

The mean and median (Reconstructed_Dose_q50) predictions for reconstructed dose are shown in Fig. 4. The small number of samples at a dose of 4 Gy were excluded from the figure to improve visualization, but RF predictions for all testing data samples are provided along with corresponding actual doses in Supplementary_File_2 online (Reconstructed_Dose and Actual_Dose columns). The median for absolute errors was only 0.15 Gy, the 75th percentile was 0.72 Gy, and 1 Gy corresponded to approximately the 80th percentile. In other words, for approximately 80% of testing data points, absolute errors on the radiation doses were ≤ 1 Gy.

Out of the 561 testing data points, 553 actual doses fit within the 0.1 to 0.9 quantiles of RF predictions, and 548 fit within the 0.25 to 0.75 quantiles. In other words, only 13 out of 561 testing data points (about 2.3%) had reconstructed doses outside the 25th to 75th percentile range of the quantile predictions by the RF model, which suggests that the model made large errors in dose reconstruction only infrequently.

CatBoost tuned in Python using the GridSearchCV procedure from sklearn.model_selection also performed decently, but somewhat worse than RF: R2 = 0.800, RMSE = 1.304 Gy, MAE = 0.783 Gy. Its best hyperparameters were: the number of trees in the ensemble, iterations = 425; learning rate during the training process, learning_rate = 0.085; L2 regularization of the loss function, l2_leaf_reg = 3; maximum allowed tree depth, depth = 7. The squared errors for CatBoost on testing data were significantly larger than those for RF (p-value 1.89 × 10–4 using the scipy.stats.wilcoxon test).

Predictions for the CatBoost model on testing data are displayed graphically in Supplementary Fig. 1. Their distributions look visually similar to those from RF (Fig. 4), but there is a small fraction of predicted dose values outside the range of the training data (i.e. < 0 or > 8 Gy). Since the RF algorithm uses bagging to build the tree ensemble, it cannot extrapolate beyond the range of training data. Boosting algorithms such as CatBoost, however, successively fit trees to the residuals of the fit from the previous step and can stray somewhat out of the training range (from -0.34 to 8.69 Gy in this case). This difference in ensemble building approach may explain why on this data set RF produced somewhat better performance metrics that any of the boosting algorithms, but both strategies may ultimately be useful for biodosimetry under field conditions.

Expectedly, both the RF and CatBoost models performed slightly worse on testing data than during initial training (Table 2). The magnitude of performance decrease between training and testing was not severe for either model, and suggests that both algorithms were able to generalize relatively well from one portion of the data set to another.

The most important predictor variables in the RF model, assessed by the SHAP metric42, were the CBMN and DCA data, followed by age (Fig. 5). SHAP values for the CatBoost model showed very similar patterns (Supplementary Fig. 2). Partial dependence plots, which provide additional details about how the predictions of each model were related to values of features of interest, are displayed in Figs. 6, 7 for RF and Supplementary Fig. 3 for CatBoost. As intuitively expected, the SHAP and partial dependence plots suggest that larger values of corrected micronucleus index (Mi_BN_c) and dicentrics yield (Yield) were associated with higher predicted doses, whereas the corresponding relationship for raw micronucleus index (Mi_BN) was different because of the tendency of this index to decrease at high doses (Fig. 2B).

Figure 5
figure 5

Visualization of how each predictor variable in the random forest model affected the model’s predictions (dose reconstructions). The SHAP metric, explained in the main text, was used to assess the importance of each predictor. Predictor variables (features) are listed on the left side in descending order, based on the mean absolute SHAP value (shown in bold black font). For example, corrected micronuclei yield (Mi_BN_c) was the predictor with the highest mean absolute SHAP value of 1.433 Gy. Negative SHAP values (left side of the figure) represent reductions in reconstructed dose, and positive ones (right side of the figure) represent increases in reconstructed dose. Each circle represents a blood sample (data point). Blue circles represent high feature values, and yellow ones represent low values. For example, high (blue) values of Mi_BN_c were associated with positive SHAP values, i.e. increased reconstructed dose, and low (yellow) values had the opposite effect. Details are discussed in the main text.

Figure 6
figure 6

Partial dependence plots, which show the influence of selected predictor variables (AD) on reconstructed dose in the random forest model. Each black curve represents an Individual Conditional Expectation (ICE) plot for a given blood sample (data point), which shows how the reconstructed dose for this blood sample changed when the selected predictor variable was changed along the x-axis. Each red curve represents the average of all black curves in each panel.

Figure 7
figure 7

A two-dimensional partial dependence plot, which shows the influence of dicentric chromosomes (Yield) and corrected micronuclei (Mi_BN_c) on reconstructed dose in the random forest model. The plot shows that low values of both Yield and Mn_BN_c are associated with low reconstructed doses (blue), whereas high values of Yield and Mn_BN_c are associated with high reconstructed doses (yellow).

Older ages were associated with somewhat lower predicted doses, probably as an inverse to the trend for DCA and CBMN yields to increase at older ages (Fig. 3). The effects of Ethnicity and Race variables were generally small and may need to be investigated further in future studies involving even larger data sets.

An alternative calculation, where Dose_rate_category was included in the set of predictor variables for dose reconstruction, was performed with the goal of assessing the magnitude of dose rate effects for the proposed analysis approach. The addition of this extra feature improved the RF performance metrics on testing data only slightly (R2 = 0.851, RMSE = 1.140 Gy, MAE = 0.630 Gy). Squared errors were not significantly reduced, compared with the default model without dose rate among the predictors: p-value = 0.157 for a paired Wilcoxon signed rank test in R. Therefore, radiation dose reconstructions by the proposed combined method using DCA and CBMN data in an ML framework were not very sensitive to dose rate, even when dose rate was varied over several orders of magnitude. Dose rate effects in this data set were discussed in more detail in our previous publication.

In agreement with our hypothesis that combining DCA with CBMN data would improve dose reconstruction accuracy, removing the Mi_BN variable from the best-performing RF model significantly increased the model’s squared errors on testing data (p-value 3.4 × 10–8, using a paired Wilcoxon signed rank test in R). R2 on testing data was reduced to 0.808, and RMSE and MAE were increased to 1.290 and 0.751 Gy, respectively. Removing both the Mi_BN and Mi-BN_c variables from the model expectedly reduced performance even more: the p-value for a test on squared errors was < 2.2 × 10–16, R2 was reduced to 0.472, and RMSE and MAE were increased to 2.148 and 1.596 Gy, respectively. Alternatively, retaining Mi_BN and Mi_BN_c variables, but removing Yield from the predictor set, resulted in decreased performance as well, compared with the default model. For the model variant without Yield, R2 was reduced to 0.771, and RMSE and MAE were increased to 1.396 and 0.761 Gy, respectively, and the p-value for a test on squared errors, compared with the default model, was 1.1 × 10–6. Therefore, using DCA or CBMN data alone resulted in significantly worse dose reconstructions, compared with combining data from both assays.

Discussion

We hypothesized that combining data from the DCA and CBMN assays using ML approaches can improve the accuracy of radiation dose reconstructions in demographically heterogeneous populations and exposures at different dose rates of ionizing radiation. The rationale for this hypothesis was based on the idea that the DCA and CBMN assays can provide partially complementary (rather than redundant) information, because their dose response shapes and dependences on other factors are not identical (e.g. linear quadratic for DCA and more linear for the “corrected” linearized CBMN index, Fig. 2), and that ML algorithms can extract and utilize this information. To test this hypothesis, we assembled a large data set of ex vivo irradiated blood samples from adult and pediatric blood donors, which was intended to mimic a realistic heterogeneous population of people exposed to a mass-casualty radiological event. We compared the performances of several state-of-the-art ensemble ML methods, e.g. RF, XGBoost, LightGBM, CatBoost, and found that RF and CatBoost models generated the best results based on R2, RMSE and MAE metrics. The ensemble tree-based models generally outperformed other algorithms (Table 2). For the RF and CatBoost models, absolute dose reconstruction errors on the testing half of the data were generally well below 1 Gy even though the studied dose range included a high dose of 8 Gy (Fig. 4 and Supplementary Fig. 1). In other words, model performance for dose reconstruction achieved accuracies of < 1 Gy despite the heterogeneity of the subject population (by age, sex, ethnic background, etc.) and the very wide range of investigated dose rates: from 0.08 Gy/day to ≥ 600 Gy/s.

For both the RF and CatBoost models, the most important predictor variables, assessed by the SHAP metric, were the CBMN and DCA data, followed by age (Fig. 5 and Supplementary Fig. 2). Removing either CBMN or DCA data significantly worsened model performance (p-value 3.4 × 10–8 for removing Mi_BN and 1.1 × 10–6 for removing Yield), compared with using both assays together. These findings demonstrate the promising potential of combining automated CBMN and DCA assays to reconstruct the radiation dose in heterogeneous populations exposed to a mass radiological event. We argue that such a strategy of using ML to integrate the output of different radiation damage assays in the context of high-throughput radiation biodosimetry can help to mitigate the challenges (e.g. different dose rates, radiation qualities) posed by potential improvised nuclear device detonations or other types of malicious or accidental large-scale radiological events in populated areas.

The strengths of the current study include a large and diverse data set, innovative radiation delivery methods which enabled us to investigate both very low and very high dose rates, state of the art ML implementation, and a novel hypothesis. Of course, the study also had limitations. For example, each blood sample was assumed to have the same weight during ML regression analysis, regardless of the number of cells scored for DCA or CBMN assays performed on this sample, although samples with very few cells were removed from the analysis as described in Materials and Methods. The actual dose assigned to each sample was the nominal prescribed dose (e.g. 3 or 8 Gy), rather than a detailed dosimetry estimate on each sample. Radiation type (photons vs. electrons) was also not explicitly considered under the assumption that, at the energies used here, both types of exposures represent sparsely ionizing low-LET radiation with similar biological effectiveness. Another reason not to discriminate between photons and electrons in the current context was because electron pulses were used to mimic the very high dose rate prompt photon irradiation after an IND detonation, whereas such dose rates with photons could not be technically achieved in our irradiation facility. Finally, DCA and CBMN assays were scored automatically by a high throughput methodology, which may not be as accurate as manual scoring.

In summary, we proposed a high-throughput radiation biodosimetry approach, which uses ML algorithms to combine the output from automated DCA and CBMN assays. The results showed that combining the assays produced more accurate dose reconstructions, compared with using either assay alone. Although the automated scoring assays are likely to be more error-prone relative to traditional manual scoring, it is advantageous for use in mass-casualty scenarios. High throughput sample preparation, liquid handling and imaging techniques allow for DCA and CBMN assays to be performed on the same sample without excessive use of resources or time. The results of this study demonstrate the promising potential of combining DCA and CBMN assays within the ML framework to reconstruct radiation doses in clinically-relevant radiation exposure scenarios, where the potentially affected population is demographically heterogeneous and radiation dose rates may vary considerably. We also plan future experiments to increase the sample size and diversity in the data set, and potentially to integrate the cytogenetic damage assays with other radiation-responsive biomarker types.