Machine Learning Approaches to Radiogenomics of Breast Cancer using Low-Dose Perfusion Computed Tomography: Predicting Prognostic Biomarkers and Molecular Subtypes

Radiogenomics investigates the relationship between imaging phenotypes and genetic expression. Breast cancer is a heterogeneous disease that manifests complex genetic changes and various prognosis and treatment response. We investigate the value of machine learning approaches to radiogenomics using low-dose perfusion computed tomography (CT) to predict prognostic biomarkers and molecular subtypes of invasive breast cancer. This prospective study enrolled a total of 723 cases involving 241 patients with invasive breast cancer. The 18 CT parameters of cancers were analyzed using 5 machine learning models to predict lymph node status, tumor grade, tumor size, hormone receptors, HER2, Ki67, and the molecular subtypes. The random forest model was the best model in terms of accuracy and the area under the receiver-operating characteristic curve (AUC). On average, the random forest model had 13% higher accuracy and 0.17 higher AUC than the logistic regression. The most important CT parameters in the random forest model for prediction were peak enhancement intensity (Hounsfield units), time to peak (seconds), blood volume permeability (mL/100 g), and perfusion of tumor (mL/min per 100 mL). Machine learning approaches to radiogenomics using low-dose perfusion breast CT is a useful noninvasive tool for predicting prognostic biomarkers and molecular subtypes of invasive breast cancer.

or recurrence scores have been investigated. A recent radiogenomic study using breast MRI and RNA sequencing by Yeh et al. 4 demonstrated that tumor size characteristics are associated with proliferation and replication genetic pathways and better prognostic imaging features such as smaller size, increased sphericity, and that sharper margins are associated with immune pathway activation. Radiogenomic studies using mammography have shown the associations between breast parenchymal density with UGT2B gene variations and between radiographic texture analysis and BRCA gene mutations 6,7 .
The use of computed tomography (CT) to image the breast is limited because of the risk of high radiation exposure and poor image quality. However, CT has many advantages in oncology imaging. CT can be used to evaluate tumor angiogenesis using contrast agents and to assess lymph node or distant metastasis to lung, bone, or liver as well as breast itself. We recently developed a breast perfusion CT protocol using a low dose of 1.30-1.40 mSv, which was performed with the patient in the prone position, and reported the feasibility of quantifying tumor vascularity using this method 8 . Our preliminary results showed correlations between CT perfusion parameters and prognostic biomarkers and molecular subtypes, and between CT parameters and MRI enhancement characteristics. Low-dose breast perfusion CT requires a short scan time of only 3 minutes, produces satisfactory image quality, and can, therefore, be applied to patients who cannot undergo MRI because of implantable metallic devices, allergy to gadolinium, marked obesity, severe kyphoscoliosis, or claustrophobia.
We wanted to verify the application of low-dose perfusion CT to breast cancer radiogenomics by using a robust statistical method in a prospective study of breast cancer patients. Machine learning is a statistical tool that can allow computers to perform tasks by learning from examples without being explicitly programmed 9 . As data from electronic medical record become available, machine learning extracts knowledge from this data pool and produces output that can be used for individual outcome prediction analysis and clinical decision making. A few studies have applied a machine learning approach to radiogenomics and predictive analysis 10,11 . A machine learning approach allows for effective generalization of imaging data and patient stratification.
The purpose of our study was to investigate the clinical value of a machine learning approach to radiogenomics using low-dose perfusion breast CT for predicting prognostic biomarkers and molecular subtypes in invasive breast cancer. Although only 4 CT perfusion parameters with the maximum slope algorithm were extracted in the preliminary study 8 , the current study evaluated 18 quantitative CT parameters according to the maximum slope, deconvolution, and Patlak algorithms. The Patlak algorithm assumes the double-compartment method in which the intravascular and extravascular spaces are separate compartments 12,13 . The concentration of contrast agent within an organ at any time will have intravascular and extravascular components. This model quantifies the movement of the contrast agent from intravascular to extravascular space. The intravascular component is determined by the blood volume of the organ and the blood concentration of the contrast agent used at that time. The extravascular component is determined by the capillary permeability and the amount of contrast agent having passed through the organ. To create a CT feature-specific model, we used standard logistic regression and 5 machine learning models. We tried to identify the best machine learning model and which CT parameters are important for predicting prognostic biomarkers and molecular subtypes.

Methods ethics statement. This prospective study was approved by the Institutional Review Board of Korea
University Ansan Hospital (2016AS0071 and 2017AS0037) and written informed consent was obtained from all patients. All methods were performed in accordance with the relevant guidelines and regulations. patients. Perfusion CT was performed in 246 consecutive women who were scheduled to undergo treatment for invasive breast cancer from November 2016 to March 2019. All patients received a core needle biopsy and were not subjected to vacuum-assisted or excisional biopsies. Five of the 246 patients were excluded because of recent breast surgery within 2 years (n = 3), only ductal carcinoma in situ on final pathological examination after surgery (n = 1), and poor identification of a tumor in perfusion images because of small size (n = 1). Finally, a total of 241 breast cancer patients (age range, 25─84 years; mean, 51 years) were enrolled in this study. ct acquisition. Perfusion breast CT scans were performed according to the previous preliminary study 8 .
One of 2 radiologists (B.K.S. and P.E.K.), who had 20 or 8 years' experience in breast imaging, performed targeted ultrasound to localize the cancer before CT examination. After the cancer was identified on ultrasound, a dot skin marker (X-spot ® ; Beekley Medical, Bristol, CT, USA) was attached at the cancer site. If the patient had multiple cancers, the largest cancer was marked.
We used a spectral CT scanner (IQon Spectral CT; Philips Health Systems, Cleveland, OH, USA). For CT scanning in the prone position, an additional table pad with a rectangular hole to position the breast was placed on a standard CT table 8 . CT was performed at 80 kV tube voltage, 25 mAs or 30 mAs tube current, 64 × 0.625 mm collimation, 0.5 s rotation time, 512 × 512 matrix, and 5 mm slice thickness. 138 patients had a CT scan at 25 mAs and 103 patients at 30 mAs. The perfusion scan range was 40 mm along the z-axis, including skin markings on the cancers. Precontrast images were obtained to determine the scan range. After the perfusion range was determined, the skin markers were removed and 60 mL of contrast agent (Xenetix 350; Guerbet, Aulnay-sous-Bois, France) was administered at 4 mL/s. Scanning was performed 18 times at 3-second intervals followed by 4 times at 30-second intervals after the contrast administration. The CT dose index at 80 kVp and 25 mAs or 30 mAs using a 32 cm body phantom was 0.7 mGy or 0.9 mGy. The CT effective dose for each patient ranged from 1.01 to 1.40 mSv. ct analysis. The CT data were sent to a dedicated workstation (Extended Brilliance Workspace; Philips Health Systems). All 18 perfusion parameters (independent variables) were obtained using the 2 software programs ( Table 1). The time-attenuation curves and perfusion color maps of the cancer were computed automatically using a commercial software (Functional CT (FUNCTION); Philips Health Systems) and a prototype (Advanced Perfusion and Permeability Application (APPA); Philips Health Systems). (Fig. 1).
Ten parameters were obtained using the APPA program (Fig. 1A); peak enhancement intensity (PEI; Hounsfield units, HU), perfusion on deconvolution model (PD; mL/min per 100 mL), perfusion on maximum slope model (PM; mL/min per 100 mL), blood volume (BV; mL/100 g), mean transit time (MTT; seconds), time to peak (TTP; seconds), permeability (Permeability; mL/min per 100 g), blood volume permeability (BV permeability; mL/100 g), standardization of perfusion on deconvolution model (Standard PD; mL/min per 100 mL), and standardization of perfusion on maximum slope model (Standard PM; mL/min per 100 mL). Standardization meant calculated perfusion values based on cardiac output, body weight, volume and density of the contrast agent, and conversion factors (contrast agent density unit to HU).
A perfusion map of breast cancer was obtained by (1) manually selecting the images between the beginning and end of enhancement in the descending aorta area, (2) obtaining the reference artery input curve by placing a region of interest (ROI) in the aorta, and (3) placing a ROI in the tumor hot spot or the whole tumor range. Perfusion parameters for each patient were measured 3 times at intervals of 3 months by the radiologist (E.K.P.). The tumor size was measured on CT scans using the maximum diameter of the enhancing tumor, and the size was then classified into 2 groups for statistical analysis: ≤20 mm or >20 mm. The tumor size ranged from 6.0 mm to 82.0 mm (mean, 23.8 ± 13.4 mm).
Histopathological evaluation. We reviewed the histopathological reports for evaluation of prognostic biomarkers and molecular subtypes (dependent variables) of breast cancer. Lymph node status, tumor grade, estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2), and Ki67 were dichotomized for statistical analysis. Lymph node status was divided into positive or negative metastasis. Tumor grade was dichotomized as low (grade 1 and 2) and high (grade 3) 14,15 . The immunohistochemical results of ER, PR, HER2, and Ki67 were classified as positive or negative. The Allred scoring system was used for ER and PR status, and a score of >2 was considered positive. Positive HER2 overexpression was considered

Statistical analysis.
The primary goal of this study was to evaluate whether popular machine learning models, that is, decision tree, naïve Bayes, random forest, support vector machine (SVM) and artificial neural network (ANN) based on CT perfusion features, can predict prognostic biomarkers and molecular subtypes of breast cancer. A decision tree comprised (a) internal nodes (each denoting a test of an attribute or independent variable), (b) branches (each denoting an outcome of the test), and (c) terminal nodes (each representing a class label or dependent variable). A naïve Bayesian classifier is a predictor based on Bayes' theorem. A random forest creates many training sets, trains many decision trees, and makes a prediction with a majority vote (bootstrap aggregation). An SVM makes a prediction by maximizing a margin among hyperplanes separating data. The ANN of this study included 1 input layer, 2 hidden layers, and 1 output layer with 9774 neurons as data units in the input layer, 15 in each hidden layer, and 4 (or 2) in the output layer, i.e., 4 for the molecular subtypes and 2 for the dichotomized status of a prognostic biomarker for breast cancer. Here, the number of neurons in the input layer, 9774, was derived from the multiplication of 18 and 543, the numbers of attributes and cases in the training set, respectively. Neurons in the input or previous hidden layer were combined with weights in the next hidden or output layer (feedforward algorithm). The weights in the output layer and its previous hidden layers were then adjusted according to how much they contributed to the loss of the ANN, i.e., a gap between the actual and predicted class labels (backpropagation algorithm). Initially, the weights were set as small random numbers around 0, and the feedforward and backpropagation algorithms were iterated until certain criteria were met for the accurate prediction of a class label 16 .
The model input variables included 18 quantitative CT parameters ( Table 1). The model output variables were prognostic biomarkers and molecular subtypes of breast cancer. The data on 723 cases were divided into training and test sets at a 75:25 ratio. The models were built based on the training set of 543 cases, and the trained models were then validated based on the test set of 180 cases. Two often used criteria were introduced for validating the trained models and finding the best prediction model; accuracy, which represents the ratio of correct predictions among 180 cases in the test set, and the area under the receiver-operating characteristic curve (AUC), which represents the plot of the true positive rate (or sensitivity) vs. the false positive rate (or 1 -specificity  www.nature.com/scientificreports www.nature.com/scientificreports/ this reason, the random split and the statistical analysis were repeated 50 times and their average accuracy and AUC were calculated for each of six statistical methods, i.e., logistic regression and 5 machine learning models above 17 . Here, an additional step with a validation set is considered to be essential when hyper-parameter tuning is essential as in the case of deep learning models with many layers. In this case, (1) the validation set is held back from the training set besides the test set; (2) the validation set is used for tuning hyper-parameters of a certain model (e.g., the learning rate in a deep learning model); and (3) the test set is used for selecting the best model based on performance measures such as accuracy and the AUC. However, this additional step with the validation set can be skipped when hyper-parameter tuning is not essential, as in this study. Therefore, a validation set was not used here. Finally, the importance ranking of CT parameters for predicting the prognostic biomarkers and the molecular subtypes were derived from the variable importance measure of the best prediction model, which gave an accuracy (or mean impurity) gap between the complete model and a model excluding a certain variable. Python 3.52 was used for the analysis in April 2019.

Results
Descriptive statistics of prognostic biomarkers, molecular subtypes, and CT perfusion parameters. Table 2 shows the descriptive statistics for the prognostic biomarkers and molecular subtypes of invasive breast cancer. The statistics for the positive prognostic biomarkers were 43%, 67%, 63%, 21%, and 53% for lymph node, ER, PR, HER2, and Ki67, respectively. Similarly, the percentages of cases with molecular subtypes, luminal A, luminal B, HER2 overexpression, and triple negative cancers were 41%, 29%, 17%, and 13%, respectively. Table 3 indicates the descriptive statistics for the 18 CT perfusion parameters expressed as average, minimum, maximum, and 25%, 50%, and 75% values.
Diagnostic performance of the logistic regression and machine learning models for predicting prognostic biomarkers and molecular subtypes. Table 4 shows the diagnostic performance of the logistic regression and 5 machine learning models in terms of accuracy and AUC values, averaged over the 50 random splits and statistical analyses described above. Indeed, Fig. S1 shows the AUCs from one of the 50 analyses. In terms of accuracy and the AUC, the random forest model was the best model for predicting prognostic biomarkers and molecular subtypes of breast cancer. The accuracy was higher for the random forest model than for the logistic regression model by 13% Variable importance of ct parameters in predicting prognostic biomarkers and molecular subtypes. According to the importance of CT variables from the 50 random forest model, PEI, TTP, BV permeability, Perfusion-Function, and Perfusion-Function-Whole were the most important parameters for predicting prognostic biomarkers and molecular subtypes of breast cancer (Table 5). These results were based on the number  Table 3. Descriptive statistics for CT perfusion parameters. * The meanings of independent variables are described in Table 1. The importance rankings of PEI were 4 th , 3 rd , 1 st , 5 th , 3 rd , and 5 th for lymph node status, tumor grade, tumor size, HER2 status, Ki67 status, and molecular subtypes, respectively. TTP ranked 1 st , 2 nd , 2 nd , 2 nd , and 2 nd among the most important predictors for tumor grade, PR status, HER2 status, Ki67 status, and molecular subtypes, respectively. BV permeability ranked 4 th , 1 st , 3 rd , 4 th , 1 st , and 1 st among the most important predictors for tumor grade, ER status, PR status, HER2 status, Ki67 status, and molecular subtypes, respectively. The measures of Perfusion-Function-Whole were 1 st , 2 nd , 5 th , 5 th , 5 th , and 1 st for lymph node, tumor grade, tumor size, ER, PR, and HER2, respectively. The measures for Perfusion-Function were 2 nd , 4 th , 4 th , 5 th , and 3 rd for predicting tumor size, ER status, PR status, Ki67 status, and molecular subtypes, respectively.
The multinomial logistic regression results yielded useful information about the direction and magnitude of the effects of the top 5 CT parameters on the prognostic biomarkers and molecular subtypes (Table S1). For example, a 10-second increase in TTP would increase the odds of ER being positive (vs. negative) by 78%. A 10-second decrease in TTP would lead to a 137% increase in the odds of HER2 overexpression cancer (vs. luminal A type cancer). Similarly, a 10-second decrease in TTP would result in a 97% increase in the odds of triple negative cancer (vs. luminal A type cancer). An increase of 10 mL/min per 100 mL in Perfusion-Function would increase the odds of HER2 overexpression cancer (vs. luminal A type cancer) by 25%. Indeed, the results of univariate analysis were consistent with their multivariate counterparts in general (Supplementary Table S2 Table 5. Variable importance of CT parameters from the random forest in predicting prognostic biomarkers and molecular subtypes. * The meanings of independent variables are described in Table 1 www.nature.com/scientificreports www.nature.com/scientificreports/ mean of a CT parameter was positive (or negative) in cases when the odds ratio was greater (or smaller) than 1 with respect to a prognostic biomarker/molecular subtype. For instance, a 10-second decrease in the TTP would result in a 97% increase in the odds of triple negative cancer (vs. luminal type A cancer). Likewise, the TTP mean was significantly smaller for triple negative cancer (44.45) than for luminal type A cancer (51.14). These results would supplement the random forest, the best model for predicting the prognostic biomarkers and the molecular subtypes in terms of accuracy and the AUC.

Discussion
Until recently, radiogenomic investigations of breast cancer have been carried out using MRI or mammography. This prospective study has shown the possibility of using CT for radiogenomic investigations of different breast cancer subtypes. Our CT protocol provides a quantitative variety of perfusion parameters associated with tumor angiogenesis, with fairly low radiation doses while maintaining high image quality. In addition, the feasibility of this innovative and convenient CT technology was evaluated using a powerful statistical method, a machine learning approach. Our results show that quantitative perfusion CT parameters are associated with the prognostic biomarkers and molecular subtypes of breast cancer, and that the random forest model was the best machine learning model in terms of accuracy and AUC for predicting the prognosis using CT parameters. The random forest model had better accuracy and AUC than the logistic regression model. The accuracy of the random forest model was on average 13% higher than that of the logistic regression model, and the mean AUC difference between the random forest and logistic regression models was 0.17. In addition, we identified the 5 most important CT variables for prediction: PEI, TTP, BV permeability, Perfusion-Function, and Perfusion-Function-Whole.
Our study shows 3 key advances in radiogenomics of breast cancer. First, we have provided an innovative breast CT protocol that measures tumor vascularity using a very low radiation dose. In our previous preliminary study of low-dose perfusion CT, we used 80 kVp tube voltage and 30 mA tube current, and the CT effective dose of the protocol ranged from 1.30 to 1.40 mSv 8 . We have continued to strive to reduce the radiation exposure while maintaining the quality of the images, and we performed CT scans with tube currents down to 25 mAs. Of a total of 241 patients, 138 underwent the CT scan at 25 mAs and the remaining 103 underwent the CT scan at 30 mAs. The CT protocol using a 25mAs tube current reduced the effective radiation dose to 1.01 mSv. In the United States, the average annual effective dose of natural background radiation is about 3 mSv 18 . The average effective dose of 2-view film-screen mammography is 0.56 mSv and that of digital mammography is 0.44 mSv 19 . Therefore, this low-dose perfusion CT protocol is clinically applicable in terms of radiation exposure. If a flexible scan range customized for breast cancer size is available, the effective dose should be reduced to less than 1 mSv for small breast cancers. Reducing the frequency of the acquisition to create time-attenuation curves and developing radiation-reducing algorithms and reconstruction programs may reduce the radiation dose even further.
Second, we have demonstrated the clinical value of quantitative hemodynamic parameters in perfusion CT for predicting biomarkers and molecular subtypes of breast cancer using machine learning models. We evaluated the performance of 5 machine learning models and standard logistic regression for prediction using the measurement of accuracy and AUC values. The best model in this study was the random forest model. The accuracy and AUC values were consistently higher for the random forest model than the standard logistic regression model. In particular, for predicting the molecular subtypes, luminal A, luminal B, HER2 overexpression, or triple negative cancer, the random forest model had superb results, as shown by the comparisons of the random forest vs. logistic regression models: 66% vs. 48% for accuracy and 0.82 vs. 0.69 for AUC. The random forest model is known for its robust performance and strong generalization power from bagging (or bootstrap aggregation) 20 . As noted in the Statistical Analysis section above, a random forest model creates many training sets, trains many decision trees, and makes a prediction with a majority vote (bagging). The random forest of this study comprised 1000 decision trees. Intuitively, a majority vote made by 1000 doctors would be more reliable than a vote made by 1 doctor. In a similar vein, a majority vote made by 1000 decision trees would be more reliable than a vote made by a single machine learning method. To our knowledge, no research has investigated the value of machine learning approaches to radiogenomics using breast CT scanning, breast perfusion imaging, or breast imaging parameters reflecting tumor vascularity to predict biomarkers and molecular subtypes of breast cancer. The results of this study confirm the robust performance and strong generalization power of the random forest model using bagging for the clinical purpose of disease diagnosis and prognosis.
Third, our machine learning approach identified important CT hemodynamic parameters in relation to the treatment and prognostic indicators of breast cancer. In the previous preliminary study of low-dose perfusion CT 8 , 4 CT parameters were obtained with 1 perfusion algorithm (maximum slope) of the FUNCTION analysis software. In the current study, we extracted 18 CT parameters with 3 perfusion algorithms (maximum slope, deconvolution, and Patlak). We sought to identify which CT parameters are more important for predicting biomarkers and molecular subtypes using a machine learning approach. According to the importance of CT variables in the random forest model, PEI, TTP, BV permeability, Perfusion-Function, and Perfusion-Function-Whole were the 5 most important CT parameters for prediction overall. The following CT parameters were valuable for predicting particular prognostic biomarkers or the molecular subtypes: BV permeability, TTP-Function-Whole, Permeability, Perfusion-Function, and Perfusion-Function-Whole for ER status; Perfusion-Function-Whole, TTP, PEI-Function-Whole, BV permeability, and PEI for HER2 status; BV permeability, TTP, PEI, PEI-Function, and Perfusion-Function for Ki67 status; and BV permeability, TTP, Perfusion-Function, Permeability, and PEI for the molecular subtypes.
Tumor angiogenesis plays a vital role in the growth and metastasis of tumors 21 . In breast cancer, angiogenesis is associated with tumor hypoxia and is an independent predictor of overall disease-free survival 22,23 . New tumor vessels in malignant tumors generally contain immature microvessels 24,25 , which increase permeability, and arteriovenous shunts and hyperpermeable vessels, which increase the blood flow within the tumor. Therefore, Scientific RepoRtS | (2019) 9:17847 | https://doi.org/10.1038/s41598-019-54371-z www.nature.com/scientificreports www.nature.com/scientificreports/ aggressive tumors can demonstrate a faster time to reach a peak enhancement. Our multinomial logistic regression results showed that a decrease in TTP would lead to a rise in the odds of breast cancer with positive lymph node metastasis, higher tumor grade, larger tumor size, hormone receptor negativity, HER2 positivity, or higher Ki67 level. A decrease in TTP would result in a rise in the odds of HER2 overexpressing cancer compared to luminal type cancer. Our results also demonstrate the importance of TTP as a noninvasive image parameter for predicting prognostic biomarkers and molecular subtypes of breast cancer. TTP ranked within the top 5 for the predictions of tumor grade, PR status, HER2 status, Ki67 status, and the molecular subtypes.
Blood flow through the vasculature is also related to the velocity within the tumor. In this study, CT parameters that indicated blood flow were PD, PM, Standard PD, Standard PM, Perfusion-Function, and Perfusion-Function-Whole. These are obtained from various perfusion algorithms (maximum slope or deconvolution), various analyzing software (FUNCTION or APPA), various tumor regions (hot spots or whole tumor), and the use of standardization based on cardiac output, body weight, volume and density of the contrast agent, and conversion factors. Perfusion-Function and Perfusion-Function-Whole ranked within the top 5 in predicting biomarkers and molecular subtypes of breast cancer. The multinomial logistic regression results demonstrated that a growth in Perfusion-Function (i.e., blood flow at the hot spot of tumor) would lead to a rise in the odds of cancers with the worse prognostic factors such as positive lymph node metastasis, larger tumor size, hormone receptor negativity, HER2 positivity, or higher Ki67 level, and that a growth in Perfusion-Function would result in a rise in the odds of HER2 overexpressing cancer or triple negative cancer compared to luminal type cancer. These results were consistent with the previous preliminary study 8 .
Immature microvessels in tumors are fragile and hyperpermeable. Vascular permeability indicates the capacity of the blood vessel walls to allow for the flow of molecules or cells in and out of the vessel. In cancer, the vascular wall is important because it acts a barrier to drugs, nutrients, and immune cells. Disorganized and leaky vessel networks cause tumor cell extravasation, blood flow disturbance, and inflammatory cell infiltration, and vascular permeability is associated with tumor progression and can affect drug delivery. In perfusion CT, the permeability is evaluated using the Patlak algorithm 13 . In this study, after administration of the contrast agent, CT was performed 18 times at 3-second intervals followed by 4 times at 30-second intervals. Four scans were taken at 84, 114, 144, and 174 seconds to obtain information about permeability. We measured 2 parameters related to permeability using the Patlak algorithm; Permeability and BV permeability. Permeability indicates the flow of molecules through the capillary membranes in a certain volume of tissue (mL/min per 100 g). BV permeability indicates BV and the movement of the contrast agent from the intravascular space into the extravascular space (mL/100 g). The Patlak algorithm assumes the double-compartment method in which the intravascular and extravascular spaces are separate compartments. This algorithm quantifies the movement of the contrast agent from the intravascular space into the extravascular space and is described in the following equation: where b is the blood volume, a is the permeability, C(t) is the tissue time-attenuation curve, Cb(t) is the input artery time-attenuation curve, rbv is the tissue BV, and Pm is the permeability coefficient. Two required parameters, y and x, are calculated using linear regression, which finds the line that best fits the measured data. In the algorithm application, this calculation is performed for each voxel in the scanned volume. Based on our results, BV permeability was a more important CT parameter for predicting biomarkers and molecular subtypes than permeability. In addition, the logistic regression results showed that an increase in BV permeability values would lead to a rise in the odds of cancers with the worse prognostic factors such as higher tumor grade, hormone receptor negativity, or higher Ki67. However, the relationship between permeability and prognostic biomarkers was not consistent in logistic regression analysis and further studies are needed to determine the clinical relevance of the two permeability-related parameters. This study had several limitations. First, this study was performed using one spectral CT device from one institution. However, the preliminary study of low-dose breast perfusion CT has already shown the feasibility to quantify tumor vascularity and significant correlations of CT parameters with prognostic biomarkers and molecular subtypes in 70 patients with invasive breast cancers 8 . In this prospective study, we measured CT parameters three times at 3-month intervals for each patient and generated a large data set with 723 cases from 241 invasive breast cancer patients. Second, we evaluated fewer imaging parameters than did the general studies related to radiomics or radiogenomics because we only extracted perfusion-related imaging phenotypes for the assessment of tumor angiogenesis. We extracted 18 imaging features given that perfusion CT scans can only measure perfusion-related variables. However, we calculated perfusion CT parameters using all existing applicable perfusion CT algorithms in this study, the maximum slope, deconvolution, and Patlak. Therefore, our study has obtained the most perfusion parameters in perfusion CT studies including breast as well as other organs. Moreover, this study introduced innovative machine learning methods to demonstrate that low-dose perfusion breast CT is a useful noninvasive tool for predicting prognostic biomarkers and molecular subtypes of invasive breast cancer. Third, we used hand-drawn ROIs in the evaluation of perfusion parameters in the tumor hotspots. Future development of automatic selection and calculation software are helpful for overcoming subjectivity. Fourth, this study was designed to make predictions based on a predetermined set of predictors, namely 18 CT parameters. Conducting various experiments based on random sets of predictors and comparing their performance measures would be a good topic for further research. (2019) 9:17847 | https://doi.org/10.1038/s41598-019-54371-z www.nature.com/scientificreports www.nature.com/scientificreports/ conclusion This study demonstrates that a machine learning approach to radiogenomics using low-dose perfusion CT is a useful noninvasive tool for predicting prognostic biomarkers and molecular subtypes of invasive breast cancer. The machine learning random forest model was the best model for predictions with an average accuracy improvement of 13% and an average AUC improvement of 0.17 compared to logistic regression. The combination of innovative and convenient CT technology and a robust statistical model is a promising tool for radiogenomics of breast cancer and can help with risk stratification and precise treatment.

Data availability
The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.