Automatic scoring of COVID-19 severity in X-ray imaging based on a novel deep learning workflow

In this study, we propose a two-stage workflow used for the segmentation and scoring of lung diseases. The workflow inherits quantification, qualification, and visual assessment of lung diseases on X-ray images estimated by radiologists and clinicians. It requires the fulfillment of two core stages devoted to lung and disease segmentation as well as an additional post-processing stage devoted to scoring. The latter integrated block is utilized, mainly, for the estimation of segment scores and computes the overall severity score of a patient. The models of the proposed workflow were trained and tested on four publicly available X-ray datasets of COVID-19 patients and two X-ray datasets of patients with no pulmonary pathology. Based on a combined dataset consisting of 580 COVID-19 patients and 784 patients with no disorders, our best-performing algorithm is based on a combination of DeepLabV3 + , for lung segmentation, and MA-Net, for disease segmentation. The proposed algorithms’ mean absolute error (MAE) of 0.30 is significantly reduced in comparison to established COVID-19 algorithms; BS-net and COVID-Net-S, possessing MAEs of 2.52 and 1.83 respectively. Moreover, the proposed two-stage workflow was not only more accurate but also computationally efficient, it was approximately 11 times faster than the mentioned methods. In summary, we proposed an accurate, time-efficient, and versatile approach for segmentation and scoring of lung diseases illustrated for COVID-19 and with broader future applications for pneumonia, tuberculosis, pneumothorax, amongst others.


Scientific Reports
| (2022) 12:12791 | https://doi.org/10.1038/s41598-022-15013-z www.nature.com/scientificreports/ To determine the most optimal workflow we evaluated nine state-of-the-art lung and disease segmentation networks (U-net 19 , U-net + + 20 , DeepLabV3 21 , DeepLabV3 + 22 , FPN 23 , Linknet 24 , PSPNet 25 , PAN 26 , MA-Net 27 ) and found the best performing configurations as determined by the combined accuracy and complexity. The latter is of particular importance as it allows the broader scientific community to adopt the determined hyperparameters for further research, extending beyond the scope of this work. To study algorithm performance, we collected, cleaned, and pre-processed three lung segmentation datasets as well as four disease segmentation and scoring datasets acquired for COVID-19 and pneumonia-infected patients. The datasets are made publicly available 28,29 . We compared our results against two known tailor-made solutions, BS-net 30 and COVID-Net-S 31 . The source code used for model development, tuning, training and testing, and the obtained segmentation models are also made publicly available 32 .

Related work
Recent advances in medical imaging, with the added utility of portable X-ray machines, led to a rise in the adoption of CXRs when monitoring patients admitted to ICUs 33 . Furthermore, analysis of CXRs plays a crucial role in the diagnosis and progression of acute respiratory distress syndrome (ARDS) and lower-respiratory tract infection (LRI) 34,35 . Access to quantifiable methods for radiograph-based severity scoring can thus greatly assist in a patients' risk stratification 36,37 .
Despite its potential advantages, investigation into severity scoring in radiograph images has thus far been limited. Sheshdari et al. developed a radiologic severity index (RSI) to predict a 30-day mortality after an LRI diagnosis 38 . RSI scores (0-72) were based on pulmonary infiltrates as well as the degree of their geographic involvement. To assist with the interpretation of CXRs, Taylor et al. developed a severity scoring tool for severe acute respiratory infections (SARI) 39 . The scoring tool made use of a 5-point system based on the extent of lung abnormalities, with resultant scores validated against those assigned by trained clinicians. In 2018, Warren et al., introduced the radiographic assessment of lung oedema (RALE) score to evaluate the degree of pulmonary oedema in ARDS 40 . The score (0-48) is calculated by dividing a radiograph image into quadrants with each quadrant scored on the extent of involvement (0-4; 0 meaning no involvement and 4 meaning 75% or more geographic involvement) and degree of opacification (1-3; hazy, intermediate, and dense). The method has shown diagnostic promise for the evaluation of ARDS in ICU admitted patients 41 and has recently been extended to diagnose the severity of a patients' COVID-19 infection 31,[42][43][44][45] .
Wong et al. adapted RALE by assigning a score (0-4) to each lung based on the extent of lung involvement by consolidation or ground-glass opacity 44 . This methodology, though further adapted, has readily been applied when investigating the performance of deep learning models developed for severity detection 31,46,47 .
Adopting such scoring methodologies, Cohen et al., made use of seven non-COVID-19 datasets to pre-train a DenseNet model for feature extraction and task prediction 46 . Subsequently, a linear regression model was utilized to predict the score of each image. When evaluating the model performance against scoring performed by experts, the authors report an R-squared of 0.62 for the degree of opacity and 0.67 for the extent of involvement.
Extending upon the COVID-Net architecture 14 , Wong et al. propose COVID-Net-S to capture the severity of the COVID-19 infection 31 . By making use of 396 CXRs and the adapted RALE score, the authors report an R-squared score of 0.739 and 0.741 between predicted and expert scores for geographic involvement and opacity extent, respectively.
In 2020 Broghesi et al. 11 introduced an alternative semi-quantitative score (0-18), namely the Brixia score. Here, each lung is divided into three regions with each region ranked (0-3) based on the extent of lung abnormalities. The Brixia score has shown promise when used to predict the risk of in-hospital mortality 48 and the need for ventilatory support 49 . Subsequently, Signoroni et al. 30 made use of the Brixia score to evaluate their end-to-end deep learning network when tasked with assessing the severity of a COVID-19 infection. By utilizing a dataset consisting of 5000 CXRs the authors reported mean absolute errors of 0.441 as compared to expert radiologists.

Data
This section is devoted to the description of the data used to evaluate the proposed method. Since the proposed workflow is based on two independent stages, we adopted two different datasets for the training and testing of neural networks respectively.
Stage I: lung segmentation dataset. The first stage of the proposed workflow is utilized for lung segmentation. Here, we collected and pre-processed three publicly available datasets, including the Darwin, Montgomery, and Shenzhen datasets [50][51][52] . These datasets include CXRs acquired for patients diagnosed with either COVID-19, pneumonia, or tuberculosis. It is worth noting that we utilize CXRs with different diagnoses solely for model training during Stage I. Since Stage I is tasked with lung segmentation, the use of images with different pathologies, differing in nature and disease patterns, acts as an augmentation technique and improves the generalization ability of studied networks. Table 1 contains a short description of these datasets and their split over the training, validation, and testing subsets.
The Darwin dataset images include most of the heart, revealing lung opacities behind the heart, which may be relevant for assessing the severity of viral pneumonia. The lower-most part of the lungs, where visible, is defined by the extent of the diaphragm. Where present and not obstructive to the distinguishability of the lungs, the diaphragm is included up until the lower-most visible part of the lungs. A key property of this dataset is that image resolutions, sources, and orientations vary across the dataset, with the smallest image being 156 × 156 pixels and the largest being 5600 × 4700 pixels. Furthermore, we include portable CXRs. Despite the latter being of significantly lower quality, such image variety allows for the improvement of the generalization ability of studied   www.nature.com/scientificreports/ In order to achieve the ground truth segmentation and scores of the abnormal datasets and validate the normal datasets, two senior radiologists in our team, from the United States and Russia, annotated anteroposterior and posteroanterior radiographs. Each radiologist has more than ten years of experience and was active during the COVID-19 pandemic. The annotators labeled each dataset independently. Such a strategy, although timeconsuming, allows us to get a consensus segmentation and severity score, which in turn helps us to determine the scoring ability of the proposed workflow when compared to the "gold standard" i.e. to the radiologist's consensus.
Having analyzed the COVID-19 datasets, we summarized their main demographic characteristics in Table 3. Accordingly, subject ages fall between 36 and 70 years. Analysis of the age distribution shown in Appendix A ( Figure A2), indicates that COVID-19 affects 91.6% of subjects aged from 25 to 80 years old, highlighting the severity of the global disease burden. We emphasize that the majority of the combined dataset records (53.7%) were acquired in European countries (Germany, Italy, Spain, and the United Kingdom). The distribution of the COVID-19 records by country is provided in Appendix A. We note that certain demographic information, such as age, gender, and the name of the collecting organization, is absent, in specific cases, due to the necessity to preserve patient anonymity.
To evaluate the agreement of segmentation and scoring as determined by the two radiologists, their scores were compared pairwise for each CXR, as shown in Fig. 2. The visual summaries for 1085 pairs which were available for both radiologists are provided as a heatmap, Fig. 2c, where a darker color indicates a higher density of points and points on the red dashed line indicate perfect agreement. Most of the points are concentrated tightly around the dashed line with the only difference being that the first radiologist appeared to give slightly higher (one point) scores than the second one. For a more formal, numerical comparison between the judgment scores of two radiologists, the correlation coefficient (ρ = 0.97) and Cohen's kappa statistic (κ = 0.64) were computed for all pairs of scores, where each element of the pair is the score of the corresponding radiologist for a given image. The computed correlation coefficient of 0.97 is close to 1 indicating a strong positive correlation and good agreement between the judgments. In the same way, the Cohen's kappa value of 0.64 is interpreted as "substantial" 62 and indicates a good agreement between the two judgments. Therefore, the radiologist's scores were deemed to be robust measures with limited variability from radiologist to radiologist. These paired scores were then used to compute the "averaged" score from each pair, when available for both, and later used as a "gold standard" or a consensus for method evaluation and comparison. When only a single score was available from one radiologist, that score value was used as the "gold standard" which increased the total number of available scored images to 1364.

Methods
In this section, we provide a detailed explanation of the proposed workflow, including both stages and the postprocessing block used for the final scoring estimation. Additionally, we describe how the studied networks are hyper-tuned, trained, and how their performance is estimated in terms of accuracy, error rate, complexity, and processing speed. The proposed workflow inherits the quantification and qualification of lung diseases (scoring and decision-making) from expert radiologists, and fulfills the following processing steps, as shown in Fig. 3:   When developing the proposed workflow, we tried to mimic the visual inspection and evaluation of X-ray images performed by radiologists and clinicians, when assessing and estimating the extent of lung disorders. We relied on input from expert radiologists to further understand the nature of this visual assessment. Here, an expert would first assess an image based on whether a suspected infection is present and further study the degree of pulmonary involvement on a 0 to 6 grading scale (low to high).
Stage I: lung segmentation. In Stage I, we employ and compare different segmentation neural networks, including nine state-of-the-art solutions [19][20][21][22][23][24][25][26][27] . Lung segmentation is followed by a processing block used to exclude any unnecessary areas. As shown in Fig. 4, this block applies the bitwise AND operator (calculating the conjunction of pixels in both images) on the source image, using the predicted binary mask. Once this operation is performed, an image with the extracted lungs is cropped by the lung-bounding box and is then resized. The latter operation is useful during training because if the ground truth is not large, the training signal magnitude will be small. Such an issue is similar to the gradient vanishing problem and may lead to the general inability of a network, with many layers, to learn on a given dataset or prematurely converge to a poor solution. In such a scenario, due to a weak magnitude, the gradient may diminish dramatically as it is propagated backward through the network. The error may be very small by the time it reaches layers close to the input of the network and thus may have very little effect. To circumvent this, Stage I excludes any informative regions beyond the lungs with the aim of increasing the gradient magnitude during back-propagation.
Stage II: disease segmentation. Stage II is concerned with the application of Multi-Task Learning (MTL) 63,64 , as opposed to the Single-Task Learning (STL) of the first stage. The proposed MTL disease segmentation model, reflected in Fig. 5, has two branches, namely classification and segmentation. The need for the classification branch stems from the limited size of the training subset and is used to predict the class of an input image in order to regularize the shared encoder and impose additional constraints on its layers. MTL aims to learn multiple different tasks simultaneously while maximizing the performance across all tasks. The workflow of Stage II is based on the Hard Parameter Sharing of the MTL approach because of the need to simultaneously predict the disease label (classification) and the affected lung area (segmentation). In the proposed MTL workflow, the encoder plays the role of a Convolutional Neural Network (CNN) feature extractor, while the head of the classification branch (classifier) is used for making class predictions. The classifier predictions, thus, adopt the role of a regularizer and are used to refine segmentation predictions. Taking into account the highly confident classification outputs, the model refines the output of the segmentation branch i.e. the segmentation mask according to Eqs. (1) and (2). The classification branch outputs a probability of an image being a normal case (probability tends to 0) or being a disease case (probability tends to 1). The refined segmentation mask f seg ref (x) is computed in the following manner: www.nature.com/scientificreports/ where x is an input image, f seg raw (x) is a raw probability map made by the segmentation branch, f cls raw (x) is the probability of an image being normal or disease-infected, made by the classification branch, f cls bin (x) is a binarized f cls raw (x) , ⊙ is the element-wise multiplication, T is a threshold equal to 0.5 in our study.
We have found, experimentally, that the adopted regularization helps the proposed workflow decrease the false-positive rate on the segmentation branch. For the regulizer architecture, we used a reliable set of layers with proven stability on classification tasks and includes an adaptive two-dimensional pooling layer, flatten layer, dropout layer with a dropout rate of 0.20, and a dense one-neuron layer with a sigmoid activation function.
It has been shown that MTL models can often improve accuracy relative to independent STL models 65-67 . However, even when the average task accuracy improves, individual tasks may experience negative transfer where MTL model's predictions are worse than that of STL models. To avoid this, we integrate and utilize Dynamic Loss Weighting (DLW) 68 of the MTL model which combines and expands upon ideas from Reinforced MTL 69 and Gradient Normalization for Adaptive Loss Balancing (GradNorm) 70 . According to this approach, loss weights have to be dynamic, meaning that a specific task weight differs given different inputs, compared to GradNorm where the task weight is static as it is identical among all batches. DLW assumes that the task-specific loss is informative for balancing different tasks. For each task and batch, DLW considers the loss ratio between the current loss and the initial loss (Algorithm B1 in Appendix B and Eq. 3), which is a proxy for how well the model has trained. Poorly trained tasks have ratios close to one and contribute more to the overall loss and gradient. Having applied the DLW approach, the loss for the proposed disease segmentation model, based on MTL, is calculated as follows: where w cls B and w  Figure 5. Disease segmentation stage (Stage II) and score estimation. The prefixes "P" and "GT" stand for prediction and ground truth, respectively. Post-processing: severity scoring and visualization. For quantification, qualification, and visualization of the affected area of the lungs, we utilize the post-processing block, as shown in Fig. 6, which is primarily used to estimate the scores for lung segments as well as the overall score. Having obtained the extracted lungs from Stage I, first, we separate the two lungs. Here, we employ an algorithmic application of graph theory called Block-Based Connected Components 71 which is used to determine the connectivity of BLOB-like regions in a binary image. Once this algorithm is applied, we detect the two biggest BLOBs which represent the lungs in an image. After the lungs are detected we take each BLOB, representing a single lung mask, and nullify other pixels using the bitwise AND operator. Having performed this set of operations, the lung splitter outputs two separated lung masks M l (mask of left lung) and M r (mask of right lung). Once two binary lung masks are obtained, we use a segmenter block to divide each lung into three segments. Although similar to clinical practice, the frontal X-ray image is divided into three zones per lung (total of 6 zones) 72 : upper, middle, and lower zone. The upper zone extends from the apices to the superior portion of the hilum. The middle zone spans the space between the superior and inferior hilar margins. The lower zone extends from the inferior hilar margins to the costophrenic sulci. This approach, however, does not consider the area of the estimated zones that lead to non-uniform segment division and, in turn, lead to a different impact made by each segment on the estimation of the total score. Furthermore, such a methodology is dependent on the image alignment, leading to some solutions, for instance, BS-Net 30 , integrating an alignment block inside their workflow. Besides being of high complexity, due to the application of a neural network, the alignment block is typically placed right before the core model, as such any error originating from this block may significantly influence the error of the entire pipeline. In this regard, we propose a segmenter that is invariant to affine and geometric transformations, and divides each lung into three segments, maintaining a consistent area within each segment and across both lungs (Algorithm B2 in Appendix B).
To divide each lung into three equal-sized segments, we apply a binary search algorithm, which runs twice, to compute both the upper y top and lower y bot coordinates. Initially, the segmenter searches for the upper coordinate . Having estimated the lung mask, disease mask, and the limits of the six lung segments, we utilize two estimators (Algorithm B3 in Appendix B) for the computation of the severity score per segment, and the total severity score for a given subject. For this procedure, we take the intersection of the predicted mask of the disease and the segments for each lung obtained by the segmenter. If the intersection of these regions is big enough, meaning it is more than a predefined threshold value T , we count this part as 1, otherwise 0. At the end of the proposed pipeline, the severity score estimator sums up all values for each segment and gives the total score, which falls in the range of 0 to 6.
Hyperparameter tuning. In the proposed pipeline, both stages rely on neural networks. Moreover, Stage II is based on an MTL neural network, requiring hyperparameter tuning to obtain optimal performance. Despite different techniques being explored in addressing the problem of neural network hyperparameter optimization, such as Neural Architecture Search 73 by Google and Generative Synthesis 74 by DarwinAI, one challenge stands out-computational cost. Even though Grid Search and Random Search 75 are more efficient in terms of time spent for the optimization, they are completely uninformed by past trials, and, as a result, often spend a significant amount of time evaluating irrelevant hyperparameters. Thus, the Bayesian methodology 76 was chosen where f cls (x) and f seg (x) represent classification and segmentation objective scores to be minimized and evaluated on the validation subset, x * is the set of hyperparameters that yields the lowest value of the overall score, and x can take on any value in the domain X . We note that due to the MTL architecture f cls (x) is only used for networks in Stage II, otherwise, it is equal to 0. In essence, Bayesian optimization is a probability model and its key advantage is compatibility with a black-box function, whilst being data-efficient and robust against noise. However, it works poorly with parallel resources as the optimization process is sequential. In this regard, we used an additional extension of the Successive Halving algorithm 77 called HyperBand 78 . In contrast to HyperBand, Successive Halving suffers from a trade-off between the selection of the number of configurations and the number of cuts while allocating the budget. As a solution, HyperBand proposes to perform the Successive Halving algorithm with different budgets to find the best configurations. HyperBand evaluates whether tuning has to be stopped or permitted to continue at one or more pre-set iteration counts, called brackets. When a trial reaches a bracket, its metric value is compared to previously reported metric values and the trial is terminated if its value is too high (minimization goal) or too low (maximization goal). The goal of the tuning we perform is related to the maximization of a segmentation score, the Dice similarity coefficient (DSC). In order to specify the bracket schedule and maximize the aforementioned metric, we use: • I max = 16 (the maximum number of iterations); • S = 2 (the total number of brackets); • ETA = 2 (the bracket multiplier schedule).
The brackets are computed using the equation B k = I max /ETA k , where k is the bracket number. The latter means that the brackets for tuning are [16/2 1 , 16/2 2 ] equaling [8,4] . In addition to Hyperband, we use a complementary stopping strategy, the Early Stopping algorithm, that helps to reduce the computation time. This algorithm is used as a regularization, which allows for the removal of poorly performing trials and attempts at more configurations. The key settings of the Early Stopping algorithm used for tuning are: • Metric to be monitored: Dice similarity coefficient; • min = 0.01 (minimum change in the monitored quantity to qualify as an improvement); • e = 6 (number of epochs with no improvement after which the training is stopped).
In connection, instead of a blind repetition algorithm on top of Successive Halving, we use a Bayesian optimization with constraints imposed by two algorithms: HyperBand and Early Stopping. In such an approach, Early Stopping acts as an intra-regularizer that estimates the performance of a single trial in an epoch-by-epoch manner. Whereas, HyperBand plays the role of an extra-regularizer which estimates the performance between trials and terminates poorly performing ones in a bracket-by-bracket manner.
Conducting a hyperparameter search is a non-trivial task due to the variability in hyperparameter priority when it comes to tuning them i.e. models are more sensitive to certain hyperparameters than others, necessitating a more impactful strategy 79 . As a result, we did not optimize hyperparameters such as batch size, non-linearity type, optimizer options, kernel sizes, etc. However, we pay attention to the encoder architecture, input image size, loss function, optimizer, and the learning rate. In Table C1 of Appendix C, we summarize the explored hyperparameters along with their values used during tuning.
The dataset used during tuning differs from the one we use in training. First, the testing subsets are not used for tuning purposes and the termination or early stopping of the trials are based on the DSC value computed on a validation subset. Second, the overall tuning dataset includes fewer images than the dataset used for final training and testing. A comparison of the datasets used in both steps is reflected in Table 4. According to the displayed distribution, we use 10% of the whole dataset for the tuning of lung segmentation networks and 50% for the tuning of disease segmentation networks. Such a difference is explained by the fact that the region of both lungs is more distinctive than the COVID-19 affected regions. The typical appearance of such regions presents ground-glass opacities (with or without consolidation) or a "crazy-paving" pattern, which is the appearance of  www.nature.com/scientificreports/ ground-glass opacities with superimposed interlobular septal thickening and intralobular septal thickening. Such patterns present in images forced us to increase the tuning dataset in order to find the best configuration of the studied networks.
Hyperparameter correlation and importance. In addition to the best configurations, we estimate which of the investigated hyperparameters are the best predictors and are highly correlated with the desirable metric, the DSC. For hyperparameter quantification, we compute two metrics: correlation, between the hyperparameter and the chosen metric, and importance. Correlation ranges from − 1 to 1, where positive values represent a positive linear correlation, negative values represent a negative linear correlation, and a value of 0 represents no correlation. Generally, a value greater than 0.7 in either direction represents a strong correlation. Correlation, alone, cannot capture second-order interactions between inputs and it can get messy when comparing inputs with different ranges. As such, we estimate a complementary metric, importance, where we train a random forest with the hyperparameters as inputs and the metric as the target output, and report the feature importance values for the random forest. We were inspired by a methodology proposed by Jeremy Howard, who has pioneered the use of random forest feature importance to explore hyperparameter spaces at Fast.ai 80 . This is in contrast to the adoption of linear regression for the task at hand, which works well if the dataset is properly prepared, as random forests are more tolerant of different data types and scales. We note that hyperparameters with importance values of 0.05 and lower are likely not important. Below we outline key factors for the interpretation of correlation and distinguishing it from the importance: • Correlation shows evidence of association, but not necessarily causation; • Correlations are sensitive to outliers which may turn a strong relationship into a moderate one, especially if the investigated sample size of hyperparameters is small; • Correlations only capture linear relationships between hyperparameters and metrics i.e. if there is a strong non-linear relationship, it may not be captured.
Disparities between importance and correlation result from the fact that importance accounts for interactions between hyperparameters, whereas correlation only measures the effect individual hyperparameters have on metric values. Secondly, correlations strictly capture linear relationships, whereas importances can capture more complex ones. Nevertheless, both importance and correlation are powerful metrics for the understanding of how hyperparameters influence model performance, and are both used in our study.

Model training.
Once the tuning of networks in both stages is performed, we train nine models with their best configurations i.e. the best configuration per model (U-net, U-net++, DeepLabV3, etc.). During training, we used the Early Stopping strategy, similar to that described in "Hyperparameter tuning". Additionally, we employ a set of augmentation transformations that are used during both the tuning and training steps. Besides allowing us to increase the size of the dataset, augmentation acts as a regularizer and helps reduce overfitting during model training. The proposed augmentation workflow consists of the following transformations: • Contrast limited adaptive histogram equalization with a probability of 20%; • Random-sized crop with a probability of 20% (weight-to-height ratio of the crop equal to 1, the range of crop is picked from 0.7 × I h to 0.9 × I h , where I h is the source image height); • Rotation with a probability of 50% (a random angle is picked from -15° to + 15°); • Horizontal flip with a probability of 50%; • Random brightness and contrast adjustment with a probability of 20% (factor range for changing both brightness and contrast is picked from − 0.2 to + 0.2).
In contrast to the tuning step, where the batch size was chosen as a fixed value equal to 4, the training step does not use a fixed batch size. Since the studied models are of different complexity, they require different memories for training, for a fixed batch size. In this regard, we decided to equalize the trained models by the GPU memory utilization i.e. each model is trained using a batch size allocating approximately 90-100% of GPU memory.

Results
Hyperparameter tuning. Following the approach described in "Hyperparameter tuning", we sequentially tuned the networks for both stages, lung and disease segmentation. We found that approximately 100 runs turn out to be minimally enough to select optimal hyperparameters for the lung segmentation networks. However, to extend the hyperparameter space, we tripled the number of runs. Each network was trained with a batch size of 4 on NVIDIA GeForce RTX 3090 24 Gb. A small batch size was selected due to the physical limitation of the GPU memory and the out-of-memory error thrown during training. We decided to bias our focus to the encoder rather than the batch size. In this regard, we estimated a wide variety of encoders and models rather than several lightweight/middleweight models and encoders with different batch sizes. Besides the accuracy metric (DSC), we estimated the number of parameters for each model and its complexity. The complexity is represented by the theoretical amount of multiply-accumulate (MAC) operations in CNNs.
Having performed 300 trials with different hyperparameter combinations, we found all possible optimal solutions for lung segmentation (Table 5). According to our results, the Adam optimizer, or its variants, turned out to be the optimal solution for network training and effective convergence. The optimal learning rate falls into the range of 10 -3 to 10 -4 , while the median input size for the training and inference of an optimal network  FPN (9.1 G). Having compared all nine models, DeepLabV3 + turned out to be the optimal network in terms of the accuracy-complexity ratio i.e. DSC-MAC ratio. As we described in "Hyperparameter correlation and importance", we additionally estimate two metrics (correlation and importance), showing the degree to which each hyperparameter was useful in predicting the chosen metric. Below, in Fig. 7, we summarize the average correlation and importance values. As we described before, we calculate the importances using a tree-based model rather than a linear model as the former is more tolerant of both categorical data and data that is not normalized. Based on our results, the learning rate and the training time turned out to be the most important hyperparameters, significantly affecting the accuracy of the studied networks. Also, these metrics correlate significantly with the estimated segmentation metric (Dice similarity coefficient). However, they are of opposite correlation nature with the desired metric i.e. higher DSC values with smaller learning rates. Furthermore, the training time and DSC are positively correlated, meaning networks that are trained longer result in a better accuracy performance. Additionally, in Fig. C1 of Appendix C, we provide the original correlation and importance values per hyperparameter.
To find the optimal configurations for COVID-19 segmentation networks, we employ a similar strategy with the key difference being the presence of a regularization branch. The optimization of these networks refers In contrast to lung segmentation networks, the number of runs for the hyperparameter tuning was increased by 7 times resulting in 2,100 runs. Such a big difference between the two segmentation approaches is related to the deeper exploration of the hyperparameter space of the disease segmentation networks because (a) the latter plays a crucial role in the proposed scoring workflow and (b) the segmentation of indistinguishable COVID-19-affected regions is of higher complexity than the segmentation of the more distinctive lungs. Having performed the hyperparameter tuning of the COVID-19 segmentation networks, we obtained the results shown in Table 6. Similarly, the Adam optimizer and its variants proved to be optimal. The most accurate models are U-net (0.894), PSPNet (0.879), and MA-Net (0.876). PSPNet with the EfficientNet B0 encoder has the lowest complexity (0.1 G) which leads to the best performance as compared to other solutions. The second and third lightweight models are Linknet (0.5 G) and FPN (4.5 G). Figure 8 displays the average correlation and importance per hyperparameter for the COVID-19 segmentation networks. We observe that the learning rate and the training time are the most important hyperparameters. Additionally, in Fig. C1 of Appendix C, we provide the original correlation and importance values per hyperparameter.

Model training.
Having performed hyperparameter optimization and found the best configurations, we trained nine lung segmentation networks according to the methodology described in "Model training". Once the networks were trained and validated, we tested them, summarizing their main specifications and the Dice similarity coefficient in Table 7. Additional metrics related to the quality of lung segmentation are reflected in Appendix D, Table D1. As can be seen, all networks have a high level of segmentation accuracy, however, some of them are computationally expensive (U-net, Linknet, and PSPNet) with relatively similar values of DSC. In this regard, DeepLabV3 + (DSC test = 0.963, parameters = 7.4 M, MACs = 2.2 G) is chosen as an optimal solution which is used in our scoring pipeline as the core block for Stage I.
Similarly, we trained nine MTL networks for both segmentation and classification of COVID-19 and normal cases, where the latter is connected to patients with no diseases or no findings. In Table 8, we give a summary of the obtained results. Since the networks of Stage II are MTL-based, we report both segmentation and classification accuracies (DSC and F1 score). More detailed results, including metrics such as Accuracy, Precision, Recall, Dice similarity coefficient, and F1 score, are presented in Appendix D, where Table D2 is devoted to the  segmentation results, while Table D3 to the classification results. As can be seen, U-net and FPN proved to be the most accurate networks in both classification and segmentation tasks. U-net achieved an F1 score, on the testing subset, of 0.985, which is 1.9% (0.019) higher than the F1 score of FPN which is 0.966. However, in terms of the number of parameters, U-net is almost 26 times more computationally expensive than FPN (115.8 M vs 4.5 M), which usually leads to a lower prediction speed. In this regard, both networks may be employed as the optimal solution for Stage II, depending on the required performance (processing time) and accuracy (DSC and F1).
Scoring results and comparison with the state-of-the-art solutions. After training the networks with their optimal configurations, we obtained nine networks for each stage (18 networks in total). As discussed ("Model training"), all networks perform lung segmentation with approximately the same accuracy. However, In "Post-processing: severity scoring and visualization" and in Algorithm B3 of Appendix B we introduced a threshold parameter T . In order to estimate a score of each lung segment, we take the intersection of the predicted disease mask and the corresponding lung segment. If the intersection of these two regions is greater than or equal to a predefined threshold T , we count this segment as 1 (affected by the disease), otherwise 0 (non-affected or slightly affected by the disease). At the end of the proposed pipeline, the severity score estimator sums up the values for each segment and gives the total score which falls in the range of 0 to 6. Based on the training and validation subsets, before testing our workflow, we chose an optimal threshold T ( Table 9) which is different for each network in Stage II. We estimate threshold values based on the lowest MAE and RMSE values. The optimal threshold T is chosen based on the minimum MAE for the following workflow testing.
To strictly validate the proposed workflow, we test it on an unseen dataset (testing subset) which is described in "Stage II: disease segmentation and scoring dataset". Additionally, we estimate the workflow performance on the overall testing subset, including both COVID-19 and normal cases (Table 10), on the testing subset with only COVID-19 cases (Table E1 in Appendix E), and on the testing subset with only normal cases (Table E2  www.nature.com/scientificreports/ in Appendix E). Besides the internal comparison of different models within Stage II, we compare the obtained results against tailor-made state-of-the-art solutions used for disease scoring, namely BS-net and COVID-Net-S. Although MAE and RMSE of BS-net and COVID-Net-S on the pure COVID-19 dataset are of relatively acceptable level, the error on the dataset with healthy subjects (normal cases) turned out to be high. Meaning that most healthy subjects are usually scored as mild or intermediate COVID-19 cases. To compare the obtained scoring results and make them visually distinct, we use a blue-white-red color scale, where blue refers to a better performance, red refers to a worse performance, and white is intermediate. From our studies, we found several optimal solutions for lung disease scoring. The most accurate variants for the proposed workflow are based on U-net, FPN, or MA-net, used for disease segmentation in Stage II. The MA-Net-based workflow achieved the lowest MAE of 0.30 on the testing subset, while U-net and FPN MAEs are of similar levels and equal to 0.33 and 0.36 respectively. It should be noted that we consider MAE and RMSE dynamics over the training, validation, and testing subsets to choose the best three networks. As shown in Table 10, the obtained optimal solutions did not overfit. We note that the key to the comparison of the proposed workflow with state-of-the-art solutions is the estimation of both COVID-19 cases (Table E1 in Appendix E) and normal cases (Table E2 in Appendix E) separately.
While MAE and RMSE provide formal quantifications of the overall performance of each network and allow for a formal comparison between the networks, they do not evaluate the performance of networks based on the underlying score values. The latter is desired since some networks can perform better on higher scores, some on lower scores, and some can be the same for all score values. To address this, visual summaries of each network's performance are provided for the testing subset as a heatmap in Fig. 9, where a darker color indicates a higher density of points and those points on a red dashed line indicate perfect agreement between the network and the consensus score. The comparison between radiologists scoring is also provided in Fig. 9 (bottom-right chart). The network's score and the consensus score on the red line indicate perfect agreement while deviations in each side reflect whether the network underscores or overscores. A darker color on these plots represents the density  For visualization purposes, we provide probability maps of each network in Fig. 10 and Appendix F. Each probability map represents an output of Stage II scaled on a 0-1 range. As shown, all probability maps are of different nature. For instance, some networks such as U-net + + , PSPNet, and PAN showcase blurring mask edges. Linknet usually outputs many erroneous binary objects, while DeepLabV3 cannot segment the COVID-19-affected region. In turn, U-net, FPN, and MA-Net perform well, segmenting with no artifacts or blurred borders. The latter confirms that these three networks achieved the best results in severity scoring (Table 10, Tables E1  and E2 in Appendix E). Figure 11 showcases CXR images of a subject taken from the ACCD dataset, where we additionally provide the ground-truth severity score and the scores obtained by BS-net and COVID-Net-S. Additional cases taken from other COVID-19 datasets (CRD, CCXD, and FCXD) are reflected in Appendix G.
For the overall comparison of the proposed solutions, we showcase MAE estimated on the testing subset, the frame rate (FPS), the number of overall parameters, and MAC in Fig. 12. The Y-axes "Parameters" and "MAC" refer to the overall number of parameters and the theoretical amount of multiply-accumulate operations for both stages of the proposed workflow. Similar to the accuracy estimation, we choose DeepLabV3 + as the core network of Stage I. In Stage II we tested nine networks. All networks were tested in the evaluation mode meaning that (a) normalization or dropout layers work in evaluation mode instead of training; (b) the automatic differentiation engine is deactivated. Adoption of the evaluation mode reduces memory usage and speeds up computations turning the back-propagation over the network. The main GPU used for testing is NVIDIA RTX 2080 Ti 11 Gb. The best performance (12.5 images/s) resulted in a proposed pipeline consisting of DeepLabV3 + (Stage I) and PSPNet (Stage II) whilst ranking sixth by MAE of the severity score. The most accurate solution consisted of DeepLabV3 + (Stage I) and MA-Net (Stage II), ranking eighth in the level of performance (7.9 images/s). On the other hand, the prediction speed of the tailor-made solutions, BS-net and COVID-Net-S, turned out to be the lowest making up 0.7 and 0.6 images/s respectively.

Discussion
Issue of the source data. In this study, we utilize four publicly available COVID-19 datasets. According to the available protected health information, these datasets represent patients from over 40 countries, 107 cities, and 40 organizations. However, from a scientific and statistical point of view, we cannot ensure that all of the specified and/or unspecified medical organizations follow the same detailed protocol while data gathering. According to the generally accepted data collection rules 81,82 , the study design should be reproducible, so that the protocol can be followed by any other research party. All of the data needs to be gathered in a consistent manner and if data is collected by different individuals, it must be guaranteed that there is a sufficient degree of inter-rater reliability. In this regard, we cannot verify that there was no collection and processing biases in the data used for the analysis.
Comparison with existing scoring solutions. Having estimated both state-of-the-art solutions (BS-net and COVID-Net-S), we found them not suitable for the usage on CXRs of healthy patients, signifying the limits of their usage in daily clinical practice. As we described in "Scoring results and comparison with the state-ofthe-art solutions", BS-net and COVID-Net-S fail on the normal CXR cases and their MAEs (on a scale from   From an architectural perspective, both BS-Net and COVID-Net-S are relatively lightweight. COVID-Net-S utilizes a lightweight residual projection-expansion-projection-extension design pattern discovered by the machine-driven design exploration strategy and uses 1 × 1 convolution layers and 3 × 3 depth-wise convolution layers. The authors of BS-net also choose a lightweight solution for the processing of input images which is based on ResNet-18, while the segmentation is performed by a nested version of U-net, U-net++. We are inclined to believe that such lightweight networks do not generalize well for COVID-19 and pneumonia cases and that, in turn, leads to low-quality scoring, requiring additional validation by radiologists or clinicians. In contrast, the proposed workflow is based on network architectures with proven stability and generalization ability on a wide variety of tasks, including medical cases. Moreover, having hyper-tuned the networks in both stages of the workflow, we found optimal solutions in both accuracy and performance. In contrast to BS-net and COVID-Net-S, the proposed workflow, based on modern architectural solutions, outperforms them in prediction speed (Fig. 12). This means that despite the lightweight design of both BS-net and COVID-Net-S, these networks are of high complexity and potentially include more parameters than that of the modern networks highlighted in this study.
The majority of the current disease classification solutions focus, primarily, on distinguishing whether an infection is present or not, without paying much attention to where the network is looking. Previously 13 , we extended the classification of COVID-19 and pneumonia by utilizing a popular visualization technique known as Grad-CAM 83 . Using Grad-CAM, we validated where the four best-performing networks (MobileNet V2, EfficientNet B1, EfficientNet B3, VGG-16) were focusing, verifying that they are properly looking at the correct patterns in the image and activating around those patterns. We noticed that some networks were not focusing on image patterns of interest, instead, activating around patterns that lead to incorrect predictions (see Fig. 13c-e). Moreover, complex patterns of COVID-19 and pneumonia distinguished solely by radiologists forced us to The Grad-CAM technique uses gradients, flowing into the final convolutional layer to produce a coarse localization heatmap highlighting important regions in the image for predicting the target concept. After visual comparison of classification-based and segmentation-based approaches, we may state that visualization and decision-making based on segmentation masks are of higher quality than the localization heatmaps obtained by Grad-CAM, thus the results of the proposed two-stage workflow are more precise. This is to be expected, however, as Grad-CAM is usually used for approximate localization and requires less effort and resources for data labeling and neural network training. . Comparison of the segmentation and severity score estimation of a COVID-19 subject from the ACCD dataset. A cyan delineation refers to the lung segmentation obtained by Stage I; a red mask is a disease mask obtained by Stage II; a yellow mask refers to the ground-truth segmentation of the disease.

Conclusion
In this study, we proposed a workflow used for scoring and segmentation of lung diseases. The development was influenced by the standard practice adopted by trained clinicians when estimating the severity of a lung infection from an X-ray image. Central to our approach is the utilization of two core independent stages which allow us to investigate the regions of interest on an X-ray image, resulting in a lung mask and a disease mask. An additional block at the end of the workflow uses these masks to estimate the overall severity score for a given www.nature.com/scientificreports/ patient. To select the best solution, we compared the performance of nine neural networks in both stages. The most accurate solution, in terms of MAE and RMSE, turned out to be based on DeepLabV3 + for lung segmentation and MA-Net for disease segmentation. Having compared the solution with the state-of-the-art BS-net and COVID-Net-S, we found our proposal to be more stable in terms of accuracy and more time-efficient in terms of prediction speed.

Data availability
To study algorithm performance, we collected, cleaned, and pre-processed three lung segmentation datasets as well as four disease segmentation and scoring datasets acquired for COVID-19 and pneumonia-infected patients.