High sensitivity methods for automated rib fracture detection in pediatric radiographs

Rib fractures are highly predictive of non-accidental trauma in children under 3 years old. Rib fracture detection in pediatric radiographs is challenging because fractures can be obliquely oriented to the imaging detector, obfuscated by other structures, incomplete, and non-displaced. Prior studies have shown up to two-thirds of rib fractures may be missed during initial interpretation. In this paper, we implemented methods for improving the sensitivity (i.e. recall) performance for detecting and localizing rib fractures in pediatric chest radiographs to help augment performance of radiology interpretation. These methods adapted two convolutional neural network (CNN) architectures, RetinaNet and YOLOv5, and our previously proposed decision scheme, “avalanche decision”, that dynamically reduces the acceptance threshold for proposed regions in each image. Additionally, we present contributions of using multiple image pre-processing and model ensembling techniques. Using a custom dataset of 1109 pediatric chest radiographs manually labeled by seven pediatric radiologists, we performed 10-fold cross-validation and reported detection performance using several metrics, including F2 score which summarizes precision and recall for high-sensitivity tasks. Our best performing model used three ensembled YOLOv5 models with varied input processing and an avalanche decision scheme, achieving an F2 score of 0.725 ± 0.012. Expert inter-reader performance yielded an F2 score of 0.732. Results demonstrate that our combination of sensitivity-driving methods provides object detector performance approaching the capabilities of expert human readers, suggesting that these methods may provide a viable approach to identify all rib fractures.


Rib fracture detection
Numerous deep learning methods for rib fracture detection have been developed in the last 5 years for volumetric CT images [18][19][20] .Yao et al. implemented a three-stage process for rib fracture detection, beginning with a U-Net for bone segmentation of the CT image, isolating the ribs and removing additional bony structures such as scapulae, and classifying whether a fracture was present via a 3D DenseNet 21 .A similar approach was taken by Zhang et al., utilizing a nnU-Net to segment areas of ribs that may contain a fracture and a secondary stage with a DenseNet to classify the segmented region 22 .MICCAI hosted the RibFrac challenge in 2020 that invited methods to detect the location and classify fractures into four clinical categories on CT images 23 .The three leading methods from this challenge used RetinaNet or variations of masked R-CNNs.Fewer methods have been developed for rib fracture detection on 2D radiographs.
There have been substantial efforts to apply deep learning to chest X-ray images, thanks partially to large publicly available data sets of common chest pathologies 24,25 .These efforts largely focus on classification of the image and on lung diseases such as fibrosis, pneumonia, and COVID-19 26 .There have been successful efforts for improved wrist fracture detection on radiographs 27 , but there has been less effort focused on rib fracture assessment 28 .Gao et al. performed rib fracture detection on radiographs with their proposed CCE-Net where multiple feature extraction modules were fused together as inputs to a two-stage detection network demonstrating improved performance compared to competing R-CNN and YOLOv4 architectures 29 .To our knowledge, there are no published methods attempting to automatically detect the location of rib fractures on pediatric chest X-rays.
Detection of rib fractures on pediatric chest radiographs is challenging for a host of reasons: complex anatomy of the ribs, age-related variations of the rib structure, rib fracture location and orientation, and perceptional difficulties from overlying anatomy and artifacts (e.g., monitor leads, support devices, clothing) 30 .The age of fractures also make identification difficult.In children, acute rib fractures may be undetectable on chest radiographs and in some cases only become evident after callus formation (new bone growth) develops 10-14 days into the healing process 31 .It is therefore not surprising that missed rib fractures in children are quite common and that the sensitivity for detection of any rib fracture on pediatric chest radiography by experts is only about 31% 32,33 .
We have previously presented preliminary results in support of pediatric rib fracture detection.In one effort, we developed a method for chest segmentation tailored for pediatric radiographs 34 that serves as a pre-processing step in this current work.Likewise, we proposed a sensitivity driving approach named "avalanche decision schemes" 9 .This current work expands on our prior efforts with the following innovations: (1) use of a larger custom annotated data set, (2) development of varied-input-processing methods, (3) application of ensembled models, and (4) comparison against expert reader performance for the same task on matched data.

Methods
We propose a start-to-finish methodology for detection of rib fractures in pediatric chest radiographs.The pipeline for processing through evaluation for the rib fracture detection task is as follows: original DICOM radiograph file → thoracic region segmentation via trained U-Net → cropping around segmented region → image processing/filtering → CNN architecture training and inference → evaluation of CNN detection proposals with applied avalanche decision schemes and/or ensembling.Each step in this pipeline is elucidated in the following sections.The major contributions of this work are presented in Fig. 1.

Dataset curation and labeling
This study was approved by the institutional review board at Seattle Children's Hospital (STUDY00000853) with informed consent waived due to the study design.Data were collected for this minimal risk retrospective analysis and all methods and experiments in this study were carried out in accordance with relevant guidelines and regulations of the institution (Seattle Children's Hospital).In this convenience sample, we first searched the medical record for patients with chest radiographs with confirmed rib fractures and identified 624 cases.Gender and age statistics for these fracture-present studies were extracted.An age-and gender-matched sample of chest radiographs with no rib fractures was created.Chest radiographic images were extracted from the medical record and fully anonymized.These images had (height by width) dimensions on average of 2348 ± 685 by 2134 ± 500 pixels with pixel spacing of 0.128 ± 0.023 mm.All images were quantized from 12-or 16-bit integer to 8-bit integer precision and analyzed with their original pixel spacing.
In total, the dataset contains 1109 unique patients, of which 624 are fracture present and 485 are fracture absent.There are 385 ( 34.7% ) female and 724 ( 65.3% ) male patients.The average age of patients is 281.74 ± 769.42 (range 0-7300; median 84; IQR 224) days.In order to perform and evaluate object detection, we obtained handdrawn ground-truth annotations for all images.In short, the fracture present cases were each interpreted by one of seven board-certified pediatric radiologists with 5-20 years of experience.During interpretation, the radiologists had access to all of the available radiographic views of the chest (usually supine anterior-posterior (AP) although occasionally other views were available).They were instructed to draw bounding boxes as closely around each detected fracture as possible on the AP view only; the object detection methods discussed below were only applied to the AP view image.Of the total 624 fracture-present images, 338 were read by two board-certified pediatric radiologists to enable estimation of inter-reader variability.This inter-reader variability estimate served as a performance baseline for evaluating the proposed methods.

U-Net segmentation and cropping
In our prior work 34 , we trained a U-Net model 35 to segment chest radiographs into multiple anatomic regions.Here, we improved that effort with the use of a more advanced U-Net3+ architecture that includes full-scale skip connections 36 .Labels for training were manually drawn to segment the chest into seven non-overlapping regions: left and right lung, left and right "subdiaphragm" (the thorax below the superior boundary of diaphragm), spine, mediastinum, and background.In total, users manually labeled 469 radiographs.After each inference from the U-Net3+ model, the proposed segmentation maps are automatically refined with basic morphological operations to remove small spurious disconnected regions from background and close all foreground regions.Of the 469 labeled images, the U-Net model was trained with 422 and tested on the remaining 47 images.Representative segmentations from the test set are presented in Fig. 2. The mean Dice coefficient for each of the seven regions exceeded 0.88 and visual assessment confirmed that all final cropped images in the test set contained only the thoracic cavity.For this current work, we use this U-Net segmentation to provide a tight cropping window around the region of the image containing ribs and for the varied input processing ensembles described below.

Deep learning models
For RetinaNet, we used the ResNet-50 backbone with pre-trained weights on ImageNet 37 .We trained all Reti-naNet models on our dataset using a NVIDIA V100S for a maximum of 300 epochs at a batch size of 8 using the Adam optimizer.The learning rate was set initially to 0.0001 and was decreased by one-tenth if validation performance did not improve within 4 epochs.The dataset was augmented with a 50% chance of applying any of the following transformations to each image: shift/scale/rotate, horizontal flip, random brightness, random contrast, or Gaussian blur.Training would cease via an early stopping clause if performance on the validation set had not improved in 30 epochs.
We also utilized Ultralytics' open-source YOLOv5 repository 38 , using the large L6 model pre-trained on the COCO dataset prior to training on our dataset.Similarly, all YOLOv5 models were trained on a NVIDIA V100S for a max of 300 epochs and batch size of 8.A stochastic gradient descent (SGD) optimizer was used with momentum 0.937 and weight decay of 0.0005.Learning rate was initially 0.01 and decreased linearly each epoch.There was also an early stopping feature that stopped training if no improvement in validation was observed after 100 epochs.

Avalanche decision schemes
Our preliminary work proposed adjusting the decision threshold for fracture positive proposals as a function of the number of already accepted bounding box proposals 9 .The decision threshold is not fixed, but rather changes depending on the number of high probability proposed regions.This approach is motivated by the reality that if a subject has one fracture they are very likely to have more than one fracture, and a subject with two fractures is very likely to have three, and so on.These likelihoods presented in Table 1 are updated using the now larger dataset than our preliminary work.In this table, the first row shows that 444 images in the training set have at least 1 fracture; If an image has at least one fracture, there is a 73.6% likelihood of having more than one fracture.Similarly, if an image has at least 2 fractures, there is a 81.3% likelihood of having more than 2 fractures.We explored different relationships for decision thresholds versus apparent number of accepted fractures as presented in Fig. 3.For the "Standard" approach the decision threshold is constant no matter how many proposed regions clear this threshold level, and typically the threshold is set at 0.5 (all proposals with confidence greater than 50% are accepted).For the avalanche approaches, the decision threshold decreases as more proposed regions  are accepted.If n regions have a probability greater than a n−1 , then the new threshold is set to We evaluated different schemes for setting the reduction, r.Specifically, the "Posterior" method uses threshold reductions based on the likelihood information presented in Table 1, with r = (1 . The other methods, labeled with " γ ", fixed the reduction to r = (1 − γ ) for all n.The values for γ were selected based on initial testing of multiple values ranging from 0.05 to 0.5 in increments of 0.05; please see preliminary work for additional details 9 .We improved on our previous work by implementing a non-maximum suppression (NMS) step, a common approach to filter out regions proposals with a large overlap of each other.This NMS step is applied after the avalanche decision schemes have been applied on the given trained model predictions.This is particularly effective on networks like RetinaNet where the number of bounding box proposals per image is significantly higher than more reserved models such as YOLOv5.

Input processing
We wanted to explore how varying the type of processing performed on the input images changed the performance of the trained models.All types of processing were applied following the segmentation and cropping via the U-Net discussed above.Figure 4 provides a visualization of the different types of processing and how they are combined to create input images for training and evaluation.Method a applies histogram equalization to the single-channel, grayscale image array after which the array is replicated three times to provide the three channel input to the object detector models.
The two additional variations go an additional step by utilizing the pixel-level segmentation information from the U-Net from the previous pre-processing stage.After cropping the image around the segmented thoracic region, all background pixels are masked out to generate a masked foreground image containing only anatomical structures.In method b, adaptive thresholding is applied using a Gaussian weighting method to determine threshold values in a given neighborhood of pixels.This transforms the image from grayscale to binary, with 1 (white) representing pixels above the threshold and 0 (black) below, providing a rough segmentation of just the ribs.This binary mask is then stacked three times as the final image.
In method c, the masked image goes through two separate filtering operations, inspired by Heidari et al. 39 : histogram equalization (like method a) for increased contrast and bilateral low-pass filtering for edge-preserving noise reduction.The low-pass filter uses mid-line σ-space and range values of 150, with a 9 pixel neighborhood diameter.The original masked image, histogram equalized masked image, and bilateral filtered masked image are then stacked as the three channels for the detector input, which we label as "blended" input.

Ensembles
We also investigated the impact of model ensembles on rib fracture detection.Model ensembles with deep neural networks have shown better generalizability as well as improved performance on tasks with smaller datasets [40][41][42] .
To survey this, we tested the following ensembles: (1) Same-Model Ensemble: The simplest form of ensembling is the combination the proposal results of multiple identical models each with different training runs initialized with different seeds, similar to the deep ensembles analyzed by Lakshminarayanan et al. 43 .(2) Hybrid-Model Ensemble: This is a slight variation to the same-model ensemble, combining an equal number of training runs of both deep learning architectures we tested; for example, combining one run of RetinaNet with one run of YOLOv5.(3) Varied-Input-Processing Ensemble: The final type of ensembling models incorporated all three of the different image pre-processing operations as summarized in Fig. 4, requiring at minimum three trained models trained on each of the input processing variations.
Prior to final evaluation, proposed bounding boxes from all members of each ensemble were aggregated together and overlapped boxes then removed via non-maximum suppression (NMS) with an intersection-over-union (IOU) threshold of 0.45.This threshold was set based on initial validation experiments, but not fully optimized across all model variants.

Training and evaluation
Twenty percent of the total dataset were withheld as the fixed test set (N = 222 images), with half randomly drawn from fracture-present images and the other half randomly drawn from fracture-absent images.The remaining 80% of data was then used to create the training ( 70% ) + validation ( 10% ) sets.All evaluations were performed after training with a 10-fold cross-validation strategy in order to examine the range in model performance; the ten separate training and validation sets were randomly drawn with replacement between each set.Object detection performance was evaluated in terms of recall, precision, and F2 score on the fixed test set.An intersection-over-union (IOU) threshold of 0.30 was applied across all model and ensemble evaluations to identify concordance between model predictions and labeled annotations.Supplementary Fig. S3 provides rationale for the selection of this IOU threshold.We used F2 score rather than F1 to give recall/sensitivity twice the importance of precision, considering this task warrants high sensitivity performance as discussed above.Max F2 scores are also provided for each combination by finding the highest F2 score achieved across all potential decision thresholds.Furthermore, to summarize average performance across a range of settings, we calculated mean average precision (mAP) by computing the areas under multiple precision-recall curves generated at IOU thresholds ranging from 0.25 to 0.75 in 0.05 increments.Note that mAP was not be calculated for the avalanche decision schemes since precision-recall curves are not analogous between fixed and dynamic decision thresholds.
For single-model calculations, we evaluated performance metrics for all ten trained models (trained on each of the ten folds) and report the average ± standard deviation across these models.For two-, three-, and six-model ensembles, twenty ensemble combinations were arbitrarily selected from the multitude of different ways to combine 2, 3, or 6 models from the 10-fold data; In other words, twenty model combinations were taken www.nature.com/scientificreports/from the 10 choose 2 (45), 10 choose 3 (120), and 10 choose 6 (210) possible combinations, respectively, that were then evaluated and averaged.

Results
Representative test set images with ground truth annotations and model predictions are presented in Fig. 5.

Inter-reader variability
Of the total 624 fracture-present images, 338 were read by two board-certified radiologists.We calculated interreader performance for two different data sets: (1) Images from the test set (which contains 222 images, although only 111 of these are fracture-present and therefore interpreted by radiologists), and (2) Images from the set of 338 fracture-present images that have been read by two radiologists.For clarity, this inter-reader study was performed on fracture-present only images, while the deep learning training and testing was performed on present and absent images.On the test set, the first reader marked 536 total rib fractures for an average of 4.83 ± 3.30 (range 1-14; median 4; IQR 5) fractures per image.The second reader marked 486 fractures overall, averaging 4.38 ± 3.74 (range 1-27; median 4; IQR 4) fractures per image.Setting the first reader's annotations as "ground truth" between the two, fractures were scored as true positive (second reader box matches a reader 1 box), false positive (reader 2 box has no corresponding reader 1 box), or false negative (reader 1 box has no matching reader 2 box), dictated by an intersection-over-union (IOU) threshold of 0.30.
Three-hundred eighty-five fractures were counted as true positive matches, with 101 false positives and 151 false negatives.This led to the second reader scoring a precision of 0.792, recall of 0.718, and F2 score of 0.732.Essentially, the second reader "detected" just under 72% of the rib fractures discovered by the first reader.With these scores, the second reader's boxes overlapped reader 1 on average by 84% with a mean intersection-over- union of 0.63 across the 111 images.For clarity, overlapping represents the percentage of reader 1's annotated box pixels that are covered by the pixels from reader 2's matching box.
Inter-reader performance metrics remained very similar when looking at all 338 multi-read images.Reader 1 marked 1719 fractures, averaging 5.09 ± 4.30 (range 1-22; median 4; IQR 5) fractures per image and reader 2 marked 1567 fractures for an average of 4.64 ± 4.08 (range 1-27; median 4; IQR 4) fractures per image.Percent overlap and IOU remained essentially identical at 84% and 0.62.Precision, recall, and F2 score all decreased slightly to 0.777, 0.709, and 0.721.If we were to assume the first reader caught all fractures during their reads, the second reader was able to find 71% of the fractures in their reads.This again leaves over one-quarter of all fractures undetected between expert radiologists.

Base network performance
Base network performance was evaluated for single-model performance of either RetinaNet or YOLOv5 using histogram equalization image pre-processing (method (a) from Fig. 4).These results are presented in the Standard rows in Table 2 (and below with the nomenclature 1x-R a and 1x-Y a ).RetinaNet achieved 0.892 ± 0.015 precision, 0.430 ± 0.014 recall, and 0.480 ± 0.014 F2 score, whereas YOLOv5 scored 0.897 ± 0.032 precision, 0.434 ± 0.040 recall, and 0.484 ± 0.037 F2 score.When compared to expert-level human performance, both networks had marked higher values in precision but lower recall and therefore F2 scores.If either network were to predict a region for a potential rib fracture, they were essentially 90% likely to be correct in that prediction.However, both networks detected less than half of all rib fractures in the test set.

Avalanche decision
Table 2 presents results with the avalanche schemes applied to single RetinaNet and YOLOv5 models trained on the histogram equalized inputs.The posterior scheme with RetinaNet reduced precision to 0.141 ± 0.015 , www.nature.com/scientificreports/ a 84.19% decrease, whereas recall increased 102.8% to 0.872 ± 0.013 .This, however, lead to an F2 score of 0.427 ± 0.026 which is 11% lower than standard.The best performing avalanche scheme for RetinaNet was the conservative scheme, where the 40.6% reduction in precision and 69.8% increase in recall saw the F2 score increase to 0.679 ± 0.010 ( +41.5%).
Interestingly, YOLOv5 had a relatively minor decrease in precision but marked improvement in recall, and therefore F2 score, with the avalanche decision schemes.Precision decreased by 7-15% across the schemes while recall increased between 35-49%.As a result, the lowest performing YOLOv5 model with an avalanche scheme had an F2 score of 0.615 ± 0.050 and was still 27.1% better than standard; the best performance came from the posterior scheme with an F2 score of 0.652 ± 0.051 ( +34.7% ).Considering the large number of different ava- lanche schemes, the remaining results will only present schemes corresponding to best F2 score performance for each model or ensemble.

Combining avalanching, input processing, and ensembling
Performance of combining methods are presented in Tables 3 and 4, with the former including evaluations only with the standard decision scheme (i.e., fixed acceptance threshold of 0.50 for all model predictions) and the latter including the best avalanche scheme for each model and/or ensemble.Table 3 also includes inter-reader variability performance at the top for comparison between expert human readers and deep learning models.For full results of all models and ensembles, see Supplementary Tables S1 and S2.An explanation of the model nomenclature is presented in Fig. 6.
Using the standard decision scheme, the four three-model ensembles, 3x-R c , 3x-R*, 3x-Y c , and 3x-Y*, performed very similarly with regards to precision, scoring within 0.6% of one another.Compared to their single- model versions, ensemble methods resulted in improved recall and thus F2 scores.While the best single-model Table 2. RetinaNet and YOLOv5 single-model results comparing performance with the standard fixed decision threshold and applying the various avalanche schemes.γ represents the constant rate reduction between each decision threshold in the avalanche scheme.Bolded values highlight the highest value for each metric for each model.4 and varied input processing [*] uses one from each input processing type.For example, results presented for method 3x-R* would be for an ensemble of three RetinaNet models (trained on different folds) using varied input processing.
recall was 0.464 ± 0.043 , the worst performing three-model ensemble (3x-R*) achieved 0.523 ± 0.014 ( +12.7% ) and the best performing ensemble (3x-Y c ) reached 0.599 ± 0.021 ( +29.1% ).This led to an F2 score of 0.633 ± 0.018 with the 3x-Y c model.The mean average precision (mAP) trended upward as ensemble size increased and had similar trends as the F2 score, demonstrating that these models have similar rankings in performance across a range of inference hyper-parameters.After applying avalanche decision schemes, we saw the expected decrease in precision and increase in recall.The 3x-Y c ensemble with the γ = 0.20 decision scheme had the superior performance among three-model ensembles achieving 0.725 ± 0.012 F2 score, which is within 1% of expert human-level performance.One inter- esting thing to note is that the three-model ensembles with standard decision schemes have lower F2 scores than single-models with avalanche schemes at the trade-off of maintaining much higher precision values that exceed the inter-reader performance.
Six-model ensembles with standard decision schemes have lower precision scores than prior model and ensemble sizes, though still being on-par with expert-level performance.Once again, YOLOv5 with the blended, method (c) input images achieved the highest F2 score at 0.671 ± 0.007 , a 6% improvement over its three-model variant.Incorporating avalanche schemes with 3x-Y c utilizing the γ avalanche scheme with a fixed rate of 0.20 provided the highest F2 score of 0.725 ± 0.012 .The most complex 3x-R*+3x-Y* model was unable to achieve the highest performance in any metric, and in fact even had a slightly lower mAP score than the 1x-Y c models, though its recall and F2 performance were still second-highest among the standard decision threshold models/ ensembles.

Max F2 score between standard and avalanche schemes
In order to get a better understanding of how well the avalanche schemes perform compared to the standard inferencing technique, we plotted the F2 scores of a handful of test cases across all possible decision threshold values in Fig. 7.For the avalanche methods, the x-axis of these plots represents the starting threshold ( a 0 ) that is then potentially reduced if proposed regions have probabilities greater than a 0 .
We chose one model from the single-model group: the single RetinaNet using the histogram equalized images.Then we chose the best three-member ensemble with the 3x-Y c ensemble, and the most diverse ensemble with the six-member 3x-R*+3x-Y* ensemble.Each of their best corresponding avalanche decision schemes was plotted Table 3. Performance results of selected models with standard decision threshold.I.R.V. represents interreader variability performance between two radiologists.Bolded values represent the top two scores for each metric.Superscripts a, b, and c represent the type of input processing to train the models as shown in Fig. 4. Ensembles with * have hybrid inputs, i.e., each ensemble member was trained on a different input processing method.along with their traditional decision scheme performances, .In the 1x-R a and 3x-Y c cases, the avalanche scheme performs better than the standard decision scheme across a majority of the possible decision thresholds.For the 3x-R*+3x-Y* ensembles, the standard scheme outperforms the avalanche scheme for thresholds less than around 0.50.In all three cases, not only do the avalanche schemes generally perform better than their standard decision scheme counterparts, but the maximum F2 scores were also higher than what the standard decision versions could attain.This maximum F2 performance can also be seen in the last column of Tables 3 and 4. For each model and/or ensemble, the Max F2 value of its corresponding avalanche scheme is significantly higher than the standard schemes, many reaching a score above 0.9.

Limitations and future work
This work demonstrated the value of ensembling models to increase recall.We acknowledge that there could be multiple ways to combine model results.In this effort, we only explored combining model results with nonmaximum suppression (NMS) of all proposed boxes.This is an affirmative approach that inherently will only result in equal or better recall (coupled with equal or worse precision).Future work could explore additional methods for merging model results.This work uses a relatively small dataset from a single institution; future efforts are needed to replicate these results and ensure generalizability in larger, more diverse datasets.Moreover, our 624 fracture-present images were labeled with single reads and therefore our fracture-present labels are noisy and likely contain errors.This is especially likely considering the challenge of detecting pediatric fractures and given that our inter-reader variability sub-study showed that along with other measures, the recall between radiologists was just under 71% demonstrating over one-quarter of all fractures were missed by the second reader.Future work using consensus interpretation is needed to improve our labels.Finally, the performance was evaluated on a test set with half fracture-present cases and half fracture-absent cases.While the prevalence of rib fractures in real-world clinical settings will vary depending on the site and nature of the practice (out-patient versus emergency room setting, etc.), this 50% prevalence test set contains a higher likelihood of fractures than would be encountered in practice.Future work is needed to determine performance in realistic clinical settings with more thorough comparisons to expert human performance.Furthermore, future work is needed to evaluate the proposed methods as an in-line augmentation strategy with a radiologist serving as the arbitrator of model predictions; these types of future evaluations are critical for determining the ultimate clinical impact of AI-assisted and AI-augmented interpretation tools.

Conclusion
We demonstrate multiple methods that improve the sensitivity (i.e.recall) performance of two state-of-the-art object detectors on a custom curated dataset of pediatric chest radiographs.This includes an improvement to our novel dynamic decision threshold avalanche scheme as well as three methods of pre-processing the images.Additionally, various ensembling approaches combined with the aforementioned techniques were investigated.These techniques provided reduced precision with higher recall resulting in improvements in F2 score by extension.Simple ensembles, such as same-model three-and six-model ensembles, offered straightforward improvements over single-model detectors.This is likely due to the enhanced generalizability by training each ensemble member on different cross-validation folds of the training and validation data sets.Interestingly, many of the best performing models utilized the blended method (c) of pre-processing, where each channel of the input images was processed differently.The method with the highest F2 score was an ensemble of three YOLOv5 models using the input (c) pre-processing and with the γ = 0.20 avalanche scheme.This model achieved an F2 score of 0.725 ± 0.012, which was only approximately 1% below the inter-reader F2 score of expert radiologists of 0.732 and with a recall score exceeding the experts at 0.780 ± 0.030 versus 0.718.Overall, this work demonstrates promising ) is compared to the 'Standard' decision scheme.Generally, the avalanche scheme performed better than conventional inferencing, except for low decision thresholds.In every case, the avalanche decision scheme reaches higher max F2 scores than standard.

Figure 1 .
Figure 1.Summary of the major contributions of this paper including curation of a labeled dataset and the addition of up to three high-sensitivity methods (avalanche decision scheme, varied-input processing, and ensembling) to start-of-the-art pre-trained object detectors.In addition, inter-reader variability between expert radiologists was evaluated on 338 of the 624 fracture-present radiographs.

Figure 2 .
Figure 2. Representative results from multi-class segmentation showing manually labeled images (left) and U-Net results (right) with final automatically-generated cropped region represented by red box.

Figure 3 .
Figure 3. Plot of relative decision threshold for bounding box acceptance as a function of the number of accepted proposals.

Figure 6 .
Figure 6.Explanation of model nomenclature for ensembling combined with different input processing.The selection of input processing [a,b,c] is described in Fig.4and varied input processing [*] uses one from each input processing type.For example, results presented for method 3x-R* would be for an ensemble of three RetinaNet models (trained on different folds) using varied input processing.

Figure 7 .
Figure 7. F2 scores across all possible confidence thresholds for 1x-R a models, 3x-Y c ensembles, and the hybrid, 3x-R* + 3x-Y* ensemble.Each dashed line represents performance from one model or combination of ensembles.The best performing avalanche scheme ('Conservative') is compared to the 'Standard' decision scheme.Generally, the avalanche scheme performed better than conventional inferencing, except for low decision thresholds.In every case, the avalanche decision scheme reaches higher max F2 scores than standard.

Table 4 .
Performance of models from Table3with their best corresponding avalanche decision scheme result with respect to F2 score.Bolded values represent the top two scores for each metric.Superscripts a, b, and c represent the type of input processing to train the models as shown in Fig.4.Ensembles with * have hybrid inputs, i.e., each ensemble member was trained on a different input processing method.