Synthetic data accelerates the development of generalizable learning-based algorithms for X-ray image analysis

Gao, Cong; Killeen, Benjamin D.; Hu, Yicheng; Grupp, Robert B.; Taylor, Russell H.; Armand, Mehran; Unberath, Mathias

doi:10.1038/s42256-023-00629-1

Download PDF

Article
Published: 20 March 2023

Synthetic data accelerates the development of generalizable learning-based algorithms for X-ray image analysis

Nature Machine Intelligence volume 5, pages 294–308 (2023)Cite this article

13k Accesses
11 Citations
23 Altmetric
Metrics details

Subjects

A preprint version of the article is available at arXiv.

Abstract

Artificial intelligence (AI) now enables automated interpretation of medical images. However, AI’s potential use for interventional image analysis remains largely untapped. This is because the post hoc analysis of data collected during live procedures has fundamental and practical limitations, including ethical considerations, expense, scalability, data integrity and a lack of ground truth. Here we demonstrate that creating realistic simulated images from human models is a viable alternative and complement to large-scale in situ data collection. We show that training AI image analysis models on realistically synthesized data, combined with contemporary domain generalization techniques, results in machine learning models that on real data perform comparably to models trained on a precisely matched real data training set. We find that our model transfer paradigm for X-ray image analysis, which we refer to as SyntheX, can even outperform real-data-trained models due to the effectiveness of training on a larger dataset. SyntheX provides an opportunity to markedly accelerate the conception, design and evaluation of X-ray-based intelligent systems. In addition, SyntheX provides the opportunity to test novel instrumentation, design complementary surgical approaches, and envision novel techniques that improve outcomes, save time or mitigate human error, free from the ethical and practical considerations of live human data collection.

2DeteCT - A large 2D expandable, trainable, experimental Computed Tomography dataset for machine learning

Article Open access 04 September 2023

Image Turing test and its applications on synthetic chest radiographs by using the progressive growing generative adversarial network

Article Open access 09 February 2023

Accurate auto-labeling of chest X-ray images based on quantitative similarity to an explainable AI model

Article Open access 06 April 2022

Main

Advances in robotics and artificial intelligence (AI) are bringing autonomous surgical systems closer to reality. However, developing the AI backbones of such systems currently depends on collecting training data during routine surgeries. This remains one of the largest barriers to widespread use of AI systems in interventional clinical settings, versus triage or diagnostic settings, as the acquisition and annotation of interventional data is time intensive and costly. Furthermore, while this approach can contribute to the automation or streamlining of existing surgical workflows, robotic and autonomous systems promise even more substantial advances: novel and super-human techniques that improve outcomes, save time or mitigate human error. This is perhaps the most exciting frontier of computer-assisted intervention research.

Conventional approaches for curating data for AI development (that is, sourcing it retroactively from clinical practice) are insufficient for training AI models that benefit interventions that use novel instrumentation, different access points or more flexible imaging. This is because they are, by definition, incompatible with contemporary clinical practices and such data do not emerge from routine care. Furthermore, these novel systems are not readily approved, and thus not easily or quickly introduced into clinical practice. Ex vivo experimentation does not suffer the same ethical constraints; however, it is costly and requires mature prototypes, and therefore does not scale well.

A promising alternative to these strategies is simulation, that is, the in silico generation of synthetic interventional training data and imagery from human models. Simulation offers a rich environment for training both human and machine surgeons alike, and sidesteps ethical considerations that arise when exploring procedures outside the standard of care. Perhaps most importantly, in silico surgical sandboxes enable rapid prototyping during the research phase. Simulation paradigms are inexpensive, scalable and rich with information. While intra-operative data are generated in highly unstructured and uncontrolled environments, and require manual annotation, simulation can provide detailed ground-truth data for every element of the procedure, including tool and anatomy pose, which are invaluable for AI development.

However, simulations can fall short of real surgery in one key aspect: realism. The difference in characteristics between real and simulated data is commonly referred to as the ‘domain gap’. The ability of an AI model to perform on data from a different domain, that is, with a domain gap from the data it was trained on, is called ‘domain generalization’. Domain gaps are problematic because of the well-documented brittleness of AI systems¹, which exhibit vastly deteriorated performance across domain gaps. This may happen even with simple differences, such as noise statistics, contrast level and other minutiae^2,3,4,5. This unfortunate circumstance, which applies to all machine learning tasks, has motivated research in the AI field on simulation-to-reality (Sim2Real) transfer, the development of domain transfer methods.

In this Article, we present SyntheX, a framework for developing generalizable AI algorithms for X-ray image analysis solely based on synthetic data simulated from annotated computed tomography (CT). Using realistic simulation of X-ray image formation from CT and using domain randomization to train AI models, SyntheX creates AI models that retain their performance under domain shift, enabling evaluation and deployment on clinical X-rays acquired in the real world. The overall concept of SyntheX is illustrated in Fig. 1 and we demonstrate its utility and validity on three clinical applications: hip imaging, surgical robotic tool detection and coronavirus disease 2019 (COVID-19) lesion segmentation.

At the core of our report is an experiment on precisely controlled data from the hip-imaging task that isolates and quantifies the effect of domain shift for AI-based X-ray image analysis. Using CT images from human cadavers and corresponding C-arm X-ray images acquired from two different imaging systems during surgical exploration, we generated a hip-image dataset consisting of geometrically identical images across various synthetic and the real domains to train AI models for hip-image analysis. To our knowledge, no study so far has isolated the effect of domain generalization using precisely matched datasets across domains. This work also demonstrates a feasible and cost-effective way to train AI image analysis models for clinical intervention on synthetic data in a way that provides comparable performance to training on real clinical data in multiple applications. We also demonstrate that the model’s performance increases substantially as the number of synthetic training samples increases, which highlights the key advantage of SyntheX: making available large amounts of well-annotated data for model training or pre-training.

Clinical tasks

We demonstrate the benefits of SyntheX on three X-ray image analysis downstream tasks: hip imaging, surgical robotic tool detection and COVID-19 lesion segmentation in chest X-ray (Fig. 2). All of the three tasks use deep neural networks to make clinically meaningful predictions on X-ray images. We introduce the clinical motivations for each task in the following sections. Details of the deep network and training/evaluation paradigm are described in ‘Model and evaluation paradigm’.

Hip imaging

Computer-assisted surgical systems for X-ray-based image guidance have been developed for trauma surgery⁶, total hip arthroplasty⁷, knee surgery⁸, femoroplasty⁹, pelvis osteotomy¹⁰ and spine surgery¹¹. The main challenge in these procedures is to facilitate intra-operative image-based navigation by continually recovering the spatial tool-to-tissue relationships from two-dimensional (2D) transmission X-ray images. One effective approach to achieving spatial alignment is the identification of known structures and landmarks in the 2D X-ray image, which then are used to infer poses^12,13.

In the context of hip imaging, we define six anatomical structures and fourteen landmarks as the most relevant known structures. They are shown in Fig. 2a. We trained deep networks using SyntheX to make these detections on X-ray images. Synthetic images were generated using CT scans selected from the New Mexico Decedent Image Database¹⁴. The three-dimensional (3D) anatomical landmarks were manually annotated and the anatomical structures were segmented using the automatic method described in ref. ¹⁵, which were then projected to 2D as labels following the simulation X-ray geometries. We evaluate the performance of our model on 366 real X-ray images collected from 6 cadaveric specimens using the Siemens CIOS Fusion imaging system and another 60 real X-ray images from a separate cadaveric specimen using the BrainLab LoopX imaging system. On real images, ground-truth target structures were annotated semi-automatically. This real dataset also serves as the basis for our precisely controlled experiments that isolate the effect of the domain gap. We provide substantially more details on the creation, annotation and synthetic duplication of this dataset in ‘Precisely matched hip dataset’.

Surgical robotic tool detection

Automatic detection of the surgical tool from intra-operative images is an important step for robot-assisted surgery as it enables vision-based control¹⁶. Because training a detection model requires sufficient image data with ground-truth labels, developing such models is possible only after the surgical robot is mature and deployed clinically. We demonstrate AI model development for custom and pre-clinical surgical robotic tools.

We consider a continuum manipulator (CM) as the target object. CMs have been investigated in minimally invasive robot-assisted orthopaedic procedures because of their substantial dexterity and stiffness^17,18, but they are not currently used clinically nor easily manufactured for extensive cadaveric testing. Using SyntheX, we address CM detection, which consists of segmenting the CM body and predicting distinct landmarks in the X-ray images. The semantic segmentation mask covers the 27 alternating notches that discern the CM from the other surgical tools; the landmarks are defined as the start and end points of the CM centreline¹⁹. Synthetic images were generated using CT scans selected from the New Mexico Decedent Image Database¹⁴ and a computer-aided design model of the CM. Three-dimensional CM segmentations and landmark locations were determined through forward kinematics and then projected to 2D as training labels using the X-ray geometry. The performance was evaluated on 264 real X-ray images of the CM during pre-clinical cadaveric testing. These images were acquired at different scenarios, including different cadaver specimens, with or without drilling tool inserted, positions of the tool, and multiple scanner acquisition settings. We present example simulation and real X-ray images in Extended Data Fig. 1. On real images, ground-truth segmentation masks and landmark locations were annotated manually.

COVID-19 lesion segmentation

Chest X-ray (CXR) has emerged as a major tool to assist in COVID-19 diagnosis and guide treatment. Numerous studies have proposed the use of AI models for COVID-19 diagnosis from CXR and efforts to collect and annotate large amounts of CXR images are underway. Annotating these images in 2D is expensive and fundamentally limited in its accuracy due to the integrative nature of X-ray transmission imaging. While localizing COVID-19 presence is possible, deriving quantitative CXR analysis solely from CXR images is impossible. Given the availability of CT scans of patients suffering from COVID-19, we demonstrate lung-imaging applications using SyntheX.

Specifically, we consider the task of COVID-19 lesion segmentation, which is possible also from CXR to enable comparison. We used the open-source COVID-19 CT dataset released by ImagEng lab²⁰ and the CT scans released by the University of Electronic Science and Technology of China (UESTC)²¹ to generate synthetic CXR images. A 3D infection mask was created for each CT using the automatic lesion segmentation method COPLE-Net²¹. We followed the same realistic X-ray synthesis pipeline and generated synthetic images and labels using the paired CT scan and segmentation mask from various geometries. The lesion labels were projected following the same geometries. The segmentation performance was tested on the benchmark dataset QaTa-COV19²², which contains 2,951 real COVID-19 CXR samples. Ground-truth segmentation masks for the COVID-19 lesions in these CXR images are supplied with the benchmark, and were created in a human–machine collaborative approach.

Precisely controlled investigations on hip imaging

Beyond presenting SyntheX for various clinical tasks, we present experiments on a unique dataset for hip imaging that enables the isolation of the effect that the domain gap has on Sim2Real AI model transfer. On the task of anatomical landmark detection and anatomy segmentation in hip X-ray, we study the most commonly used domain generalization techniques, namely, domain randomization and domain adaptation, and further consider different X-ray simulators, image resolution and training dataset size. We introduce details on these experiments next.

Precisely matched hip dataset

We created an accurately annotated dataset of 366 real hip fluoroscopic images and corresponding high-resolution CT scans of six lower torso cadaveric specimens with manual label annotations¹², which constitutes the basis of our unique dataset that enables precisely controlled benchmarking of domain shift. For each of the real X-ray images, the X-ray camera pose was accurately estimated using a comprehensive 2D/3D image registration pipeline¹². We then generated synthetic X-ray images (digitally reconstructed radiographs (DRRs)) that precisely recreate the spatial configurations and anatomy of the real X-ray images and differ in only the realism of the simulation (Fig. 3a). Because synthetic images precisely match the real dataset, all labels in 2D and 3D apply equally. Details of the dataset creation are introduced in ‘Benchmark hip-imaging investigation’.

**Fig. 3: Precisely controlled hip-imaging X-ray database.**

We studied three different X-ray image simulation techniques: naive DRR generation, xreg DRR¹⁰ and DeepDRR^23,24, which we refer to as naive, heuristic and realistic simulations. They differ in the considerations of modelling realistic X-ray imaging physical effects. Figure 3b shows a comparison of image appearance between the different simulators and a corresponding real X-ray image.

We have collected data on an additional lower torso cadaveric specimen using the Brainlab LoopX imaging system, which is different from the Siemens CIOS Fusion C-arm system for collecting the 366 controlled study data. High-resolution CT scans of the specimen were acquired. We collected 60 X-ray images of the cadaveric specimen to test our model’s generalization performance. These data differ from all images previously used in the controlled investigations for training and testing in regards to anatomy, acquisition protocol and X-ray scanner characteristics. We performed the same 2D/3D image registration pipeline and generated 2D segmentation and landmark labels.

Domain randomization and adaptation

Domain randomization is a domain generalization technique that inflicts marked changes on the appearance of the input images. This produces training samples with markedly altered appearance, which forces the network to discover more robust associations between input image features and desired target. These more robust associations have been demonstrated to improve the generalization of machine learning models when transferred from one domain to another (here, from simulated to real X-ray images, respectively). We implemented two levels of domain randomization effects, namely, regular domain randomization and strong domain randomization. Details are described in ‘Domain randomization’.

Other than domain randomization, which does not assume knowledge or sampling of the target domain at training time, domain adaptation techniques attempt to mitigate the domain gap’s detrimental effect by aligning features across the source (training domain; here, simulated data) and target domain (deployment domain; here, real X-ray images). As such, domain adaptation techniques require samples from the target domain at training time. Recent domain adaptation techniques have increased the suitability of the approach for Sim2Real transfer because they now allow for the use of unlabelled data in the target domain. We conducted experiments using two common domain adaptation methods: CycleGAN, a generative adversarial network with cycle consistency²⁵ and adversarial discriminative domain adaptation (ADDA)²⁶. The two methods are similar in that they attempt to align properties of real and synthetic domains, and differ based on what properties they seek to align. While CycleGAN operates directly on the images, ADDA seeks to align higher-level feature representations, that is, image features after multiple convolutional neural network layers. Example CycleGAN generated images are shown in Fig. 3b. More details of CycleGAN and ADDA training are provided in ‘Domain adaptation’.

Model and evaluation paradigm

As the focus of our experiments is to demonstrate convincing Sim2Real performance, we rely on a well-established backbone network architecture, namely, TransUNet²⁷, for all tasks. TransUNet is a state-of-the-art medical image segmentation framework, which has shown convincing performance across various tasks²⁷. Segmentation networks for all clinical applications are trained to minimize the Dice loss (L_seg)²⁸, which evaluates the overlap between predicted and ground-truth segmentation labels. For hip-image analysis and surgical tool detection, we adjust the TransUNet architecture as shown in Extended Data Fig. 2 to concurrently estimate landmark locations. Reference landmark locations are represented as symmetric Gaussian distributions centred on the true landmark locations (zero when the landmark is invisible). This additional prediction target is penalized using (L_ld), the mean squared error between network prediction and reference landmark heatmap.

For evaluation purposes, we report the landmark accuracy as the l₂ distance between predicted and ground-truth landmark positions. Further, we use the Dice score to quantitatively assess segmentation quality for hip imaging and surgical tool detection. The COVID-19 lesion segmentation performance is reported using confusion matrix metrics to enable comparison with previous work²¹.

For all three tasks, we report both Sim2Real and Real2Real (reality-to-reality) performances. The Sim2Real performance was computed on all testing real X-ray data. The Real2Real experiments were conducted using k-fold cross-validation, and we report the performance as an average of all testing folds. For the hip-imaging benchmark studies, we further carefully designed the evaluation paradigm in a leave-one-specimen-out fashion. For each experiment, the training and validation data consisted of all labelled images from all but one specimen while all labelled images from the remaining specimen were used as test data. The same data split was strictly preserved also for training of domain adaptation methods to avoid leakage and optimistic bias. On the scaled-up dataset, we used all synthetic images for training and evaluated on all real data in the benchmark dataset.

A specially designed assessment curvature plot is used for reporting pelvic landmark detection performance. This way of measuring landmark detection performance provides detailed information on the two desirable attributes of such an algorithm: (1) completeness and (2) precision of detected landmarks. The direct network output for each landmark prediction is a heatmap intensity image (I). To distinguish the landmark prediction confidence, we compute a normalized cross-correlation between I and the Gaussian landmark heatmap I_gauss, ncc(I, I_gauss)¹². Landmarks are considered valid (activated) if ncc(I, I_gauss) is higher than a confidence threshold, ϕ, (ncc(I, I_gauss) > ϕ). The kth predicted landmark location ${{{{\bf{x}}}}}_{{\mathrm{p}}}^{k}$ is reported using the image coordinate of the maximum intensity pixel. Given the ground-truth location ${{{{\bf{x}}}}}_{{\mathrm{g}}}^{k}$, the mean landmark detection error (e^ld) is reported as the average l₂ distance error over all activated landmarks: ${e}^{{\mathrm{ld}}}=\frac{1}{K}\mathop{\sum }\nolimits_{k = 1}^{K}| | {{{{\bf{x}}}}}_{{\mathrm{p}}}^{k}-{{{{\bf{x}}}}}_{{\mathrm{g}}}^{k}| {| }_{2},$$(k\in \{\,{{\mbox{ncc}}}\,({I}^{k},{I}_{{\mathrm{gauss}}}^{k}) > \phi \})$, where K is the total number of activated landmarks. The ratio (p) of the activated landmarks over all landmarks is a function of ϕ. Thus, we created plots to demonstrate the relationship between e^ld and p, which shows the change of the error as we lower the threshold to activate more landmarks. Ideally, we would like a model to have a 0.0 mm error with a 100% activation percentage, corresponding to a measurement in the bottom right corner of the plots in Fig. 4. Following the convention in previous work¹², we selected a threshold of 0.9 (ncc(I, I_gauss) > 0.9) to report the numeric results for all ablation study methods in Table 1. This threshold selects the network’s confident predictions for evaluation.

**Fig. 4: Plots of average landmark detection error with respect to activated landmark percentage.**

Table 1 Hip-imaging landmark detection errors and segmentation Dice scores

Full size table

Results

Primary findings

We find that across all three clinical tasks, namely, hip imaging, surgical robotic tool detection and COVID-19 lesion segmentation, models trained using the SyntheX Sim2Real model transfer paradigm when evaluated on real data perform comparably to or even better than models trained directly on real data. This finding suggests that SyntheX, that is, the realistic simulation of X-ray images from CT combined with domain randomization, is a feasible cost- and time-effective, and valuable approach to the development of learning-based X-ray image analysis algorithms that preserve performance during deployment on real data.