Sensitivity of standardised radiomics algorithms to mask generation across different software platforms

Whybra, Philip; Spezi, Emiliano

doi:10.1038/s41598-023-41475-w

Download PDF

Article
Open access
Published: 02 September 2023

Sensitivity of standardised radiomics algorithms to mask generation across different software platforms

Philip Whybra¹ &
Emiliano Spezi¹

Scientific Reports volume 13, Article number: 14419 (2023) Cite this article

664 Accesses
Metrics details

Subjects

Abstract

The field of radiomics continues to converge on a standardised approach to image processing and feature extraction. Conventional radiomics requires a segmentation. Certain features can be sensitive to small contour variations. The industry standard for medical image communication stores contours as coordinate points that must be converted to a binary mask before image processing can take place. This study investigates the impact that the process of converting contours to mask can have on radiomic features calculation. To this end we used a popular open dataset for radiomics standardisation and we compared the impact of masks generated by importing the dataset into 4 medical imaging software. We interfaced our previously standardised radiomics platform with these software using their published application programming interface to access image volume, masks and other data needed to calculate features. Additionally, we used super-sampling strategies to systematically evaluate the impact of contour data pre processing methods on radiomic features calculation. Finally, we evaluated the effect that using different mask generation approaches could have on patient clustering in a multi-center radiomics study. The study shows that even when working on the same dataset, mask and feature discrepancy occurs depending on the contour to mask conversion technique implemented in various medical imaging software. We show that this also affects patient clustering and potentially radiomic-based modelling in multi-centre studies where a mix of mask generation software is used. We provide recommendations to negate this issue and facilitate reproducible and reliable radiomics.

Microenvironmental reorganization in brain tumors following radiotherapy and recurrence revealed by hyperplexed immunofluorescence imaging

Article Open access 15 April 2024

Segment anything in medical images

Article Open access 22 January 2024

Transparent medical image AI via an image–text foundation model grounded in medical literature

Article 16 April 2024

Introduction

Traditional radiomic methods seek imaging biomarkers that quantify disease through analysis of regions of interest (ROIs) in medical imaging. With radiomics, models built from relevant and explainable imaging biomarkers could provide additional insights to guide clinical decisions and facilitate personalised treatment strategies^1,2,3,4. Radiomics is a growing topic of interest in oncology, which extensively uses imaging throughout diagnosis and treatment. In oncology, radiomics features usually come from ROIs defining tumours or organs at risk.

Since the term radiomics entered the literature lexicon, numerous studies have explored—and addressed—many challenges surrounding the reproducibility of key features that describe shape, statistics, and texture^5,6. With the traditional radiomics approach, assessing and selecting reproducible features as potential imaging biomarkers is essential for model generalisability. Only models built with reproducible features can survive external validation and see routine use in the clinic.

To enable reproducibility of radiomic studies, the Image Biomarker Standardisation Initiative (IBSI)⁷ represents a significant effort to produce a set of standardised reference values for a number of feature families. The IBSI determined benchmarks in two phases, iterating towards consensus-based reference values considering a variety of extraction settings. These benchmarks were developed using a digital phantom and a single lung computed tomography (CT) image with a gross tumour volume (GTV) segmentation. An extensive reference manual was produced detailing the image processing steps for extraction. Finally, a cohort of 51 patients was used in a third validation phase to evaluate the reproducibility of software from different participants, after the benchmarking phases were completed. The validation phase showed excellent reproducibility for successfully benchmarked features.

A number of studies have investigated the effect of segmentation on radiomic features. Typically studies test the individual robustness of features holding extraction settings constant, whilst changing the contour defining the ROI. Often, this is tested in the context of manual vs. automatic methods. For instance Belli et al.⁸ quantified the robustness of radiomic features on pancreatic cancer delineated on 18F-fluorodeoxyglucose (FDG) positron emission tomography (PET) were FDG-based contours were delineated following manual, semi-automatic and automatic segmentation methods. Haarburger et al.⁹ analysed the reproducibility of radiomic features using expert manual segmentations and neural network-based CT segmentations on lung, kidney and liver lesions. Fiset et al.¹⁰ evaluated the stability of radiomic features from T2-weighted MRI of cervical cancer using test-retest, simulated MRI data and inter-observer segmentation. Due to intra- and inter- variability of manual segmentation, semi- and automated methods have been shown to facilitate more repeatable radiomic results⁶. Notably, results from a phantom study in PET by Pfaehler et al.¹¹ found that segmentation of smaller volumes led to lower repeatability of radiomic feature values. In essence, the effect of segmentation on features is emphasised in volumes with fewer voxels.

In addition to comparing manual vs automatic segmentation, differences in delineation can be simulated by computationally manipulating the ROI binary mask through adaptation of the volume and randomisation of voxel inclusion at the contour edges. Image and mask perturbation methods were proposed by Zwanenburg et al.¹² as a way to select robust features for modelling. These perturbations include small adaptations to the ROI mask, such as translation, rotation and randomised erosion and dilation at the edges. They concluded their perturbation approach was a viable alternative to feature comparison on test-retest imaging when this is not available.

The widely accepted standard for storage and handling of medical data is the “Digital Imaging and Communications in Medicine” (DICOM) format (https://www.dicomstandard.org). In a radiotherapy (RT) department, manual segmentations are drawn on top of an image by expert clinicians using specialised medical devices, including image contouring software and treatment planning systems. Within the vast DICOM standard, the Radiotherapy Structure Set (RTSTRUCT) module is the recognised format to store segmentation data for RT ROIs such as target volumes and organs at risk.

Contours in RTSTRUCT format are defined as 3-dimensional (3D) coordinate points (x, y, z) that represent closed polygon loops. The main data is stored in Contour Data attribute of the file (tag (3006, 0050)), and the attributes Image Position (tag (0020, 0032)) and Image Orientation (tag (0020, 0037)) of the associated DICOM images are vital to align contour and voxel data correctly. Image Position specifies the x, y, z coordinates of the upper left hand corner of the image and the Image Orientation specifies the direction cosines of the first row and column with respect to the patient, where axes direction are determined by the patient orientation. A full list of the mandatory modules and attributes recorded for RTSTRUCT can be found in¹³.

Most automatic segmentation methods inherently produce binary masks that are then converted to RTSTRUCT format to use in data communication and storage protocols within the hospital environment. Working with RTSTRUCT data collected from a clinical setting requires conversion of a polygon contour into a binary mask by determining which voxels lie sufficiently within the enclosed polygon space. A binary mask representation of a segmentation is needed for traditional radiomic analysis. Hence, the need for conversion between the two representations.

The reliability of polygon to mask conversion strategies is a well-known problem in computational geometry¹⁴. The IBSI reference manual provides details on a common technique to determine whether a point lies within a 2D (2 dimensional) polygon known as the crossing number algorithm⁷, alongside a description of a naïve implementation. However, there are many techniques that could be used¹⁴ and no one method was selected or suggested by the IBSI. Particularly with commercial products, the underlying mask generation method is often obscured from the end user. Furthermore, strategies such as super-sampling are used to fine tune selection of region voxels. The many different parameter settings for super-sampling strategies result in slight variations in the final mask.

In the radiomics reporting guidelines of the IBSI⁷, and a subsequent checklist by Pfaehler et al.⁶, it is explicitly suggested to describe the method used to convert polygon segmentations to a binary mask. However, this remains rarely reported in practice⁶. Knowledge of the polygon to binary mask conversion algorithm is necessary to ensure consistency and interoperability of radiomic features extracted from imaging and contouring data processed with different software applications. As DICOM is non-proprietary industry standard for data exchange—and most used file structure for biomedical images and metadata including contouring data—it is necessary to understand the full impact that discrepancies in binary mask generation could have on reportedly standardised features, particularly when radiomics algorithms are integrated in different medical image processing software platforms.

Despite an abundance of studies evaluating the robustness of radiomics to manual and automatic segmentation, to the best of our knowledge, no study has yet properly assessed standardised radiomic feature sensitivity to underlying mask generation algorithms, which still differ across both commercial and research-based medical imaging software.

The second aim was to determine whether mask generation alone could have a meaningful impact on quantitative analysis by examining the changes to raw feature values and to patient clustering, which in turn can influence radiomics based models.

Study design

Overview

The study workflow is shown in Fig. 1. The sensitivity of standardised radiomic algorithms to mask generation from DICOM RTSTRUCT was tested with 3 main experiments.

Experiment 1

We imported the DICOM dataset into four different imaging platforms (described in “Methods” section) with the default settings for each application. Using the API of each software, we passed the generated image, mask and voxel resolution to SPAARC for radiomic feature extraction. Correspondingly, we calculated features directly on the NIfTI version of the dataset, which were considered the baseline results.

The raw feature values calculated from the output of each imaging software were compared pairwise with the baseline using percentage difference and Spearman rank correlation $\rho$. We then assessed changes in hierarchical clustering of patients from choice of import software alone.

Experiment 2

Additionally, we used functionalities available in MICE Toolkit to test the impact of mask generation with super-sampling using different voxel inclusion thresholds. This experiment aimed to systematically assess how voxel inclusion and exclusion at the mask edge impacts quantitative radiomics values. These were adjusted using settings in the MICE Structure Processor node. With this grid super-sampling method, each voxel is divided into sub-voxels. Whether each sub-voxel centre is within the closed polygon is then determined. For a voxel to be included, a threshold is set to define the percentage of sub-voxels within that voxel that are within the polygon. A comparison between the process of converting a of a polygon contour into a binary mask and the grid super-sampling method is shown in Fig. 2). In this study, the sub-voxel threshold was adjusted from 10 to 90%, in increments of 10%. Here we used a single split grid super-sampling that yielded 8 sub-voxels per voxel (i.e. each voxel is split down the centre of each axis).

Experiment 3

Finally, we split the dataset into 3 and imported with different mask generation settings before recombining into a single dataset. This is referred to here as the mixed dataset. The aim was to mimic collection and combination of radiomic data from 3 different centres where there was not a homogeneous approach to mask generation. We normalised the mixed and baseline feature extractions separately and compared hierarchical clustering differences at the feature family level.

The mixed import mask generation approaches were: (1) standard import (with CERR), (2) super-sampled import (with MICE—threshold 90%, grid split 1) (3) super-sampled import (with MICE—threshold 50%, grid split 3).

Analysis

Mask difference

Mask differences produced by each import software were analysed pairwise with the corresponding baseline NIfTI mask. All masks for each patient were compared. For pairwise comparison a difference map ${\textbf{d}}_{map}$ was used to highlight the areas of discrepancy between masks. When comparing two mask matrices, ${\textbf{m}}_{a}$ and ${\textbf{m}}_{n}$, the difference map is simply the element-wise subtraction of one matrix from the other:

$$\begin{aligned} {\textbf{d}}_{map} = {\textbf{m}}_{a} - {\textbf{m}}_{n}. \end{aligned}$$

(1)

Voxels within the region in ${\textbf{m}}_{a}$ but not in ${\textbf{m}}_{n}$ correspond to a value of 1 in ${\textbf{d}}_{map}$. Conversely, voxels within the region in ${\textbf{m}}_{b}$ but not in ${\textbf{m}}_{a}$ correspond to -1 in the difference map. Voxels with the same assignment have a value of 0. Taking the absolute of ${\textbf{d}}_{map}$ and summing provided the voxel discrepancy number, $V_d$, which was scaled by the number of voxels in the ROI of the baseline, $N_m$ (in this case the NIfTI file), and stated as a percentage:

$$\begin{aligned} V_d = 100\times \frac{\sum _i \vert {\textbf{d}}_{map}^i\vert }{N_m}. \end{aligned}$$

(2)

Feature comparison

For each software import with DICOM, features were extracted using the SPAARC interface and compared pairwise with the baseline extraction set from NIfTI files with pre-defined ROI mask. The pairwise variation between respective raw feature values, (e.g., $f_A$ and $f_B$) was expressed as a percentage difference:

$$\begin{aligned} P_d = 100 \times \left| \frac{f_A - f_B}{(f_A + f_B)/2} \right| \end{aligned}$$

(3)

To avoid division by zero, any matching 0 value features were defined as having no variation explicitly. For visualisation, the measured difference, $P_d$, was used to categorise the result into 4 groups 0–<0.5%, 0.5–3%, 3–10%, and > 10%. The four groups were chosen to span a wide range of values, from 0 to infinity. The use of four groups provided enough granularity to distinguish between small and large differences, while still keeping the number of groups manageable. The cutoffs between the groups are were chose to represent thresholds of increasing significance. For example, the jump from [0.5–3%] to [3–10%] represents a three-fold increase in the magnitude of the difference being reported. We produced a heatmap for comparison. The groups had corresponding colors: green yellow, orange and red. The intraclass correlation coefficient (ICC) was used to assess feature consistency when extracted using different mask generation approaches. ICC values lie between 0 and 1 with a threshold > 0.9 often used to express high repeatability. With ICC we considered each mask generation as rater and each patient as subject. ICC was calculated using the Pingouin open-source statistical package¹⁵ written in Python 3 (Python Software Foundation, https://www.python.org). This implementation reports all the six cases of reliability of ratings described by Shrout and Fleiss¹⁶. Here we report ICC2, that considers a random sample of raters rating each target and that measures absolute agreement in the ratings. In addition for each feature, we assessed the stability of patient ranking using Spearman’s rank correlation coefficient ($\rho$) to investigate changes in the patients ranking which is important for the use of radiomic-based models.

Hierarchical clustering

With the feature data, we performed hierarchical clustering using R software (ver. 4.2.1, https://www.r-project.org) with the following packages: cluster¹⁷, and dendextend¹⁸. For each extraction, features were first scaled using Z-score. Hierarchical clustering was then performed using complete-linkage clustering and visualised using a dendrogram, which were compared using dendextend. Entanglement between two trees was measured (using dendextend) with L norm value set to 1. This is a gauge of cluster similarity between 1 and 0, where no entanglement (i.e. 0) is found for matching cluster ordering.