Multi-view convolutional neural networks for automated ocular structure and tumor segmentation in retinoblastoma

In retinoblastoma, accurate segmentation of ocular structure and tumor tissue is important when working towards personalized treatment. This retrospective study serves to evaluate the performance of multi-view convolutional neural networks (MV-CNNs) for automated eye and tumor segmentation on MRI in retinoblastoma patients. Forty retinoblastoma and 20 healthy-eyes from 30 patients were included in a train/test (N = 29 retinoblastoma-, 17 healthy-eyes) and independent validation (N = 11 retinoblastoma-, 3 healthy-eyes) set. Imaging was done using 3.0 T Fast Imaging Employing Steady-state Acquisition (FIESTA), T2-weighted and contrast-enhanced T1-weighted sequences. Sclera, vitreous humour, lens, retinal detachment and tumor were manually delineated on FIESTA images to serve as a reference standard. Volumetric and spatial performance were assessed by calculating intra-class correlation (ICC) and dice similarity coefficient (DSC). Additionally, the effects of multi-scale, sequences and data augmentation were explored. Optimal performance was obtained by using a three-level pyramid MV-CNN with FIESTA, T2 and T1c sequences and data augmentation. Eye and tumor volumetric ICC were 0.997 and 0.996, respectively. Median [Interquartile range] DSC for eye, sclera, vitreous, lens, retinal detachment and tumor were 0.965 [0.950–0.975], 0.847 [0.782–0.893], 0.975 [0.930–0.986], 0.909 [0.847–0.951], 0.828 [0.458–0.962] and 0.914 [0.852–0.958], respectively. MV-CNN can be used to obtain accurate ocular structure and tumor segmentations in retinoblastoma.

www.nature.com/scientificreports/ algorithm in the presence of tumor tissue. U-nets, on the other hand, have the disadvantage of a limited field of view (in the 2D-case) 16 or do require an extraordinary amount of data (in the 3D-case) 11 for proper training. A one-step solution for segmentation of ocular and tumor structures would greatly simplify the use of radiomics in the clinic and research and, importantly, also has the potential of increasing accuracy. Therefore, the purpose of the current work is to evaluate the performance of multi-view convolutional neural networks (MV-CNN) in RB patients. MV-CNN has been successfully applied in other medical segmentation problems where relatively little data was available (e.g. MS lesions 17,18 lymph nodes 19 ) and is likely to be robust even in a longitudinal manner 20 . In contrast to above-mentioned ASM-based models, MV-CNN allows for multi-class segmentation of healthy and pathological ocular regions in a single step, without the need for feature engineering. Specifically, we train the classifier to discriminate background, sclera, vitreous humour, lens, retinal detachment and tumor. Throughout this work, the retinal detachment is regarded as retina and sub-retinal fluid resulting from retinal detachment. We compared our results to an established 'baseline' model published in literature.

Results
Manually segmented volumes of healthy and RB eyes were on average 5.82 ± 1.22 mL and 5.46 ± 1.17 mL respectively. Median [interquartile range (IQR)] manual tumor volume was 0.88 [0.53-1.61] mL. An overview of performances for different MV-CNN model alterations is given in Supplementary Table S1. The MV-CNN model that used all sequences (Fast Imaging Employing Steady-state Acquisition; FIESTA, T2 and T1c) and multi-scale information showed the highest volumetric and spatial performance in 4 out of 6 classes. This MV-CNN configuration was regarded as best performing model and is further reported below.
For the best performing MV-CNN model, healthy and RB eye volumes were on average 6.15 ± 1.27 mL and 5.82 ± 1. 22  Tumor size dependency. Terciles were used to group results into small (< 0.55 mL; N = 10), medium (> 0.55 mL and < 1.51 mL; N = 9) and large tumors (> 1.51 mL; N = 10). Analysis of these groups showed an average MV-CNN spatial performance of DSC = 0.72 ± 0.36, 0.90 ± 0.04 and 0.92 ± 0.02, respectively. Two very small tumors with a volume of < 0.1 mL were completely missed by the MV-CNN network.

Independent validation set.
In the independent validation set, manually segmented healthy and RB eye volume were on average 5. 30

Discussion
The purpose of the current work was to evaluate the performance of MV-CNN to provide a one-step solution for segmentation of ocular structures and tumor tissue on MR images in RB patients. MV-CNN displayed good volumetric and spatial performance of MV-CNN when compared to manual reference segmentations and an established automated segmentation methodology. These findings were confirmed in an independent validation sample, underlining the practical usability of the approach for automatic delineation and incorporation in a radiomics pipeline.
Our study demonstrated that MV-CNN provides tumor segmentations that have very high volumetric (ICC > 0.99) and spatial consistency (DSC > 0.8) compared with manual delineations. Comparing our tumor segmentation results to other publications should be done with care, since measured performance is highly dependent on the dataset and the reference used for validation. Factors known to influence segmentation performance include the pulse-sequence used, construction of the reference dataset and overall burden. In addition, definition of anatomical regions can be an issue and some studies using other class definitions as compared to the current work. Taking these considerations into account, an overview of ocular segmentation literature is provided in Table 2. Bach Cuadra et al. 21 achieved moderate to high sclera and lens segmentation performance using a parametric model on computed tomography (CT) and ultrasound (US) images of the eye to improve external beam radiotherapy (EBRT) planning for RB. Rüegsegger et al. 13 used an ASM on adult head CT data to further improve segmentation of the eye and lens for RB EBRT planning. Comparing these works with our results is not straightforward for two reasons. First, they were done for the purpose of radiotherapy planning in which safety margins are used depending on the location of the tumor, and thus different boundary criteria and evaluation criteria may be used. Second, these studies used CT for segmentation which has less soft tissue contrast than MRI used in our study. More recent studies constructed segmentations on MRI, for example Ciller et al. 12 and Nguyen et al. 15 used ASM segmentation of healthy ocular structures (sclera, vitreous humour and lens average DSC = 0.949, 0.947 and 0.882, respectively).
Only four studies used deep learning methods based on CNN and U-Net architectures to segment healthy and tumor ocular tissue. First, Ciller et al. expanded their ASM method with an input for an 8-layer 3D CNN to also obtain tumor tissue 10 . At the time, this method served as a new state-of-the-art because it resulted in tumor segmentation performances up to DSC = 0.62. A weak point of the method is however that it depended on two steps requiring feature engineering, as tumor-specific features are used as input for the CNN. Second, Nguyen et al. 11 proposed a single-step 3D U-Net CNN to achieve a reported tumor DSC of 0.59. Third, De Graaf et al. 14 used an ASM as input for a 2D U-Net CNN to segment healthy ocular structures and tumor with DSC = 0.64, respectively. Fourth, Nguyen et al. 16 explored a weakly supervised approach based on class activation maps to train a 2D U-Net CNN to segment UMs in 24 patients with on average DSC = 0.84. However, these methods still use post-processing steps 11 , or need an ASM to provide prior knowledge of the inside eye volume [12][13][14] .
Compared to the previously discussed methods, MV-CNN shows superior spatial performance in tumor segmentation, and similar performance in vitreous humour and lens segmentation. A comparison for sclera segmentation performance is unfortunately less straight forward, because in previous works it was common practice to define sclera as the sum of sclera and vitreous humour. This resulted in considerably larger sclera volumes which positively biased DSC as a performance metric 18 . Considering that the size of the sclera segmentation volume in our definition is almost twice as small compared to the former papers, we argue that the average spatial performance of DSC = 0.84 in our work was very high.
To overcome the difficulties in comparing performance metrics between studies (e.g. due to differences in data set, manual segmentation quality, or class definition), we also compared our results with an established baseline model 12,15 . This direct comparison demonstrated substantial increases of volumetric and spatial performance www.nature.com/scientificreports/ for almost all tissue classes except lens. This ruled out the possibility that our data set or manual reference segmentation biased the results. Several factors may contribute to the superior performance of the MV-CNN network topology. First, the number of parameters versus the number of training samples is more efficient in a 2.5D versus 3D network, which can be beneficial in the presence of limited training data. Secondly, it is believed that the branched architecture of MV-CNN can more effectively learn and propagate higher-level features, when compared to a U-Net architecture. This is because during the down-sampling procedure, details specific to informative branches can vanish when mixed with less informative branches 18 . Finally, MV-CNN uses a multi-scale pyramid representation to integrate contextual information in the segmentation verdict. This is important because it can be argued that anatomical information within the direct vicinity of a query voxel can be of great descriptive value, resolving local ambiguities (e.g. it is unlikely that tumor is detected in or near the lens) 22 . Integration of contextual information is therefore likely to enhance model performance.
During evaluation, we also noticed some issues that may be improved in future work. First, we observed that the ASM segmentations showed generally higher spatial agreement of the lens with the manual reference compared with the MV-CNN. This can probably be explained by the fact that the ASM is superior in dealing with structures that have little shape variability among subjects. Second, we observed that MV-CNN has the tendency of a slight, but systematic, over-estimate of the total eye volume (see Fig. 3). Post-hoc investigation of the segmentation masks showed that this phenomenon is most likely driven by overestimation of the sclera. Three explanations may account for this. First, the manual annotation protocol was very conservative in this area. This may have led to a less optimal ground truth at the edges of the eye. Second, the effect could have been caused by an interpolation artefact due to the 2.5D nature of our kernel. And third, the issue might have been caused by the fact that our loss function was non-weighted. Future studies may resolve the issue by using a 3D kernel at the border or class-weighting.
Our work also has a number of limitations. First, we used very high-quality data (e.g. both in terms of image quality and labels) from only a single scanner for training and evaluation. The current method would require training for every new scanner, which is not practical. Real world applications should be able to handle data from multiple sources, especially in a rare disease such as retinoblastoma. Future work should therefore invest in multi-center labelled data and methods that are able to handle real-world scanner diversity. Second, we did not extensively investigate the effects of class imbalance and loss function. Such class imbalance is intrinsically present in data where malignant tissue is one of the target classes, and may be handled better by other loss functions such as generalized dice or boundary loss 23,24 . Future studies may investigate whether even better performance can be achieved by tuning these aspects. Finally, we did not investigate different network topologies within the MV-CNN branches themselves. It is known that the conventional double convolutional layer may be affected by loss of gradient. This is not the case with several other designs, such as ResNet 25 . Future work may investigate whether alternative branch topologies lead to even better performance.
In conclusion, we validated a multi-view convolutional neural network for automated, single-step segmentation of ocular and pathological structures for MRI in RB, and compared its performance to the current state-ofthe-art. The MV-CNN model demonstrated superior performance when compared to the baseline model, both in terms of volumetric and spatial performance. In addition, we demonstrated the benefit of multi-view networks over axial-view and 3D-view networks for ocular structure and tumor segmentation in retinoblastoma. Our results indicate that MV-CNNs have great potential for further development towards automated segmentation for radiomics applications. Manual reference segmentation. Reference segmentations of ocular structures (sclera, vitreous humour, lens, retinal detachment) and tumor were manually constructed on the 3D FIESTA images by one rater (CdB) using 3D Slicer (Version 4.10.1, MIT, USA) 26 . All reference segmentations were validated by a neuro-radiologist with 14 years of experience (PdG). Manual segmentations were carefully constructed in approximately 10 hours per eye and were considered as ground truth in the analyses. Image preprocessing. Prior to automatic segmentation, images were automatically preprocessed using tools from the Insight Toolkit (ITK; https:// itk. org/) and FMRIB Software Library (FSL; https:// fsl. fmrib. ox. ac. uk/). First, a rough outline of the eye was constructed on each sequence by using the 3D Hough filtering approach implemented in ITK 27 . These masks, extrapolated by a radius of 25 mm, were used as a region of interest for co-registration of the images of a specific subject. The rigid transformations between high-resolution FIESTA and lower resolution T2 and T1c space were obtained using FSL FLIRT. Both transformation matrices were inverted to obtain all 2D sequences in 3D FIESTA space using spline interpolation. Finally, the intensities of each contrast were re-scaled such that image intensities had a mean and variance of 0 and 1, respectively, within the union of 5-mm masks of the left and right eye. with the MV-CNN approach. In summary, the ASM approach previously used 12,15 was retrained on the FIESTA images to segment the sclera, lens and vitreous humour. Subsequently, adopting recent ocular tumor segmentation methods, a 2D U-Net architecture was trained to obtain tumor and retinal detachment masks using the combined FIESTA, T2 and contrast-enhanced T1 as inputs 14,16 . We refer the reader to the Supplementary material for details on 2D U-Net implementation and to Supplementary Fig. S1 online for a schematic representation of the baseline model. These state-of-the art methods for healthy and pathological structures proceed to each structure segmentation separately and as such they need afterwards to combine their outputs to assign one class per voxel. Similarly to previous studies 11,13 , tumor and retinal detachment predictions were constrained to be inside the eye as defined by ASM output of sclera. Moreover, lens was prioritized over retinal detachment and tumor, and retinal detachment was prioritized over tumor. Finally, as ASM segmentation is based on the structure outer contour, the output sclera and vitreous humour are converted to binary masks that include all voxels inside their fitted contours 14,15 . Scleral segmentation is obtained by removing all other structures' subsets, and finally vitreous humour is obtained by removing retinal detachment and tumor from the ASM segmentation.

Materials and methods
Multi-view convolutional neural network. MV-CNN is a network topology that combines information from different views into fully connected layers to classify the voxel where the planes cross. The multi-view approach (see Fig. 1) can be considered as a 2.5D CNN given that it incorporates information from each image plane, but does not use the full 3D neighborhood of the queried voxel. This results in a lower computational complexity when compared to 3D-kernel methods. Multi-scale contextual information is incorporated from different scales λ in a pyramid representation of each patch. Increasing image scale beyond scale 3 was investigated but did not improve segmentation accuracy as the field of view would simply fall outside of the region of interest. One MV-CNN block contains three equally structured network branches for each imaging plane. The input to each branch is a 32 × 32 patch from each MR sequence (FIESTA, T2, T1c; total number of sequences ch = 2 or 3), which are fed as channels, and scales 0, 1 and 2. Here, scale 0 refers to an unaltered patch with no larger-scale contextual information, where the considered scale's reception field is widened by a factor of two for each subsequent pyramid level. Batch normalization is always applied before applying the activation function. Each branch contains two hidden convolutional layers (3 × 3 convolution kernel; activation function: rectified linear unit (ReLu)) with a max-pool layer (2 × 2 max-pool kernel) and a dropout layer (dropout proportion p = 0.25). This is followed by a dense layer with 32 output neurons (activation function: ReLu). The results from each anatomical plane are then concatenated and the procedure is repeated for each scale in parallel. In a similar fashion, results from each scale are then concatenated. Following, another dense layer with 32 output neurons (activation function: ReLu) is used with a dropout layer (p = 0.25). Finally, a dense layer (activation function: softmax) is used for voxel classification.
Experiments. FIESTA and T1c images were used as input for every experiment. In addition, we investigated the effects of the following alterations: (i) addition of T2 to the input; (ii) addition of multi-scale information (i.e., λ = 1,2); and (iii) random left-right mirroring of the input data to facilitate data-augmentation, resulting www.nature.com/scientificreports/ in 8 configurations. Each sub-model was trained once for performing multi-class segmentation in one step. Performances were evaluated by leave-one-subject out cross-validating each possible configuration. To further demonstrate the benefits of MV-CNN, post-hoc analyses were done including axial-view and 3D-view networks.
Here, the exact same settings and architecture were used as the proposed best MV-CNN sub-model (inclusion of T2, multi-scale (λ = 2), no left/right mirroring), but differed only at input-level (axial-view: one 32 × 32 viewbranch and three context branches; 3D: one 32 × 32 × 32 view-branch and three context branches).
Model training. MV-CNN training was done on a NVIDIA GeForce GTX 1080 TI graphics processor unit (GPU) using the GPU-version of TensorFlow version 1.9.0 with Cuda 9.0 and Python 3.6.9. TensorBoard (Version 0.4.0) callback was used for tracking training and validation scores. Categorical cross-entropy was used as a loss function for multi-class segmentation: Here, p(a) represents a reference distribution of a ∈ A given by the manual annotations, where q(a) is a query distribution and A is a set of observations. c ∈ [0, 1, . . . , C] denotes class indices. The loss function was minimized for 50 epochs (batch size = 64) using the ADAM optimizer 28 . A random sub-set of 5% of all training voxels was sampled to reduce computational demand and thereby accelerate training, and random reshuffling of samples was done to allow for varied training. Dropout was switched off at test time. Statistical analysis of model performance. The performance of each MV-CNN model and the baseline were assessed using leave-one-subject-out cross-validation (i.e., K-fold cross-validation, where N = 23 subjects). Performance was measured by quantifying volumetric and spatial agreement. Volumetric agreement was quantified by calculating the intra-class correlation coefficient (ICC; single measure and absolute agreement 29 ) and spatial agreement was quantified by calculating DSC: where A and B are sets that refer to the manual reference and segmentation of interest, respectively. Because DSC measures were not normally distributed upon histogram inspection, differences in spatial performance were evaluated by a two-sided Wilcoxon rank signed test. Bonferroni correction was applied to account for multiple comparisons. Since the DSC spatial performance measure is dependent on size of the underlying burden 30 , additionally, spatial performance was grouped according to tumor size.

Independent validation set.
The MV-CNN model reaching best performance was additionally evaluated in an independent validation set that consisted of 7 subjects (mean age: 16.0 ± 17.8 months, range  months), with 3 healthy and 11 RB eyes. The images of these subjects were acquired on the same scanner and using the same imaging protocols as specified above.  www.nature.com/scientificreports/

Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.