iW-Net: an automatic and minimalistic interactive lung nodule segmentation deep network

We propose iW-Net, a deep learning model that allows for both automatic and interactive segmentation of lung nodules in computed tomography images. iW-Net is composed of two blocks: the first one provides an automatic segmentation and the second one allows to correct it by analyzing 2 points introduced by the user in the nodule’s boundary. For this purpose, a physics inspired weight map that takes the user input into account is proposed, which is used both as a feature map and in the system’s loss function. Our approach is extensively evaluated on the public LIDC-IDRI dataset, where we achieve a state-of-the-art performance of 0.55 intersection over union vs the 0.59 inter-observer agreement. Also, we show that iW-Net allows to correct the segmentation of small nodules, essential for proper patient referral decision, as well as improve the segmentation of the challenging non-solid nodules and thus may be an important tool for increasing the early diagnosis of lung cancer.

predictions, inside a cube containing a lung nodule, of the abnormal tissue. Each predicted voxel corresponds to the center of a fixed size patch to be processed by the network and thus predicting an entire segmentation requires the evaluation of a high number of patches. Furthermore, this model has an inherent lack of global context, since the network only evaluates patches, and thus the 3D reconstruction of the nodule may be affected. A common solution is to adapt 3D U-Net 13 architectures, since they allow to consider both local and global context. With this in mind, Wu et al. 14 proposed a multi-task scheme for pulmonary nodule segmentation together with the prediction of the nodules' expected malignancy, achieving state-of-the-art performance in both tasks. This malignancy prediction is performed by concatenating and processing, via a set of fully-connected layers, the features of the segmentation network's bottle neck with a convolved version of the produced segmentation prediction.
Despite the high performance of deep learning methods, their application in the medical field is being criticized due to (1) the inherent lack of explanations behind the decision and, (2) the production of deterministic outputs, ignoring the existing inter-observer variability of the annotations and inhibiting the medical specialist to interact and change the decisions of the system. With this in mind, Kohl et al. 15 proposed to model the inter-observer variability by combining a conditional variational auto encoder (cVAE) with an U-Net. The cVAE is used for drawing a set of feature maps sampled from the trained latent space representation. These features are then concatenated with the last feature maps of the U-Net, which are then convolved to produce the segmentation output. By varying the sampled set of features from the cVAE, this model is capable of producing different, yet plausible, nodule segmentations. However, the method of Kohl et al. does not allow the clinician to alter the segmentation, instead forcing the specialist to opt for the result closer to his/her expectations.
Recently, Wang et al. 16 proposed a scribble-based approach to refine 2D and 3D segmentations resulting from a fully-convolutional neural network. First, the user selects a bounding box containing the anatomical structure to segment. For each unseen image, the top of a pre-trained segmentation model is trained to accommodate the foreground and background scribbles by minimizing, via an expectation-maximization (EM) approach, a loss function composed of two terms: (1) a pixel-wise weighted categorical cross-entropy term that prioritizes the inclusion of foreground and the removal of background scribbles, and (2) a pair-wise smoothness term that encourages the aggregation of neighbor pixels of similar intensity 17 . Even though this scheme achieves state-of-the-art results on organ segmentation in MRI images, its application for lung nodule segmentation is limited due to the nature of the abnormalities. For instance, nodules are often attached to structures of similar intensity, such as the pleural wall and blood vessels, and thus the EM scheme may lead to the inclusion of these structures in the segmentation and thus potentially demand extra manual correction efforts. Also, sub-solid and non-solid nodules do not have a clear boundary, which can further hinder the minimization of the smoothness term.
With this in mind, we propose an end-to-end deep learning scheme, iW-Net (interactive W-Net), that allows for both automatic and optional interactive 3D lung nodule segmentation, as suggested in Fig. 1. The network receives as input a cube of fixed dimensions which centroid is indicated by the user, or by an automatic nodule detection framework, and proposes a corresponding segmentation. If the user is not satisfied, the segmentation can be corrected by using the end-points of a manually inserted stroke of the nodule's diameter. For this purpose, we use a second segmentation network that integrates the 3D image of the nodule, the initial segmentation and the coordinates of the end-points. Namely, this paper shows that the end-points can be represented by a physics-inspired weight map  that, when used as a feature map and as loss function term, allows to cap the inter-observer agreement in the LIDC-IDRI public dataset. Our approach allows a simple and fast segmentation correction when that information is available without introducing a significant over-head in comparison to the non-guided version of the model.

Results and Discussion
Experiment 1 -comparison with 3D U-Net. iW-Net without user interaction outperforms the baseline 3D U-Net 13 . As shown in Table 1, the nodule segmentation performance is relatively increased by approximately 26% while reducing the number of parameters by a factor of 12. In fact, the reduction of the size of the network contributed to the disparity between the referred IoUs by allowing to increase the batch size during training. The larger batch size allows for a more robust batch normalization, easing the error's back-propagation and thus improving the convergence and performance of the iW-Net in comparison to 3D U-Net.
As expected, iW-Net's prediction without user-interaction tends to be better for larger nodules (see Fig. 2A). Indeed, since most segmentation errors occur near the nodules' boundary, then smaller nodules, which have a higher surface area vs volume ratio, should be more challenging. Interestingly, the inter-observer agreement follows the same tendency, indicating that smaller nodules are particularly difficult to segment. Experiment 2 -user interaction assessment. The proposed simplistic user interaction approach allows to improve the baseline segmentation on more than 75% of the cases. Figure 3 depicts examples where iW-Net allows to significantly alter the 3D shape of the segmentation just by the introduction of two points, being capable of correcting, at least partially, poor segmentations (middle) as well as change the orientation of the proposed region of interest (right). In fact, 44% of the user-introduced points are inside the new segmentations, further showing the tendency of iW-Net to alter the shape of the segmentation. Also, as detailed in Table 2 and Fig. 2B, iW-Net specially enables the delineation correction of the challenging non-solid nodules.
Our proposed approach also has promising results for computer-aided lung cancer screening. As depicted in Fig. 2C, the radius range [1,4](mm) is where iW-Net (user supervised) most improves the quality of the nodules' segmentation. Importantly, several international lung cancer screening guidelines, such as LUNG-RADS 3 , point this dimension range as essential to classify a nodule as either benign or malignant.
iW-Net with the simulated user-interaction allows to improve over the baseline for nodules of different dimensions and textures, as summarized in Figs 2B,C and 3. However, the achieved IoU is still, in average, 0.04 lower than the inter-observer agreement. A possible reason for this is that, due to the variability of the ground-truth in

Number of parameters
Inter-observer 0.59 ± 0.14 -3D U-Net 13 0.38 ± 0.08 19 080 001 iW-Net first block 0.48 ± 0.19 1 592 093 Table 1. Intersection over Union ± the standard deviation of the prediction of the first block iW-Net in comparison to a 3D U-Net and the inter-observer agreement. www.nature.com/scientificreports www.nature.com/scientificreports/ the data (i.e. several segmentations for the same nodule), the network is likely to learn an average segmentation in order to minimize the loss over the redundant training images. Also, during the segmentation correction we are always selecting the two furtherest points in the nodule boundary. In fact, this is a challenging scenario since there is no guarantee that the selected points are in the direction in which the segmentation needs to be corrected. Instead, we are assuming that providing an estimation of the nodule's largest axis is sufficient to improve the segmentation.
Despite always using the two farthest points to correct the segmentation, iW-Net improves the baseline segmentation's ASD for all nodule types by 24%, (Fig. 2D). Namely, the baseline's average ASD is 1.09 and the corrected's is 0.827, meaning that iW-Net has a segmentation error that is in average less than 1 voxel. Also, similarly to the IoU's behavior, the simplistic user interaction allows to significantly improve the quality of the nodules' segmentation in non-solid and sub-solid abnormalities.
Comparison with other approaches. iW-Net achieves a performance in pair to the inter-observer agreement, similarly to other state-of-the-art approaches. Note that making a direct comparison between the approaches is non-trivial since (1) there is a great variation on the size of the test set, type and size of the nodules used as well as the minimum inter-observer agreement; (2) different methods use different voxel scales, and the inherent re-sampling affects the shape of the ground-truth; (3) there are different ways of combining the ground-truth annotations from the different observers (using all, the average or the median, for instance) to produce the final evaluation mask. Nevertheless, for reference, Table 3 shows the achieved IoUs of different approaches on the LIDC-IDRI dataset. Similarly to other state-of-the-art approaches, the performance of our method is close to the inter-observer agreement, even though a significantly larger number of samples has been studied. Advantageously, iW-Net does not rely on computationally heavy pre-processing steps and allows to segment nodules of all sizes and textures without the need to define bounding boxes or other specific parameters. In fact, the average inference time per nodule is only 0.12 ± 0.08 (s). Furthermore, our system allows to correct segmentations without requiring an external algorithm, which none of the others do. Finally, unlike Wu et al. 14 model, training iW-Net does not require other metadata, making it easier to enrich the training set and thus the generalization capability of the system.

Hyper-parameters.
The best performing set of parameters, selected by random search, are γ = 0.59, p = 0.44 and λ 1 = 0.68. These allow to achieve an average validation IoU of 0.59 in the first train/test split. Intuitively, a p near 0.5 (see Fig. 4B) allows to create a weight map that prioritizes the inclusion of the points and the respective connection region without overspreading (Fig. 4A) or over-emphasizing the points (Fig. 4C,D). Likewise, the found γ allows the binarized weight map to have an ellipsoidal structure, following the approximate shape of most of the nodules. Finally, λ 1 balances the contribution of the initial manual segmentation and the added weight map during model training. In the limit where λ 1 = 0 the network would be trying to approximate the nodule  www.nature.com/scientificreports www.nature.com/scientificreports/ segmentation to an ellipsoid. On the other hand, λ 1 = 0.68 ensures that the manual segmentation is the prioritized target during training and that the weight map  (see Fig. 3) is used for local corrections.

Methods
iW-Net allows to easily correct lung nodule segmentations according to the specialists' perception. As depicted in Fig. 5, iW-Net first performs an (1) automatic 3D segmentation of lung nodules, predicted by the first block (i.e. U) of the network, and after an (2) optional segmentation correction, which is performed by the second block of the model by processing the end-points of a manually drawn nodule diameter. For this, we propose a pixel-wise weight map  to guide the segmentation, as detailed in Section Weight map for segmentation control.  is then used as a feature map of iW-Net and in a loss function term to train an auto-encoder segmentation network, as described in Sections iW-Net for nodule segmentation and Loss function.
Weight map for segmentation control. Our weight map  is inspired on the attraction field generated by punctual electric charges of opposite value. Let S define a sphere of undetermined radius: where (x 0 , y 0 , z 0 ) is the center of the sphere and (x i , y i , z i ) are Cartesian coordinates. The unitary normalized gradient field is: The norm of the vectors of ▽S can be weighted as function of the distance to the center of the sphere: where ∈ p R I controls the decay of the vectors' magnitude and a ∈ {0, 1} makes the field centripetal or centrifuge, respectively. Then, W = Q 0 + Q 1 is a vector field that moves from Q 0 to Q 1 . In our approach, Q 0 and Q 1 correspond to the user introduced points and  = |W| is a 3D feature map indicating how valuable each voxel is for the segmentation. In terms of magnitude,  has high intensity in the region between the centers of Q 0 and Q 1 and low vector magnitude elsewhere, indicating to the network that the region between the two points has high interest for the segmentation. Changing p affects the strength of the interaction between the two points, as shown in Fig. 4. Namely, a lower p increases the focus on the central region but also increases its overall volume, whereas a high p leads to more spherical regions of interest surrounding the points. Note that if no points exist, then  is a zero-value tensor with the same size of the input volume.  Table 3. Average Intersection over Union ± the standard deviation for lung nodule segmentation methods on the LIDC-IDRI dataset, and the reported inter-observer agreement (Inter). NA: information is not available. *Sub-solid nodules only. www.nature.com/scientificreports www.nature.com/scientificreports/ iW-Net for nodule segmentation. The proposed nodule segmentation scheme is adaptation of the 3D U-Net 13 . As shown in Fig. 5, iW-Net is composed of two auto-encoders: the first outputs an initial segmentation, which is then used as an input for the second block to produce the corrected segmentation. Each of the auto-encoders has a reduced number of filters in the encoding and decoding parts in comparison to the 3D U-Net, resulting in less parameters to tune and thus easing the back-propagation process.
We include the proposed segmentation weight map  by concatenating it to the initial feature maps of the encoding part of the second block of the model since preliminary experiments showed a significant performance drop if  was included on the upsampling part only. In fact, adding  on the initial part of segmentation correction block ensures that all weights of the model are affected by these external features. Due to the skip connections,  is also included on the final segmentation layer, thus directly affecting the model's output. where I t and I p are the ground truth mask and the soft prediction mask, respectively, and ○ is the Hadamard product. The second term aims at forcing the network to have into account the manually introduced points by evaluating if there are segmentation points in the defined region of interest: 1 IoU 1 attraction

  
where λ 1 controls the relative importance of the terms.
Dataset and training details. iW-Net was developed using the LIDC-IDRI 6 dataset, which contains 1 012 LDCT scans with variable slice thickness. In this dataset, nodules with diameter ≥3 mm have voxel-wise annotations from up to 4 different expert radiologists and the corresponding inter-observer agreement level is indicative of how likely an abnormality is in fact a nodule. The dataset also contains a numeric description  ∈ of several nodule characteristics. Namely, nodule texture ∈ [1,5] indicates the opacity of the nodule, with 1 being a pure non-solid nodule and 5 a pure solid nodule. We considered the 888 scans used for the LUNA16 challenge 18 and studied 2 284 nodules (some samples were discarded due to annotation inconsistencies, poor scan reconstruction or excessive slice thickness). From those, 1 593, 1 190 and 790 have agreement level ≥2, ≥3 and ≥4, respectively. In our experiments, a nodule is considered non-solid if it has an average texture ≤2, solid if = 5 and sub-solid otherwise. For an agreement level ≥2, the dataset has 135 non-solid, 300 sub-solid and 1 695 solid nodules.
All nodules were collected by patching a 51 × 51 × 51 mm cube centered at the average center of mass of the specialists annotations and were then isotropically resized to 64 × 64 × 64 voxels. The intensity of the volume Figure 5. iW-Net: a network for guided segmentation of lung nodules, composed by a block responsible for predicting the initial segmentation and a second block for its correction. S is the side of the feature map. input image intermediary feature maps; initial segmentation prediction; weight map  computed from the user's input; corrected segmentation. ▸ 3 × 3 × 3 × N convolution, followed by batch normalization and rectified linear unit activation (N is the number of feature maps, indicated on the top of each layer); ▾ 3 × 3 × 3 × N convolution with stride 2 × 2 × 2, followed by batch normalization and rectified linear unit activation; ▴ 2 × 2 × 2 nearest neighbor up-sample; ▻ 3 × 3 × 3 × N convolution with sigmoid activation. (2019) 9:11591 | https://doi.org/10.1038/s41598-019-48004-8 www.nature.com/scientificreports www.nature.com/scientificreports/ image was linearly mapped from [−1000 400] Hounsfield Units to [0 1]. Adam 19 was used as optimizer (learning rate 0.001) and the network was trained using a batch size of 8 samples.
The dataset was artificially augmented by performing random rotations, translations, flips and zooms. For each epoch, user input was simulated by selecting the two most distant points on the middle axial slice of the segmentation. All agreement levels were considered to account for the inter-observer variability and thus no segmentation combination was performed, i.e. the same nodule was paired with different viable ground-truths to train the model. Furthermore, iW-Net was evaluated via stratified 5-fold cross-validation with partition at scan level and we used 20% of the training for validation. All hyper-parameters were found via random search 20  , where U is an uniform distribution. Optimization was performed on the validation set of the first train-test split.
iW-Net was trained in two steps. The first block was initially trained separately using  IoU until the validation loss stopped improving for 3 epochs. The weights were then frozen and the entire iW-Net was trained using , the output of the first segmentation block and the artificially generated user interaction until the loss stopped improving for 5 epochs. Since each nodule can have multiple segmentations (one per expert), iW-Net had to perform different corrections according to the expert's annotation and the respective simulated user input. Experiments were performed on an Intel Core i7-5960X, 32 Gb RAM, 2× GTX1080 desktop with Python 3.5 and Keras 2.2. Code is available at https://github.com/gmaresta/iW-Net.
Experiments and evaluation. iW-Net produces pixel-wise predictions ∈[0 1], which are thresholded at 0.5 for the model's evaluation. The predictions are evaluated in terms of 3D Intersection over Union (IoU) and Average Surface Distance (ASD), as follows: where S is the expert's annotation, Ŝ is the model's prediction, N S and N S are the number of surface elements, d is the Euclidean distance (mm) and min is the minimum operation.
For each nodule, the average inter-observer IoU performance is computed by iteratively considering one expert's annotation as the ground-truth and the remaining as predictions and then averaging the results. For instance, the inter-observer IoU performance in an agreement level 4 nodule is the average of 12 = 4 annotators × 3 predictions IoU results. For better comparison with the observers, iW-Net is only evaluated in nodules with agreement level ≥2. The segmentation performance is also analyzed in terms of nodule radius and texture. We consider the radius of each nodule as the average of the equivalent spherical radius of all the annotators. Experiment 1. We study the performance of the non-guided segmentation unit (the first block of iW-Net) using as comparison the average inter-observer agreement and the segmentation produced using the 3D U-Net 13 . This U-Net is trained and tested on the aforementioned dataset. Due to computational constraints, the batch size is reduced to 2. Evaluation is performed according to Eq. 9: n N n j j 1 , where N is the expert's agreement level for nodule j, i.e. the number of radiologists that annotated that nodule. Since a nodule can have multiple segmentations, it is not expected that the model outperforms the inter-observer agreement. Experiment 2. The goal of this experiment is to evaluate the impact of the user's input on the segmentation of iW-Net. For that, we artificially generate user inputs on the axial plane of the slice that contains the nodule's centroid. Similarly to the training procedure, the two most points distant points in the ground-truth boundary of that slice are selected.
The performance of the full iW-Net is compared with the output of the first block in terms of IoU and ASD for different nodule sizes and textures. As in a real case scenario, we consider that the experts can keep either the initial or the corrected (Cr) segmentation, according to which better fits their needs. The evaluation is thus performed via Eq. 10: This principle is also applied to the ASD metric having as decision criteria the IoU, i.e., the same nodules are considered.