Towards automatic pulmonary nodule management in lung cancer screening with deep learning

The introduction of lung cancer screening programs will produce an unprecedented amount of chest CT scans in the near future, which radiologists will have to read in order to decide on a patient follow-up strategy. According to the current guidelines, the workup of screen-detected nodules strongly relies on nodule size and nodule type. In this paper, we present a deep learning system based on multi-stream multi-scale convolutional networks, which automatically classifies all nodule types relevant for nodule workup. The system processes raw CT data containing a nodule without the need for any additional information such as nodule segmentation or nodule size and learns a representation of 3D data by analyzing an arbitrary number of 2D views of a given nodule. The deep learning system was trained with data from the Italian MILD screening trial and validated on an independent set of data from the Danish DLCST screening trial. We analyze the advantage of processing nodules at multiple scales with a multi-stream convolutional network architecture, and we show that the proposed deep learning system achieves performance at classifying nodule type that surpasses the one of classical machine learning approaches and is within the inter-observer variability among four experienced human observers.

Scientific RepoRts | 7:46479 | DOI: 10.1038/srep46479 (in the range between − 750 and − 300 HU), also referred to as ground glass opacities. Part-Solid nodules contain both a non-solid and a solid part, the latter normally referred to as the solid core. Compared with solid nodules, non-solid and in particular part-solid nodules occur less frequent but have a higher frequency of being malignant lesions 9 . Finally, calcified nodules are characterized by a high intensity and a well-defined rounded shape on CT. Completely calcified nodules represent benign lesions.
In Lung-RADS, the workup for pulmonary nodules is mainly defined by nodule type and nodule size. However, presence of imaging findings that increase the suspicion of lung cancer, such as spiculation, can modify the workup. In the PanCan model, spiculation is a parameter that together with nodule type, nodule size and patient data contribute to the estimation of the malignancy probability of a nodule. Furthermore, completely calcified and perifissural nodules are given a malignancy probability equal to zero. In a scenario in which CAD systems are used to automate the lung cancer screening workflow from nodule detection to automatic report with decision on nodule workup, it is necessary to solve the problem of automatic classification of nodule type. In this context, the classes that have to be considered are: (i) solid, (ii) non-solid, (iii) part-solid, (iv) calcified, (v) perifissural and (vi) spiculated nodules.
Although the general characteristics of nodule types can be easily defined, recent studies 10,11 have shown that there is a substantial inter-and intra-observer variability among radiologists at classifying nodule type. In this context, researchers have addressed the problem of automatic classification of nodule type in CT scans by (1) designing a problem-specific descriptor of lung nodule and (2) training a classification model to automatically predict nodule type. In ref. 11, nodules were classified as solid, non-solid and part-solid. A nodule descriptor was designed based on information on volume, mass and intensity of the nodule, and a kNN classifier was applied, but the used features strongly rely on the result of a nodule segmentation algorithm, whose optimal settings also depend on nodule type. The authors propose to solve this problem by first running the algorithm multiple times using different segmentation settings in order to extract features and then classify nodule type. In practice, this strategy hampers the applicability of such a system to an optimized scan reading scenario. In ref. 12, the SIFT descriptor was used to classify nodules as juxta, well circumscribed, pleural-tail and vascularized, and a feature matching strategy was used for classification purposes. Despite the good performance reported, the considered categories are not relevant for nodule management according to current guidelines. A descriptor specifically tailored for lung nodule analysis was introduced in ref. 13, which was used to assess presence of spiculation in detected solid nodules 14 and to classify nodules as perifissural 15 . Although this approach could be extended to other nodule types, it strongly relies on the estimation of nodule size in order to define the proper scale to analyze data.
Scale is an important factor to consider in automatic nodule type classification. As an example, discriminating a pure solid nodule from a perifissural nodule involves the detection of the fissure, which on a 2D view of the nodule can be differentiated from a vessel only if a sufficiently large region surrounding the nodule is considered (see Fig. 1). On the other hand, discriminating non-solid from part-solid nodules strongly relies on the presence of a solid core, which can consist of a tiny part of the lesion that can only be clearly detected on a small scale.
In recent years, the advent of deep learning 16,17 has emerged as a powerful alternative to designing ad-hoc descriptors for pattern recognition applications by using deep neural networks, which can learn a representation of data from the raw data itself. The most used incarnation of deep neural networks are convolutional networks 16,18,19 , a supervised learning algorithm particularly suited to solve problems of classification of natural images [19][20][21] , which has recently been applied to some applications in chest CT analysis 6,15,22-24 . In this paper, we address the problem of automatic nodule classification by introducing three main contributions. For the first time, we propose a single system that classifies all nodule types relevant for patient management in lung cancer screening according to the Lung-RADS assessment categories and the PanCan malignancy probability model, namely solid, non-solid, part-solid, calcified, perifissural and spiculated nodules. Differently from what has been done in previous work, we design a classification framework based on Convolutional Networks (ConvNets) [17][18][19] . In particular, we propose a multi-stream multi-scale architecture in which ConvNets simultaneously process multiple triplets of 2D views of a nodule at multiple scales and compute the probability for the nodule to belong to each one of the six considered nodule types. The proposed approach does not require nodule segmentation or the estimation of nodule size. Inspired by recent work 6,15 , we formulate the analysis of a nodule as a combination of 2D patches. Relying on the experimental results of Setio et al. 6 , which showed that performance increase by increasing the number of analyzed patches, we go beyond a limited number of patches by introducing a novel approach to extract an arbitrary number of 2D views from a nodule. We trained the deep learning system using data from 943 patients and 1,352 nodules from the Multicentric Italian Lung Detection (MILD) trial 25 and we validated the trained system using independent data from 468 patients and 639 nodules from the Danish Lung Cancer Screening Trial (DLCST) 26 . Furthermore, in order to compare the performance of our deep learning architecture versus classical approaches of patch classification, we trained a linear support vector machines classifier to classify both features based on the raw intensity of nodules and features learned from raw data via an unsupervised learning approach. Finally, in order to compare the performance of our method versus human performance, we designed an observer study in which four observers, including experienced radiologists, classified a subset of 162 nodules extracted from the test set. We show that the proposed system achieves performance that surpasses classical patch classification approaches and is comparable with the inter-observer variability among human observers.

Results
Training data. We trained the deep learning system using data from the Multicentric Italian Lung Detection (MILD) trial 25 . For this purpose, we considered all baseline CT scans from the MILD trial. The study was approved by the Institutional review board of Fondazione IRCCS Istituto Nazionale Tumori di Milano, and the written informed consent was waived for the retrospective examination of the analyzed data. For all patients, non contrast-enhanced low-dose CT scans were acquired using a 16-detector row CT system, with section collimation 16 × 0.75 mm. Images were reconstructed using a sharp kernel (Siemens B50 kernel, Siemens Medical Solutions) with a slice thickness of 1.0 mm. Nodules were detected and annotated based on the following procedure. All CT scans were first read by a workstation (CIRRUS Lung Screening, Diagnostic Image Analysis Group, Radboudumc, Nijmegen, Netherlands) with automatic nodule detection (CAD) tools integrated. Two medical students, trained by a radiology research in detecting pulmonary nodules, either accepted or rejected CAD marks and labeled nodules as one of the considered nodule types. Accepted nodules were segmented using the algorithm presented in ref. 27, which is implemented in CIRRUS Lung Screening. The students manually adjusted parameters to obtain the best possible nodule segmentation, which allowed to compute the equivalent diameter of the lesion. Nodules with label disagreement were reviewed by a thoracic radiologist (ES) with more than 20 years of experience in reading chest CT scans. Nodules with label agreement were further reviewed by two radiology researchers (SvR, KC) independently. From the set of annotated nodules, we removed all cases with a diameter smaller than 4 mm, which is considered as an irrelevant finding in lung cancer screening 1 . The final set of data consisted of 1,805 nodules from 943 subjects (see Table 1), which were split into two non-overlapping sets: a training set (1,352 nodules), used to train the deep learning system and a validation set (453 nodules), used to monitor the performance of the system during training.
In the development of the proposed deep learning system, we defined a nodule data sample as a set of triplets of patches (axial, coronal and sagittal view), where each triplet was used to feed three streams of convolutional network (details on data preprocessing, system design and training are detailed in the Methods section). For training purposes, several different samples were extracted from the same nodule by rotating triplets around the center of mass and by using techniques of data augmentation at patch level. In this way, ≈ 0.5 M training samples were extracted and used to train the deep learning system. In our experiments, we investigated the performance of the system when data at different scales were considered. For this purpose, we extracted nodule data with patches of size 10 mm, 20 mm and 40 mm, which represent 3 different scales. We built and trained three network architectures where one scale (40 mm), two scales (20 mm, 40 mm) and three scales (10 mm, 20 mm, 40 mm) were processed, and compared the performance of the three networks with both classical patch classification approaches based on machine learning and human performance.
Test data. The performance of the trained deep learning system was assessed on data from the Danish Lung Cancer Screening Trial (DLCST) 26 . In particular, we used the subset of data used in a study recently published by the DLCST research group 28 , where the authors also describe the procedure used to annotate nodule types. Agency. The trial is registered with ClinicalTrials.gov (NCT00496977). All participants provided written informed consent. Non contrast-enhanced low-dose CT scans were acquired using a multi-slice CT system (16-row Philips Mx 8000, Philips Medical Systems) with section collimation 16 × 0.75 mm. Images were reconstructed using a sharp kernel (kernel D) with a slice thickness of 1.0 mm. From the initial data set, we removed nodules with a diameter smaller than 4 mm, as done for data from the MILD trial, and discarded scans with incomplete or corrupted data (e.g., missing slices). Finally, we obtained a set test ALL of 639 nodules from 468 subjects (see Table 1), which we used for testing purposes.
Observer study. In order to compare the deep learning system with human performance, we selected a subset of nodules from the set test ALL and asked three observers to label nodule type. For this purpose, we built a dataset by including all spiculated nodules in test ALL (27 nodules) and the same number of nodules randomly selected from the other classes. Therefore, a dataset test OBS of 162 nodules was built for the observer study. Two chest radiologists (ES, CSP) with more than 20 years of experience reading chest CT and a radiology researcher (KC) were involved in the observer study. Readers independently labeled nodule types. Nodules were shown at locations indicated by annotations provided by the DLCST trial, and readers had the possibility to either label the nodule as belonging to one of the six considered categories, or label it as not a nodule. For evaluation purposes, we considered annotations made by the three observers involved in this study as well as annotations coming from the DLCST trial, which we considered as an additional observer. In the rest of the paper we will refer to annotations coming from these four different sources as observers O 1 , O 2 , O 3

and O 4 (where O 4 indicates the DLCST annotations).
Evaluation. After training, all nodules in test ALL were classified using the trained deep learning system. In order to compute the computer-observer agreement, we compared the results from the computer with the nodule type given by each observer independently in the test OBS set. Furthermore, we computed the inter-observer agreement by considering all possible pairs of observers Oi vs. Oj (i, j = 1, … , 4, i ≠ j). In this case, since observers were given the possibility of labeling a given nodule as "not a nodule", the additional class not a nodule is considered to assess the inter-observer variability. The results in terms of k value are reported in Table 2, when all pairs of observers and the results from the three deep learning architectures working with different scales are considered. It can be noted that human observers have a moderate to substantial agreement, with k between 0.59 and 0.75, and that the deep learning system achieves a variability in the same range of human observers, with a level of agreement that increases with the number of scales used for nodule classification. When the 3-scale architecture is considered, the k value between the computer and each observer under test is between 0.58 and 0.67 and in half of the cases, it is higher than the agreement between the observer under test and at least one of the other observers.
We also evaluated the classification performance of the best performing network, namely the one working with 3 scales, in terms of accuracy and per-class F-measure and compared it with human performance (Table 3).   It is worth noting that the average performance among human observers are comparable with the average performance between the computer and observers, with an average accuracy of 72.9% versus 69.6% respectively. A similar trend can be observed for all the other classification parameters. Furthermore, we used the test ALL dataset to compare the performance of the proposed deep learning system with two classical approaches where a linear Support Vector Machines (SVM) classifier was trained in a supervised fashion using features extracted from 2D nodule patches. In the first approach, features based on the raw pixel intensity of 2D patches were used (intensity features). In the second approach, features were not engineered but learned from raw data via an unsupervised learning approach using the K-Means algorithm (unsupervised features), as proposed in ref. 34. Details on the design of these two additional experiments are given in the Methods section. The proposed approach based on deep learning, together with these two approaches based on classical machine learning, covers a scenario where the problem of nodule classification is tackled by (1) manually defining features based on raw image data and use them to train a classifier, (2) learning features from raw data in an unsupervised fashion and use them to train a classifier, (3) learning a hierarchical representation of nodules from raw data, using convolutional networks trained end-to-end. The results of the comparison are reported in Table 4, where the gradual improvement from using intensity-based features and SVM to a 3-scale approach based on deep learning can be observed both in terms of accuracy and F-measure.
In Fig. 2, examples of nodule type classification are depicted, grouped based on labels provided by the DLCST trial. For each nodule type, nodules classified by the deep learning system are ordered by increasing probability. As a consequence, atypical examples for each nodules type can be found on the left side of the figure, while typical examples can be found on the right side of the figure.

Discussion
The deep learning system produces a score by classifying an internal representation learned from raw data. In order to get insights on the kind of features learned by the network, we extracted an embedded representation of each nodule and applied multidimensional scaling to project the embedded representation onto a bidimensional plane. For this purpose, we applied the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm 29 to the output of the last fully-connected layer of the network. In this way, each nodule is represented by a feature vector of 256 values. The result of the multidimensional projection is depicted in Fig. 3, where close nodules have a similar representation in the network. Clearly defined clusters of nodules with similar characteristics can be identified. Examples are clusters of large solid nodules, calcified or perifissural nodules, but also groups of nodules of a particular class that was not used in this study, namely juxtapleural nodules.
One of the clusters in the t-SNE representation shows a direct association between large solid nodules and spiculated nodules. Based on training data, the system implicitly learns that large solid nodules are likely to be spiculated nodules. This effect can be observed in the quantitative evaluation reported in Table 3, where spiculation has an F-measure of 62.7% when the system is compared with O 4 on the subset of 162 nodules, while it decreases to an F-measure of 43.4% when all nodules are considered. The reduction in precision observed in the second experiment is therefore related to the presence of more large solid nodules that are misclassified as spiculated. This suboptimal behavior of the system can be compensated by increasing the amount of spiculated nodules in the training set, for example by including follow-up cases. Nevertheless, in the clinical context of lung cancer screening, labeling large solid nodules as spiculated may not hamper the nodule workup, since large solid nodules without spiculation are also considered as suspicious lesions.
The values of precision and recall per nodule type when the test ALL set is classified with the 3-scale network are reported in Table 5. We can observe that the system tends to classify solid, calcified and non-solid nodules  Table 3. Nodule classification performance in terms of accuracy and F-measure per nodule type. Results for each pair of human observer Oi vs. Oj and for observers versus the computer on the test OBS dataset (167 nodules) are reported. Averages of measures across observers and across computer-observers are also indicated. The additional class "not a nodule" is added to observers since they could exclude nodules during the observer study. The performance of the system on the test ALL set (639 nodules) is also reported. In this case, annotations from DLCST radiologists (O 4 ) are considered as the reference standard.
with high performance. As a consequence, since nodule type distributions are skewed (see Table 1), the overall accuracy for test ALL is higher than for test OBS . The low value of precision and recall for part-solid and spiculated nodules in test ALL corroborates what is observed for test OBS and can be compensated in the future by adding more training samples for underrepresented classes, therefore increasing the variability of nodule appearance in the learning procedure. The performance of the system is within the inter-observer variability. This corroborates the effectiveness of the system at classifying nodules and also indicates that even experienced radiologists do not fully agree on nodule types. The concept of nodule type has been coined by radiologists, who have to differentiate opacities in CT scans according to their appearance and, most importantly, to their frequency of malignancy. The fact that there is no complete agreement among experienced radiologists implies that no gold standard for nodule type classification can be made, and that there will always be doubtful cases even in the training set. In this context, the range of variability within the one among humans reached by the proposed system makes it the first suitable system to be integrated in workstations for automatic analysis of CT scans in lung cancer screening.

Methods
The input of the proposed framework is a chest CT scan and the position q = [x, y, z] of the nodule (e.g., its center of mass) to classify. The output of the system is the probability for the nodule to belong to each one of the six considered classes. The framework is based on convolutional networks (ConvNet), which process input samples via a "multi-stream multi-scale" architecture (see Fig. 4). We define an input sample as a triplets of 2D patches obtained by intersecting the 3D domain of the nodule with triplets of orthogonal planes, and crop triplets of patches at different resolutions. Therefore, an input sample to feed the deep learning system is given by three triplets of patches from the same nodule (see Fig. 4). Each step of the proposed framework is detailed in next sections.  1, … , N), which defines the rotation of each plane of T n with respect to the axes x, y, z. In this way, T 1 is the triplet of planes that define the default axial, coronal and sagittal views of a CT scan, and any other triplet T n is obtained by sequentially rotating the triplet with respect to the x, the y and the z axis by an angle θ n . Rotating all the planes by the same angle guarantees that orthogonal planes are always obtained. Examples of triplets for several values of N are depicted in Fig. 4(a), where the axial, coronal and sagittal planes are represented in different colors.

Generation of triplets of 2D patches.
The intersection of a triplet of planes and a CT scan generates 2D views of the nodule of interest. From each intersection, we generate triplets of 2D patches by cropping a square area of size d centered on q. Increasing the value of N allows to increase the number of extracted patches per nodule, which also increases the coverage of the volume of a nodule in 3D. Furthermore, adapting the value of N per nodule type has the advantage of (1) balancing classes distribution in the presence of skewed distribution of classes by using a larger value of N for underrepresented classes, and (2) using it as a kind of data augmentation, in which many different views of the same object are extracted.
The parameter d defines the scale at which patches are considered. Using multiple values of d allows to crop triplets of patches with information that range from local content to more global context of nodule appearance. In order to train the proposed deep learning system, we extracted triplets of patches at three different scales, namely d = 10, 20, 40 mm and fed three streams of the network with three triplets at the same time. This allows the network to focus both on the local appearance of a nodule (10 mm), where small structures like the solid core can be analyzed, and on more global context (40 mm), in which structures like the fissure can be recognized. Before feeding the network, each patch was rescaled to a fixed size of 64 × 64 pixels using bicubic interpolation and the pixel intensity Deep learning network. Network design. The architecture of the used deep learning system is depicted in Fig. 4(b). The system consists of nine streams of ConvNets, grouped into three sets of three streams. Each set of streams is fed with a triplet of orthogonal patches extracted at the same scale. Different sets of streams process triplets of orthogonal patches with exactly the same orientation in the CT scan, but at different scales. Each stream      of the set is fed with one patch from a triplet of orthogonal patches. The 2D input patch is then processed by a series of convolutional and pooling layers, with one last fully-connected layer. The size of each patch is 64 × 64 pixels, which covers a size of ≈ 40 mm at the used in-plane resolution of 0.67 mm/px. In order to define the optimal architecture for each stream, we followed the VGG-net approach proposed in ref. 30. We set a fixed size of convolutional kernels to 3 × 3 px and used 32 filters in the initial layer. Similarly to ref. 30, we added pairs of convolutional and max-pooling layers, keeping a fixed filter size of 3 × 3 and doubling the number of filters in convolutional layers after each max-pooling, as long as the performance on the validation set were improving. We slightly deviated from the fixed procedure of ref. 30 by increasing the filter size in the first convolutional layer to 5 × 5 and by using 2 layers of 64 filters in cascade before the second max-pooling layer, since this configuration showed to perform slightly better than the standard one. The described architecture represents one of the three streams used in a set, which we define as multi-stream network. All the parameters of the network are shared across the three streams in the same multi-stream network. It is worth noting that a multi-stream network processes triplets of 2D patches extracted with the same resolution d.
We used three scales with patch size of 10 mm, 20 mm and 40 mm, respectively, and for each scale we trained a multi-stream network. Each multi-stream network has the same architecture, but parameters are optimized independently at each scale. The multi-stream networks at different scales are finally merged in a final fully-connected layer (see Fig. 4(b)). The final soft-max layer has six neurons, which produce the probability for the six considered classes. We implemented the network using Theano 31 .
Training. We trained the proposed multi-stream multi-scale convolutional network with data from the MILD trial. For training purposes, we split the dataset into two parts, a training set containing 75% of the data, and a validation set, containing the remaining 25% of the data. We defined the two data sets without any overlap of patients or nodules across the sets and distributing all nodule types in the two sets based on the same proportion 75-25%. The statistics of the two data sets are reported in Table 1.
For training purposes, for each nodule, three triplets of patches were extracted. Each triplet was extracted at a given scale by setting the values d 1 = 10 mm, d 2 = 20 mm and d 3 = 40 mm for the streams 1, 2, and 3 respectively. Since the distribution of nodule types were skewed, we adapted the number of angles N per nodule type. In order to set the proper value for N, we decided to initially extract 5,000 training samples per nodule class. Specific values for N for each class are reported in Table 1. Adapting the value of N per nodule type produced 30,000 training samples. We further augmented the size of the training data set by adding three shifted versions of each training sample. Data augmentation was therefore done by randomly shifting the position q of the center of mass of the nodule to shift x y z , where (δ x , δ y , δ z ) were drawn from a normal distribution with mean value μ = 0 and standard deviation σ = 1 3 3 , which ensures shifting within a sphere of radius 1 mm centered on q. Finally, each patch of the triplet and its shifted version were flipped along the vertical, the horizontal axis, and a combination of the two axes. As a result, 16 different views of each nodule sample were included in the training set, which resulted in approximately 500,000 training samples.  N. (b) Examples of pyramidal triplets of patches used to feed the proposed deep learning systems. The system consists of three groups of three streams, one for each considered scale (namely 10 mm, 20 mm and 40 mm for patch size). Convolutional layers, max-pooling layers, fully-connected layers and one soft-max layer are the building blocks of the proposed network. The last fully-connected layer with 256 neurons serves as a combiner of the three sets of three streams, and a 6-value probability vector is generated as output.
In order to train the ConvNet, we initialized the parameters according to the method in ref. 32 and trained using stochastic gradient descent, minimizing the categorical cross-entropy loss. During optimization, we set an initial learning rate η = 10 −3 and decreased it by a factor 3 every 50 epochs. The parameters of the network were updated using the ADAM algorithm 33 . We set the batch size to 256 and used dropout 19 with a probability of 0.5 in the last fully-connected layer. Additionally, L2 normalization was used, with a weight decay parameter of 10 −6 . We empirically noticed that the training converges after ≈ 200 epochs.
Prediction. Given an input sample x, consisting of a set of triplets extracted at multiple scales, the trained architecture is able to predict a probability P k (x) for each considered nodule type class k. Since one set of triplets is extracted for a given angle θ, the prediction also depends on the angle θ. Therefore, the input triplet for a given nodule can be written as a function of θ, namely x θ . In order to classify a given nodule, the prediction becomes a function of the parameter θ as well, which we can write as P(x θ ). The final prediction is obtained as a combination of the N predictions obtained by varying the parameter θ. The adopted combination strategy consisted in averaging the per-class probability, and finally assigning the nodule the label = ∑ . This prediction strategy was applied both during training to assess the performance of the network on the validation set, and during the final evaluation on the DLCST data set. For validation purpose, after each epoch, all nodules in the validation set were tested and performance was assessed. For this purpose, 30 samples per nodule were extracted (N = 30), meaning that patches at rotation steps of 6° were taken. At each iteration, nodule type was predicted using the proposed combination of predictions, and quantitative performance parameters were computed. Since the distribution of nodule types in the validation set is skewed (see Table 1), we considered the F-measure per class instead of the commonly used accuracy, since the F-measure is less sensitive to skewed distributions. Based on this, during training we maximized the mean F-measure across classes. For the final evaluation on DLCST data, the same settings using N = 30 was used, and the results for the three considered architectures reported in Tables 2 and 3 were obtained.
Nodule classification using Support Vector Machines. In this section, we describe the details of the experiments based on classical machine learning approaches, where we used two different sets of features. The first set consists of features based on the intensity of pixels in 2D patches. The second set consists of features automatically learned from raw data in an unsupervised fashion, using the K-means algorithm.
Intensity features. The first set of features consists of the raw pixel intensity (HU values) extracted from 2D patches. Given a patch of size 64 × 64 px, we extracted a feature vector by vectorizing the values of pixel intensities in the patch. In this way, each patch had a 4,096-dimension feature vector. We built a training set by considering all the nodules used to train the methods based on deep learning, balancing samples across classes using the coefficients reported in Table 1. We used the training set to train a linear Support Vector Machines (SVM) classifier. Data were normalized prior to training to have zero mean and unit variance, and the one-vs-one strategy was used to deal with the multi-class problem. After training, we applied the classifier to the test ALL dataset, which contains 634 nodules. As done for the evaluation of deep learning approaches, 30 patches per nodules were considered at test time, which were all classified using the trained SVM classifier. Finally, majority voting of the predicted labels was used to obtain the final prediction of nodule type.
Unsupervised features. The approach used to learn a representation of pulmonary nodules in an automatic unsupervised fashion is based on the work of Coates et al. 34 . The original method presented in 34 was developed based on the CIFAR10 dataset, which contains RGB images of 32 × 32 px. Since the size of the patches used in this paper is 64 × 64 px, in order to apply the method in ref. 34 to our data we doubled the receptive field size, which we set to 12 px, and set the number of centroids to 1,600, which gave a feature space of 6,400 dimensions. We kept the rest of parameters of the algorithm at their default value. As done for the experiment using intensity features and linear SVM, at test time we classified 30 samples per nodule and considered the label given by the majority voting on the predicted labels as the final prediction of nodule type.