An improved deep learning approach for detection of thyroid papillary cancer in ultrasound images

Unlike daily routine images, ultrasound images are usually monochrome and low-resolution. In ultrasound images, the cancer regions are usually blurred, vague margin and irregular in shape. Moreover, the features of cancer region are very similar to normal or benign tissues. Therefore, training ultrasound images with original Convolutional Neural Network (CNN) directly is not satisfactory. In our study, inspired by state-of-the-art object detection network Faster R-CNN, we develop a detector which is more suitable for thyroid papillary carcinoma detection in ultrasound images. In order to improve the accuracy of the detection, we add a spatial constrained layer to CNN so that the detector can extract the features of surrounding region in which the cancer regions are residing. In addition, by concatenating the shallow and deep layers of the CNN, the detector can detect blurrier or smaller cancer regions. The experiments demonstrate that the potential of this new methodology can reduce the workload for pathologists and increase the objectivity of diagnoses. We find that 93:5% of papillary thyroid carcinoma regions could be detected automatically while 81:5% of benign and normal tissue could be excluded without the use of any additional immunohistochemical markers or human intervention.

diagnosis of breast cancer. Basavanhally et al. 15 presented a new multiple field of view classifier, with different size of multiple field of view to identify the important features of one image. This method was used for classification of breast cancer pathological images. However, due to the computational costs, the discriminating power of these methods is challenging for identifying definitive features, subset characterization and optimization. In addition, these methods rely on limited manual annotations and are only applicable to fixed feature matching. Once the characteristics (such as twist, flip, illumination, corruption, and so on) changed, the effects of these algorithms will become worse. Therefore, their universality is not strong.
Recently, a promising machine learning approach has made rapid progress in the automatic classification and interpretation of medical image data. During the last few years, Convolutional Neural Network (CNN) 3,16 becomes one of the most rapidly developing fields in deep learning. As a kind of artificial neural network, it is becoming a research focus in the area of speech analysis 17 and image recognition [2][3][4] . The shared weights network structure makes it more akin to human neural networks. Due to shared weights, CNN can reduce the complexity of the network model and reduce the number of weights. CNN has a more significant advantage when the input is a multidimensional image. The whole image is used as the input avoids complex traditional recognition algorithms such as feature extraction and data reconstruction process. The multi-layer perception of CNN is particularly applicable to identify two-dimensional images. It is highly invariant for translation, scaling, skewing and forms of deformation 3 . Lutjanus et al. 18 used CNN to idiomatically identify the features of Sentinel and breast cancer metastasis in the MR image. This method can reduce the workload of the pathologist and increase the objectivity of the diagnosis. The concluded that deep learning holds great promise to enhance the efficacy of prostate cancer diagnosis and breast cancer staging. Angel Cruz-Roaa, Ajay Basavanhally et al. 19 realized automatic segmentation of invasive breast cancer MR images and generated cancer distribution maps by CNN. The authors compared the performance of CNN, hand-crafted image feature extraction method and random forest method. The experiment showed that CNN worked best. Petersen, Kersten and Chernoff et al. 20 proposed a combination of supervised learning and unsupervised learning approach to segment breast density separation and evaluate risk assessment of breast. First, they utilized the deep unsupervised CNN to extract feature of images. Then they used classified images to adjust network weights and offset parameters. This strategy called fine-tuning 3 . The model could be easily extended to many areas of image segmentation and classification. Su et al. 21 used stacked de-noising auto-encoders to detect and segment cell in lung cancer and brain tumors.
In this paper, we analyze the shortcomings of the state-of-the-art object detection network Faster R-CNN for detecting ultrasound image in detail (See Section 0.7). Different from routine images, the cancer regions in ultrasound images are usually blur, vague margin or irregular shape. Facing these problem, we validate the strategies such as layer concatenation and spatial constrained layer. Experimental results show that each strategy can improve the functioning of the detection. Combining all of the strategies yields the best results. In the following, we name this approach CS Faster R-CNN for short.

Methods
This section presents the mechanism of the CS Faster R-CNN and the pipeline deployed to evaluate the benefits of representation in the task of detection. Before we actually start, we will explain the related concepts such as CNN and Faster R-CNN.

CNN. Given
, where x represents annotated region, y represents label. Through the training, CNN can estimate a model F mapping the relationship between input vectors x and output vectors y. In detail, The training process includes two phases, the forward propagation phase and the back propagation phase. During the forward propagation phase, when a training sample (x i , y i ) is given as input to the network, x i is transferred from the input layer to the output layer step by step. Finally, we get the output o i . This process can be formulated as, where L is the number of layers, w j is the weight vector of the jth layer F j . Commonly, we define F j as a series layers which perform operations such as convolution with kernel function, max pooling or non-linear activation. After a series operations, estimating the weight vectors w 1 , w 2 , ..., w L can be solved with the following optimization problem, Figure 1. Some ultrasound images of thyroid papillary carcinoma. The cancer regions are are marked by yellow crosses. We can see that cancer regions are blur, vague margin or irregular shape.
where  is usually defined as cross-entropy cost function. Using back-propagation and stochastic gradient descent methods, we can solve the numerical optimization problem (2). In Fig. 2, the CNN model ZF is framed with the green dashed rectangle and its detailed architecture is shown in Fig. 3. We can see that ZF has 5 convolution layers and 3 full connected layers. Conv5 layer is the top convolution layer, of which is a 3*3 kernel function with a stride of 1. It outputs 256 feature maps with the size of 13*13 which is given as an input for RPN to generate object proposals.
Faster R-CNN. As

Improved Faster R-CNN for ultrasonic image detection.
In view of problems referred in 0.7, we investigate a series strategies to make Faster R-CNN suitable to detect ultrasound images. As shown in Fig. 2, in CNN used in Faster R-CNN, the conv3 layer and conv5 layer of ZF is concatenated and normalized. In addition, we add a spatial constrained layer before the output layer. These strategies will be introduced with detail in the following. In our research, we mainly aim at improving the CNN used in Faster R-CNN.
Layer concatenation. In CNN, with the deeper layers, the reception fields become bigger. Therefore, deeper layer has smaller-scale values while shallower layer has bigger-scale values. Due to the large scale difference, it is difficult for the following layers to adjust and tune the weights. If we directly concatenate the tensors of the conv3 layer and the conv5 layer of ZF model, the "big" feature in the conv5 layer will override the "small" feature in the conv3 layer. As a result, the output of the results is likely to express "big" feature and ignore a "small" feature. According to 28 , these two tensors need the normalized operation. In the process of Faster R-CNN training, the system can automatically learn the scaling factor of each layer. Therefore, the normalization operation can keep the stability and precision of the system 28 . As shown in Fig. 2, we apply L2 normalization to tensors in the conv3 layer and the conv5 layer. We make the normalization within each pixel in the pooled feature map tensor. After the normalization, scaling is applied on each tensor independently as: where X is the original pixel vector, X is the normalized pixel vector and d stands for the number of channels in each RoI pooling tensor. The scaling factor i ϒ is then applied to each channel for every ROI pooling tensor: uring training, the update for the scaling factor i ϒ and input X is calculated with back-propagation and chain rule: Spatial constrained layer. During the training phase, CNN only extracts features from the annotated regions. However, because of the pathologists' experience and cognitive level, the annotated regions are often subjective or even inaccurate. In addition, the cancer regions depend on their residing regions which are hard to define. That is to say, the output y may not only depend on the input x alone, but also on the topological domain region on which it is residing. In order to extract features from the unknown residing regions, as shown in Fig. 2, we add a Spatial constrained layer before the output layer. We define Ω as the residing region which the output y depends on. The Spatial constrained regression model m can be expressed as, where θ (x) is an unknown parameter vector which can be estimated. Here we suppose that m is known a priori. As in Fig. 2, we can estimate θ (x) by where c i represents the coordinates of y i , c o represents the the center of the cancer region within Ω. We define d as constant radius of cancer region which can be estimated with experiments. As shown in Fig. 2, the predicted output ŷ is the output of the Spatial constrained layer. Through training, we can get the probability map (11). Following (11), the i th element of the predicted output y î can be expressed as is an estimated center of the probability mask. In our experiments, we set d in (11) and (12) to 150 pixels. ĉ r q can be estimated in the (L − 1) th layer using (10). r, q can be defined as where w L-1 , r ,w L-1,q denote the weight vectors and b r , b q denote the bias variables, and sigm(⋅) denotes the sigmoid function. To learn all the variables (i.e., weight vectors and bias values) in the network, we solve (2) using the following cross-entropy loss function:ˆ To ensure objectivity, we only provide the annotator with original ultrasound images without any annotations. With the help of a tagging software, two experienced physicians annotate the training samples with the red rectangular box. Testing samples with cancer regions are also annotated to generate the ground truths. The screenshot of the software interface in the annotation process is shown in Fig. 4. The rectangular box will be dropped if the shortest side is less than 2 mm. In addition, the rectangle box must completely surround the cancer region and as small as possible. As referred in Section 0.3, all the images used for annotation are selected from ultrasound reports containing ready-made annotation and their description of representations generated by the previous physicians. In addition, following the corresponding ultrasound reports, another two physicians reviews the annotated images. With the help of the original annotations in ultrasonic reports, the experienced physicians' annotation and the inspectors' verification, we can ensure the accuracy of the annotation. In this way, we annotate 6727, 1881 regions in training samples and testing samples respectively. Finally, we make XML files for all labeled training samples according to the requirement of Faster R-CNN.
Annotation. Training. Our experiments are done on Ubutun 14.04 64 bit installed with Python version of Faster R-CNN (The Matlab and Python version of Faster R-CNN can be downloaded at: https://github.com/ rbgirshick/), using 32 G Nvidia TITAN X (Pascal) acceleration. Because of the lack of training samples, we fine-tune Faster R-CNN using an image dataset VOC2007 database. Specially, we utilize CNN model ZF which is pre-trained with the VOC2007 database. Theoretically, fine-tuning all convolutional layers will result in the best performance. To balance time consumption and efficiency, we fine-tune the weights of all the layers, except the first two convolutional layers. During fine-tuning, we take the approximate joint training scheme in 29 to train the RPN and detector simultaneously using multitask loss.
As referred in Section 0.2.1, we modify the source code of Faster R-CNN to normalize the conv3 and conv5 layers and concatenate the features pooled from them. As illustrated in 28 , the scale used after the features being concatenated could be either refined or fixed. Here we use a fixed scale of 4700 for the entire blob, both in the training and test phases. In the training phase, we set iteration numbers [50000, 25000, 50000, 25000], using a fixed learning rate of 0.0001. With the above parameters and data, training a CS Faster R-CNN model tkkes about 12 hours. Then we use the model to test training samples. The output regions whose confidence scores are above 0.8 while their IoU values with any ground-truth annotation are less than 0.3, are considered as the hard negatives. A selected region proposal would be regarded as a cancer if the confidence score is higher than 0.8. Results showed that CS Faster R-CNN is a real-time system, detecting an image takes an average of 0.15 s.

Results
In the following, true positive, false positive, true negative, false negative, true positive rate, false positive rate, true negative rate and false negative rate, are called TP, FP, TN, FN, TPR, FPR, TNR and FNR respectively for short. To further gain deep insights of the improvements obtained by our proposed method, we conduct more additional experiments for ablation studies as listed in Table 1, where we aim to examine the effectiveness and contributions of different strategies used in the proposed method. The 10-fold cross validation is used to estimate the performance of all the strategies. We present results for splits on per image (i.e., the training set and the validation set do not share the same image). Table 2 presents the results that pool each of the ten folds together. In Table 2, using ID3, that is CS Faster R-CNN, 93.5% of papillary thyroid carcinoma regions can be detected automatically while 81.5% of benign and normal tissue can be excluded without using any additional immunohistochemical markers or human intervention. Therefore, the potential of this new methodology could reduce the workload for pathologists and increase the objectivity of diagnoses.
To better validate the effectiveness of each strategy, we compare the performance of strategies one by one. As shown in Table 1, ID1 does not use any strategy. ID2 uses the strategy of layer concatenation. ID3 uses the strategy of layer concatenation and spatial constrained layer. Following we will discuss the performance of each strategy in detail. Fig. 5, the receiver operating characteristic curve (ROC) of ID2 using spatial constrained layer is closer to the top left corner than ID1. From Table 2, comparing to ID1, the TPR and TNR of ID2 increase by 3.2%, 7.1% respectively. Intuitively, as shown in first  two columns of Fig. 6, for the same ultrasound image, both ID1 and ID2 can identify the correct cancer regions. However, comparing to ID1, the detection result of ID2 is more close to ground truth. Fig. 7a and Fig. 7b are detection results of ID1 and ID2 respectively, we find that ID1 can not identify the cancer regions while ID2 can. However, both ID1 and ID2 can not identify the cancer regions in Fig. 7d and Fig. 7e. It shows that ID2 still needs to be further improved. In addition, we find that both Fig. 8a and Fig. 8b generate FP in the same image. However, the classification score of ID2 is smaller that ID1. In Fig. 8d and Fig. 8e, we find that ID2 can eliminate FP while ID1 can not. In summary, comparing to ID1, ID2 can eliminate more FN and FP results effectively. These results confirm the statements mentioned in the first paragraph of Section 0.7. Due to the RoI pooling mechanism, the original Faster R-CNN cannot capture more local texture of cancer regions, especially for the monochromatic and low-resolution ultrasound images. Using the strategy of layer concatenation, our approach can extract both local and whole texture features of the cancer regions and get a better performance.

Results of the ablation experiments. Layer concatenation. As in
Spatial constrained layer. As in Fig. 5, the ROC of ID3 using layer concatenation and spatial constrained layer is obviously closer to the top left corner than ID1 and ID2. From Table 2, comparing to ID2, the TPR and TNR of ID2 increase by 3.50%, 3.30% respectively. Intuitively, as shown in Fig. 7b and Fig. 7c, comparing to ID2, ID3 gets a more accurate rectangular results which is closer to ground truth. In Fig. 7d-Fig. 7f, only ID3 can identify the cancer region correctly. In Fig. 8a-Fig. 8c, both ID1 and ID2 generate the FP results while ID3 can get the TN result correctly. As referred in Section 0.2.2, by using the strategy of spatial constrained layer, our approach can fully utilize the features of the residing environment around the annotated cancer regions during training. Therefore, ID3 yields the better performance than ID2.

Results of comparison with approaches based on SVM.
For image classification, SVM has the best performance among machine learning approaches 10 . In our study, we compare CS Faster R-CNN with some state-of-the-art ultrasound image classifier base on SVM. For an unbiased comparison, we only compare their classification performance rather than regional detection performance. Both CS Faster R-CNN and existing methods are trained and validated with the same samples. For SVM approaches, it is worth noting that the training samples are no longer annotated the cancer regions, but just are annotated by classification label, e.g., 1 represents positive sample, −1 represents negative sample. For test results of CS Faster R-CNN, we decide it is tested positive if at least one positive region is contained, or it is negative. Note that all experiments were carried out using 10 fold cross-validation, i.e., one tenth of the cases were used for testing and the rest for training. We compare our results with several approaches based on SVM such as Moradi, M et al. 30 , Virmani et al. 31 , Acharya et al. 32 , Acharya et al. 33 , Tsiaparas et al. 34 and Güler et al. 35 . Table 3 shows the results of the comparison. We used the publicly available Matlab implementation of the SVM algorithms named LIBSVM 36 as the basic platform because the source codes of these studies are not public. We used the optimal implementation as proposed by the authors respectively. As shown in Table 3, for both TPR and TNR, CS Faster R-CNN has the best performance. Acharya et al. 33 has the highest TPR and TNR among the approaches based on SVM. Comparing to Acharya et al. 33 CS Faster R-CNN has increases of 2.7%, 4.7% to TPR and TNR respectively. Obviously, CS Faster R-CNN can identify more correct samples, especially the negative samples. For SVM, it is difficult to judge whether a region is positive or negative because the features of cancer region are very similar to normal or benign tissues in ultrasound images. However, through a series of strategies, CS Faster R-CNN can get better performance than state-of-the-art approach based on SVM.

Discussion
Many studies have begun to use the state-of-the-art object detection network Faster R-CNN for image classification and detection. However, it is rarely used in ultrasonic image detection. Unlike daily life photos, ultrasound images have some shortcomings which result that using CNN directly for detection of ultrasound images is not feasible. First, we known that deep learning needs a large amount of labeled training data. Ultrasound images are limited and difficult to obtain. Second, as shown in Fig. 1, ultrasound images are usually blur, vague margin   or irregular shape. In particular, it is a considerable challenge to distinguish malignant tumor tissue and benign tumor tissue. Finally, cancer tissues lodge in the surrounding environment, and it is difficult for us to identify their boundaries. For the first question, as referred in 37 , though there are substantial differences between natural and medical images which may advise against knowledge transfer, fine-tuning a CNN that has been pre-trained with a large set of labeled natural images still outperforms or, in the worst case, performs as well as a CNN trained from scratch. Therefore, we fine-tune the CNN used in Faster R-CNN with public image dataset VOC2007. For the second question, in order to detect the detail of the cancer regions, we must thoroughly identify their local texture features. However, Faster R-CNN can not extract local texture features well due to the following reasons. The Regions of Interesting (RoI) pooling layer of Faster R-CNN only uses feature maps of the deepest convolution layer. As reported in 38 , as the layer becomes deeper, the reception fields become larger. Therefore, deeper layers have larger-scale values while shallower layers have smaller-scale values. For instance, given that the overall stride of the conv5 layer in the ZF model is 16, once the object size is less than 16 pixels, Faster R-CNN can no longer project the RoI pooling region proposal. From the viewpoint of the feature visualization, as the author pointed out in 39 , conv5 layer (the deepest layer) captures entire features of object. Therefore, Faster R-CNN cannot capture more local texture of object due to the RoI pooling mechanism, and it is difficult for the Faster R-CNN to extract local texture features from low-resolution images. Inspired from 39 , we concatenate conv3 layer and conv5 layer to enable the RoI to pool both local and global features. For the last question, in order to extract features from the unknown residing regions, we add a spatial constrained layer before the output layer.
By using layer concatenation, we concatenated the features pooled from conv3 and conv5 layer of ZF 39 used in Faster R-CNN. This strategy could enhance the ability of detector to capture more detail features of the RoI, especially for low-resolution images. Experiments showed that this strategy could increase the TPR by 3.3%. By using spatial constrained layer, the detection could extract the features of surrounding host environment in which the cancer regions are residing, increasing the TPR and the TNR by 6.3%, 7.5% respectively. As seen in Fig. 5 and Table 2, combining the strategies of layer concatenation and spatial constrained layer, ID3, that is CS Faster R-CNN, can dramatically improve the detection performance, exceeding any single strategy. Fig. 9 presents   that ID3 has good recognition results for blur, low-resolution, vague margin and irregular shape caner regions.
In terms of efficiency, as shown in Table 4, using the same training sample referred in Section 0.3, ID1, ID2 and ID3 take about 8.5 hours, 11 hours and 12 hours respectively. This is because ID2 takes extra about 2.5 hours to compute the layer connection and normalization. In the same way, ID3 takes extra about 1 hour to compute the spatial constrained layer. For testing a ultrasound image, ID1, ID2 and ID3 take about 0.10 s, 0.13 s and 0.15 s respectively. That is, all of them are real-time detection system. Experiments show that we have achieved good results by adopting these strategies. We will further investigate why not concatenate conv4 and conv5 layer or other two layers. As mentioned in 39 , The conv1 layer just responds to some blocks or lines which have no semantic features. The conv2 Layer responds to some corners and other edge/color conjunctions which still have no semantic features. The conv3 Layer has more complex invariances, capturing similar textures (e.g. mesh patterns) which have semantic features. The conv4 Layer shows significant variation, and is more class-specific: dog faces or bird legs. The conv5 Layer shows entire objects with significant pose variation, e.g. keyboards or dogs. Therefore, the conv1 and conv2 layers can not represent the local texture features. Under the same experimental conditions, we have compared the performances of the using single layers and the using different layer concatenations. The results are as shown in Table 5. In Table 5, we can see that in all the individual layers, the conv5 layer has the best effect, the TPR reaches 0.897, 0.238 (23.8%) higher than the conv4 layer alone with the second-best TPR 0.659. In all the layer concatenations, we can see that the concatenation of the conv3 and conv5 layers has the best effect, the TPR reaches 0.935, 0.025 (2.5%) higher than the concatenation of the conv4 and conv5 layers with the second-best TPR 0.910. However, in all the layer concatenations, if the conv5 layer is not involved, the best effect is generated by the concatenation of the conv3 and conv4 layers, the TPR is only 0.691. Therefore, we can conclude that the conv5 layer works best when we use an individual layer. If we use the layer concatenations, the concatenation of the conv3 and conv5 layers works best. The reason is that the conv5 layer is necessary for extracting the whole texture features of cancer regions. The detector can not identify the cancer regions if there is no whole texture features. The conv3 layer extract the local texture features of the cancer regions, and the concatenation of the whole texture features and the local texture features can achieve the best effect.

Conclusion
This study investigates the strategies to improve the ability of Faster R-CNN to detect cancer regions in thyroid papillary carcinoma images. Facing fewer training samples and blurry cancer regions, we validate the strategies such as layer concatenation and Spatial constrained layer. Experimental results show that each strategy can improve the functioning of the detection. Combining all of the strategies yields the best results. In future, we will investigate new strategies to detect more kind of cancer region considering the context. In addition, we will further study how to generate a exhaustive and practical diagnostic report.  Table 5. Performances of the using single layers and the using different layer concatenations. The numbers in first column represent the convolutional layers. '&' represents concatenation.