Stochastic co-teaching for training neural networks with unknown levels of label noise

Label noise hampers supervised training of neural networks. However, data without label noise is often infeasible to attain, especially for medical tasks. Attaining high-quality medical labels would require a pool of experts and their consensus reading, which would be extremely costly. Several methods have been proposed to mitigate the adverse effects of label noise during training. State-of-the-art methods use multiple networks that exploit different decision boundaries to identify label noise. Among the best performing methods is co-teaching. However, co-teaching comes with the requirement of knowing label noise a priori. Hence, we propose a co-teaching method that does not require any prior knowledge about the level of label noise. We introduce stochasticity to select or reject training instances. We have extensively evaluated the method on synthetic experiments with extreme label noise levels and applied it to real-world medical problems of ECG classification and cardiac MRI segmentation. Results show that the approach is robust to its hyperparameter choice and applies to various classification tasks with unknown levels of label noise.


Numerical stability
The benefit of selecting and rejecting an arbitrary number of training instances from mini-batches may come at the cost of numerical instability.If all instances from a mini-batch would be rejected, an empty training set would be generated, which would cause numerical instability.To combat this, we employ two procedures.First, we clamp the randomly selected thresholds between 0.01 and 0.99.This ensures rejection of cases where the posterior probability is very low and it ensures selection of cases where the posterior probability is very high.Note that a threshold of 0 would result in inclusion of all instances and a threshold of 1 would result in rejection of all instances and thus an empty set.Second, we monitor the fraction of selected training instances and impose the following selection criterion: when < 10% of the instances in a mini-batch are selected, we generate a new selection threshold.When five consecutive thresholds do not satisfy the selection criterion, the mini-batch may exclusively consist of instances with label noise, and a new mini-batch is sampled.
Similarly to 42 , we gradually introduce stochastic co-teaching via a schedule.The schedule introduces the selection threshold gradually (1) Beta(α, β) : www.nature.com/scientificreports/with multiplication factor where n is the current time-step, and n 0 the time-delay to start the introduction gradually in δ steps.

Rejection rate
In contrast to conventional co-teaching, stochastic co-teaching can reject an arbitrary number of training instances.By calculating the rejection-rate (per epoch) training stability can be monitored.Instability appears when the majority of instances is rejected during training.Furthermore, assuming that only training instances with incorrect labels are rejected, the rejection-rate provides an estimate of label noise.Note that the rejectionrate is equivalent to the term forget-rate used in conventional co-teaching, but for clarity we use the two terms distinctly.

Experiments and results
For a comparison with previously published methods we have performed baseline experiments with synthesized label noise in MNIST 51 , CIFAR-10 52 , and CIFAR-100 53 datasets.To generate label noise we employed two types of noise transition matrices visualized in Fig. 1.Bias noise, also referred to as label-flipping noise 42,49 , mimics observer-bias by substituting a ground-truth label with the label of the neighboring class.Uniform noise is achieved by replacing a ground-truth label with randomly selected other label.For a direct comparison with related methods, experiments were performed similar to those in 17,26,42,54 , i.e. experiments with high levels of label noise: bias noise with a noise rate of 45%, and uniform noise with noise rates of 50% and 20%.Note that experiments with 45% bias noise are exceptionally difficult, because there is a marginal majority of correct samples; a noise rate of more than 50% bias noise would flip the majority of instances to the incorrect label.We implemented two CNN architectures, a standard four-layer CNN for MNIST experiments, and a ninelayer CNN for CIFAR experiments.The former model was specifically designed for MNIST and highly efficient and fast and the latter has been used for experiments with weak supervision and noisy labels 42,55 being less efficient and therefore slower.Table 2 lists the architectures of these networks.
Each network was trained in 200 epochs in mini-batches using stochastic gradient descent with Adam and a learning rate of 0.001.Regular co-teaching was performed using optimal settings reported in 42 .In each experiment, Stochastic Co-Teaching was introduced in ten epochs; in MNIST experiments without a delay and in CIFAR experiments with a delay of ten epochs.If not stated otherwise, the reported results and corresponding standard deviations are determined from the last ten epochs of each experiment.
All experiments were implemented using PyTorch 56 and were performed in accordance with relevant guidelines and regulation.

Hyperparameter stability
We propose a stochastic co-teaching approach, where a randomly chosen threshold is used to select or reject training instances based on posterior label-probability.We have performed extensive experiments to investigate the influence of hyperparameters α and β , i.e. the hyperparameters that determine different beta distributions to sample the instance-selection thresholds.Figure 2 shows the α and β parameters used in the experiments and their corresponding beta distributions.The probability densities show a wide variety of shapes, ranging from uniform, to parabolic, to bell-shaped distributions that are symmetric, or right-or left-tailed.Selection thresholds are stochastically sampled from these distributions and used in the experiments.
Figure 3 shows the effect of different hyperparameters on stochastic co-teaching.The different beta distributions have different impact, but there are dominant hyperparameters pairs that achieve optimal results in nearly all experiments.In general, the distributions above the α = β diagonal (i.e. the right-tailed distributions) show suboptimal results and the distributions below the α = β diagonal (i.e. the left-tailed distributions) show Figure 1.Noise transition matrices used in synthetic experiments.Transition matrices are equivalent to confusion matrices but they visualize noise distribution among classes.ias noise, or label-flipping noise, mimics observer-bias by substituting the true label with the label of the neighboring class.Uniform noise is achieved by replacing the true label with a randomly selected other label.Note that bias noise should always be <50% to ensure that the majority of instances remain correct.optimal results.Right-tailed distributions sample lower thresholds on average, meaning an increased chance of selecting instances with low posterior probabilities.Left-tailed distributions sample higher thresholds on average, meaning an increased chance of rejecting instances with low posterior probabilities, leaving mainly high posterior probability instances.
Not all hyperparameters perform equally in all experiments.CIFAR-10 and CIFAR-100 image classification tasks are generally considered more challenging.In these applications, there are narrower sweet spots for hyperparameters.The left-tailed somewhat wider bell-shaped distributions appear to be optimal, with α = 32, β = 2 as an optimum in all applications.Nevertheless, hyperparameters show relatively wide sweet spots, and different hyperparameters have limited impact on accuracy.

Rejection rate
Stochastic co-teaching can estimate the amount of label noise present in the data by monitoring the rejection rate.Figure 4 shows the development of the rejection rate during training in several applications.From the figure we observe that when the accuracy converges, the rejection rate converges towards the noise-rate.For CIFAR classification, the rejection rate overshoots for the more challenging tasks with higher noise rates.However, this faulty estimation can be inferred from deteriorating test or validation accuracy.www.nature.com/scientificreports/

Comparison with other methods
When the label noise-rate is known, conventional co-teaching outperforms competing methods on synthetic tasks, as was reported by Han et al. 42 .The results of our experiments are shown in Table 3.In all but one experiment, stochastic co-teaching outperforms its conventional counterpart.The benefit of stochastic co-teaching is that the method allowed the same parameter setting ( α = 32 and β = 2 ) for all experiments.In addition, the rejection rate of stochastic co-teaching provides insight about the label noise rate.
In the replication of conventional co-teaching experiments, we achieved higher accuracies than reported by Han et al. 42 .To inspect the cause of this we studied the effect of the forget-rate hyperparameter and the effect of different random seeds used for parameter initialization and mini-batch sampling.For each experiment, we used forget rates between 5% and 95% in steps of 5%.We repeated each MNIST experiments ten times with different random initialization.For CIFAR experiments we repeated each experiment five times, because training this network was more time consuming (0.5 h vs. 3 h.).
The results, shown in Fig. 5, reveal that conventional co-teaching is sensitive to its forget-rate hyperparameter.When the noise-rate is known and the forget-rate is chosen equal to it, a suboptimal accuracy is achieved, concurring with the findings reported by Han et al. 42 .Additionally, our results demonstrate that the impact of overestimating the forget-rate is larger than the impact of underestimating it, particularly for experiments with CIFAR-10 and CIFAR-100.Furthermore, different random seeds resulted in a large range of achieved accuracies, specifically in the experiments with MNIST data.

Real-world medical tasks
To show applicability of our method to medical data we employed stochastic co-teaching for multi-label classification of medical signals, namely ECGs and for semantic segmentation of medical images, namely cardiac cine MRI.Like any other medical task, this data inherently contains label noise caused by, e.g., inter-and intraobserver variability.

ECG classification
ECG is the primary tool for cardiologists to assess cardiac condition of patients.A typical ECG exam acquires 10 seconds of data at 500 Hz using 12 leads.ECG characteristics are sometimes automatically detected, but an ECG is thereafter manually assessed for diagnosis.This manual interpretation task can be cumbersome and it is often non-trivial in the presence of pathology 58 .Automatic interpretation of ECGs using deep neural networks is currently subject of intensive research, but training such networks is non-trivial, because ECG interpretation is complex due to label noise and observer bias.
In this experiment, we apply stochastic co-teaching to ECG classification using the PTB-XL dataset 59,60 .A full description of the data can be found in 59 .Briefly, the dataset consists of 21,837 clinically acquired 12-lead ECGs of 10 seconds (16 bit, 500 Hz) from 18,885 patients.The data are divided into ten folds of equal size on the patient-level.The 71 different features and diagnoses are aggregated into 5 different classes.The task is posed as non-exclusive multi-label classification in the following classes: normal, conduction disturbance, myocardial The real noise-rates are provided as a dashed horizontal line in matching colors.infarction, hypertrophy, and ST/T changes.We divide the folds into training, validation, and test data as proposed in 59 .Only some of the training data labels were scrutinized by an expert and therefore the training data contains label noise.The validation and test data were checked by an expert and can be considered to contain minimal levels of label noise.
We have performed experiments using a Resnet adapted for time-series 61 , which is a top-scoring neural network on the PTB-XL data 60 .In Table 4, we present baseline results as reported in Strodthoff et al. 60 , our replication of the baseline method, and results of our proposed stochastic co-teaching.The results show similar AUCs between our implementation of the baseline method and stochastic co-teaching: 0.913 and 0.917, respectively, with overlapping confidence intervals.Note there is a difference between the results reported in Strodthoff et al. and our replication, likely caused by implementation differences.Our implementation of the trained baseline network achieved an accuracy of 0.618, with stochastic co-teaching the accuracy increases to >0.640 as is shown in Fig. 6a.Different hyperparameter settings show a similar pattern of performance as is shown in Fig. 3: higher performance is below the α = β diagonal.

Cardiac MRI segmentation
In this experiment we evaluate stochastic co-teaching for left ventricle segmentation in cardiac cine MRI images.These images are typically acquired to evaluate cardiac function.One of the primary indicators of cardiac function is the ejection fraction.Ejection fraction is the fraction of blood that leaves the heart when it contracts.It is calculated from annotations of the endocardium (the inner wall) of the left-ventricle at two time-points: at enddiastole (maximum expansion) and end-systole (maximum contraction).While segmentation may seem trivial, the endocardium contains many papillary structures (i.e.protruding muscle tissue) that make the task prone to high intra-and inter-observer variability.Moreover, some of the papillary muscles are quite large and this may affect measurements if inconsistently segmented.However, segmenting papillary muscles is cumbersome.
In this experiment we use publicly available MRI images from the Sunnybrook challenge 62 .The dataset consists of short-axis cardiac cine MRIs from 45 patients.Multiple slices are acquired that encompass the heart.Image resolution is 1.25 × 1.25 mm in-plane.Each slice is a time-series of 24 frames visualizing one heart-beat.For this dataset three structures are annotated at end-diastole and end-systole.In approximately half of the images, the two largest papillary muscles have been annotated as a separate class.We have included those in our experiments.
We divided images into a training set (104 images) and a test set (49 images) on the patient level such that the test set does not contain images from patients in the training set.We mimic segmentation errors by assigning papillary muscles to the blood pool in 40% of the image slices in the training set.We did not modify the test-set.We performed segmentation experiments with a U-Net 63 , because this is one of the most used architectures for medical image segmentation.The network was trained in 1000 epochs using mini-batches containing 16 randomly selected image patches of 128 × 128 pixels (original image size is 256 × 256).Other augmentations were random flipping and random rotations around all axes in steps of 90 degrees.
We implemented stochastic co-teaching to select or reject in individual voxels during training.By visualizing selection and rejection of voxels as masks, valuable qualitative information during training is revealed, such as areas with label errors and areas of observer variability, as is shown in Fig. 7.The selection masks ignore borders of segmentations.This is logical considering that the outlines can be quite arbitrary.Furthermore, the selection masks include inner voxels of papillary muscles when they are correctly labeled, and they exclude them when they are not labeled, meaning that stochastic co-teaching has effectively ignored incorrect labels.Note that a single threshold might be used to select pixel-instances in a mini-batch, but we chose to generate a selection threshold for each pixel.However, given that generating random parameters is time-consuming, we generate one 16 × 16-map per training image patch and tile these to the patch size.Table 3.Comparison of the accuracy achieved by stochastic co-teaching and previous methods.For a fair comparison we report the results from Han et al. 42 , as well as our replication of these experiments.We report the results of related methods Decoupling and MentorNet, and we report results of standardly trained neural networks (Standard), co-teaching (CoT), and stochastic co-teaching (StoCoT).For stochastic co-teaching we chose a left-tailed beta distribution with parameters α = 32 and β = 2 .Note that there is a performance increase in our CoT experiments in all but the CIFAR 100 experiments.Figure 5 shows that random initialization has an impact on an ill-tuned forget rate.Figure 8 shows several qualitative examples of a U-Net trained with and without stochastic co-teaching.The results show that the method using stochastic co-teaching achieves a more consistent output.The method more consistently segments the papillary muscles and the results also show that it outlines the myocardium more consistently.Quantitative results, listed in Table 5, demonstrate that Dice and distance metrics improve when stochastic co-teaching is applied compared to standard training.Finally, Fig. 6b demonstrates the robustness of the stochastic co-teaching towards this segmentation task with varying settings of hyperparameters α and β.

Discussion
Stochastic co-teaching employs two neural networks that are jointly trained.Each network selects mini-batch examples for the other using a stochastically determined threshold on the posterior probability.This approach does not require a priori knowledge about label noise, and as a result it eliminates the need for meticulous parameter tuning, which is especially useful in real-world tasks.Stochastic co-teaching achieves excellent results and outperforms state-of-the-art approaches on a variety of classification tasks with extreme and unknown levels of label noise.It is robust to varying levels of label noise and it can be used to estimate the level of label noise by monitoring the rejection rate.Furthermore, we have demonstrated the applicability of our method in two real-world medical tasks: classification of ECG signals, and semantic segmentation of cardiac MRI images.Incorrect estimation of the noise rate may be detrimental to model performance.As Fig. 4 shows, the noise rate is incorrectly estimated for the uniform 50%, and biased 45% noise cases in the CIFAR experiments.This can be caused by the complexity of the problem, and by poorly chosen α and β .However, we have demonstrated that with a priori unknown noise rates, stochastic co-teaching requires relatively little hyperparameter tuning compared to conventional co-teaching.While conventional co-teaching requires extensive tuning of the forget rate, in stochastic co-teaching, the forget rate is automatically determined.The only hyperparameters for stochastic co-teaching pertain to the shape of the sampling distribution, defined by α and β .As the hyperparameter grids shown in Fig. 3 suggest, the performance is consistent for a wide range of settings and problems.Similarly, the results from cardiac MRI segmentation problem demonstrate relatively homogeneous performance across the different hyperparameter settings.The ECG classification results on the PTB-XL dataset show a slightly lower fault tolerance with respect to choosing α and β , but they agree with the general trend that it is beneficial to choose α > β , as observed in MNIST and CIFAR.We note that other distributions could potentially be chosen for threshold sampling: the only requirement is that the distribution is defined exclusively on [0, 1], and that it is unimodal, which the beta distribution satisfies for α, β > 1 .Additionally, left-tailed distributions are preferred, like the β-distribution for α > β .However, these results may be specific to experiments performed with softmax outputs and cross-entropy.Since different losses are differently calibrated 64 , they might show other responses to stochastic co-teaching.
Replication of baseline experiments with conventional co-teaching resulted in higher accuracies than reported by Han et al. 42 .Differences in results may be ascribed to differences in random initialization.Namely, in our experiments we have shown that different random initialization of conventional co-teaching resulted in highly variable outcomes.In cases of extreme label noise, mini-batches might be sampled consisting predominantly of training-instances with label noise.Such situations are difficult for conventional co-teaching where a predetermined number of training instances is selected.Stochastic co-teaching handles these situations better since it can reject an arbitrary amount of label noise.However, we have observed that mini-batches containing only instances with label noise led to numerical instability when all instances were rejected.We have addressed this by generating new selection thresholds or by resampling new mini-batches.The stochasticity has an additional effect of enforcing different decision boundaries to each of the models during training, and although we did not study the effects, we presume that stochastic co-teaching allows training of models that were initialized identically, meaning that stochastic co-teaching might be readily applicable to pre-trained networks.Similar to conventional co-teaching, we employed a schedule that introduces the selection threshold of stochastic co-teaching.For complex tasks we found that a delay of a few several epochs benefited performance.This delay effectively utilizes the tendency of deep neural networks to learn general patterns first 1,2 .While we did not perform an in-depth study of the effect of different schedules on different classification tasks, network architectures, or loss functions, we observed a limited effect in preliminary experiments.
Stochastic co-teaching in the PTB-XL ECG classification task leads to improved accuracy, whereas the AUC remains similar to the baseline.This may indicate that the improvement from stochastic co-teaching is caused by improved classification of the majority classes.The difference between our baseline experiments on the PTB-XL ECG classification dataset and those from Strodthoff et al. 60 may be attributed to differences in randomness-based operations, such as weight initialization and data sampling.
A drawback of co-teaching is that training with two networks imposes increased computational demands of approximately twice the amount of compute and memory compared to supervised training.A more efficient alternative is bootstrapping, which uses just one network for selection or rejection of samples 26 .Bootstrapping handles label noise well, but it is outperformed by co-teaching 42 .This indicates that it is beneficial to train with two predictors, each with its own decision boundaries.An alternative method including more than two networks was also proposed 43 .However, such an approach increasingly impacts hardware demands, especially for semantic segmentation tasks, which require high resolution outputs.Alternative directions might be in training Bayesian networks, and synthetically increasing the number of networks.However, each of the methods would require non-trivial voting schemes.Furthermore, consensus voting and selecting instances for training renders the method more similar to Decoupling 41 where the error propagation is strongly coupled among networks 49 , contrasting with the core benefits of co-teaching where networks are decoupled and thus decision boundaries can develop individually.
Although we only evaluated stochastic co-teaching for classification tasks using the cross-entropy loss, other losses could be applied.However, threshold selection should be re-evaluated, because the effect of hyperparameters might be different than in our experiments.Additionally, co-teaching could be recasted for application in regression problems.The decision of selection and rejection should then be a distance metric, e.g. an L1 or   L2 norm and the inclusion threshold should be sampled from an unbounded distribution, e.g. a Gamma or a Chi-squared distribution.

Conclusion
We have presented a method for stochastic co-teaching.The method employs training of two networks where each network selects training instances for the other network.Training instances are selected based on posterior probability of each network and a selection threshold sampled from a left-tailed beta distribution.The method does not require any a-priori knowledge about the level of label noise and it can be applied to a variety of classification problems including medical tasks such as classification of medical signals and semantic segmentation of medical images.

Figure 2 .
Figure 2. Variations for α and β used in experiments result in varying Beta probability distributions.

Figure 3 .
Figure 3. Hyperparameter-sweep experiments showing accuracy of the average of last 10 epochs of 200 epochs.Figures(a)-(i)show the experiments that were performed, using corresponding variations of Beta-distributions shown in Fig.2.Each colorbar is scaled to the specific range of results.

Figure 4 .
Figure 4. Stochastic co-teaching provides an estimate of label noise rates, if both the rejection rate and the test/validation performance converge.Rejection rates and test accuracy are shown for image classification in (a) MNIST, (b) CIFAR-10, and (c) CIFAR-100 classification experiments.We used Beta-distributions with parameters α = 32 and β = 2 .The real noise-rates are provided as a dashed horizontal line in matching colors.

Figure 5 .
Figure 5. Accuracy of conventional co-teaching using different settings for the forget-rate hyperparameter.The solid line shows average accuracy and dashed lines show the range of accuracies, calculated for 10 experiments per forget rate for MNIST, and 5 experiments per forget-rate for CIFAR 10 and CIFAR 100.The results demonstrate the impact of an ill-chosen forget-rate and different random initializations of conventional co-teaching.

Table 4 .Figure 6 .
Figure 6.proposed stochastic co-teaching on real-world medical tasks for different hyperparameter settings of α (horizontal axis) and β (vertical axis): (a) Accuracies for Multi-label ECG classification (PTB-XL).(b) Dice scores for left ventricle myocardium segmentation in cardiac MRI (CMR).The baseline ECG classification network trained without stochastic co-teaching achieved an accuracy of 0.618, and the baseline CMR segmentation network a Dice score of 0.71.

Figure 7 .
Figure 7. Three cardiac MRI training images (first column), with the original reference labels (second column), and with synthetic label noise applied (third column).The fourth to sixth columns show different selection masks generated by stochastic co-teaching.Left ventricle myocardium is indicated in yellow, and the blood pool in red.The selection masks indicate selection (white) or rejection (black) of pixels for training.Top row: A training example with annotated papillary muscles, hence no synthetic label noise applied.Note that pixels are rejected exclusively along segment borders, while the pixels in the center of the papillary muscles are selected.Middle row and bottom row: Two training examples with synthetic label noise.Synthetic label noise is added by including pixels representing papillary muscle in the blood pool.Note that for all selection maps in the middle and bottom row, pixels representing papillary muscles are rejected for training.These examples show that the method rejects the noisily labeled papillary muscles, while preventing the overfitting of correctly labeled papillary muscles in the first example.See Table5for quantitative results.

Figure 8 .Table 5 .
Figure 8. Examples showing the segmentation performance on the test-set of a standard CNN and the similar CNN with stochastic co-teaching.Co-teaching shows more sensitivity towards segmentation of papillary muscles, even though their segmentations were excluded in 45% of training data.See Table5for quantitative results.

Table 2 .
42,57wo network architectures used in the experiments.The MNIST model uses ReLU as activation function.The CIFAR model, as used in42,57, uses batch-normalization after each convolution layer and uses a leaky-ReLU for activation with a negative slope of 0.01.