Measuring internal inequality in capsule networks for supervised anomaly detection

In this paper we explore the use of income inequality metrics such as Gini or Palma coefficients as a tool to identify anomalies via capsule networks. We demonstrate how the interplay between primary and class capsules gives rise to differences in behavior regarding anomalous and normal input which can be exploited to detect anomalies. Our setup for anomaly detection requires supervision in a form of known outliers. We derive several criteria for capsule networks and apply them to a number of Computer Vision benchmark datasets (MNIST, Fashion-MNIST, Kuzushiji-MNIST and CIFAR10), as well as to the dataset of skin lesion images (HAM10000) and the dataset of CRISPR-Cas9 off-target pairs. The proposed methods outperform the competitors in the majority of considered cases.

www.nature.com/scientificreports/ detect anomalies with Capsule Networks would benefit from exploiting this "part-whole" relationship expressed in dynamics of primary capsules voting for a class. Previous works 11,12 on supervised anomaly detection with capsule networks use the reconstruction ability and class probabilities to separate outliers from inliers, while the methods proposed in this work are based on the evaluation of unequal response of the routing mechanism to normal and aberrant inputs. Class probability is given by the computation of class capsule output via routing. Routing by agreement has an intrinsic property of polarization 13 -convergence on a single route from primary to class capsules. This property gives rise to inequality between a well-predicted and poorly predicted class in case of class imbalance. We can measure such discrepancy using economic inequality metrics, such as Gini 14 and Palma 15 coefficients. Our main contributions can be summarized as follows: 1. We propose a new approach for supervised anomaly detection using capsule networks; 2. We suggest a new application of economic inequality metrics to machine learning which also allows investigating internal mechanisms of capsule networks; 3. We perform a comprehensive review and comparison of different capsule network-based anomaly detection methods on standard benchmarks and real-world data, which confirms state-of-the-art performance of the proposed methods.

Materials and methods
Capsule networks as an ensemble of weak learners. The main difference of Capsule Networks from conventional architectures is the output. The output is not probabilities of classes, but rather vectors with information on learned features. This is achieved through combining the output of different small models and summarizing their decisions. There are two types of capsules, primary and secondary, so Capsule Networks are inherently two-layered. Class capsules perform prediction based on information from all primary capsules combined. The output of primary capsule acts as an asset for a class capsule N in a way that it enlarges or shrinks its output vector, giving a large or small probability of an example belonging to class N . Different primary capsules are of different value for class capsule N and the vectors of class capsules act as a kind of summary statistics over the distribution of primary capsule contributions. We hypothesize that the distribution of the "values" in case of a normal example will be different than in case of an anomalous one, so we can detect anomalies based on that difference. The architecture of Capsule Networks proposed by Sabour et al. 10 is shown in Fig. 1.
In case of image classification, capsule networks process the input image in three steps: preprocessing, routing and reconstruction. Preprocessing in our setting consists of a 2D convolutional layer followed by an activation function (ReLU or ELU in our experiments). It is followed by a number of 2D convolutional layers of N f filters, which produces the output of size ( N f , output width w and output height h ). The outputs of every convolutional layer are joined together and flattened out to form a tensor of size (number of primary capsules N P = N f × w × h , dimension of primary capsule output O P ). Then the squash activation function is applied element-wise separately for each element in batch and for each 2D convolutional layer: where x is the input to the squash activation, e.g. output of the preprocessing stage. The outputs of primary capsules are then fed into N S secondary or class capsules (in our setting N S = 2 ) and we enter the stage of routing. The routing is an iterative algorithm and the following steps describe a single iteration. Usually the routing is repeated for 3 iterations.
During the routing we first compute the coupling coefficients-tensor c of shape ( N P , N S , O S ). For each secondary capsule j we compute the prior probability of coupling for all primary capsules and save it into the coupling coefficient table. We will use the following key constructions below: 1. c ij -a slice of tensor c that corresponds to primary capsule i and secondary capsule j; 2. W ij -the corresponding slice of the weight tensor; 3. u i -an output of primary capsule i.
Then for each secondary capsule j we compute unsquashed secondary capsule output using the product of the weight tensor slice that corresponds to the primary capsule i and secondary capsule j and the outputs of the primary capsules u i : where ⊙ denotes elementwise multiplication.
The outputs of secondary capsules are computed as in case of primary ones, by the squashing function [see Eq. (1)] for every secondary capsule j: The "agreement" is computed by scalar multiplication of outputs from secondary capsule j with the product of weights and primary capsule outputs that correspond to primary capsule i and the secondary capsule, and at the end of the routing iteration the routing table is updated.
To get the probabilities of classes for an example, we should compute the Euclidean norm of the vectors we get from secondary capsules: where C j is the class label that corresponds to j-th class capsule.
Loss function and regularization. We can train the network with any kind of classification objective but following Sabour et al. 10 we use margin loss: where k is the number of output capsule, T k = 1 if the k-th capsule denotes the class that corresponds to the real label and 0 otherwise, m + , m − and are hyperparameters, Ŷ k is the norm of the k-th capsule output vector. Full description of all hyperparameters used can be found in the Supplementary Note.
As a regularization, like in the original paper 10 , the additional reconstruction subnetwork R is used: So the total loss we optimize is: where α > 0 is a hyperparameter chosen to balance the involvement of reconstruction loss. Most of the learning happens when the weights of the class capsule layer are adjusted via gradient descent during the backward pass. The adjustment is based on the output of the layer during the forward pass. The forward pass is computed iteratively, according to the equations above. The usual way to train capsule networks is not easy: one needs to use a non-standard loss and an additional decoder for regularization, but the most interesting computations happen within the coupling coefficients.

Main idea.
Both previous studies 11,12 base their work on the differences between estimated class probabilities and reconstruction subnetwork. We do not use reconstructions in our method while it could indeed help in a real-life application. Instead, we focus on the estimation of class probabilities. In capsule network setting, the probabilities are formed by softmax of capsule output vector norms. Output vectors provide information beyond class probability, according to the original Capsule Networks paper 10 , they capture interpretable properties like thickness of stroke, localization and shape. This information gets lost when we compute the norm.
To gather as much information as possible, we dive deeper into the routing mechanics. Coupling coefficients c , computed from the routing table, contain all the information about the way primary and class capsules would route for a given example. We base our research on the assumption that the couplings on normal and anomalous capsule show different results when encountered with an abundant class of examples and a rare one.
To do the classification, we ideally need a summary statistics for the couplings. Let c j be a part of coupling table c that corresponds to j-th secondary capsule. It is a tensor of size (N P , O S ) where N P is the number of primary capsules, O S is the dimension of class capsule output vector. For each secondary capsule j we first sum the respective couplings along the O S axis: www.nature.com/scientificreports/ where c l j is a vector of dimension N P obtained as a slice of tensor c for secondary capsule j and its output dimension l ∈ 1, N S . The distribution of M j for different cases is shown on the Fig. 2 below. This vector can be interpreted as a vector of total contributions of primary capsules to the j-th secondary capsule results. Those contributions due to polarization property of capsule networks would be highly inequal in case the network is well-trained. Figure 2 clearly shows the significant change in the distribution for the anomalous capsule that can be captured by inequality measures discussed below.
To evaluate an internal inequality in capsule networks, we require a short detour to econometrics. Income inequality in economics is measured by a number of statistical criteria 15 . Most popular, the ones we use here, are Gini 14 and Palma 16 coefficients. Gini coefficient is where n is the sample size, i is the number of example in the sample and z i is the value. Gini coefficient ranges from 0 to 1 and the value for the most inequal case (only one non-zero example) is 1.
More recently, Palma 16 coefficient started to displace Gini as a go-to measure of income inequality: where Q 90 and Q 40 are the 90-th and 40-th percentiles respectively. Key assumption behind Palma coefficient is that the tails of income distribution contribute to inequality the most and the middle ground remains relatively stable over time. This assumption makes Palma coefficient work rather well in case of yearly assession of economic inequality of countries. We hypothesise that Palma coefficient would work better than Gini because the assumption holds in our case due to polarization. Now we are equipped to apply income inequality metrics to Capusle Networks. The first proposed criterion is based on Gini coefficient, see Eq. (3): The second criterion is based on Palma coefficient and is computed according to Eq. (4): www.nature.com/scientificreports/ Coming back to the example from Fig. 2, we clearly see that both Gini and Palma coefficients allow to capture the difference in distribution for anomalous capsule. We use both of these criteria as a measure of data point "outlierness" and compute the AUC directly. It is possible to use Logistic Regression or any other classification model (SVM, XGBoost, Random Forest, ...) based on the value of Gini, Palma or both and also consider adding other features derived from the data or reconstruction properties, but we defer it to future work.

Data. MNIST-like benchmarks.
The previous studies 11, 12 are conceptually similar but they offer different ways to measure the performance of the anomaly detection metrics. We first considered an experiment inspired by the work of Piciarelli et al. 12 which is organized following the model generation procedure for Diverse Outlier setup: 1. Extract all examples of class i from the training set with N classes, and assign the label l = 0; 2. Randomly extract A examples of any other class and assign the label l = 1; 3. Train a model to classify the data into two classes.
We apply this procedure to all classes in the datasets and to the fractions of 10%, 1% and 0.1% so we get 4 × 3 × 10 = 120 models to test our results on. This procedure gives us the coherent normal subset and a diverse subset of outliers.
Approach based on the work of Li et al. 11 is the reverse one in nature-we use a single class for our anomaly label and all other classes we consider normal, so our normal set is diverse and anomalous set is coherent. We train the models following a model generation procedure for Diverse Inlier setup. This gives us another 120 models to test: 3. Train a model to classify the data into two classes.
We use MNIST 17 , FashionMNIST 18 , KuzushijiMNIST 19 and CIFAR10 20 with the diverse outlier and diverse inlier setups to make a comparison of the proposed methods, the previous studies 11,12 and the baselines. Each dataset except CIFAR10 has 60,000 single-channel images of (28,28) size that are separated into 10 classes. CIFAR10 has 60,000 images with 3 channels and (32,32) size, also separated into 10 classes.
Malignant skin lesion classification. The HAM10000 21 dataset contains high-quality photos of 7 skin lesion types three of whom are malignant. The dataset contents are shown on Table 1.
Inspired by the anomaly-based cancer detection pipeline 3 , we consider malignant skin lesions anomalies, aberrations of the correct skin cell life-cycle and while benign skin lesions are also a kind of an aberration, we consider them a base for the normal classes in our experiments. From this dataset we derive four experiments: 1. Diverse normal set, diverse anomalous set: all malignant types as an anomaly set, all benign types as a normal set; 2. Diverse normal set, homogeneous anomalous set: melanoma (the most common skin cancer) images as an anomaly set, all benign types as a normal set; 3. Homogeneous normal set, diverse anomalous set: all malignant types as an anomaly set, melanocytic nevus (the most common benign lesion, a birthmark) images as a normal set; 4. Homogeneous normal set, homogeneous anomalous set: melanoma vs melanocytic nevus.
CRISPR-Cas9 off-target detection. Off-target cleavage in CRISPR/Cas9-based gene editing can lead to various unforeseen consequences. For a design of gene editing experiment it is of high importance to select such gRNAs that minimize the probability of Cas9 performing doublestrand cleavage in a wrong place (off-target effect). To do so using Machine Learning, a dataset of gRNA-target pairs is used. In this setting we classify the pairs in two classes: no cleavage(0) and an off-target(1). www.nature.com/scientificreports/ The dataset of CRISPR-Cas9 off-targets taken from the work of Peng et al. 22 consists of 215 low-throughput off-target pairs, 527 high-throughput off-target pairs and a negative subset of 408,260 pairs. The low and high throughput pairs are labelled 1 and the negative subset thus 0. Each pair is two strings of "A", "T", "G" and "C" symbols. We convert pairs into images using the following preprocessing routine: 1. One-hot encode the target string (which is 23 nucleotides long) to get I 1 ; 2. One-hot encode the gRNA string (which is also 23 nucleotides long) to get I 1 ; 3. Join the encodings to get a tensor of size ( 2,4,23).
We do not use the pairs that have more than 6 mismatches (since the off-target cleavage is considered to be impossible in that case) so the final dataset consists of 615 anomalous pairs (off-targets) and 26,038 normal pairs from negative subset.

Related work
Within the supervised framework, a model for anomaly detection is trained to discriminate normal examples from the anomalous ones. It has a certain advantage in the discrimination ability over the model within the unsupervised framework and in cases where the anomalies are rather homogeneous, it usually performs the best. One of the basic methods in supervised anomaly detection using deep learning is Negative Learning (NL 23 )-to distinguish between the outliers and the normal data points, one uses the reconstruction error of autoencoder that is trained to reconstruct normal data points perfectly while failing to reconstruct the outliers. Negative learning-based anomaly detection does suffer from inability to reconstruct normal data points though. To overcome this issue, the work of Yamanaka et al. 6 introduces the Autoencoding Binary Classifiers (ABC) which extend the negative learning approach by providing lower and upper boundaries on the loss function with respect to the reconstruction errors. NL and ABC will be used in our work as the baselines.
The main idea of negative learning 23 is to permanently damage the reconstructive ability of an autoencoder by forcing it to maximize the reconstruction error on anomalous samples while minimizing it on normal ones. As an autoencoding model, the work of Munawar et al. 23 uses Restricted Boltzmann Machine with a visible layer and a hidden layer. The network is trained using single-step contrastive divergence 24 : where v is the visible layer, h is the hidden layer, σ is the sign and δw is the gradient of weights. The logarithm term caps the loss for the Y ≥ 1 . Additionally, the ABC paper 6 uses multilayer perceptrons instead of RBMs for architecture and gradient descent instead of CD for training algorithm.
Capsule networks were already used for anomaly detection in the works of Piciarelli et al. 12 and Li et al. 11 . The first paper considers supervised anomaly detection setup while the second one proceeds with an unsupervised formulation. They propose anomaly and normality metrics respectively, that are based on usage of regularizing decoder and the difference between the estimated probabilities of normal and anomalous class. In our paper we show that those metrics are a direct consequence of internal inequality of "assets" in Capsule Network representations. The probabilities and reconstruction errors are on the end of the pipeline and a lot of information that help distinguish the anomalies from normal data is lost during the process of their computation.
We compare our work with both previous studies 11, 12 and a few baselines (NL and ABC). Our working hypothesis implies that the information ignored when the probabilities are computed helps detect anomalies better. We also compute all available anomaly metrics and show that our work provides the best results in most cases. The work of Piciarelli et al. 12 proposes the following anomaly measure: where X is the input image, X is its reconstruction, Ŷ n , n ∈ {normal, anomaly}-the norm of capsule output vectors. It is based on the observation that for an anomaly the difference between probabilities for each class tends to be less drastic than for a normal example. Additionally, this paper 12 proposes filtering the anomaly class from the input to the reconstruction so the reconstruction network is trained to reconstruct only normal images. We include this feature to every experiment with Capsule Network.
The work of Li et al. 11 provides two metrics. The first one, given by Eq. (6), is the largest probability of a class: where Ŷ is the vector of norms of the capsule output vectors and N S is the number of output capsules. Equation (6) is introduced under the assumption that for a normal example there would be only one capsule with the norm close to 1 and for an outlier both of capsules would be close to each other. Hence, for normal images the results of Eq. (6) would be close to 1, but for an outlier they would be close to 0.5. The authors of the previous work 11 normalize MSE by the euclidean norm of inputs in Eq. (7), because the MSE is dependent on

(5)
A X,X,Ŷ normal ,Ŷ anomaly =Ŷ normal −Ŷ anomaly + MSE(X,X), www.nature.com/scientificreports/ the number of non-background pixel in input and reconstruction and the authors tried to invent a metric that is unaffected by this issue.
where X is an input image and X is the reconstruction computed by the reconstruction subnetwork. We compare the proposed methods with a selected set of previous works 6,23 because those works provide clear, simple and accurate approaches that are similar enough to ours so the design of a comparison study is pretty straight-forward.

Results
MNIST-like benchmarks. We measure the performance using the AUROC metrics. For each dataset and outlier fraction we compute AUROCs for all classes (10 values), then we report average AUROC and the standard deviation. We denote performance of the capsule network without any additional metrics as "Plain", noncapsule baselines as NL and ABC, anomaly score as A, normality scores as N pp and N re . The results for diverse outliers are shown in Table 2.
The proposed methods, either Palma or Gini outperforms other metrics and baselines in 1% and 10% cases for diverse outliers for both AUROC and average precision (shown in Supplementary Table 1). For CIFAR10, Palma and Gini also perform the best in 0.1% case. This is probably due to loss of information after computing the norms according to Eqs. (5) and (6). For 1% KMNIST and 1% CIFAR10 case, Palma and Gini respectively come second to N pp . In MNIST and FMNIST 0.1% though (in AUROC, and for KuzijishiMNIST additionally in average precision), Palma and Gini perform way worse than anomaly score and both normality scores. Overall, www.nature.com/scientificreports/ as the number of anomalous examples grows, the performance of normality measures decreases, performance of anomaly measure increases slightly, and performance of Palma and Gini increases by a large margin. For diverse inlier settings, Palma and Gini coefficients outperform almost everything for all cases except CIFAR10 0.1% and 1%, KuzijishiMNIST 0.1% in which Gini and Palma coefficients respectively perform the second best to N pp (Table 3 and Supplementary Table 2). As in the diverse outlier settings, the proposition of the previous study 12 that plain capsule network performs poorly for anomaly detection stands. As in the diverse outlier case, Palma and Gini coefficients are very close to each other.

Malignant skin lesion classification.
This constitutes the first application of supervised anomaly detection with capsule networks to a real-world non-benchmark dataset. Following footsteps of Quinn et al. 3 , we consider that anomaly detection can facilitate the search for actual biological anomalies-malignant skin lesions. The main conceptual difference from this work 3 (apart from using photos instead of transcriptomics data) is that we actually use the examples of such anomalies-the detection is not unsupervised.
The results, as Table 4 shows, are rather similar to the results on the MNIST-like benchmarks (Tables 2 and  3). Palma and Gini outperform every other metric by a large margin and provide almost the same performance. For the case B, diverse outliers and homogeneous inliers, Gini outperforms Palma, but not very much. The N pp measure performs close to Palma and Gini, while the rest is far behind.
Analysis of average precision (Supplementary Table 3) for this task also show clear superiority of Gini and Palma over the other metrics, closely mirroring the AUROC results, but the difference here is also more pronounced, because every metric except for Gini and Palma scores only about 2% more on average than the respective proportion of the positive class (anomaly). Such result is close to the one we would expect from a degenerate model that outputs 1 regardless of the input. .5147 ± 0.0483 AUROC respectively, and whopping 0.0271 ± 0.0007 , 0.0264 ± 0.0044 average precision respectively). Normalized reconstruction error N re again performs worse than its N pp pair-0.6756 ± 0.0372 AUROC, 0.2725 ± 0.0328 average precision, and 0.9147 ± 0.0144 AUROC, 0.304 ± 0.0701 average precision respectively, but anomaly score and plain capsules give not the worst, but the average quality in AUROC-0.6131 ± 0.0471 and 0.7518 ± 0.0404 respectively, while in average precision, plain capsules score below Gini and Palma only ( 0.4059 ± 0.034 ). Anomaly score performs in average precision similarly to AUROC-slightly worse than N re ( 0.2535 ± 0.0327) . The advantage of inequality-based measures over the rest is clearly seen.

Discussion
This paper is based on the parallels between economic inequality and the inequality in the coupling coefficients of capsule networks with dynamic routing by agreement. We apply income inequality analysis to the couplings and propose two metrics for capsule network-based anomaly detection. Inequality in the size of coupling coefficients arises naturally in Capsule Networks. If we consider Capsule Networks an ensemble of weak learners, the secondary capsules are not supposed to treat all primary capsules equally in considering their signal as an evidence for presence of an object of a particular class in the image, so they have to weight them to achieve specialization.
Another way to look at routing in capsule networks is through the Hebbian learning lenses. Routing, especially routing-by-agreement, is conceptually similar to the idea of Hebbian learning 25 : "neurons that fire together, wire together". In case of Capsule Networks, connections between primary and secondary capsules that give large scalar product between weighted output vectors grow larger with the training time. Such behavior also leads to larger connections between well learned class capsule and its primary counterparts, and to smaller connections otherwise.
This behavior is closely related to "polarization problem" 13 : tendency of Capsule Networks to converge to a single route from primary to class capsules during training. While it could be a problem for usual classification, polarization seems beneficial for anomaly detection-we observe lack of polarization as the inequality in couplings, we can catch it, measure and use as an anomaly detection tool.
Our inequality-based approach extends the previously proposed metrics based on differences in class probability. It shows an advantage over the competing methods, especially on real-world datasets we considered. We have tested our approach on MNIST-like benchmarks and two complex real-world tasks: skin malignant lesion detection and CRISPR off-target cleavage identification. The experiments show that the proposed methods perform significantly better than other Capsule Networks-based anomaly detection metrics and baselines in most cases. For rather simple datasets (MNIST-like except CIFAR10), in the case of high imbalance (0.1% outliers), the economic inequality methods perform below the anomaly and normality measures though, but it doesn't happen with more complicated datasets. We hypothesize that there are two possible reasons for such pronounced superiority of Palma and Gini coefficients. First of all, information averaged away by other methods seems actually helpful for solving imbalanced classification. Secondly, there is an inherent inequality within the computations of the network formed by the statistical differences between the abundance of rare and common classes, and this inequality gets exploited by Gini and Palma coefficients which can be seen as ways to indicate the class rareness.
An obvious direction for future studies would be replication of the current study with models based on different routing mechanism (OptimCaps routing 26 , EM routing 27 , spectral capsules 28 , attention-based routing 29 and others). We predict that the results would be close since the inequality in connection strength is a necessary outcome of a Hebbian-like learning approach (and it is typical behavior in ensemble learning). Another direction is the further exploration of inequality's nature and the invention of other metrics based on various aspects of it.  Table 4. AUROCs for HAM10000 with the following setups: A-diverse outliers, diverse inliers, B-diverse outliers, homogeneous inliers, C-homogeneous outliers, homogeneous inliers, D-homogeneous outliers, diverse inliers. The best and the second-best results are in [bold].