E pluribus unum interpretable convolutional neural networks

The adoption of convolutional neural network (CNN) models in high-stake domains is hindered by their inability to meet society’s demand for transparency in decision-making. So far, a growing number of methodologies have emerged for developing CNN models that are interpretable by design. However, such models are not capable of providing interpretations in accordance with human perception, while maintaining competent performance. In this paper, we tackle these challenges with a novel, general framework for instantiating inherently interpretable CNN models, named E pluribus unum interpretable CNN (EPU-CNN). An EPU-CNN model consists of CNN sub-networks, each of which receives a different representation of an input image expressing a perceptual feature, such as color or texture. The output of an EPU-CNN model consists of the classification prediction and its interpretation, in terms of relative contributions of perceptual features in different regions of the input image. EPU-CNN models have been extensively evaluated on various publicly available datasets, as well as a contributed benchmark dataset. Medical datasets are used to demonstrate the applicability of EPU-CNN for risk-sensitive decisions in medicine. The experimental results indicate that EPU-CNN models can achieve a comparable or better classification performance than other CNN architectures while providing humanly perceivable interpretations.


Introduction
Recently the commercial applicability of Machine Learning (ML) algorithms has been regulated through legislation acts that aim at making the world 'fit for the digital age' with requirements, safeguards, and restrictions regarding ML and automatic decision-making in general [1].A crucial aspect regarding the compatibility of ML models concerning these regulations is interpretability.But how is the interpretability of ML models defined?According to the recent literature [2], interpretability refers to a passive characteristic of a model, indicating the degree to which a human understands the cause of its decision.Hence, the provided interpretations of the decision-making process of a model can limit its opaqueness and earn users' trust, e.g., by offering interpretations for risk-sensitive decisions in medicine.
In real-world tasks, the discriminative power of ML models, as expressed by their performance measures, e.g., their predictive accuracy, is regarded as an insufficient descriptor of their decisions [3].
Various approaches have tackled interpretability from a post hoc perspective, i.e., using methods that receive as input a fitted black-box to determine the causality of its predictions [4].Post hoc approaches include image perturbation methods applied on the network by masking, substituting features with zero or random counterfactual instances, occlusion, conditional sampling, etc.Such approaches aim at revealing impactful regions in the image that affect the classification result [5,6].Other post hoc methodologies handle the interpretation problem by constructing simple proxy models, with similar behavior to the original model and implement the perturbation notion at a featurelevel [7,8].This approach limits the credibility of the explanations, since the proxy model only approximates the computations of the black box [9].Another set of techniques that reduce the complexity of operations to achieve interpretability utilize the gradient that is backpropagated from the output prediction to the input layer.These methods construct saliency maps by visualizing the gradients to present areas that are considered important by the network [10]; solely relying on their explanations, however, can be misleading [11].In general, these methods aim at interpreting the inference of a deep learning model after its development and training, which can lead to unreliable interpretations [12].
A different approach to interpretability is the development of ML models that are interpretable by design, e.g., decision trees, lists, and sets [13].Such models are also referred to as inherently interpretable, and usually, introduce a trade-off between interpretability and accuracy.The structure of such a model is simpler; thus, its predictive performance may be inferior to that of a more complex black-box model.However, this trade-off might be preferable in high-risk decision-making domains due to the importance of understanding, validating and trusting ML models [14].CNNs with embedded feature guiding and self-attention mechanisms in their architecture, can also be regarded as inherently interpretable [15].These mechanisms derive interpretations by visualizing saliency maps and CNN features indicating certain concepts on the input image [16].However, such models usually do not associate the saliency maps with human-perceivable features, and do not account for the contribution of these salient regions to the result.Other methods quantify the alignment of predefined concepts with learned filters in different layers of a network or aim towards the disentanglement of features [17] however, they do not address the direct contribution of the concept representations to the prediction [18].Also, training such models requires a considerable manual effort for additional annotations with respect to the human-understandable concepts illustrated in each image [19].Approaches extending regular CNNs to encode object parts in deeper convolutional layers, have also been proposed; nevertheless, they usually result in performance degradation [20].Another approach is to leverage the intelligibility and expressiveness of Generalized Additive Models (GAMs) [21], which are recognized for their interpretability [22].The interpretation of a GAM is based on observations associating the effect of each input feature to the predicted output.A variety of applications incorporate GAMs into their methodology to leverage their expressiveness in domains such as healthcare [23].GAMs based on Multilayer Perceptrons (MLPs) [24], were recently proposed for interpretable data classification and regression; however, these particular models are not tailored for contemporary, CNN-based, computer vision tasks.
State-of-the-art interpretable CNN models usually exploit the information deriving from saliency maps, indicating image regions on which the model focuses its attention; however, it is not apparent how these regions contribute to the predictions.A recent relevant methodology incorporates interpretable components into a CNN model to explain its predictions [25]; nevertheless, the provided interpretations are intertwined with predefined edge kernels, and the selection of the color components does not consider any aspects of human perception.In general, there is a lack of methodologies that could explain the classification of an image based on perceptual features, i.e., features such as color and texture, described in a way that can be easily perceived and interpreted by humans [26].
In this paper, we propose a novel framework for the construction of inherently interpretable CNN models for computer vision tasks, motivated by the need for perceptual interpretation of image classification.The proposed framework is named after the Latin expression E Pluribus Unum Interpretable CNN (EPU-CNN), which means "out of many, one" interpretable CNN.A major advantage of the proposed framework is that it is generic, in the sense that it can be used to render conventional CNN models interpretable.Given a base CNN architecture, an EPU-CNN model can be constructed as an ensemble of base CNN sub-networks, by following the GAM approach.The EPU-CNN framework requires that each sub-network of the model receives a set of orthogonal (complementary) perceptual feature representations of the same input image.EPU-CNN is therefore scalable as it can accommodate an arbitrary number of parallel sub-networks corresponding to different perceptual features.The sub-networks are jointly trained and working as one, to automatically generate interpretable class predictions.An EPU-CNN model associates the perceptual features with salient regions, as computed by the different sub-networks, and it explains a classification outcome by indicating the relative contribution of each feature to this outcome.
To the best of our knowledge, EPU-CNN is the first framework based on GAMs for the construction of interpretable CNN ensembles, regardless of the base CNN architecture used and the application domain.Unlike current ensembles, the models constructed by EPU-CNN enable interpretable classification based both on perceptual features and their spatial expression within an image; thus, it enables a more thorough and intuitive interpretation of the classification results.Notably, ensembling shallower CNN architectures can be more efficient than training a single large model [27].Furthermore, unlike previous interpretable CNN models [20,28], the classification performance of EPU-CNN models is comparable to or higher than that of their non-interpretable counterpart, which in the case of EPU-CNN is the base CNN model.This is demonstrated with an extensive experimental evaluation on various biomedical datasets, including datasets from gastrointestinal endoscopy and dermatology, as well as a novel contributed benchmark dataset, inspired by relevant research in cognitive science [29].

Methodology
As a framework, EPU-CNN follows the GAM approach for the construction of interpretable image classification models.GAMs represent a class of models extending linear regression models by using a sum of unknown smooth functions ∑   , i = 1, 2, …, N. A GAM is formally expressed as follows: where  result, based on the spatial arrangement of the observed features within the input image.The details about the PFMs considered in this study, the formulation of the classification model, and its interpretable output, are described in the following paragraphs.

Opponent Perceptual Feature Maps
In this study the generation of PFMs is motivated by the theory of human perception of color vision proposed by Hering in the 1800's, and the opponent-process theory proposed in the 1950's by Hurvich and Jameson [31].The behavior of a cell in the retina of the human visual system is determined by a pattern of photoreceptors, which comprises a receptive field.Receptive fields have a center-surround organization, which causes the cell to exhibit spatial antagonism, e.g., a cell that is excited by a light stimulus in the center of its receptive field will be inhibited by a light stimulus in the annulus surrounding the excitatory center.There are different types of photoreceptors, with different sensitivities to light frequency and intensity, responding differently to chromatic and luminance variations.Depending on the type of the photoreceptors, receptive fields can be color-opponent or spatially-opponent without being color-opponent [32].
Studies have provided indications that the transmitted stimuli to the retina can be decomposed into independent luminance and chromatic-opponent sources of information, and that the chromatic and luminance information of an image are processed through separate pathways by the human visual system [33,34].Also, computer vision experiments have indicated the encoding of the chromatic and luminance components separately, as a more effective approach for image recognition [34,35].
Motivated by these studies, the proposed framework considers an opponent representation of the input images, focusing on both color and texture, which are two decisive properties for image understanding [26].Also, color and texture provide cues enabling inferences about the shapes of objects and surfaces present in the image.Opponent color spaces have been proposed to cope with drawbacks of the RGB color space, such as the high correlation between the R, G and B color components, and its incompatibility with human perception.Representative examples include, Ohta's color space, which is obtained as a linear transformation of RGB, and it has been proposed in the context of color image segmentation, and CIE-Lab, which is obtained as a non-linear transformation of RGB, proposed as a device independent, perceptually uniform color space (i.e., a color space where a given numerical change corresponds to similar perceived change in color) [36].Considering the effectiveness of CIE-Lab in numerous applications in computer vision, especially in biomedicine [37], in this study CIE-Lab is considered as a basis for the derivation of three PFMs, corresponding to its components.All the components of CIE-Lab are approximately orthogonal.
Components a and b encode two antagonistic colors that cannot be perceived together simultaneously, e.g., there is no "reddish-green" or "bluish-yellow" color.Specifically, component a, expresses the antagonism between green-red hues (redness is expressed for a > 0, and greenness is expressed for a < 0), and component b expresses the antagonism between blue-yellow hues (yellowness is expressed for b > 0, and blueness is expressed for b < 0).The L component represents the perceptual lightness, which expresses an antagonism in luminance, between light and dark.This component, which is practically a greyscale representation of the RGB image, is usually characterized by the highest variance, as it concentrates rich information about the texture of the image contents [34].
In the field of computer vision, several studies have been based on spatial frequency representations of images, to effectively model texture for machine perception [38].Aiming to the interpretation of the classification outcomes based on perceptual texture characteristics, the L component is further analyzed with respect to its spatial frequency.
The human eye has a capacity to focus on the right range of spatial frequencies to capture the relevant details of images; thus, visual perception treats images at different levels of resolution.At lower resolutions, these details correspond to larger physical structures in a scene, whereas at higher resolutions the details correspond to smaller structures.The concept of multiresolution image representation can be modeled by the 2D Discrete Wavelet Transform (DWT) [39].This representation is computed by decomposing the original image using a wavelet orthonormal basis.
The computation of the 2D DWT is performed efficiently using the à trous algorithm, which is based on convolutions of the image with a pair for low and high-pass filters, called Quadrature Mirror Filters (QMFs), and dyadic down-

Classification Model
Given Accordingly, the logit(•) function, defined as: can be used as a suitable link function, g(•).Considering that the inverse of logit is the log-sigmoid function σ, Eq. ( 2) can be rewritten as:

Interpretable Output
Given an input image, EPU-CNN provides three outputs, as illustrated in Fig. 1, namely, a) the predicted class l is intertwined with its capacity to highlight regions that contribute to the derivation of C i (I i ; η i ).The deeper the layer l that  is extracted from, the more approximate the correspondence among the feature maps  , and the input image I; thus, a middle layer l of C i is considered for the construction of each S i [42] (Section 3.4).
To quantify the amount of information that each  , encodes, we compute the Shannon Entropy (SE) scores.Then, half of the most informative  , , i.e.,  , that correspond to the highest entropy scores, are aggregated to construct the S i .The aggregation is performed by averaging  , features maps which results to the initial S i estimation.Then S i is further refined, by applying a thresholding method that maximizes the entropic correlation between the foreground and background of S i , for maximum information transfer [43].The entropy-based thresholding operation is performed to exclude values associated with lower saliency and communicate to the user the most informative regions.An example of different S i of input images I can be seen in Fig. 3.The generated S i illustrated in Fig. 3

Datasets
EPU-CNN was trained and evaluated on six different datasets.Initially, a dataset specifically created for the evaluation of the interpretability capabilities of EPU-CNN was considered.The purpose of using this dataset was to demonstrate the capabilities of EPU-CNN with clear, simple, and perceptually meaningful examples.Considering biomedicine as a critical application area for explainable and interpretable artificial intelligence (AI), four well-known biomedical benchmark datasets, consisting of endoscopic and dermoscopic images, was used for further evaluation.Furthermore, to demonstrate the generality of the proposed approach, a well-known benchmark dataset for real image classification was considered.
Interpretability Dataset: For the purposes of this study, a novel dataset was constructed, named Banapple.The dataset consists of images of bananas and apples.It was created by collecting images, under the Creative Commons license, from Flickr.The images illustrate bananas and apples with variations regarding the color, placement, size, and background.The motivation for the construction of this dataset stems from studies in cognitive science, where human perception is investigated using examples with discrete properties of bananas and apples [29].The experiments performed aim to demonstrate that EPU-CNN is capable of capturing the discriminative characteristics of bananas and apples by the perceptual features it incorporates, i.e., apples have a circular shape and usually red color, whereas bananas have a bow-like shape and usually a yellow color.In addition, samples that deviate from the average appearance of these objects can provide insights regarding the reliability of the interpretation of the model.
Endoscopic Datasets: Publicly available datasets of endoscopic images were considered for the evaluation process.
Namely, KID [44], Kvasir [45] and a dataset that was part of the MICCAI 2015 Endovis challenge [46].and normal (melanocytic nevus) skin lesions whereas task c) discriminates two abnormal categories of different incidence and survival rates, i.e., melanomas have higher mortality rates than carcinomas [47].Task a) comprised of 9000 images whereas task b) and c) 8200 and 8500 images respectively.

CIFAR-10:
To demonstrate the generality of the proposed framework, EPU-CNN was further validated on the longstanding benchmark dataset CIFAR-10.CIFAR-10 consists of 60,000 color images of natural objects that belong to 10 different classes.The dataset is split in 50,000 training and 10,000 test images and each class comprises 6,000 images with a size of 32 × 32.

Classification Performance Assessment
For the comparison of the classification performance of EPU-CNN, we selected three well established CNN models, namely, VGG16 [48], ResNet50 [49] and DenseNet169 [50] and an inherently interpretable CNN model abbreviated as TT [51].In this study, the VGG 16 was used as a base for the TT model.The same training parameters, i.e., batch size, optimization algorithm and data augmentation, were applied on all networks involved in the evaluation process.In convolutional blocks in total, followed by an FCNN.The first two convolutional blocks are identical and include two convolutional layers followed by a max-pooling and a batch normalization layer.The convolutional layers of these blocks have a depth size of 64 and 128 respectively.The following convolutional block consists of three convolutional layers with a depth size of 256 followed by a max-pooling and a batch normalization.All the kernels of the DenseNet169 0.88 ± 0.04 0.94 ± 0.05 0.90 ± 0.12 0.88 ± 0.01 0. and DenseNet169 were used for the construction of EPUI, EPUII, EPUVGG, EPUResNet and EPUDenseNet, respectively.The evaluation followed a 10-fold cross validation procedure with the average Area Under the receiver operating Characteristic (AUC) score among all folds.The AUC was selected as an overall summary measure of binary classification performance, which unlike accuracy, is relatively robust for datasets with imbalanced class distributions [52].The performance of all models is summarized in Table 1.

Models Trainable Parameters (Millions)
however, they provide higher classification performance when compared to more complex EPU-CNN models.
Furthermore, the less complex EPU II has comparable or even better classification performance when compared to EPUI while it outperforms substantially ResNet50 that is a more computationally demanding base model (~27M parameters).On the other hand, the inherently interpretable (TT) model that has a similar complexity to EPU II , i.e., ~20M parameters, provides the lowest overall classification performance amongst all models.

Quantitative Interpretability Analysis
To quantitatively evaluate the interpretability of the proposed framework we exploited the properties of the Banapple benchmark dataset.Banapple is suitable for this purpose because our perception of the class-related objects is directly associated with the way we categorize them, based on their visual attributes regarding color and shape [29].Therefore, the subsequent task of annotating the images of Banapple did not require any domain-specific knowledge.Images of bananas and apples have distinguishable characteristics with respect to all PFMs utilized by the EPU-CNN models, i.e., light-dark, coarse-fine, blue-yellow and green-red.Thus, in the case of a correct class prediction, ideally, all RSSs should trend towards the same direction, as indicated by the sign of an RSS, e.g., all RSSs for an apple image should be positive, whereas for a banana image should be negative.Hence, given that the EPU-CNN models in this study use four PFMs, a ground truth,  , and predicted,  , interpretability label is expressed as follows: where y is the ground truth class label of an image I, 1 and 0 denotes the apple and banana class respectively whereas sign(Ci(Ii; ηi)) returns the sign of an RSS.Given a set of ground truth and predicted interpretability label pairs, the interpretability accuracy a int , is calculated as the average Jaccard Index [53], J( • ), among them, as follows:   predictions from the Banapple, Kvasir and ISIC2019 datasets.As it can be observed, the regions identified as meaningful regarding each PFM are approximately consistent with each other regardless of the degree of abstraction that each set of feature maps encodes.However, the feature maps estimated by the intermediate 5 th layer provide less noisy PRMs that highlight with more precision the areas on the input image that are estimated to be meaningful with respect to each PFM.

Qualitative Interpretability Analysis
The Banapple: The global bar-charts illustrated in Fig. 6(a) indicate that all the perceptual features contribute to the classification of the images.This result is in accordance with our perceptual understanding [29], since apples and bananas are discriminated with respect to all PFMs considered in this study.It can be observed that the negative red-green and coarse-fine scores, have corresponding PRMs that do not focus on the class-related object, i.e., they highlight regions of the hand and the background.Also, the greenish color of the bananas in Fig. 7(b)-D, can be descriptive for both classes (as both bananas and apples can be green), which is also expressed by the disagreement between the relative scores of the color PFMs.Interestingly, the disagreement between the light-dark and coarse-fine scores can also be justified by the highlighted regions in the respective PRMs, i.e., the outline of the banana object in S coarse-fine and the circular region, resembling an apple in S light-dark .
To further investigate the behavior of EPU-CNN, we have chosen an image depicting an apple (Fig. 7(c)-A) which was digitally processed to obtain 3 variations: a) to illustrate a bitten apple (Fig. 5(c)-B); b) an apple with a shape resembling that of a banana (Fig. 7(c)-C); and c) an apple resembling both the shape and color of a banana while maintaining a reddish region (Fig. 7(c)-D).The interpretation changes that can be observed include the following: a) When the shape resembles a bitten apple the coarse-fine PFM is still guiding the prediction towards the apple class but with greater uncertainty (Fig. 7(c)-B), whereas the S coarse-fine discriminates the image based on its textural variations, i.e., the curvature of the left side of the apple.
b) When the shape resembles a banana, the coarse-fine PFM strongly suggests that the image belongs to the banana class (Fig. 7(c)-C, (c)-D).The magnitude of light-dark RSS has also changed, but still trends towards the apple class.
The S light-dark and S coarse-fine appear to contribute to the segregation the depicted object; however, only the RSS of coarse-fine PFM suggests the opposite class, indicating that is more sensitive to shape variations.c) When the yellow region is added, the light-dark and green-yellow PFMs guide the prediction to the banana class.
However, the green-red RSS trends towards the apple class.As expected, S blue-yellow and S green-red focus on the yellow and red segments of the object respectively (Fig. 7(c)-D).This justifies the trend of each PFM towards either class, i.e., yellow and red are representative colors of banana and apple class respectively.
These interpretations reveal that the coarse-fine PFM enables the respective sub-network to respond to different shape variations and infer relevant decisions.In addition, the color related PFMs, i.e., blue-yellow and green-red, are very sensitive to the class-related colors and it is clearly reflected both in the respective PRMs and RSSs.When both the class-related colors, i.e., yellow and red, cooccur in the image, the blue-yellow and green-red PFM guide the prediction towards the banana and apple class respectively.

Endoscopic Datasets:
The experiments showed that EPU II tend to discriminate normal from abnormal images of the endoscopic datasets mainly based on the blue-yellow and green-red PFMs.This is illustrated in the respective global bar-charts (Fig. 6 (b-d)).As it can be observed, the light-dark and coarse-fine are biased towards a specific class, in all endoscopic datasets.On the other hand, the chromatic PFMs are the main contributors to the correct classification predictions.This finding is in accordance with the literature since it has been proven that color has a leading role in finding abnormalities in the gastrointestinal tract [37].
An example of local bar-charts visualizing the prediction interpretations of EPU-CNN on endoscopic images is presented in Fig. 8(a, b).Since the light-dark and coarse-fine features are not informative, only the color related PFMs were considered.Figures 8(a, b) illustrate correctly classified endoscopic images of both the normal and abnormal class.In the case of the image depicting an abnormality (Fig. 8(a)), the S green-red indicates that the focus of the subnetwork that corresponds to the green-red PFM focuses on the abnormality, i.e., blood, whereas Sblue-yellow focuses on normal tissue and only partially on the abnormal region.
To assess the behavior of EPUII in a more controlled way in the endoscopic datasets, we proceeded to digitally with normal tissue, the RSS of the green-red PFM shifts from trending towards the abnormal class (red) to the normal class (green).Furthermore, the RSS of the green-red PFM, in the absence of an abnormality, indicate that the respective subnetwork focuses on normal tissue.
Dermoscopic Datasets: The evaluation of the interpretability of EPU II on the dermoscopic datasets revealed that all the PFMs participate actively in the classification process with an exception to green-red PFM that appears biased towards either the Melanoma or Carcinoma class on all trials (Fig. 6(e-g)).This is an indication that the PFM of greenred is not informative to the network regarding the classification of dermoscopic images.Furthermore, as it can be observed in Fig. 6(e-g) the classification process of EPU II is relying on both chromatic and textural cues (i.e., blueyellow, light-dark and coarse-fine) that are also considered by the ABCD rule of skin lesion classification to assess the malignancy of a lesion [54].
An example of local bar-charts of classified dermoscopic images are illustrated in Fig. 8(c, d).The local bar-chart includes the most informative PFMs, i.e., light-dark, coarse-fine and blue-yellow.In Fig. 8(c, d) all RSSs are trending, correctly, towards the abnormal (carcinoma, red) and normal (nevus, green) class respectively.In the case of the carcinoma (Fig. Similarly, to the other datasets, we proceeded to digitally modify the image of Fig. 8(g) that illustrates a melanoma to resemble a nevus.The modification was implemented according to the rule-based diagnostic criteria expressed by the ABCD rule [54]; in detail, we removed the part of the lesion that introduced color variation on the same mole and obtained a more symmetrical shape.The qualitative results of this process are illustrated in Fig. 8(h), where it is shown that after the modification all the RSSs trend towards the nevus class (green).Furthermore, the Sblue-yellow PRM, in the absence of an abnormal region, indicate that the respective subnetwork does not focus on the skin lesion.The S light-dark and Scoarse-fine PRMs seem to maintain a similar behavior with the unmodified image.

Comparison with State-Of-The-Art Interpretable Methods
Even though there is an increasing research interest regarding the interpretation of CNNs, there is still not a standard procedure to evaluate and compare the interpretable output.Nevertheless, a qualitative comparison can reveal strengths and weaknesses of such methods.In this study, the interpretations that EPU-CNN provides are qualitatively compared to seven methodologies that have been proposed to interpret CNNs and have been also widely used in the literature.These methods provide saliency maps indicating regions or points on the input image that are estimated to be crucial for a prediction inferred by a CNN.
In detail, six post-hoc methodologies, namely, Grad-CAM [55], LIME [8], XRAI [56], Shapley Additive exPlanations (SHAP) [57], Smoothgrad [58] and Vanilla Gradients [59], as well as one inherently interpretable model [51] (TT) were utilized in this evaluation.The post-hoc methodologies were applied on the CNN models that achieved the highest performance on each dataset according to Table 1, whereas TT was trained on each dataset from scratch.All the methods provide interpretations in the form of saliency maps while TT can also provide bounding boxes that specify discreetly the estimated region of interest.These methods were selected since they can render CNN models interpretable without the need for training on datasets specifically annotated for interpretable learning, e.g., with annotation regarding the concepts that are depicted on images.Figure 9 summarizes the interpretations provided by each method on exemplary images that are presented in Fig. 7 and 8.All the images have been correctly classified by the respective models that were used.In detail, only XRAI and SHAP were successful at highlighting regions of interest on the images that can be regarded crucial for classification, i.e., areas of the apple, the skin lesion and blood depicted in the endoscopic image.The gradient-based interpretation approaches, i.e., Grad-CAM, Smoothgrad and Vanilla Grad., also revealed that the respective CNN models focus on image regions that can be regarded meaningful; nevertheless, the fuzziness of their visualization makes the communication of their interpretations difficult to comprehend.On the other hand, EPU-CNN can provide different visualizations, that highlight the most relevant regions with respect to each PFM as it was estimated by the layer activations of each subnetwork.This can also be expressed quantitatively since the RSSs indicate the degree to which each highlighted region affects the classification result.
Furthermore, the dataset-wide plots that can be constructed by using an EPU-CNN model give insights regarding which PFMs are important for classifying the images of a particular dataset.To the best of our knowledge no other interpretation approach can incorporate all this information to its explanations and simultaneously be applied on nonspecialized datasets, e.g., datasets where each image is annotated only with respect to their class membership.
To demonstrate the applicability of EPU-CNN framework, the best EPU-CNN model in terms of classification  all the PRMs focus on the body of the airplane, and on the wheels, whereas no PRM focuses on the wings.Therefore, the respective RSSs of light-dark and coarse-fine drive the prediction with a higher magnitude towards the truck class.
Figure 11 presents a comparison in terms of classification accuracy among EPU II (orange bar) and other state-of-theart CNN models [50,49,48,60,61] (gray bars) on the CIFAR-10 dataset.EPUII achieved an accuracy score of 93.31% which is comparable or better than the other models considered.However, a major advantage over the other models is that the EPU-CNN model can provide interpretations regarding the classification outcome.

Conclusions and Future Work
In this study we propose a novel, generalized framework, called EPU-CNN, that provides a guideline for the development of interpretable CNN models, inspired by GAMs.A model, designed according to EPU-CNN framework, consists of an ensemble of sub-networks with a base CNN architecture that is trained as one.The proposed framework can be used to render conventional CNN model interpretable, by using it as a base model.Each subnetwork receives as input a different PFM of an input image, chosen according to the literature of cognitive science and human perception.EPU-CNN is designed in such way enabling human-friendly interpretations of its classification results based on the utilized perceptual features.The interpretations provided EPU-CNN are in the form of RSSs that quantify the resemblance of a perceptual feature to a respective class.These interpretations are complemented by PRMs indicating the image regions where the network focuses to infer its interpretable decisions.Furthermore, EPU-CNN provides spatial expression of an explanation on the input image.Thus said, the most important conclusions of this study can be summarized as follows:  EPU-CNN models satisfy the need for interpretable models based on human perception, i.e., the proposed framework is able to provide interpretations in accordance with human perception and cognitive science, e.g., EPU-CNN classifies endoscopic images based on the chromatic perceptual features.
 Unlike other inherently interpretable CNN methodologies [20], [28], the classification performance of EPU-CNN models is not affected by their capacity to provide interpretations.In fact, the results obtained from the comparison of EPU-CNN models with respective non-interpretable CNN models, show that their performance is better or at least comparable to that of the non-interpretable models.
 When an image is modified with respect to a perceptual feature, e.g., color, the interpretations derived from the EPU-CNN model change accordingly both on natural and biomedical images (Figs. 6, 7).
 Since EPU-CNN is a generalized framework, it provides a template for the development of interpretable CNNs that fulfill the requirements imposed by current legislations regarding the commercial applicability of ML models.
An aspect that can be considered as a limitation of the proposed framework, is the manual selection of PFMs.The process of the selection, however, enables the user to leverage specific PFMs that are relevant to a particular application and as a result to acquire meaningful interpretations and insights regarding the internal process of an EPU-CNN model.For example, in the case of endoscopic images, the EPU-CNN models considered only the PFMs of color as more important which is in accordance with the respective literature.Most inherently interpretable models that have been proposed in the literature, can only be applied on datasets that are further annotated with respect to humanunderstandable concepts illustrated in each image, which results in limitation regarding their applicability [18,19].
The PFM selection of an EPU-CNN model can be considered as a less demanding and time-consuming procedure when compared to the annotation of huge datasets with the human-understandable concepts.Additionally, since the selection of the textural and color perceptual features, that are based on the 2D DWT and Lab, respectively, is empirical, as a future work we intend to automate the PFM selection towards a direction that minimizes human intervention and is more compatible with the principles of deep learning.
x = (x1, x2, …, xN) T , x ∈ ℝ , denotes an input feature vector, g(•) is a link function (e.g., logit), β is a bias term and [Y | x] denotes the expected value of the response variable Y, given an input x.Each fi(•), represents a univariate smooth function,  : ℝ → ℝ, mapping each x i ∈ ℝ to a latent representation, f i (x i ), through which, x i participates to the result.This structure is easily interpretable because it enables the user to explore how each input variable xi affects the predicted output.The EPU-CNN framework considers Eq. (1) as a template to construct an interpretable ensemble of CNNs from a conventional, non-interpretable, CNN base model (Fig.1).The sub-networks of the ensemble are arranged in parallel, and each sub-network has the same architecture with the base model.Each sub-network receives a different input, which should be a perceptual feature representation of an input image.This representation will be referred to as Perceptual Feature Map (PFM) of an input image, and it can be obtained by an image transformation revealing a physical property of choice that can be easily perceived and interpreted by humans over the input image space, e.g., color and texture[26].The number of sub-networks is determined by the number of different PFMs required to render a CNN interpretable for a particular application.Considering that each sub-network of an ensemble with a parallel topology should receive inputs with complementary information[30], the PFMs should be orthogonal.Let us consider N different PFMs Ii, i = 1, 2, …, N, of an input image I.Each I i is provided as input to a corresponding sub-network Ci( • ; ηi), which is parametrized by η i , and trained jointly with the rest of the sub-networks.Hence, the input of an EPU-CNN model is a tensor I = (I 1 , I 2, …, I N ) with dimensions of N×H×W, where N, H, and W denote the number, height, and width of the PFMs, respectively.Each sub-network provides a univariate output C i (I i ; η i ).The output of the EPU-CNN ensemble is computed by summing up all Ci(Ii; ηi), i = 1, 2, …, N. The output of each C i (I i ; η i ) can be regarded as a Relative Similarity Score (RSS), quantifying the resemblance of image I to a class with respect to Ii. Considering a binary classification problem, RSS takes values within the range of [-1, 1].It represents the degree of similarity of an input image to a particular class, with respect to a particular PFM I i .An absolute RSS value closer to 1 implies a greater similarity, whereas a positive or negative sign of the RSS associates the similarity with the one class or the other.By visualizing these scores, it becomes easier for a human to understand how each I i affects a classification result of the EPU-CNN model.Furthermore, by examining the layer activations of Ci(Ii; ηi), the scores can be associated with respective image regions; thus, enabling a deeper interpretation of the classification

Figure 2 .
Figure 2. Illustration of the opponent perceptual features utilized by EPU-CNN.
an input tensor I composed of N input PFMs, an EPU-CNN model performs feature extraction and classification.An EPU-CNN model is constructed from N CNN sub-networks Ci, with each sub-network receiving a PFM I i, i = 1, 2, …, N, as input.Each C i can be regarded as a function C i (• ; η i ), C i : X H×W → Z, where X H×W and Z are the input and univariate output space of each Ci, respectively.A sub-network Ci consists of two parts: a) a feature extractor  • ;  parametrized by ω i ; and b) a univariate function  • ;  parametrized by θ i .Thus Eq. (1) becomes:    | ;  ,      ;  ;  (2) where  represents a feature extraction model composed of a CNN followed by a Fully Connected Neural Network (FC-NN), that utilizes activation functions, which are not conditioned to be smooth, and  represents a single FC-NN layer utilizing a smooth activation function that provides the final univariate output of a CNN sub-network, ω{N} = {ω1, ω2, … ωN} and θ{N} = {θ1, θ2, … θN} are the parameters of  ,  , respectively.Equation (2) encapsulates the properties and definition of GAMs while extending its capacity to exploit CNN models for computer vision tasks.Τhe feature extractor  , can be implemented by a conventional CNN architecture, whereas the number of output neurons and the activation function of  should be considered so to appropriately represent the classification outcome, in a binary or multiclass setting.In the context of binary classification, which is considered in this study,  is formulated with a single output neuron and the hyperbolic tangent (tanh) activation function, resulting in subnetwork responses within the range of [-1, 1].Ultimately, this allows to intuitively express the contribution of each feature to the final prediction, as positive or negative contribution with respect to a class label.Since EPU-CNN is applied in the context of binary classification, we chose the final output of an EPU-CNN model to be within the interval of [0, 1].However, the formulation of EPU-CNN presented in Eq. (2) indicates that the final output can fall out of the range of [0, 1], i.e., given a N number of C i and a bias term β the right-hand part of Eq. (2) provides values that fall within the range of [β -Ν, β + Ν].
explaining why the image is classified in that class; and c) a set of Perceptual Relevance Maps (PRMs) Si explaining which image regions are responsible for each RSS.Figure3illustrates the provided outputs of the model for two images that belong to different classes.The classification result is indicated as a textual label characterizing the input image, and the RSSs are visualized through bar-charts.Each bar-chart consists of horizontal red or green colored bars, indicating the magnitude of resemblance that each I i is estimated to have for the banana and apple class, respectively.Additionally, the model provides with respect to each I i , areas (PRMs) highlighting their resemblance to the predicted class.The color scaling from orange to yellow regions of the maps indicates the ascending intensity of activation.Image-specific visualizations of RSSs enable the interpretation of the classification process of unlabeled input images.This is the most important aspect of an EPU-CNN model.For example, the image of Fig. 3(a), is classified as a banana, because all PFMs, i.e., light-dark, coarse-fine, blue-yellow and green-red, as indicated by the respective RSSs, guide the prediction towards the banana class, which corresponds to negative Ci(Ii; ηi) responses (red).Accordingly, the image of Fig. 3(b), is classified as an apple, because all PFMs guide the prediction towards the apple class, i.e., positive C i (I i ; η i ) responses (green).However, it is not necessary for all the Ci(Ii; ηi) responses to be negative for an image to be classified as a banana, since EPU-CNN models consider the consensus of the sub-networks.Perceptual Relevance Maps, S i , are generated to visually inspect the relevant regions of the input image I with respect to each RSS Ci(Ii; ηi).Let   , ,  , ,  , … ,  , indicate a tensor of feature maps with  ∈ ℝ , where n, h, w denote the depth, height and width of  , and  , ∈ ℝ , as computed by a convolutional layer l of a C i .The selection of

Figure 3 .
Figure 3. Example of EPU-Net output visualization using bar-charts and saliency maps.The numbering indicates the interpretation order of EPU-CNN output.The label field indicates the predicted label.(a) Interpretation of an image classified as a banana.(b) Interpretation of an image classified as an apple.
are overlayed on the input images (Fig.3(a, b)).The highlighted regions indicate the spatial association of similarity scores Ci(Ii; ηi) with the respective input image.Moreover, the numbers in the images of Fig.3indicate in which order the different outputs of an EPU-CNN can be considered by the user.Initially a user can examine the regions that are highlighted by the generated PRMs of each PFM(1).Subsequently, these regions are participating to the classification outcome, towards either class, with a magnitude that is indicated by the RSSs (2).Finally, the PRMs (1) along with the RSSs (2) can assist the user to interpret the class prediction of an EPU-CNN (3).
The KID dataset consists of 2,352 annotated wireless capsule endoscopy (WCE) images of abnormal findings i.e., inflammatory, vascular and polypoid lesions as well as images depicting normal tissue from the esophagus, stomach, small bowel and colon.The Kvasir dataset consists of images of the gastrointestinal (GI) tract, annotated and verified by medical experts.These include 4,000 images of anatomical landmarks, i.e., Z-line, pylorus and cecum, and pathological findings of esophagitis, polyps and ulcerative colitis.The dataset also contains sets of images related to endoscopic polyp removal that were not utilized for this work.The MICCAI 2015 Endovis challenge dataset consists of 800 gastroscopic images of normal and abnormal findings, such as gastritis, ulcer, and bleeding.Dermoscopic Dataset: The evaluation process of EPU-CNN has also included the International Skin ImageCollaboration Challenge 2019 (ISIC2019) dermoscopic image collection.ISIC2019 challenge provides a publicly available archive of 25,331 dermoscopic images of eight different categories of skin lesions, namely, melanoma, melanocytic nevus, carcinomas (both of basal and squamous cells), actinic and benign keratosis, dermatofibroma, and vascular lesions.These images were used to construct three different binary classification problems: a) melanomas vs. melanocytic nevi (Me. vs. Ne.);b) carcinomas vs. melanocytic nevi (Ca. vs. Ne.) and c) carcinomas vs. melanomas (Ca. vs. Me.).The tasks a) and b) are characterized as a classification between abnormal (carcinomas, melanomas) detail, the batch size was set to 64 and as an optimization algorithm the Stochastic Gradient Decent was used; the training data were augmented only with respect to their orientation.The weights of all networks were randomly initialized before training.Five different CNNs architectures were considered for the construction of EPU-CNN models.In detail, two indicative CNN architectures, namely, BaseI and BaseII, along with VGG16, ResNet50 and DenseNet 169 were incorporated as base models in the EPU-CNN framework.These models were selected to demonstrate the generality of the proposed framework, i.e., its applicability to rendering different conventional CNN architectures interpretable.Regarding the architecture of the indicative CNN architectures, Base I , consists of 3 The best results are in boldface typesetting and the results ranked second are underlined.It can be observed that the results obtained by the EPU-CNN models indicate an overall better or comparable classification performance to their non-interpretable counterparts, i.e., BaseI, BaseII, VGG 16 , ResNet 50 and DenseNet 169 .In detail, on Banapple, Endovis-MICCAI, Kvasir and ISIC 2019 (Me. vs. Ne.)EPU II provided substantially better results when compared to the other EPU-CNN and the majority of base models.

Figure 4
Figure 4 illustrates the number of trainable parameters of each model.It can be observed that the complexity of an EPU-CNN model is analogous to that of its base model.Additionally, an EPU-CNN can provide competent results even with base models of low complexity, i.e., EPU I and EPU II utilize ~40 and ~19 million parameters, respectively;

Figure 4 .
Figure 4. Visualization of the complexity of the compared models in terms of the number of trainable network parameters.

Figure 5 .
Figure 5. Example of PRMs generated by features maps extracted from different layers of EPUII.

Figure 5
Figure 5 illustrates indicative PRMs constructed using feature maps estimated by different convolutional layers on qualitative analysis of EPU-CNN was investigated by considering both PRMs, global and local bar-charts generated by the EPU II model, for each dataset.Given a validation set of images with a priori known class memberships, global bar-charts are constructed by averaging the RSSs per class, as provided by each sub-network of EPU II .Global bar-charts enhance the transparency of the model and reveal the overall contribution of PFMs regarding the data discrimination process.In a global bar-chart, PFMs of low or high significance can be identified by their dataset-wide score, which can lead to the selection of a subset of the most informative PFMs, i.e., by pruning or replacing the sub-networks the PFMs of low significance.The respective results obtained per dataset are provided in the next paragraphs.

Figure 7
illustrates examples of local barcharts along with the respective PRMs of classified images.Specifically, the images presented in Fig.7(a) were correctly classified by EPU II , and this is reflected by the visualization of the RSSs.The PRMs of each sub-network indicate the regions of the input image which resemble the class that each RSS suggests.For instance, in Fig.7(a)-Bthe highlighted areas of the PRMs corresponding to green-red and blue-yellow are overlayed precisely on the classrelated object, i.e., the bananas.Interestingly, the difference between the light-dark and coarse-fine RSSs can be justified by the obscurity of the highlighted regions of Slight-dark and Scoarse-fine, i.e., both PRMs highlight the table.

Figure 7 (
Figure 7(b) illustrates wrongly classified images.Notably, each of these images have resemblances to the opposite class with respect to color and shape.For example, in Fig. 7(b)-A the perceptual features of light-dark, coarse-fine

Figure 8 .
Figure 8. Example of EPU-CNN interpretations, as generated by EPUII, on biomedical images.The label field indicates the predicted label.(a) Abnormal and (b) normal endoscopic image; (c) Carcinoma and (d) (normal) nevus skin lesion; (e) Abnormal endoscopic image and (f) modification of (e) to resemble a normal endoscopic image; (g) Melanoma skin lesion and (h) modification of (g) to resemble nevus.
8(c)), S light-dark focuses on the entirety of the image, whereas S coarse-fine and S blue-yellow focuses on regions with color variations, e.g., on the yellow spot and little cuts on the lest and bottom side of the image respectively.In the case of the nevus (Fig.8(d)), S light-dark and S coarse-fine isolate the lesion by segregating it from the rest of the image, either by focusing on it or around it, whereas Sblue-yellow indicates only a slight attention of the network to the lesion.

Figure 9 .
Figure 9. Example of CNN interpretations provided by various methodologies.

Figure 10 .
Figure 10.Example of EPU-CNN interpretations, as generated by EPU II , on images of the CIFAR-10 dataset.The label field indicates the predicted label.Row A and B illustrate interpretations of correct and wrong prediction, respectively.

Figure 11 .
Figure 11.Classification performance in terms of accuracy on the CIFAR-10 dataset.
EPU CNN (I j ; η {N} ) is the class probability of I j and y j is the ground truth label of I j .As it can be observed from Eq.(6), the total error of the EPU-CNN, deriving from the responses of the CNN ensemble consensus, EPU-CNN model as illustrated inFig.1.To train an EPU-CNN model in the context of binary classification, the Binary Cross Entropy (BCE) is chosen as a loss function to be minimized: is used to update the parameters of each  •;    •;  ;  of the parallel sub-network ensemble topology, simultaneously.It is worth noting that an EPU-CNN model can also be adapted for multiclass datasets, e.g., using n>1 output neurons instead of one, in the case of n > 1 classes.Then, the network's output can be interpreted by considering the contribution of the multiclass classification outcome of each CNN sub-network to the final classification result (see Section 3.6).

Table 1 .
Classification Results (AUC) of EPU-CNN and CNN models BaseII, follows the same architecture with BaseI with an additional convolutional block, in the beginning of the architecture, utilizing an inception module.Base I , Base II , VGG 16 , ResNet 50

Table 2 .
Interpretability accuracy results of EPU-CNN models.EPUII achieved the highest aint with a score of 72.40±1.51% and 72.62±1.63%respectively.This means that the capacity of both EPU I and EPU II models is comparable with respect to their capacity to interpret the classification of bananas and apples.Since EPU II achieves a better overall classification performance, a int score and it is more computationally efficient, it has been chosen for the qualitative investigation of interpretability that is presented in the following sections.