Interpretable classification of Alzheimer’s disease pathologies with a convolutional neural network pipeline

Neuropathologists assess vast brain areas to identify diverse and subtly-differentiated morphologies. Standard semi-quantitative scoring approaches, however, are coarse-grained and lack precise neuroanatomic localization. We report a proof-of-concept deep learning pipeline that identifies specific neuropathologies—amyloid plaques and cerebral amyloid angiopathy—in immunohistochemically-stained archival slides. Using automated segmentation of stained objects and a cloud-based interface, we annotate > 70,000 plaque candidates from 43 whole slide images (WSIs) to train and evaluate convolutional neural networks. Networks achieve strong plaque classification on a 10-WSI hold-out set (0.993 and 0.743 areas under the receiver operating characteristic and precision recall curve, respectively). Prediction confidence maps visualize morphology distributions at high resolution. Resulting network-derived amyloid beta (Aβ)-burden scores correlate well with established semi-quantitative scores on a 30-WSI blinded hold-out. Finally, saliency mapping demonstrates that networks learn patterns agreeing with accepted pathologic features. This scalable means to augment a neuropathologist’s ability suggests a route to neuropathologic deep phenotyping.


Introduction
The extracellular deposition of amyloid beta (Aβ) in the form of plaques is a pathological hallmark of Alzheimer's Disease (AD) 1,2 , a common chronic neurodegenerative disease. Plaques have a diverse range of morphologies and locational distributions 1 . The current consensus criteria for a neuropathological diagnosis of AD [3][4][5] incorporate protocols assessing plaque density and location; some hypothesize that plaques may be an initial event in AD 5,6 . Precise measures of plaque morphologies (such as cored, neuritic, and diffuse), can serve as a basis for understanding disease progression and pathophysiology, providing guidance and insight into disease mechanisms 2,7-10 .
For neuropathologic diagnosis, established semi-quantitative scales are used to assess plaque burden ( Fig. 1a ) 4,8,11,12 . The standard criteria put forth by the Consortium to Establish a Registry for Alzheimer's Disease (CERAD) is based upon the semi-quantitative assessment of the highest density of neocortical neuritic plaques 4,13 . Diffuse plaques, which may be the initial morphological type of Aβ 14,15 , can account for over 50% of plaque burden in preclinical cases but are not included in CERAD 16 . Furthermore, data on anatomical location ( i.e. Thal amyloid phase) are based on the presence of plaques regardless of type or density 5 . The potential for neuropathologic deep phenotyping efforts that account for anatomic location, diverse sources of proteinopathy, and quantitative pathology densities motivates the need for effective and scalable quantitative methods to differentiate pathological subtypes [17][18][19] .
Existing quantitative methods, such as positive pixel count 20 algorithms, rely on human-defined 21 RGB or HSV ranges (i.e., pixel color and intensity) and are thus sensitive to batch differences or to formalin fixation variable effects on tinctorial properties. Manual counts or stereological 22,23 methods can be tedious, difficult to score, and time-consuming. Consequently, studies using limited-range scores or overall pathology burden 3,8,13,24 are powerful but may have interrater variability, 4,13 be difficult to adapt to statistically meaningful disease-correlation analysis, or be blind to selected locational vulnerability 20 . New methods introducing detailed and sensitive quantification of pathologies would reduce the burden placed on pathologists, increase reliability, and enable studies at a scale that is currently prohibitive.
Deep learning has transformed medical image analysis 25,26 . Convolutional neural networks (CNNs) have achieved expert-level performance in complex visual recognition tasks, including the diagnosis of skin 27 and breast 28,29 cancers. These flexible models learn to recognize intricate patterns directly from visual data without the need for manually-defined image features or expert-delineated templates, and can account for non-trivial variations in image quality and color. Deep learning approaches have been reported for the classification of AD pathophysiology in magnetic resonance and positron-emission tomography images 30 , and for relating gene expression to neuropathology datasets 31 .
We hypothesized deep learning methods could augment neuropathological whole slide image (WSI) analysis 32 . As a proof of concept, we posited CNN models could be employed for recognition and classification of Aβ pathologies, especially plaques, with the downstream goal of providing reliable, scalable, and interpretable measures based on neuroanatomical location. Despite their strong predictive power, deep learning models have been criticized for their poor interpretability and reliance on massive annotated datasets 33 . At the outset, we recognized these factors represented significant challenges in the development of useful tools for neuropathology. An approach tailored to neuropathology would require 1) careful delineation of the machine learning task; 2) construction of a curated image dataset with high-resolution annotations by experts; and 3) extensive model interpretability. To develop a useful tool to aid neuropathologists, we deemed it critical that predictive performance should result from learning meaningful patterns within the images 34 .
In this study, we present a pipeline for the neuropathological analysis of Aβ pathologies in WSIs generated by digitizing glass microscope slides of temporal gyri of human brain ( Fig. 1b ). We describe an end-to-end pipeline for image processing, a custom web interface for rapid expert annotation, and training of CNN models that result in high performance multi-task classifiers capable of distinguishing Aβ in the form of cored plaques, diffuse plaques, and cerebral amyloid angiopathy (CAA). We demonstrate how prediction confidence maps visualize distributions as an interpretable and complementary means to understand Aβ burden. Finally, we demonstrate that these models are interpretable, using deep learning introspection methods to show that trained models learn relevant features of each of these Aβ pathology classes. To the best of our knowledge, these studies constitute the first report of CNNs for Aβ pathology analysis.

A Platform to Rapidly Manually Annotate 77,000 Plaque Candidates
CNNs operate most effectively when trained on datasets exceeding tens of thousands of example images. Indeed, we found over 500,000 individual candidate images could be extracted from 43 digitized glass microscope slides (WSIs, see Supplementary Table 1 for case details) used in this study. We set out to build an annotated dataset of at least ten percent of them ( Fig. 2 , see Data Availability ). Manual annotation at this scale would have been a daunting task by the conventional approach of a neuropathologist manually drawing bounding boxes within a standard "10x" (~700 micron) visual field. We therefore developed an automated preprocessing procedure ( Fig. 2a and Supplementary Fig. 1, 2 ) to normalize slide color and to place bounding boxes around all of the immunohistochemically-stained objects within WSIs that might be plaques, using open-source image analysis tools (see Methods ). As the native resolution of a WSI is too large (typically 50,000 by 50,000 pixels at 20x magnification) to use as the direct input for a standard CNN, we designed the dataset to contain uniform 256x256 pixel tiles centered on individual plaque candidates.
We created a simple web interface to rapidly annotate Aβ pathology-candidate image tiles and deployed it on the Amazon Web Services Elastic Beanstalk 35 for reliability and scalability (illustrated in Fig. 2c ).
An expert neuropathologist annotator used unique credentials and a rapid keystroke-entry format to annotate the tiles, which were stored in a SQL database (see Supplementary Fig. 3 ). Using this platform, nearly 66,000 candidate images were annotated at rates up to 2,500 tiles per hour into three major categories-"cored" plaques, cerebral amyloid angiopathy ("CAA"), or "diffuse" plaques. Additional categories such as "not sure" or "flag" denoted uncertainty, image segmentation failures, or other special cases ( Supplementary Fig. 3 ). The dataset was then built in three phases ( Table 1 ) . In Phase I, 55,000 images were expert-labeled using the web application. The majority of candidate images had diffuse plaque morphologies (84.8% of the annotations), with cored plaques (2.2%) and CAAs (1.1%) making up the minor classes. Given this initial distribution, the goal for the second phase was to enrich the minority classes. We trained an intermediate CNN to classify objects based on the Phase I dataset, then used its predictions to prioritize 101,671 as-yet unprocessed tiles in favor of cored plaques and CAAs (see Supplementary Fig. 4 ). Thus in Phase II, an additional 11,029 tiles were annotated, having been evaluated in rank-order of their predicted likelihood to contain either of the minority-class plaques. In Phase III, we annotated an additional 10,873 plaque candidate tiles extracted from a separate hold-out test set of 10 WSIs not in the original 32-WSI collection, without any prioritization procedures. We performed manual annotation using the web application for all phases.

Convolutional Neural Networks Effectively Discriminate Among β-Amyloid Morphologies
We trained CNNs to classify tiles as CAA, cored, or diffuse plaques. At 20x magnification, a single 256x256 pixel tile (128 microns) could contain more than one plaque, so we trained the CNNs for multi-task classification: CNNs were asked to determine the presence or absence of all morphologies in each tile. We combined the Phase I and Phase II datasets, then randomly split the resulting 70,000 tiles (66,030 annotated and 3,970 IHC-negative) into training (from 28 WSIs) and validation (from 4 WSIs) sets, while stratifying by case ( i.e. , WSI source) to ensure that models generalize to new cases. A search of CNN architectures identified a six-layer convolutional architecture with two dense layers ( Fig. 3a ) with strong performance. Using subsequent hyperparameter optimization we found that data augmentation 36,37 and minority class oversampling 38  CAA prediction performance was also strong on the validation set ( Supplementary Fig. 6 ), but was omitted from the Phase-III test set benchmarking in Fig. 4 because Phase III was derived from neuroanatomic regions predominantly lacking in CAAs (e.g., Supplementary Fig. 7 ). Notably, model performance was achieved using fewer than 2,000 training examples from each minority plaque class. Representative accurate ( Fig. 3b ) and misclassified ( Fig. 3c ) examples from the 10,873 hold-out set tests illustrate cases where the model succeeded or went astray.

Model Performance Improves Nonlinearly with the Number of Training Examples
To determine whether similar performance could be achieved with fewer manual annotations, we performed two retrospective studies to investigate the effect of training dataset size. In the first study, we randomly selected subsets of the 61,370-example training dataset, maintaining stratification by case (i.e., WSI source), and plotted model performance as a function of the number of training examples ( Fig. 4c ). Each random selection was repeated five times, and a fresh model trained each time, for a total of 19 independently trained and evaluated identically-architected CNN models. All models were benchmarked against the same hold-out (Phase III, as in Fig. 4a-b ) test set. As expected, model performance positively tracked with the total number of training examples. Notably, models trained on a 50% smaller training set size still achieved an average AUROC above 0.99 and an AUPRC above 0.74, at minimal loss to overall performance.
In the second study, we investigated model performance as a function of the chronological dataset growth during the project, where training examples were included in the order of original expert annotation ( Fig. 4d ). Model performance at 15 expert-hours fell short of model performance at 50% of dataset size ( Fig.   4c ) because the latter could contain chronologically later-annotated tiles in its training. Accordingly, the goal of this second study was to determine whether a training example's annotation chronology played a role in its use for CNN training. As above, performance steadily increases as the annotated dataset grows. However, performance trends between the studies differed in two ways. Chronologically-trained models did not converge in AUPRC performance as early as the equivalent-sized random-subset-trained models benefitting from later annotations did. Secondly, the chronology study shows a distinct AUPRC boost in Phase II, illustrating the positive effect of enriching for cored-plaque prevalence.

Prediction Confidence Maps Show Plaque Localization
To visualize the distribution and neuroanatomic location of pathologies in a broader context, we applied a sliding window approach 39 to generate WSI heatmaps of predictions ( Fig. 5 ). These heatmaps plot the confidence and location of each prediction by the CNN, which may then be visualized from the sub-tile resolution ( Fig. 5c ) up to the full WSI view ( Fig. 5a ). By progressively zooming in from larger anatomical views, the visualization shifts from the broad distribution of plaques to their detailed 20x morphology. A single cored plaque can be distinguished from a dense region of neighboring diffuse plaques ( Fig. 5c ). In this cohort, diffuse plaques are densely distributed across the grey matter, whereas cored plaques are located in deeper and lower cortical layers, in accordance with known neuroanatomic distributions 1 . Furthermore, CAA predictions appear predominantly at the periphery, proximal to the cortical surface where leptomeninges are present 40 , although predictions are made independently of the surrounding field or neuroanatomic context. These maps highlight other locational aspects of the plaques, such as their presence in white matter immediately beneath the gray matter 41 .

Classification Performance Does Not Vary by Tissue Landmarks
The CNNs perform classification (e.g., Fig. 4 ) directly on small anatomical areas (128 microns; green box in Fig. 6a ). Human experts typically assess larger fields of view such as~700 microns viewed at 10x magnification when conducting semi-quantitative plaque scoring. To visualize prediction performance in this context, we also assessed cored-plaque agreement maps on contiguous 6-by-6 tile (768 micron) regions ( Fig.   6a ). In the leftmost column, a green box surrounds the cored plaque within the tile, as labeled by a neuropathologist during the Phase-III dataset annotation ( Fig. 6a ). The middle column overlays the prediction map (as in Fig. 5c ) onto the original IHC-stained image. Finally, the rightmost column summarizes agreement between the expert label and the prediction, with blue and cyan representing correct prediction areas, while red and orange denote misclassification 42 . For this analysis, we used a CNN prediction confidence threshold of 0.90. A more permissive threshold would decrease false negatives (red) at the cost of more false positives (orange). Interestingly, this agreement-map highlights the limitations of bounding-box annotations, such that the correct cored-plaque prediction shown is nonetheless penalized by this view (red halo) for accurately predicting the rounded boundaries of the actual plaque instead of anticipating its square "ground truth" bounding-box.
Taking a step yet further out to regions of 3840 microns ( Fig. 6b ), these maps (see Supplementary

Machine Learning Introspection Techniques Identify Salient Plaque Features
To investigate the CNN model's internal logic, we performed two studies to determine the importance of morphology features contributing to accurate predictions ( Fig. 7 ). In the first, we applied guided gradient-weighted class activation mapping (guided Grad-CAM) 43  consistent with CAA's defining feature; while for cored and diffuse tasks, Grad-CAM highlights the punctate deposit (red arrow) beneath the CAA. Lastly Fig. 7d , which contains both a diffuse (red arrow) and a cored plaque (yellow arrow), shows cored-task activation maps localizing to the amyloid core, with broader feature activations for diffuse and CAA tasks. Crucially, Grad-CAM activation mapping may highlight certain image features as salient because they help determine that an object is not present in the image: Despite strong localized activation for cored and diffuse maps in Fig. 7c at the punctate deposit (red arrow), the CNN predicts that neither plaque is present.
Whereas guided Grad-CAM provides a fine-grained view of feature salience, it does not differentiate features indicative of a plaque from those that contradict its presence. To complement the analysis, we performed a feature occlusion study 44 on the same examples. In this experiment, a small occlusion patch (shown in Fig. 7a , black box) is systematically moved across the image, and the model makes a prediction on the occluded image at each increment. Blue-to-yellow-to-red colors indicate increasing CNN prediction confidence from 0.0 to 1.0. Consequently, color shifts in occlusion maps show which image features, when occluded, change prediction confidence. When the patch occludes an important feature such as the amyloid core of a cored plaque ( Fig. 7a , yellow arrow), the model fails to predict the object correctly: cored-task confidence drops to zero (blue dot on red background, yellow arrow). Occluding less cored-task-relevant relevant regions such as within the off-center diffuse stain (red arrow) have little effect, indicated by the solid red coloring in the cored-task's confidence map for this area. Conversely, confidence maps may also show where occlusion of a critical feature makes an alternative class more likely. If the amyloid core in Fig. 7a is occluded, diffuse plaque prediction becomes likely (signified by yellow arrow).
Where more than one plaque occur within the same tile, the two feature-importance studies differ.
Grad-CAM activation maps identify salient pixels for plaque classes independently, whereas occlusion maps highlight the interplay of features among classes. For example, in occlusion maps, occluding the leftmost plaque decreases diffuse-task confidence ( Fig. 7d , light blue region in diffuse-task map, red arrow), whereas the same prediction gains confidence (red region, yellow arrow) when the cored plaque's core is occluded. In the corresponding grad-CAM activation maps for the diffuse-task, however, features specific to the diffuse plaque are predominant. Together, these complementary maps visualize the features within images that motivate the CNN's plaque predictions.

Model-Based Whole-Slide Scores Correlate with Manual Semi-Quantitative Scores
We developed a preliminary neural-network derived score of plaque burden to compare with manual semi-quantitative approaches such as CERAD at a global WSI level. For the CNN-based score, we calculated a count of predicted plaques across an entire WSI by segmenting its prediction heatmap (e.g., Fig. 5a ) and normalizing the result by the tissue area of each slide ( Supplementary Fig. 9 ). The CNN-based scores correlated strongly across the total dataset of 63 WSIs ( Supplementary Tables 1 and 2) for which we had independently-collected semi-quantitative plaque-type-specific CERAD-like scores ( Fig. 8 ). CNN-based scores for Aβ significantly differentiated WSIs by CERAD-like categories (e.g., "moderate" vs. "frequent"), especially for cored plaques ( Fig. 8a , second row). For instance, CNN-based WSI scores between "none" versus "frequent" CERAD-like categories were exponentially separated. To account for potential model training bias, we collected a further set of 20 WSIs (Supplementary Table 2 ) with corresponding CERAD scores that were blinded during analysis. Combined with the 10 separate hold-out WSIs from Phase III, we found this 30-WSI blinded hold-out set demonstrated strong correlation between the automated and manual scoring approaches, such that CNN-based scores significantly discriminated existing semiquantitative categories ( Fig. 8b ).

Discussion
We report a scalable, quantitative, and interpretable approach to identify neuropathologies for three classes of Aβ deposits, motivated by the method's downstream application to statistically powerful correlative analyses and neuroanatomical localization of AD pathologies. In practice, such deep-phenotyping techniques will have limited utility if their underlying predictions cannot be interpreted, critiqued, and refined by expert neuropathologist supervision. Consequently, to establish the feasibility and limitations of this approach, we considered multiple challenges when adapting CNNs to WSIs of archival human brain samples 45  Addressing the first challenge, we developed an end-to-end pipeline to automate WSI processing and aid rapid image annotation ( Fig. 2a , Supplementary Fig. 3 ). This pipeline performs color normalization 46 followed by immunohistochemical (IHC) stain detection to create a preliminary library of candidate plaques at 20x magnification. The logic behind stain detection was two-part: Stained objects must inhabit a delimited brown hue range and objects must comprise coherent contiguous regions exceeding a minimum size. Then, we generated image tiles centered on each candidate ( Fig. 2b ). Thus 43 WSIs containing Aβ IHC-stained temporal gyri yielded nearly 500,000 raw candidate tiles at 20x resolution (details in Supplementary Table   1-2 ), filtered to~206,000 tiles containing candidate objects of sufficient size ( Methods ). The next step was manual annotation. Although web-based histopathological annotation tools exist 47,48 , we developed a simple platform using the cloud-based Amazon Web Services Elastic Beanstalk 35 infrastructure ( Supplementary Fig.   3 ) for study design flexibility and for the speed of its keystroke-based entry format. For instance, subsequent studies may investigate a broader field of view for annotation context or introduce checks on intra-observer reliability ratings by re-presenting tiles to annotators in different orientations. Given the scope of the annotation task, we incorporated several aspects of gamification theory 49,50 , such as annotator leveling, achievement badges, and progress-bar filling 51,52 to acknowledge and motivate progress. Using this tool, we observed sustained annotation rates reaching 1.44 seconds/tile as measured by the database's timestamp function.
The second challenge was in determining the necessary training dataset size. Having manually annotated 66,030 candidate tiles from 33 WSIs in two annotation phases ( Table 1 ), plus 3,970 randomly selected IHC-negative tiles, we examined the CNN's ability to precisely discriminate plaque and CAA morphologies. For this analysis, we randomly split the tiles into train and validation sets, such that train and validation tiles never shared the same WSI source. Additionally, as a strict hold-out test set (Phase III) and to investigate the role of nearby neuroanatomic landmarks on prediction, we annotated larger contiguous tissue regions corresponding to 5x by 5x our normal field size, for 10 previously unseen WSIs ( Supplementary Fig.   7 ). Note this Phase III dataset differed from the Phase I+II train and validation datasets in that the latter's 70,000 labeled tiles were randomly selected, and thus there was no guarantee that tiles and their resulting ground-truth expert plaque annotations would be contiguous or comprehensively labeled for any local region of tissue. We trained multitask CNNs on all 61,370 training tiles, evaluated multiple CNN architectures and hyperparameter choices, and found that a relatively simple model design ( Fig. 3a )
Given the substantial time investment, we asked whether similar performance could have been achieved with fewer training tiles. We evaluated this retrospectively, by progressively decreasing the training dataset size in two different ways ( Fig. 4c-d ). In the first study, we selected a progression of training data subsets randomly and repeated the training process five times per subset size ( Fig. 4c ). In the second study, we maintained the chronology of the project instead, and thus plot a natural history of the annotation process.
Intriguingly, these performance evaluations highlighted two annotation regimes ( Fig. 4d ); first, unbiased random-tile candidate labeling (Phase I), followed by the Phase II procedure, where cored-plaque and CAA candidates were purposefully enriched by bootstrapping from a partially-trained CNN model. As expected, increasing training example counts improved model performance. Less anticipated was that chronologically-early annotations appeared to be less effective for model training ( Fig. 4d ); considerations such as the neuropathologist's growing familiarity with the annotation tool and its visual field may be subjects for further study. From a practical perspective, the steepest performance gains were nonetheless achieved within the first 15 hours of expert labeling, suggesting a reduced dataset may be pragmatically sufficient.
Significantly, models trained using a comparatively small investment of a neuropathologist's time can assist with new cases and potentially reduce overall expert burden. Subsequent refinements to the model, particularly in reinforcement feedback on incorrectly-classified examples encountered during the model's use (e.g., Fig.   3d ), might later be incorporated into the workflow with minimal friction.
The third challenge was human interpretability. We posited that visualizing the CNN model's predictions as comprehensive confidence maps from the WSI level down to a 20x field would aid interpretability by a trained neuropathologist, given the importance of local tissue and neuroanatomic context. On a neuroanatomic level, most predicted plaques are located within grey matter ( Fig. 5a , yellow-to-green regions, right three columns) with some sparse densities in white matter not appreciated from the raw slide ( Fig. 5a , left column).
Despite their primary localization within grey matter, studies have reported plaques within white matter 1,41,54 .
Furthermore, the maps predict cored plaques' propensity for deeper and lower cortical layers, consistent with their known neuroanatomic distribution 1,55 . We were likewise gratified to observe that individual cored plaques stand out from dense neighborhoods of diffuse plaques ( Fig. 5c , cored column) and that CAA predictions made by the model on a 20x (128 microns) tile-by-tile basis nevertheless localized predominantly to the leptomeninges and some within cortex grey matter. There were caveats however, as for instance when clusters of diffuse plaques having staining "halos" were misclassified as CAAs ( Fig. 5b , CAA column). This was not entirely surprising as the project focused on cored plaques, so the CAA dataset was comparatively small; a larger CAA dataset containing the full spectrum of its morphologies may be a useful subject of further projects. Indeed, CAAs can be delineated into various staging schemes, such as by their location within the media of the vessel and vessel integrity 11 , which is important in diagnosing CAA-related hemorrhage .
Using Phase III's larger field-of-view (3840 microns) hold-out regions, the overlays of CNN prediction confidence maps onto ground-truth annotations highlighted cases and context of prediction success and disagreement ( Fig. 6b , Supplementary Fig. 8 ). Nonetheless, accurate predictions alone do not guarantee meaningful learning or that the model will be applicable to new scenarios or populations 56 . Plaque morphology can differ by neuroanatomic location-a CNN model developed from temporal gyri plaques may not be translatable to plaques in other anatomic areas, such as the striatum 1,57 . Although an explicit evaluation of all confounders is outside the scope of this work, the feature saliency and occlusion map studies ( Fig. 7 ) demonstrated that the models focus on image features relevant for neuropathology. Guided gradient-weighted class activation mapping (guided Grad-CAM) techniques near-exclusively highlighted the IHC stained regions, in patterns characteristic of the pathologies ( Fig. 7 , white-on-black maps). Complementarily, feature occlusion studies illustrated that the central amyloid core is the most discerning feature of a cored plaque's correct identification, and that its occlusion transforms a CNN model's classification to diffuse plaque. Importantly, the crucial features emerging from these machine learning introspection techniques-dense compact amyloid centers for cored plaques, ill-defined amorphous amyloid deposits for diffuse plaques, and amyloid within the media of the cortical vessels for CAAs -all a gree with key features used by experts 1,11,58,59 .
We finally evaluated whether CNN models could automatically quantify plaque burden on a whole-slide level in a way that would correlate with standard semi-quantitative methods for plaque assessment (i.e. CERAD neuritic plaque scores). As true neuritic plaques are not distinguishable using Aβ-selective IHC stains, we leveraged CERAD-like manual scores (none, sparse, moderate and frequent) specific to each amyloid class. We found that a preliminary WSI-level CNN-based score we developed ( Methods ) correlated strongly with manual CERAD-like scores ( Fig. 8 ). CNN-based scores from one CERAD-like category were significantly different from WSIs in other categories (for cored plaque, p <0.01). Beyond its overall correspondence with CERAD, the finer-grained CNN-based metric captured subtle variations of plaque burden within each CERAD category. The more detailed and sensitive measurement of plaque burden, after appropriate validation in further studies, may strengthen statistical power for clinicopathological correlations 60 . Automated scores of this nature might be applied across entire archives of stained tissue from diverse anatomic regions, or aid in studies focused on evaluating burden specific to certain neuroanatomic locales or other local landmarks.
Several caveats, however, merit mention. Foremost among them is the intentional restriction of this proof-of-concept study's scope to annotations made by a single expert neuropathologist on a single immunohistochemical stain within a single anatomic region. Differences in experience and annotation criteria will likely result in individual expert variation among "ground truth" labels. The goal and intent of this project was therefore to establish the potential to extend an individual neuropathologist's plaque-identification capabilities in the context of their normal workflow. Furthermore, all data used in this study were from a single brain bank and retrieved and digitized under the same conditions; more diverse datasets from multiple sources will yield more robust and reliable models. We noted also that when the same hold-out set (Phase III) was annotated by web platform ( Fig. 2c ) versus entirely by hand, this resulted in at times differing labels ( S upplementary Fig. 10 ). Future work may build on these foundations to investigate cross-neuropathologist plaque labeling, differing stains, anatomic regions, or collection centers, as well as region-level scoring systems to quantify bulk Aβ deposit burdens.

Dataset
Case Cohort -All samples were retrieved from archives of the University of California, Davis Alzheimer's Disease Center (UCD-ADC) Brain Bank. Archival samples analyzed in this study were 5 μm formalin fixed, paraffin embedded, sections of the superior and middle temporal gyrus. The tissue had been previously stained with an amyloid-β antibody (4G8, recognizing residues 17-24, BioLegend, formerly Covance) that were first pretreated with formic acid to rid samples of endogenous protein. All slides were digitized using an Aperio AT2 up to 40x magnification.
Dataset Splitting -A total of 33 WSIs corresponding to 33 separate decedent cases, spanning non-AD to AD and possessing a variety of CERAD scores (see Supplementary Table 1 Fig. 1 ). The resulting WSIs were tiled to 1536x1536 pixel tiles, corresponding to 768x768 micron regions of tissue for further analysis, resulting in a total of 33,111 tiles for the training set.
Segmentation -Image segmentation was performed using the open-source library OpenCV 63 . Immunohistochemically-stained entities including CAAs, cored, and diffuse plaques appear in the brown hue-region and segmentation was performed in the hue-saturation-value (HSV) colorspace utilizing a permissive colormask. Morphological opening and closing operations were performed to smooth the binary masks, and a standard "blob"-detection procedure was applied to isolate candidate objects. These unique components were center-cropped to a fixed size (256x256 pixels), corresponding to a region of 128x128 microns. This procedure resulted in nearly 500,000 images. Noisy background deposits were eliminated through a small size-filter, resulting in a total of 206,888 tiles.

Plaque-labeling Web Interface
To allow for the rapid and efficient annotation of the dataset, we developed a custom Python Flask web application that we deployed on Amazon Web Services Elastic Beanstalk 35 . The web-based interface allows for remote login by the expert labeler, and enables fast, multi-label annotation of images using individual keystrokes. In the interface, images corresponding to 128x128 micron regions were shown to the annotator. A bounding box in the image specified which specific plaque candidate was being labeled. Several elements of gamification, such as leveling, achievement badges (crown icon), and progress bar filling (green bar) are incorporated to reward and motivate annotation task progress. A timestamp function was implemented to record the number of images/hour annotated by the expert (BD). All labels were stored in a SQL database using the Amazon Relational Database Service.
Labeling of the image data proceeded in three phases: 1) In an initial phase, 55,000 images stemming from 3,811 unique tiles were labeled; 2) In the second phase, images containing the minor classes of interest (cored plaques and CAAs) were enriched by running the CNN model built from the first-phase dataset on the remaining 101,671 images. These were ranked by their predicted likelihood of containing cored plaques or CAAs. We then chose the top 11,029 images for labeling. The labeled data from Phase I and Phase II were combined as the entire dataset (Phase I+II) that we used for model training and evaluation. 3) In the third phase, two test sets were constructed with the same data but two distinct labeling methods. A 7680 x 7680 pixel (0.5 MPP) region was selected within each of the 10 hold-out test set WSIs by an expert neuropathologist as the area of interest. For the first test set, 10,873 plaque candidate tiles extracted from these 10 regions were labeled using the plaque-labeling web interface. For the second test set, the cored plaques and CAAs were directly marked by a neuropathologist on the selected region at a standard 10x (768 micron) visual field.

Model Development and Training
CNN Model Architecture and Training -All neural network models were trained in the open-source package PyTorch 64 on four NVIDIA GTX 1080 or Titan X graphics processing units. Our optimized model uses a simple convolutional architecture for image classification, consisting of alternating (3x3) kernels of stride 1 and padding 1 followed by max pooling ( Fig. 3a ), followed by two fully connected hidden layers (512 and 100 neurons) and rectified linear units as the nonlinear activation function. All neural network models were trained using backpropagation. Our optimized training procedure uses the Adam 65 optimizer with a multi-label soft margin loss function with weight decay (L2 penalty, 0.008) and dropout (probability 0.5 for the first two fully connected layers and probability 0.2 for all convolutional layers). Training proceeds with mini-batches of 64 images with real-time data augmentation including random flips, rotations, zoom, shear, and color jitter.

Prediction Confidence Heatmaps
A sliding window approach 39 is applied with the trained CNN model to generate confidence heatmaps.
At each step, the CNN model takes a 256x256 pixel region as input and generates a prediction score for cored plaques, diffuse plaques and CAAs. By systematically sliding the input region across the entire image, the prediction scores are plotted as prediction confidence heatmaps. The color represents the CNN's prediction confidence for the presence of cored plaques, diffuse plaques, and CAAs in the corresponding region, with yellow being the most confidence, and purple the least.

Activation Maps
Guided gradient-weighted class activation mapping (guided Grad-CAM) 43,66 is performed to generate activation maps (also known as saliency maps) highlighting the features that have a positive influence on the prediction of the class of interest. The saliency map is a pointwise multiplication of guided backpropagation and grad-CAM. Guided backpropagation produces a pixel-space gradient map of predicted class scores with respect to pixel intensities of input image. Grad-CAM produces a more class specific map which is the dot product of the feature map of the last convolutional layer and the partial derivatives of predicted class scores with respect to the neurons in the last convolutional layer.

Feature Occlusion Studies
Feature occlusion studies 44 were performed to show the influence of occluding regions of input image to the confidence score predicted by CNN. The occlusion map is computed by replacing a 16x16 pixels region of the image with a pure white patch and generating a prediction on the occluded image. As we systematically slide the white patch across the image (stride = 1 pixel), the prediction score on the occluded image is recorded as each pixel of the occlusion map.

Tissue Segmentation
Tissues areas from WSIs were calculated using the open-source libraries PyVips and OpenCV. Tissue segmentation against the slide background was performed in the lightness-chroma-hue (LCH) colorspace using a specific colormask for each WSI. Morphological opening and closing operations were performed to smooth the binary masks, and the tissue areas were the pixel sum of refined masks.

Statistical Analyses
Statistical analyses were performed using the open-source library Scipy 67 . A two-sided, independent, two-sample t-test was used to test the null hypothesis that two independent samples have identical expected values. CNN quantification scores of WSIs from different CERAD categories were used for the test. Data were presented as box plots overlaid with dot plot. Box plot plotted interquartile range (top and bottom of the box), median (the band inside the box), and outliers (points beyond the whiskers). Dot plot showed individual data points. P ≥ 0.05 was considered not significant (ns); * p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001.

Data Availability
We have made the CNN model code ( https://github.com/keiserlab/plaquebox-paper ) and the full raw WSI dataset and the annotated plaque-level dataset ( https://www.keiserlab.org/resources/ ) openly available. Alzheimer's Disease Center program. All human subject involvement was overseen and approved by the institutional review boards at the University of California, Davis. All data followed current laws, regulations, and IRB guidelines (such as sharing de-identified data that does not contain information used to establish the identity of individual deceased subjects). De-identified data does not contain personal health information (PHI) like names, social security numbers, addresses and phone numbers. Data were shared with a randomly generated pseudo-identification number.

Availability of data and material
The core datasets used and/or analyzed during the current study are openly available (see Data Availability ). Further materials are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests related to this study. Dr. DeCarli is a consultant to Novartis. Dr. Keiser is a consultant to Daiichi Sankyo, unrelated to this project. Dr. Dugger has received previous funding from Daiichi Sankyo unrelated to this project.

Funding
This study was funded by a NIH P30 AG010129 grant (BND, CD, LWJ, and LB), a Paul G. Allen Family Foundation Distinguished Investigator Award (MJK), and the China Scholarship Council (ZT). These agencies had no role in any aspect of the study, including study design, data collection, analysis, or writing.

Author contributions
The studies were conceptualized, results analyzed, and manuscript drafted by ZT, KVC, MJK, and       . CNN models identify three amyloid plaque types in image tiles. a. The optimized CNN model architecture contained six convolutional layers and two dense layers, using exclusively 3x3 kernels and alternating max-pooling layers. b. Examples of correct CNN predictions. The ground truth "expert label" row indicates the pathologies that had been manually found within the tile image. The "predicted" row shows corresponding model confidences for cored (yellow arrow), diffuse (red), and CAA (blue) classes (from left to right). Model predictions range from 0.00 to 1.00, where a higher score indicates a higher predicted confidence by the CNN for that plaque class (e.g., the 1.00 corresponds to 100% model confidence that a cored plaque is present in the leftmost panel). c. Examples of CNN predictions that do not agree with the expert manual annotation. Incorrect model predictions are indicated by light orange backgrounds in the "predicted" column; green backgrounds correspond to correct predictions. Scale bar=25 um for all images.    Model interpretability studies using machine-learning introspection techniques. a . A cored plaque example (top row, yellow arrow). For the task of cored-plaque prediction, the activation map (by guided grad-CAM; left, second row) and the feature occlusion map (right, second row) identify the amyloid core (yellow arrow) as the defining morphological feature. By contrast, the diffuse stained region (red arrow) only arises as a salient feature during diffuse-plaque and CAA prediction tasks (third and fourth rows, respectively).
b. Diffuse plaque example where activation and feature occlusion maps focus on ill-defined amorphous amyloid contours for diffuse-plaque classification task (third row). c. CAA example, where the CAA task's activation and feature occlusion maps (fourth row) highlight amyloid ring pixels within the media of the cortical vessel (blue arrow), while for cored and diffuse tasks the small punctate IHC staining is considered salient (red arrow; second and third rows). d. Example containing both diffuse (red arrow) and cored (yellow arrow) plaques in same tile illustrate the difference between activation and feature occlusion maps. Scale bar=25 um.