Neuropathologist-level integrated classification of adult-type diffuse gliomas using deep learning from whole-slide pathological images

Current diagnosis of glioma types requires combining both histological features and molecular characteristics, which is an expensive and time-consuming procedure. Determining the tumor types directly from whole-slide images (WSIs) is of great value for glioma diagnosis. This study presents an integrated diagnosis model for automatic classification of diffuse gliomas from annotation-free standard WSIs. Our model is developed on a training cohort (n = 1362) and a validation cohort (n = 340), and tested on an internal testing cohort (n = 289) and two external cohorts (n = 305 and 328, respectively). The model can learn imaging features containing both pathological morphology and underlying biological clues to achieve the integrated diagnosis. Our model achieves high performance with area under receiver operator curve all above 0.90 in classifying major tumor types, in identifying tumor grades within type, and especially in distinguishing tumor genotypes with shared histological features. This integrated diagnosis model has the potential to be used in clinical scenarios for automated and unbiased classification of adult-type diffuse gliomas.

The validation cohort including 340 cases (male vs female: 195 vs 145, IDH mutation vs non-mutation: 121 vs 219, mean age: 50.81 years, range: 18 to 78 years) was used for hyperparameter optimization and for internally validating the performance of the finally selected optimal model.
(3) Internal test cohort (n = 289) This internal cohort including 289 cases (male vs female: 172 vs 117, IDH mutation vs non-mutation: 100 vs 189, mean age: 50.25 years, range: 18 to 77 years) was separately collected from 2020 and was used for further testing the model performance.
(4) External test cohort 1 (n = 305) The external testing cohort 1 recruited from HPPH included 305 cases (male vs female: 171 vs 134, IDH mutation vs non-mutation: 75 vs 230, mean age: 52.46 years, range: 20 to 77 years).This cohort was used to externally validate the performance of the classification model.
(5) External test cohort 2 (n = 328) The external testing cohort 2 recruited from XHCMU included 328 cases (male vs female: 186 vs 142, IDH mutation vs non-mutation: 136 vs 192, mean age: 50.83 years, range: 18 to 82 years).This cohort was also used to externally validate the performance of the classification model For the study cohorts, clinical parameters including gender, age, WHO grade, and subtype were collected.Sanger sequencing or FISH was used for detection of IDH mutation, TERT promoter mutations, 1p/19 co-deletion, CDKN2A/B homozygous deletion, EGFR amplification, and chromosome 7 gain/chromosome 10 loss, as described in Supplementary A2-A3.Based on both histological features and molecular markers, the classification of adult-type diffuse gliomas according to the 2021 WHO rule was made, as described in Supplementary A4.

Mutation analysis of IDH1/IDH2
Mutational hotspots of IDH1/IDH2 were evaluated by direct sequencing.Tissues from representative tumor area (the proportion of tumor cells＞20%) were scrapped off from dewaxed sections and treated with PCR reaction solution A 10μl (reaction mixture containing 1μl of cell lysate, 0.3mM of each dNTP, 2.5mM MgCl2, 0.3μM of each primer and 0.2U of KAPA HiFi HotStart DNA Polymerase (Kapa Biosystems Inc., Wilmington, USA)), Shrimp Alkaline Phosphatase (SAP) enzyme (NEB, Ipswich, MA, USA) 2μl and BigDye (BigDye™ Terminator v3.1 Cycle Sequencing Kit, Thermo Fisher Scientific, Waltham, MA, USA) 1μl for centrifugation at 1600g rpm for 10 sec.The crude cell lysate was centrifuged and supernatant was used for subsequent PCR analysis.The forward primer primers (IDH1-F:5'-CGGTCTTCAGAGAAGCCATT-3',IDH1-R:5'-CACATTATTGCCAACATGAC -3',IDH2-F:5'-AGCCCATCATCTGCAAAAAC-3',IDH2-R:5'-CTAGGCGAGGAGCTCCA GT-3') were used to amplify the region of mutational hotspots of IDH1/IDH2.①PCR was performed was initiated at 95°C for 5 min, followed by 40 cycles of 95°C for 20 sec, 57°C for 30 sec and 72°C for 1min, and a final extension of 72°C for 5 min and 10°C for 10 min.② 5μl PCR products were then mixed with 2μl SAP enzyme and reacted at 37°C for 40min and then at 80° C for 15min.③Then 18μl PCR reaction solution C(CWBIO, Beijing, Chima), 1μl products from ② step, and 1μl BigDye were mixed and reacted at 96°C for 1 min, followed by 30 cycles of 96°C for 10 sec, 50° C for 5 sec and 60° C for 2 min, and a final extension of 25°C for 1 min and 10°C for 10 min.Then 50μl natrium asceticism-ethanol mixture (3M NaAc: ethanol=1:15) were added and the mixture was centrifuged for 30min (12000 rpm, 4°C), with the supernatant being discarded.Then 70μl 75% ethanol were added and the mixture was centrifugated for 15min (12000 rpm, 4°C), with the supernatant being discarded.After complete volatilization of the ethanol at room temperature, 12μl Hi-Di™ Formamide (Thermo Fisher Scientific, Waltham, MA, USA) were added into the precipitate to dissolve the DNA.The dissolved products were sequenced on Applied Biosystems™ 3500DxGenetic Analyzer (Thermo Fisher Scientific, Waltham, MA, USA), and analyzed by Chromas software (Technelysium, South Brisbane, Australia).The sequencing results were compared with wild-type sequences of IDH1/IDH2 for analysis.

Mutation Analysis of TERT promoter
Tissue samples were prepared according to the "Mutation Analysis of IDH1/IDH2 protocol" protocol previous described.The crude cell lysate was centrifuged, and supernatant was used for subsequent PCR analysis.The forward primer TERT-F (5'-GTCCTGCCCCTTCACCTT-3') and reverse primer TERT-R (5'-CAGCGCTGCCTGAAACTC-3') were used to amplify a 163bp fragment spanning the two mutational hotspots [chr5, 1,295,228 (C228T) and 1,295,250 (C250T)] in TERT promoter region.①PCR was performed was initiated at 95°C for 5 min, followed by 40 cycles of 95°C for 20 sec, 57°C for 30 sec and 72°C for 1min, and a final extension of 72°C for 5 min and 10°C for 10 min.②5μl PCR products were then mixed with SAP enzyme and reacted at 37°C for 40 min and then at 80° C for 15 min.③Then 18μl PCR reaction solution C, 1μl products from ② step, and 1μl BigDye were mixed and reacted at 96°C for 1 min, followed by 30 cycles of 96°C for 10 sec, 50° C for 5 sec and 60° C for 2min, and a final extension of 25°C for 1 min and 10°C for 10 min.The following steps were performed according to the "Mutation Analysis of IDH1/IDH2 protocol" protocol previous described.The sequencing results were compared with wild-type sequences of TERT for analysis.

A5: The patch clustering algorithm
We aimed to select a subset of patches within a WSI to reduce the computational burden.
Specifically, we extracted many patches from a WSI and then partitioned these patches into several clusters according to their "phenotypes", or important cancerous features that were  The network inputs were small patches extracted from the whole slide image.The patch-level labels were used in all CNNs.For the CNN feature extractor, namely the all-patch classifier, all patches in the training cohort were used to train the network.There were nine cluster-based CNN classifiers, corresponding to the nine patch clusters.Each of them was trained using one patch cluster in the training cohort.The patch-level CNN was trained using the patches from the three selected clusters in the training cohort.All CNNs were trained from scratch while patches in the validation set were used for hyperparameter optimization and for optimal model selection.In the cross-validation process, the model was trained for a minimum of 50 epochs.Then, the loss on the validation set was computed in each epoch, and where i was the index of patient; p(x i ) was the true probability value of patient i, i.e., the label of input data; q(x i ) was the predicted probability value of patient i, i.e., the output of the network.The network was implemented using the PyTorch (version 1.4.0)library.The same training parameters were used across each fold in the cross-validation procedure.Finally, the patient-level classifier with the highest mean AUC over the six types (A2, A3, A4, O2, O3 and GBM) on the validation cohort was selected as the optimal model and its performance was further tested on the testing cohorts.For classifying the six types of A2, A3, A4, O2, O3, and GBM (task 1), most P values for the three MIL models were more than 0.05, indicating that the difference in AUCs between the clustering-based model and the MIL models was not significant on most datasets.However, in most tests the AUCs of our proposed model were numerically higher than that of the MIL models.Meanwhile, we found many P values less than 0.05 for the all-patch model, indicating that the AUCs between the proposed model and the all-patch model were significantly different in many tests.For classifying the three types of A, O, and GBM (task 2), P values in almost half of tests were less than 0.05, where P values were less than 0.05 for the all-patch model in all tests.

:
Detection of chromosome 1p/19q, CDKN2A, EGFR, and chromosome 7/10 status by Fluorescence in Situ Hybridization (FISH)Chromosome 1p/19q, CDKN2A (9p21) gene, EGFR amplification, and chromosome 7 gain/chromosome 10 loss status was examined by fluorescence in situ hybridization.4μm thick FFPE sections were baked at 65°C for 2-3h and deparaffinized in xylene for 10 minutes for 2 times.The sections were hydrated by 100％ethanol for 2 min, 85％ethanol for 2 min and 70％ethanol for 2 min orderly, and then immerged in deionized water for 3 min.The sections were processed with citrate repair solutionin (pH6.0) for 4 min in high pressure condition, and then rinsed in 2×SSC solution for 5 min for 2 times.The sections were immerged in protease K fluid(200µg/ml)and incubated for 2 min at 37°C, and then rinsed in 2×SSC solution for 5 min for 2 times.10μl probes (GP Medical Technologies, Beijing, China) mixture was added to thehybridization zone of the section, and the denaturation and hybridization process was carried out on the ThermoBrite®hybridization instrument (Leica Biosystems, Nussloch, Germany), with denaturation temperature at 83°C for 5 min and hybridization temperature at 42°C for 16h.Sections were immerged in 0.4×SSC plus 0.3% NP-40 cleaning solution (65±1°C) and vibrated for 3 sec.Sections were then retrieved 2 min later and put into 0.1% NP-40 plus 2×SSC cleaning solution at room temperature, vibrated for 3 sec and cleansed for 1 min.Then the sections were immerged in 70% ethanol for 3 min and dried avoiding light at room temperature.15μl DAPI redyeing agent was added into the hybridization zone of the section, and the section was placed avoiding light for 10 min.At last, the section was placed under the BX51TRF fluorescence microscope (Olympus, Tokyo, Japan) for analysis by expert pathologist (Dr.Wei-wei Wang).Hybridizing signals in at least 100 non-overlapping nuclei were counted.Chromosome 1p/19q status: The loci interrogated were 1p36.3 (RP11-62M23 labeled red)/1q25.3-q31.1 (RP11-162L13 labeled green) and 19q13.3(CTD-2571L23 labeled red)/19p12 (RP11-420K14 labeled green).A sample was considered 1p or 19q deleted according to the ratio of number of red signal to green signal.In 1p36 or 19q13, positive loss of heterozygosity (LOH) was determined when the ratio of number of red signal to green signal was less than 0.7.CDKN2A (9p21) gene status: CDKN2A (9p21) is labeled red and centromere of Chromosome 9 (CSP9) is labeled green.A sample was considered CDKN2A homozygous deficiency according to red signal lost in more than 25 nuclei, at least 100 non-overlapping nuclei were counted.The probes used for FISH were from LBP Medicine Science & Technology Co., Ltd.(Guangzhou, China).EGFR amplification: EGFR gene is labeled red, and centromere of Chromosome 7 (CSP7) is labeled green.A sample was considered EGFR amplification when the ratio of number of red signal to green signal was greater than 2.Chromosome 7 gain/chromosome 10 loss status: chromosome 7 and chromosome 10 are both labeled green.A sample was considered chromosome 7 gain and chromosome 10 loss when the number of chromosome 7 signal was greater than 2 and the number of chromosomeA4: 2021 WHO Classification of adult-type diffuse gliomas in our datasetsAccording to 2021 WHO CNS5, the classification of adult-type diffuse gliomas in our study was showed as following: IDH1 and IDH2 mutation were detected by Sanger sequencing firstly.For the cases having IDH1 or IDH2 mutation, ATRX expression were determined by immunohistochemistry (IHC)(1:1, ZA-0016, ZSGB-BIO, China), when the results of ATRX IHC was negative, the cases in the presence of typical necrosis and/or microvascular proliferation (MVP) were classified as astrocytoma, IDH-mutant, Grade 4 (A4);For the cases without necrosis or MVP, CDKN2A gene status was then detected by FISH, and the cases were also classified as astrocytoma, IDH-mutant, Grade 4 (A4) if CDKN2A gene was homozygous deficiency, otherwise the cases were classified as astrocytoma, IDH-mutant, Grade 2 (A2) which having nil or low mitosis (< 2 per case) or were classified as astrocytoma, IDH-mutant, Grade 3 (A3) which having high mitosis (>= 2 per case).When the results of ATRX IHC was positive, chromosome 1p/19q status were detected by FISH, if 1p/19q were co-deletion, the cases were classified as Oligodendrolioma, IDH-mutant, Grade 2/3, which were further classified as Oligodendrolioma, IDH-mutant, Grade 2 (O2) having nil or low mitosis (< 2 per case), or classified as Oligodendrolioma, IDH-mutant, Grade 3 (O3) having high mitosis (>= 2 per case); If 1p/19q were none co-deletion, the cases were classified as astrocytoma, IDH-mutant and the classification was done through the workflow above.Glioblastoma, IDH-wild type (GBM) has been diagnosed in the setting of IDH-wild type diffuse astrocytic glioma in adults if there existed typical necrosis and/or MVP, or TERT promoter mutations, or EGFR amplification, or Chromosome 7 gain/chromosome 10 loss (+ 7/− 10).
useful in subtype/grade discrimination.Considering the less representative power of features from original images, we extracted deep features using convolutional neural networks (CNNs) for patch clustering.Here a ResNet-50 network pretrained with patch-level labels on all patches in the training cohort was used as the deep feature extractor.Having this pretrained ResNet-50, a patch can be fed into it and 2048 features can be extracted from its average pooling layer.As this ResNet-50 was trained for patch-level tumor classification into six categories (A2, A3, A4, O2, O3, and GBM), the features extracted using this network were used as cancerous phenotypes relevant to tumor subtypes.Next, we randomly selected 100 patients in the training cohort, including 11 A2, 2 A3, 2 A4, 14 O2 3 O3, and 68 GBM patients, for training a deep feature-based K-means algorithm for patch clustering.From these 100 patients, 43653 patches were extracted and were fed into the pretrained ResNet-50 for deep feature extraction.From each of the 42216 patches, 2046 deep features were extracted.Using these 43653×2046=89314038 features, the candidate 42653 patches were used to develop a K-means clustering algorithm by partitioning these patches into K clusters.To improve the clustering stability, K-means++ approach with multiple initializations was used.The K-means clustering was performed by using a KMeans function in the scikit-learn python library, with the function parameters init = 'k-means++' and n_init = 10.Here the optimal cluster number K=9 was determined when the silhouette coefficient reached its maximum value, as shown in Figure3.We also calculated another metrics named Calinski-Harabasz index to assess the clustering quality.At K=9 the Calinski-Harabasz index also reached its highest point, providing additional support for this optimal cluster number.The patches belonged to different clusters were considered to have different discriminative powers as they may have distinguished imaging patterns relevant to tumor types/grades.A6: Deep learning network architecture and training schemeThere were three kinds of CNNs: the CNN used as deep feature extractor in patch clustering step, the CNNs used for cluster-based tumor classification in patch selecting step, and the CNN used for patch-level tumor classification.All the CNNs employed the same ResNet-50 architecture, as shown in the following Supplementary the model with the lowest average validation loss over 10 consecutive epochs was saved.If such a model was not found, the training continued up to a maximum of 150 epochs and the last model was saved.During training, an SGD optimizer was used with a learning rate of 0.001 and a batch size of 64.To accommodate the class imbalance problem caused by the different incidence rate across the glioma types, we used random rotation and shear approaches to augment A3, A4, and O3 classes in the training dataset (after training/validation dataset division).The sample number in each of the three classes have tripled after augmentation.The output of the network was the probability values of each glioma subtype.The loss function for this CNN training was the cross-entropy loss function as

Table 2
A summary of the comparison of model performance in terms of AUC.The top, second, third, and bottom rows for each type/grade indicate the performance on the internal validation cohort, internal testing cohort, external testing cohort 1 and 2, respectively.Five models were compared, including the proposed clustering-based model, the

Table 3
A summary of the P values obtained from Delong tests by comparing the AUCs between our proposed clustering-based model and each of the listed models.P value less than 0.05 indicated statistical significance.The top, second, third, and bottom rows for each type/grade indicate the P value calculated on the internal validation cohort, internal testing cohort, external testing cohort 1 and cohort 2, respectively.The AUC of the proposed clustering-based model was statistically compared with that of the listed four models by using the Delong test.The four listed models include the classical MIL model, the attention-based MIL (AMIL) model, the clustering-constraint-attention MIL (CLAM) model, and the all-patch model.Task 1: classifying the six categories of A2, A3, A4, O2, O3, and GBM.Task 2: classifying the three types of A, O, and GBM.