A deep learning model and human-machine fusion for prediction of EBV-associated gastric cancer from histopathology

Epstein–Barr virus-associated gastric cancer (EBVaGC) shows a robust response to immune checkpoint inhibitors. Therefore, a cost-efficient and accessible tool is needed for discriminating EBV status in patients with gastric cancer. Here we introduce a deep convolutional neural network called EBVNet and its fusion with pathologists for predicting EBVaGC from histopathology. The EBVNet yields an averaged area under the receiver operating curve (AUROC) of 0.969 from the internal cross validation, an AUROC of 0.941 on an external dataset from multiple institutes and an AUROC of 0.895 on The Cancer Genome Atlas dataset. The human-machine fusion significantly improves the diagnostic performance of both the EBVNet and the pathologist. This finding suggests that our EBVNet could provide an innovative approach for the identification of EBVaGC and may help effectively select patients with gastric cancer for immunotherapy.

G astric cancer (GC) is the fifth most common cancer globally and the fourth leading cause of cancer deaths worldwide 1 . In 2020, there were over one million new cases of GC, with the highest rate of incidence in Eastern Asia 1 . According to The Cancer Genome Atlas (TCGA) Research Network, GCs are classified into four molecular subtypes: Epstein-Barr virus (EBV)-positive tumors, microsatellite instable tumors (MSI), genomically stable tumors, and chromosomal instable tumors 2 . EBV-positive GC, also known as EBV-associated GC (EBVaGC), comprises~9% of all GC cases and is a distinct subset of gastric cancer 2 that may respond remarkably well to immune checkpoint inhibitors [3][4][5] and have a favorable prognosis 6,7 .
EBV testing is routinely recommended for GC patients in order to identify such a small group of responders for immunotherapy 8 . The most common method for evaluating EBV status in tumor tissues is in situ hybridization (ISH) targeting EBV-encoded small RNAs (EBERs) in histopathologic samples 9 . However, EBV testing by ISH is time-consuming and not cost-saving. Currently, there is no alternative to universal EBV testing. Therefore, a more cost-efficient and accessible tool is needed for confirmatory EBV testing to assist in patient selection, thereby reducing the unnecessary cost for patients with EBV-negative GC (EBVnGC).
Deep learning has been successfully used to identify cancer subtypes and molecular features on hematoxylin and eosin (H&E)-stained histopathological slides, and as such has the potential to serve as a promising cancer biomarker 10,11 . Several studies have demonstrated that deep learning models can accurately predict the MSI status of colorectal cancer through H&Estained digital whole slide images (WSIs), with an area under the receiver operating curve (AUROC) of 0.77-0.96 [12][13][14] . Moreover, deep learning models can predict the molecular subtype of muscle-invasive bladder cancer from H&E-stained slides 15 and the hormonal receptor status of breast cancer from histopathological images 16 . Herein, we hypothesize that a deep learning model may facilitate EBVaGC prediction and refine the selection for confirmatory EBV testing.
An image-based deep learning model has the potential to improve visual diagnostic accuracy. In patients with EBVaGC, H&E-stained slides possess some morphological features that could be recognized by pathologists, including poorly differentiated adenocarcinoma and massive lymphocyte infiltration 17,18 . Pathologists triage patients for the confirmative EBV testing on the basis of these features. Besides these recognizable features, a deep learning model might extract more characteristics of EBVaGC that pathologists have not been aware of, consequently predicting EBV status more accurately.
Here we introduce an innovative deep learning model called EBVNet to predict EBV status among patients with GC using H&E-stained slides. More importantly, we further develop a simple yet effective and novel human-machine fusion strategy for the clinical and practical use of the deep learning model.

Results
Patients cohorts. Three cohorts were included in this study ( Supplementary Fig. 1). The Internal-STAD was used as an internal dataset to develop the EBVNet, enrolling 203 H&Estained WSIs from 145 patients with EBVaGC and 803 WSIs from 582 patients with EBVnGC in a single medical center. MultiCenter-STAD and TCGA-STAD were served as two independent external validation datasets. MultiCenter-STAD comprised 417 WSIs from 417 patients, including 98 patients with EBVaGC and 319 patients with EBVnGC. TCGA-STAD contained 234 H&E-stained WSIs from 218 patients with EBVnGC and 24 WSIs from 21 patients with EBVaGC. The details of the three datasets were summarized in Supplementary Table 1.
Diagnostic performances of tumor detector. To fully automate the process of EBV status prediction, a tumor detector was developed based on the internal dataset and used to automatically detect the tumor region of gastric cancer slides on the external datasets. Only the automatically detected tumor regions were used for the prediction of EBV status by the EBVNet. We found that the tumor detector achieved a sensitivity of 0.964 and an AUROC of 0.862 on the MultiCenter-STAD, and a sensitivity of 0.945 and an AUROC of 0.848 on the TCGA-STAD (Supplementary Table 2).
Performance of EBVNet. ResNet50 was utilized as the default backbone for training and validating the EBVNet to predict EBV status of gastric cancer slides (Supplementary Notes and Supplementary Table 3). The workflow of EBVNet was depicted in Fig. 1. On the Internal-STAD, the AUROC of the testing set on each fold ranged from 0.954 to 0.981 (Supplementary Table 4). Over all testing folds, the EBVNet obtained an AUROC of 0.969, a sensitivity of 0.857, specificity of 0.903, and a negative predictive value (NPV) of 0.962.
Human-machine fusion. To investigate the application scenario of the deep learning model in clinical practice, we further developed a human-machine fusion strategy to integrate the model into the universal testing paradigm. As shown in Supplementary Table 7, the comparison results of different humanmachine fusion strategies indicated that our fusion strategy outperformed other fusion strategies in most cases. Our fusion of the EBVNet and each pathologist with a varying level of expertise further improved the performance of both the EBVNet and the pathologist.
On the MultiCenter-STAD, the prediction fusion from the EBVNet with Junior pathologist 1, Junior pathologist 2, Senior pathologist 1, Senior pathologist 2, Expert pathologist 1, and Expert pathologist 2 achieved an AUROC of 0.945 (95% CI:   1 The workflow of EBVNet for predicting EBV status with hematoxylin and eosin-stained WSIs. Each WSI was preprocessed and tessellated into non-overlapped tiles of ×10 magnification. After color normalization, tiles were resized to 224 × 224 pixels and then input to the tumor detector. Only tiles from regions recognized as tumor were fed to EBVNet to get tile-level probabilities for EBV status. The five well-trained individual classifiers were ensembled to form the EBVNet at the output layer of individual classifiers. The average probability outputs of the five individual classifiers were used as the prediction of the ensembled model EBVNet. Tile-level probabilities were averaged to generate a slide-level probability of EBV status. EBV Epstein-Barr Virus, WSI whole slide image. Association between the histopathological features and the EBVaGC prediction. To reveal the black-box nature of deep learning model, we further built multivariate logistic regression models to evaluate the association between the histopathological features and the EBVNet's EBVaGC prediction. On the Multi-Center-STAD, the EBVaGC prediction was significantly correlated with medullary histology [odd ratio (OR), 58.73; P < 0.001], mucinous differentiation (OR, 0.30; P = 0.011), signet-ring cell differentiation (OR, 0.42; P = 0.010), poor differentiation (OR, 5.17; P < 0.001). On the TCGA-STAD, the EBVaGC prediction was significantly associated with medullary histology (OR, 9.20; P = 0.006), papillary differentiation (OR, 0.17; P = 0.003) and vacuolar nucleus or recognizable nucleolus (OR, 3.86; P < 0.001) ( Fig. 4 and Table 3). The number of morphological features on different datasets are shown in Supplementary Table 8.
Misdiagnosis from EBVNet. The misdiagnosis of EBVNet was further analyzed to better understand this deep learning model. On the MultiCenter-STAD, the EBVNet misdiagnosed 80 out of 417 slides, including 3 slides of EBVaGC and 77 slides of EBVnGC. Among the 3 misdiagnosed EBVaGC slides, 2 slides were misdiagnosed by all pathologists and the remaining one slide was misdiagnosed by one pathologist. Among the 77 misdiagnosed EBVnGC slides, 7 slides were misdiagnosed by all pathologists, 30 slides were diagnosed correctly by all pathologists, and the remaining 40 slides were misdiagnosed by at least one pathologist. To analyze the morphological features of EBV-Net's misdiagnosed cases, we compared the features of falsepositive cases with those of true negative cases and the features of false-negative cases with those of true-positive cases. Compared to true negative cases, these 77 false-positive cases were more likely to occur in female patients (P = 0.017), have the presence of medullary histology (P < 0.001), poor differentiation (P < 0.001), vacuolar nucleus or recognizable nucleolus (P = 0.005), and the absence of mucinous differentiation (P = 0.008), adenoid differentiation (P < 0.001), and papillary differentiation (P = 0.008) (Supplementary Fig. 3 and Supplementary Table 9).
On the TCGA -STAD, the EBVNet misdiagnosed 44 out of 258 slides, including 5 EBVaGC slides and 39 EBVnGC slides. Among the 5 misdiagnosed EBVaGC slides, 3 were misdiagnosed by all pathologists and the other 2 were misdiagnosed by at least 1 pathologist. Among the 39 misdiagnosed EBVnGC slides, 4 were misdiagnosed by all pathologists, 15 were diagnosed correctly by all pathologists, and the other 20 were misdiagnosed by at least 1 pathologist. Compared to true negative cases, these 39 false-positive cases were more likely to occur in female patients (P = 0.002), have the presence of medullary histology (P < 0.001), poor differentiation (P = 0.002), and vacuolar nucleus or recognizable nucleolus (P < 0.001), and the absence of adenoid differentiation (P = 0.036) and papillary differentiation (P < 0.001) (Supplementary Table 10).

Discussion
In this study, we used three diverse datasets to confirm that the innovative deep learning model EBVNet could automatically predict EBV status among gastric cancer H&E-stained WSIs with high performance. Specifically, the EBVNet's diagnostic performance surpassed that of board-certified pathologists and this model could be generalized to heterogeneous clinical scenarios, including a variety of H&E-stained slides, different medical centers, and patient populations. More importantly, our study further indicated that a human-machine fusion could improve the EBVNet's performance in identifying EBVaGC, although a further prospective clinical trial would be needed for the confirmation of its validity. These findings suggest that the EBVNet can serve as an efficient approach to identify EBVaGC as well as a promising biomarker to select GC patients for immunotherapy.
To our best knowledge, our study is the first one to report pathologists' performance in identifying EBV status from H&Estained slides. Although EBV testing is routinely recommended for GC patients, many GC patients remain EBV-untested due to the high cost and EBV-ISH accessibility. Therefore, only those patients with a high possibility of EBVaGC are selected for EBV testing based on pathologists' pre-assessments. Although the H&E slides of EBVaGC contain some discriminative features 17,19 , pathologists, even those with more than ten years of specialized gastrointestinal experience, still had poor to moderate interobserver agreements and unsatisfactory diagnostic performances.
Also, this study is the first investigation that compares the performance of a deep learning model to that of pathologists regarding EBVaGC prediction. The AUROC of EBVNet was significantly better than that of all pathologists. Note that the method of calculating the AUROC has been used in dichotomous classification 12,20 , although it might be unfair to pathologists considering that pathologists, in general, cannot give a specific  MultiCenter-STAD external dataset from multiple medical center, TCGA-STAD external dataset from The Cancer Genome Atlas, NA not applicable.   It is worth noting that when developing a deep learning model on the Internal-STAD dataset, data imbalance between EBVaGC (minority) and EBVnGC (majority) would cause the developed AI model to predict the majority class (EBVnGC) during inference in external validation or future application. Besides widely used data augmentation techniques for model training, more cases of EBVaGC may directly help the model learn positive features (of the minority class EBVaGC) better. Thus, to develop a EBVNet which can better predict EBVaGC, we included all available slides from the patients with EBVaGC in the Internal-STAD. To analyze the impact of the proportion of EBVaGC on the model performance, we further tested the diagnostic performance by randomly sampling slides with~9% proportion being EBVaGC on the MultiCenter-STAD dataset. We observed that the EBVNet achieved equivalent AUROCs on the subsets of MultiCenter-STAD and TCGA-STAD (with about 9% prevalence of EBVaGC). Taken together, our results suggest that the diagnostic performance of EBVNet model works very well and is less affected by the proportion of EBVaGC.
Deep learning models have often been regarded as black boxes 10,23 , offering no transparency into how they work. To interpret the EBVNet, we first constructed a logistic regression model to find the features associated with the prediction of the EBVNet. In addition to recognizable features, the EBVNet might be potentially able to extract more characteristics of EBVaGC that have yet to be identified in the previous histopathological study 23 . It is possible that further studies using larger datasets might provide other morphological features that are significantly associated with EBVaGC. By analyzing EBVNet's misdiagnosed cases, we found that the false-positive cases possessed some morphological features of EBVaGC while the false-negative cases had some characteristics of EBVnGC. Most cases that the EBVNet misdiagnosed were also incorrectly predicted by at least one pathologist, indicating that these cases indeed possessed some confounding features. Taken together, these findings imply that certain effective methods should be developed to overcome this issue and improve the diagnostic performance in future investigations.
Given the prediction uncertainties of both the EBVNet model and the pathologists, we developed and tested a simple yet effective and novel human-machine fusion strategy in these settings. To the best of our knowledge, this is the first study to adaptively fuse predictions from a deep learning model and a human expert based on their prediction uncertainties. To report the prediction confidence of pathologists, we applied the 5-scale self-confidence score method, which is less fine-grained but more clinically practical. The diagnostic performance of the human-machine fusion outperformed that of both the EBVNet and pathologists with varying levels of experience and expertise alone, suggesting that any pathologist could combine the EBVNet's prediction with his or her own diagnosis to obtain an overall expert-level diagnosis performance. In terms of clinical application, such a human-in-the-loop diagnosis system could be integrated into the current universal testing paradigm in two ways. The first one is to apply the EBVNet as a screening tool. When the prediction of EBVNet encounters a low confidence score, pathologists can help the model perform the prediction. The second way is to let pathologists do the screening based on the morphological characteristics, and EBVNet can assist the pathologists in making the decision when they are not confident enough. The two ways can be potentially applied in the current universal EBV testing paradigm but need further studies to obtain more evidence for the efficacy of the human-in-the-loop system.
While promising results have been obtained from our EBV-Net, there do exist several limitations in our study. First, the EBVNet was trained and validated retrospectively, and a rigorous and prospective clinical study is needed to obtain more robust evidence. Second, although the logistic regression model indicated that our EBVNet made biological sense, this method is still an indirect way to interpret the EBVNet. Going forward, more intuitive visualization methods should be attempted in order to interpret the black-box nature of the EBVNet. More importantly, the fusion of the EBVNet and a pathologist should be further evaluated to confirm the improved diagnostic performances in future studies. To potentially further improve the performances of predicting EBV status in gastric cancer slides by deep learning models, besides human-machine fusion, the following aspects may be considered, including the combination of current imaging data with clinical data (tumor manifestation, serum EBV DNA, etc) or multimodality features (like radiomics features), the replacement of network backbones with more recently developed ones (such as Vision Transformer 24 ), ensemble model based on different network backbones, and a multi-scale deep learning model by combining different magnifications of slides. Methods Study participants. This study was approved by the Institutional Review Board of Sun Yat-sen University Cancer Center. To develop EBVNet, we used three pathological image datasets, including the internal dataset from a single medical center (Internal-STAD), the external dataset from multiple medical centers (MultiCenter-STAD), and the well-known public dataset from The Cancer Genome Atlas (TCGA-STAD), to achieve a broad patient representation and improve the ability to generalize our findings. Internal-STAD was served as a training dataset that comprised all available slides from the patients with EBVaGC and randomly-chosen slides from the pool of all patients with EBVnGC in one medical center. On MultiCenter-STAD, the GC patients with available EBV status were randomly included in this study. On TCGA-STAD, patients with the known EBV status were obtained from the TCGA database. The inclusion criteria for this study were the followings: (1) patients with GC underwent primary gastrectomy at the Determination of EBV status. The ground-truth EBV status from the Internal-STAD and MultiCenter-STAD datasets was determined using ISH targeting EBERs in histopathologic samples at their respective institutions ( Supplementary Fig. 4) since EBERs are consistently expressed in all latent EBV infection types 26,27 . The EBV status for the TCGA-STAD was defined by the previously published study through genetic sequencing 2 . Similar sensitivity and reliability between EBV DNA detection and EBER-ISH were observed in the previous study, suggesting that EBERs ISH was interchangeable with genetic sequencing for identifying EBV status 28 .
Tumor detector. To fully automate the analysis of gastric cancer WSIs, a tumor detector was developed mainly based on the internal dataset and then used to automatically find the tumor regions in each slide from the two external validation datasets. The detected tumor regions were then further analyzed by the EBVNet. A well-known convolutional neural network ResNet50 was used as the classifier backbone for the tumor detector. For tumor detection, 1006 GC slides of Internal-STAD manually outlined by pathologists were set as tumor tissues. In GC cases with diffuse type, it is very difficult to define clear boundaries between adjacent normal mucus and tumor. Thus, we randomly selected 145 additional gastric tissues free of tumors and set them as normal tissues. In this study, the size of each tile is 512-by-512 pixels with a magnification ×10. During training, the dataset was split into five-folds, and the five-fold cross-validation strategy was adopted to train five individual tumor classifiers, each time using four folds to train the classifier and another fold of data as internal validation set to determine when to stop the training (i.e., when the performance of the classifier does not further improve on the internal validation set). Stochastic gradient descent (SGD) optimizer with batch size 64 and weight decay 0.0005 was used to train each classifier for maximally 50 epochs. The learning rate starts from 0.001 and changes with a cosine annealing schedule. After five individual tumor classifiers were well trained, for any input tile from a new slide on the external datasets, the probability outputs of the five classifiers were averaged as the final output of the ensemble tumor detector. The input tile was classified as 'tumor' class when the average probability is larger than 0.5.
EBVNet development. To predict whether a patient belongs to the EBV subgroup or not, an ensemble binary classifier called EBVNet with any network backbone (e.g., ResNet50, VGGNet, EfficientNet) can be trained based on the internal dataset from Internal-STAD. During training, the dataset was divided into five-folds at the slide level, and the five-fold cross-validation strategy was employed to train five individual classifiers first. In particular, for each individual classifier, four folds of data were used to train the classifier and the remaining one was utilized as an internal validation set to determine when to stop training the classifier. For the training of each classifier, each slide was regularly divided into multiple tiles (i.e., image patches) with tile sizes 512-by-512 pixels, and only the tiles from the tumor regions and their labels (1 for 'EBV', 0 for 'non-EBV') were used respectively as the inputs and the expected outputs of the classifier. A tile in a slide is considered from the tumor region when 50% pixels of the tile are within the pre-segmented tumor region in the slide. About 52,600 EBV tiles and 178,500 non-EBV tiles were obtained for the training of each individual classifier, and about 15,600 EBV tiles and 63,400 non-EBV tiles for internal validation. For each individual classifier, SGD optimizer with batch size 64 and weight decay 0.0005 was used to train the model for maximally 150 epochs. The learning rate starts from 0.001 and changes with a cosine annealing schedule. The training was stopped when the classifier performance on the internal validation set was not further improved over 5 consecutive epochs or at the last (maximum) epoch. It has been consistently observed that classifier training converged after 100 epochs or so. Such classifier training process was repeated five times to generate five individual classifiers, each time with a different fold as the internal validation set. The five well-trained individual classifiers were ensembled to form the EBVNet at the output layer of individual classifiers, i.e., the average probability outputs of the five individual classifiers are used as the prediction of the ensembled model EBVNet. Since the ensemble EBVNet was used to predict EBV status for each tile rather than for each slide, to Fig. 4 Successful cases predicted by EBVNet. a-c Histological image (left column) of patients with EBVaGC in a-c were from Internal-STAD, MultiCenter-STAD, and TCGA-STAD, respectively. The heatmaps overlapped on these three WSIs (middle column) showed that tumor tiles were mainly predicted as EBVaGC with a high score (reddish color). Tiles with a high score were mainly localized in areas of medullary histology, poor differentiation, and tumor with vacuolar nucleus or recognizable nucleolus (right column, tiles at ×10 magnification). d-f Histological image (left column) of patients with EBVnGC in d-f were from Internal-STAD, MultiCenter-STAD, and TCGA-STAD, respectively. The heatmaps overlapped on these three WSIs (middle column) showed that tumor tiles were mainly predicted as EBVnGC with a low EBV score (bluish color). All results could be reproduced stably by EBVNet. Tiles with a low score were more likely localized in areas of adenoid differentiation, mucinous differentiation, and signet-ring cell differentiation (right column, tiles at ×10 magnification). EBV Epstein-Barr Virus, EBVaGC Epstein-Barr Virus-associated gastric cancer, EBVnGC Epstein-Barr Virus negative gastric cancer. predict the EBV status of any slide, the EBVNet predictions over all tiles from the tumor regions in the slide were averaged as the final EBV prediction probability for the slide. The diagnostic performances of different model backbones (including VGGNet16, ResNet18, ResNet50, SE_ResNet50, DenseNet121, EfficientNet-B0 and EfficientNet-B1) on Internal-STAD were compared.
EBVNet evaluation. The EBVNet was internally evaluated with the Internal-STAD dataset and externally assessed with two external datasets, the MultiCenter-STAD and TCGA-STAD. For the external testing, the developed ensembled EBVNet classifier was used to predict the EBV probability at the slide level, and the predictions were compared with the ground-truth EBV status for each external dataset. For the internal evaluation, to faithfully simulate the external evaluation scenario, the five-fold cross-validation strategy was applied as follows. First, onefold of the internal dataset was held-out as the simulated external test set, and the other four folds were further divided into five new subsets to train an ensemble AI model for the external evaluation. Then, the ensemble model was evaluated on the held-out one-fold at the slide level. Such a process was repeated five times, each time with a different one-fold as the simulated test set. In this way, every slide on the internal dataset was used once for evaluation, and the EBV status predictions of all slides were finally compared with the corresponding ground-truth EBV status.
Only tiles from tumor regions were used to train and evaluate the model for both internal and external evaluations.
Pathologist reader study. To investigate whether the EBV status could be identified by pathologists from H&E-stained slides alone, six pathologists with different years of experience (two junior pathologists with less than 5 years of experience; two senior pathologists with about 10 years of experience; and two experts with specialized in gastric cancer with up to 15 years of experience) were presented with slides from the MultiCenter-STAD and TCGA-STAD. Blind to all clinical information and the performance of EBVNet, pathologists reviewed these slides and classified each case into EBVaGC or EBVnGC based on their expertise and experience.
To obtain further insight into associations between specific histopathological features and EBVNet's predictions (i.e., EBVaGC or EBVnGC), a logistic regression model was built to assess the relationship between the histopathological features and the EBVNet's prediction. Based on previous studies 29 , EBVaGC is associated with some morphological features, including poorly differentiation and tertiary lymphoid structure. All these reported morphological features are positive features of EBVaGC but no negative feature reported. Based on the pathologist's expert experience with gastric cancer, EBVaGC is also linked to some other positive morphological features like medullary histology and vacuolar nucleus (most chromatin distributed at the peripheral nucleus rather than the central nucleus) or recognizable nucleolus, and negative features like mucinous differentiation, adenoid differentiation (tumor cells arranged in glandular patterns), papillary differentiation and signet-ring cell. Therefore, these eight histopathological features were all included in this study. Two expert pathologists worked together to determine whether each of the features was present in individual cases.
To further understand EBVNet, we assessed the association between morphological features and the misdiagnosis of EBVNet. The morphological features of false-negative cases were compared with that of true-positive cases while the features of false-positive cases were compared with that of true negative cases.
Human-machine fusion. EBVNet can not only be used to predict EBV status of patients in a standalone manner, but also be combined with pathologist predictions in a human-machine fusion manner. In this study, we proposed a simple yet novel adaptive fusion strategy to combine the predictions of the deep learning model EBVNet and pathologists with varying degrees of experience, mainly based on their prediction uncertainties. The details of the human-machine fusion are described below, with the overall fusion strategy introduced first, followed by the design of prediction uncertainty for both EBVNet and pathologists. It is worth noting that the human-machine fusion is predefined and not involved in the training of EBVNet.
Suppose an EBVNet has been well trained and prepared to collaborate with a pathologist to predict the EBV status of a patient. Based on the slide data of the patient, denote by P m and P h the two-dimensional output probability vectors from the EBVNet and the pathologist, respectively, and u m and u h the prediction uncertainties from the EBVNet and the pathologist, respectively. Then, 1 u m and 1 u h can represent prediction certainties (or confidence) from the EBVNet and the pathologist, respectively. Based on the EBV predictions and prediction certainties from both the EBVNet and the pathologist, the output of the human-machine fusion, i.e., the fused prediction P f from the EBVNet and the pathologist, can be defined as Here α represents the relative importance of the prediction from the EBVNet for the final fusion prediction P f , and similarly 1 − α represents the relative importance of the prediction from the pathologist. Intuitively, when the EBVNet is more certain than the pathologist for the EBV prediction, the final prediction P f will be more dependent on the model prediction P m , and vice versa. From Eq. (1), the main challenge is to obtain the pathologist's prediction probability P h and the two prediction uncertainties u m and u h .
For the uncertainty u m of the EBVNet prediction, a deep learning model may be uncertain under two conditions, either when the new slide data are very different from all those used for model training (called knowledge uncertainty), or when the new slide data is similar to one or more slides from both the EBV and non-EBV classes used for model training (called data uncertainty). Fortunately, when the deep learning model is an ensemble of multiple individual models (as EBVNet), both types of prediction uncertainties can be captured by the entropy of the ensemble model's probability output P m 21,30 . Therefore, the uncertainty u m for EBVNet prediction can be estimated by.
where p m,1 and p m,2 are respectively the first and second component of the probability output P m of the EBVNet. In order to obtain the pathologist's prediction probability P h and the associated uncertainty u h , we collected not only his or her diagnosis result ('EBV' or 'Non-EBV'), but also his or her 5-scale self-confidence on the diagnosis, with '1' to '5', respectively, representing 'surely non-EBV', 'likely non-EBV', 'unsure', 'likely EBV', and 'surely EBV'. The five self-confidence scales [1,2,3,4,5] were simply linearly transformed to the corresponding probabilities [0.2, 0.35, 0.5, 0.65, 0.8], where the smallest and largest probability are respectively set to 0.2 and 0.8 (rather than 0 and 1) by considering the potential over-confidence or noise in reported selfconfidence. With this transformed probability, a two-dimensional probability output P h can easily be obtained based on the pathologist's diagnostic result and his or her self-confidence report. For example, for the diagnosis result 'EBV' and selfconfidence '5', P h would be [0.8, 0.2] where the first component represents the probability of being 'EBV'; and for the diagnosis result 'non-EBV' and selfconfidence '2', P h would be [0.35, 0.65]. Once P h is obtained, the associated prediction uncertainty u h can be easily estimated by the entropy of P h , similarly based on Eq. (2). Finally, after obtaining the pathologist's prediction P h and prediction uncertainty u h , together with the EBVNet prediction P m and prediction uncertainty u m , the human-machine fusion output can be obtained based on the designed fusion strategy (Eq. 1).
Besides the proposed human-machine fusion strategy, other human-machine fusion strategies ('Or' strategy, 'And' strategy, and '1-uncertainty' strategy) have been described in the Supplementary Methods. The diagnostic performances of different human-machine fusion strategies on two external datasets were compared.
Statistical analysis. Using the ground-truth EBV status as the reference standard, the AUROCs of EBVNet were calculated according to its prediction scores and the AUROCs of pathologists were determined based on their dichotomous classification into EBVaGC or EBVnGC 20 . Therefore, in the ROC space, each pathologist corresponds to a point and EBVNet can provide a continuous curve 12 . The AUROC of each pathologist was calculated by the area under two lines that link the pathologist's point to (0,0) and (1,1) in the axis, respectively. The AUROC were calculated and compared by Delong's test. The cutoff threshold of EBVNet's receiver operator characteristic curve was defined by Youden's J statistic 31 to dichotomize EBVNet's probabilities into binary predictions for calculating the sensitivity, specificity, and NPV. This threshold was predefined and determined by the Internal-STAD before the evaluation of the external datasets. The sensitivity and specificity were compared using the McNemar test. The baseline data of study participants from different datasets were compared by variance analysis or Chisquare test. The morphological features of misdiagnosed cases were compared with those of correctly diagnosed cases with the Chi-square test or t-test. In terms of the interobserver agreement, the Kappa value of different-level pathologists was calculated with the Chi-square test. The associations between EBVNet prediction and morphological features were obtained by logistic regression models. The 95% CIs of the AUROC were calculated by bootstrapping. Differences were considered significant when the P-value from a two-tailed test was less than 0.05. IBM SPSS Statistics (version 20.0) and Medcalc (vesion 15.2.2) were used for statistical analysis. Python (version 3.9.6) and the deep learning platform PyTroch (version 1.9) were used to build the model and analyze the data.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The TCGA diagnostic whole slides and corresponding labels are available from NIH genomic data commons (https://portal.gdc.cancer.gov/). Restrictions are applied to the whole slide images and annotation data of Internal-STAD and MultiCenter-STAD, which are used with institutional permission via IRB approval for the current study, and thus are not publicly available due to patient privacy obligations. All data supporting the findings of this study are available on requests for non-commercial and academic purposes from the corresponding author M.C. (caimy@sysucc.org.cn) within 10 working days. We do not require to sign a data use agreement. Processed data can be reproduced stably by the source code. Source data are provided as a zip file with the paper. Source data are provided with this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/.