Main

Tissues are spatially organized ecosystems, where cells of diverse phenotypes, morphologies and molecular profiles coexist with non-cellular compounds and interact to maintain homeostasis1. Several tissue staining technologies are used to interrogate this intricate tissue architecture. Among these, hematoxylin and eosin (H&E) is the undisputed workhorse, routinely used to assess aberrations in tissue morphology linked to disease in histopathology workflows2. A notable example is cancer, where H&E staining can reveal abnormal cell proliferation, lymphovascular invasion and immune cell infiltration, among others. Complementary to the morphological information available via H&E, immunohistochemistry (IHC)3 can detect and quantify the distribution and localization of specific markers within cell compartments and within their proper histological context, crucial for tumour subtyping, prognosis and personalized treatment selection. As tissue restaining in conventional IHC is limited, repeated serial sections stained with different antibodies are required for in-depth tumour profiling, a time-consuming and tissue-exhaustive process, prohibitive in cases of limited tissue availability. Additionally, serial IHC staining yields unaligned, non-multiplexed images occasionally of suboptimal quality due to artefacts, and tissue unavailability may lead to missing stainings (Fig. 1a). Recently, multiplexed imaging technologies4,5,6 have enabled the simultaneous quantification of dozens of markers on the same tissue, revolutionizing spatial biology7. Still, their high cost, cumbersome experimental process, tissue-destructive nature and need for specialized equipment severely limit clinical adoption.

Fig. 1: VirtualMultiplexer is a generative toolkit for synthesizing virtual multiplexed staining.
figure 1

a, In a typical histopathology workflow, serial tissue sections from a tumour resection are stained with H&E and IHC to highlight tissue morphology and molecular expression of several markers of interest. This time-consuming and tissue-exhaustive process yields unpaired tissue slides that bear the technical risk of suboptimal quality in terms of missing stainings, tissue artefacts and unaligned tissues. b, To mitigate these issues, the VirtualMultiplexer uses generative AI to rapidly render, from a real input H&E image, consistent, reliable and pixel-wise aligned IHC stainings. c, As the generated images are now virtually multiplexed, they are further exploited to train early fusion graph transformers able to predict several clinically relevant endpoints. d, The VirtualMultiplexer was successfully transferred across image scales and patient cohorts and showed potential in being transferred to other tissue types, accelerating clinical applications and discovery.

Virtual staining—that is, artificially staining tissue images using generative artificial intelligence (AI)—has emerged as a promising cost-effective, accessible and rapid alternative that addresses the above limitations8,9. A virtual staining model is trained on two sets of images—a source and a target set—and learns the source-to-target appearance mapping10,11 so as to simulate the target staining on the source, ultimately producing at inference time a virtual target image. Initial virtual staining models were based on different flavours of generative adversarial networks (GANs) operating under a paired setting: that is, they depended on precisely aligned source and target images, which allowed them to directly optimize a pixel-wise loss between the virtual and real images12. Successful examples of paired models include translating label-free microscopy to H&E and specific stainings13,14,15,16, H&E to special stains17,18, H&E to IHC19,20 and IHC to multiplex immunofluorescence21. However, as tissue restaining is not routinely done, paired models depend on aligning tissue slices via image registration, a time-consuming and error-prone process, often infeasible in practice because of substantial discrepancies even between consecutive slices. Additionally, as tissue architecture largely alters after the first slices, retrospective addition of new markers is impossible. To circumvent these limitations, unpaired stain-to-stain (S2S) translation models have recently emerged, with early applications in translating from H&E to IHC22,23,24,25,26 and special staining27,28 and from cryosections to formalin-fixed paraffin-embedded (FFPE) sections29. The vast majority of unpaired models are inspired by CycleGAN30; they depend on an adversarial loss to preserve the source content and a cycle consistency loss to preserve the target style. Some employ additional constraints: for example, domain-invariant content and domain-variant style22, perceptual embeddings24 or structural similarity25.

An important limitation of CycleGAN-based models is that cycle consistency assumes a bijective mapping between the source and target domains30, which does not hold for many S2S translation tasks. As a result, a persistent problem is staining unreliability, observed as incorrect mappings across domains: for example, positive signals from the source domain are mapped to negative signals from the target domain. To account for staining unreliability, recent works guide the translations via expert annotations: ref. 26 translates H&E to cytokeratin-stained IHC using expert annotations of positive and negative metastatic regions on the H&E images, and ref. 25 translated H&E to Ki67-stained IHC by leveraging cancer and normal region annotations in both H&E and IHC images. Although these approaches show promising results for these specific translation tasks, acquiring such annotations is impractical when translating to several IHC markers and infeasible even for experienced pathologists for specialized tasks (for example, identifying p53+ cells in H&E images). To circumvent the annotation challenge, ref. 31 recently introduced a semisupervised approach, which, however, again depends on image registration. Consequently, there is a great need for unpaired S2S translation models that preserve staining consistency without needing consecutive tissue sections, image registration or extensive annotations on the source domain.

Regardless of the underlying modelling assumptions, another important limitation of S2S translation methods concerns evaluation. As ground-truth and virtually generated images are not pixel-wise aligned, S2S translation quality is typically quantified at a high feature level using inception-based scores32. However, these scores do not guarantee accurate preservation of complex and biologically meaningful patterns9. To alleviate these concerns, some studies employ qualitative assessment through pathological examination of the virtual images22,24. Still, a persistent concern is the presence of hallucinations in virtual images33 that might otherwise appear realistic even to experienced pathologists. Ultimately, to ensure that virtual images not only visually appear realistic but also are useful from a clinical standpoint, using them as input to downstream models that predict clinical endpoints could provide an unbiased, convincing validation9.

Here, we propose the VirtualMultiplexer, a generative toolkit that translates H&E images to matching IHC images for a variety of markers (one IHC marker at a time) (Fig. 1a,b). The VirtualMultiplexer is inspired by contrastive unpaired translation (CUT)34, an appealing alternative to CycleGAN that achieves content preservation by maximizing the mutual information between target and source domains. Our toolkit does not necessitate pixel-wise aligned H&E and IHC images and, in contrast to existing approaches, requires minimal expert annotations only on the IHC domain. To ensure biological consistency, the VirtualMultiplexer introduces an architecture with multiscale constraints at the single-cell, cell-neighbourhood and whole-image level that closely mimics human expert evaluation. We trained the VirtualMultiplexer on a prostate cancer tissue microarray (TMA) containing unpaired H&E and IHC images for six clinically relevant nuclear, cytoplasmic and membrane-targeted markers. We evaluated the generated images using quantitative fidelity metrics, expert pathological assessment and visual Turing tests and assessed their clinical relevance by predicting clinical endpoints (Fig. 1c). We successfully transferred the model across tissue image scales and out-of-distribution patient cohorts and demonstrated its potential to transfer across tissue types (Fig. 1d). Our results suggest that the VirtualMultiplexer generates realistic, indistinguishable from real, multiplexed IHC images of high quality, outperforming existing methods. Using the virtually multiplexed datasets improves the prediction of clinical endpoints not only in the training cohort but also in two independent prostate cancer patient cohorts and a pancreatic ductal adenocarcinoma (PDAC) cohort, with important implications in histopathology.

Results

VirtualMultiplexer is a virtually multiplexed staining toolkit

The VirtualMultiplexer is a generative toolkit for unpaired S2S translation, trained on unpaired real H&E (source) and IHC (target) images (Fig. 2; detailed description in Methods). During training, each image is split into patches that are fed into a generator network G that conditions on input H&E and IHC and learns to transfer the staining pattern, as captured by IHC, to the tissue morphology, as captured by H&E. The generated IHC patches are stitched together to create a virtual IHC image (Fig. 2a). We train an independent one-to-one VirtualMultiplexer model for each IHC marker at a time. To ensure staining reliability, we propose a multiscale approach, designed to accurately learn staining specificity at a single-cell level and content and style preservation at a cell-neighbourhood and whole-image level, which involves jointly optimizing three distinct loss functions (Fig. 2b). The neighbourhood loss (1) ensures that generated IHC patches are indistinguishable from real IHC patches and consists of an adversarial and a multilayer contrastive loss (Fig. 2b), adopted from CUT34. The adversarial loss \({{\mathcal{L}}}_{{\rm{adv}}}\) (1a) is a standard GAN loss35, where real and virtual IHC patches are used as input to patch discriminator D, which attempts to classify them as either real or virtual, eliminating style differences. The multilayer contrastive loss (1b) is based on a patch-level noise contrastive estimation (NCE) loss34 \({{\mathcal{L}}}_{{\rm{contrastive}}}\) that ensures that the content of corresponding real H&E and virtual IHC patches is preserved across multiple layers of Genc: that is, the encoder of the generator G. The VirtualMultiplexer introduces two losses: a global consistency loss and a local consistency loss (Fig. 2b). The global consistency loss (2) uses a feature extractor F and enforces content consistency between real H&E and virtual IHC images (\({{\mathcal{L}}}_{{\rm{content}}}\)) and style consistency between real and virtual IHC images (\({{\mathcal{L}}}_{{\rm{style}}}\)) across multiple layers of F. The local consistency loss (3) enables the model to capture a realistic appearance and staining pattern at the cellular level while alleviating the multi-subdomain mappings. This is achieved by leveraging prior knowledge on staining status via expert annotations and training two separate networks: a cell discriminator Dcell that eliminates differences in the style of real and virtual cells (\({{\mathcal{L}}}_{{\rm{cellDisc}}}\)) and a cell classifier Fcell that predicts the staining status and thus enforces staining consistency at a cell level (\({{\mathcal{L}}}_{{\rm{cellClass}}}\)).

Fig. 2: Overview of the VirtualMultiplexer architecture.
figure 2

a, The VirtualMultiplexer consists of a generator G that takes as input real unpaired H&E and IHC images and is trained to perform S2S translation by mapping the staining distribution of IHC onto H&E while preserving tissue morphology, ultimately generating virtually multiplexed synthetic IHC images only from input H&E images. b, During training, the VirtualMultiplexer optimizes several losses that enforce consistent S2S translation at multiple scales, including (1) a neighbourhood consistency loss that ensures indistinguishable translations at a neighbourhood (patch) level, (2) a global consistency loss that ensures that the model accurately captures content and style constraints at a global tile level and (3) a local consistency loss that encodes biological priors on cell type classification and discriminator constraints at a cellular level.

Performance assessment of the VirtualMultiplexer

We trained the VirtualMultiplexer on a prostate cancer cohort from the European Multicenter Prostate Cancer Clinical and Translational Research Group (EMPaCT) TMA36,37,38 (Methods). The cohort contained unpaired H&E and IHC images from 210 patients with four cores per patient for six clinically relevant markers: androgen receptor (AR), NK3 Homeobox 1 (NKX3.1), CD44, CD146, p53 and ERG. The VirtualMultiplexer generated virtual IHC images that preserved the tissue morphology of the real H&E image and the staining pattern of the real IHC image (Fig. 3a–c; additional examples in Extended Data Fig. 1). We benchmarked the VirtualMultiplexer with four state-of-the-art unpaired S2S translation methods: CycleGAN30, CUT34, CUT with kernel instance normalization (KIN)39 and AI-FFPE29 using the Fréchet inception distance (FID), an established metric used to assess the quality of AI-generated images40 (Methods). The VirtualMultiplexer resulted in the lowest FID score across all markers (Fig. 3d), with an average value of 29.2 (±3), consistently lower than CycleGAN (49 ± 6), CUT (35.8 ± 4.5), CUT with KIN (37.8 ± 2.3) and AI-FFPE (35.9 ± 2.6). We also used the contrast-structure similarity score, a variant of the structural similarity score that computes contrast and structure preservation25, where again the VirtualMultiplexer surpassed all other models in performance (Supplementary Table 1). These results indicated that virtual images generated by the VirtualMultiplexer were closer to the real ones in terms of distribution than any of the competing methods.

Fig. 3: Performance evaluation of the VirtualMultiplexer.
figure 3

a, Example H&E core from the EMPaCT TMA. b, Real, unpaired IHC-stained cores for different antibody markers corresponding to the H&E core in a. c, Virtually stained IHC cores, now paired with the H&E core in a. d, Comparison of the VirtualMultiplexer with state-of-the-art S2S models. Barplots and error bars indicate the mean and standard deviation of the FID score from three independent runs of each model. Number of test samples used varies per marker and is reported in each subplot. e, Results of the visual Turing test, where circles indicate results of the guess of each one of the n = 4 experts, and barplots and error bars indicate the corresponding mean and standard variation. f, Assessment of staining quality of the virtual and real stainings, performed on 50 real and 50 virtual images. RR, real as real; RV, real as virtual; VR, virtual as real; VV, virtual as virtual.

To further quantify the indistinguishability of real and virtual images, we conducted a visual Turing test: three experts in prostate histopathology and one board-certified pathologist were shown 100 randomly selected patches per marker, with 50 of them originating from real and 50 from virtual IHC images, and were asked to classify each patch as virtual or real. Our model was able to trick the experts, as they achieved a close-to-random average sensitivity of 52.1% and specificity of 54.1% across all markers (Fig. 3e). Last, we performed a staining quality assessment: we gave the pathologist 50 real and 50 virtual images per marker, revealing which were real and virtual; the pathologist performed a qualitative assessment of the staining, as judged by overall expression levels, background, staining pattern, cell type specificity and subcellular localization (Fig. 3f; detailed annotations in Supplementary Data 1). Across all markers, on average 70.7% of the virtual images reached an acceptable staining quality, as opposed to 78.3% of the real images. The results varied depending on the marker, with virtual NKX3.1 and CD146 images achieving the highest quality of 96%, surpassing even real images. Conversely, virtual AR images had the lowest score of 46%, with an additional 10% exhibiting accurate staining but high background, and the remaining 42% rejected mostly due to heterogeneous staining or falsely unstained cells. Background presented a challenge with CD44 and p53; the latter appeared to be further affected by border artefacts—that is, the presence of abnormally highly stained cells only in the core border—also occasionally present in real images. ERG achieved a higher staining quality in virtual than in real images, which both often faced background issues. We concluded that for most markers, the staining quality scores and the number of cores with staining artefacts were comparable in virtual versus real images.

Following these observations, we carefully examined if virtual images capture accurate staining patterns. Overall, for all markers, we observed similar patterns, correct cell types and subcellular distributions (Extended Data Fig. 2a). Certain discrepancies were also found, such as systematic lack of recognition of CD146+ vascular structures (Extended Data Fig. 2b). Nonetheless, the more pathologically relevant paterns, crucial for diagnostic applicability, were correctly reconstructed. We also compared the staining intensity of positive and negative cells and observed high concordance between class-wise intensity distributions and separability for both real and virtual images, confirming that the virtual images faithfully capture the staining intensity for both cell classes (Extended Data Fig. 3). Finally, we performed an ablation study demonstrating the effects of different components of the VirtualMultiplexer loss (Extended Data Fig. 4). The mere imposition of the neighbourhood consistency (the primary objective in competing methods) leads to obvious staining unreliability: for example, swapping of staining patterns between positive and negative cells. Our global consistency clearly mitigates this, and our local consistency further optimizes the virtual staining at the cell level.

Transferring from TMAs to WSIs

To assess how well the model can be transferred across imaging scales, we fed the TMA-trained VirtualMultiplexer with five out-of-distribution H&E-stained prostate whole-slide images (WSIs) and generated virtual IHC images for NKX3.1, AR and CD146. We then stained for the same markers by IHC on the direct serial sections, thus generating ground-truth and directly comparable WSIs to visually validate the model predictions (Methods). For NKX3.1 (Fig. 4), the virtual images largely captured the staining appearance of the real ones, both in terms of specific glandular luminal cell identification (positive signal) (examples 1 and 2 in Fig. 4 and Extended Data Fig. 5) and accurate non-annotation of stromal or vascular structures (absence of signal) (example 3 in Fig. 4 and Extended Data Fig. 5). In minority, virtual images did not highlight the rarer NKX3.1+ cell population that are not part of the epithelial gland, but rather in the periglandular stroma (example 4 in Fig. 4 and Extended Data Fig. 5). For CD146 and AR, we observed intensity discrepancies between virtual and real images, more striking for CD146 where the overall signal intensity and background are higher in virtual versus real images (Fig. 4 and Extended Data Fig. 5). These discrepancies can be attributed to the fact that the training set TMA images have a different staining distribution than the WSIs. Although this might lead to false interpretation of marker expression levels at a first inspection, when evaluating at higher magnification, the staining pattern in the matching real and virtual regions was effectively correct: for example, no glandular signal (example 5 in Fig. 4) and appropriate stromal localization of CD146 (examples 6 and 7 in Fig. 4) and nuclear localization of AR in luminal epithelial cells (example 5 in Extended Data Fig. 5). Lack of detection of vascular structures for CD146 was evident in both TMA cores and WSI (example 8 in Fig. 4).

Fig. 4: Transfer learning from TMA to WSIs of prostate cancer tissue.
figure 4

Example of H&E (left), virtual IHC (middle) and real IHC (right) staining for NKX3.1 (top) and CD146 (bottom) of prostate cancer tissue WSIs. Blue-framed zoomed-in regions display accurate staining pattern. Red-framed zoomed-in regions display examples of virtual staining mispredictions.

The VirtualMultiplexer improves clinical predictions

We then assessed the utility of the generated stainings in augmenting the performance of AI models when predicting clinically relevant endpoints. Specifically, we encoded the real H&E, real IHC or virtual IHC images as tissue-graph representations and employed a graph transformer (GT)41 to map the representations to downstream class labels (Fig. 5a,b and Methods). We trained the GT model under three settings (Fig. 5c): (1) a unimodal setting, where independent GT models were trained for each H&E and IHC marker; (2) a multimodal late fusion setting, where the outputs of independent GT models were fused at the last embedding stage, and (3) a multimodal early fusion setting, where the patch features were combined early in the tissue graph and fed into the GT model. Whereas the unimodal setting resulted in a separate prediction per marker, both multimodal settings combined the patch features, resulting in a single prediction. In contrast to the late fusion multimodal setting, in the early fusion case only one model that learns from the joint spatial distribution across all markers was trained, mimicking a multiplexed imaging scenario. With the exception of the early fusion setting that is only feasible for virtual images, we tested all three settings with both real and virtual images as input, resulting in a total of five different combinations (Fig. 5d, legend).

Fig. 5: Prediction of clinically relevant downstream tasks with virtually multiplexed data.
figure 5

a, Patch extraction and computation of patch features with a frozen ResNet-50 model (blue trapezoid). b, Overview of the GT model, implemented by first constructing a patch-level graph representation, followed by a transformer that processes the graph representation to predict clinically relevant endpoints. c, Training of GT models (green trapezoid) under three different settings, depending on the integration strategy. d, Prediction results of overall survival status (left: 0, alive/censored; 1, prostate cancer related death) and disease progression (right: 0, no recurrence; 1, recurrence). Barplot colours indicate one of the five combinations of training setting and input data used (see legend). For each combination, barplot heights and error bars indicate the mean and standard deviation of the weighted F1 score, as computed in the held-out test set from three independent runs with different initializations. The exact number of training samples used in each cases is given on the top of the barplots. aFor all multimodal models, the reported number refers to the union across all markers. MM-R-L, multimodal–real–late fusion; MM-V-E, multimodal–virtual–early fusion; MM-V-L, multimodal–virtual–late fusion.

We applied these settings to the EMPaCT dataset to predict patient overall survival status and disease progression (Fig. 5d and Methods). As small discrepancies in the number of real IHC images available were present due to missing stainings, we matched the number of virtual IHC images to the number of available real IHC images to ensure a fair comparison between real and virtual unimodal models (dark and light blue barplots in Fig. 5d, respectively). As H&E images were always available, the unimodal model trained on H&E had a slight advantage over all other models in terms of number of samples used. To compare all multimodal models, we again matched the number of virtual images to the available real data, and thus the last three bars in Fig. 5d are also directly comparable. We observed that the unimodal–virtual settings are on par with the unimodal–real for both tasks, with variations in prediction performance depending on the marker. When predicting overall survival status, two interesting exceptions concern CD146 and p53: for CD146, the unimodal–virtual setting outperformed the unimodal–real, in accordance with the previous observation that virtual CD146 images achieved a higher-quality score than real ones (Fig. 3f). The opposite is true for p53: virtual p53 images were of lower quality than real p53 images, and the corresponding unimodal–virtual setting achieved a lower performance than the unimodal–real one. However, these observations were not replicated for disease progression prediction, which appeared to be an overall harder task. In both tasks, all multimodal settings outperformed the unimodal ones, including the H&E, indicating the utility of combining information from complementary markers. Furthermore, the multimodal early fusion model trained with virtual images achieved the best weighted F1 score of 82.9% and 74.8% for overall survival status and disease progression, respectively. We also performed a marker-level interpretability analysis, pointing to markers of high importance inline with the unimodal high and low weighted F1 scores (Extended Data Fig. 6). Overall, our results establish the potential of virtual multiplexed images in augmenting the efficacy of AI models in the prediction of clinical endpoints.

Transferring across patient cohorts and cancer types

We then assessed the model’s ability to generalize to out-of-distribution data using two independent prostate cancer cohorts, SICAP42 and prostate cancer grade assessment (PANDA)43, containing H&E-stained needle biopsies with associated Gleason scores (Methods). We used the pretrained VirtualMultiplexer to generate IHC images for four markers relevant towards Gleason score prediction: NKX3.1, CD146, AR and ERG (Fig. 6a; additional examples in Extended Data Fig. 7). We observed that the virtual staining patterns of the IHC markers were overall correct and specific in terms of cell type and subcellular localization, with the only exception being the occasional aspecific AR signal in the extracellular matrix areas. Other inconsistencies include weak staining of interstitial tissue for CD146 and heterogeneous gland staining for ERG. We also observed some recurring issues as in the EMPaCT TMA (Fig. 3): background (for example, occasional stromal background in NKX3.1 and ERG) and border and tiling artefacts (for example, CD146). Subsequently, we trained GT models under the previous settings to predict Gleason grade (Fig. 6b,c, respectively). We observed that the predictive performance of the unimodal–virtual settings was close to or superior to the model using standalone H&E images for both datasets. Further improvement was attained by the multimodal–virtual settings, with the early fusion model achieving the highest weighted F1 score (SICAP, 61.4%; PANDA, 72.3%), which not only outperformed the H&E unimodal counterparts, but also WholeSIGHT44, the previous top performing model on these datasets that achieved a weighted F1 score of 58.6% and 67.9% on SICAP and PANDA, respectively. Finally, as for both SICAP and PANDA, ground-truth region-level annotations of Gleason scores exist, we performed a region-based interpretability analysis and observed that the salient tissue regions contributing to model predictions coincided with the ground-truth annotations (Extended Data Fig. 8).

Fig. 6: Transfer learning across scales, cohorts and cancer types.
figure 6

a, Top, real H&E needle biopsy of the SICAP dataset. Bottom, matching virtual IHC stainings across four IHC markers, as generated from the EMPaCT-trained VirtualMultiplexer. b, Prediction results of Gleason grading for the SICAP test set in terms of weighted F1 score and confusion matrix. Note that the setting unimodal–real (dark blue barplot) only includes training the model on H&E, as no real IHC data are available here. c, Same as in b, but for the PANDA dataset. d, Virtual IHC staining of a PDAC TMA dataset with corresponding prediction of TNM staging. In bd, barplots and error bars are as in Fig. 3 and confusion matrices correspond to the multimodal–virtual early fusion model.

Finally, we evaluated the generalization ability of the VirtualMultiplexer on other cancer types. We applied the EMPaCT-pretrained VirtualMultiplexer to a PDAC TMA and generated virtual IHC stainings for CD44, CD146 and p53 (Fig. 6d), three markers with expected expression in pancreatic tissue. The generated images appeared overall realistic, with no means of discriminating whether they were virtually or actually stained. We observed that the CD44 and CD146 staining pattern in the virtual images was allocated, as expected, to the extracellular matrix of presented tissue spots, without major staining in the epithelial tissue part. For p53, we again observed overall proper staining allocation to the nuclei of epithelial cells with expected distribution, with no major staining of other compartments. To quantify the utility of the virtual stainings for downstream applications, we followed the same process as before to predict PDAC tumour, node and metastasis (TNM) stage, leading, again, to increased performance of models trained with virtually multiplexed data, concluding that virtually multiplexed data offers a performance advantage to prediction models. We also applied the pretrained VirtualMultiplexer to generate virtual IHC images for CD44 and CD146 from colorectal45 and breast cancer46 H&E-stained WSIs from The Cancer Genome Atlas (TCGA) at www.cancer.gov/tcga. Although the lack of normal tissue limited our ability to evaluate the staining quality in the generated images, we again observed an overall realistic virtual staining (Extended Data Fig. 9).

Lastly, we performed a runtime estimation of our framework (Extended Data Fig. 10) and concluded that it leads to substantial time gains when compared to a typical IHC staining, greatly accelerating histopathology workflows.

Discussion

We proposed the VirtualMultiplexer, a generative model that translates H&E to several IHC markers using a multiscale architecture with biological priors that ensures biological consistency on a cellular, neighbourhood and global whole-image scale without requiring image registration or extensive annotations. The VirtualMultiplexer consistently outperformed state-of-the-art methods in image fidelity scores. Detailed evaluation suggested that the virtual IHC images were indistinguishable from real ones to the expert eye, with a staining quality on par with or even exceeding that of real images and occasional staining artefacts largely comparable for three of the six markers. A thorough ablation study demonstrated that our multiscale loss mitigates staining unreliability, as opposed to competing methods that solely use adversarial and contrastive objectives. We also found that the model generalized well to unseen datasets of different image scales without any fine-tuning.

Although our results demonstrate a clear potential, several limitations remain, to be addressed in future extensions. First, we occasionally observed elevated background, especially for markers with faint staining. More pronounced background was present when transferring to prostate cancer WSIs, which was expected considering that this dataset was generated in different institutions using different staining protocols. Second, the patch-wise processing occasionally induced tiling artefacts more pronounced at the core border, a well-known limitation of S2S translation approaches24,39,47,48. One possible underlying cause is that as the model has only seen tissue-full patches during training, when it receives as input a patch with little tissue, the losses ‘force’ it to stain with higher intensity to match the distribution of a full patch. Previous attempts to address the tiling artefact24,39 have been suggested to cause less efficient translations49. As in our case the tiling artefact is limited to edge cases, a straightforward solution is discarding a narrow border surrounding the tissue, as empirically done in actual IHC when border artefacts are present. Alternatively, more sophisticated extensions, such as the bidirectional feature-fusion GAN proposed by ref. 48 could be exploited. Third, discrepancies in staining specificity were occasionally observed (for example, failing to stain CD146+ vascular structures and glandular NKX3.1+ cells invading periglandular stroma), as these patterns were rarely observed in the training images and can be mitigated by ensuring the inclusion of adequate representative examples in the training set.

Importantly, despite their limitations, the generated images enabled the training of early fusion GT models, which consistently improved the prediction of clinical endpoints not only in the training dataset across two prediction tasks but also in both out-of-distribution prostate cancer cohorts and the PDAC TMA cohort. In our experiments, we ensured that the multimodal early fusion models did not have a numerical advantage over models trained with real data and also had a much smaller parameter space in comparison to late fusion ones, suggesting that improved performance is not a mere outcome of higher sample size or model complexity. A potential explanation of the observed improvement is that virtual images are not affected by artefacts occasionally found in real images, corroborated by the fact that for markers where virtual images were of higher quality than real, the corresponding unimodal–virtual models outperformed the unimodal–real ones and vice versa. Another explanation could be that as multimodal early fusion models could learn from the joint spatial distribution of several markers on the same tissue, they managed to pick up single-cell multimodal spatial relationships, mimicking data generated by advanced multiplexed technologies. This is further supported by the fact that in the early fusion case, a single GT model proved to have more learning capacity than the integration of several equivalently potent ones. However, the superior performance of models trained with virtual data could be unrelated to a potential higher quality of the generated images and could be a direct outcome of the fact that the VirtualMultiplexer potentially picks up the most consistent patterns and eliminates a lot of the noise and artefacts in the data, making the prediction task easier. This is further supported by other works that have reported competitive performance using models trained on other spatial features extracted from the tissue images50,51.

In conclusion, the current work establishes the potential of virtual multiplexed staining, with important implications towards AI-assisted histopathology. For example, the VirtualMultiplexer could be directly used for data inpainting—that is, filling missing regions in an image—or for sample imputation—that is, generating missing samples from scratch. As IHC marker panels are not standardized across labs, filling in the gaps via virtual multiplexing could harmonize datasets within or across research labs, particularly important in cases of limited sample availability52,53. This could lead to the generation of harmonized and comprehensive patient cohorts, further used for clinically relevant predictions. An equally important application of our work concerns prehistopathological experimental design: generating a large collection of IHC stains in silico and training AI models could support marker selection for actual experimentation, reducing costs and preserving precious tissue. To reach its full potential, future work will be needed to validate the VirtualMultiplexer in real-world settings. From a technical standpoint, virtually multiplexed stainings can augment existing datasets and enable the development of foundational models for IHC, paving the way for multimodal tissue characterization. Interestingly, virtual multiplexed staining can be exploited as biologically conditioned data augmentations to boost the development and predictive performance of foundational models in histopathology. Our preliminary results on PDAC and TCGA images indicate that our model has the potential to generalize to tissues of different origins. However, more thorough evaluations are needed to solidify these encouraging early results. Finally, as our method is stain-agnostic, straightforward adaptations for S2S translation across multiplexed imaging technologies could substantially reduce costs via antibody panel optimization. Our vision is that future extensions of our work could lead to an ever-growing and readily available dictionary of virtual stainers for IHC and beyond, surpassing in multiplexing even the most cutting-edge technologies and accelerating spatial biology.

Methods

VirtualMultiplexer architecture

The VirtualMultiplexer is a generative AI toolkit that performs unpaired H&E-to-IHC translation. An overview of the model’s architecture is shown in Fig. 2a. The VirtualMultiplexer is trained using two sets of images: source H&E images, denoted as \({X}_{{\rm{img}}}=\{x\in {\mathcal{X}}\}\), and target IHC images, denoted as \({Y}_{{\rm{img}}}=\{\;y\in {\mathcal{Y}}\}\). Ximg and Yimg are unpaired images that originate from different sections of the same TMA core and thus belong to the same patient, but are pixel-wise unaligned and thus unpaired. We train an independent one-to-one VirtualMultiplexer model for each IHC marker at a time. To train the VirtualMultiplexer, we use patches Xp = {xpXimg} and Yp = { ypYimg} extracted from a pair of images Ximg and Yimg, respectively. The backbone of the VirtualMultiplexer is a GAN-based generator G, specifically a CUT34 model that consists of two sequential components: an encoder Genc and a decoder Gdec. Upon training, the generator takes as input a patch xp and generates a virtual patch \({y}_{\rm{p}}^{{\prime} }\): that is, \({y}_{\rm{p}}^{{\prime} }=G({x}_{\rm{p}})={G}_{{\rm{dec}}}({G}_{{\rm{enc}}}({x}_{\rm{p}}))\). The virtually generated patches are stitched together to produce a final virtual image \({Y}_{{\rm{img}}}^{{\prime} }=\{\;{y}^{{\prime} }\in {{\mathcal{Y}}}^{{\prime} }\}\). The VirtualMultiplexer is trained under the supervision of three levels of consistency objectives: local, neighbourhood and global consistency (Fig. 2b). The neighbourhood consistency enforces effective staining translation at a patch level, where a patch captures the neighbourhood of a cell. We introduce additional global and local consistency objectives, operating at an image level and cell level, respectively, to further constrain the unpaired S2S translation and alleviate the stain-specific inconsistencies.

Neighbourhood consistency

The neighbourhood objective is a combination of an adversarial loss and a patch-wise multilayer contrastive loss, implemented as previously described in CUT34 (Fig. 2b, panel 1). Briefly, the adversarial loss dictates the model to learn to eliminate style differences between real and virtual patches, and the multilayer contrastive loss guarantees the content preservation at a patch level54. The adversarial loss is a standard GAN min–max loss35, where the discriminator D takes as input real IHC patches Yp and IHC patches \({Y}_{\rm{p}}^{{\prime} }\) virtually generated by generator G and attempts to classify them as either real or virtual (Fig. 2b, panel 1a). It is calculated as follows:

$${{\mathcal{L}}}_{{\rm{adv}}}(G,D,{X}_{\rm{p}},{Y}_{\rm{p}})={{\mathbb{E}}}_{{y}_{\rm{p}} \sim {Y}_{\rm{p}}}\log D(\;{y}_{\rm{p}})+{{\mathbb{E}}}_{{x}_{\rm{p}} \sim {X}_{\rm{p}}}\log (1-D(G({x}_{\rm{p}}))).$$
(1)

The patch-wise multilayer contrastive loss follows a NCE concept as presented in refs. 54,55 and reused in refs. 29,34. Specifically, it aims to maximize the resemblance between input H&E patch xpXp and corresponding virtually synthesized IHC patch \({y}_{\rm{p}}^{{\prime} }\in {Y}_{\rm{p}}^{{\prime} }\) (Fig. 2b, panel 1b). We first extract a query subpatch \({y}_{\rm{sp}}^{{\prime} }\) of size 64 × 64 from the target IHC domain patch \({y}_{\rm{p}}^{{\prime} }\) (purple square in Fig. 2b, panel 1b) and match it to the corresponding subpatch xsp: that is, a subpatch at the same spatial location as \({y}_{\rm{sp}}^{{\prime} }\) but from the H&E source domain patch xp (black square in Fig. 2b, panel 1b). Because both subpatches originate from the exact same tissue neighbourhood, we expect that xsp and \({y}_{\rm{sp}}^{{\prime} }\) form a positive pair. We also sample N subpatches \(\{{x}_{\rm{sp}}^{-}\}\) at different spatial locations from xp (red squares in Fig. 2b, panel 1b) and expect that they form dissimilar, negative pairs with xsp. In a standard contrastive learning scheme, we would map ysp, xsp and \(\{{x}_{\rm{sp}}^{-}\}\) to a d-dimensional embedding space \({{\mathbb{R}}}^{d}\) via Genc and project them to a unit sphere, resulting in v, v+ and \(\left.\right\{{v}^{-}\},\in {{\mathbb{R}}}^{d}\), respectively, and then estimate the probability of a positive pair (v, v+) selected over negative pairs \((v,{v}_{n}^{-}),\forall n\in N\) as a cross-entropy loss with a temperature scaling parameter τ:

$${\mathcal{L}}(v,{v}^{+},{v}^{-})=-\log \left[\frac{\exp (v{v}^{+}/\tau )}{\exp (v{v}^{+}/\tau )+\mathop{\sum }\nolimits_{n = 1}^{N}\exp (v{v}_{n}^{-}/\tau )}\right]$$
(2)

Here, we use a variation of the loss in equation (2), specifically a patch-wise multilayer contrastive loss that extends \({\mathcal{L}}(v,{v}^{+},{v}^{-})\) by computing it for feature maps extracted from L-layers of Genc29,34. This is achieved by passing the L feature maps of xp and \({y}_{\rm{p}}^{{\prime} }\) through a two-layer multilayer perceptron (MLP) Hl, resulting in a stack of features \({\{{z}_{\rm{l}}\}}_{L}={\{{H}_{l}({G}_{{\rm{enc}}}^{l}({x}_{\rm{p}}))\}}_{L}\) and \({\{{z}_{l}^{{\prime} }\}}_{L}={\{{H}_{l}({G}_{{\rm{enc}}}^{l}(\;{y}_{\rm{p}}^{{\prime} }))\}}_{L}\) = \({\{{H}_{l}\left.\right({G}_{{\rm{enc}}}^{l}(G({x}_{\rm{p}}))\}}_{L}\), l {1, 2,  , L}, respectively. We also iterate over each spatial location s {1,  , Sl}, and we leverage all Sl\s patches as negatives, ultimately resulting in \({z}_{l,s}^{{\prime} }\), zl,s and \({z}_{l,{S}_{l}\backslash s}\) for the query, positive and negative subpatches, respectively (purple, black and red boxes in Fig. 2b, panel 1b). The final patch-wise multilayer contrastive loss is computed as

$${{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{X}_{\rm{p}})={{\mathbb{E}}}_{{x}_{\rm{p}} \sim {X}_{\rm{p}}}\mathop{\sum }\limits_{l=1}^{L}\mathop{\sum }\limits_{s=1}^{{S}_{l}}{\mathcal{L}}\left({z}_{l,s}^{{\prime} },{z}_{l,s},{z}_{l,{S}_{l}\backslash s}\right)$$
(3)

We also employ contrastive loss \({{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{Y}_{\rm{p}})\) on patches ypYp, a domain-specific version of the identity loss56,57 that prevents the generator G from making unnecessary changes as proposed in ref. 34. Finally, the overall neighbourhood consistency objective is computed as a weighted sum of the adversarial loss {equation (1)) and the multilayer contrastive loss (equation (3)) with regularization hyperparameter λNCE:

$$\begin{array}{ll}{{\mathcal{L}}}_{{\rm{neighbourhood}}}={{\mathcal{L}}}_{{\rm{adv}}}(G,D,{X}_{\rm{p}},{Y}_{\rm{p}})+{\lambda }_{{\rm{NCE}}}\times \left({{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{X}_{\rm{p}})\right.\\\qquad\qquad\qquad\;\;\,\left.+{{\mathcal{L}}}_{{\rm{contrastive}}}(G,H,{Y}_{\rm{p}})\right)\end{array}$$
(4)

Global consistency

Inspired by seminal work in neural style transfer58, this objective consists of two loss functions: a content loss \({{\mathcal{L}}}_{{\rm{content}}}\) and a style loss \({{\mathcal{L}}}_{{\rm{style}}}\) that together enforce biological consistency in terms of both tissue composition and staining pattern at the image (tile) level (Fig. 2b, panel 2). Because the generated IHC images should be virtually paired to their corresponding input H&E image in terms of tissue composition, the content loss aims to penalize the loss in content between H&E and IHC images at a tile level. First, real patches Xp and synthesized patches \({Y}_{\rm{p}}^{{\prime} }\) are stitched to create images Ximg and \({Y}_{{\rm{img}}}^{{\prime} }\), respectively, and corresponding tiles of size 1,024 × 1,024 are extracted (boxes in Fig. 2b, panel 2), denoted as Xt = {xtXimg} and \({Y}_{t}^{{\prime} }=\{\;{y}_{t}^{{\prime} }\in {Y}_{{\rm{img}}}^{{\prime} }\}\), respectively. Then the tiles are encoded by a pretrained feature extractor F, specifically VGG16 (ref. 59) pretrained on ImageNet60. The tile-level content loss at layer l of F is calculated as

$${{\mathcal{L}}}_{{\rm{content}}}^{l}\left(F,{X}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)=\frac{\sum | | {F}^{l}({x}_{t})-{F}^{l}(\;{y}_{t}^{{\prime} })| {| }^{2}}{{h}^{l}\cdot {w}^{l}\cdot {c}^{l}}$$
(5)

where h, w and c are the height, width and channel dimensions of the feature map at the lth layer, respectively.

The style loss utilizes the synthesized image \({Y}_{{\rm{img}}}^{{\prime} }\) and the available real image Yimg to match the style or overall staining distribution between real and virtual IHC images. Because \({Y}_{{\rm{img}}}^{{\prime} }\) and Yimg do not have pixel-wise correspondence, large tiles \({Y}_{t}^{{\prime} }=\{\;{y}_{t}^{{\prime} }\in {Y}_{{\rm{img}}}\}\) and Yt = { ytYimg} are extracted at random such that each tile incorporates a sufficient staining distribution. Next, \({Y}_{t}^{{\prime} }\) and Yt are processed by F to produce feature maps across multiple layers. The style loss is computed as

$${{\mathcal{L}}}_{{\rm{style}}}^{l}\left(F,{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)=\frac{\sum | | {\mathcal{G}}\circ {F}^{l}(\;{y}_{t})-{\mathcal{G}}\circ {F}^{l}(\;{y}_{t}^{{\prime} })| {| }^{2}}{| | {\mathcal{G}}\circ {F}^{l}(\;{y}_{t})| {| }^{2}+| | {\mathcal{G}}\circ {F}^{l}(\;{y}_{t}^{{\prime} })| {| }^{2}}$$
(6)

where \({\mathcal{G}}\) is the Gram matrix that measures the correlation between all the styles in a feature map. The denominator is a normalization term that compensates for the under- or overstylization of the tiles in a batch61. The overall global consistency loss is computed as

$${{\mathcal{L}}}_{{\rm{global}}}={\lambda }_{{\rm{content}}}\times \mathop{\sum }\limits_{l}^{{L}_{{\rm{content}}}}{{\mathcal{L}}}_{{\rm{content}}}^{l}\left(F,{X}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)+{\lambda }_{{\rm{style}}}\times \mathop{\sum }\limits_{l}^{{L}_{{\rm{style}}}}{{\mathcal{L}}}_{{\rm{style}}}^{l}\left(F,{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} }\right)$$
(7)

where Lcontent and Lstyle are the lists of the content and style layers of F, respectively, used to extract the feature matrices, and λcontent and λstyle are regularization hyperparameters for the respective loss terms.

Local consistency

The local consistency objective aims to enforce biological consistency at a local cell level and consists of two loss terms: a cell discriminator loss (\({{\mathcal{L}}}_{{\rm{cellDisc}}}\)) and a cell classification loss (\({{\mathcal{L}}}_{{\rm{cellClass}}}\)) (Fig. 2b, panel 3). The cell discriminator loss is inspired by ref. 26 and uses the cell discriminator Dcell to identify whether a cell is real or virtual, in the same way that the patch discriminator of equation (1) attempts to classify patches as real or virtual. \({{\mathcal{L}}}_{{\rm{cellDisc}}}\) takes as input a real (Yp) and a virtual (\({Y}_{\rm{p}}^{{\prime} }\)) target patch and their corresponding cell masks (\({M}_{{Y}_{\rm{p}}}\) and \({M}_{{Y}_{\rm{p}}^{{\prime} }}\), respectively), which include bounding-box demarcation around the cells (Fig. 2b, panel 3). Dcell comprises a feature extractor followed by a RoIAlign layer62 and a final discriminator. The goal of Dcell is to output \({D}_{{\rm{cell}}}({Y}_{\rm{p}},{M}_{{Y}_{\rm{p}}})\to {1}\) and \({D}_{{\rm{cell}}}({Y}_{\rm{p}}^{{\prime} },{M}_{{X}_{\rm{p}}})\to {0}\), where 1 and 0 indicate real and virtual cells (indicated in black and purple, respectively, in Fig. 2b, panel 3). The cell discriminator loss is defined as

$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{cellDisc}}}({D}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }})=\frac{1}{2}\,{{\mathbb{E}}}_{{y}_{\rm{p}}\in {Y}_{\rm{p}}}{\left({D}_{{\rm{cell}}}(\;{y}_{\rm{p}},{M}_{{y}_{\rm{p}}})-1\right)}^{2}\\\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad+\frac{1}{2}\,{{\mathbb{E}}}_{{y}_{\rm{p}}^{{\prime} }\in {Y}_{\rm{p}}^{{\prime} }}{\left({D}_{{\rm{cell}}}\left({y}_{\rm{p}}^{{\prime} },{M}_{{y}_{\rm{p}}^{{\prime} }}\right)\right)}^{2}\end{array}$$
(8)

Although Dcell aims to enforce the generation of realistically looking cells, it is agnostic to their marker expression, as it does not explicitly capture which cells have a positive or a negative staining status. To account for this, we introduce an additional loss via a classifier Fcell that is trained to explicitly predict the cell staining status. This is achieved with the help of cell labels \({C}_{{Y}_{\rm{p}}^{{\prime} }}\) and \({C}_{{Y}_{\rm{p}}}\): that is, binary variables depicting the positive or negative staining status of a cell (indicated as 1: yellow and 0: blue boxes in Fig. 2b, panel 3). The computation of cell masks and labels is described in detail in the section ‘Cell masking and labelling of IHC images’. The cell-level classification loss can be easily computed as cross-entropy loss, calculated as

$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{cellClass}}}\left({F}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }},{C}_{{Y}_{\rm{p}}},{C}_{{Y}_{\rm{p}}^{{\prime} }}\right)\\=\frac{-1}{| {C}_{{y}_{\rm{p}}}| }\displaystyle\mathop{\sum }\limits_{{j=1}\atop {i\in \{0,1\}}}^{| {C}_{{y}_{\rm{p}}}| }\!{{\mathbb{1}}}_{{C}_{{y}_{\rm{p}}}^{\,j} = i}\times \log \left(p\left({M}_{{y}_{\rm{p}}}^{\;j}\times {y}_{\rm{p}}\right)\right)\\\quad+\frac{-1}{| {C}_{{y}_{\rm{p}}^{{\prime} }}| }\displaystyle\mathop{\sum }\limits_{{j=1}\atop {i\in \{0,1\}}}^{| {C}_{{y}_{\rm{p}}^{{\prime} }}| }\!{{\mathbb{1}}}_{{C}_{{y}_{\rm{p}}^{{\prime} }}^{\,j} = i}\times \log \left(p\left({M}_{{y}_{\rm{p}}^{{\prime} }}^{\;j}\times {y}_{\rm{p}}^{{\prime} }\right)\right)\end{array}$$
(9)

where \(| {C}_{{y}_{\rm{p}}}|\) and \(| {C}_{{y}_{\rm{p}}^{{\prime} }}|\) are the number of cells in yp and \({y}_{\rm{p}}^{{\prime} }\), respectively, \({{\mathbb{1}}}_{(.)}\) is the indicator function and p(.) is the cell-level probabilities predicted by Fcell.

The overall local consistency loss is computed as

$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{local}}}\;=\;{\lambda }_{{\rm{cellDisc}}}\times {{\mathcal{L}}}_{{\rm{cellDisc}}}\left({D}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }}\right)\\\qquad\qquad\;+\;{\lambda }_{{\rm{cellClass}}}\times {{\mathcal{L}}}_{{\rm{cellClass}}}\left({F}_{{\rm{cell}}},{Y}_{\rm{p}},{Y}_{\rm{p}}^{{\prime} },{M}_{{Y}_{\rm{p}}},{M}_{{Y}_{\rm{p}}^{{\prime} }},{C}_{{Y}_{\rm{p}}},{C}_{{Y}_{\rm{p}}^{{\prime} }}\right)\end{array}$$
(10)

where λcellDisc and λcellClass are the regularization hyperparameters for the cell discriminator and classification loss terms, respectively. Importantly, the local consistency loss can be easily generalized to any other cellular or tissue component (for example, nuclei, glands) that might be relevant to other S2S translation problems, provided that corresponding masks and labels are available.

The complete objective function for optimizing VirtualMultiplexer is given as

$${{\mathcal{L}}}_{{\rm{VirtualMultiplexer}}}={{\mathcal{L}}}_{{\rm{neighbourhood}}}+{{\mathcal{L}}}_{{\rm{global}}}+{{\mathcal{L}}}_{{\rm{local}}}$$
(11)

Cell masking and labelling of IHC images

As already discussed, the local consistency loss of equation (11) needs as input cell masks \({M}_{{X}_{\rm{p}}},{M}_{{Y}_{\rm{p}}}\) and cell labels \({C}_{{X}_{\rm{p}}},{C}_{{Y}_{\rm{p}}}\). However, acquiring these inputs manually for all patches across all antibodies is practically prohibitive, even for relatively small datasets. Automatic nuclei segmentation/detection using pretrained models (for example, HoVerNet63) is a standard task for H&E images, but no such model exists for IHC images. To circumvent this challenge, we use an attractive property of the VirtualMultiplexer: its ability to synthesize virtual images that are pixel-wise aligned in any direction between the source and target domain. Specifically, we train a separate instance of the VirtualMultiplexer that performs IHC → H&E translation. The VirtualMultiplexerIHC→H&E is trained using neighbourhood consistency and global consistency objectives, as previously described. Once trained, it is used to synthesize a virtual H&E image \({X}_{{\rm{img}}}^{{\prime} }\) from a real IHC image Yimg. At this point, we can leverage HoVerNet63 to detect cell nuclei on real and virtual H&E images (Ximg and \({X}_{{\rm{img}}}^{{\prime} }\)) and simply transfer the corresponding cell masks (\({M}_{{X}_{{\rm{img}}}}\) and \({M}_{{X}_{{\rm{img}}}^{{\prime} }}\)) to their pixel-wise aligned IHC counterparts (\({Y}_{{\rm{img}}}^{{\prime} }\) and Yimg, respectively) to acquire \({M}_{{Y}_{{\rm{img}}}^{{\prime} }}\) and \({M}_{{Y}_{{\rm{img}}}}\). This ‘trick’ eliminates the need to train individual cell detection models for each IHC antibody and fully automates the cell masking process in the IHC domain. To acquire cell labels \({C}_{{Y}_{{\rm{img}}}^{{\prime} }}\) and \({C}_{{Y}_{{\rm{img}}}}\), we use only region annotations in Yimg, where the experts partially annotated areas as positive or negative stainings in a few representative images. Because IHC stainings are specialized in delineating positive or negative staining status, the annotation was easy and fast and required approximately 2–3 minutes per image and per antibody marker. We also train cell detectors for the source and target domain: that is, \({D}_{{\rm{cell}}}^{{\rm{source}}}\) and \({D}_{{\rm{cell}}}^{{\rm{target}}}\), respectively. Provided with the annotations, \({D}_{{\rm{cell}}}^{{\rm{target}}}\) is trained as a CNN patch classifier. The classifier predictions on Yimg combined with \({M}_{{Y}_{\rm{p}}}\) result in \({C}_{{Y}_{p}}\). The above region predictions on Yimg are transferred on to \({X}_{{\rm{img}}}^{{\prime} }\). Afterwards, \({X}_{{\rm{img}}}^{{\prime} }\) and the transferred annotations are used to train \({D}_{{\rm{cell}}}^{{\rm{source}}}\) as a CNN patch classifier. The classifier predictions on Ximg combined with \({M}_{{X}_{\rm{p}}}\) result in \({C}_{{X}_{p}}\).

Implementation and training details

The architectural choices of the VirtualMultiplexer were set as follows: G is a ResNet64 with nine residual blocks, D is a PatchGAN discriminator12, Dcell includes four stride-2 feature convolutions followed by a RoIAlign layer and a discrimination layer and Fcell includes four stride-2 feature convolutions and a two-layer MLP. We use Xavier weight initialization65, instance normalization66 and a batch size of one image. We use least square GAN loss67 for \({{\mathcal{L}}}_{{\rm{adv}}}\). The model hyperparameters for the loss terms of the VirtualMultiplexer are set as follows: λNCE is 1 with temperature τ equal to 0.08, λcontent {0.01, 0.1}, λstyle {5, 10}, λcellDisc {0.5, 1} and λcellClass {0.1, 0.5}. VirtualMultiplexer is optimized for 125 epochs using the Adam optimizer68 with momentum parameters β1 = 0.5 and β2 = 0.999. Different learning rates (lr) are employed for different consistency objectives: that is, for neighbourhood consistency, lrG and lrD are set to 0.0002; for global consistency, learning rate lrG is chosen from {0.0001, 0.0002}; and for local consistency, learning rates \({\text{lr}}_{{D}_{{\rm{cell}}}}\) and \({\text{lr}}_{{F}_{{\rm{cell}}}}\) are chosen from {0.00001, 0.0001, 0.0002}. Among other hyperparameters, the number of tiles extracted per image to compute \({{\mathcal{L}}}_{{\rm{content}}}\) and \({{\mathcal{L}}}_{{\rm{style}}}\) is set to eight; the content layer in F is relu2_2; the style layers are relu1_2, relu2_2, relu3_3, relu4_3; and the number of cells per patch to compute \({{\mathcal{L}}}_{{\rm{cellDisc}}}\) is set to eight.

GT architecture

The GT architecture, proposed by ref. 41, fuses a graph neural network and a vision transformer (ViT) to process histopathology images. The graph neural network operates on a graph-structured representation of a histopathology image, where the nodes and edges of the graph denote patches and interpatch spatial connectivity, and the nodes encode patch features extracted from a pretrained ResNet-50 network64. The graph representation underwent graph convolutions to contextualize the node features of the local tissue neighbourhood. Specifically, the GT employs a graph convolution layer69 to learn contextualized node embeddings through propagating and aggregating neighbourhood node information. Subsequently, a ViT layer operates on the contextualized node features, leverages self-attention to weigh the importance of the nodes and aggregates the node information to render an image-level feature representation. Finally, an MLP maps the image-level features to a downstream image label. Note that histopathology images can have different spatial dimensions; therefore, their graph representations can have varying number of nodes. Also, the number of nodes can be very high when operating on gigapixel-sized WSIs. These two factors can potentially hinder the integration of the graph convolution layer to the ViT layer. To address these challenges, GT introduces a mincut pooling layer70, which reduces the number of nodes to a fixed number of tokens while preserving the local neighbourhood information of the nodes.

Implementation and training details

The architecture of the GT follows the official implementation on GitHub (https://github.com/vkola-lab/tmi2022). Each input image was cropped to create a bag of 256 × 256 non-overlapping patches at ×10 magnification, and background patches with non-tissue area greater than 10% were discarded. The patches were encoded using the ResNet-5064 model pretrained on the ImageNet dataset60. A graph representation was constructed using the patches with an eight-node connectivity pattern. The GT network consisted of one graph convolutional layer, and the ViT layer configurations were set as follows: number of ViT blocks = 3, MLP size = 128, embedding dimension of each patch = 32 and number of multihead attention = 8. The model hyperparameters were set as follows: number of clusters in mincut pooling = {50, 100}, Adam optimizer with initial learning rate of {0.0001, 0.00001}, a cosine annealing scheme for scheduling and a mini-batch size of eight. The GT models were trained for 400 epochs with early stopping.

Datasets

The VirtualMultiplexer was trained using the EMPaCT TMA dataset; an independent subset of EMPaCT was used for internal testing. The VirtualMultiplexer was further evaluated in a zero-shot fashion—that is, without any retraining or fine-tuning—on three external prostate cancer datasets (prostate cancer WSIs, SICAP42 and PANDA43 needle biopsies), on an independent PDAC dataset (PDAC TMAs) and on TCGA data from breast and colorectal cancer. In all cases, independent GTs are trained and tested for individual datasets by using both real and virtually stained samples to address various downstream classification tasks. Details on all datasets used follow.

EMPaCT

The dataset contains TMAs from 210 primary prostate tissues as part of EMPaCT and the Institute of Tissue Pathology in Bern. The study followed the guidelines of the World Medical Association Declaration of Helsinki 1964, updated in October 2013, and was conducted after approval by the Ethics Committees of Bern (CEC ID2015-00128). For each patient, four cores were selected, with two of them representing a low Gleason pattern and the other two a high Gleason pattern. Consecutive slices from each core were stained with H&E and IHC using multiple antibodies against nuclear markers NKX3.1 and AR, tumour markers p53 and ERG, and membrane markers CD44 and CD146. TMA FFPE sections of 4 μm were deparaffinized and used for heat-mediated antigen retrieval (citrate buffer, pH 6, Vector Labs; or Tris-HCl, pH 9). Sections were blocked for 10 min in 3% H2O2, followed by 30 min room temperature incubation in 1% bovine serum albumin in phosphate-buffered saline–0.1% Tween 20. The following antibodies were used: anti-AR (Dako Agilent, catalogue no. M3562, AR441, 1:100 dilution), anti-NKX3.1 (Athena Enzyme Systems, catalogue no. 314, lot 18025, 1:200), anti-p53 (Dako Agilent, catalogue no. M7001, DO-7, 1:800), anti-CD44 (Abcam, catalogue no. ab16728, 156-3C11, 1:2000), anti-ERG (Abcam, catalogue no. ab133264, EPR3864(2), 1:500) and anti-CD146 (Abcam, catalogue no. ab75769, EPR3208, 1:500). Images were acquired using a 3D Histech Panoramic Flash II 250 scanner at ×20 magnification (resolution 0.24 μm per pixel). The cores were annotated at patient level by expert uro-pathologists with binary labels for overall survival status (0, alive/censored; 1, prostate-cancer-related death) and disease progression status (0, no recurrence; 1, recurrence). Clinical follow-up was recorded on a per-patient basis, with a maximum follow-up time of up to 12 years. For both the survival and disease progression clinical endpoints, the available data were imbalanced in terms of class distributions. Access information is possible upon request to the corresponding authors. The distribution of cores per clinical endpoint for the EMPaCT dataset is summarized in Supplementary Table 2.

Prostate cancer WSIs

Primary stage prostate cancer FFPE tissue sections (4 μm) were deparaffinized and used for heat-mediated antigen retrieval (citrate buffer, pH 6, Vector Labs). Sections were blocked for 10 min in 3% H2O2, followed by 30 min room temperature incubation in 1% bovine serum albumin in phosphate-buffered saline–0.1% Tween 20. The following primary antibodies were used: anti-CD146 (Abcam, catalogue no. ab75769, EPR3208, 1:500), anti-AR (Abcam, catalogue no. ab133273, EPR1535, 1:100) and anti-NKX3.1 (Cell Signaling, catalogue no. 83700T, D2Y1A, 1:200). Secondary anti-rabbit antibody Envision horseradish peroxidase (DAKO, Agilent Technologies, catalogue no. K400311-2, undiluted) was incubated for 30 min, and signal detection was done using 3-amino-9-ethylcarbazole substrate (DAKO, Agilent Technologies). Sections were counterstained with hematoxylin and mounted with aquatex. Images were acquired using a 3D Histech Panoramic Flash II 250 scanner at ×20 magnification (resolution 0.24 μm per pixel).

SICAP

The dataset contains 155 H&E-stained WSIs from needle biopsies taken from 95 patients, split in 18,783 patches of size 512 × 512 (ref. 42). The WSIs were reconstructed by stitching the patches. The WSIs were scanned at ×40 magnification by a Ventana iScan Coreo scanner and downsampled to ×10 magnification. The WSIs were annotated by expert uro-pathologists for Gleason grades at the Hospital Clínico of Valencia, Spain.

PANDA

The dataset includes 5,759 H&E-stained needle biopsies from 1,243 patients at the Radboud University Medical Center, Netherlands71 and 5,662 H&E-stained needle biopsies from 1,222 patients at various hospitals in Stockholm, Sweden72. The slides from Radboud were scanned with a 3D Histech Panoramic Flash II 250 scanner at ×20 magnification (resolution 0.24 μm per pixel) and were downsampled to ×10. The slides from Sweden were scanned with a Hamamatsu C9600-12 and an Aperio Scan Scope AT2 scanner at ×10 magnification with a pixel resolution of 0.45202 μm and 0.5032 μm, respectively. The Gleason grades of the biopsies were annotated by expert uro-pathologists and were released as part of the PANDA challenge43. We removed the noisy and inconspicuously labelled biopsies from the dataset, resulting in 4,564 and 4,988 biopsies from the Radboud and the Swedish cohorts, respectively (9,552 biopsies in total). The distribution of WSIs across Gleason grades for both SICAP and PANDA datasets is shown in Supplementary Table 3.

PDAC

The PDAC TMA contained cancer tissue of 117 (50 female, 67 male) PDAC cases resected in a curative setting at the Department of Visceral Surgery of Inselspital Bern and diagnosed at the Institute of Tissue Medicine and Pathology (ITMP) of the University of Bern between the years 2014 and 2020. The study followed the guidelines of the World Medical Association Declaration of Helsinki 1964, updated in October 2013, and was conducted after approval by the Ethics Committees of Bern (CEC ID2020-00498). All participants provided written general consent. The TMA contained three spots from each case (tumour front, tumour centre, tumour stroma), leading to a total number of 351 tissue spots. Thirteen of these 117 cases were treated by neoadjuvant chemotherapy followed by surgical resection and adjuvant therapy, and the majority of the cases (104) were resected curatively and received adjuvant therapy. All cases were characterized comprehensively clinico-pathologically, including TNM stage, during a master’s thesis of student Jessica Lisa Rohrbach at ITMP, supervised by Martin Wartenberg. All cases were Union for International Cancer Control (UICC) tumour stage I, stage II or stage III cases on pathologic examination, according to the UICC TNM Classification of Malignant Tumours, 8th edition73; the TMA did not include UICC tumour stage IV cases. In all of our analysis, including the TNM prediction (Fig. 6d), we excluded the 13 neoadjuvant cases and considered only the 104 cases that received adjuvant therapy. The distribution of cores across the three TNM stages is reported in Supplementary Table 4.

TCGA

The dataset includes example H&E WSIs from breast cancer (BRCA) and colorectal cancer (CRC) from The TCGA, available at the GDC data portal (https://portal.gdc.cancer.gov) as diagnostic slides under project IDs TCGA-BRCA and TCGA-CRC, respectively.

Data preprocessing

For all datasets used, we followed a tissue region detection and patch extraction preprocessing procedure. Specifically, the tissue region was segmented using the preprocessing tools in the HistoCartography library74. A binary tissue mask denoting the tissue and non-tissue regions was computed for each downsampled input image by iteratively applying Gaussian smoothing and Otsu thresholding until the mean of non-tissue pixels was below a threshold. The estimated contours of the denoted tissue and the cavities of tissue were then filtered depending on their area to generate the final segmentation mask. Subsequently, non-overlapping patches of size 256 × 256 were extracted from ×10 magnification using the segmentation contours. The extracted H&E and IHC patches of the EMPaCT dataset were used for training and internal validation of the VirtualMultiplexer. For the unseen datasets (prostate cancer WSIs, SICAP, PANDA, PDAC, TCGA), the images were first stain-normalized to mitigate the staining appearance variability with respect to the EMPaCT TMAs, and then H&E patches were extracted. Specifically, for the SICAP, PANDA and PDAC datasets, we used the Vahadane stain normalization method75, from the HistoCartography library74, on the entire images. We masked out the blank regions by applying a threshold on the Lab colour space and computed the stain-density maps using only the tissue regions. Afterwards, the target stain-density maps are combined with the reference colour appearance matrix to produce normalized images, as proposed by the Vahadane method. Supplementary Fig. 1 presents a sample unnormalized WSI from the PANDA dataset and the corresponding stain-normalized WSI based on the reference EMPaCT TMA. For the prostate cancer and TCGA WSIs, we followed the same procedure but with stain-density maps extracted from a lower magnification (×2.5) for computational efficiency. Note that the VirtualMultiplexer is independent of the stain normalization method and can be trained using H&E images normalized by other advanced stain normalization algorithms: for example, deep learning-based methods76.

Method evaluation

Patch-level evaluation

We use the FID score77 to compare the distribution of the virtual IHC patches with the distribution of the real IHC patches, as shown in Fig. 3. The computation begins with projecting the virtual and the real IHC patches to an embedding space using the InceptionV3 (ref. 77) model, pretrained on ImageNet60. The extracted embeddings are used to estimate multivariate normal distributions \({\mathcal{N}}(\;{\mu }_{\rm{r}},{\Sigma }_{\rm{r}})\) for real data and \({\mathcal{N}}(\;{\mu }_{\rm{s}},{\Sigma }_{\rm{s}})\) for virtual data. Finally, the FID score is computed as

$$\,\text{FID}\,=| | {\mu }_{\rm{r}}-{\mu }_{\rm{v}}| {| }^{2}+{\mathrm{Tr}}\left({\Sigma }_{\rm{r}}+{\Sigma }_{\rm{v}}-2{({\Sigma }_{\rm{r}}{\Sigma }_{\rm{v}})}^{\frac{1}{2}}\right)$$
(12)

where μr and μv are the feature-wise mean of the real and virtual patches, Σr and Σv are covariance matrices for the real and virtual embeddings, and Tr is the trace function. A lower FID score indicates a lower disparity between the two distributions and thereby a higher staining efficacy of the VirtualMultiplexer. To ensure reproducibility, we ran each model three times with three independent initializations and computed the mean and standard deviation for each model (barplot height and error bar in Fig. 3). We used a 70%:30% ratio to split the data into train and test sets, respectively. As for each marker a different number of IHC stainings were available in the EMPaCT data, the exact number of cores used per marker are given in Supplementary Table 5.

Image-level evaluation

We used a number of downstream classification tasks to assess the discriminative ability of the virtually stained IHC images on the EMPaCT, SICAP, PANDA and PDAC datasets. We further used these tasks to depict the utility of leveraging virtually multiplexed staining in comparison to standalone real H&E, real IHC and virtual IHC staining. Specifically, provided the aforementioned images, we constructed graph representations as described in Section GT architecture. Subsequently, GTs41 were trained under unimodal and multimodal settings using both real and virtually stained images and evaluated on a held-out independent test dataset. The final classification scores were reported using a weighted F1 metric, where a higher score depicts a better classification performance and thereby higher discriminative power of the utilized images. As before, we ran each model three times with three independent initializations and computed the mean and standard deviation for each model (barplot heights and error bars in Figs. 5 and 6). In all cases, we used a 60%:20%:20% ratio to split the data into train, validation and test sets, respectively. The exact number of train, validation and test samples used per task, marker and training setting in the EMPaCT dataset are given in Supplementary Table 6.

For the SICAP, PANDA and PDAC datasets, the exact number of samples used in the train, validation and test splits coincide for all unimodal and multimodal models of Fig. 6 and are reported in Supplementary Table 7.

Computational hardware and software

The image datasets were preprocessed on POWER9 central processing units and one NVIDIA Tesla A100 graphics processing unit using the Histocartography library74. The deep learning models were trained on NVIDIA Tesla P100 graphics processing units using PyTorch (v.1.13.1) (ref. 78) and PyTorch Geometric (v.2.3.0) (ref. 79). The entire pipeline was implemented in Python (v.3.9.1).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.