Breast cancer histopathology image-based gene expression prediction using spatial transcriptomics data and deep learning

Tumour heterogeneity in breast cancer poses challenges in predicting outcome and response to therapy. Spatial transcriptomics technologies may address these challenges, as they provide a wealth of information about gene expression at the cell level, but they are expensive, hindering their use in large-scale clinical oncology studies. Predicting gene expression from hematoxylin and eosin stained histology images provides a more affordable alternative for such studies. Here we present BrST-Net, a deep learning framework for predicting gene expression from histopathology images using spatial transcriptomics data. Using this framework, we trained and evaluated four distinct state-of-the-art deep learning architectures, which include ResNet101, Inception-v3, EfficientNet (with six different variants), and vision transformer (with two different variants), all without utilizing pretrained weights for the prediction of 250 genes. To enhance the generalisation performance of the main network, we introduce an auxiliary network into the framework. Our methodology outperforms previous studies, with 237 genes identified with positive correlation, including 24 genes with a median correlation coefficient greater than 0.50. This is a notable improvement over previous studies, which could predict only 102 genes with positive correlation, with the highest correlation values ranging from 0.29 to 0.34.


Introduction
Breast cancer is the most prevalent malignancy in women and is a molecularly diverse disease.In 2018, 2.1 million women were diagnosed with breast cancer, one every 18 seconds, and 626,679 women died, a 3.1% yearly rise [1].Early breast cancer, where the disease is limited to the breast with or without the involvement of the axillary lymph nodes, has a good prognosis with a 5-year survival of close to 90%.However, advanced metastatic disease is often incurable with existing therapies, which aim to delay progression and treat symptoms.According to the World Health Organization (WHO), early identification of cancer significantly improves the likelihood of appropriate decision making and successful treatment [2].
Pathological analysis of a tissue biopsy removed from the breast tumour is acknowledged as the gold standard method for the diagnosis, subtype classification and staging.Heterogeneity is evident at the morphological and molecular levels of breast cancer, where divergent patterns of gene expression and interaction with host immune and stromal populations can vary widely within and between similar tumour types.A better understanding of spatial gene expression is a current area of very active cancer research to better capture spatial differences in gene expression which may more accurately reflect patient outcome and response to treatment, which are currently difficult to gauge.
Emerging spatial transcriptomics (ST) technologies enable profiling gene expression at a single-cell resolution while preserving spatial orientation and cellular composition within a tissue [3,4].ST is quickly becoming an extension of single-cell RNA sequencing (scRNAseq).To understand the complex transcriptional architecture of biological systems, it is necessary to determine how cells are arranged in space and how gene expression changes in different foci of a targeted tumour or tissue.ST methods based on next generation sequencing (NGS), such as 10× Genomics' Visium, Slide-Seq [5], Slide-Seq2 [6], and HDST [7], barcode whole transcriptomes but have restricted capture rates and resolutions greater than a single cell (around 100 µm for Visium and 10 µm for Slide-Seq).Moreover, NGS-based approaches offer unbiased profiling of large tissue sections without requiring a list of target genes, unlike image-based technologies such as in-situ hybridization (ISH) and in-situ sequencing (ISS) [8,9,10].In order to accurately and robustly analyse the data produced by ST technologies, specialised computational methods are required because of the data's intrinsic noise, high dimensionality, sparseness, and multimodality (containing histology pictures, count matrices, etc.).
Although ST provides a plethora of data, its generation is prohibitively costly, precluding its large-scale application.In addition, with commercial equipment like the 10× Genomics Visium system, significant expertise is required to create high-quality expression patterns for whole tissue samples [11].On the other hand, compared to ST, hematoxylin & eosin (H&E)-stained histology images are not only easier and less expensive to acquire, but they are also routinely used in clinical practice.Therefore, the recent development of predicting spatial gene expression data from routine histology images offers advantages in cost, speed, and availability of such data.These projections have the potential to provide virtual ST data, which will make it possible to investigate the regional differences in gene expression on a large scale.
While existing analysis pipelines mostly use gene expression profiles, not image pixel values, the integration of imaging data with gene expression data is a new area of research [12].Moreover, whole-slide images (WSIs) have been utilised to predict global gene expression patterns via HE2RNA [13], demonstrating their strong correlation with transcription.HE2RNA is trained to predict gene expression patterns from The Cancer Genome Atlas (TCGA) WSIs without specialist interpretation.Similarly, ST-Net uses ST data in conjunction with DenseNet to make predictions on the spatially varying gene expression of each spot over WSIs [14].Prediction of the spatially resolved transcriptome makes it possible to use images to look for biomarkers that change in various tumour location spots.
Although these techniques have performed well, they do have certain drawbacks.HE2RNA lacks the capacity to learn from ST data since it was designed for bulk RNA sequencing.Despite the fact that ST-Net is purpose-built for ST, its convolutional neural network (CNN) model is pretrained on the ImageNet dataset, which fails to take into account the observed individual spatial locations of spots.Expression of genes generally has local patterns, thus it is vital to consider spatial location when training networks to make accurate predictions.
To overcome these challenges, we have developed BrST-Net (short for Breast ST-Net), a framework based on deep learning that uses spatial transcriptomics data to predict gene expression directly from breast histopathology images (Fig. 1).This framework includes stain normalisation of tissue sections, filtering of gene and ST spots, and the generation of image patches depending on the locations of the spots.The image patches are then utilised to train CNN and transformer models, with the primary network devoted to predicting 250 highly expressed genes.We introduce an auxiliary network before the end of the main network to forecast the remaining genes, which ultimately enhances the generalisation performance of the main network by functioning as an alternative kind of regularisation [15].We have thoroughly trained and evaluated 10 recent cutting-edge deep learning models to assess their prediction capabilities on the ST dataset.Our suggested approach outperforms existing work: the framework using EfficientNet-b0 with an auxiliary network can predict 237 genes with positive correlation, whereas ST-Net can only predict 102 genes with positive correlation.ST-Net identified the top-5 predicted genes in their framework with a median correlation coefficient value of 0.34, 0.33, 0.31, 0.30, and 0.29, while our framework using EfficientNet-b0 with an auxiliary network can predict 24 genes with a median correlation coefficient greater than 0.50, which is a substantial performance boost.

Dataset description
We used the fifth edition of the publicly accessible spatial transcriptomics dataset [16] as also used by ST-Net [14].It comprises a total of 68 WSIs of H&E-stained frozen tissue sections, from 23 breast cancer patients having various subtypes of breast cancer, including luminal A, luminal B, triple-negative, human epidermal growth factor receptor 2 (HER2) luminal, and HER2 non-luminal.Each section is scanned at 20× magnification, with image size approximately 9,500 × 9,500 pixels in JPEG format, and contains around 450 spots, each having a diameter of 100 µm and approximately 3,000 gene expression counts.In total, the dataset comprises 30,612 spots, and includes spot coordinates, count matrices, and coordinate files.

Gene n n
Auxiliary predictions 2.2 Preprocessing of the dataset

Stain normalization
Histology samples stained with H&E often vary in colour between and within laboratories from one batch to the next.Stain normalisation is employed as a preprocessing step in computational pathology algorithms to regulate variances and has a significant impact on the outcome of automated image analysis methods [17].In our work, we employed StainTools [18], using the Vahadane technique to normalize the stain variance in the target image and the Luminosity Standardizer function to standardize the brightness, both of which facilitate consistent interpretation of the images [19].

Gene and spots filtering
To eliminate background noise from gene expression data per spot, we removed genes with a mean expression of zero across all samples, and selected spots with at least 1,000 total read counts for analysis.The resulting filtered genes were stored in a ZipFile format (.npz) comprising multiple arrays, including count, pixel, patient, and index arrays.This resulted in approximately 28,792 spot files from all samples, which were then utilized to generate corresponding image patches.After filtering, approximately 6,000 genes remained.The complete preprocessing pipeline is described in detail in [20].

Generating image patches
Since the original images are too large for direct input to a CNN, we extracted N patches from the WSIs centered on the ST spots, using their size and position.The patches were then flattened into an N × (3 × m × n) matrix, where m and n are the patch's width and height, respectively, and each patch has 3 colour channels.In our experiment, a value of m = n = 224 pixels was selected.To ensure that the training dataset only contained informative data, we excluded patches with more than 50% white pixels, as these are unlikely to contain relevant information and may introduce noise in the data.In total, 27,652 image patches were acquired for training.

Gene symbol conversion
The dataset provides a comprehensive list of genes, including their Ensembl identifiers (IDs), which are widely used as a standard for gene annotation and identification.The IDs were translated from the given format to gene symbols according to the HUGO Gene Nomenclature Committee (HGNC) database (genenames.org) and converted into a Pickle file, retaining just the Ensembl ID and its related symbol.Pickle is a Python library to store complex data as binary files.

Data augmentation
Histology image analysis is supposed to be orientationally invariant, just like a pathologist can analyse a microscopic image from any angle.Therefore, to facilitate the generalisability of the trained networks, we used PyTorch's "torchvision.transforms"module to randomly transform image patches during training.It performs random horizontal and vertical flipping, and random rotation of 90 degrees to increase the diversity of the training data.All patches were converted to a PyTorch tensor and normalised to zero mean and unit variance.

Gene expression prediction models
We experimented with 10 state-of-the-art CNN and transformer models as the basis for our BrST-Net framework (Fig. 1), including ResNet, InceptionNet, six variants of EfficientNet, and two vision transformer models, briefly described below.These main networks were trained to learn spatial features from histology images and predict gene expression levels.However, due to the high dimensionality of ST data, CNN architectures may be limited in their ability to predict gene expression levels accurately for the remaining genes.To address this limitation, we modified the main network by introducing an auxiliary network (AuxNet) just before the end of the network.The AuxNet is a simple feed-forward neural network with a single fully connected layer that predicts the remaining genes.It enhances the generalisation performance of the main network and helps prevent overfitting, providing an alternative form of regularization.

ResNet model
ResNet (Residual Network) [21] is a deep learning architecture widely used in computer vision tasks such as image classification, object detection, and semantic segmentation [22,23].It uses residual connections to improve information propagation and learning, expressed as F (x) = H(x) + x, where x is the input, H(x) is the learned residual mapping, and F (x) is the residual block output.This helps mitigate the vanishing gradient problem in deep neural networks.In this study, ResNet101 is used as the main network and fully trained on the ST dataset, making this the first study to use ResNet for gene expression prediction.

Inception model
Inception [24] is a deep convolutional neural network designed for image classification.It uses a combination of convolutional and pooling layers, along with auxiliary classifiers, to capture both local and global features in the input image.The final prediction is made by combining the outputs of the main and auxiliary classifiers, represented mathematically as y = W x + b, where x is the input feature vector, W is the weight matrix and b the bias term, and y is the output.Different versions exist, with Inception-v3 achieving state-of-the-art performance on benchmark datasets and now being widely used in real-world applications [25,26].In this study, we fully trained it on the ST dataset and evaluated its performance for gene expression prediction.

EfficientNet model
The conventional approach to increasing the learning capacity of CNN models involves tweaking the network's depth d and width w and the image resolution r.While this enhances accuracy, it often requires significant manual tuning and can result in suboptimal performance.This has been addressed in literature by examining various scaling strategies and proposing a systematic network architecture scaling method, leading to the development of EfficientNet [27].The scaling method employs a user-defined coefficient, ϕ, to scale up networks in a more systematic manner, according to the following equations: where α, β, and γ are calculated through automatic hyperparameter optimisation using grid search.The scaling of the model is regulated by ϕ, which controls the allocation of computational resources.In pathological image analysis, multiple studies have utilized the EfficientNet model and achieved superior accuracy [28,29,30,31,32].To our knowledge, however, our study is the first to use EfficientNet for gene expression prediction using the ST dataset.Specifically, we experimented with versions b0, b1, b2, b3, b4, and b5.

Vision transformer model
Recent editions of the ImageNet Large-Scale Visual Recognition Competition have witnessed the dominance of the vision transformer (ViT) over other state-of-the-art approaches.Inspired by the challenge of natural language processing (NLP) to deal with varying sentence lengths, ViT aims to deal with dependencies at various spatial distances, and partitions an image into a fixed number of patches.In the feature extraction stage of a ViT model, a 2D image x ∈ R X×Y ×C of size X × Y pixels and C channels, is transformed into a 1D sequence x p ∈ R N ×P 2 ×C of N patches of size P 2 , where N is equal to the input sequence length of the transformer encoder and is calculated as N = XY /P 2 .In the transformer, all patches are flattened and then projected linearly to D dimensions, referred to as patch embeddings.These patch embeddings are the input to the transformer encoder, preserving positional information [33].The transformer encoders consist of multiple multihead self-attention (MHSA) and multilayer perceptron (MLP) blocks.The MHSA layer in the transformer is capable of learning the attention for spots or image patches and involves a linear combination of several attention heads: where c denotes the number of heads, W 0 are the weights used to aggregate the attention heads, and Q, K, and V represent the query, key, and value, respectively.Each head is calculated as: where Here, QK T / √ d K refers to the attention map, whose pattern is N × N , and V represents the self-attention mechanism's value, where V = Q = K.The attention mechanisms produces an N × 1024 matrix as output.In our experiment, we evaluated the performance of the ViT-B16 and ViT-B32 models, where B16 and B32 refer to the partitioning of an image into 16 and 32 patches, respectively [33].

Loss function
To train a network, its weights are iteratively updated to minimise a given loss function.In our framework, we use the following loss: where L is the overall loss, L main is the loss of the main network, L aux is the loss of the auxiliary network, with both losses being calculated using cross-entropy, and λ is a hyperparameter used to balance the contribution of the two losses.We experimented with various values of λ and found that a value of 40 gives good performance.Stochastic gradient descent (SGD) optimization was used to minimise the loss.

Error measures
To quantify errors between predicted and actual values, we used the mean absolute error (MAE) and root mean squared error (RMSE): where n is the total number of observations, y i is the true value of observation i, and ŷi is its predicted value.Both error measures yield an average prediction error of a model, ranging from 0 to ∞, with smaller values indicating more accurate predictions.

Correlation measure
To assess the reliability of gene expression predictions from histopathology images, we used the Pearson correlation coefficient (Pcc) [34,35,36]: where a i and b i are the true and predicted genes, respectively, n is the total number of genes, ā is the mean of the a i , and b is the mean of the b i .The Pcc evaluates the linear relationship between two variables and assigns a score ranging from −1 to 1.A score of 1 means there is a perfect positive linear correlation, a score of −1 indicates a perfect negative linear relationship, and a score of 0 implies there is no linear relationship.In practice, a Pcc score between 0.5 to 1 indicates strong correlation, a score between 0.3 to 0.5 indicates medium correlation, and a score between 0.1 to 0.3 indicates weak correlation.
3 Experimental results

Implementation and setup
BrST-Net was implemented in PyTorch and Python 3.7.For training, the batch size was set to 32, the number of epochs to 200, the gene filter to 250, and the learning rate for SGD was set to 0.001.The models were trained on 2 GPUs with 24 CPU cores and 50 GB of RAM on the Gadi cluster of the National Computational Infrastructure in Australia.A 5-fold cross-validation was performed on 22 out of 23 patients to evaluate the performance of the trained models.The remaining patient was held out as a final test set to assess the generalisability of the models.

Quantitative results
To assess the efficacy of the different models for our framework, we fully trained and evaluated a total of 10 models on the ST dataset: ResNet101, Inception-v3, EfficientNet-b0, EfficientNet-b1, EfficientNet-b2, EfficientNet-b3, EfficientNet-b4, EfficientNet-b5, ViT-B-16 and ViT-B-32, with and without the use of AuxNet, and compared their performances in predicting the top 250 genes.From the results with AuxNet on the held-out test case (Table 1 and Fig. 2), we observe that the EfficientNet architecture is dominant, with EfficientNet-b0 performing most favourably, followed by EfficientNet-b4.The results without using AuxNet for the test case (Supplementary Table S1) clearly show the improvements brought by the proposed auxiliary network.Specifically, for each gene, the best median Pcc value is 0.00 1.00 2.00 3.00 4.00 5.00 6.00

ENet-b5
ViT-B16 higher with the use of AuxNet than without, though this higher number may be produced by a different main network model.Of the other models, ResNet101, Inception-v3, and ViT-B16 performed somewhat comparably (no consistent winner), and better than ViT-B32 and EfficientNet-b5, the latter of which generally performed worst.EfficientNet-b0 with AuxNet was able to predict 237 genes with positive correlation (Fig. 3), including 24 genes with median Pcc values greater than 0.5 (strong correlation), 123 genes with Pcc values between 0.3 and 0.5 (medium correlation), and 78 genes with Pcc values between 0.1 and 0.3 (weak correlation).
The average MAE and average RMSE of all models with and without AuxNet (Table 2) show that for the best-performing model according to the Pcc metric (Table 1) the errors on the testing data are smaller with the use of AuxNet than without.More broadly, Inception-v3, the lower-version EfficientNet models, and ViT-B16 all produce smaller errors with the use of AuxNet.Interestingly enough, the errors of these models on the training data (Table 2) as well as on the validation data (Supplementary Tables S2 and S3) can be larger with the use of AuxNet.This may be explained by the fact that AuxNet introduces additional outputs in the network, which can make the optimization process more difficult.
During training, the network learns to optimize the main objective while also optimizing the auxiliary objectives.Consequently, this may result in a more complex optimization problem, and the network may not converge as quickly or as well as without the use of AuxNet.However, the use of AuxNet can still enhance the generalisation performance   of the models on the unseen testing data, which is crucial in practical applications.Thus, the slight increase in training error is a reasonable trade-off for the improved generalisation performance on the testing data.

Visualisation of gene expression
To better understand the gene expression predictions, the top predicted genes with high Pcc values were visualised on the corresponding histology image (Fig. 4 and Supplementary Figs.S1 and S2).The visual comparison of the true and predicted expressions of gene B2M and their Pcc values for the test dataset (Fig. 4) shows a strong correlation between the two.The yellow portions correspond to regions where B2M is strongly expressed, while the blue portions correspond to regions where that gene is poorly expressed.We observe that the yellow portion of the tissue is highly correlated with the presence of the black tumour annotation, indicating that this gene is colocated with cancer cells in the image.This demonstrates the effectiveness of our proposed framework BrST-Net in predicting local gene expression from tissue images for a selected panel of genes.

Computational costs
To evaluate the computational cost of training our framework, we recorded the training times of all considered models in our experiments.From the results (Table 3) we observe that as EfficientNet increased in scale from b0 to b5, the training time gradually increased accordingly, as expected.ResNet101 and Inception-v3 were about as costly as the lower-scale versions of EfficientNet, while ViT-B16 was about as expensive as the higher-scale versions of EfficientNet, and ViT-B32 was the fastest network.For a few models, the use of AuxNet added comparatively little to the total cost, and for most models, it even slightly reduced the cost, demonstrating that AuxNet is a lightweight network that can improve accuracy while incurring minimal computational overhead.For all models, while training takes many hours, their application at test time takes only a few seconds per image, making them computationally very feasible in clinical practice.

Discussion
Spatial transcriptomics technologies can profile gene expression for complete transcriptomics at almost single-cell resolution, with spatial positions matched with H&E-stained histological images (WSIs).While WSIs are inexpensive, accessible, and commonly generated in clinics, generating ST data is very costly and complicated, and currently, only a limited number of research centres are capable of doing so.Therefore, predicting gene expression directly from WSIs is useful.
In this paper, we proposed BrST-Net, a framework for predicting gene expression using ST data.We fully trained and tested 10 different deep-learning models for our framework to predict 250 genes based on ST breast cancer data.Of all considered models, the combination of EfficientNet-b0 and our proposed AuxNet was able to predict 237 genes with positive median correlation, including 24 genes with strong correlation (Pcc value over 0.5), 123 genes with medium correlation (Pcc values between 0.3 and 0.5), and 78 genes with weak correlation (Pcc values between 0.1 and 0.3).EfficientNet-b4 and EfficientNet-b3 also performed relatively well, as did Inception-v3 and ViT-B16, while EfficientNet-b5 performed worse and ViT-B32 generally did poorly.The fact that, among the transformer models, ViT-B16 outperformed ViT-B32 in gene expression prediction, may be attributed to the smaller dimensionality of ViT-B16, which enhances its ability to capture complex gene expression relationships.These findings emphasize the significance of selecting an appropriate model architecture for the given task and dataset.
In comparison to the first method in this area, ST-Net [14], our BrST-Net framework improved the gene expression prediction performance quite substantially.We have completely trained the most recent state-of-the-art CNN models and transformers and compared their performances on the ST dataset, while ST-Net employed a DenseNet trained on natural images (such as cats, dogs, and flowers), which may not be optimal for histopathology.In our suggested approach, in addition to the core network, an auxiliary network is added to predict the remaining genes.The auxiliary loss helps reduce the fading gradients problem and stabilizes and regularises the training.Whereas ST-Net can predict only 102 genes with positive correlation, our framework was able to predict 237.ST-Net revealed the top-5 predicted genes with a median correlation coefficient value of (0.34, 0.33, 0.31, 0.30, and 0.29), and for smoothed data (0.49, 0.50, 0.50, 0.52, and 0.43).In contrast, with our framework using EfficientNet-b0 with the proposed AuxNet, we could predict 24 genes with a median correlation coefficient greater than 0.50, which represents a considerable increase in performance.
Notwithstanding the merits of BrST-Net, there is still much room for further study and development in this field.As is well known, deep learning methods require a large dataset to obtain good results.Thus we expect future studies using larger datasets to increase the prediction performance and robustness of the models.A richer dataset from a larger number of patients would also expand our framework's generalisability.Because each model has unique performance characteristics, combining two or more models can be beneficial.Also, it may be possible to do further gene expression filtering and model training for predicting target gene expression.There is a need to further improve the number of genes predicted with high correlation, which are of established relevance in cancer biology and treatment.In the meantime, BrST-Net could serve as an inexpensive and fast high-throughput screening tool for large numbers of patient samples to direct downstream definitive molecular analyses.

Figure 1 :
Figure 1: Overview of the proposed BrST-Net framework.(a) Original whole slide image.(b) Stain normalised image.(c) Mask image and filtered ST spot location information.(d) Extracted image patches using the spot location information.(e) Main network with an auxiliary network for gene expression prediction.(f) Prediction of an individual gene on the tissue sample, showing (i) test sample, (ii) ACTB gene prediction, (iii) ground-truth expression of gene ACTB, and (iv) gene ACTB overlayed on the tissue sample.(g) Performance evaluation.

Figure 2 :Figure 3 :
Figure 2: Stacked bar chart summarising the predictive performances of the different models.

Figure 4 :
Figure 4: Visualisation of gene B2M expression for patient BC23903 slide number C2.(a) Histopathology sample from the test data.(b) Binary label of tumour (black) and normal (white) areas, with gray parts indicating areas that are not purely tumor or normal.(c) Ground truth expression of gene B2M.(d) Predicted expression of gene B2M (Pcc = 0.6325).(e) Visualisation of predicted gene B2M expression overlayed on the tissue slice.

Figure S1 :
Figure S1: Visualisation of gene expression for patient BC23903 slide number C2.(a) Histopathology sample from the test data.(b) Binary label of tumour (black) and normal (white) areas, with gray parts indicating areas that are not purely tumor or normal.(c) Ground truth expression of corresponding genes.(d) Predicted expression of corresponding genes and their Pcc values (e) Visualisation of predicted gene expression overlayed on the tissue slice.

Figure S2 :
Figure S2: Visualisation of gene expression for patient BC23903 slide number C2.(a) Histopathology sample from the test data.(b) Binary label of tumour (black) and normal (white) areas, with gray parts indicating areas that are not purely tumor or normal.(c) Ground truth expression of corresponding genes.(d) Predicted expression of corresponding genes and their Pcc values (e) Visualisation of predicted gene expression overlayed on the tissue slice.

Table 1 :
Top 10predicted genes by our framework using different models with AuxNet.The numbers are median Pcc values on the held-out test patient ID BC23903.Here, to save space, "EfficientNet" is abbreviated to "ENet", and "Inception-v3" to "Incep-v3".Bold indicates best performance (highest Pcc) per gene and underlined indicates second-best.

Table 2 :
Average mean absolute error (aMAE) and average root mean squared error (aRMSE) of different models with and without AuxNet on the training and testing data.

Table 3 :
Computational cost of our framework using different models.