Abstract
Triple-negative breast cancers (TNBCs) are a subset of breast cancers that have remained difficult to treat. A proportion of TNBCs arising in non-carriers of BRCA pathogenic variants have genomic features that are similar to BRCA carriers and may also benefit from PARP inhibitor treatment. Using genomic data from 129 TNBC samples from the Malaysian Breast Cancer (MyBrCa) cohort, we developed a gene expression-based machine learning classifier for homologous recombination deficiency (HRD) in TNBCs. The classifier identified samples with HRD mutational signature at an AUROC of 0.93 in MyBrCa validation datasets and 0.84 in TCGA TNBCs. Additionally, the classifier strongly segregated HRD-associated genomic features in TNBCs from TCGA, METABRIC, and ICGC. Thus, our gene expression classifier may identify triple-negative breast cancer patients with homologous recombination deficiency, suggesting an alternative method to identify individuals who may benefit from treatment with PARP inhibitors or platinum chemotherapy.
Similar content being viewed by others
Introduction
Triple negative breast cancer (TNBC) continues to be an area of unmet clinical need, as this aggressive subtype has fewer treatment options and higher mortality rates compared to other subtypes of breast cancer1,2,3. As a “catch-all” diagnosis for all breast tumours that test negative for both hormone receptors and HER2, TNBC is highly heterogeneous, and can be divided into several subtypes of its own4,5, each with potentially different treatment options. In this context, one potential biomarker that may be useful for subcategorizing TNBC patients for treatment is homologous recombination deficiency (HRD).
Cells with homologous recombination deficiency (HRD) have defects in their homologous recombination pathway leading to a diminished ability to repair DNA damage. In breast cancer, HRD is an important biomarker for therapies that utilize DNA-damaging agents such as platinum chemotherapy and PARP inhibitors. Platinum chemotherapies such as cisplatin, oxaliplatin, and carboplatin induce crosslinking of DNA that inhibits DNA repair and synthesis6. PARP inhibitors, on the other hand, are a new class of drugs that inhibit the action of Poly (ADP-ribose) polymerase (PARP) proteins, leading to double strand DNA breaks during cellular replication7. In tumours with HRD, these DNA-damaging agents cause an accumulation of mutations, leading to synthetic lethality and eventually cell death8. Tumours with deleterious genomic BRCA variants have defective homologous recombination repair pathways9, and PARP inhibitors have been approved for TNBC for individuals with deleterious germline variants in BRCA1/2 (gBRCAm)10,11.
Besides deleterious germline BRCA variants, other molecular alterations in tumours may also lead to similar defects in the HRD pathway. This “BRCAness” feature has been proposed to broaden the patient population to PARP inhibitors12, and various molecular aspects of “BRCAness” have been characterized13. Some breast tumours with the “BRCAness” feature may arise when expressions of BRCA1 or BRCA2 are repressed by hypermethylation or somatic mutation14, or when the homologous recombination pathway is abrogated through mutations in other genes in the pathway (e.g. PALB2)15,16, and there are ongoing clinical studies that seek to expand the utility of PARP inhibitors in this context17. In addition, transcriptional signatures and genomic mutational signatures have been generated to identify tumours with HRD, with some association with PARP inhibitor sensitivity18,19. Clinical studies to examine the utility of these genomic signatures as predictive biomarkers have also been initiated20. In other cancer settings such as recurrent or high grade serous ovarian carcinoma, HRD assays such as the myChoice HRD assay and the Foundation Medicine T5 NGS assay have demonstrated some utility in guiding treatment with PARP inhibitors, and have received FDA approval as companion diagnostics21,22,23.
We have recently characterized the genomic and transcriptomic profiles of fresh frozen breast tumours from a large cohort of Malaysian patients of Chinese, Malay and Indian descent (the MyBrCa cohort)24. In order to study transcriptomic biomarkers for HRD in Asian TNBC, we first defined HRD status by clustering our TNBC samples using genomic features associated with HRD, followed by differential gene expression analyses comparing samples with high HRD to samples with low HRD. We identified a set of largely novel genes that were associated with HRD in our cohort, which we used to train a machine learning classifier to classify patient tumour samples as having high or low HRD. We validated the classifier using TNBC samples from the TCGA, ICGC, and METABRIC cohorts. We also validated the classifier on an alternative NanoString platform, using both FFPE and fresh frozen tumour samples. This classifier may have clinical utility as a non-gBRCAm biomarker to select for patients with high HRD who may benefit from treatment with platinum chemotherapy or PARP inhibitors.
Results
Study population
Our discovery and training dataset consisted of 129 TNBC samples from the MyBrCa cohort, for which whole-exome sequencing and RNA-seq data were derived from fresh frozen primary tumour tissue and matched blood samples. 94 of these samples have been published previously (Pan et al.24), while the remaining 35 were sequenced more recently and have not been formally described. Of the 94 previously published samples, eight samples had pathogenic germline BRCA1 variants, two samples had pathogenic germline BRCA2 variants, and two samples had pathogenic germline PALB2 variants25. The average age of the 129 TNBC patients was 52.9 years ( ±13.6), and the majority of the samples were Stage II or Stage III, poorly-differentiated invasive breast carcinomas of no special type (NST) (Table 1). Our validation cohorts comprised of 87 TNBC tumours from The Cancer Genome Atlas (TCGA) breast cancer cohort26 (TCGA 2012), 306 TNBC tumour samples from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) breast cancer cohort27,28, and 73 TNBC tumour samples from the Nik-Zainal (2016) breast cancer cohort (NZ-560)29.
HRD in Asian TNBC
First, we examined the prevalence of HRD in the Asian TNBC setting by conducting an unsupervised clustering analysis of several genomic features associated with HRD in the MyBrCa TNBC samples. These features include commonly accepted features of HRD such as genomic loss of heterozygosity (LOH), telomeric allelic imbalance (TAI) and large-scale state transition (LST), short indels, copy number aberrations (CNAs) including gene amplification, gain, loss, and deletion, and the COSMIC mutational signature SBS3.
Hierarchical (Fig. 1A) and k-means (Supp. Fig. 2) clustering of these genomic features revealed two distinct clusters which we called HRD High and HRD Low based on the levels of these HRD-associated scores. Out of the 129 samples that were clustered, 113 samples were concordant between the two clustering algorithms, while 16 samples were discordant and dropped from subsequent analyses. Of the 113 concordant samples, 41 (32%) were categorized as HRD High and 72 (68%) were categorized as HRD low.
Next, using differential gene expression analysis of RNA-sequencing data from the TNBC tumour samples, we identified a set of 217 genes that were differentially expressed between the two groups by manually curating the top upregulated and downregulated genes based on Benjamini-Hochberg adjusted p-values of less than 0.001, with a minimum log2 fold change of 2 in either direction (Fig. 1B, Supp. Table 2). This gene set of 217 genes was substantially different from previous HRD-associated gene sets derived in studies by Peng et al.30 and the i-SPY 2 consortium31,32, with very small numbers of overlapping genes (1 between MyBrCa and Peng, 15 between MyBrCa and i-SPY 2).
Over-representation analysis of the selected 217 differentially expressed genes using KEGG and Reactome pathway-based gene sets revealed an enrichment in a number of metabolic and signaling pathways (Supp. Table 3), but not pathways known to be associated with HRD. Similarly, analysis of over-represented gene ontology (GO) terms for the 217 genes showed an enrichment of extracellular matrix, plasma membrane, and hormone-related terms, but not HRD-related terms (Supp. Table 4), suggesting that the 217 genes were largely not previously known to be associated with HRD. On the other hand, gene set enrichment analysis (GSEA) of MSigDB Hallmark and KEGG pathways comparing TNBC tumour samples classified as HRD High to samples classified as HRD Low revealed an upregulation of cell cycle and DNA repair pathways in samples classified as HRD High, similar to previous studies13, suggesting that HRD pathways are indeed differentially expressed between the two groups when looking across the whole transcriptome (Supp. Tables 5-6). Interestingly, immune-related pathways appear to be downregulated in the HRD High group and upregulated in the HRD Low group as well (Supp. Tables 5-6).
Classification of tumour samples according to HRD status using the gene set
Using the set of 217 genes identified above, we trained a machine learning classifier, which we call HRD200, to classify the HRD status of any given tumour sample, using an adaptation of the composite classifier framework described in Sammut et al.33. Our adaptation of this framework incorporates 5-fold stratified shuffling to split the samples along 70/30 ratio for model training and testing, respectively, followed by feature selection and 5-fold cross-validation (see Methods, Supp. Table 7). The composite classifier under this framework achieved a mean AUROC of 0.93 in both training and testing datasets, utilizing 122 of the 217 genes as features (Fig. 2A). Using a probability cut-off that maximized F1 score (0.5), the HRD200 ensemble classifier designated 38 out of the 41 Asian TNBC in the MyBrCa samples with high HRD scores correctly as “HRD High” and classified 71 out of 72 of the samples with low HRD scores correctly as “HRD Low”, for an F1 score of 0.97 and an accuracy of 96%. The classifier also categorized 5 out of the 6 samples in our cohort with known biallelic pathogenic germline BRCA1 variants as HRD High (Fig. 2B). In addition, the use of our specific 217 gene set with this classification approach was able to outperform similar classifiers using the gene sets by Peng and i-SPY 2, although the difference was small (Supp. Fig. 3).
Given that the classifier was trained with the 217 genes that were identified using the entire MyBrCa data set, there was likely some information leakage that may have led to overly optimistic results with the testing datasets. Thus, to validate our methodology, we also trained the classifier using other sets of 200 genes, using a sliding window approach to derive gene sets of increasing informativeness based on their differential expression. We found that more significantly differentially expressed genes produced classifiers with high AUROCs and vice versa (Supp. Fig. 4), showing that the informativeness of the gene set used for the model directly influences the classifier’s predictive ability. Interestingly, we found that we could also use random sets of 200 genes to predict HRD status with a high AUROC using our methodology, suggesting that HRD may have a widespread effect on the expression of many genes across the transcriptome (Supp. Fig. 5A), and this was also supported by our differential expression analysis where there was a strong enrichment of low p-values across the transcriptome (Supp. Fig. 5B).
Validation of the HRD200 classifier in other cohorts
Next, we examined the ability of our HRD200 classifier to predict HRD-associated genomic features in other cohorts. First, we used our classifier to predict the HRD status of 87 TNBC tumours in The Cancer Genome Atlas (TCGA) dataset (TCGA 2012) for which mutational signature data was available. Comparison of the samples classified as HRD High versus HRD Low revealed that the HRD High samples had significantly higher scores for all HRD-associated variables including LOH, TAI, LST, short indels, CNAs, and mutational signature SBS3 (Fig. 3A, Supp. Fig. 6), suggesting that our classifier was able to successfully segregate samples with HRD-associated features in the TCGA TNBC cohort. We also compared the classifier predictions to a “ground truth” HRD status for the samples derived using the same methodology as applied to the MyBrCa cohort (consensus hierarchical and k-means clustering of HRD-associated variables) and found that the classifier was able to predict HRD status in TCGA TNBCs with an AUROC of 0.84 (F1 = 0.70, precision = 0.63, recall = 0.86). Using the same 0.5 probability cutoff as the MyBrCa cohort analyses above, our classifier designated 54 of the 87 samples as HRD High and 33 as HRD Low.
We also tested our classifier on 73 TNBC samples from the Nik-Zainal (2016) WGS cohort and 306 TNBC samples from the METABRIC cohort, although in both cases we had to reduce the number of genes included in the classifier and retrain the classifier as expression data was not available for some genes (see Methods). In the Nik-Zainal cohort, the samples designated as HRD High by our retrained classifier had significantly higher scores for SBS 3, rearrangement signatures 3 and 5, and HRD index scores (Supp. Fig. 7A). We also found that our retrained classifier could predict HRD classification by HRDetect at an AUROC of 0.71 (F1 = 0.57, precision = 0.72, recall = 0.67; Supp. Fig. 7B). Similarly, in the METABRIC cohort, the samples designated as HRD High by our retrained classifier had significantly higher scores for SBS 3 and copy number aberrations (Supp. Fig. 7C), and we found that our retrained classifier could predict consensus hierarchical and k-means clustering of these variables at an AUROC of 0.80 (F1 = 0.81, precision = 0.69, recall = 0.59; Supp. Fig. 7D).
Validation of the HRD classifier using a NanoString platform
Next, we evaluated the robustness of the HRD200 classifier with regards to different methods of measuring gene expression, as well as to the use of formalin-fixed paraffin-embedded (FFPE) tissue rather than fresh frozen tissue. We did this because both RNA-seq data as well as fresh frozen tumour tissue are expensive and difficult to obtain as part of routine clinical practice.
To evaluate the performance of the HRD200 classifier with a different gene expression measurement method, we used data from the NanoString nCounter platform34, which uses direct digital detection of mRNA molecules to generate gene-level transcript counts. We were able to obtain gene expression data for 36 genes of our gene set from 61 fresh frozen tissue samples as well as 23 FFPE tissue samples from the same cohort of MyBrCa TNBC patients described above, using a custom NanoString nCounter CodeSet. These data were then inputted into a version of our HRD200 classifier that was optimized for the 36 genes, and the results were compared to the HRD classification from high-throughput sequencing data. Using the consensus clustering results as the ground truth, the NanoString-based classification had an AUROC of 0.95 for fresh frozen tissue and 0.78 for FFPE (Fig. 4A) for the samples with both consensus cluster and NanoString data (n = 55 and n = 19 for fresh frozen and FFPE samples, respectively). Additionally, the probabilities for any given sample being classified as HRD High from RNAseq and NanoString data were highly correlated, with a Spearman’s correlation coefficient (ρ) of 0.94 when comparing RNASeq and fresh frozen NanoString results, and a (ρ) of 0.77 when comparing RNAseq and FFPE NanoString results (Fig. 4B). Pairwise comparisons of the NanoString and consensus clustering results revealed an overall accuracy of 0.91 and an F1 score of 0.88 for fresh frozen tissue, and an accuracy of 0.74 and an F1 score of 0.71 for FFPE tissue (Fig. 4C, Supp. Fig. 8), when using probability cutoffs that maximized F1 scores. Overall, this suggests that our HRD200 classifier was robust even when used with expression data from a different platform, and also when used with smaller subsets of genes compared to the original gene set.
Discussion
Here, we describe the development and validation of a gene expression classifier for homologous recombination deficiency in Asian TNBC. Using mutation load and other genomic features, we were able to sort our TNBC samples into two clusters - HRD high and HRD low, and we derived a gene set associated with HRD status using gene expression analyses. Using that gene set, we subsequently developed an ensemble machine learning model classifier (HRD200) that could discriminate between samples with high versus low HRD with good accuracy and AUROC from gene expression data. The classifier also had the ability to segregate samples according to genomic features associated with HRD in TNBC validation cohorts from TCGA, METABRIC, and Nik-Zainal (2016). Importantly, we also found a very high concordance rate in the classification results when using an alternative measure of gene expression (NanoString), as well as when using FFPE material instead of fresh frozen tissue, even with small subsets of the gene panel, suggesting that the HRD200 classifier could be robust for real-world clinical use situations.
In the TNBC setting, we found a large cluster of samples with high HRD-associated genomic features and mutational signatures. This suggests that there may be a significant number of Asian TNBC patients who may benefit from therapies that target the HRD pathway, such as platinum chemotherapy or PARP inhibitors. Thus, tools to detect HRD in tumour samples may have clinical utility as a non-gBRCAm biomarker to select for Asian breast cancer patients with high HRD who may benefit from treatment with platinum chemotherapy or PARP inhibitors.
The training and validation of the HRD classification tool described in this paper relies in part on measures of the mutational signature SBS3 in each tumour sample. The COSMIC mutational signature SBS3 is a well-studied cancer mutational signature that has been validated in orthogonal techniques35,36, and its HRD-driven etiology has been experimentally confirmed37. The use of HRD-associated mutational signatures to predict “BRCAness” and thus response to PARP inhibitors and platinum chemotherapy has received significant attention in recent years, with tools such as HRDetect19 and CHORD38 demonstrating that HRD-associated mutational signatures can be used to predict BRCA1/BRCA2-deficiency in breast tumours as well as survival of breast cancer patients39. The clinical utility of such tools in a breast cancer setting is being tested in ongoing clinical trials; however, one significant drawback is that these methods usually require whole-genome sequencing of tumour samples, which may be prohibitively expensive and time-consuming, particularly in low-resource settings. By using gene expression signatures, our method for predicting HRD in tumour samples may offer a useful alternative.
As mentioned above, other gene expression signatures related to HRD have also been previously described—Peng and colleagues (2014)30 described a set of 230 genes associated with homologous recombination DNA repair in a microarray analysis of a nonmalignant human mammary epithelial cell line that predicts clinical outcome in cancer patients, while a different 77-gene panel of “BRCA1ness” (derived from microarray and MLPA analyses of TNBC samples) was significantly associated with response to PARP inhibitor treatment in the I-SPY 2 breast cancer clinical trial31,32. Interestingly, there appears to be little overlap between the different gene panels—only one out of the 217 genes in our gene set are included in Peng’s 230-gene set, and only 15 of the 217 genes in our study are included in the 77-gene “BRCA1ness” panel, with zero genes common across all three sets. The lack of overlap may reflect the large differences in methodology or study population. Nonetheless, when the same machine learning approach was used, all three gene sets were almost equally predictive for HRD status in our dataset, with very high predictive value across the board. This suggests that the machine learning methodology used may be more important than the specific gene set for researchers to be able to derive accurate predictions of HRD status from gene expression data, as long as the gene set used retains some association with HRD in the study population.
Taken together, we believe that the HRD200 classifier, implemented as a NanoString-based test, may have clinical utility as a non-BRCAm biomarker to select for patients with high HRD who may benefit from treatment with PARP inhibitors. Further development of the classifier is required to determine if HRD200 can correctly identify patients sensitive to PARP inhibitor therapy in real-world clinical settings.
Methods
Data description
Genomic sequencing data for this project was taken primarily from 94 TNBC samples included in the MyBrCa cohort tumour sequencing project. In brief, this included whole-exome sequencing (WES) and RNA-sequencing (RNA-seq) data collected from biobanked breast tumours of female patients from two hospitals – Subang Jaya Medical Centre in Subang Jaya, Malaysia, and Universiti Malaya Medical Centre in Kuala Lumpur, Malaysia, and analysed together with available clinical data. The cohort data and sequencing methods are described in full in Pan et al.24 and associated papers24,25,40. We also included an additional 35 TNBC samples that were not part of the original cohort description, for a total sample size of 129 MyBrCa TNBC samples. These samples were obtained and processed in largely the same way as the previous MyBrCa samples, with the only difference being the use of the Illumina NovoSeq 6000 as the sequencing platform instead of the Illumina HiSeq 4000. The sequencing coverage and quality statistics of WES and RNA-seq data for each new sample are summarized in Supp. Tables 1A and 1B, respectively. Additional validation data from TCGA and METABRIC TNBC samples were downloaded from the NIH Genomics Data Portal and the European Genome-phenome Archive, respectively.
Patient recruitment and sample collection was reviewed and approved by the Independent Ethics Committee, Ramsay Sime Darby Health Care (reference no: 201109.4 and 201208.1), as well as the Medical Ethics Committee of the University Malaya Medical Centre (reference no: 842.9). Written informed consent to participation in research was given by each individual patient.
Transcriptomic data processing
Raw RNA-Seq reads were mapped to the hs37d5 reference human genome, and gene-level read counts were quantified using featureCounts (v. 1.2.31) with the Homo sapiens GRCh37.87 human transcriptome genome annotation model.
Mutational analyses
To call SNVs, we used positions called by Mutect2 with following filters: minimum 10 reads in tumour and 5 reads in normal samples, OxoG metric less than 0.8, variant allele frequency (VAF) 0.075 or more, p-value for Fisher’s exact test on the strandedness of the reads 0.05 or more, and SAF more than 0.75. For positions that are present in 5 samples or more, we removed two positions that were not in COSMIC and in single tandem repeats. We also removed variants that have VAF at least 0.01 in gnomAD, and considered only variants that are supported by at least 4 alternate reads, with at least 2 reads per strand. For indels, we also required the positions to be called by Strelka2. Variants were annotated using Oncotator version 1.9.9.0.
Determination of HRD status
Genomic features from WES and sWGS data were used in a clustering step to group the TNBC samples into 2 groups: HRD high and HRD low. The genomic features used include telomeric allelic imbalance (TAI), loss of heterozygosity (LOH), large-scale transitions (LST), copy number amplification, copy number gain, copy number loss, copy number deletion, indel counts, and COSMIC mutational signature SBS3 scores. TAI, LOH and LST scores were determined using the scarHRD R package (v. 0.1.1)41 on allele-specific copy number profiles derived by Sequenza (v. 2.2) from paired tumour-matched normal WES bam files. The prevalence of the HRD-associated single base-pair substitution (SBS) mutational signature 3 from COSMIC (SBS3) was determined using deconstructSigs (v.1.8.0), restricted to samples with at least 15 SNVs. Scores for copy number amplification, gain, loss, and deletion were obtained using the QDNASeq R package (v. 1.22) on shallow-whole genome sequencing bam files. Scores for each feature were normalized using z-scores before clustering, except for indel counts which were log-transformed, then all the scores were rescaled. K-means clustering and hierarchical clustering were performed using the Python packages “scikit-learn” (v. 1.2.1) and “scipy” (v. 1.12.0) respectively. Only samples that reached consensus between the two clustering algorithms were selected for further analysis, and the consensus clustering results were assigned as the HRD status of each sample.
Differential expression analyses
Gene-level count matrices were normalised using the “Trimmed Mean of M-values” method implemented in the edgeR (v. 3.20.9) R package. The normalized count matrices were then transformed into log2 counts-per-million (CPM) values using the “cpm” function from the edgeR package in R. The count matrix was first filtered to remove very lowly- and non-expressed genes. Differentially expressed genes were determined by empirical Bayes moderation of the standard errors towards a common value from a linear model fit of the transformed count matrices as implemented in the limma package, with the threshold for differential expression set as false discovery rate (FDR) < 0.001 and absolute log fold change > 0.2.
Pathway analysis
Over-representation analysis using KEGG and Reactome pathway-based sets as well as gene-ontology (GO) based sets was conducted using ConsensusPathDB (http://cpdb.molgen.mpg.de, accessed 21 April 2022) using the human database and ENSEMBL identifiers. For GO-based sets, the search was restricted to gene ontology level 2 and level 3 categories only.
Pathway analysis was conducted using gene set enrichment analysis (GSEA), as implemented in the Broad Institute GSEA Java executable (v 4.2.3), using the MSigDB Hallmark gene sets, as well as the KEGG gene sets, as implemented in the GSEA program using default options.
Determination of germline BRCA mutation status
Carriers of deleterious pathogenic germline variants in BRCA1 and BRCA2 in the MyBrCa cohort were identified from targeted sequencing conducted as part of the BRIDGES study42. LOH and biallelic status of the germline variants were taken from Ng et al. 25. Each carrier was independently confirmed with Sanger sequencing.
Classifier architecture
The machine learning framework was implemented in Python (v. 3.9.6) using the libraries “scikit-learn”, “scipy”, “numpy” (v. 1.26.4), “pandas” (v. 1.5.3). The input dataset for the classifier consisted of RNA-seq gene expression data quantified as TMM and log2 normalized counts per million (CPM), along with the HRD classification of each sample.
Our classifier architecture consisted of a double loop system (Supp. Fig. 1). In the outer loop, the input data was split into training and testing sets following a 70/30 ratio using a one-fold stratified shuffle split repeated five times with different seeds, resulting in five sets of training and testing data that were passed into the inner loop. The inner loop combined two classifier pipelines for Support Vector Machine and Random Forest algorithms, respectively, with the probability that a sample is HRD High being the average score of both pipelines. The inner loop pipeline architecture was adapted from Sammut et al. (2021)33 and has a feature selection step built into the classifier pipeline prior to the classification model, consisting of z-score scaling, k-best selection and collinearity removal. Within the inner loop, the hyperparameters were optimized using a five-fold randomized cross-validation (CV) search that maximizes the area under the receiver operating characteristic (AUROC). This randomized CV search tested 1000 random combinations sampled from the specified hyperparameter distributions. The optimization was repeated five times as part of the cross-validation step, and the final scores for the inner loop were the average scores of the five-fold CV. After training, the models were validated against their testing datasets to determine the AUROC for each set of data in the outer loop. Lastly, the AUROC scores from each repetition were averaged to get the final reported AUROC for the entire ensemble classifier. The final ensemble classifier is essentially composed of five sets of five SVM and RF models (25 models in total for each algorithm), and the scores generated by the ensemble classifier are the average scores across all five sets. The optimized hyperparameters and selected features for each model are reported in the supplementary material. This final ensemble classifier was used for further validation, referred to below as the “MyBrCa model”.
Validation on other cohorts
The classifier was validated using gene expression data from TNBC samples from other cohorts, including TCGA, the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort27,28, and the Nik-Zainal (2016)29 (NZ-560) cohort from the International Cancer Genome Consortium (ICGC). Because the individual cohort datasets did not always contain all the genes used in the model training, the models used in each validation were retrained on the MyBrCa data using the available genes for that cohort. The TCGA cohort RNA-seq data was downloaded from the GDC Data Portal and included all 217 genes used in the MyBrCa model. The METABRIC cohort, unlike our other cohorts, includes microarray data rather than RNA-seq data, and includes only 146 of the genes used in the MyBrCa model. Gene expression data for the METABRIC cohort was downloaded from the European Genome-phenome Archive. For the NZ-560 cohort, we used the log2 FPKM gene expression values from RNA-seq data that was reported in the original publication, but data was available for only 164 of the genes used in MyBrCa model. Gene expression values from each cohort were normalized using z-score scaling and quantile normalization separately for each cohort before classification. F1 score, precision, and recall values were calculated using the HRD200 probability threshold that maximized F1 score.
RNA extraction
RNA from tumour samples was extracted using the QIAGEN miRNeasy Mini Kit with a QIAcube, according to standard protocol. Total RNA was quantitated using a Nanodrop 2000 Spectrophotometer and RNA integrity was measured using an Agilent 2100 Bioanalyzer.
NanoString validation
For the NanoString validation, we used data from a custom CodeSet developed for the NanoString nCounter platform. This custom CodeSet included 35 genes from our gene set and 3 housekeeping genes used for data normalization. We obtained NanoString nCounter read counts for these genes from 61 fresh frozen samples and 23 FFPE samples from the MyBrCa TNBC cohort. Expression for this gene set was measured on an nCounter MAX Analysis System, and the raw data was processed and normalized using the NanoString’s proprietary nSolver (v. 4.0) software before being exported as a normalized gene expression matrix text file for processing by the machine learning classifier, which was retrained using only the 35 genes included in the NanoString data. The NanoString gene expression values were normalized using z-score scaling and quantile normalization before classification. The fresh frozen and FFPE samples were normalized separately.
Statistical analyses
All box and whiskers plots in the figures are constructed with boxes indicating 25th percentile, median and 75th percentile, and whiskers showing the maximum and minimum values within 1.5 times the inter-quartile range from the edge of the box, with outliers shown.
Data availability
The WES and RNA-seq data generated in this study are available in the European Genome-phenome Archive under accession number EGAS00001006518. Previously published data from Pan et al. 24 are available in EGA under accession numbers EGAS00001004518. Access to controlled patient data will require the approval of the Data Access Committee. Further information is available from the corresponding author upon request.
Code availability
The code used in the machine learning model training and validation is available on GitHub at https://github.com/ziching18/HRD200.
References
Yin, L., Duan, J. J., Bian, X. W. & Yu, S. C. Triple-negative breast cancer molecular subtyping and treatment progress. Breast Cancer Res. 22, 1–13 (2020).
Lebert, J. M., Lester, R., Powell, E., Seal, M. & McCarthy, J. Advances in the systemic treatment of triple-negative breast cancer. Curr. Oncol. 25, 142–150 (2018).
Dawson, S. J., Provenzano, E. & Caldas, C. Triple negative breast cancers: clinical and prognostic implications. Eur. J. Cancer 45, 27–40 (2009).
Lehmann, B. D. et al. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. J. Clin. Invest. 121, 2750–2767 (2011).
Burstein, M. D. et al. Comprehensive genomic analysis identifies novel subtypes and targets of triple-negative breast cancer. Clin. Cancer Res. 21, 1688–1698 (2015).
Rabik, C. A. & Dolan, M. E. Molecular mechanisms of resistance and toxicity associated with platinating agents. Cancer Treat. Rev. 33, 9 (2007).
Rouleau, M., Patel, A., Hendzel, M. J., Kaufmann, S. H. & Poirier, G. G. PARP inhibition: PARP1 and beyond. Nat. Rev. Cancer 10, 293 (2010).
O’Connor, M. J. Targeting the DNA damage response in cancer. Mol. Cell 60, 547–560 (2015).
Powell, S. N. & Kachnic, L. A. Roles of BRCA1 and BRCA2 in homologous recombination, DNA replication fidelity and the cellular response to ionizing radiation. Oncogene 22, 5784–5791 (2003).
Robson, M. et al. Olaparib for metastatic breast cancer in patients with a germline BRCA mutation. N. Engl. J. Med. 377, 523–533 (2017).
Litton, J. K. et al. Talazoparib in patients with advanced breast cancer and a germline BRCA mutation. N. Engl. J. Med. 379, 753–763 (2018).
Lord, C. J. & Ashworth, A. BRCAness revisited. Nat. Rev. Cancer 16, 110–120 (2016).
Severson, T. M. et al. BRCA1-like signature in triple negative breast cancer: Molecular and clinical characterization reveals subgroups with therapeutic potential. Mol. Oncol. 9, 1528–1538 (2015).
Nicolas, E., Bertucci, F., Sabatier, R. & Gonçalves, A. Targeting BRCA deficiency in breast cancer: what are the clinical evidences and the next perspectives? Cancers (Basel) 10, 506 (2018).
Wu, S. et al. Molecular mechanisms of PALB2 function and its role in breast cancer management. Front. Oncol. 10, 301 (2020).
Ahlskog, J. K., Larsen, B. D., Achanta, K. & Sørensen, C. S. ATM/ATR‐mediated phosphorylation of PALB2 promotes RAD51 function. EMBO Rep. 17, 671 (2016).
Gruber, J. J., Gross, W., McMillan, A., Ford, J. M. & Telli, M. L. A phase II clinical trial of talazoparib monotherapy for PALB2 mutation-associated advanced breast cancer. J. Clin. Oncol. 39 https://doi.org/10.1200/JCO.2021.39.15_suppl.TPS1109 TPS1109 (2021).
Telli, M. L. et al. Homologous Recombination Deficiency (HRD) Score Predicts Response to platinum-containing neoadjuvant chemotherapy in patients with triple negative breast cancer. Clin. Cancer Res. 22, 3764–3773 (2016).
Davies, H. et al. HRDetect is a predictor of BRCA1 and BRCA2 deficiency based on mutational signatures. Nat. Med. 23, 517–525 (2017).
Chopra, N. et al. Homologous recombination DNA repair deficiency and PARP inhibition activity in primary triple negative breast cancer. Nat. Commun. 11, 2662 (2020).
Haunschild, C. E. & Tewari, K. S. Bevacizumab use in the frontline, maintenance and recurrent settings for ovarian cancer. Future Oncol. 16, 225–246 (2020).
Ray-Coquard, I. et al. Olaparib plus Bevacizumab as first-line maintenance in ovarian cancer. N. Engl. J. Med. 381, 2416–2428 (2019).
Banerjee, S. et al. Maintenance olaparib for patients with newly diagnosed advanced ovarian cancer and a BRCA mutation (SOLO1/GOG 3004): 5-year follow-up of a randomised, double-blind, placebo-controlled, phase 3 trial. Lancet Oncol. 22, 1721–1731 (2021).
Pan, J. W. et al. The molecular landscape of Asian breast cancers reveals clinically relevant population-specific differences. Nat. Commun. 11, 1–12 (2020).
Ng, P. S. et al. Characterisation of PALB2 tumours through whole-exome and whole-transcriptomic analyses. NPJ Breast Cancer 7, 46 (2021).
Network, C. G. A. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Curtis, C. et al. The genomic and transcriptomic architecture of 2000 breast tumours reveals novel subgroups. Nature 486, 346–352 (2012).
Pereira, B. et al. The somatic mutation profiles of 2433 breast cancers refines their genomic and transcriptomic landscapes. Nat. Commun. 7, 11479 (2016).
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
Peng, G. et al. Genome-wide transcriptome profiling of homologous recombination DNA repair. Nat. Commun. 5, 3361 (2014).
Wolf, D. M. et al. DNA repair deficiency biomarkers and the 70-gene ultra-high risk signature as predictors of veliparib/carboplatin response in the I-SPY 2 breast cancer trial. NPJ Breast Cancer 3, 31 (2017).
Severson, T. M. et al. The BRCA1ness signature is associated significantly with response to PARP inhibitor treatment versus control in the I-SPY 2 randomized neoadjuvant setting. Breast Cancer Res. 19, 99 (2017).
Sammut, S. J. et al. Multi-omic machine learning predictor of breast cancer therapy response. Nature 601, 623–629 (2021).
Kulkarni, M. M. Digital multiplexed gene expression analysis using the NanoString nCounter system. Curr. Protoc. Mol. Biol. Chapter 25, Unit25B (2011).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).
Polak, P. et al. A mutational signature reveals alterations underlying deficient homologous recombination repair in breast cancer. Nat. Genet. 1–15 (2017).
Zámborszky, J. et al. Loss of BRCA1 or BRCA2 markedly increases the rate of base substitution mutagenesis and has distinct effects on genomic deletions. Oncogene 36, 746–755 (2017).
Nguyen, L., Martens, J. W. M., Van Hoeck, A. & Cuppen, E. Pan-cancer landscape of homologous recombination deficiency. Nat. Commun. 111, 1–12 (2020).
Staaf, J. et al. Whole-genome sequencing of triple-negative breast cancers in a population-based clinical study. Nat. Med. 25, 1526–1533 (2019).
Pan, J. et al. Germline APOBEC3B deletion increases somatic hypermutation in Asian breast cancer that is associated with Her2 subtype, PIK3CA mutations and immune activation. Int. J. Cancer https://doi.org/10.1002/ijc.33463 (2021).
Sztupinszki, Z. et al. Migrating the SNP array-based homologous recombination deficiency measures to next generation sequencing data of breast cancer. npj Breast Cancer 4, 1–4 (2018).
Dorling, L. et al. Breast cancer risk genes—association analysis in more than 113,000 women. N. Engl. J. Med. 384, 428–439 (2021).
Acknowledgements
Cancer Research Malaysia receives charitable funding from the Scientex Foundation, Estée Lauder Companies, Vistage Malaysia, Yayasan PETRONAS, and Yayasan Sime Darby which contributed to the funding of this study. Funding was also provided by a research grant from the Newton-Ungku Omar Fund (MRC Ref: MR/P012442/1) to SFC and SHT. OMR, CC, and SFC also receive funding from Cancer Research UK. All genomics work was undertaken by the Genomics Core Facility CRUK Cambridge Institute.
Author information
Authors and Affiliations
Contributions
J.W.P. led the data analysis, supervised experiments, and wrote the manuscript. Z.C.T., M.M.A.Z. and P.N.F. contributed to data analysis and generation of figures. PSN contributed to data analysis, experimental design, sample collection, and carried out experiments. J.Y.T., S.N.H., T.I., L.Y.T., S.J., M.H.S., C.H.Y., P.R., L.M.L., and A.M.T. contributed to sample collection and processing and data collection, while O.M.R. and S.F.C. generated and processed sequencing data. P.R. and L.M.L. provided histopathology expertise, and collected clinical data together with M.H.S., T.I., S.J., L.Y.T., C.H.Y., and A.M.T. O.M.R., C.C., and S.F.C. also interpreted results and helped to draft the manuscript. C.C. and S.H.T. contributed to obtaining funding for the project. S.H.T. also designed experiments, drafted the manuscript, and provided overall project direction and supervision. The work reported in the paper has been performed by the authors, unless clearly specified in the text.
Corresponding author
Ethics declarations
Competing interests
The authors declare that this research was funded by Cancer Research Malaysia, which also holds a patent pending related to the gene expression classifier described in this study. J.W.P., Z.C.T., P.S.N., M.M.A.Z., P.N.F., J.Y.T., S.N.H., J.L., and S.H.T. are current or former employees of Cancer Research Malaysia.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pan, JW., Tan, ZC., Ng, PS. et al. Gene expression signature for predicting homologous recombination deficiency in triple-negative breast cancer. npj Breast Cancer 10, 60 (2024). https://doi.org/10.1038/s41523-024-00671-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41523-024-00671-1