Molecular features and clinical implications of the heterogeneity in Chinese patients with HER2-low breast cancer

The molecular heterogeneity and distinct features of HER2-low breast cancers, particularly in the Chinese population, are not well understood, limiting its precise management in the era of antibody‒drug conjugates. To address this issue, we established a cohort of 434 Chinese patients with HER2-low breast cancer (433 female and one male) and integrated genomic, transcriptomic, proteomic, and metabolomic profiling data. In this cohort, HER2-low tumors are more distinguished from HER2-0 tumors in the hormone receptor–negative subgroup. Within HER2-low tumors, significant interpatient heterogeneity also exists in the hormone receptor–negative subgroup: basal-like tumors resemble HER2-0 disease, and non-basal-like HER2-low tumors mimic HER2-positive disease. These non-basal-like HER2-low tumors are enriched in the HER2-enriched subtype and the luminal androgen receptor subtype and feature PIK3CA mutation, FGFR4/PTK6/ERBB4 overexpression and lipid metabolism activation. Among hormone receptor–positive tumors, HER2-low tumors show less loss/deletion in 17q peaks than HER2-0 tumors. In this work, we reveal the heterogeneity of HER2-low breast cancers and emphasize the need for more precise stratification regarding hormone receptor status and molecular subtype.

(a) Multiomics level and number of features of included HER2-low patients.WES, whole exome sequencing; TMT: tandem mass tag.
In boxplots, the centerline represents the median, the box limits represent the upper and lower quartiles, the whiskers represent the 1.5x interquartile range, and the points represent individual samples.
Source data are provided as a Source Data file.Samples with all CNA, RNA and protein data were included.RTKs: receptor tyrosine kinases.
Source data are provided as a Source Data file.

HR status HER2 IHC score PAM50
Lipids

HR status
Positive Negative Source data are provided as a Source Data file.and HR-negative (b) subgroups.Significantly increasing or decreasing lipid-related gene sets were plotted in red dots and annotated.P values were computed by Spearman's rank correlation analysis and were adjusted for multiple testing using false discovery rate method.
(c) Pathway-based analysis of lipid and polar metabolite changes along with HER2 IHC scores in different HR subgroups.
Source data are provided as a Source Data file.(j-l) Bar plots comparing the copy number alterations (CNAs) of FGFR4 (j), PTK6 (k) and ERBB4 (l) between non-basallike tumors and basal-like tumors.P values were computed using the two-sided Fisher's exact test.
(m-o) Dot plot of the Spearman correlation between the expression of ERBB4 and FGFR4 (m), ERBB4 and PTK6 (n), and PTK6 and FGFR4 (o).P values were computed by Spearman correlation analysis.
(p-q) Gene Ontology (GO) enrichment analysis of genes (p) and proteins (q) that were differentially expressed between HER2-low non-basal-like and basal-like tumors.
In boxplots, the centerline represents the median, the box limits represent the upper and lower quartiles, the whiskers represent the 1.5x interquartile range, and the points represent individual samples.
Source data are provided as a Source Data file.(a-c) Boxplots comparing luminal-related genes 6 (b), endocrine sensitivity score 7 (c) and SETER/PR score 8 (d).In boxplots, the centerline represents the median, the box limits represent the upper and lower quartiles, the whiskers represent the 1.5x interquartile range, and the points represent individual samples.For luminal-related genes, the number (N) of HER2-0 and HER2-low is 61 and 355.For endocrine sensitivity score, the number (N) was 55 and 301.For SETER/PR score, the number (N) was 57 and 282.P values were computed using the two-sided Wilcoxon test.
(d) Somatic mutations of cancer-related genes (CAGs) among HER2 status subgroups.Upper panel: top 10 frequently mutated CAGs, lower panel: other differentially mutated CAGs between HER2 status subgroups.Genes that were differentially mutated (P<0.05) between HER2-positive and HER2-0 tumors compared with HER2-low tumors are in bold font.P values were computed using the two-sided Fisher's exact test.
(e-g) Comparison of the copy number of Amp peak 17q12 (e), Del peak 17q11.12(f) and Del peak 17q21.31(g).P values were computed using the two-sided Fisher's exact test.
(h-i) Forest plots showing the multivariable Cox regression analysis for distant metastasis-free survival (DMFS) of the status HR and the status of focal peaks 17q12 (h) and 17q11.2(i) in HR-positive HER2-low breast cancers.Number(N) of patients belonging to each category is indicated.Association of all variables with prognosis is analyzed using a two-sided Cox proportional hazard regression analysis.Error bars represent the 95% confidence interval of hazard ratio.
Source data are provided as a Source Data file.(a) Comparison of the expression level of genes located in 17q11.2 and 17q21.31 between loss/deletion patients and others.P values were computed using the two-sided Wilcoxon test and were adjusted for multiple testing using false discovery rate method.
(b-c) Dot plot showing the log2(fold change) of the comparison of genes located in 17q11.2(b) and 17q21.31(c) between tumors with or without loss/deletion in corresponding peaks at both the RNA and protein levels.
Source data are provided as a Source Data file.normal (PoN) samples to screen out expected germline variation and artifacts.This PoN panel was based on 699 normal blood samples, from which two VCF files were created for the sites identified as mutations by TNseq and TNscope.In addition, the location of the population germline resource containing the population allele frequencies obtained from gnomAD 16 was used to filter the raw TNseq results.
To obtain the final set of variant calls, we used a two-step approach, first removing any spurious variant calls arising as a consequence of sequencing artifacts and then making use of consensus mutations in at least two out of three callers to identify somatic mutations.
Second, additional filtering based on bam-readcount (https://github.com/genome/bam-readcount)was performed to reduce false-positive calls: 1) variant allele frequency (VAF) ≥ 5%; 2) sequencing depth in the region ≥ 8; and 3) sequence reads in support of the variant call ≥ 4. A catalog of cancer driver genes was assembled with: (1) the curated cancer gene list by OncoKB (October 2020) 17 ; (2) previously published and functionally validated oncogenic driver genes by Bailey,et al 18 ; (3) the compendium of mutational cancer driver genes from the Integrative OncoGenomics 19 ; (4) genes recorded as oncogene or tumor suppressor gene (TSG) by the Cancer Gene Census 20 .

Sample preparation and data generation
The OncoScan CNV Assay Kit (Affymetrix, Santa Clara, CA, USA) was used to perform genome-wide copy number analysis according to the manufacturer's recommendations.Briefly, a total of 80 ng of DNA from each tumor sample was processed.Molecular inversion probes (MIPs) were mixed with the sample DNA and annealed at 58°C overnight.The annealed DNA was divided into two equal parts and incubated with AT or GC gap-fill master mixes for ligation.Then, exonuclease treatment was performed to remove the unincorporated, noncircularized MIPs and the remaining genomic template.The circularized MIPs were linearized with a cleavage enzyme, and the two PCR amplifications were performed successively.The amplified products were digested with HaeIII and Exo enzymes, and the small fragments containing the specific single-nucleotide polymorphism (SNP) genotype were hybridized onto arrays.The arrays were washed and stained using a GeneChip Fluidics Station 450 (Affymetrix, Santa Clara, CA, USA) and scanned using a GeneChip Scanner 3000 7G (Affymetrix, Santa Clara, CA, USA).The fluorescence of clusters was measured to generate a DAT file.Cluster intensity values were automatically calculated using a built-in algorithm from DAT files using GeneChip Command Console software (Affymetrix, Santa Clara, CA, USA), and a CEL file was generated.

Analysis of SNP array data
An analysis of Affymetrix OncoScan CNV SNP probe assays was performed with Chromosome Analysis Suite (ChAS) v4.1 software (Thermo Fisher Scientific).A copy number reference model file was built by using a reference cohort of DNA from 23 randomly selected white blood cell samples from the mentioned patients and positive control samples from the OncoScan CNV Assay Kit.Probe-level output from the ChAS was analyzed using ASCAT (v2.4.3) 21to obtain segmented copy number calls, estimated tumor ploidy and estimated tumor purity results.The ASCAT segments were subsequently used to produce log2 ratios by dividing by the total copy number (nAraw + nBraw, with zero values set to 0.05).These segments were used as the input of GISTIC2.0 (v2.0.22) 22 to study the recurrence of gene-level CNVs in our sample set.GISTIC2.0 was run with the following parameters changed from the default settings (-ta 0.2 -td 0.2genegistic 1 -smallmem 1 -broad 1 -conf 0.95 -rx 0 -brlen 0.7 -cap 3.5 -armpeel 1 -js 100).Moreover, a group of adjacent normal tissues from 23 patients was used to filter recurrent germline/potential false-positive calls.Based on their segment output, the probes that suggested gain or loss in at least five patients were used with the help of Integrative Genomics Viewer to constitute a CNV file for removing recurrent germline/potential false-positive calls in GISTIC2.0.

Proteome analysis
Proteins were extracted and denatured from 1-2 mg of fresh frozen tissues in 30 μL lysis buffer (6 M urea, 2 M thiourea, 100 mM triethylammonium bicarbonate), followed by proteolytic digestion using Lys-C (Hualishi, Beijing, China) and trypsin (Hualishi, Beijing, China) assisted by pressure-cycling technology (PCT), as described previously 23,24 .TMTpro 16plex label reagents (Thermo Fisher Scientific, San Jose, USA) were used to label the digested peptides.A common pooled peptide sample was used as the reference control sample for each TMT batch.The TMT-labeled samples were cleaned with a C18 column (Waters Sep-Pak® Vac 1 cc C18 Cartridge) and fractionated using a Dionex UltiMate3000 HPLC system (Thermo Fisher Scientific, San Jose, USA) as described previously (Gao et al., 2020).Peptides were separated into 60 fractions, which were then consolidated into 30 fractions.The redissolved peptides were analyzed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) with a nanoflow DIONEX UltiMate 3000 RSLCnano System (Thermo Scientific™, San Jose, USA) coupled with an Orbitrap Exploris 480 mass spectrometer (Thermo Scientific™, San Jose, USA), which was equipped with a FAIMS Pro™ (Thermo Scientific™, San Jose, USA) in data-dependent acquisition (DDA) mode.The peptide samples were analyzed using an LC gradient of 60 min.The other LC-MS parameters followed a previous publication (Gao et al., 2020).

Database search
The mass spectrometric (MS) data were analyzed by Proteome Discoverer (Version 2.4.1.15,Thermo Fisher) using the human protein database downloaded from UniProt (version 15/07/2020, 20368).Two trypsin missed cleavages were allowed.The minimal peptide length was set to 6 residues.Normalization was performed against the total peptide amount.Carbamidomethylation (+57.021Da) of cysteine was set as static modification, while oxidation (+15.995Da) of methionine and acetylation (+42.011Da) of peptides' N-termini were set as variable modifications.Lysine residues and peptide N-termini were tagged with TMTpro (+304.207Da).Precursor ion mass tolerance was set to 10 ppm, and fragment mass tolerance was to 0.02 Da.Both unique peptides and razor peptides were used for mapping the best associated master proteins.The master protein abundances were calculated by summation of their associated peptide groups.The false discovery rate (FDR) of peptide was set to 1% (strict) and 5% (relaxed).The other parameters followed the default setup.More details have been described previously 25 .

Normalization and quality control of proteomic data
In the primary matrix of proteome data, columns were tested samples and rows were detected proteins.Samples with outlier median intensity ratio [defined as >mean(intensity ratio)+2*sd(intensity ratio) or <mean(intensity ratio)-2*sd(intensity ratio)] were excluded by tumor samples and para-tumor samples respectively.Then, the matrix was first log2-transformed and then normalized by column-median.Batch effects were removed by the removeBatchEffect function in the R package limma 26 .Batch effects before and after batch effect removal were evaluated by principal component analysis (PCA).Proteins that were absent in over 30% of all samples were not included in further quality control analysis.Then samples were filtered by PCA, where sample out of the 90% confidence eclipses in PCA analysis (plotted with PC1 and PC2 using all included proteins) by tumor samples and para-tumor samples respectively.Finally, duplicated samples and genes were merge by using mean value.

Polar metabolomics detection 1) Sample quenching and extraction
Twenty-five milligrams of sample was weighed into an EP tube, and 500 μL of extraction solution (methanol:acetonitrile:water = 2:2:1) was added.Then, the samples were homogenized at 35 Hz for 4 min and sonicated for 5 min in an ice-water bath.The homogenization and sonication cycle was repeated 3 times.The samples were incubated for 1 h at -40°C and centrifuged at 12000 rpm for 15 min at 4°C 27 .The QC sample was prepared by mixing equal aliquots of the supernatants from all of the samples.

3) Mass spectrometry
A QE HFX mass spectrometer was used for its ability to acquire MS and MS/MS spectra in information-dependent acquisition (IDA) mode in the control of the acquisition software (Xcalibur, Thermo).In this mode, the acquisition software continuously evaluates the full-scan MS spectrum.The electrospray ionization (ESI) source conditions were set as follows: sheath gas flow rate of 30 Arb, Aux gas flow rate of 25 Arb, capillary temperature of 350°C, full MS resolution of 60000, MS/MS resolution of 7500, collision energy of 10/30/60 in normalized collision energy (NCE) mode, and spray voltage of 3.6 kV (positive) or -3.2 kV (negative).

4) Data processing, metabolite identification and data analysis
MS raw data files were converted to mzXML format by ProteoWizard software (version 3.0.19282)and processed by the R package XCMS (v3.2) for metabolomics.The data pretreatments include peak identification, peak alignment, peak extraction, retention time correction and peak integration.To make the metabolomics data reproducible, peaks with RSDs larger than 30% in the QC samples were filtered out.The remaining peaks were annotated by comparison to retention time and mass to charge ratio (m/z) indices in the library by using the R package CAMERA 28 .After that, we obtained a data matrix consisting of the retention time, m/z and peak intensities.The data matrix was further processed by removing the peaks with missing values (intensity = 0) in more than 50% of the samples.For each metabolite, the missing values were replaced with 50% of the lowest observed value of all detected samples 29,30 .The area of each peak was then normalized by isotopically labeled ISs for polar metabolomics 31 .To remove the unwanted intra-and interbatch analytical variations, each metabolite peak in all subject samples was normalized using the locally estimated scatterplot smoothing (LOESS) method based on QC samples 31 .In brief, a LOESS regression model was built based on the intensity drift of each metabolite in the QC samples and used to predict and correct the intensities of the same metabolite in subject samples 31 .In all, 11708 MS features were included for further annotation.
Then the MS/MS spectra were searched in an in-house database for polar metabolite annotation based on accurate mass (m/z, ± 5 ppm), retention time and spectral patterns.The MS/MS spectra matching score was calculated using dot-product algorithm, which take the fragments and intensities into consideration 32 .Metabolites with MS/MS matching score higher than 0.3 were included in our study.In all, 669 MS/MS features were annotated and included in our study.
In summary, only the peaks with MS/MS name and with MS/MS matching score higher than 0.3 were included for further analysis.

Lipidomic detection 1) Sample quenching and extraction
Twenty milligrams of sample was weighed into an EP tube.Two hundred microliters of water and 480 μL of extract solution (MTBE: MeOH= 5: 1) were added sequentially.After 30 s of vortexing, the samples were homogenized at 35 Hz for 4 min and sonicated for 5 min in an ice-water bath.The homogenization and sonication cycle was repeated 3 times.Then, the samples were incubated at -40°C for 1 h and centrifuged at 3000 rpm (RCF=900 (×g), R= 8.6 cm) for 15 min at 4°C.Three hundred microliters of supernatant was transferred to a fresh tube, and the QC sample was prepared by mixing equal aliquots of the supernatants from all of the samples and drying the mixture in a vacuum concentrator at 37°C.Then, the dried samples were reconstituted in 150 μL of 50% methanol in dichloromethane by sonication for 10 min in an ice-water bath.The constitution was then centrifuged at 13000 rpm (RCF=16200 (×g), R= 8.6 cm) for 15 min at 4°C, and 120 μL of supernatant was transferred to a fresh glass vial for LC/MS analysis.

3) Mass spectrometry
A QE mass spectrometer was used for its ability to acquire MS and MS/MS spectra in DDA mode in the control of the acquisition software (Xcalibur 4.0.27,Thermo).In this mode, the acquisition software continuously evaluates the fullscan MS spectrum.The ESI source conditions were set as follows: sheath gas flow rate of 30 Arb, Aux gas flow rate of 10 Arb, capillary temperature of 320°C (positive) and 300°C (negative), full MS resolution of 70000, MS/MS resolution of 17500, collision energy of 15/30/45 in NCE mode, and spray voltage of 5 kV (positive) or -4.5 kV (negative).

4) Data processing, metabolite identification and data analysis
MS raw data files were converted to mzXML format by ProteoWizard software (version 3.0.19282)and processed by LipidAnalyzer for lipidomics data.The data pretreatments include peak identification, peak alignment, peak extraction, retention time correction and peak integration.The filtering of reliable lipid peaks and the normalization of data were similar to that in polar metabolomics.Then, the LipidBlast database was applied for lipid annotation.The MS/MS spectra matching score was also calculated as described in the polar metabolomic analysis section.Generally, 1312 MS/MS peaks were identified for lipidomics.
In summary, only the peaks with MS/MS name and with MS/MS matching score higher than 0.3 were included for further analysis.
Figure S1.The omics overview of the Fudan University Shanghai Cancer Center (FUSCC) HER2-low cohort, related to Figure 1.
Figure S2.CNA-RNA-protein integrated analyses of frequently altered breast cancer genes in HER2-low breast cancers, related to Figure 1e.

Figure S3 .
Figure S3.Heatmap showing the molecular landscape of HER2-low breast cancers, related to Figure 1e.(a-c) Samples were stratified by hormone receptor (HR) status and HER2 IHC scores.RNA-seq (a), proteome (b) and metabolome (c) data are shown separately.
Figure S4.Difference in overall survival (OS) between HER2-low and HER2-0 patients, related to Figure 2. (a-c) Kaplan-Meier curves and risk tables showing OS of HER2-low and HER2-0 breast cancers compared by twosided log-rank test in the entire cohort (a), HR-positive subgroup (b) and HR-negative subgroup (c).Source data are provided as a Source Data file.

Figure S6 .
Figure S6.The trend of pathway activity and result of metabolite enrichment, related to Figure 3. (a-b) Dot plots showing the trend of REACTOME pathway activity changing with HER2 IHC scores in HR-positive (a)and HR-negative (b) subgroups.Significantly increasing or decreasing lipid-related gene sets were plotted in red dots and annotated.P values were computed by Spearman's rank correlation analysis and were adjusted for multiple testing using false discovery rate method.

Figure S9 .
Figure S9.The association between molecular characteristics and lipid metabolism in HR-negative HER2-low breast cancers, related to Figure 4. (a) Heatmap comparing the gene expression level of lipid metabolism between PIK3CA-mutated and PIK3CA-wild-type breast cancers.(b) Heatmap comparing the pathway score of lipid metabolism between PIK3CA-mutated and PIK3CA-wild-type breast cancers.Original p values derived from two-sided Wilcoxon test that is less than 0.05 were annotated.(c) Heatmap showing Spearman's correlation coefficient between the expression levels of FGFR4/PTK6/ERBB4 and lipid metabolism-related genes.(d)Heatmap showing Spearman's correlation coefficient between the expression levels of FGFR4/PTK6/ERBB4 and the pathway scores of lipid metabolism.Pathways that were significantly correlated with FGFR4/PTK6/ERBB4 levels (FDR from two-sided Spearman's correlation coefficient were less than 0.05) are marked with the exact FDR.Source data are provided as a Source Data file.

Table S1 . Baseline characteristics of included patients stratified by HER2 status.
NA) values in categorical variables were shown but not included in the statistical analysis.Statistical tests of continuous variables were performed using the two-sided Kruskal-Wallis rank sum test.Statistical tests of categorical variables were performed using the two-sided Fisher's exact test.P1: comparison among HER2-0, HER2-low and HER2-positive.P2: comparison between HER2-0 and HER2-low.P3: comparison between HER2-low and HER2positive.IDC: invasive ductal carcinoma; ILC: invasive lobular carcinoma; ISH: in situ hybridization; HR: hormone receptor; sTILs: stromal tumor infiltrating lymphocytes; iTILs: intratumoral tumor infiltrating lymphocytes; TMB: tumor