Comparison of TCGA and GENIE genomic datasets for the detection of clinically actionable alterations in breast cancer

Kaur, Pushpinder; Porras, Tania B.; Ring, Alexander; Carpten, John D.; Lang, Julie E.

doi:10.1038/s41598-018-37574-8

Download PDF

Article
Open access
Published: 06 February 2019

Comparison of TCGA and GENIE genomic datasets for the detection of clinically actionable alterations in breast cancer

Pushpinder Kaur^1,2,
Tania B. Porras^1,2,
Alexander Ring^1,2,
John D. Carpten^2,3 &
…
Julie E. Lang^1,2

Scientific Reports volume 9, Article number: 1482 (2019) Cite this article

10k Accesses
24 Citations
11 Altmetric
Metrics details

Subjects

Abstract

Whole exome sequencing (WES), targeted gene panel sequencing and single nucleotide polymorphism (SNP) arrays are increasingly used for the identification of actionable alterations that are critical to cancer care. Here, we compared The Cancer Genome Atlas (TCGA) and the Genomics Evidence Neoplasia Information Exchange (GENIE) breast cancer genomic datasets (array and next generation sequencing (NGS) data) in detecting genomic alterations in clinically relevant genes. We performed an in silico analysis to determine the concordance in the frequencies of actionable mutations and copy number alterations/aberrations (CNAs) in the two most common breast cancer histologies, invasive lobular and invasive ductal carcinoma. We found that targeted sequencing identified a larger number of mutational hotspots and clinically significant amplifications that would have been missed by WES and SNP arrays in many actionable genes such as PIK3CA, EGFR, AKT3, FGFR1, ERBB2, ERBB3 and ESR1. The striking differences between the number of mutational hotspots and CNAs generated from these platforms highlight a number of factors that should be considered in the interpretation of array and NGS-based genomic data for precision medicine. Targeted panel sequencing was preferable to WES to define the full spectrum of somatic mutations present in a tumor.

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

Tissue-specific enhancer–gene maps from multimodal single-cell data identify causal disease alleles

Article 09 April 2024

Feasibility of functional precision medicine for guiding treatment of relapsed or refractory pediatric cancers

Article Open access 11 April 2024

Introduction

A comprehensive understanding of potentially actionable genomic aberrations in tumor samples is important in guiding precision medicine for clinical decision-making. With the development of next-generation sequencing (NGS) technologies, it is feasible to characterize the individual genomic landscape and to identify disease causal variation for diagnosis and therapy. The recent advances in cancer genomics using targeted enrichment sequencing have reliably identified clinically relevant genomic alterations present in solid tumors¹. However, the functional significance of these alterations is still unexplored and for most patients with metastatic breast cancer, there is a compelling need for selecting clinically relevant beneficial treatment strategies via the identification of genetic alterations driving tumorigenesis.

Large-scale efforts such as the Catalogue of Somatic Mutations (COSMIC), The Cancer Genome Atlas (TCGA) and American Association for Cancer Research (AACR) Genomics Evidence Neoplasia Information Exchange (GENIE) project were designed to help investigators better understand the impact of somatic mutations in cancer. However, the vast heterogeneity of lesions observed in mutations and copy number alterations (CNAs) varies for different genes and tumor histologies^2,3,4. Molecular profiling of somatic mutations is increasingly being used to help select new treatment regimens in metastatic disease, although as yet there is no proven survival advantage for this approach. This is a particular concern since the open-label randomized, controlled SHIVA trial found that the use of molecularly targeted agents outside of their indications does not improve progression-free survival when compared to empirical treatment in heavily pre-treated metastatic patients⁵. Others have noted that genomics has not failed, it is just it its early stages of adoption and that N-of-One designs are necessary to adopt personalized medicine since each tumor has such unique biology⁶. The United States Food and Drug Administration (FDA) has recently approved the NGS-based FoundationOne CDx test that identifies actionable alterations in cancer-related genes and can guide treatment decisions. Likewise, a variety of commercial and academic laboratories engage in NGS, with discussion of results at molecular tumor boards to discuss if findings indicate a druggable treatment target^7,8,9. However, several technical issues need to be addressed before implementing NGS results into clinical practice. These include consideration of the downstream molecular analysis of: degraded DNA extracted from formalin-fixed, paraffin-embedded (FFPE) specimens, limited amounts of fresh tissue, the degree of stromal cellularity, and variation in the sequencing depth and capture efficiency. These challenges limit the ability to identify clinically relevant aberrations present in cancer cell subpopulations^7,10,11. In addition, another challenge arising in the analysis of multiple datasets is to identify consistent and reproducible clinically actionable biomarkers from sequencing technologies across cohorts and laboratory platforms. A comprehensive understanding of the detection of genomic alterations in cancer requires an integrative network framework for the analysis of NGS data.

The objective of our study was to investigate which platform (array versus WES and targeted panel sequencing) was most sensitive in identifying clinically significant genomic alterations using the TCGA and GENIE datasets for non-metastatic breast cancer patients.

Results

Comparison of the clinicopathological features of TCGA and GENIE cohorts

The clinical characteristics including age, race, ethnicity, tumor grade and hormone receptor status were compared between TCGA and GENIE breast cancer invasive lobular carcinomas (ILC) and invasive ductal carcinomas (IDC) patients (Table 1). No significant differences were found for the mean age of patients for ILC (p = 0.66) and IDC (p = 0.66) patients in TCGA and GENIE datasets. Tumor grade and hormone receptor information were not available from the GENIE dataset.

Table 1 Clinicopathological features of the TCGA and GENIE cohorts.

Full size table

Comparison of the number of mutational hotspots in actionable genes in breast cancer TCGA and GENIE datasets

Since WES, SNP arrays and targeted gene-panel approaches are routinely used to assess alterations in the coding regions of the genome, we sought to evaluate which of these technologies was more suitable for providing evidence of alterations in actionable targets. Overall, the results showed that there was inconsistency in the genomic alterations (including the percentages of mutational hotspots and CNAs) in the GENIE and TCGA datasets. We also compared the percentage of mutational hotspots between the TCGA and GENIE dataset after stratifying GENIE samples by PCR- and hybridization capture-based approach. The results showed inconsistency in mutational profiles with significant differences in the percentage of identified mutations and CNAs analyzed by WES, PCR and hybridization capture in ILC and IDC cohorts observed. (Fig. 1(a–c)). However, we identified consistency in the mutation frequencies across 40 clinically relevant genes including frequent mutations in PIK3CA, TP53, MAP2K1, NF1 and GATA3 in both of the datasets (Fig. 1(d,e)), which is consistent with previous reports of an association between these gene mutations with breast cancer¹². Figure 1(d,e) showed the data of all mutations (hotspots and non-hotspots). Hotspot mutations have been annotated with COSMIC database and non-hotspots have been annotated with the Oncology Knowledge Base (OncoKB) and the Clinical Interpretation of Variants in Cancer (CIViC) databases. We applied the Fisher’s exact test to compare the frequencies for all identified mutations. We observed significant differences between the two datasets in some actionable genes such as PIK3CA, ERBB2, TP53, RB1, BRCA2, ESR1, PGR, and ATM, with respect to the number of mutations. To further compare the identified somatic mutations from targeted gene panels to WES, we first assessed the distribution and prevalence of mutations in ILC and IDC samples. The mutations in each gene identified as significant in TCGA dataset were even more prevalent in mutational cluster regions in the GENIE dataset in the IDC subtype. The genes that had a higher number of mutations in the GENIE cohort as compared to TCGA cohort were BRCA2 (57 versus 12, p-0.035 for missense mutations), NOTCH1 (38 versus 5, p-0.04 for missense mutations), and BRCA1 (36 versus 14, p = 0.02 for missense mutations). We also observed 20 mutations in the ESR1 gene in the IDC subtype in the GENIE dataset that were not identified in the same tumor subtype in TCGA. Among these, the 2 main mutations (D538G and E380Q) confer acquired resistance to aromatase inhibitors¹³. In both cohorts, missense mutations were more prevalent than truncating and inframe mutations in both ILC and IDC subtypes (Kruskal-Wallis test, p < 0.0001) (Fig. 1(d,e)). The frequencies, percentages and p-values for missense, truncating and inframe mutations in individual genes in ILC and IDC samples are shown in Supplementary Tables S1 and S2, respectively.

To measure the prevalence of only hotspot mutations in the TCGA and GENIE datasets, we calculated the number of samples in ILC and IDC subtypes that contain =1 and >2 hotspots analyzed by WES and targeted sequencing approach (Supplementary Tables S3 and S4, Supplementary Fig. S1). We found the larger number of mutational hotspots in GENIE than TCGA which may be related to the deeper coverage of the targeted sequencing approach. However, we could not find any significant differences for the percentage of individual mutation hotspot between two datasets. The TCGA cohort had matched normal controls, however, GENIE samples have no matched normal controls. We also searched public databases (COSMIC v87¹⁴, hotspots.org^15,16 and 3Dhotspots.org¹⁷) as references for evaluating whether the identified mutations through WES and targeted sequencing includes any common polymorphisms. We observed that all these hotspots identified in TCGA and GENIE are occurring recurrently in COSMIC database and many of those are present in cancer hotspots database, a resource for statistically significant mutations in cancer¹⁵. We found many novel hotspots in targeted sequencing data that have been missed through the WES approach (Supplementary Tables S3 and S4, Supplementary Fig. S1) which shows that higher read depth has the potential for higher detection sensitivity of low-level mutations^18,19. These results demonstrated that target enrichment with higher coverage depths¹ ranging from ~200x to 4000x permits an in-depth characterization of the genomic landscape to identify rare and low-frequency variants that would have been missed by WES.

Comparison of copy number calls in actionable genes in breast cancer TCGA and GENIE datasets

The TCGA Pan-Cancer analysis and other studies have shown that CNAs are one of the hallmarks of genomic instability in many cancers and are also the dominant feature in breast cancer^20,21,22,23. A large-scale genomic dataset called the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) has performed an integrated analysis by combining gene copy number and expression to identify novel biological subgroups²². However, a comprehensive understanding of these alterations as putative predictive biomarkers in clinical practice should ultimately facilitate the interpretation of patient data for potential targeted therapy. Thus, the accurate and unbiased identification of recurring CNAs, which are potentially driver events, by using multiple data sets is important to identify the genomic regions of consistent aberration across multiple individuals. We next examined to what extent these two platforms were consistent in detecting actionable genomic CNAs at the sample and gene level. The Fisher’s exact test was used to evaluate the variability in the frequencies of CNA calls. We observed striking differences in the CNA landscape between these two datasets (Fig. 2). The frequencies, percentages and p-values for actionable CNAs in ILC and IDC samples are shown in Supplementary Tables S5 and S6, respectively. The frequency of copy number gain alterations in FGFR1 across ILC samples was 22-fold higher in the TCGA cohort as compared to the GENIE cohort (22% versus 1%, p < 0.0001). RPTOR also harbored frequent copy number gain alterations in 15% of TCGA cases compared to GENIE (0%). We also observed higher frequencies of patients having hemizygous deletions in the hormone receptors in ILC TCGA data set that were not observed in GENIE, including AR (8% versus 0%, p < 0.0001), ESR1 (23% versus 2%, p < 0.0001) and PGR (42% versus 0%, p < 0.0001) (Fig. 2(a)). The differences in the frequencies of copy number amplifications and deletions within actionable genes were also observed in the IDC subtype (Fig. 2(b)). The most frequent actionable alterations in the TCGA IDC dataset in comparison to GENIE were amplification in the regions of 15 genes (AKT3, ESR1, BARD1, BRCA1, PALB2, CD274, GATA3, NOTCH1, NOTCH4, MET, CDK4, CCND3, CCND2, CCNE1, CDK6, p < 0.0001) and deletion in 9 genes (PGR, ATM, BRCA2, BARD1, FGFR1, RB1, BRAF, KRAS, FBXW7, p < 0.05) (Fig. 2(b)). These results indicate that SNP array platforms can detect DNA copy number changes to a reasonable degree of accuracy. We next applied the two-stage linear step-up procedure of Benjamini, Kreiger, and Yekutieli²⁴ by setting false-discovery rate (FDR)(Q) to 5% to determine the number of genes with statistically significant different proportion of samples with CNAs between the two datasets. Our comparative analysis for ILC revealed that 34/40 (85%) genes had significant variance in copy number gain, 3/40 (7%) genes in amplification and 34/40 (85%) in hemizygous deletion. Likewise, for IDC, we observed differences in 40/40 (100%) genes for gain, 28/40 (70%) for amplification, 40/40 (100%) genes for hemizygous deletion and 9/40 (22%) genes for homozygous deletion. Since chromosomal aberrations are known to be associated with cancer progression^25,26, we analyzed amplification and deletions separately to assess which fraction of calls would have been missed by SNP-based array and targeted sequencing approach. We compared both of the datasets for the identification of significant regions of chromosomal amplification and deletions using GISTIC algorithm on the segmented data. The most significant regions (q < 0.25) of copy number amplification in actionable genes were found in GENIE dataset as compared to TCGA dataset in ILC (Fig. 3(a–c)) and IDC cohorts (Fig. 4(a–c)). For deletions, we found common and distinct regions that were deleted in breast cancer-associated genes in both datasets in the ILC (Fig. 3(d–f)) and IDC cohorts (Fig. 4(d–f)). The results of this analysis showed that several potentially important copy number amplifications were capable of being better detected by hybridization capture than SNP-based arrays.

Comparison of the number of mutational hotspots and copy number calls in actionable genes in NSCLC and colorectal cancer TCGA and GENIE datasets

We further evaluated whether these differences in CNAs were specific for breast cancer or due to tissue preservation methods or platform-specific artifacts. To address this question, we compared the TCGA WES and SNP array data generated from fresh frozen tissues in colorectal²⁷ and non-small cell lung cancer (NSCLC)²⁸ with the corresponding cancer type in the GENIE targeted panel data obtained from FFPE tissues. We found that there was inconsistency in the frequency distribution of CNAs in both of the data sets for those actionable genes from our list which are considered promising druggable targets for NSCLC, and colorectal cancer, such as KRAS, BRAF, EGFR, ATM, and PIK3CA. In NSCLC alone, we observed higher frequencies of CNAs in many actionable genes in TCGA than in GENIE, such as FGFR1 (9% versus 2%, p < 0.0001) and PIK3CA (18% versus 1%, p < 0.0001) for amplification and CDKN2A (13% versus 0%, p < 0.0001), CDKN2B (20% versus 4%, p < 0.0001) for deletions (Fig. 5(a)). In colorectal cancer, the genes that were significantly enriched for copy number gain in TCGA versus GENIE were BRCA2 (60% versus 23%, p < 0.0001), BRAF (48% versus 11%, p < 0.0001) and KRAS (22% versus 3%, p < 0.0001) (Fig. 5(b)).

In NSCLC and CRC, we observed no significant differences between the proportion of mutations in actionable genes identified through WES and targeted sequencing approach. However, the total number of mutations (including missense, truncating and inframe) in the TP53 gene was greater in GENIE than in TCGA (1709 versus 791, p < 0.0001). Larger number of hotspots and non-hotspots were also detected in GENIE in the NSCLC dataset for genes such as EGFR (738 versus 122, p < 0.0001), NF1 (211 versus 131, p < 0.0001) and PIK3CA (237 versus 94, p-0.038) in comparison to TCGA. Likewise, TP53 was highly enriched for mutations in the GENIE colorectal cancer data as compared to TCGA data (1629 versus 122, p-0.0064). We also observed higher number of mutational hotspots in 3 actionable genes in KRAS (1164 versus 219, p < 0.0001, q-0.0005), EGFR (814 versus 103, p < 0.0001, q-0.0005) and TP53 (1195 versus 499, p < 0.0001, q-0.0005) in GENIE than in TCGA in NSCLC cases.

Discussion

This study represents an integrated comparison of whole exome, SNP-based array and targeted gene panel sequencing in terms of their ability to detect mutations and CNAs in potentially clinical actionable genes from two-large breast cancer cohort studies. We observed that targeted sequencing is more effective in detecting CNAs than SNP-based array. Although targeted capture sequencing focused on hotspot regions and provided increased quality and reliability at a greater depth in comparison to whole genome sequencing (WGS)^29,30, it identified only smaller insertions and deletions while ignoring large duplications and deletions³¹. RNA sequencing data was not available from the GENIE dataset and thus it was difficult to determine whether the identified mutational hotspots and gene dosage are related to gene expression. The differences are attributable to the methodology used in both datasets and due to the limited capture design in targeted gene panel and an unequal distribution of targeted sites across the genome that would result in a large number of false positive and false negative calls. These results may be used as a better benchmark for future studies aimed at the identification of actionable alterations from the comparison of large-scale genomic data sets.

We observed that the percent of tumors with CNAs was quite small in GENIE as compared to TCGA, making it difficult to determine the precise spectrum of actionable alterations. The low frequencies in CNAs in these FFPE samples may also be explained due to low input of DNA and degraded DNA that makes the detection procedure complicated for the identification of the regions of deletion. Schweiger et al.³² have shown that higher sequencing coverage is required for CNA analysis. Although GENIE has also used higher sequencing coverage to detect CNAs, however, there are low frequencies in CNAs in breast cancer, NSCLC and CRC FFPE samples in comparison to TCGA fresh-frozen tissues. Studies have also shown that copy number analysis between the fresh-frozen and FFPE samples varied to a certain degree suggesting that discrepancy in the CNAs frequencies can be due to tissue-preservation methods^33,34. Another important factor affecting CNA detection is the amount of input DNA that is more than ten-fold higher for the array-based method than sequencing. Thus, the choice of assay and tissue preservation method is important for accurately detecting mutations and CNAs to guide treatment decisions. The MSK-IMPACT tumor profiling assay may distinguish mismatch repair deficiency (MMR-D) and proficient (MMR-P) tumors on the basis of mutational burden in colorectal cancer³⁵. The implementation of the results from these platforms in a clinical diagnostic environment requires immunohistochemistry (IHC) validation per multiple guidelines^36,37,38. Due to the large variation in detecting genomic alterations between different platforms, many studies have suggested that using multiple computational methods for the identification of genomic alterations reduces the chances of false positive results^39,40. Recently, Shi et al.⁴¹ identifies that 69% of the mutations from tumor-only WES pipeline were false-positive and even for matched-normal DNA only 36–78% were found consistently in replicate pairs. Since the TCGA cohort is having with or without matched normal controls and GENIE samples have no matched normal controls suggests that caution should be exercised when interpreting these genomic alterations. Torga and colleagues reported very low congruence in tumor-specific genetic alterations for patient-paired samples between the PlasmaSELECT and Guardant360 tests that could lead to different treatment decisions⁴². These results showed that genetic sequencing assays are not always concordant even when the exact same samples are processed, likely due to inherent differences in assay platforms.

From a clinical point of view, our results are of high importance in terms of assessing CNAs from SNP-based array in clinical laboratories, with a particular focus on amplifications in CNAs that would have been missed by this approach. The differences in the CNAs frequency across different platforms would also affect the ability to identify the subtype-specific patterns of alterations (for example, TERT amplification in lung cancer squamous cell carcinoma⁴³) and the driver genes that have been mutated by genomic duplication and deletion. Our results highlight some of the issues associated with technical inconsistencies in using molecular profiling for clinical decision-making. NGS technologies continue to evolve with improvements in accuracy along with the rapid production of huge datasets and new methods for identification of recurrent CNAs in multiple samples. However, it is difficult to assess the relative strengths and limitations of different sequencing methods because of the lack of studies that comprehensively compare these technologies. Despite this, variations in the interpretation of copy number changes between the sequencing platforms may become a problem not only for researchers who need to select the method for a dataset of interest, but also a big challenge for clinicians: which platform (array versus NGS) might best detect the underlying genetic driver of the disease in patients? These differences pose a serious challenge when trying to apply these technologies in clinical trials due to the confounding results, which may further impact on treatment decisions for cancer patients. Although both the TCGA and GENIE genomic datasets have CLIA/CAP certifications, validation steps are needed for both the wet and dry bench workflow of NGS-based assays independently by the clinical laboratory before implementation. Furthermore, the platform selection should be based on cross-validating these technologies with more reliable methods such as fluorescence in situ hybridization (FISH) and real-time PCR. There is also a need for more specific guidelines to interpret the clinical significance of actionable CNAs detected by array and NGS technologies for improved “genomic-based” therapeutic approaches for cancer patients.

The major limitation of this study is that raw files are not available for the GENIE dataset. In addition, there was much variation in the underlying research strategy of these two datasets such as coverage of the sequencing platforms, different variant calling pipelines and different assays. Differences in the tools/algorithm used in the different steps along with the variant-calling pipelines may also impact the frequency of variants identified. Considering these constraints, we set out to make a comparison demonstrating the frequency of variants using only the processed data as that was available for both datasets through cBioportal.

In conclusion, our study provides an integrated comparison of array and NGS technologies in identifying clinically relevant genomic alterations in potentially actionable genes. We compared the DNA sequencing data between the TCGA and GENIE project to evaluate the concordance in the frequencies of mutations and significant patterns of CNAs in clinically relevant genes in two breast cancer subtypes. Our results showed that SNP array platform identified many candidate regions of CNAs in actionable genes. We found that targeted gene panel sequencing was more effective in detecting a larger number of mutational hotspots and clinically significant duplications and deletions that were missed by WES and SNP-based array. The results of our study may be used as a better benchmark for future studies aimed at the identification of actionable alterations from the comparison of large-scale genomic data sets.

Methods

Analysis of potentially breast cancer related genes

For both large-scale genomic datasets, we identified a panel of 49 potentially actionable targets in which biomarkers were linked with FDA-approved or investigational therapeutics in breast cancer studies listed on www.clinicaltrials.gov (Table 2). We analyzed the TCGA⁴⁴ and GENIE¹ datasets from primary invasive lobular carcinomas (ILC) and invasive ductal carcinomas (IDC) patients for 40 genes from our curated list as 9 genes were not available on the targeted gene panel. Genes were defined as clinically relevant or actionable based on therapeutic and/or diagnostic implications in cancer patients⁴⁵. Our gene panel is not Clinical Laboratory Improvement Amendments (CLIA)/College of American Pathologists (CAP) certified, but the majority of these 49 actionable targets are found in CLIA certified gene panels such as the Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets (MSK-IMPACT) (410 genes), OncoKB database⁴⁶ (476 cancer-associated genes targeted by FDA-approved drugs or standard therapeutic agents) and Foundation Medicine (315 clinically relevant genes). The intent of our gene panel was to focus on potentially actionable genes with relevance to breast cancer and to maintain a sufficiently focused list in order to permit a detailed comparison of the TCGA and GENIE results as they pertain to clinically relevant gene targets.

Table 2 List of potentially breast cancer related genes.

Full size table

TCGA and GENIE data

We assessed the whole-exome DNA sequencing and Affymetrix SNP 6.0 array data for 127 ILC and 490 IDC from TCGA cohort and compared these with the third data release for GENIE targeted sequencing data for 248 ILC and 1724 IDC cases. The mutations and CNAs generated from Affymetrix array and NGS technologies were retrieved from cBioportal^47,48. Only GENIE samples that were screened using hybridization-based capture approach, as opposed to PCR-based approach, were analyzed for CNAs. The sample size of this subset of GENIE samples analyzed for CNAs is given in Supplementary Table S7. All patient samples were de-identified and encoded with TCGA and GENIE sample codes. We compared the array and NGS results from TCGA fresh frozen tissues and GENIE FFPE tissues to determine concordance between each platform. For the validation of both datasets, we also compared the TCGA WES and SNP array data generated from fresh frozen tissues in colorectal²⁷ and non-small cell lung cancer (NSCLC)²⁸ with the corresponding cancer type in the GENIE targeted panel data¹ obtained from FFPE tissues. We obtained the mutational and CNA events using cBioPortal for array data from TCGA NSCLC (n = 1144) and targeted gene panel sequencing data from GENIE (n = 3694). The mutational and CNA events for colorectal cancer were also obtained from cBioPortal for array data from TCGA colorectal (n = 226) and targeted gene panel sequencing data from GENIE (n = 2574).

Comparison of DNA mutations from WES and targeted gene panel sequencing data

For the identification of putative hotspots in clinically actionable genes, we downloaded the mutational hotspot data for TCGA and GENIE cohorts using cBioportal from the sequenced exomes of breast cancer patients (based on prespecified classifications or groups). The Fisher’s exact test was used to evaluate the variability in the frequencies of mutations for 40 actionable genes between both data sets for ILC and IDC subtypes. The Kruskal-Wallis test was applied to assess which mutation types are more prevalent in both breast cancer subtypes.

Comparison of CNAs from SNP-based array and targeted gene panel sequencing data

To determine the copy number status of an individual gene in any given patient, we used copy number datasets within the cBioportal generated by Genomic Identification of Significant Targets in Cancer (GISTIC) algorithms²⁶. CNA was characterized by measured copy number (expressed as a log2 ratio), and by the extent of change in the genome. The CNA thresholds were determined according to the set of discrete copy number calls provided by GISTIC: deep loss/homozygous deletion (−2), shallow loss/hemizygous deletion (−1), low-level gain (1), and high-level amplification (2). The copy number data was not available from the patients analyzed by PCR method in GENIE data set. The Fisher’s exact test was used to determine whether the frequencies of CNAs are different in actionable genes between TCGA and GENIE datasets analyzed by the array and NGS-based technologies. The identification of significantly amplified and deleted regions among potentially actionable genes was done using the GISTIC algorithm. The data was aligned to genome build hg19. The algorithm was executed within the Broad Firehose infrastructure. The GISTIC analysis was conducted separately on the ILC and IDC subtypes in TCGA and GENIE breast cancer study.

Statistical Analysis

Statistical analysis for comparing the mutations and CNAs was performed using GraphPad Prism version 7. The most prevalent mutations among missense, truncating and inframe mutations were calculated using the Kruskal-Wallis test. The Fisher’s exact test was used to calculate the variability for the frequencies of hotspots and CNAs. The two-stage linear step-up procedure of Benjamini, Kreiger and Yekutieli by setting FDR(Q) to 5% was used to correct p-values for multiple testing.

Ethics approval and consent to participate

This study was performed in strict accordance with the recommendations of data access guidelines of TCGA and AACR project GENIE datasets. We received administrative permission for downloading the restricted-access data for breast cancer patients from the TCGA Data Access Committee (Project # 10345).

Data Availability

The datasets analyzed in the current study are publicly available in cBioportal and sage synapse platform.

References

AACR Project GENIE: Powering Precision Medicine through an International Consortium. Cancer discovery 7, 818–831, https://doi.org/10.1158/2159-8290.cd-17-0151 (2017).
Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing. Nat Methods 7, 111–118, https://doi.org/10.1038/nmeth.1419 (2010).
Article CAS PubMed Google Scholar
Altmuller, J., Budde, B. S. & Nurnberg, P. Enrichment of target sequences for next-generation sequencing applications in research and diagnostics. Biol Chem 395, 231–237, https://doi.org/10.1515/hsz-2013-0199 (2014).
Article CAS PubMed Google Scholar
Choi, M. et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc Natl Acad Sci USA 106, 19096–19101, https://doi.org/10.1073/pnas.0910672106 (2009).
Article ADS PubMed Google Scholar
Le Tourneau, C. et al. Molecularly targeted therapy based on tumour molecular profiling versus conventional therapy for advanced cancer (SHIVA): a multicentre, open-label, proof-of-concept, randomised, controlled phase 2 trial. The Lancet. Oncology 16, 1324–1334, https://doi.org/10.1016/s1470-2045(15)00188-6 (2015).
Article PubMed Google Scholar
Wheler, J. J. et al. Unique molecular signatures as a hallmark of patients with metastatic breast cancer: implications for current treatment paradigms. Oncotarget 5, 2349–2354, https://doi.org/10.18632/oncotarget.1946 (2014).
Article PubMed PubMed Central Google Scholar
Frampton, G. M. et al. Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nat Biotechnol 31, 1023–1031, https://doi.org/10.1038/nbt.2696 (2013).
Article CAS PubMed PubMed Central Google Scholar
Drilon, A. et al. Broad, Hybrid Capture-Based Next-Generation Sequencing Identifies Actionable Genomic Alterations in Lung Adenocarcinomas Otherwise Negative for Such Alterations by Other Genomic Testing Approaches. Clinical cancer research: an official journal of the American Association for Cancer Research 21, 3631–3639, https://doi.org/10.1158/1078-0432.ccr-14-2683 (2015).
Article Google Scholar
Villaflor, V. et al. Biopsy-free circulating tumor DNA assay identifies actionable mutations in lung cancer. Oncotarget 7, 66880–66891, https://doi.org/10.18632/oncotarget.11801 (2016).
Article PubMed PubMed Central Google Scholar
Hadd, A. G. et al. Targeted, high-depth, next-generation sequencing of cancer genes in formalin-fixed, paraffin-embedded and fine-needle aspiration tumor specimens. J Mol Diagn 15, 234–247, https://doi.org/10.1016/j.jmoldx.2012.11.006 (2013).
Article CAS PubMed Google Scholar
Yau, C. et al. A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol 11, R92, https://doi.org/10.1186/gb-2010-11-9-r92 (2010).
Article CAS PubMed PubMed Central Google Scholar
Powell, E., Piwnica-Worms, D. & Piwnica-Worms, H. Contribution of p53 to metastasis. Cancer Discov 4, 405–414, https://doi.org/10.1158/2159-8290.cd-13-0136 (2014).
Article CAS PubMed PubMed Central Google Scholar
Jeselsohn, R., Buchwalter, G., De Angelis, C., Brown, M. & Schiff, R. ESR1 mutations-a mechanism for acquired endocrine resistance in breast cancer. Nat Rev Clin Oncol 12, 573–583, https://doi.org/10.1038/nrclinonc.2015.117 (2015).
Article CAS PubMed PubMed Central Google Scholar
Forbes, S. A. et al. The Catalogue of Somatic Mutations in Cancer (COSMIC). Current protocols in human genetics Chapter 10, Unit-10.11, https://doi.org/10.1002/0471142905.hg1011s57 (2008).
Article CAS PubMed Google Scholar
Chang, M. T. et al. Accelerating Discovery of Functional Mutant Alleles in Cancer. Cancer discovery 8, 174–183, https://doi.org/10.1158/2159-8290.cd-17-0321 https://www.cancerhotspots.org/ (2018).
Chang, M. T. et al. Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity. Nature biotechnology 34, 155–163, https://doi.org/10.1038/nbt.3391 https://www.cancerhotspots.org/ (2016)
Gao, J. et al. 3D clusters of somatic mutations in cancer reveal numerous rare mutations as functional targets. Genome medicine 9, 4, https://doi.org/10.1186/s13073-016-0393-x https://www.3dhotspots.org/ (2017).
Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nature Communications 6, 10001, https://doi.org/10.1038/ncomms10001 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Cai, L., Yuan, W., Zhang, Z., He, L. & Chou, K.-C. In-depth comparison of somatic point mutation callers based on different tumor next-generation sequencing depth data. Scientific Reports 6, 36540, https://doi.org/10.1038/srep36540 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Bignell, G. R. et al. Signatures of mutation and selection in the cancer genome. Nature 463, 893–898, https://doi.org/10.1038/nature08768 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Ciriello, G. et al. Emerging landscape of oncogenic signatures across human cancers. Nat Genet 45, 1127–1133, https://doi.org/10.1038/ng.2762 (2013).
Article CAS PubMed PubMed Central Google Scholar
Curtis, C. et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 486, 346–352, https://doi.org/10.1038/nature10983 (2012).
Article CAS PubMed PubMed Central Google Scholar
Siegel, M. B. et al. Integrated RNA and DNA sequencing reveals early drivers of metastatic breast cancer. The Journal of clinical investigation 128, 1371–1383, https://doi.org/10.1172/jci96153 (2018).
Article PubMed PubMed Central Google Scholar
Benjamini, Y. & Yekutieli, K. A. D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika 1, 491–507, https://doi.org/10.1093/biomet/93.3.491 (2006).
Article MathSciNet MATH Google Scholar
Beroukhim, R. et al. The landscape of somatic copy-number alteration across human cancers. Nature 463, 899–905, https://doi.org/10.1038/nature08822 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome biology 12, R41–R41, https://doi.org/10.1186/gb-2011-12-4-r41 (2011).
Article PubMed PubMed Central Google Scholar
Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337, https://doi.org/10.1038/nature11252 (2012).
Campbell, J. D. et al. Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas. Nat Genet 48, 607–616, https://doi.org/10.1038/ng.3564 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhao, L. et al. Next-generation sequencing-based molecular diagnosis of 82 retinitis pigmentosa probands from Northern Ireland. Hum Genet 134, 217–230, https://doi.org/10.1007/s00439-014-1512-7 (2015).
Article CAS PubMed Google Scholar
Tajiguli, A. et al. Next-generation sequencing-based molecular diagnosis of 12 inherited retinal disease probands of Uyghur ethnicity. Sci Rep 6, 21384, https://doi.org/10.1038/srep21384 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, Y. et al. SeqCNV: a novel method for identification of copy number variations in targeted next-generation sequencing data. BMC Bioinformatics 18, 147, https://doi.org/10.1186/s12859-017-1566-3 (2017).
Article CAS PubMed PubMed Central Google Scholar
Schweiger, M. R. et al. Genome-wide massively parallel sequencing of formaldehyde fixed-paraffin embedded (FFPE) tumor tissues for copy-number- and mutation-analysis. PLoS One 4, e5548, https://doi.org/10.1371/journal.pone.0005548 (2009).
Article ADS CAS PubMed PubMed Central Google Scholar
Menon, R. et al. Exome enrichment and SOLiD sequencing of formalin fixed paraffin embedded (FFPE) prostate cancer tissue. Int J Mol Sci 13, 8933–8942, https://doi.org/10.3390/ijms13078933 (2012).
Article CAS PubMed PubMed Central Google Scholar
Robbe, P. et al. Clinical whole-genome sequencing from routine formalin-fixed, paraffin-embedded specimens: pilot study for the 100,000 Genomes Project. Genetics in medicine: official journal of the American College of Medical Genetics, https://doi.org/10.1038/gim.2017.241 (2018).
Stadler, Z. K. et al. Reliable Detection of Mismatch Repair Deficiency in Colorectal Cancers Using Mutational Load in Next-Generation Sequencing Panels. Journal of clinical oncology: official journal of the American Society of Clinical Oncology 34, 2141–2147, https://doi.org/10.1200/jco.2015.65.1067 (2016).
Article CAS Google Scholar
Teutsch, S. M. et al. The Evaluation of Genomic Applications in Practice and Prevention (EGAPP) Initiative: methods of the EGAPP Working Group. Genet Med 11, 3–14, https://doi.org/10.1097/GIM.0b013e318184137c (2009).
Article PubMed PubMed Central Google Scholar
Ladabaum, U. et al. Strategies to identify the Lynch syndrome among patients with colorectal cancer: a cost-effectiveness analysis. Annals of internal medicine 155, 69–79, https://doi.org/10.7326/0003-4819-155-2-201107190-00002 (2011).
Article PubMed PubMed Central Google Scholar
Giardiello, F. M. et al. Guidelines on genetic evaluation and management of Lynch syndrome: a consensus statement by the US Multi-Society Task Force on colorectal cancer. Gastroenterology 147, 502–526, https://doi.org/10.1053/j.gastro.2014.04.001 (2014).
Article PubMed Google Scholar
Pinto, D. et al. Functional impact of global rare copy number variation in autism spectrum disorders. Nature 466, 368–372, https://doi.org/10.1038/nature09146 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Xu, B. et al. Strong association of de novo copy number mutations with sporadic schizophrenia. Nat Genet 40, 880–885, https://doi.org/10.1038/ng.162 (2008).
Article CAS PubMed Google Scholar
Shi, W. et al. Reliability of Whole-Exome Sequencing for Assessing Intratumor Genetic Heterogeneity. Cell Reports 25, 1446–1457, https://doi.org/10.1016/j.celrep.2018.10.046 (2018).
Article CAS PubMed PubMed Central Google Scholar
Torga, G. & Pienta, K. J. Patient-Paired Sample Congruence Between 2 Commercial Liquid Biopsy Tests. JAMA oncology, https://doi.org/10.1001/jamaoncol.2017.4027 (2017).
Pikor, L. A., Ramnarine, V. R., Lam, S. & Lam, W. L. Genetic alterations defining NSCLC subtypes and their therapeutic implications. Lung cancer (Amsterdam, Netherlands) 82, 179–189, https://doi.org/10.1016/j.lungcan.2013.07.025 (2013).
Article Google Scholar
Ciriello, G. et al. Comprehensive Molecular Portraits of Invasive Lobular Breast. Cancer. Cell 163, 506–519, https://doi.org/10.1016/j.cell.2015.09.033 (2015).
Article CAS Google Scholar
Van Allen, E. M. et al. Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine. Nature medicine 20, 682–688, https://doi.org/10.1038/nm.3559 (2014).
Article CAS PubMed PubMed Central Google Scholar
Chakravarty, D. et al. OncoKB: A Precision Oncology Knowledge Base. JCO Precision Oncology 1, 1–16, https://doi.org/10.1200/po.17.00011 http://oncokb.org/#/ (2017).
Gao, J. et al. Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci Signal 6, pl1, https://doi.org/10.1126/scisignal.2004088 (2013).
Article CAS PubMed PubMed Central Google Scholar
Cerami, E. et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2, 401–404, https://doi.org/10.1158/2159-8290.cd-12-0095 (2012).
Article PubMed Google Scholar

Download references

Acknowledgements

The authors wish to acknowledge the TCGA Research Network for sharing the TCGA breast cancer (BRCA) genomic datasets. The results presented here are in whole or part based upon data generated by The Cancer Genome Atlas managed by the NCI and NHGRI. The content is solely responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health. The authors would also like to acknowledge the AACR and its financial and material support in the development of the AACR Project GENIE registry, as well as members of the consortium for their commitment to data sharing. Interpretations are the responsibility of study authors.The project was supported in part by award number P30CA014089 from the National Cancer Institute.

Author information

Authors and Affiliations

Department of Surgery, Keck School of Medicine, University of Southern California, Los Angeles, CA, 90033, United States
Pushpinder Kaur, Tania B. Porras, Alexander Ring & Julie E. Lang
University of Southern California, Norris Comprehensive Cancer Center, Los Angeles, CA, 90033, United States
Pushpinder Kaur, Tania B. Porras, Alexander Ring, John D. Carpten & Julie E. Lang
Department of Translational Genomics, University of Southern California, Norris Comprehensive Cancer Center, Los Angeles, CA, 90033, United States
John D. Carpten

Authors

Pushpinder Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Tania B. Porras
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Ring
View author publications
You can also search for this author in PubMed Google Scholar
John D. Carpten
View author publications
You can also search for this author in PubMed Google Scholar
Julie E. Lang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Study Design: P.K., J.L. Data Acquisition: P.K., J.L. Data Analysis: P.K., J.L., A.R., T.P. Manuscript drafting: P.K., J.L., T.P., A.R., J.C. Critical revisions: P.K., J.L., T.P., A.R., J.C. Funding: J.L.

Corresponding author

Correspondence to Julie E. Lang.

Ethics declarations

Competing Interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Dataset 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kaur, P., Porras, T.B., Ring, A. et al. Comparison of TCGA and GENIE genomic datasets for the detection of clinically actionable alterations in breast cancer. Sci Rep 9, 1482 (2019). https://doi.org/10.1038/s41598-018-37574-8

Download citation

Received: 03 September 2018
Accepted: 10 December 2018
Published: 06 February 2019
DOI: https://doi.org/10.1038/s41598-018-37574-8

This article is cited by

Integrating somatic CNV and gene expression in breast cancers from women with PTEN hamartoma tumor syndrome
- Takae Brewer
- Lamis Yehia
- Charis Eng
npj Genomic Medicine (2023)
Classification with 2-D convolutional neural networks for breast cancer diagnosis
- Anuraganand Sharma
- Dinesh Kumar
Scientific Reports (2022)
Left sided breast cancer is associated with aggressive biology and worse outcomes than right sided breast cancer
- Yara Abdou
- Medhavi Gupta
- Kazuaki Takabe
Scientific Reports (2022)
Disease characterization in liquid biopsy from HER2-mutated, non-amplified metastatic breast cancer patients treated with neratinib
- Stephanie N. Shishido
- Rahul Masson
- Peter Kuhn
npj Breast Cancer (2022)
Identification of putative actionable alterations in clinically relevant genes in breast cancer
- Pushpinder Kaur
- Tania B. Porras
- Julie E. Lang
British Journal of Cancer (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.