Abstract
Circulating cell-free DNA (cfDNA) in the peripheral blood is a promising biomarker for cancer diagnosis and prognosis. Somatic mutations identified in cancers have been used to detect therapeutic targets for clinical transformation and individualize drug selection, while germline variants can predict a patient’s risk of developing cancer and drug sensitivity. However, no platform has been developed to analyze, calculate, integrate, and friendly visualize these pan-cancer cfDNA mutations deeply. In this work, we performed panel sequencing encompassing 1,115 cancer-related genes across 16,659 cancer patients, spanning 27 cancer types. We detected 496 germline variants in leukocytes and 11,232 somatic mutations in the cfDNA of all patients. CPGV (Cancer Peripheral blood Gene Variations), a database constructed from this dataset, is the first pan-cancer cfDNA database that encompasses somatic mutations, germline variants, and further comparative analyses of mutations across different cancer types. It bears great promise to serve as a valuable resource for cancer research.
Similar content being viewed by others
Background & Summary
Large-scale sequencing projects have dramatically enhanced our molecular understanding of cancers to the point where using genomic analysis to improve treatment outcomes seems promising1,2,3,4. A pan-cancer atlas provides a panoramic view of the oncogenic processes that contribute to human cancer, featuring hundreds of predisposing germline variants identified and linked to functional consequences, thus providing guidelines for variant classification and germline genetic testing in cancer5,6,7,8. Actionable genetic alterations in signaling pathways provide opportunities for targeted and combination therapies9,10,11. In addition, pan-cancer analysis will enable effective therapies in one cancer type to be extended to other types with similar genomic profiles12. Therefore, pan-cancer mutation resources are valuable for cancer research and medical care.
Cell-free DNA (cfDNA) is a degraded DNA fragment which is released into the plasma; therefore, it can provide a panoramic view of the tumor genome, overcome intratumor heterogeneity issues, and can be used for cancer screening, diagnosis, and monitoring13,14,15,16. With the development of sequencing technology, the sensitivity and accuracy of cfDNA genotyping have greatly improved. Due to its characteristics of non-invasive sampling and easy-to-obtain, cfDNA has become a potential molecule that has attracted much attention for clinical transformation17,18,19,20. In recent years, an increasing number of studies related to cfDNA have been published, which have provided great help in the treatment of tumors21,22,23,24,25. Subsequently, a comprehensive database of cfDNA fragmentation was created26. However, no database or platform has been developed to integrate germline variants and somatic mutation data from pan-cancer peripheral blood.
To address this gap, we developed a comprehensive resource, named CPGV, to integrate gene variations from pan-cancer peripheral blood samples. CPGV would serve as an invaluable resource for cancer research.
Methods
Database construction
First, we collected blood samples. Subsequently, cfDNA and genomic DNA (gDNA) were extracted and sequenced. Then, the raw data were uniformly filtered and analyzed. Finally, all data were assembled into the database system, and the web platform was implemented (Fig. 1).
Sample collection
Between September 2017 and March 2020, 16,659 patients with 27 cancer types were enrolled in this study (Fig. 2A). The ratio of males to females was approximately 3:2, with 42.3% (7,041) of the samples derived from females and 57.7% (9,618) from males (Fig. 2B). The age range was 1–96 years old, with an average age of 59 years old (Fig. 2C). Blood collections were performed twice for some patients at different time points (Supplementary Table S1), and we recruited 15,214 and 12,822 patients for somatic and germline research, respectively. All participants provided written informed consent before blood sampling, and guardians provided informed consent for any participants under 18 years of age. This study was reviewed and approved by the Ethics Committee of the Beijing Institute of Genomics, Chinese Academy of Sciences (Institutional Review Board No. 2021H016). The data has been filed with the Human Genetic Resource Administration of China and can be accessed under the data access agreement.
DNA extraction, sequencing, and data processing
Plasma and white blood cells (WBCs) were separated by centrifugation at 1,600 × g for 10 min. cfDNA was extracted from plasma using a MagMAX™ Cell-Free DNA Isolation Kit (Thermo Fisher Scientific, Waltham, MA, USA). gDNA was extracted from WBCs using a TIANamp Blood DNA Kit (TIANGEN, Beijing, China). The DNA quality was assessed using an Agilent 2100 Bioanalyzer (Agilent, USA)27. gDNA was cut into 150–200 base-pair (bp) fragments with a Covaris M220 Focused-ultrasonicator (Covaris, Massachusetts, USA), constructed using the KAPA Hyper Prep Kit (Kapa Biosystems, USA)28, hybridized to several in-house panels (Genecast, Wuxi, China), and sequenced on the Illumina NovaSeq 6000 according to the manufacturer’s instructions, producing paired-end reads with a length of 151 bp.
Preliminary sequencing data in the BCL format were converted to FASTQ files using bcl2fastq (v2.20.0), processed using Trimmomatic (v0.39) for adapter trimming and low-quality read filtering29, mapped to the reference genome (hg19) using BWA (0.7.17)30, sorted and marked duplicates using the Picard toolkit (version 2.1.0)31, and then realigned using the Genome Analysis Toolkit (GATK, version 3.7)32.
Somatic single nucleotide variation (SNV), germline single nucleotide polymorphism (SNP) and insertion-deletion (InDel) calling
GATK base quality score recalibration (BQSR) was first used to recalibrate base quality. For somatic SNVs and InDel calling, a panel of normals (PoN) containing germline and artifactual sites was created using GATK Mutect2 (4.1.2.0), and then Mutect2 was run to call variants in pairs of tumor and matched normal samples with the PoN. For germline SNP and InDel calling, GATK Haplotype Caller was used to call variants in normal samples in a joint calling mode. These variants were annotated with ANNOVAR and filtered using gnomAD for rare variants33. Rare variants were further filtered in the blacklist and healthy people, while nonsynonymous, stop-gain, stop-loss, splicing, frameshift/non-frameshift insertion, and deletion variants among the exonic and splicing regions were retained for later analysis. Pathogenicity classification for germline variants was predicted using CharGer following the American College of Medical Genetics guidelines34. Variants with a variant allele frequency (VAF) greater than 0.007 were retained for the final somatic mutation set.
Mutation signature analysis
Mutation signatures were determined by applying somatic rare variants in parsing 96 tri-nucleotide contexts to calculate the proportion of the Catalogue Of Somatic Mutations In Cancer (COSMIC) signatures using the R package (version 4.1.2) “deconstructSigs”35.
Calculation of tumor mutation burden (TMB)
TMB was determined by somatic mutations in the exonic and splicing regions with a VAF greater than 0.007. Alterations that were likely or known to be oncogenic drivers were excluded. TMB per megabase was calculated as the total number of mutations divided by the total bases of the target panel, with no less than 500x coverage.
Estimation of circulating tumor DNA (ctDNA) content fraction (CCF)
The CCF of plasma samples was estimated using a maximum likelihood model based on SNVs and copy number variants in the paired plasma and WBC samples, a method that can calculate CCF at lower ctDNA concentrations with high accuracy and stability36.
Calculation of gene-level variant number and carrier ratio for somatic and germline variants
Only the pathogenic and likely pathogenic germline variants predicted by CharGer were used to calculate the gene-level variant number and carrier ratio, and the dysfunctional variants among the exonic and splicing regions for somatic mutations were calculated. We counted the number of variants for each gene in the 27 cancer types. The carrier ratio of a variant is the percentage of patients with this variant per cancer type. The carrier ratio of a gene in a cancer is the proportion of individuals with any mutation in this gene among all patients with the same cancer type.
Statistical analysis
The relationship between carrier ratio and cancer type in our cohort was determined by a one-sided Fisher’s exact test, and a two-sided Fisher’s exact test compared these relationships in our cohort and TCGA cohort and the somatic and germline in our cohort. Multiple hypothesis testing was carried out to adjust the p value.
Data Records
VCF files recording all raw mutational data and the tumor classification of samples in this paper have been deposited in the Genome Variation Map37 (GVM) in the National Genomics Data Center38, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences, under accession numbers GVM000186-GVM000195, GVM000197-GVM000207, GVM000209-GVM000227, and GVM000229-GVM000263 within the overarching project number PRJCA02746939 (https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA027469). The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive40 (GSA) in the National Genomics Data Center, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA-Human: HRA00454341) that are publicly accessible at https://ngdc.cncb.ac.cn/gsa-human/browse/HRA004543. At the same time, these data can also be browsed, searched and compared through our online database website.
Technical Validation
All samples adopt a unified data quality control and analysis process to ensure that different samples can be compared.
The data undergoes a series of rigorous filtering process, ensuring that only high-quality variants that meet all filtering criteria are included in the database. Low quality reads that meet the following conditions would be excluded: 1) Fisher Strand (FS) value exceeding 200.0; 2) Symmetric Odds Ratio (SOR) value greater than 3.0; 3) Mapping Quality (MQ) below 40.0; 4) MQRankSum value less than −12.5; 5) Quality by Depth (QD) value below 2.0; 6) QUAL value less than 30.0; 7) ReadPosRankSum value less than −20.0.
Genes harboring only 1 mutation site account for 21.6%, while those with 5 or fewer mutations make up 65.5% (Fig. 3A). Approximately 96.2% of the mutations occur within the exonic region (Fig. 3B), 94.6% involve SNP variations, and nonsynonymous SNV constitute 82.4% (Fig. 3C,D). Specific mutation patterns reveal a tendency for A to G (42.3%), C to T (48.2%), G to A (48.2%), and A to C (43.2%) (Fig. 3E). Insertions are predominantly single-base event, comprising 51.8% of all insertions (Fig. 3F). The majority of reads, accounting for 71.9%, have depths ranging from 100 to 500 (Fig. 3G). For a more detailed overview, the statistics of somatic mutants and germline mutants are shown in Supplementary Figure S1, S2, respectively.
The purpose of constructing this database is to integrate genetic variation data derived from peripheral blood samples of cancer patients, providing a wealth of mutated gene sites for research. Notably, only 18% of our SNVs overlap with those listed in COSMIC42, the foremost and most comprehensive resource of somatic mutations in human cancer, suggesting the significant value of our dataset. In addition, compared with other databases, such as VARAdb43 and OncoVar44, our database offers distinct advantages. These include the provision of germline mutation site information, the capability to compare across multiple cancer types, and the facility to contrast our data with other datasets (Table 1). By leveraging these unique features, researchers can gain a more comprehensive understanding of genetic variations in cancer and identify potential therapeutic targets.
Usage Notes
Both the processed mutation data stored in GVM and the raw data stored in GSA-Human are open to users for free. However, due to the sensitivity of human genetic data, users need to apply on the GVM platform or GSA-Human platform and fill in the data access agreement before downloading the data. The data at GVM is held under the same terms and conditions as the data at GSA-Human. The user’s application will be approved provided that the user agrees to the terms and conditions of the data access agreement and the purpose is not for commercial profit. In addition to the repository data, our dataset is also available for users to download and visualize at http://ngdc.cncb.ac.cn/cpgv/.
Users need to register an account first, log in to find the data they are willing to request by entering the access number in the search bar, then click the “Request” button, and follow the steps to make their Data Access Request. The data access agreement can be downloaded during the intermediate process. Additionally, users can refer to the “Guidance for Making Data Access Requests” under the document button in the navigation bar for further assistance on submitting their requests. Once the Data Access Committee (DAC) has approved the request, users will receive an email notification from GVM/GSA confirming the approval. Upon receiving this notification, users can log in to their account, click on “Request” in the navigation bar, and select “My requests” to check the status of their application. By clicking on “view” in GVM or “download” in GSA, they will be able to access the data download address.
Code availability
The code of the CPGV has been uploaded to GitHub: https://github.com/padapeng911/CPGV.
References
Tao, Z. et al. Characterizations of Cancer Gene Mutations in Chinese Metastatic Breast Cancer Patients. Front Oncol 10, 1023 (2020).
Cooper, L. A. D. et al. PanCancer insights from The Cancer Genome Atlas: the pathologist’s perspective. J Pathol 244, 512–524 (2018).
Brittain, H. K., Scott, R. & Thomas, E. The rise of the genome and personalised medicine. Clinical Medicine 17, 545–551 (2017).
Zehir, A. et al. Mutational Landscape of Metastatic Cancer Revealed from Prospective Clinical Sequencing of 10,000 Patients. Nat Med 23, 703–713 (2017).
Huang, K. L. et al. Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173, 355–370 (2018).
Li, Y. et al. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genomics 18, 508 (2017).
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp Oncol (Pozn) 19, A68–A77 (2015).
Cline, M. S. et al. Exploring TCGA Pan-Cancer Data at the UCSC Cancer Genomics Browser. Sci Rep 3, 2652–2657 (2013).
Zhou, S. et al. Landscape of RAS Variations in 17,993 Pan-cancer Patients Identified by Next-generation Sequencing. Pathology & Oncology Research 26, 2835–2837 (2020).
Korkut, A. et al. A Pan-Cancer Analysis Reveals High-Frequency Genetic Alterations in Mediators of Signaling by the TGF-b Superfamily. Cell Systems 7, 422–437 (2018).
Sanchez-Vega, F. et al. Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell 173, 321–337 (2018).
Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet 45, 1113–1120 (2013).
Che, H. et al. Pan-Cancer Detection and Typing by Mining Patterns in Large Genome-Wide Cell-Free DNA Sequencing Datasets. Clinical Chemistry 68, 1164–1176 (2022).
Zhang, J. et al. 5-Hydroxymethylome in Circulating Cell-free DNA as A Potential Biomarker for Non-small-cell Lung Cancer. Genomics Proteomics Bioinformatics 16, 187–199 (2018).
Wan, J. C. M. et al. Liquid biopsies come of age: towards implementation of circulating tumour DNA. Nature Reviews Cancer 17, 223–238 (2017).
Diaz, L. A. Jr & Bardelli, A. Liquid Biopsies: Genotyping Circulating Tumor DNA. J Clin Oncol 32, 579–586 (2014).
Burgener, J. M. et al. Tumor-Naïve Multimodal Profiling of Circulating Tumor DNA in Head and Neck Squamous Cell Carcinoma. Clin Cancer Res 27, 4230–4244 (2021).
Han, X., Wang, J. & Sun, Y. Circulating Tumor DNA as Biomarkers for Cancer Detection. Genomics Proteomics Bioinformatics 15, 59–72 (2017).
Yeh, P. et al. Circulating tumour DNA reflects treatment response and clonal evolution in chronic lymphocytic leukaemia. Nat Commun 8, 14756 (2017).
Bettegowda, C. et al. Detection of Circulating Tumor DNA in Early- and Late-Stage Human Malignancies. Sci Transl Med 6, 224ra224 (2014).
Zhang, Y. et al. Pan-cancer circulating tumor DNA detection in over 10,000 Chinese patients. Nat Commun 12, 11 (2021).
Ng, C. K. Y. et al. Genetic profiling using plasma-derived cell-free DNA in therapy-naïve hepatocellular carcinoma patients: a pilot study. Annals of Oncology 29, 1286–1291 (2018).
Gremel, G. et al. Distinct subclonal tumour responses to therapy revealed by circulating cell-free DNA. Annals of Oncology 27, 1959–1965 (2016).
Sacher, A. G. et al. Prospective validation of rapid plasma genotyping as a sensitive and specific tool for guiding lung cancer care. JAMA Oncol 2, 1014–1022 (2016).
Murtaza, M. et al. Multifocal clonal evolution characterized using circulating tumour DNA in a case of metastatic breast cancer. Nat Commun 6, 8760 (2015).
Zheng, H., Zhu, M. S. & Liu, Y. FinaleDB: a browser and database of cell-free DNA fragmentation patterns. Bioinformatics 37, 2502–2503 (2021).
Jeong, T.-D. et al. Effects of Pre-analytical Variables on Cell-free DNA Extraction for Liquid Biopsy. Lab Med Online 9, 45–56 (2019).
Jiang, T. et al. Heterogeneity of neoantigen landscape between primary lesions and their matched metastases in lung cancer. Transl Lung Cancer Res 9, 246–256 (2020).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Zhou, Y. et al. Whole exome sequencing identifies novel candidate mutations in a Chinese family with left ventricular noncompaction. Molecular Medicine Reports 17, 7325–7330 (2018).
Auwera, G. A. et al. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43, 11.10.11–11.10.33 (2013).
Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Research 38, e164 (2010).
Scott, A. D. et al. CharGer: clinical Characterization of Germline variants. Bioinformatics 35, 865–867 (2019).
Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S. & Swanton, C. deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology 17 (2016).
Jiang, T. et al. Utilization of circulating cell-free DNA profiling to guide first-line chemotherapy in advanced lung squamous cell carcinoma. Theranostics 11, 257–267 (2021).
Li, C. et al. Genome Variation Map: a worldwide collection of genome variations across multiple species. Nucleic Acids Research 49, D1186–D1191 (2021).
Xue, Y. et al. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2022. Nucleic Acids Research 50, D27–D38 (2022).
Biological project library of China. https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA027469 (2024).
Chen, T. et al. The Genome Sequence Archive Family: Toward Explosive Data Growth and Diverse Data Types. Genomics, Proteomics & Bioinformatics 19, 578–583 (2021).
Raw sequence database of China. https://ngdc.cncb.ac.cn/gsa-human/browse/HRA004543 (2023).
Sondka, Z. et al. COSMIC: a curated database of somatic variants and clinical data for cancer. Nucleic Acids Research 52, D1210–D1217 (2024).
Pan, Q. et al. VARAdb: a comprehensive variation annotation database for human. Nucleic Acids Research 49, D1431–D1444 (2021).
Wang, T. et al. OncoVar: an integrated database and analysis platform for oncogenic driver variants in cancers. Nucleic Acids Research 49, D1289–D1301 (2021).
Acknowledgements
This work was supported by the National Key Research and Development Program of China (Grant No. 2022YFF1202101, 2022YFC2406803, and 2021YFF0703704) and the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant No. XDA16010602 and XDB38020100). We are grateful to all anonymous participants in this Genecast cohort and all participants of the CASPMI project, which was supported by a grant from the Key Program of the Chinese Academy of Sciences (KJZD-EW-L14). We would also like to thank the TCGA Project Consortium for making their data publicly available.
Author information
Authors and Affiliations
Contributions
Conception and design: Y. Liu, S. Zhang, J. Wang, G. Zheng, Z. Zhang, H. Qu, X. Fang. Acquisition of data (acquired and managed patients, provided facilities, etc.): J. Liu, Z. Sun, C. Sun, J. Xiao, C. Zeng. Analysis and interpretation of data (e.g., statistical analysis, biostatistics, computational analysis): S. Zhang, H. Sun, Y. Huang. Writing, review, and/or revision of the manuscript: Y. Liu, S. Zhang, H. Sun, M. Li, H. Qu, X. Fang. Administrative, technical, or material support (i.e., reporting or organizing data, constructing databases): Y. Liu, J. Wang, G. Zheng, Y. Yang. Study supervision: H. Qu, X. Fang.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Liu, Y., Liu, J., Zhang, S. et al. A panel sequencing dataset of peripheral blood gene variations in pan-cancer. Sci Data 11, 805 (2024). https://doi.org/10.1038/s41597-024-03620-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03620-6