Patient-derived glioblastoma cell lines with conserved genome profiles of the original tissue

Glioblastoma (GBM) is the most lethal intracranial tumor. Sequencing technologies have supported personalized therapy for precise diagnosis and optimal treatment of GBM by revealing clinically actionable molecular characteristics. Although accumulating sequence data from brain tumors and matched normal tissues have facilitated a comprehensive understanding of genomic features of GBM, these in silico evaluations could gain more biological credibility when they are verified with in vitro and in vivo models. From this perspective, GBM cell lines with whole exome sequencing (WES) datasets of matched tumor tissues and normal blood are suitable biological platforms to not only investigate molecular markers of GBM but also validate the applicability of druggable targets. Here, we provide a complete WES dataset of 26 GBM patient-derived cell lines along with their matched tumor tissues and blood to demonstrate that cell lines can mostly recapitulate genomic profiles of original tumors such as mutational signatures and copy number alterations.


Backgrounds & Summary
Glioblastoma (GBM) is one of the most aggressive forms of malignancies. Due to its aggressiveness and molecular complexity, the prognosis of GBM patients has not been improved compared to patients with other types of cancer 1 . With advanced sequencing technologies, massive sequencing of brain tumor tissues has improved our understanding of genomic characteristics of GBM 2,3 . Nevertheless, unmatched sequencing results between tumor tissues and in vitro models still kept the in silico molecular comprehensions from pre-clinical applications such as high-throughput screening (HTS) 4 . In an effort to close the gap between in silico analysis and actual biological models, broad institute cancer cell lines encyclopedia (CCLE) project have made significant progress to thoroughly analyze omics data of widely used cancer cell lines 5 . Sixty-six glioma cell lines were massively analyzed in the CCLE project, yet the molecular and physiological resemblances between parental tumor tissues and cell lines remain uncertain. Here, we provide a complete set of whole exome sequencing (WES) from successfully established patient-derived cell lines along with their matched tissues and bloods of 26 different GBM patients. Although sequencing data from either GBM tissues or cell lines has been massively deposited, there are only few databases encompassing both tissues and cell lines with matched normal. Our data indicated that GBM cell lines recapitulated representative pathogenic mutations of the original tumor and the germline mutations were exclusively present in matched blood DNA. All cell lines introduced in this study including its genomic profiles will be deposited to Korean Cell Line Bank (http://cellbank.snu.ac.kr) at initial passages to be distributed to researchers worldwide.
www.nature.com/scientificdata www.nature.com/scientificdata/ WES identified several mutations in the applied samples, including point mutations in putative oncogenes. We have excluded benign mutations by referring sequencing results from patient blood DNA and Clinvar database 6 . Mutations commonly observed in Glioblastoma 3,7 were well presented in our cell lines. These include inactivating mutations in tumor suppressors such as TP53 and PTEN as well as activating alterations in the PIK3CA. No representative driver mutation was detected in the SNU-3978 cell line (Fig. 1a). Missense mutations at the promotor regions of TERT has been associated with increased telomerase activity and eventually malignant tumors including Glioblastoma 8,9 . Since promotor regions were mostly uncovered by WES, we manually verified TERT promoter mutations among 26 GBM cancer cell lines using Sanger sequencing (Table 1). SNU-3978 harbored C228T missense mutation at the TERT, which might function as a driver mutation. Lollipop plot of TP53 indicates that the nonsense mutation pointed by blue arrows exclusively presented at the tissue and cell line samples, which implied these are the potential driver mutations. The missense mutation and nonsense mutations highlighted by red arrows were harbored by cell lines only ( Supplementary Fig. 1a). For instance, SNU-4098 had a pathogenic mutation at TP53 (c.273 G > A, Trp91Ter) which was undetected in its parental tissue samples (Fig. 1a). This incongruity between the original tissues and cell lines might be caused by both acquisition of random somatic mutations during passaging or low cellularity of the tissue sample.
Mutational concordances within the coding regions between the original tumor tissues and cell lines indicated that cell lines well recapitulated the mutational trait of the matched tumor specimens (median = 0.91 frequency of concordance ranging 0.84 to 0.94). While the portion of cell line-specific mutations were analogous to each other (median = 0.019 frequency of concordance ranging 0.015 to 0.025), tissue-specific mutations exhibited more fractions (median = 0.069 frequency of concordance ranging 0.037 to 0.013) (Fig. 1b). For instance, approximately 13% of total mutations in SNU-4026 and SNU-4954 series were tissue-specific, which was nearly twice as much as other sets. Continuous passaging of the cell lines functions as selective force to reduce heterogeneous cell populations, which might cause the decreased concordance rate between the original tumor www.nature.com/scientificdata www.nature.com/scientificdata/ and matched cell lines. We confirmed that the passages of all applied samples were set to 4-6, which excluded the potential de novo loss of mutations through the cell line establishment of SNU-4026 and SNU-4954. Other experimental settings were equivalent, and we concluded that this is due to random effect during the cell line establishment.
Mutational signatures of the tissue and cell lines were compared as well (Fig. 1c). The most predominant point mutation type in total samples was the C-to-T transition including CpG regions ( Supplementary Fig. 1b) matching well to the other glioblastoma sequencing cohorts 10 . Since we applied different sequencing depths to the tissue and cell lines, we only compared the types of mutational signatures between tumor tissue and cell lines which was highly corresponding. The portion of signature 3 and 12 was decreased in most of the cell lines, which might imply the culture-derived selection favors specific mutational signatures. www.nature.com/scientificdata www.nature.com/scientificdata/ We also compared the exome-wide CNVs of cell lines to matched tumor tissues. Cell lines displayed mostly analogous CNV patterns with the parental tumors. Few changes in CNV were observed such as gain at chromosome 12p and loss at chromosome 14 of SNU-3978 sample (Fig. 2a). Our samples displayed comparable CNVs with a larger TCGA-GBM cohort 10 , which maintained the gain of chromosome 7 and loss of chromosome 10 ( Fig. 2b,c). Inspection of the top regions identified by TCGA disclosed the presence of EGFR-amplified and CDKN2A, PTEN-depleted cell lines, as well as a documented gain of 19q region ( Supplementary Fig. 2). Overall, www.nature.com/scientificdata www.nature.com/scientificdata/ this data validated that the GBM cancer cell lines recapitulate the genomic characteristics of the primary tumor and most of the genomic diversity of Glioblastoma.
We provide the aligned BAM files and the processed Variant Call format (VCF) files for each of the samples encompassing the variants of GATK HaplotypeCaller pipeline for variant genotyping for each sample based on the BAM file previously generated. These data can be a valuable resource for investigating genetic variants, genes and signaling pathways to identify novel factors related to these disorders, and may provide novel information for the investigation of the Korean population or in general for studies of genetic polymorphisms of human population.

Method
Establishment and maintenance of human GBM cell lines. This study was reviewed and approved by the institutional review board of the Seoul National University Hospital (IRB No. 1608-139-787), and written informed consent was obtained from all patients enrolled in this study. All data were handled anonymously. Surgical specimens and clinical information were obtained from 26 GBM patients who underwent surgery at Seoul National University Hospital. Informed consent was obtained from all patients for the usage of samples and the establishment of cell lines. Baseline patient and tumor information is summarized in Table 1. The histological diagnosis was rendered using the WHO 2016 classification. Cell lines of histologically proven GBM. Cell lines of histologically proven GBM were established. Solid tumors were finely minced with scissors and dispersed into small aggregates by pipetting. Appropriate amounts of fine neoplastic tissue fragments were seeded into 25 cm 2 flasks. Most of the tumor cells were initially cultured in Opti-MEM medium supplemented with 5% heat-inactivated fetal bovine serum (FBS) (O5). Cultures were maintained in RPMI 1640 supplemented with 10% heat-inactivated FBS (R10). Initial passages were performed when heavy tumor cell growth was observed, and subsequent passages were performed every 1 or 2 weeks. Adherent cells were recovered while growth was subconfluent by treatment with trypsin, dispersed by pipetting and used for the passages. If stromal cell growth was noted in the initial cultures, differential trypsinization was used to obtain a pure tumor cell population. Cultures were maintained in humidified incubators at 37 °C in an atmosphere of 5% CO 2 and 95% air. All cell lines were confirmed to be free of mycoplasma contamination. DNA purification. Genomic DNA (gDNA) samples were isolated from the GBM tissue and blood DNA using the DNeasy Blood and Tissue Kit (Qiagen, MD, USA) according to the manufacturer's recommendations for the spin-column protocol, using 30 mg starting tissue material from each sample. In short, tissue samples were cut into small pieces, and then lysis buffer (provided by the kit) and proteinase K were added. Lysis reactions were carried out at 56 °C until complete lysis was obtained. DNeasy Mini spin columns (kit's component) were used for the isolation of gDNA from the lysate. Elution was carried out twice to a final volume of 100 μl per elution. DNA fingerprinting. DNA fingerprinting was proceeded with extracted gDNA. Quantified and diluted gDNA solution was added to reaction mixture consisted of Amp FISTR PCR reaction mix, Taq DNA polymerase, and Amp FISTR identifier primer set (Applied Biosystem, CA, USA). Then the sequence is amplified www.nature.com/scientificdata www.nature.com/scientificdata/ by GeneAmp PCR System 9700 (Applied Biosystem) with annealing temperature set to 59°C. 0.05 μl of Gene Scan-500 Rox standard and 9 μl of Hi-Di Formamide (Applied Biosystem) were added to 1 μl of PCR product of each cell line and denatured at 95°C for 2 minutes. This mixture was then analyzed by 3500xL Genetic Analyzer (Applied Biosystem).
Sanger sequencing. For the hTERT promoter region sequencing, 1 μL of gDNA of each cell lines were amplified in 14 μL PCR mixture containing 1.5 μL of 10X PCR buffer with MgCl2, 0.5 μL of dNTP, 0.25 μl of forward primer, 0.25 μl of reverse primer, and 0.08 μL of Taq DNA polymerase (Intron Biotechnology, Kyung-gi, South Korea) was proceeded using GeneAmp PCR System 9700 (Applied Biosystems, CA, USA). Each PCR cycle was set with denaturation step at 96°C annealing temperature at 68°C, and elongation at 72°C for 35 cycles. The primer sequence is following: hTERT-F > CTGGCGTCCCTGCACCCTGG, hTERT-R > ACGAACGTGGCCAGCGGCAG with estimated amplicon size of 470 bp. PCR product was precipitated by 5% sodium acetate buffer (Sigma-Aldrich, Cat# S7899) and 95% ethanol mixed solution. Then washed product was set on ice for 10 minutes and centrifuged at 4°C, 14000 rpm. Supernatant was discarded and the product was rinsed this time by 70% ethanol and centrifuged 14000 rpm. Supernatant was discarded then the products were dried using vacuum concentrator (Eppendorf). 10 μL of distilled water was added to dilute precipitated sample. When the product is all diluted in distilled water, cyclic PCR was carried out. Two separate mixtures for forward and reverse sequences were made where they each include 5X sequencing buffer (Applied Biosystems), Big Dye (Applied Biosystems), forward or reverse primer, distilled water, and product from the previous step. Cyclic PCR was carried out with denaturation step at 96°C, annealing temperature at 55°C, and elongation at 60°C for 25 cycles. The cyclic PCR product was then precipitated with 5% sodium acetate buffer and 95% ethanol mixed solution and set on ice for 10 minutes then it was centrifuged at 4°C and supernatants were carefully discarded and the final product was dried using the vacuum concentrator. 10 μL Hi-Di formamide (Applied  www.nature.com/scientificdata www.nature.com/scientificdata/ Biosystems) was added to dilute the dried product. This final product was transferred to 96 well PCR plate and denatured at 95°C for 2 minutes before taken to 3500xL Genetic Analyzer (Applied Biosystems) for sequencing.
Quality and quantity check of DNA. The generation of standard exome capture libraries, we used the Agilent SureSelect Target Enrichment protocol for Illumina paired-end sequencing library (ver. B.3, June 2015) together with 200 ng input gDNA. In all cases, the SureSelect Human All Exon V6 probe set was used. The quantification of DNA and the DNA quality is measured by PicoGreen and Nanodrop. Fragmentation of 1ug of genomic DNA was performed using adaptive focused acoustic technology. (AFA; Covaris) The fragmented DNA is repaired, an ' A' is ligated to the 3′ end, agilnet adapters are then ligated to the fragments. Once ligation had been assessed, the adapter ligated product is PCR amplified. The final purified product is then quantified using qPCR according to the qPCR Quantification Protocol Guide and qualified using the Caliper LabChipHigh Sensitivity DNA. (PerkinElmer). For exome capture, 250 ng of DNA library was mixed with hybridization buffers, blocking mixes, RNase block and 5 µl of SureSelect all exon capture library, according to the standard Agilent SureSelect Target Enrichment protocol. Hybridization to the capture baits was conducted at 65 °C using heated thermal cycler lid option at 105 °C for 24 hours on PCR machine. The captured DNA was then amplified. The final purified product is then quantified using qPCR according to the qPCR Quantification Protocol Guide and qualified using the TapeStationDNAscreentape(Agilent). And then we sequenced using the HiSeq ™ 2500 platform (Illumina, San Diego, USA).
Whole-exome sequencing. Whole-exome capture was performed on all samples with the SureSelect Human All Exon V5 Kit (Agilent Technologies, Tokyo, Japan). The captured targets were subjected to sequencing using HiSeq. 2500 (Illumina, San Diego, CA, USA) with the pair-end 100 bp read option for cell lines and blood samples and 200 bp read option for tissue materials. Information on read depth is provided in Supplementary Data 2. The sequence data were processed through an in-house pipeline. Briefly, paired-end sequences were firstly mapped to the human genome, where the reference sequence was UCSC assembly hg19 (original GRCh37 from NCBI, Feb. 2009) using the mapping program BWA (version 0.7.12), and generated a mapping result file in BAM format using BWA-MEM. Then, Picard-tools (ver.1.130) were applied in order to remove PCR duplicates. The local realignment process was performed to locally realign reads with BAM files reducing those reads identically match to a position at start into a single one, using MarkDuplicates.jar, which required reads to be sorted. By using Genome Analysis Toolkit, base quality score recalibration (BQSR) and local realignment around indels were performed. Haplotype Caller of GATK (GATKv3.4.0) was used for variant genotyping for each sample based on the BAM file previously generated (SNP and short indels candidates are detected). Somatic mutations were identified by providing the reference and sequence alignment data of tumor tissues or cell lines to the MuTect2 (involved in GATK v3.8.0) with default parameters using tumor-normal mode. Those variants were annotated by Continued www.nature.com/scientificdata www.nature.com/scientificdata/ SnpEff v4.1 g, to vcf file format, filtering with dbSNP for the version of 142 and SNPs from the 1000 genome project. Then, SnpEff was applied to filter additional databases, including ESP6500, ClinVar, dbNSFP 2.9. Mutational signatures were evaluated using the Mutational Patterns R package, release 3.6.1 to configure distinct footprints in genomic context for all somatic SNVs and evaluate a multitude of mutational patterns in base substitution in tumor tissues and matched cell lines.
We performed high-depth, short-read, and paired-end WES on fresh-frozen collection of GBM tissue and matched blood samples from 26 GBM patients. Here we describe the sample collection methods, the library preparation and sequencing method, the currently available data records, and technical validations for our dataset. A schematic overview of this study, including the bioinformatics workflow, is also presented (Fig. 3). The DNA samples were sequenced using SureSelect Human All Exon V5 Kit. The captured targets were subjected to sequencing using HiSeq. 2500 with the pair-end 100 bp read option for cell lines and blood samples and 200 bp read option for tissue materials in order to counterbalance dissimilar cellularity of tumor cells between tumor tissues and cell lines, which resulted in an average of 167 million paired-end reads for tissues samples and 83.5 million paired-end reads for blood and cell lines samples. Reads were aligned to the hg19 reference human genome, and we obtained high coverage per base position in both tumor tissue and cell lines. In average, we determined 94,044 single nucleotide polymorphisms (SNPs) and 12,518 insertions/deletions (indels) per sample. Variants presented in this cohort were found of 96.1% in dbSNP v142. An average transition/transversion (Ts/Tv) ratio was 2.26 (Table 2). In the tumor tissue cohort 88.6-99.0% (quartiles) of target regions had higher than 20-fold and 77.9-98.0% of target regions had higher than 30-fold coverage. In the blood and cell line cohort, these values were 87.2-96% for 20-fold and 74.7-90.6% for 30-fold coverage, respectively (Fig. 4, Table 3). The information of post-alignment is summarized in Table 4.

Analysis of CNVs. For the detection of Copy Number Variations (CNVs) and loss of heterozygosity (LOH)
from exome sequencing data, we employed ExomeCNV package in R program. The final log ratio of depth of coverage was determined by the number of bases targeted by exome sequencing (targeted base) and the number of bases actually sequenced (mapped). CNV calls were expressed as 1, 2, and 3 which indicated deletion, normal and amplification respectively.  www.nature.com/scientificdata www.nature.com/scientificdata/

Data records
The raw FASTQ files are deposited in the Sequence Read Archive (SRA) governed by National Center for Biotechnology Information (NCBI) with accession number (PRJNA896722) 11 . The GRCh37 aligned BAM files are deposited in the SRA with accession number (PRJNA896722). Data are publicly available at https:// www.ncbi.nlm.nih.gov/bioproject/PRJNA896722. All cell lines introduced in this study including its genomic Table 4. Post-alignment reads information.

Fig. 5
The sequence reads quality of SNU-3978T as a representative sample of tissue sample was summarized. The forward and reverse quality scores across all bases, depth distribution in target regions, cumulative depth distribution in target regions, and insert size of exemplary tissue sample are described.
www.nature.com/scientificdata www.nature.com/scientificdata/ characterization will be deposited to Korean Cell Line Bank (http://cellbank.snu.ac.kr) at initial passages to be distributed to researchers worldwide.

technical Validation
Quantitation of the purified DNA samples. The isolated DNA samples were quantified by PicoGreen and Nanodrop. DNA samples were diluted to 4 ng/μl with 1X Low TE Buffer. NanoDrop (Thermo Fischer Scientific) measurements were also performed to assess quantity and quality of DNA, 260:280 and 260:230 ratios greater to 1.8 were accepted. Fig. 7 The sequence reads quality of SNU-3978 as a representative sample of tissue sample was summarized. The forward and reverse quality scores across all bases, depth distribution in target regions, cumulative depth distribution in target regions, and insert size of exemplary cell line sample are described.

Fig. 6
The sequence reads quality of SNU-3978B as a representative sample of tissue sample was summarized. The forward and reverse quality scores across all bases, depth distribution in target regions, cumulative depth distribution in target regions, and insert size of exemplary blood sample are described.
www.nature.com/scientificdata www.nature.com/scientificdata/ Quality control of the sheared DNA samples. The quality of the sheared DNA samples (200 ng of each) were checked prior to downstream analysis, using the Agilent Bioanalyser 2100 (Agilent Technologies), and High Sensitivity DNA chip and reagent kit. The electropherogram showed a DNA fragment size peak (for each of the samples) at around 150 bps.
Quality check of the amplified samples. Agilent Bioanalyser 2100 (Agilent Technologies) and DNA 1000 Assay were used for the quality and quantity control of the libraries after PCR. The sample fragments sizes were between 250 and 275 bps.
Quality control of raw reads and sample statistics. Illumina BCL files were converted to FASTQ files by the standard Illumina protocol to remove low-quality reads and adaptors. The forward and reverse quality scores across all bases, depth distribution in target regions, cumulative depth distribution in target regions, and insert size are summarized in Figs. 5-7 for tissue, blood and cell lines samples respectively. Information on mappable reads and on-target reads are summarized in Table 4.