Transcriptomic profiling of 39 commonly-used neuroblastoma cell lines

Neuroblastoma cell lines are an important and cost-effective model used to study oncogenic drivers of the disease. While many of these cell lines have been previously characterized with SNP, methylation, and/or mRNA expression microarrays, there has not been an effort to comprehensively sequence these cell lines. Here, we present raw whole transcriptome data generated by RNA sequencing of 39 commonly-used neuroblastoma cell lines. These data can be used to perform differential expression analysis based on a genetic aberration or phenotype in neuroblastoma (e.g., MYCN amplification status, ALK mutation status, chromosome arm 1p, 11q and/or 17q status, sensitivity to pharmacologic perturbation). Additionally, we designed this experiment to enable structural variant and/or long-noncoding RNA analysis across these cell lines. Finally, as more DNase/ATAC and histone/transcription factor ChIP sequencing is performed in these cell lines, our RNA-Seq data will be an important complement to inform transcriptional targets as well as regulatory (enhancer or repressor) elements in neuroblastoma.


Background & Summary
An estimated 15,780 children were diagnosed with cancer in 2014 In the United States, and per year globally, this number is nearly 250,000 (ref. 1). Although the 5-year survival rate of pediatric cancers is 80%, many of the most commonly diagnosed childhood cancers: brain tumors, Wilms tumor, rhabdomyosarcoma, and high-risk neuroblastoma, have devastatingly low rates of survival 1 , demonstrating the continued need for research progress in these areas. Here, we focus on neuroblastoma, the most common extracranial solid tumor in children. This disease has an estimated incidence of 1 in 8,000 to 10,000 births 2 and a 5-year survival rate of >95% for children in the low and intermediate risk groups. However, children with high-risk disease have only a 40% likelihood of survival 2 . Culturing of neuroblastoma cell lines dates back to the 1940s (ref. 3), during which the sole purpose of culturing was for diagnosis. However, producing cell lines from neuroblastoma tumors quickly became routine (see review 4 ) and today, they are commonly-used, highly-characterized models used in laboratories across the world. Neuroblastoma cell lines nicely model a tumor's histopathology, gene expression, aneuploidy, and drug sensitivity, thus they are routinely used to investigate oncogenes or signaling pathways pharmacologically (drug screens, drug sensitivity/resistance) and/or genetically (siRNA, shRNA, CRISPR).
The genomics of neuroblastoma cell lines have been previously characterized using SNP 5 , methylation 5,6 , and/or mRNA expression microarrays [7][8][9] , however, there has not been an effort to profile a large panel of these cell lines with high-throughput sequencing techniques. The motivation behind this study was to comprehensively profile the mRNA and non-coding RNA transcriptome of commonly-used neuroblastoma cell line models with a major goal of using this information as a complement to the epigenomic data currently available and the many data in the process of being generated. Integration of RNA expression patterns with histone and/or transcription factor chromatin immunoprecipitation (ChIP) sequencing is necessary for inferring transcriptional regulatory events. Neuroblastomas can be classified into various groups based on genetic lesions, for example: MYCN copy number amplification, harboring an activating ALK mutation, harboring a chromosomal loss (e.g.,: 1p, 3p, 11q) or gain (17q), TERT rearrangements (for review of neuroblastoma genomics, see ref. 10). Utilizing a panel of cell lines which harbor a mixture of these characteristics enables differential expression analyses on the basis of a genetic lesion, mutation of interest, or expression of a gene of interest.
These data have reuse value to inform selection of cell lines for experimental investigation of putative neuroblastoma oncogenes and/or tumor suppressors. For example, choice of knock-down or overexpression studies require a priori knowledge of basal expression of the gene of interest for rational experimental design. These data allow the experimenter to quickly determine which cell lines are high, mid, or low expressers of a gene of interest without requiring tedious quantitative, real-time PCR analysis or western blotting of multiple cell lines prior to initiating a gene over-expression or knockdown experiment.
Here, we describe transcriptome-wide profiling of 39 neuroblastoma cell lines, the hTERTimmortalized retinal pigmented epithelial cell line, RPE-1, and pooled human fetal brain tissue. Careful and stringent technical design at each experimental stage has allowed generation of a high-quality RNA-Seq dataset which has tremendous reuse value for the neuroblastoma community. An overview of the study design is depicted in Fig. 1. Briefly, cell lines were thawed, grown, and collected at 60-80% confluency over a two-month period. Once all cell lines were pelleted, RNA extractions were performed, quality of RNA inspected, and RNA sequencing was performed. Raw FASTQ files were generated and are publicly-available for reuse (see Data Citation 1). Additionally, we provide a processed file of gene-level mRNA abundances for each sample. We anticipate this data being a valuable tool for the neuroblastoma research community as we continue investigation into oncogenomic mechanisms of this disease.

Cell lines and culturing
Cell line stocks were obtained from the Children's Oncology Group (COG) Cell Culture and Xenograft Repository at Texas Tech University Health Sciences Center (www.COGcell.org), the American Type Culture Collection (Manassas, VA), or the Children's Hospital of Philadelphia (CHOP) cell line bank. Several of the COG-derived cell lines were established direct-to-culture in parallel with a patient-derived xenograft model 11 that are being characterized separately (see Table 1 (available online only)). All cell culturing for this experiment was performed at CHOP. Each cell line was thawed for 2-3 min in a 37°C water bath, added to a 15 ml tube containing its respective growth medium, and pelleted by centrifugation at 300 × g for 3 min. The supernatant was discarded to remove the DMSO-containing freezing medium. Cells were re-suspended in 1 ml of growth medium and transferred into a T75 flask containing an additional 10 ml of growth medium. Once cells were~70-80% confluent, they were transferred to a 150 mm dish. At~70-80% confluency, cells were split into two 150 mm dishes and at 70-80% confluency, each dish of cells was pelleted, washed 3x with 1X PBS, and frozen at −80C until nucleic acids were extracted. See Table 1 (available online only) for a complete listing of cell lines, whether a matched patient-derived xenograft (PDX) exists, and their growth medium. Cell lines appended with 'nb' were grown in serum-free neurobasal medium. The following were purchased from rhFGF (fibroblast growth factor, Cat# PAG5071) and rhEGF (epidermal growth factor, Cat# PAG5021). Insulin/Transferrin/Selenium (ITS) premix culture supplement was purchased from Corning Life Sciences (Tewksbury, MA, Cat# 354351). Hyclone Fetal bovine serum was purchased from Fisher Scientific (Cat# SH30071.03) and the lot remained consistent across the different medium formulations throughout the duration of the experiment. Of note, SK-N-BE(2)-C is a subclone derived from the parental SK-N-BE(2) cell line 12 and SH-SY5Y was derived from the SH-SY subclone of the parental SK-N-SH cell line 13 .
Throughout the duration of the study, randomization was implemented to ensure unbiased data production. Cell lines were thawed in random order, nucleic acid extractions were performed randomly, and library preps and sequencing were performed randomly. Phenotypic characteristics of each cell line were also assessed as quality control during the cell growth stage. No unusual morphologies or growth rates were noted.

DNA extraction and STR profiling
From separate cell pellets, DNA was extracted using the DNeasy Blood & Tissue Kit (Cat# 69504, Qiagen, Valencia, CA). DNA was quantitated using the Nanodrop 1000 (Thermo Fisher Scientific) and Short Tandem Repeat (STR) profiling employed either the AmpFLSTR Identifiler PCR Amplification kit (Applied Biosystems, Foster City, CA) by the Children's Hospital of Philadelphia Nucleic Acids and Protein Core or the PowerPlex Fusion kit (Promega, Madison, WI) by Guardian Forensic Sciences (Abington, PA). All cell line STRs matched publicly-available references listed at http://strdb. cogcell.org/.

RNA extraction
Control human fetal brain total RNA (Cat# 636526, Lot#1605061A) was purchased from Clontech Laboratories (Mountain View, CA). This RNA was a pool of normal brain tissue from 21 spontaneously aborted male/female Caucasian fetuses of ages 26-40 weeks and was isolated using a modified guanidinium thiocyanate method 14 . For all cell lines, RNA was extracted using the miRNeasy Mini kit (Cat# 217004) from Qiagen (Valencia, CA) according to the manufacturer's protocol. RNA purity was assessed using the Nanodrop 2000 (Thermo Fisher Scientific) and quantitated with the Qubit 2.0 Fluorometer (Thermo Fisher Scientific). Quality and RNA integrity numbers (RINs) were assessed using the TapeStation 2200 (Agilent Technologies, Santa Clara, CA). Each cell line RNA sample had a RIN ≥ 8.7 and the RIN for the fetal brain RNA was 7.6, thus all RNA was of high quality.  15 was used to index the full hg19 genome fasta file from UCSC using the following parameters: $ STAR --runMode genomeGenerate --runThreadN 16 --genomeDir idx_dir --genomeFastaFiles ucsc.hg19.fa --sjdbGTFfile refSeq_hg19_2016-03-03.gtf --sjdbOverhang 100 The GTF file was downloaded using the genePredToGtf command from the kent utility (available for download at http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/): $ genePredToGtf hg19 knownGene knownGene.gtf Next, sequences were aligned and counts per gene were generated using the following parameters in two-pass mode: Alignment resulted in an average of 66 million uniquely-mapped reads per sample. STAR two-pass mode alignment was chosen as it has been shown to have 99% alignment accuracy and has nearly 20x faster processing speed compared with TopHat2 and similar processing speed as HISAT two-pass mode 16 .

Generation of FPKM
A custom R script was used to generate gene fragments per kilobase of exons per million reads (FPKM) from the count data produced from STAR. The Genomic Features Package version 1.22.13 (available for download at https://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html) was used with R Version 3.2.2 (Fire Safety) to make the transcriptome database and figures were produced using ggplot2 version 2.1.0 (http://ggplot2.org/).

Differential expression analyses
Differential expression of genes based on MYCN amplification status was performed separately for cell lines and primary neuroblastoma tumor samples using the R package, DESeq2 (version 1.10.1) 17 . FASTQ files and MYCN status for patient tumors were obtained with consent through the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) Consortium (see Data Citation 2, https://ocg.cancer.gov/programs/target/data-matrix). Next, the differentially-expressed genes' log 2 -transformed mean expression and the log 2 fold-change were correlated between the cell lines and patient samples.

Code availability
R scripts for generation of FPKM and differential expression analyses are available for download at: https://github.com/marislab/NBL-cell-line-RNA-seq.

Data Records
All raw RNA-sequencing data (paired FASTQ files) as well as the processed FPKM matrix from this study have been deposited into the Gene Expression Omnibus (GEO) under Accession Number GSE89413 (see Data Citation 1). For associated specimen metadata, see Table 1 (available online only) and for www.nature.com/sdata/ SCIENTIFIC DATA | 4:170033 | DOI: 10.1038/sdata.2017.33 associated assay metadata, see Table 2 (available online only). Raw single nucleotide polymorphism (SNP) array IDAT files and processed Genome Studio files for 27 of the cell lines have been deposited into GEO under Accession Number GSE89968 (see Data Citation 3). Together, these data make up the GEO Super Series GSE89969.

Technical Validation
As a technical validation of our RNA-Seq data, we generated FPKM for all genes (See Methods and Data Citation 1) and compared MYCN FPKM with each cell line's known copy number amplification status across cell lines ( Fig. 2 and Table 3 (available online only)) . Of note, the tumor from which the NLF cell line was derived was MYCN copy number amplified by the fluorescence in situ hybridization, however, it is not amplified at the protein level 18 and therefore, as expected, has the lowest MYCN FPKM of all cell lines designated as MYCN amplified. All cell lines were concordant with known MYCN amplification status.
Next, for both cell lines and neuroblastoma patient data, we performed differential expression analyses based on MYCN genomic amplification status using the R package, DESeq2 (ref. 17). We correlated the DESeq2 base mean of the common differentially-expressed genes (N = 2,395) between cell lines and primary patient tumors, which were significantly correlated (Fig. 3a, Pearson's R = 0.824, t = 71.131, df = 2,393, 95% CI = 0.811-0.836, P o2.2 e-16). The fold changes of these genes were also significantly correlated between the cell lines and patient samples (Fig. 3b, Pearson's R = 0.73, t = 52.231, df = 2,393, 95% CI = 0.711-0.748, P o2.2 e-16), not only supporting the technical validity of our dataset, but also emphasizing the utility of these cell lines as a surrogate model for neuroblastoma.

Usage Notes
All raw FASTQ files and the associated FPKM matrix file can be downloaded from the Gene Expression Omnibus ( performed using RSEM 19 and/or transcript level analyses can be performed using kallisto 20 . Use of kallisto will also allow quantification of non-coding RNA abundances. Differential expression analyses may be performed using the common R packages, limma 21 or DESeq2 (ref. 17). Differentially expressed gene lists can be explored for enrichment in signaling pathways using Ingenuity Pathway Analysis (Qiagen, http://www.ingenuity.com/products/ipa) and/or gene ontologies using ToppGene 22 or the Gene Ontology Consortium tool 23 . Finally, these expression data can be integrated with epigenomics datasets (e.g.: ChIP-Seq, DNase-Seq/ATAC-Seq, Histone ChIP-Seq) to infer transcriptional regulation or repression.