A high-throughput molecular data resource for cutaneous neurofibromas

Neurofibromatosis type 1 (NF1) is a genetic disorder with a range of clinical manifestations such as widespread growth of benign tumours called neurofibromas, pain, learning disorders, bone deformities, vascular abnormalities and even malignant tumours. With the establishment of the Children’s Tumour Foundation biobank, neurofibroma samples can now be collected directly from patients to be analysed by the larger scientific community. This work describes a pilot study to characterize one class of neurofibroma, cutaneous neurofibromas, by molecularly profiling of ~40 cutaneous neurofibromas collected from 11 individual patients. Data collected from each tumour includes (1) SNP Arrays, (2) Whole genome sequencing (WGS) and (3) RNA-Sequencing. These data are now freely available for further analysis at http://www.synapse.org/cutaneousNF.


Background & Summary
Neurofibromatosis (NF) describes three forms of genetic conditions: NF1, NF2 and Schwannomatosis. People with NF may exhibit a wide range of clinical symptoms including neurocognitive deficits, brain tumours, bone abnormalities, and vascular disease. Effective treatments for people with NF are lacking due to the diversity of disease symptoms and underlying difficulty in treating any genetic disorder involving the loss of a tumour suppressor genes. NF1 is the most common of the NF syndromes with an incidence of approximately 1/2,500 worldwide 1 . While NF1 has been linked to loss of function in the NF1 gene-a known tumour suppressor-due to mutation or deletion, there is a high degree of phenotypic diversity in the patient population, making it difficult to predict disease progression and to treat it effectively 2 . NF1 patients are susceptible to growths of two types of neurofibromas: plexiform neurofibromas, that can cause pain in affected patients and can transform into deadly malignant peripheral nerve sheath tumours, and cutaneous neurofibromas.
Cutaneous neurofibromas (cNFs) are benign lesions that occur during adolescence in NF1 patients and increase throughout their lives. Cutaneous neurofibromas can be painful and disfiguring although they generally do not affect the overall physical health of the affected individual. These lesions generally arise upon mutation or loss of the second NF1 (ref. 3) allele. Although the microenvironment has been known to play a role in tumour growth 4,5 , little is known about the molecular aetiology of these tumours themselves. To address this knowledge gap in the cutaneous neurofibroma field, we sought to improve overall knowledge of cutaneous neurofibromas through high-throughput molecular characterization of a diverse set of samples across patients.
This resource represents a highly collaborative effort that spans multiple institutions (see Fig. 1). cNF samples and patient blood were collected in the clinic by Children's Tumour Foundation (CTF), who preserved and annotated the samples and sent them to the CTF Biobank. DNA from tumour and blood was profiled by (1) whole genome sequencing (WGS) and (2) Single nucleotide polymorphism (SNP) array and tumour RNA was profiled by (3) RNA-Sequencing. Data was then annotated and compiled into a single resource built on the Synapse 6 platform available at http://www.synapse.org/cutaneousNF (Data Citation 1).

Sample collection
Eligible patients between the ages of 18 and 65 who were scheduled to receive elective surgery to remove cutaneous neurofibromas were invited to donate surgical discards as well as blood and urine samples. Patients consented to the release of their tissue samples and any resultant data through a protocol approved by Western IRB (Puyallup, WA). Patient-reported medical, family and NF1 history was also collected. Sample information is available in the table in the data repository (syn5556216; Data Citation 1). Each tumour was sub-divided with each part either preserved in formalin, flash frozen in liquid nitrogen, or placed in RNA later solution within 45 min of removal from the patient. If there was not enough tissue for all three methods of preservation, priority was given to formalin, followed by frozen. Samples noted by the patient as recently growing were annotated as such, and each sample was labelled by the location of the body from which it was retrieved. Skin covering the tumours was not removed prior to preservation but was removed prior to sequencing. 30 cc of blood was also collected from each patient and stored at −80°C. PBMCs were isolated and stored at −150°C.

Whole genome sequencing
Genomic DNA (gDNA) was extracted from both frozen tumour and blood tissues using the DNeasy mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. The final elution was performed in 50 μl of RNase-free sterile distilled water. The concentration of the gDNA was estimated using the Qubit 2.0 Fluorometer (Invitrogen); samples with less than 500 ng were discarded. The integrity of the gDNA was assessed by running an aliquot of the gDNA on 0-8% agarose gel to confirm absence of RNA or protein bands as well as absence of a smear that would indicate degradation. One microgram gDNA was required for downstream whole genome library preparation applications. The samples were then sonicated on the Covaris LE220 (Covaris Inc, Woburn, MA, USA) to achieve an average target size of 400 bp. QC analysis of the post-sonicated material was done using Caliper LabChip GX (Perkin Elmer, Hopkinton, MA, USA).
Standard whole genome library prep was done using the NEBNext DNA Library Prep Reagent Set for Illumina (New England BioLabs Inc., Ipswich, MA, USA) as per manufacturer's recommended protocol. Library quality was assessed using the Qubit 2.0 Fluorometer, and the library concentration was estimated by utilizing a DNA 1,000 Chip on an Agilent 2,100 Bioanalyzer. Accurate quantification for sequencing applications was determined using the qPCR-based KAPA Biosystems Library Quantification Kit (Kapa Biosystems, Inc., Woburn, MA, USA). Each sample was then sequenced on an individual lane on an Illumina HiSeqX sequencer (Illumina, Inc., San Diego, CA, USA).
Reads were mapped to the genome and variants identified using the Dragen 7 program with default settings. The resulting VCF files (syn5522788; Data Citation 1) were then sorted and updated to remove errors and analysed using the GATK pipeline 8 (syn5522790; Data Citation 1).
Somatic mutations were called from BAM files exported by the Dragen alignment using the Java version of VarDict 9 in paired mode with an allele frequency threshold of 0.01. VarDict was able to recall germ line variants and also identify somatic variants that were present in the tumour samples but absent in the matched PBMC samples for each patient (syn6022465 and syn6022474; Data Citation 1).

Copy number analysis
Genomic DNA (gDNA) was extracted as described above, but required two hundred and fifty nanograms of gDNA for downstream SNP array applications.

RNA sequencing
Total RNA containing both mRNA as well as microRNA fractions was extracted from the tissues using the miRNeasy mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. The final elution was performed in 30 μl of RNase-free sterile distilled water. The concentration and integrity of the extracted total RNA were estimated using the Qubit 2.0 Fluorometer (Invitrogen) and Agilent    Post-processing of the sequencing reads from RNA-seq experiments for each sample was performed using HudsonAlpha's unique in-house RNA-seq data analysis pipeline (syn6035832; Data Citation 1). Briefly, quality control checks on raw sequence data for each sample were performed using FastQC (Babraham Bioinformatics, Cambridge, UK). Raw reads were mapped to the reference human genome hg19 using TopHat v2.0 (ref. 10) with the -p 4 and -r210 arguments. The alignment metrics of the mapped reads were estimated using SAMtools 11 (syn6022474; Data Citation 1).

Code availability
All data is currently stored in the synapse web portal, and is accessible using code in the Cutaneous NF Github repository at http://www.github.com/Sage-Bionetworks/dermalNF.

Data Records
Data collected for each patient and sample have been annotated in the Synapse Table located within the online repository. Individual patients are described in Table 1, and the data available for each are described in Table 2.

Technical Validation
Validation for each dataset was performed individually.

SNP array data
Omni Array data were processed using Illumina Genome Studio and exported for further analysis using R to cluster regions that exhibited copy number alterations in either the germline (PBMC) or tissue samples.
To detect outliers we first plotted the logR ratio values and B allele frequencies computed by Genome Studio (Figs 2a,b) to insure that they follow a similar distribution across all samples. We then used hierarchical clustering to identify any possible outliers in the data. Specifically we clustered the median value of each segment of copy number alteration computed by the DNAcopy R package 13 . The resulting clusters, depicted in Fig. 2c, show strong corresponding clustering by patient.

Whole genome sequencing
In addition to basic library quality described above, we used the bcftools package 14 to measure crosssample discordance and identify potential outliers. Between each pair of samples we measured the number of shared variants called and the discordance measured between the pair of samples. The resulting values are depicted in Fig. 3. Similar to the copy number data patient, samples clustered with one another, with the exception of one sample, derived from Patient 10 PBMC, which was dropped from further analysis.

RNA-Seq data
Quality control was performed on the RNA libraries prior to sequencing using the Agilent Bioanlyzer. Thirteen of the 44 samples were not sequenced as their RIN levels were less than 7. Final quality control of the samples was performed using the LabChip GX. Full QC and alignment statistics are shown in Table 3.

Usage Notes
All data are stored at the synapse web portal at http://www.synapse.org/cutaneousNF. Specific steps required to obtain access are described on the 'Accessing the data' wiki and involve: