Background & Summary

Neurofibromatosis (NF) describes three forms of genetic conditions: NF1, NF2 and Schwannomatosis. People with NF may exhibit a wide range of clinical symptoms including neurocognitive deficits, brain tumours, bone abnormalities, and vascular disease. Effective treatments for people with NF are lacking due to the diversity of disease symptoms and underlying difficulty in treating any genetic disorder involving the loss of a tumour suppressor genes.

NF1 is the most common of the NF syndromes with an incidence of approximately 1/2,500 worldwide1. While NF1 has been linked to loss of function in the NF1 gene—a known tumour suppressor—due to mutation or deletion, there is a high degree of phenotypic diversity in the patient population, making it difficult to predict disease progression and to treat it effectively2. NF1 patients are susceptible to growths of two types of neurofibromas: plexiform neurofibromas, that can cause pain in affected patients and can transform into deadly malignant peripheral nerve sheath tumours, and cutaneous neurofibromas.

Cutaneous neurofibromas (cNFs) are benign lesions that occur during adolescence in NF1 patients and increase throughout their lives. Cutaneous neurofibromas can be painful and disfiguring although they generally do not affect the overall physical health of the affected individual. These lesions generally arise upon mutation or loss of the second NF1 (ref. 3) allele. Although the microenvironment has been known to play a role in tumour growth4,5, little is known about the molecular aetiology of these tumours themselves. To address this knowledge gap in the cutaneous neurofibroma field, we sought to improve overall knowledge of cutaneous neurofibromas through high-throughput molecular characterization of a diverse set of samples across patients.

This resource represents a highly collaborative effort that spans multiple institutions (see Fig. 1). cNF samples and patient blood were collected in the clinic by Children’s Tumour Foundation (CTF), who preserved and annotated the samples and sent them to the CTF Biobank. DNA from tumour and blood was profiled by (1) whole genome sequencing (WGS) and (2) Single nucleotide polymorphism (SNP) array and tumour RNA was profiled by (3) RNA-Sequencing. Data was then annotated and compiled into a single resource built on the Synapse6 platform available at http://www.synapse.org/cutaneousNF (Data Citation 1).

Figure 1: From sample to dataset.
figure 1

A description of the collaborative effort that enabled the Cutaneous NF data repository and subsequent analysis of the samples.

Methods

Sample collection

Eligible patients between the ages of 18 and 65 who were scheduled to receive elective surgery to remove cutaneous neurofibromas were invited to donate surgical discards as well as blood and urine samples. Patients consented to the release of their tissue samples and any resultant data through a protocol approved by Western IRB (Puyallup, WA). Patient-reported medical, family and NF1 history was also collected. Sample information is available in the table in the data repository (syn5556216; Data Citation 1).

Each tumour was sub-divided with each part either preserved in formalin, flash frozen in liquid nitrogen, or placed in RNA later solution within 45 min of removal from the patient. If there was not enough tissue for all three methods of preservation, priority was given to formalin, followed by frozen. Samples noted by the patient as recently growing were annotated as such, and each sample was labelled by the location of the body from which it was retrieved. Skin covering the tumours was not removed prior to preservation but was removed prior to sequencing.

30 cc of blood was also collected from each patient and stored at −80 °C. PBMCs were isolated and stored at −150 °C.

Whole genome sequencing

Genomic DNA (gDNA) was extracted from both frozen tumour and blood tissues using the DNeasy mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. The final elution was performed in 50 μl of RNase-free sterile distilled water. The concentration of the gDNA was estimated using the Qubit 2.0 Fluorometer (Invitrogen); samples with less than 500 ng were discarded. The integrity of the gDNA was assessed by running an aliquot of the gDNA on 0–8% agarose gel to confirm absence of RNA or protein bands as well as absence of a smear that would indicate degradation. One microgram gDNA was required for downstream whole genome library preparation applications. The samples were then sonicated on the Covaris LE220 (Covaris Inc, Woburn, MA, USA) to achieve an average target size of 400 bp. QC analysis of the post-sonicated material was done using Caliper LabChip GX (Perkin Elmer, Hopkinton, MA, USA).

Standard whole genome library prep was done using the NEBNext DNA Library Prep Reagent Set for Illumina (New England BioLabs Inc., Ipswich, MA, USA) as per manufacturer's recommended protocol. Library quality was assessed using the Qubit 2.0 Fluorometer, and the library concentration was estimated by utilizing a DNA 1,000 Chip on an Agilent 2,100 Bioanalyzer. Accurate quantification for sequencing applications was determined using the qPCR-based KAPA Biosystems Library Quantification Kit (Kapa Biosystems, Inc., Woburn, MA, USA). Each sample was then sequenced on an individual lane on an Illumina HiSeqX sequencer (Illumina, Inc., San Diego, CA, USA).

Reads were mapped to the genome and variants identified using the Dragen7 program with default settings. The resulting VCF files (syn5522788; Data Citation 1) were then sorted and updated to remove errors and analysed using the GATK pipeline8 (syn5522790; Data Citation 1).

Somatic mutations were called from BAM files exported by the Dragen alignment using the Java version of VarDict9 in paired mode with an allele frequency threshold of 0.01. VarDict was able to recall germ line variants and also identify somatic variants that were present in the tumour samples but absent in the matched PBMC samples for each patient (syn6022465 and syn6022474; Data Citation 1).

Copy number analysis

Genomic DNA (gDNA) was extracted as described above, but required two hundred and fifty nanograms of gDNA for downstream SNP array applications.

SNP array sample prep was done using the HumanOmni2.5–8 (Illumina, Inc., San Diego, CA, USA) as per manufacturer's recommended protocol as described:

http://support.illumina.com/content/dam/illumina-marketing/documents/products/workflows/workflow_infinium.pdf

http://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/infinium_assays/infinium_lcg_assay/infinium-lcg-assay-guide-15023139-d.pdf

Samples were analysed by GenomeStudio and exported to text (syn5004874; Data Citation 1).

RNA sequencing

Total RNA containing both mRNA as well as microRNA fractions was extracted from the tissues using the miRNeasy mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. The final elution was performed in 30 μl of RNase-free sterile distilled water. The concentration and integrity of the extracted total RNA were estimated using the Qubit 2.0 Fluorometer (Invitrogen) and Agilent 2,100 Bioanalyzer (Applied Biosystems, Carlsbad, CA, USA), respectively. Five hundred nanograms of total RNA and a RIN of 7.0 or higher were required for downstream RNA-seq applications. Poly-adenylated RNAs were isolated using NEBNext Magnetic Oligo d(T)25 Beads. The NEBNext mRNA Library Prep Reagent Set for Illumina (New England BioLabs Inc., Ipswich, MA, USA) was then used to prepare individually bar-coded next-generation sequencing expression libraries as per manufacturer's recommended protocol. Library quality was assessed using the Qubit 2.0 Fluorometer, and the library concentration was estimated by utilizing a DNA 1,000 Chip on an Agilent 2,100 Bioanalyzer. Accurate quantification for sequencing applications was determined using the qPCR-based KAPA Biosystems Library Quantification Kit (Kapa Biosystems, Inc., Woburn, MA, USA). Each library was diluted to a final concentration of 12.5 nM and pooled in an equimolar ratio prior to clustering. Paired-end sequencing (25 million, 50-bp, paired-end reads) was performed using a 200 Cycle TruSeq SBS HS v4 Kit on an Illumina HiSeq2500 sequencer (Illumina, Inc., San Diego, CA, USA).

Post-processing of the sequencing reads from RNA-seq experiments for each sample was performed using HudsonAlpha’s unique in-house RNA-seq data analysis pipeline (syn6035832; Data Citation 1). Briefly, quality control checks on raw sequence data for each sample were performed using FastQC (Babraham Bioinformatics, Cambridge, UK). Raw reads were mapped to the reference human genome hg19 using TopHat v2.0 (ref. 10) with the -p 4 and -r210 arguments. The alignment metrics of the mapped reads were estimated using SAMtools11 (syn6022474; Data Citation 1).

Reads were quantified by both Cufflinks v0.9.3 (ref. 10) (syn5492805; Data Citation 1) and FeatureCounts12 (syn5493036; Data Citation 1).

Code availability

All data is currently stored in the synapse web portal, and is accessible using code in the Cutaneous NF Github repository at http://www.github.com/Sage-Bionetworks/dermalNF.

Data Records

Data collected for each patient and sample have been annotated in the Synapse Table located within the online repository. Individual patients are described in Table 1, and the data available for each are described in Table 2.

Table 1 Description of patients profiled in this study.
Table 2 Description of tumours profiled in this study and the available data for each tumour.

Technical Validation

Validation for each dataset was performed individually.

SNP array data

Omni Array data were processed using Illumina Genome Studio and exported for further analysis using R to cluster regions that exhibited copy number alterations in either the germline (PBMC) or tissue samples.

To detect outliers we first plotted the logR ratio values and B allele frequencies computed by Genome Studio (Figs 2a,b) to insure that they follow a similar distribution across all samples. We then used hierarchical clustering to identify any possible outliers in the data. Specifically we clustered the median value of each segment of copy number alteration computed by the DNAcopy R package13. The resulting clusters, depicted in Fig. 2c, show strong corresponding clustering by patient.

Figure 2: Copy number quality control metrics.
figure 2

Distribution of SNP array values across samples. Patient samples represented by different colours. Teal outline represents tumour samples while pink outline represents blood. (a) Represents B allele frequencies, (b) represents log R ratio values. (c) Clustering of segmented values, with rows below representing patient samples (colours) and tissue of origin (grey for tumour, black for blood).

Whole genome sequencing

In addition to basic library quality described above, we used the bcftools package14 to measure cross-sample discordance and identify potential outliers. Between each pair of samples we measured the number of shared variants called and the discordance measured between the pair of samples. The resulting values are depicted in Fig. 3. Similar to the copy number data patient, samples clustered with one another, with the exception of one sample, derived from Patient 10 PBMC, which was dropped from further analysis.

Figure 3: Number of genomic variant sites shared between all pairs of samples.
figure 3

Samples are labelled by patient and tissue type and indicate that all samples aside from the outlier (Patient 10 blood) cluster by patient.

RNA-Seq data

Quality control was performed on the RNA libraries prior to sequencing using the Agilent Bioanlyzer. Thirteen of the 44 samples were not sequenced as their RIN levels were less than 7. Final quality control of the samples was performed using the LabChip GX. Full QC and alignment statistics are shown in Table 3.

Table 3 High throughput sequencing read statistics in patient tumours and associated peripheral blood mononuclear cells (PBMCs).

After sequencing, RNA reads were aligned to Hg19 using TopHat v2.0 (ref. 10) and quantified using two distinct quantification methods: Cufflinks v0.9.3 (ref. 10) and FeatureCounts12. The numbers of normalized read counts (>2) that mapped to all transcripts are depicted in Figures and FPKM values (>0.1) are depicted in Fig. 4b.

Figure 4: RNA-Seq count distribution.
figure 4

Distribution of RNA-Seq read counts per gene for both (a) DESeq normalized per-gene counts and (b) FPKM calculated by Cufflinks. Reads are distributed similarly across samples after filtering for unexpressed genes (<2 counts or FPKM of 0.1). Expression of NF1 is indicated by ‘x’ in each sample.

Usage Notes

All data are stored at the synapse web portal at http://www.synapse.org/cutaneousNF. Specific steps required to obtain access are described on the ‘Accessing the data’ wiki and involve:

  1. 1

    Obtaining a Synapse account at http://www.synapse.org/register

  2. 2

    Requesting access on the wiki and sending a brief email to CTF to describe how the data will be used

Inherent conditions for use of any data on Synapse are described on the Synapse governance site and apply to use of this dataset as well.

Scripts annotated to retrieve data from this repository can be found at http://github.com/Sage-Bionetworks/dermalNF.

Additional Information

How to cite this article: Gosline, S. J. C. et al. A high-throughput molecular data resource for cutaneous neurofibromas. Sci. Data 4:170045 doi: 10.1038/sdata.2017.45 (2017).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.