Data Descriptor | Open

A high-throughput molecular data resource for cutaneous neurofibromas

  • Scientific Data 4, Article number: 170045 (2017)
  • doi:10.1038/sdata.2017.45
  • Download Citation
Published online:


Neurofibromatosis type 1 (NF1) is a genetic disorder with a range of clinical manifestations such as widespread growth of benign tumours called neurofibromas, pain, learning disorders, bone deformities, vascular abnormalities and even malignant tumours. With the establishment of the Children’s Tumour Foundation biobank, neurofibroma samples can now be collected directly from patients to be analysed by the larger scientific community. This work describes a pilot study to characterize one class of neurofibroma, cutaneous neurofibromas, by molecularly profiling of ~40 cutaneous neurofibromas collected from 11 individual patients. Data collected from each tumour includes (1) SNP Arrays, (2) Whole genome sequencing (WGS) and (3) RNA-Sequencing. These data are now freely available for further analysis at

Design Type(s)
  • individual genetic characteristics comparison design
  • clinical history design
Measurement Type(s)
  • whole genome sequencing
  • copy number variation profiling
  • transcription profiling assay
Technology Type(s)
  • DNA sequencing
  • genotyping by SNP array
  • RNA sequencing
Factor Type(s)
    Sample Characteristic(s)
    • Homo sapiens
    • neurofibroma cell
    • peripheral blood mononuclear cell

    Background & Summary

    Neurofibromatosis (NF) describes three forms of genetic conditions: NF1, NF2 and Schwannomatosis. People with NF may exhibit a wide range of clinical symptoms including neurocognitive deficits, brain tumours, bone abnormalities, and vascular disease. Effective treatments for people with NF are lacking due to the diversity of disease symptoms and underlying difficulty in treating any genetic disorder involving the loss of a tumour suppressor genes.

    NF1 is the most common of the NF syndromes with an incidence of approximately 1/2,500 worldwide1. While NF1 has been linked to loss of function in the NF1 gene—a known tumour suppressor—due to mutation or deletion, there is a high degree of phenotypic diversity in the patient population, making it difficult to predict disease progression and to treat it effectively2. NF1 patients are susceptible to growths of two types of neurofibromas: plexiform neurofibromas, that can cause pain in affected patients and can transform into deadly malignant peripheral nerve sheath tumours, and cutaneous neurofibromas.

    Cutaneous neurofibromas (cNFs) are benign lesions that occur during adolescence in NF1 patients and increase throughout their lives. Cutaneous neurofibromas can be painful and disfiguring although they generally do not affect the overall physical health of the affected individual. These lesions generally arise upon mutation or loss of the second NF1 (ref. 3) allele. Although the microenvironment has been known to play a role in tumour growth4,5, little is known about the molecular aetiology of these tumours themselves. To address this knowledge gap in the cutaneous neurofibroma field, we sought to improve overall knowledge of cutaneous neurofibromas through high-throughput molecular characterization of a diverse set of samples across patients.

    This resource represents a highly collaborative effort that spans multiple institutions (see Fig. 1). cNF samples and patient blood were collected in the clinic by Children’s Tumour Foundation (CTF), who preserved and annotated the samples and sent them to the CTF Biobank. DNA from tumour and blood was profiled by (1) whole genome sequencing (WGS) and (2) Single nucleotide polymorphism (SNP) array and tumour RNA was profiled by (3) RNA-Sequencing. Data was then annotated and compiled into a single resource built on the Synapse6 platform available at (Data Citation 1: Synapse

    Figure 1: From sample to dataset.
    Figure 1

    A description of the collaborative effort that enabled the Cutaneous NF data repository and subsequent analysis of the samples.


    Sample collection

    Eligible patients between the ages of 18 and 65 who were scheduled to receive elective surgery to remove cutaneous neurofibromas were invited to donate surgical discards as well as blood and urine samples. Patients consented to the release of their tissue samples and any resultant data through a protocol approved by Western IRB (Puyallup, WA). Patient-reported medical, family and NF1 history was also collected. Sample information is available in the table in the data repository (syn5556216; Data Citation 1: Synapse

    Each tumour was sub-divided with each part either preserved in formalin, flash frozen in liquid nitrogen, or placed in RNA later solution within 45 min of removal from the patient. If there was not enough tissue for all three methods of preservation, priority was given to formalin, followed by frozen. Samples noted by the patient as recently growing were annotated as such, and each sample was labelled by the location of the body from which it was retrieved. Skin covering the tumours was not removed prior to preservation but was removed prior to sequencing.

    30 cc of blood was also collected from each patient and stored at −80 °C. PBMCs were isolated and stored at −150 °C.

    Whole genome sequencing

    Genomic DNA (gDNA) was extracted from both frozen tumour and blood tissues using the DNeasy mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. The final elution was performed in 50 μl of RNase-free sterile distilled water. The concentration of the gDNA was estimated using the Qubit 2.0 Fluorometer (Invitrogen); samples with less than 500 ng were discarded. The integrity of the gDNA was assessed by running an aliquot of the gDNA on 0–8% agarose gel to confirm absence of RNA or protein bands as well as absence of a smear that would indicate degradation. One microgram gDNA was required for downstream whole genome library preparation applications. The samples were then sonicated on the Covaris LE220 (Covaris Inc, Woburn, MA, USA) to achieve an average target size of 400 bp. QC analysis of the post-sonicated material was done using Caliper LabChip GX (Perkin Elmer, Hopkinton, MA, USA).

    Standard whole genome library prep was done using the NEBNext DNA Library Prep Reagent Set for Illumina (New England BioLabs Inc., Ipswich, MA, USA) as per manufacturer's recommended protocol. Library quality was assessed using the Qubit 2.0 Fluorometer, and the library concentration was estimated by utilizing a DNA 1,000 Chip on an Agilent 2,100 Bioanalyzer. Accurate quantification for sequencing applications was determined using the qPCR-based KAPA Biosystems Library Quantification Kit (Kapa Biosystems, Inc., Woburn, MA, USA). Each sample was then sequenced on an individual lane on an Illumina HiSeqX sequencer (Illumina, Inc., San Diego, CA, USA).

    Reads were mapped to the genome and variants identified using the Dragen7 program with default settings. The resulting VCF files (syn5522788; Data Citation 1: Synapse were then sorted and updated to remove errors and analysed using the GATK pipeline8 (syn5522790; Data Citation 1: Synapse

    Somatic mutations were called from BAM files exported by the Dragen alignment using the Java version of VarDict9 in paired mode with an allele frequency threshold of 0.01. VarDict was able to recall germ line variants and also identify somatic variants that were present in the tumour samples but absent in the matched PBMC samples for each patient (syn6022465 and syn6022474; Data Citation 1: Synapse

    Copy number analysis

    Genomic DNA (gDNA) was extracted as described above, but required two hundred and fifty nanograms of gDNA for downstream SNP array applications.

    SNP array sample prep was done using the HumanOmni2.5–8 (Illumina, Inc., San Diego, CA, USA) as per manufacturer's recommended protocol as described:

    Samples were analysed by GenomeStudio and exported to text (syn5004874; Data Citation 1: Synapse

    RNA sequencing

    Total RNA containing both mRNA as well as microRNA fractions was extracted from the tissues using the miRNeasy mini kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. The final elution was performed in 30 μl of RNase-free sterile distilled water. The concentration and integrity of the extracted total RNA were estimated using the Qubit 2.0 Fluorometer (Invitrogen) and Agilent 2,100 Bioanalyzer (Applied Biosystems, Carlsbad, CA, USA), respectively. Five hundred nanograms of total RNA and a RIN of 7.0 or higher were required for downstream RNA-seq applications. Poly-adenylated RNAs were isolated using NEBNext Magnetic Oligo d(T)25 Beads. The NEBNext mRNA Library Prep Reagent Set for Illumina (New England BioLabs Inc., Ipswich, MA, USA) was then used to prepare individually bar-coded next-generation sequencing expression libraries as per manufacturer's recommended protocol. Library quality was assessed using the Qubit 2.0 Fluorometer, and the library concentration was estimated by utilizing a DNA 1,000 Chip on an Agilent 2,100 Bioanalyzer. Accurate quantification for sequencing applications was determined using the qPCR-based KAPA Biosystems Library Quantification Kit (Kapa Biosystems, Inc., Woburn, MA, USA). Each library was diluted to a final concentration of 12.5 nM and pooled in an equimolar ratio prior to clustering. Paired-end sequencing (25 million, 50-bp, paired-end reads) was performed using a 200 Cycle TruSeq SBS HS v4 Kit on an Illumina HiSeq2500 sequencer (Illumina, Inc., San Diego, CA, USA).

    Post-processing of the sequencing reads from RNA-seq experiments for each sample was performed using HudsonAlpha’s unique in-house RNA-seq data analysis pipeline (syn6035832; Data Citation 1: Synapse Briefly, quality control checks on raw sequence data for each sample were performed using FastQC (Babraham Bioinformatics, Cambridge, UK). Raw reads were mapped to the reference human genome hg19 using TopHat v2.0 (ref. 10) with the -p 4 and -r210 arguments. The alignment metrics of the mapped reads were estimated using SAMtools11 (syn6022474; Data Citation 1: Synapse

    Reads were quantified by both Cufflinks v0.9.3 (ref. 10) (syn5492805; Data Citation 1: Synapse and FeatureCounts12 (syn5493036; Data Citation 1: Synapse

    Code availability

    All data is currently stored in the synapse web portal, and is accessible using code in the Cutaneous NF Github repository at

    Data Records

    Data collected for each patient and sample have been annotated in the Synapse Table located within the online repository. Individual patients are described in Table 1, and the data available for each are described in Table 2.

    Table 1: Description of patients profiled in this study.
    Table 2: Description of tumours profiled in this study and the available data for each tumour.

    Technical Validation

    Validation for each dataset was performed individually.

    SNP array data

    Omni Array data were processed using Illumina Genome Studio and exported for further analysis using R to cluster regions that exhibited copy number alterations in either the germline (PBMC) or tissue samples.

    To detect outliers we first plotted the logR ratio values and B allele frequencies computed by Genome Studio (Figs 2a,b) to insure that they follow a similar distribution across all samples. We then used hierarchical clustering to identify any possible outliers in the data. Specifically we clustered the median value of each segment of copy number alteration computed by the DNAcopy R package13. The resulting clusters, depicted in Fig. 2c, show strong corresponding clustering by patient.

    Figure 2: Copy number quality control metrics.
    Figure 2

    Distribution of SNP array values across samples. Patient samples represented by different colours. Teal outline represents tumour samples while pink outline represents blood. (a) Represents B allele frequencies, (b) represents log R ratio values. (c) Clustering of segmented values, with rows below representing patient samples (colours) and tissue of origin (grey for tumour, black for blood).

    Whole genome sequencing

    In addition to basic library quality described above, we used the bcftools package14 to measure cross-sample discordance and identify potential outliers. Between each pair of samples we measured the number of shared variants called and the discordance measured between the pair of samples. The resulting values are depicted in Fig. 3. Similar to the copy number data patient, samples clustered with one another, with the exception of one sample, derived from Patient 10 PBMC, which was dropped from further analysis.

    Figure 3: Number of genomic variant sites shared between all pairs of samples.
    Figure 3

    Samples are labelled by patient and tissue type and indicate that all samples aside from the outlier (Patient 10 blood) cluster by patient.

    RNA-Seq data

    Quality control was performed on the RNA libraries prior to sequencing using the Agilent Bioanlyzer. Thirteen of the 44 samples were not sequenced as their RIN levels were less than 7. Final quality control of the samples was performed using the LabChip GX. Full QC and alignment statistics are shown in Table 3.

    Table 3: High throughput sequencing read statistics in patient tumours and associated peripheral blood mononuclear cells (PBMCs).

    After sequencing, RNA reads were aligned to Hg19 using TopHat v2.0 (ref. 10) and quantified using two distinct quantification methods: Cufflinks v0.9.3 (ref. 10) and FeatureCounts12. The numbers of normalized read counts (>2) that mapped to all transcripts are depicted in Figures and FPKM values (>0.1) are depicted in Fig. 4b.

    Figure 4: RNA-Seq count distribution.
    Figure 4

    Distribution of RNA-Seq read counts per gene for both (a) DESeq normalized per-gene counts and (b) FPKM calculated by Cufflinks. Reads are distributed similarly across samples after filtering for unexpressed genes (<2 counts or FPKM of 0.1). Expression of NF1 is indicated by ‘x’ in each sample.

    Usage Notes

    All data are stored at the synapse web portal at Specific steps required to obtain access are described on the ‘Accessing the data’ wiki and involve:

    1. Obtaining a Synapse account at

    2. Requesting access on the wiki and sending a brief email to CTF to describe how the data will be used

    Inherent conditions for use of any data on Synapse are described on the Synapse governance site and apply to use of this dataset as well.

    Scripts annotated to retrieve data from this repository can be found at

    Additional Information

    How to cite this article: Gosline, S. J. C. et al. A high-throughput molecular data resource for cutaneous neurofibromas. Sci. Data 4:170045 doi: 10.1038/sdata.2017.45 (2017).

    Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


    1. 1.

      Epidemiology of neurofibromatosis type 1. Am. J. Med. Genet. 89, 1–6 (1999).

    2. 2.

      , & Neurofibromatosis type 1. J. Am. Acad. Dermatol. 61, 1–14 (2009).

    3. 3.

      et al. Confirmation of a Double-Hit Model for the NF1Gene in Benign Neurofibromas. Am. J. Hum. Genet. 61, 512–519 (1997).

    4. 4.

      et al. The Development of Cutaneous Neurofibromas. Am. J. Pathol. 178, 500–505 (2011).

    5. 5.

      , , & Cell of origin and microenvironment contribution for NF1-associated dermal neurofibromas. Cell Stem Cell 4, 453–463 (2009).

    6. 6.

      et al. Enabling transparent and collaborative computational analysis of 12 tumour types within The Cancer Genome Atlas. Nat. Genet. 45, 1121–1126 (2013).

    7. 7.

      et al. A 26-hour system of highly sensitive whole genome sequencing for emergency management of genetic diseases. Genome Med 7, 100 (2015).

    8. 8.

      et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).

    9. 9.

      et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).

    10. 10.

      et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 7, 562–578 (2012).

    11. 11.

      et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    12. 12.

      , & featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–930 (2014).

    13. 13.

      , , & Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557–572 (2004).

    14. 14.

      A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).

    Download references

    Data Citations

    1. 1.

      Gosline, S. J. C. Synapse (2016)


    Funding for the generation of these data sets was provided by the Children’s Tumour Foundation.

    Author information


    1. Computational Oncology, Sage Bionetworks, Seattle, Washington 98109, USA

      • Sara J.C. Gosline
      • , Thomas Yu
      • , Xindi Guo
      •  & Justin Guinney
    2. Plastic and Reconstructive Surgery, Mount Sinai Medical Center, New York, New York 10028, USA

      • Hubert Weinberg
    3. Children’s Tumour Foundation, New York, New York 10005, USA

      • Pamela Knight
      • , Salvatore La Rosa
      •  & Annette Bakker
    4. Hudson-Alpha, Huntsville, Alabama 35806, USA

      • Nripesh Prasad
      • , Angela Jones
      • , Shristi Shrestha
      • , Braden Boone
      •  & Shawn E. Levy


    1. Search for Sara J.C. Gosline in:

    2. Search for Hubert Weinberg in:

    3. Search for Pamela Knight in:

    4. Search for Thomas Yu in:

    5. Search for Xindi Guo in:

    6. Search for Nripesh Prasad in:

    7. Search for Angela Jones in:

    8. Search for Shristi Shrestha in:

    9. Search for Braden Boone in:

    10. Search for Shawn E. Levy in:

    11. Search for Salvatore La Rosa in:

    12. Search for Justin Guinney in:

    13. Search for Annette Bakker in:


    S.G. collected the data, analysed it, organized the data repository and wrote the manuscript. H.W. provided access to tissue, P.K. collected the tissue, made annotations, patient consent, and shipments. T.Y., X.G., N.P., A.J., S.S., B.B., S.L., S.F. carried out high throughput sequencing and arrays. S.L.R., J.G. and A.B. supervised the project. A.B. initiated the project and performed the first sample collections together with P.K.

    Competing interests

    The authors declare no competing financial interests.

    Corresponding author

    Correspondence to Annette Bakker.

    Creative Commons BYThis work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit Metadata associated with this Data Descriptor is available at and is released under the CC0 waiver to maximize reuse.