Sotos syndrome (SS) represents an important human model system for the study of epigenetic regulation; it is an overgrowth/intellectual disability syndrome caused by mutations in a histone methyltransferase, NSD1. As layered epigenetic modifications are often interdependent, we propose that pathogenic NSD1 mutations have a genome-wide impact on the most stable epigenetic mark, DNA methylation (DNAm). By interrogating DNAm in SS patients, we identify a genome-wide, highly significant NSD1+/−-specific signature that differentiates pathogenic NSD1 mutations from controls, benign NSD1 variants and the clinically overlapping Weaver syndrome. Validation studies of independent cohorts of SS and controls assigned 100% of these samples correctly. This highly specific and sensitive NSD1+/− signature encompasses genes that function in cellular morphogenesis and neuronal differentiation, reflecting cardinal features of the SS phenotype. The identification of SS-specific genome-wide DNAm alterations will facilitate both the elucidation of the molecular pathophysiology of SS and the development of improved diagnostic testing.
Recent advances in next-generation sequencing technologies have led to the discovery of the molecular basis of many overgrowth syndromes. Constitutional mutations in two different genes involved in regulating histone modifications, NSD1 and EZH2, have been shown to cause clinically overlapping overgrowth disorders, Sotos syndrome (SS) and Weaver syndrome, respectively.
Sotos syndrome (OMIM 117550) is an autosomal dominant condition with an estimated prevalence of 1:14,000 live births (Rahman, unpublished data). It is characterized by pre- and postnatal overgrowth, advanced bone age, distinctive facial gestalt and a variety of neurodevelopmental problems including intellectual disability1. Mutations in NSD1 (nuclear receptor SET (su(var)3–9, enhancer-of-zeste, trithorax) domain containing protein-1) are found in 80–90% of patients with SS2,3,4. As mutations in other genes have not been reported in SS patients, the remaining 10–20% likely harbour undetected NSD1 mutations or represent clinical misclassification.
NSD1 is one of many genes that have been recently recognized to developmentally regulate the epigenome. NSD1 encodes a histone H3 lysine 36 methyltransferase, important for multiple aspects of normal embryonic development5. NSD1 binds near various promoter elements to regulate transcription via interactions with H3K36 methylation and RNA polymerase II (ref. 6). In general, it has been proposed that during development, histone modifications are fairly transient regulatory marks that are replaced in the longer term by the more stable epigenetic mark DNAm7. As a proof of principle of the interaction between loss of function in a histone modifier and DNAm status, we have previously demonstrated that mutations in a lysine-specific demethylase, KDM5C, generates specific genome-wide DNAm alterations8. Here we analyse the DNA methylome in SS patients and define a specific genome-wide pattern of DNAm alterations associated with NSD1 loss-of-function mutations. We demonstrate how this profile may be used to derive a molecular NSD1+/−-specific signature with high sensitivity and specificity and validate this signature in independent cohorts of SS and controls. We show that the signature is able to distinguish pathogenic NSD1 mutations not only from control subjects, but also from benign sequence variants within NSD1 and from cases with the clinically overlapping Weaver syndrome caused by mutations in the histone methyltransferase EZH2 (Enhancer of Zeste, Drosophila, Homolog 2). Finally, we determine the functional significance of the observed genome-wide DNAm signature and demonstrate its potential utility in the diagnostic testing of NSD1 mutations of unknown significance.
Identification of NSD1 +/−-specific DNAm signature
To determine whether NSD1 loss-of-function mutations impact stable epigenetic marks such as DNAm at downstream genomic targets, we compared DNAm in peripheral blood from SS patients with known pathogenic NSD1 mutations (NSD1+/−; n=19) to controls (n=53) (Supplementary Data 1 and 2). The demographics for the discovery cohort were as follows: For SS, there were 11 males and eight females and the mean age±s.d. at sample collection was 10.1±9.6 years (range 0.6–40 years). The 53 control subjects included 24 males and 29 females; their mean age at the time of sample collection was 10.8±4.8 years (range 1–21 years). There was no statistically significant difference between SS and controls with regard to the age at which blood was sampled (Wilcoxon rank-sum test, P value=0.216) or sex (Chi-square (χ2) test, P value=0.345).
Genomic DNA was treated with sodium bisulfite and DNAm was assessed using the Illumina Human InfiniumMethylation450 BeadChip. After filtering for polymorphic single-nucleotide polymorphisms (SNPs) and nonspecific probes as previously described9, we quantified DNAm at 424,586 CpG sites using beta scores, which represent DNAm levels as a percentage (between 0 and 1). The significance of differential DNAm between SS and control samples was assessed at each CpG using a non-parametric Mann–Whitney U-test with a stringent Bonferroni correction for multiple testing.
We identified genome-wide changes in DNAm in SS compared with controls. Over 28,000 CpG sites survived stringent Bonferroni correction at q<0.05 (Supplementary Fig. 1) demonstrating a previously unrecognized effect of NSD1+/− on genome-wide DNAm. Next, we applied an additional filter to this highly statistically significant set of CpGs by selecting the subset of probes with >20% difference in average DNAm levels between SS and controls (Supplementary Fig. 1). We identified 7,085 CpG sites distributed across the genome that we refer to as the NSD1+/−-specific signature; 7,038 CpG sites (99.3%) demonstrated loss of DNAm (Supplementary Data 3), whereas only 47 CpG sites (0.7%) showed a gain of DNAm in SS compared with controls. Using unsupervised hierarchical clustering of the DNAm data for the NSD1+/−-specific signature, all NSD1+/− samples clustered as a distinct group separate from controls (Fig. 1). When tested against several potential confounding factors such as sex, age, batch (Supplementary Data 3) and cell-type composition (Supplementary Fig. 2, see Supplementary Methods for details). the NSD1+/−-specific signature retained its ability to discriminate SS from control samples.
Validation of the NSD1 +/−-specific signature
Using the DNAm beta values at each CpG within the NSD1+/−-specific signature, we developed a predictive model that classifies new subjects on the basis of their DNAm profile as either ‘SS’ or ‘not SS’, using the SS score described in the Methods. We evaluated the performance of the NSD1+/−-specific signature using an independent set of normal blood-derived DNAm data (n=1,056 subjects) extracted from the Gene Expression Omnibus (GEO) database (www.ncbi.nlm.nih.gov/geo/) (Supplementary Data 4). Each of the 1,056 GEO samples received a negative SS score and was classified as ‘not SS’ (Fig. 2), demonstrating 100% specificity of the classification model. These results highlight the robustness of the NSD1+/−-specific signature as it overcame many sources of variation (sex, age, batch, DNA isolation methods, cell-type composition) introduced by processing different cohorts in different laboratories around the world.
We then tested the sensitivity of the SS score to predict the pathogenicity of NSD1 mutations using a replication cohort of SS cases with known pathogenic NSD1 mutations (n=19) from Hong Kong (Supplementary Data 5). Each of these patients received a positive SS score (Fig. 2) demonstrating a sensitivity of 100%.
As DNAm can be tissue and cell-type specific, we tested fibroblast-derived DNA from three SS patients with truncating mutations in NSD1 in comparison to four control fibroblast samples. Using hierarchical clustering and the NSD1+/−-specific signature derived from blood to assess the fibroblast DNAm data, the three SS fibroblast samples could be clearly distinguished from controls (Fig. 3). Despite the small size of the available fibroblast data set, these results demonstrate that the NSD1+/−-specific signature is robust even in the context of cell-type-specific DNAm profiles.
To further assess the specificity of the NSD1+/−-specific signature, we used it to analyse the DNAm profiles of eight patients with a clinical diagnosis of Weaver syndrome (OMIM 277590) and confirmed mutations in EZH2 (refs 10, 11; Supplementary Data 6). All Weaver syndrome patients with EZH2+/− mutations received a strongly negative SS score (between −0.151 and −0.105) and were therefore classified confidently as ‘not SS’ (Fig. 2). The fact that the NSD1+/−-specific signature allows the molecular distinction of two clinically overlapping overgrowth syndromes provides further evidence for the robust specificity of the NSD1+/−-specific signature.
Mutation variants of unknown significance
The interpretation of non-synonymous variants (variants of unknown significance or VOUS) represents a significant challenge in the clinical setting. Several different pathogenicity prediction algorithms have been developed to assist in the interpretation of VOUS; however these tools often provide incongruent results12.
To investigate the utility of the NSD1+/−-specific signature to functionally classify NSD1 VOUS, DNA samples from 16 individuals with missense mutations in NSD1 (6 from the discovery cohort and 10 from the validation cohort) were tested in a blinded fashion. As shown in Fig. 2, the NSD1+/−-specific signature allowed clear classification of VOUS as pathogenic or benign; 9/16 samples received positive SS scores clustering with the NSD1+/− pathogenic variants; these mutations were classified as pathogenic. The remaining seven samples received negative SS scores, clustering with the control group; these mutations were classified as benign variants.
To further evaluate the efficacy of the DNAm signature as a tool to classify NSD1 VOUS as benign or pathogenic, we compared the congruence of results from DNAm SS scores with expert review of the clinical phenotype. Two highly experienced clinical geneticists (RW and DC), who were blinded to the molecular data, reviewed clinical information and photos. Patients were classified into one of three phenotypic categories: (1) typical SS, (2) possible SS and (3) unlikely SS. For the discovery cohort, there was 100% concordance (6/6) for the clinical classification between reviewers. Four patients were categorized as typical SS, all of whom had the DNAm signature. Two patients were categorized as unlikely SS, neither of whom had the signature (Fig. 4, Table 1 and Supplementary Data 7). For the validation cohort, only 5 out of 10 patients had photographs and medical records available that met our criteria for review; the clinicians’ assessments were again 100% concordant. Two of the patients were categorized by both clinicians as unlikely SS; these patients did not have the methylation signature. Two of the patients were categorized by both clinicians as typical SS; these patients did have the methylation signature. One patient with the methylation signature was categorized by both clinicians as possible SS.
We also compared the interpretation of NSD1 VOUS (16 missense variants) using the DNAm signature versus five independent prediction algorithms, namely PolyPhen-2 (ref. 13; http://genetics.bwh.harvard.edu/pph2/), Mutation Assessor (http://mutationassessor.org)14, SIFT (http://sift-dna.org)15, Mutation Taster (http://www.mutationtaster.org)16 and PMut (http://mmb.irbbarcelona.org/PMut/)17. As shown in Table 1, the presence of the NSD1+/−-specific signature and positive SS scores were in agreement with the pathogenicity of the variants predicted by four out of five prediction algorithms. In contrast, negative SS scores were only in agreement with PolyPhen-2 and Mutation Assessor. Taken together, these data demonstrate that prediction of pathogenicity using the NSD1+/−-specific signature is more congruent with diagnostic classification by expert clinicians in comparison with the majority of the currently available prediction algorithms.
We propose that our specific genome-wide DNAm signature for pathogenic NSD1 mutations can be utilized as a novel epigenomic diagnostic tool that will facilitate the classification of VOUS in NSD1 as benign or pathogenic variants.
Functional significance of the SS classification signature
Finally, we investigated the DNAm classification signature of NSD1+/− for its potential to elucidate the molecular pathophysiology of SS. Analysis of the genomic locations of the CpG sites in the NSD1-specific signature showed that CpGs were over-represented in enhancers, DNase hypersensitive sites, reprogrammed differentially methylated regions (RDMR) and CpG island shores (defined as 0 to 2 kb upstream of CpG island), and were under-represented in regions overlapping promoters and CpG islands (Fig. 5 and Supplementary Data 10). These findings correlate with one previous observation that NSD1 associates primarily within a region ∼1,200 bp upstream of gene targets such as the bone-morphogenetic protein BMP4 start site6. However, our data suggest that NSD1 may have multiple functions depending on the genomic locus and the temporal/spatial location in development—specifically that NSD1 binds regulatory regions that are required for early embryonic development as also reflected by the embryonic lethality at E10.5 of the Nsd1 null mouse18.
To identify the biological processes and molecular functions most enriched within our data set, we analysed the 2,167 unique genes that overlapped the NSD1+/−-specific signature using DAVID (http://david.abcc.ncifcrf.gov)19. The results demonstrate enrichment for genes with roles in cellular morphogenesis and differentiation, as well as neuronal differentiation/axonogenesis and cell adhesion/cell signalling (Benjamini–Hochberg corrected q<0.05) (Fig. 6 and Supplementary Data 8). This enrichment in neural and cellular development pathways reflects the cardinal features of SS (that is, overgrowth and developmental delay) and validates the utility of the DNAm signature to elucidate the functional, biological and molecular impact of NSD1 pathogenic variants.
We also used GREAT (http://great.stanford.edu)20 to directly analyse the 7,085 CpGs within the classification signature, comparing them to the initial 424,586 CpGs in our data set in the context of broader functionally annotated regions (up to 1 Mbp extension from the nearest genes). Functional categories related to embryonic development, including neurodevelopment, were again predominant among the detected enrichment patterns (Supplementary Data 9).
We have identified a novel, robust DNAm signature specifically associated with pathogenic NSD1 gene mutations, which has the potential to be utilized as a functional molecular test to assess VOUS in NSD1. This study will set the stage for an unprecedented new field of epigenetic diagnostic testing where scientists and clinicians will harness the power of the methylome to unravel the pathogenicity of VOUS. We expect that this approach will be particularly valuable with regard to genetic testing for a variety of overgrowth disorders particularly as mutations in additional epigenetic regulators, histone-lysine N-methyltransferase (SETD2) (ref. 21) and DNA methyltransferase 3 (DNMT3A)22, have recently been reported to cause distinct overgrowth syndromes.
Our data strongly suggest that there is important crosstalk between histone modifications and DNAm. Although there are currently limited data regarding the mechanisms by which this occurs, one study of NSD1 loss of function supports the concept of crosstalk between DNAm, histone modifications and gene expression. Specifically, Berdasco et al.23 report changes in histone modifications and transcription at one genomic locus in tumours following epigenetic silencing of NSD1 as well as one lymphoblast cell line from a SS individual23. This study did not, however, examine genome-wide DNAm. To pursue our interest in the mechanisms by which germline NSD1 loss-of-function mutations impact the DNA methylome, we are currently establishing a developmental model system to study NSD1 in induced pluripotent stem (iPS) cells. We anticipate that these studies will further elucidate the mechanisms by which interactions between histone modifications and DNAm drive gene expression.
Finally, the genome-wide DNAm alterations identified by the NSD1+/−-specific signature represent novel, powerful and highly specific targets that can be used to elucidate the molecular pathophysiology of SS and to guide the development of future targeted therapies. This approach could be used in a broader context to study the downstream functional effects of loss-of-function mutations of regulators of the epigenome in human disease model systems.
DNA methylation analyses
We profiled a total of 112 samples from 114 unique individuals (57 with NSD1 mutations or single-nucleotide variants and 57 controls). Informed consent was obtained from parents of all the participants and assent was obtained from participants, as appropriate for age. The study was approved by the Research Ethics Board at the Hospital for Sick Children. Most DNA samples were derived from blood except for seven samples (three with NSD1 mutations and four controls) for which DNA was derived from skin fibroblasts. DNA samples were sodium bisulfite converted using the Qiagen EZ DNA Methylation kit (Qiagen, Valencia, CA), according to the manufacturer’s protocol. Modified genomic DNA was then processed and analysed on the Infinium HumanMethylation450 BeadChip from Illumina (Illumina 450K) according to the manufacturer’s protocol. The distribution of the samples on the arrays was randomized for both cases and controls but not for age and gender. Chi-square and Wilcoxon rank-sum tests were used to compare gender and age distributions, respectively between SS and control discovery cohorts.
Normalization and quality controls
We used the GenomeStudio software from Illumina to process the raw intensity data (IDAT files) for all the 114 samples. Control normalization and background subtraction included in GenomeStudio was used to generate DNAm profiles, or beta values, for each sample at every CpG site from the ∼485,000 CpG sites. All the 114 samples passed the quality controls measures and had over 485,000 CpG sites detected at a detection P value <0.01.
Probe cross reactivity and polymorphic sites
We excluded probes containing SNPs, that is, whenever the CpG sites were polymorphic at the cytosine or the guanine base. Infinium I probes were excluded if there was a SNP at the position where single-base extension occurs. Also excluded were CpG sites for which SNPs were located within 10 bases of the query site where single-base extension occurs. SNPs were also excluded if there was an allele frequency of at least 1% (19,418 sites (4.0%)) or an allele frequency of at least 5% (10,825 sites (2.2%); for more details, see Chen et al.)9. In addition, we excluded CpG sites if their probe sequences aligned to multiple positions with ≥90% identity (see Chen et al.,9 for additional details). After removing probes with missing values or detection P values >0.01 and nonspecific probes, the final data set contained 424,586 probes.
Differential DNA methylation analysis
To identify the differentially methylated CpG sites, we compared the DNAm distributions for Sotos cases versus controls at each CpG site. To account for the influence of the family relationships among three of the SS patients, we formed three separate testing trials, each time combining 16 non-familial Sotos cases with only one family member. The resulting set of 17 SS patients was compared with 53 controls for each of the 424,586 available CpG sites, using a non-parametric Mann–Whitney U-test (implemented in R, scripts available upon request). A stringent Bonferroni correction for multiple testing was applied to the results in each trial. To ensure robust results, we retained only the CpG sites that were significant at the confidence level α=0.05 in all the three trials, that is, with any choice of the family representative among the Sotos patients. As many as 28,458 CpG sites satisfied this criterion. Finally, we applied an additional effect-size criterion requiring at least 20% difference in average DNAm between the Sotos and the control groups in each of the three trials. The latter filter was performed to ensure inclusion of CpGs with differences that were the most biologically meaningful. This filter reduced the NSD1 signature to 7,085 CpG sites, which were then selected for further characterization including the development of an independent classification model for cohorts of controls and SS cases, as well as for specificity and sensitivity testing.
Sotos syndrome score and classification model
We developed a simple classification model on the basis of the NSD1+/−-specific signature in blood. At each of the 7,085 NSD1+/−-specific signature CpGs, a median DNAm level was computed across all the 19 SS patients in the original Discovery cohort. This resulted in a reference profile for the NSD1+/− Sotos DNAm levels over the NSD1+/−-specific signature CpGs, which was robust to outliers. Similarly a robust median-DNAm reference profile for the 53 healthy control subjects was created. The classification of each new DNAm sample was based on extracting a vector Bsig of its DNAm values in the NSD1+/−-specific signature CpGs, and comparing Bsig to the two reference profiles computed above. A Sotos Syndrome Score was defined as:
where r is the Pearson correlation coefficient. A simple classification model was developed based on scoring each new DNAm sample using the SS score: a blood sample with a positive SS score is more similar to the SS reference profile based on the NSD1+/−-specific signature CpGs, and is therefore classified as “SS”; whereas a sample with a negative SS score is more similar to the normal-blood reference profile, and is classified as “not SS”. The classification is implemented in R (scripts available upon request).
Accession codes: The DNA methylation data were deposited in GEO database under the accession number GSE74432.
How to cite this article: Choufani, S. et al. NSD1 mutations generate a genome-wide DNA methylation signature. Nat. Commun. 6:10207 doi: 10.1038/ncomms10207 (2015).
Gene Expression Omnibus
We thank the families for their participation in our research, and the physicians and genetic counsellors who recruited them. This research was supported by the Rare Disease Foundation grant to S.C. and the Canadian Institute of Health Research (CIHR, MOP-126054).
Sotos patients with NSD1 loss-of-function mutations
Sample Information-Control Discovery Cohort
List of NSD1+/- significant CpG sites.
List of samples extracted from GEO for specificity analysis
Sotos patients with NSD1 (NM_022455.4,NCBI hg18) loss of function mutations-Replication study
Weaver patients with EZH2 (NM_004456.4,NCBI hg19) mutations/deletions
Overgrowth patients with NSD1 (NM_022455.4,NCBI hg18) missense variants
DAVID enrichment in GO Biological Process - FAT ontology for classification signature genes.
GREAT enrichment in GO Biological Process ontology for classification signature CpG set (hypergeometric test).
Distribution of the NSD1 specific DNA methylation signature based on annotated genomic features
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/