Angelman syndrome genotypes manifest varying degrees of clinical severity and developmental impairment

Angelman Syndrome (AS) is a severe neurodevelopmental disorder due to impaired expression of UBE3A in neurons. There are several genetic mechanisms that impair UBE3A expression, but they differ in how neighboring genes on chromosome 15 at 15q11–q13 are affected. There is evidence that different genetic subtypes present with different clinical severity, but a systematic quantitative investigation is lacking. Here we analyze natural history data on a large sample of individuals with AS (n = 250, 848 assessments), including clinical scales that quantify development of motor, cognitive, and language skills (Bayley Scales of Infant Development, Third Edition; Preschool Language Scale, Fourth Edition), adaptive behavior (Vineland Adaptive Behavioral Scales, Second Edition), and AS-specific symptoms (AS Clinical Severity Scale). We found that clinical severity, as captured by these scales, differs between genetic subtypes: individuals with UBE3A pathogenic variants and imprinting defects (IPD) are less affected than individuals with uniparental paternal disomy (UPD); of those with UBE3A pathogenic variants, individuals with truncating mutations are more impaired than those with missense mutations. Individuals with a deletion that encompasses UBE3A and other genes are most impaired, but in contrast to previous work, we found little evidence for an influence of deletion length (class I vs. II) on severity of manifestations. The results of this systematic analysis highlight the relevance of genomic regions beyond UBE3A as contributing factors in the AS phenotype, and provide important information for the development of new therapies for AS. More generally, this work exemplifies how increasing genetic irregularities are reflected in clinical severity.


Participants
Initially, 1007 datasets from 304 participants enrolled in the study were available for analysis (we refer to the data from one participant collected at one visit as a single dataset). Per study protocol, participants were seen approximately annually over eight years. The mean number of visits per participant was 2.9 (±1.9); deviation from the expected 9 visits per participant was due to later enrolment in the study, missing visits and dropouts ( Supplementary Fig. 1A).
Datasets from participants that could not be assigned to one of five genetic sub-groups (MutM, MutT, IPD, UPD, Del1, Del2) were excluded from the analysis (72 datasets from 36 participants). This included those with deletions of unspecified size, and those with incomplete testing that did not permit genotype assignment (e.g. abnormal DNA methylation, negative for deletion, but no further studies). Three patients with UBE3A mutations that were synonymous (i.e coding the same amino acid) were excluded. Furthermore, we included only datasets with a participant age between 1 and 18 years, because there were few datasets available for analysis outside of this age range (46 datasets from 18 participants). We excluded individuals with atypical deletions since only eight datasets were available (five longer than Del1, three shorter than Del2). Final analyses were based on 848 datasets from 250 participants (127 females). Mean age at clinic visits was 82.4 ± 45.3 months (median 73.9, inter-quartile range 47.4 -111.6, Supplementary Figure 1B).
A molecular diagnosis was established for each participant using standard diagnostic testing for AS, involving fluorescent in-situ hybridization (FISH), methylation assay analyses, chromosomal microarray, microsatellite marker analysis, or gene sequencing. Supplementary Table 2 provides an overview of resulting diagnostic groupings of datasets entering the final analyses. 671 datasets were complete, i.e., data from all 19 scales were available. A table detailing the number of available assessments per scale can be found in the Supplementary Material (Supplementary Table 3). Since statistical modeling was carried out separately for each scale, we did not exclude any subjects based on missing values.

Clinical Scales
We analyzed data from the Bayley Scales of Infant Development, Third edition (BSID-III), the Vineland Adaptive Behavior Scales, Second edition (VABS-2), the Preschool Language Scale, Fourth edition (PSL-4) (all distributed by Pearson Education Inc., London, www.pearsonclinical.com), and the Clinical Severity Scale (CSS), a scoring tool developed specifically for the ASNHS. All assessments were carried out by trained personnel (physicians and licensed psychologists). The study protocol and tests performed were identical across all sites.
CSS is a severity scale created by the principal investigators at the main sites enrolling patients into the ASNHS and comprises a set of questions about symptoms typical for AS. The CSS has not been published previously and is reported here for the first time. It encompasses 11 assessments of severity: age of onset of epilepsy, current seizure frequency, number of seizure medications currently used, current somatic growth (weight), current head growth, age of independent sitting, age of independent ambulation, presence/severity of scoliosis, verbal communication capabilities, nonverbal communication capabilities, and mean developmental age. Each assessment is scored on a 4-to 6-point Likert scale. We analyzed the CSS total score (sum of scores from all items). A detailed description of the CSS can be found in the Supplementary material (Supplementary Table 4). (1) is an interactive, play-based developmental assessment encompassing five subscales for fine motor, gross motor, receptive communication, expressive communication, and cognitive development. The BSID-III is normed for typically developing children up to the age of 42 months but is frequently employed to assess individuals with intellectual disability across a broader age span. It has been used to assess individuals with AS of all ages (2).

BSID-III
For some patients, when the psychologist expected or observed ceiling effects in BSID-III, the BSID-III was replaced by other scales (Supplementary Table 8).
VABS-2 (3,4) assesses adaptive behavior and is validated for a wide age range from birth to adulthood. It encompasses eleven subscales for expressive communication, receptive communication, written communication, fine and gross motor development, daily personal living, daily domestic, daily community living, social-interpersonal, social-play-leisure, and social coping abilities. The parent interview form was employed.
PLS-4 (5) is a an interactive assessment of language skills in children below 7 years of age, encompassing two subscales for receptive and expressive communication. The PLS-4 and the early items of the expressive and receptive communication domains of the BSID-III share high degree of overlap.
To facilitate interpretation, we assigned all subscales to functional domains. The assignment can be found in Supplementary Table 3.

Data analysis
All analyses were carried out within a linear mixed-effects model framework (LMM) using R (www.rproject.org) and the nlme library (6). A LMM, rather than a fixed-effects regression model, allows and accounts for correlations between repeated measurements from individuals, which makes it a powerful and flexible framework for the analysis of longitudinal data (7,8).
Since the scales analyzed were designed to capture child development, we expected all scores to be agedependent. There are different data transformations that account for this dependence as provided in the scoring manuals, such as standard scores or developmental quotient tables. However, these tables are derived from typically developing (TD) individuals. Since our main interest was not a comparison between TD individuals and individuals with AS, but rather a description of the AS population including a comparison of different AS subgroups, we did not use TD-based age normalizations. TD-based age normalizations would have interfered with our analyses by introducing flooring effects and would not have accounted for age-dependence in the AS population given the fundamentally different developmental trajectories between individuals with AS and TD individuals (2,9). Therefore, we analyzed raw scores and accounted for age effects within our models, as described below. For the BSID-III scales, we additionally computed growth scale scores according to the manual, see Supplementary material.
We fit a LMM to the raw scores of each subscale and the CSS sum score. We modeled random intercepts per participant (to account for repeated measurements) and per study site (random intercept for each of the six centers of the study, to capture possible experimenter-induced covariance between participants seen at the same site). As fixed effects, we specified a third-order mean-centered orthogonal polylogarithmic function of age (i.e., a third-order orthogonal polynomial of log2(age)mean(log2(age))). We chose this parameterization as a trade-off between model complexity and flexibility (in particular, to enable the models to capture non-linear developmental trajectories, such as plateauing, across the broad age range analyzed) following visual inspection of the data. Visual inspection was assisted by summary curves using locally estimated scatterplot smoothing (LOESS, Cleveland and Devlin, 1988) (see Figure 1, Supplementary Figs. 2, 3). We chose orthogonal linear, quadratic, and cubic terms for the LMM, such that no collinearity was introduced to the model.
First, we tested for differences between participants with (Del1, Del2) and without (MutT, MutM, IPD, UPD) deletions. For each scale, we compared a model using only age but no genotype information (M1) to a model with additional information about the presence or absence of a deletion and the interaction of the presence or absence of a deletion with age (i.e. with the polylogarithmic function of age, see above) (M2). Since differences between deletion and non-deletion participants have been found in several previous studies, we expected this model to fit the data significantly better than the data without genotype information for all scales. We then separated the dataset into deletion and non-deletion participants and further compared subgroups within them. We tested whether introducing diagnostic information concerning the class of deletion (Del1, Del2) and subtype of non-deletion (MutM, MutT, IPD, UPD) would significantly improve the models. All models were fit using the maximum likelihood (ML) method and were compared using likelihood ratio tests (LRT). The LRT compares the likelihood of the measured data, given a particular (full) model, with the likelihood for a nested (reduced) model. To assess whether the full model (containing one or several additional fixed or random effects compared with the reduced model) fits the data significantly better, the likelihood ratio for both models is subjected to a χ 2 test, since it asymptotically follows the χ 2 distribution under the null hypothesis (11).
When the best model contained the full diagnostic information for the non-deletion group, we performed pair-wise post-hoc comparisons between genotypes. In principle, post-hoc comparisons in a LMM can be carried out by extracting t-values and degrees of freedom (DF) from the model, but DF and p-value estimation is controversial (12). Therefore, we carried out post-hoc comparisons using likelihood ratio test model comparisons, by refitting the models with full genotype information and comparing them to models without genotype information, filtering out one genotype at a time from the dataset. We adjusted the p-values obtained in these post-hoc comparisons using the Benjamini-Hochberg method (13). This method adjusts p-values such that the expected rate of false positive results after adjustment is equal to the specified false discovery rate (FDR, e.g. 0.05).
We used the coefficients of the "best model" for each scale (i.e. the level of genotype detail as found in the analyses reported in Supplementary Table 5 and 6, and Table 1) to predict values at the sample mean ± standard deviation (std) of log-age (3.2, 5.8, 10.7 years) to generate a summarizing visualization of genotype differences (reported in Fig. 3, Supplementary Fig. 4). Furthermore, to investigate possible structure in the inter-individual variability across scales, we performed a factor analysis. To this end, we z-transformed (i.e., subtract mean and divide by standard deviation) the agecorrected data for each clinical scale. The z-transform was performed separately for individuals with and without deletions to ensure the co-variance structure was not driven by group differences between genotypes. We then subjected the normalized data to a factor analysis. Factors were computed using a maximum likelihood algorithm and oblimin rotation. Four factors were extracted, following the result of a Horn-parallel analysis.
We quantified the stability of the scales using intra-class correlation coefficients (ICC) based on the "best model" (see above). Since visits are spaced apart 1 year or more, the ICC values can be considered an upper bound for test-retest reliability, which is normally derived from measurements performed with much shorter intervals.