An integrative systems medicine approach to delineate complex genotype-phenotype associations in Autism Spectrum Disorder

Background The heterogeneous phenotype and complex genetic architecture of Autism Spectrum Disorder (ASD) has thus far limited our understanding of genotype-phenotype correlations, hindering early diagnosis and patient prognosis. Copy Number Variants (CNVs) targeting a diversity of genes have been implicated in ASD, however correlations with clinical patterns are unclear. Methods In this study, we developed a novel machine learning integrative approach that seeks to delineate associations between ASD clinical profiles and disrupted biological processes inferred from CNVs spanning brain-expressed genes. Results Clustering analysis of relevant clinical measures from 2446 ASD cases, retrieved from the Autism Genome Project (AGP) database, identified two distinct phenotypic subgroups, with a milder and a more severe phenotype. Patients in the two clusters differed significantly in verbal status, ADOS-defined severity, adaptive behaviour profiles and intellectual ability, with verbal status contributing the most for cluster stability and cohesion. In the clustered ASD cases, functional enrichment analysis of brain-expressed genes disrupted by rare CNVs identified 15 statistically significant biological processes. These biological processes included cell adhesion, nervous system development, cognition and protein polyubiquitination and were in line with previous ASD findings. Random Forest feature importance analysis showed a positive contribution of all disrupted biological processes to the classification of ASD cases in the identified clusters. A Naive Bayes classifier was generated to predict the ASD phenotype from the identified disrupted biological processes. For a subset of patients with higher Information Content scores calculated for the disrupted biological processes, the classifier achieved predictions with a high precision but low recall (Precision: 0.82, Recall: 0.39). Conclusions This study highlights the usefulness of machine learning approaches to reduce clinical heterogeneity by taking advantage of multidimensional clinical measures. Furthermore, it shows that phenotype-genotype correlations can be established in ASD, and that milder and more severe clinical presentations have distinct underlying biological mechanisms. However, precise predictions of the phenotype from genetic data were only achieved for the subset of patients with higher biological information content. These findings are therefore a first step towards the translation of genetic information into clinically useful applications, while emphasizing the need for larger datasets with complete clinical and biological information.

database, and categorized with the following thresholds: IQ>70 normal, 50<IQ<70 mild 138 intellectual disability, IQ<50 severe intellectual disability. 139 Clinical reports from the ASD patients were examined for missing values, and clinical features 140 with more than 70% information were retained for the analysis. To minimise missing value 141 imputation bias, individuals with missing values above this threshold for more than two clinical 142 features were also excluded. Completeness of each clinical feature is reported in Table S1 143 (Additional file 1). Missing values were imputed using the missForest [24] R package that 144 implements the Random Forest [25] algorithm, a decision tree-based supervised machine 145 learning method. Imputation error was assessed using the normalised global Proportion of 146 Falsely Classification (PFC), and the missing values imputation error was 0.12.

147
 Clustering analysis of ASD clinical data 148 To focus on core domains of ASD symptoms, verbal skills, disease severity, adaptive behavior 149 and intellectual levels, which strongly condition prognosis, were selected for further analysis. 150 Verbal status was obtained from the ADI-R, ASD severity scored from the ADOS, adaptive 151 functioning from the VABS, using its three subdomains, and a performance IQ category from the 152 IQ assessment contributed by participating sites to the AGP database. Other IQ domains had too 153 many missing values to be used. The Agglomerative Hierarchical Clustering (AHC) [26] method 154 was used to identify independent phenotypic subgroups from the selected clinical features. 155 Correlations between clinical features were assessed using the Pearson method, and features with 156 a correlation value of > 0.75 were considered correlated. The Gower [27] metric was used to was also used to define outliers in clinical data. A decrease in the Silhouette value of a cluster 166 after removing one feature indicates its importance in defining this cluster and vice versa. 167  Goodness of clustering assessment 168 A Silhouette method [29] was employed to estimate the goodness of the clustering results. The 169 Silhouette value for each individual shows how well the individual is clustered, and ranges from 170 -1 to 1, with individuals scoring below 0 considered as wrongly clustered. In addition, the 171 Silhouette value for each cluster was derived, and clusters with Silhouette value of > 0.25 were 172 considered as true clusters. Bootstrapping with 1000 iterations was used to measure the stability 173 of clusters, where a boot mean value above 0.85 corresponds to stable clusters. All clustering 174 analysis was performed in R environment, using Cluster [30] and FPC packages.

175
 Functional enrichment analysis 176 Genotyping and CNV calling methods for the AGP ASD subjects (N=2446) were previously 177 described [18]. CNVs called by any two algorithms (high confidence CNVs) and above 30kb in 178 size were retained for further analysis. To screen for rare CNVs (<1% in control population) 179 CNV frequencies in control populations were estimated using the genotypes from the studies by

183
To focus CNV selection on variants spanning brain-expressed genes, avoiding a priori 184 hypotheses from ASD candidate gene assumptions, an extensive list comprising 15585 brain-185 expressed genes was obtained from Parikshak et al. [34]. The brain-expressed gene list was 186 prepared from brain RNA-seq data, collected at thirteen different developmental stages, 187 including genes expressing during early brain developmental phase. The full criteria and 188 parameters used to define the brain-expressed gene list were previously described [34]. .

189
The g:Profiler [35] tool was employed to identify biological processes enriched for brain-  terms, and terms with a similarity score of > 0.7 were grouped.   Bayes, a stratified five-fold cross-validation approach was used, in which data was first split into 217 five equal subsets with equal class probabilities; a Naive Bayes model was trained on any four 218 subsets, and the remaining subset was used as the test set. This process was repeated five times 219 and each time a different subset was used as test set. For each repetition, the model performance 220 was estimated and mean values for precision, recall, specificity and F-score were reported. The Naive Bayes classifier was trained on patient's data by using the "more severe" cluster as the 222 positive class and the "less severe" cluster as the negative class.  Table 1). Elimination of the loosely clustered 238 individuals resulted in more stable and cohesive clusters, with high values for clusters stability 239 and reduced average distance between the two individuals in a cluster (Table 1). All clinical measures differed significantly between the two clusters, as shown in Table 2.

246
Cluster 1(Additional file 1: black circles in Figure S1) includes a higher number of individuals, 247 who generally exhibited a milder clinical phenotype, while Cluster 2 (Additional file 1: red 248 triangles in Figure S1) included a higher percentage of subjects with severe dysfunction. performance IQ in the range of severe intellectual disability.

259
Regarding the ADOS severity score, approximately 14% of the individuals in Cluster 1 were 260 assigned to the milder category of the ADOS severity score ("Non-spectrum" for ADOS, but 261 scoring positive for "Autism" in the ADI-R, and therefore classified in the AGP "Spectrum"  However, the percentage of males was higher in cluster 1, representing the milder phenotype, 279 consistent with general observations that male to female ratios are higher in datasets that 280 comprise more high-function ASD individuals. Phenotypic cluster and rare CNV data was complete for 1357 individuals with ASD, and 294 available for integration. Functional enrichment analysis of rare CNVs targeting brain-expressed 295 genes (N=2738) in 1357 patients identified 17 statistically significant biological processes 296 (Additional file 1: Table S3). g:Profiler did not recognize 187 genes from the input list.

297
The redundancy of GO terms in functional enrichment analysis, caused by overlapping 298 annotations in ancestors and descendent terms in the DAG structure of GO, was reduced by 299 grouping the terms that had a semantic similarity score higher than 0.7 (Additional file 1: Table   300 S3). The Revigo tool used to reduce redundancy did not recognise one biological process 301 (Plasma membrane bounded cell projection organization). After redundancy reduction, 16 302 biological processes remained (Table 3)

305
The most significant biological process identified in this dataset was Homophilic cell adhesion 306 via plasma membrane adhesion molecules, which includes 53 brain-expressed genes disrupted 307 by the selected CNVs. The ten most significant biological processes were related to cell adhesion 308 and cellular organization, and also included nervous system development and protein poliubiquitination (Table 3). Moreover, two significant biological processes were related to 310 behavior and cognition. The enriched biological processes and phenotypic cluster information for ASD cases were 316 combined in a matrix to assess the predictive value of the biological processes for categorization 317 in one of the two phenotypic clusters, broadly characterized by a milder and a more severe 318 phenotypic presentation. The 57 individuals containing both rare CNV and cluster information that did not present any enriched biological process were excluded, so further analysis comprised 320 1300 ASD patients. 321 Table 4 shows the ranking in importance of disrupted biological processes for categorization of 322 subjects into ASD phenotypic clusters, computed using the Random Forest importance score 323 function. The importance of each biological process was calculated using the mean decrease in accuracy,

345
The Naive Bayes classifier trained on data from 1300 patients did not perform well in predicting 346 the more dysfunctional clinical phenotype from disrupted biological processes (Table 5)   The discovery of diagnostic and prognostic biomarkers for ASD has the potential to improve the

375
In this study, we developed a novel integrative approach to predict ASD phenotypes from  Overall, the present approach is proof of concept that genotype-phenotype correlations can be 487 established in ASD, and that biological processes can predict multidimensional clinical 488 phenotypes. Importantly, it highlights the usefulness of machine learning approaches that take