A small number of abnormal brain connections predicts adult autism spectrum disorder

Although autism spectrum disorder (ASD) is a serious lifelong condition, its underlying neural mechanism remains unclear. Recently, neuroimaging-based classifiers for ASD and typically developed (TD) individuals were developed to identify the abnormality of functional connections (FCs). Due to over-fitting and interferential effects of varying measurement conditions and demographic distributions, no classifiers have been strictly validated for independent cohorts. Here we overcome these difficulties by developing a novel machine-learning algorithm that identifies a small number of FCs that separates ASD versus TD. The classifier achieves high accuracy for a Japanese discovery cohort and demonstrates a remarkable degree of generalization for two independent validation cohorts in the USA and Japan. The developed ASD classifier does not distinguish individuals with major depressive disorder and attention-deficit hyperactivity disorder from their controls but moderately distinguishes patients with schizophrenia from their controls. The results leave open the viable possibility of exploring neuroimaging-based dimensions quantifying the multiple-disorder spectrum.

show the histograms of the permutation test (1,000 repetitions) for the JP LOOCV and the out-of-sample US accuracies, respectively. In panel (b), the binomial distribution is shown as a green curve. The vertical red lines indicate the accuracy of the ASD classifier trained and tested without permutation. Both LOOCV and out-of-sample accuracies (i.e. US) were significant at P = 0.001, as demonstrated by the two panels. We observe that for the out-of-sample case [i.e. panel (b)] the binomial distribution is consistent with the permuted distribution. As suggested by Noirhomme et al. (2014) 1 , the decreased independence among samples in LOOCV widens the permuted distribution relative to the binomial one; however, "with an independent validation set, the binomial test is perfectly valid 1 ".

Sites in Japan
Weighted linear summation Number of individuals FCs exhibiting under-connectivity (r ASD < r TD ), whereas the 7 FCs above the line are FCs exhibiting over-connectivity (r ASD > r TD ). An individual FC is represented by a circle, with the radius of the circle scaled by the contribution index of the corresponding connection as defined by the difference in the mean correlation values multiplied by the weight assigned in SLR (inset). The vertical and horizontal lines for each connection show the 95% confidence intervals of correlations for the ASD and TD groups, respectively. See Table 1 for the property of each FC. T x 1 , derived from demographic information and imaging conditions; white-open circles in the right column indicate the canonical variable v 2 T x 2 , which is derived from the functional connectivity (FC). The numbers on the dotted lines connecting canonical variables represent the correlation coefficients between v 1 T x 1 and v 2 T x 2 . The connections between the demographic labels and canonical variables v 1 T x 1 are represented with black lines. If there is only one link towards a canonical variable, the color of the canonical variable is also used for the link (e.g. the link connecting "Diagnosis" and the 1st canonical variable is red). On the right of the figure, FCs are visualized and encoded with the color of the respective canonical variable. If canonical variables have overlapping FCs, those are colored in gray. However, if the overlap involves the 1st canonical variable (i.e. red) a red square with a black edge is used. In this example we focus on the overlapping between the 1st and the 6th canonical variables, representing "Diagnosis," and "Gender" and "Open/Closed Eye Condition", respectively. FCs that are common to these two canonical variables are represented with a black square and connected with a colored line to the respective canonical variable. The FCs identified by the SLR classifier on the whole Japanese dataset are represented with a white edge, filled with black if an overlap with the 6th canonical variable exists, and with red otherwise. The amount of FCs associated with the 1st canonical variable was 745 and the one associated with the 6th canonical variable was 659 with an overlap of 141 FCs. Moreover, the amount of FCs selected by SLR that overlapped with the 6th canonical variable was only 1. The lambda combination where a canonical variable has only one link to "Diagnosis" was on average 17.6±5.0% of the total amount of combinations. Moreover, we observe that lambda combinations larger than (λ 1 = 0.4, λ 2 = 0.4), never comply with this constraint. On average, the number of FCs associated with a "Diagnosis" canonical variable was 925±798. For the sake of readability, we discuss the synthetic dataset with the same terminology used for the real dataset. For example, "diagnostic label" means "synthetic diagnostic label". (a) Classification performance. Histograms depict the accuracy distribution, while the vertical dashed lines represent the mean accuracy of the two methods. Our proposed method, which uses L 1 -SCCA for the feature selection, shows better classification performance (two-sample t-test, P = 1.06×10 -52 ) than that of the standard elastic-net approach. (b) Amount of nuisance-related features (i.e. nuisance features) used to predict diagnostic label. The figure shows how frequently a given amount of nuisance features was selected by using the two different classification methods. The nuisance features were less frequently selected by using our proposed method than by using elastic-net. These results indicate effectiveness of L 1 -SCCA for eliminating nuisance features. (c) Instance of the L 1 -SCCA procedure. Each subpanel represents the transformation matrices obtained by L 1 -SCCA in a given nested fold. Here we White-open green circles depict the features with zero contribution to any demographic information. The color intensity of the lines is proportional to the connection strength (i.e. absolute value of the weight). We observe that the diagnostic canonical constraint of having one canonical variable assigned exclusively to the diagnostic label is met. Moreover, the canonical variables assigned to the nuisance variables always have the strongest connection with the nuisance features (i.e. 7-10). Fold 8 shows a missing connection between one of the clean features (i.e. 5) and the canonical variable assigned to the diagnostic label. However, the missing feature in Fold 8 is selected by other folds, highlighting the usefulness of the nested subsampling procedure. Note that in site C, individuals with ASD were not recruited, thus sensitivity and DOR cannot be evaluated.

Supplementary Table 3 | Prediction of the measured domains of the two diagnostic instruments, Autism Diagnostic Observation Schedule (ADOS) and Autism Diagnostic Interview-Revised (ADI-R).
In each domain, the score of each individual was predicted by computing a linear weighted summation of a subset within the 16 FCs included in the classifier. The Pearson correlation coefficients (r) between the measured and predicted scores are shown. The statistical significance (P) is indicated both as uncorrected and as Bonferroni-corrected for multiple comparisons among the 8 domains.

Importance of the final 16 FCs throughout all FCs selected in the LOOCV.
The 16 FCs incorporated in our final classifier were selected by SLR using the whole Japanese dataset, starting from a subset of FCs that was previously reduced by nested feature selection. One might wonder whether the 16 FCs that were finally identified were also frequently selected, with a large weight, throughout the LOOCV procedure. This is an important question regarding the stability and robustness of the finally identified 16 FCs. To answer this question, we define the cumulative absolute weight for the k-th FC (k = 1, 2, ..., 9730) in the form where N=181 is the number of LOOCV folds (i.e. the number of subjects), and w i k is the weight associated with the k-th FC during the i-th LOOCV fold.
The greater magnitude of c k indicates a more significant contribution by the k-th FC to the classification into ASD and TD, throughout the LOOCV. Supplementary Fig. 3 shows the magnitude distribution of the 42 nonzero instances of c k . Sorting by their magnitudes, we found that the identified 16 FCs represent an important subset of the 42 FCs that were selected at least once during the LOOCV. Consequently, we conclude that the finally identified 16 FCs were stable and robust with respect to 181 LOOCV subsets of individuals and can be regarded as trustworthy.

Application of the ASD classifier to the extended ABIDE dataset including individuals with diverse profiles.
The goal of the present study was to establish a generalizable rs-fcMRI-based classifier by evaluating its performance using independent populations with well-defined profiles. An additional interest arises as to how the classifier works on individuals with other varying confounding factors such as presence of medication and comorbidity. To address this, we performed a supplementary analysis by applying the ASD classifier to the "extended" ABIDE dataset that incorporated individuals with diverse profiles. This dataset was formed by relaxing the selection criteria we adopted in our main analysis (see "Generalization to USA data" in Methods). Specifically, removing the conditions for the FIQ, comorbidity, and medication status, we additionally identified 19 individuals with ASD and demographically matched 19 TDs in the ABIDE data pool. We appended these individuals to the main dataset to form the extended ABIDE dataset that consisted in a total of 63 individuals with ASDs and 63 TDs. Repeating the same analysis for this extended dataset, we found: AUC = 0.74, accuracy = 71%, sensitivity = 75%, specificity = 68%, and DOR = 6.4. Importantly, there was no statistically significant difference in the classifier's sensitivity between the original ASD population (N = 44, 75%) and the appended ASD population (N = 19, 74%) (chi-square test, P = 0.91). To that extent, this supplementary analysis indicates, therefore, that the influence of such confounding factors as FIQ, comorbidity, and medication status on the classification performance appeared to be minimal.

Supplementary Note 3
Comparison with the elastic net.
Our ASD/TD classifier attained accuracies of 85% for the Japanese dataset and 75% for the USA dataset. For purposes of comparison, we also applied to our data sets a state-of-the-art regularized (logistic) regression method called elastic net 2 , which was utilized in a previous study 3 . was AUC = 0.73 and Accuracy = 61% with 173 finally selected FCs. The performance for the Japanese discovery cohort was comparable to our classifier but the accuracy for the USA independent validation cohort was 14% worse than our classifier, thereby showing much less generalization capability. These results clearly show the usefulness of our feature extraction and classification approach in preventing interferential effects by NVs because the elastic net algorithm did not explicitly avoid features related to NVs.

Supplementary Note 4
Effectiveness of L 1 -SCCA in avoiding nuisance variables.
Eliminating the unwanted effects of nuisance variables on FCs is indispensable for a study using multi-center imaging data. This is because, in the absence of a "gold standard" method for rs-fcMRI data acquisition, different sites adopt different scanning protocols and imaging instruments, which may exert significant effects on the measurement of FCs. In addition, the training of a reliable classifier requires a dataset with a large number of subjects at each site, which makes it difficult to equate all the demographic variables including diagnostic label, age, sex, etc., among multiple sites.
Under such conditions, the diagnostic label and other nuisance variables may be correlated.
Therefore, in order to achieve high generalization ability across multiple sites, it is essential to explicitly eliminate the unwanted effects of nuisance variables.
To illustrate this issue, we conducted a simplified simulation using synthetic data and visualized how L 1 -SCCA performed. For the sake of readability, we discuss the synthetic dataset with the same terminology used for the real dataset. For example, "diagnostic label" means "synthetic diagnostic label". Moreover, to keep consistency with the methods section, we define the matrix containing demographic variables as 1 X and the matrix containing the connectivity input as X 2 .
We consider a 10-dimensional connectivity input to depict how L 1 -SCCA performs with 100 samples (i.e. ). Each sample was independently generated from an identical Gaussian distribution with zero mean and unit covariance. Then, we divide the 100 samples into 70 training data samples and 30 test data samples. Here, we assume that two elements of the 10-dimensional input are related to diagnostic label, and other four elements are related to the nuisance variables. We used a weight vector of the form: w= w 1 ,0,0,0,w 5 ,0,w 7 ,w 8 ,w 9 ,w 10 ⎡ ⎣ ⎤ ⎦ T to synthesize the diagnostic label as 2 sign( ) = y X w . In the weight vector w , 1 w and 5 w correspond to the contribution of the two elements truly related to the diagnostic label (i.e. clean), and 7,8,9,10 w correspond to the contribution of the four nuisance-related features (i.e. nuisance). When defining the elements of w , we refer to the percentage of contribution with respect to the sum of the weights  x with 9 k = and 10 k = , respectively. These four variables could correspond to age (continuous variable), sex (binary variable) and two site labels (binary variables). The simulation consisted of 1,000 repetitions. At each repetition, we applied the proposed method and elastic net to newly resampled 2 X and tr w .
As a result, we found that the classification performance of our proposed method which uses L 1 -SCCA for the feature selection was better (two-sample t-test P = 1.06×10 -52 ) than that of the standard elastic-net approach (see Supplementary Fig. 8A). We also compared how frequently the nuisance-related features were selected by the two algorithms for predicting diagnostic labels (see Supplementary Fig. 8B). We then found that the nuisance-related features were less frequently selected by using our proposed method than by using elastic-net. This result showed the effectiveness of the L 1 -SCCA in avoiding the influence of nuisance variables.
To concretely show how L 1 -SCCA performed, we visualized the transformation matrix from demographic to canonical variables, and from connectivity inputs to canonical variables as a graph, for every nested fold in one repetition of the simulation (see Supplementary Fig. 8C). At each nested fold, we considered the 1 2 , λ λ combination where the diagnostic canonical constraint was last met (i.e. last iteration across 1 2 , λ λ ). The diagnostic canonical constraint, used in the feature selection procedure, determines that at least one canonical variable is assigned only to the diagnostic label (for details see the subsection of Methods entitled "L 1 -regularized sparse canonical correlation analysis used in inner loop feature selection").
We observe from Supplementary Fig. 8C that the diagnostic canonical constraint was met.
Moreover, we found that the canonical variables assigned to the nuisance variables always have the strongest connection with the nuisance-related features. We also verified the usefulness of the nested subsampling procedure, where the union of the features selected across nested folds is considered, in order to obtain a stable and clean set of features. Specifically, Fold 8 in Supplementary Fig. 8C shows a missing connection between one of the clean features and the canonical variable assigned to the diagnostic label. In this way, if features were selected only based on Fold 8, the algorithm would have missed one feature, leading to a bad prediction. However, the union of features across folds is able to overcome the issue.

Supplementary Note 5
Generalization performance of the ASD classifier from the USA dataset to the Japanese dataset.
We trained the classifier using the USA dataset and then tested on the Japanese dataset. The results showed poorer classification performance (US LOOCV: 48%, Generalization to JP: 62%), and all the selected FCs were different from the 16 FCs that were extracted in our study (see also the Results section "Characteristics of the 16 identified FCs incorporated in the classifier"). This result is somewhat consistent with the classification performances described in Supplementary Table 4.
Indeed, the total number of samples and the number of samples per site seem to play a crucial role in deriving a biomarker with high accuracy. With this in mind, we observe that the US ABIDE dataset has a total number of samples which is half of the Japanese dataset. Moreover, the number of samples per site is limited and highly variable in the ABIDE dataset compared to the Japanese dataset, on average: 12.6 ± 11.6 (USA dataset) vs. 60.3 ± 23.7 (Japanese dataset).

Supplementary Note 6
Relationship between demographic information and functional connectivity.
In this section, we exemplify how the L 1 -SCCA procedure works in order to reduce the effect of nuisance variables, such as subject properties (e.g., age, sex), site properties, and scanning protocols (e.g., eyes open/close). This procedure allowed us to utilize data with a great variety of demographic distributions and imaging conditions from multiple imaging sites, for the construction of a classifier with good generalization capability across "foreign" sites. We begin by considering a simple and extreme artificial example to illustrate how L 1 -SCCA can fulfill this role. Suppose that site X recruited almost exclusively ASD participants and only one TD participant and utilizes a closed-eye paradigm, and site Y recruited almost exclusively TD participants, only one ASD participant and utilizes an open-eye paradigm. In this case, it should be quite easy for any machine-learning algorithm to classify ASD and TD based on the FCs associated with the eyes open/close condition, rather than the ASD/TD label. This is of course an undesirable situation and leads to very poor generalization across new imaging sites. However, when we use L 1 -SCCA, at least one canonical variable is assigned to the eyes open/close condition (i.e. nuisance-related canonical variable), and at least another canonical variable is assigned to the ASD/TD label. By introducing the L 1regularization canonical variables compete for the FCs. This reduces the number of FCs common across canonical variables. More specifically, the FCs assigned to the nuisance-related canonical variables are penalized, and the classifier uses only FCs directly associated with the ASD/TD-related canonical variables. Thus, artifactual effects by canonical variables other than the ASD/TD label are reduced in classification. The same argument applies to any other unevenly distributed attribute, including psychotic drugs and sex. In practice, an FC can be related to different demographic attributes simultaneously ( Supplementary Fig. 7). However, as depicted in Supplementary Fig. 8C, the canonical variables assigned to the nuisance variables always have the strongest association with the nuisance-related FCs. Considering all these factors, we can safely assume that the L 1 -SCCA procedure can effectively suppressed cross talk from nuisance variables.

Supplementary Note 7
Details about data standardization.
For L 1 -SCCA, the standardization was conducted using only 8 out of 9 folds, and the testing pool for LOOCV was never used. Moreover, evaluating the classification performance of SLR, standardization is performed with a leave-one-subject-out (LOSO) approach. Concretely, the data standardization of the training set was done independently from the one of the test data. The test data is then standardized using the mean and standard deviation (SD) derived from the independent dataset. In the LOSO standardization, all-but-one USA subjects were concatenated to the Japanese dataset in order to find mean and SD. These parameters were subsequently used to standardize both the Japanese dataset (i.e. training set) and the remaining USA subject (i.e. test set, never used for standardization). It should be noted that even though a part of the USA dataset was used for standardizing the Japanese dataset, the actual learning was done using only the Japanese samples.
The LOSO approach is useful because it removes the bias caused by the different scanning conditions between Japanese and USA dataset, leading to a better balance between Specificity and Sensitivity. Given that for each USA sample a slightly different mean and SD were used for standardization, the actual number of selected FCs is 15.96 ± 0.23 (99.6% overlap). In order to report information other than classification performance (e.g. the weights of the classifier, number of FCs), the whole USA dataset was concatenated to the Japanese dataset for standardization and the classifier was retrained, using only the Japanese dataset. This procedure led to the finally reported 16 FCs.