Altered structural brain asymmetry in autism spectrum disorder in a study of 54 datasets

Altered structural brain asymmetry in autism spectrum disorder (ASD) has been reported. However, findings have been inconsistent, likely due to limited sample sizes. Here we investigated 1,774 individuals with ASD and 1,809 controls, from 54 independent data sets of the ENIGMA consortium. ASD was significantly associated with alterations of cortical thickness asymmetry in mostly medial frontal, orbitofrontal, cingulate and inferior temporal areas, and also with asymmetry of orbitofrontal surface area. These differences generally involved reduced asymmetry in individuals with ASD compared to controls. Furthermore, putamen volume asymmetry was significantly increased in ASD. The largest case-control effect size was Cohen’s d = −0.13, for asymmetry of superior frontal cortical thickness. Most effects did not depend on age, sex, IQ, severity or medication use. Altered lateralized neurodevelopment may therefore be a feature of ASD, affecting widespread brain regions with diverse functions. Large-scale analysis was necessary to quantify subtle alterations of brain structural asymmetry in ASD.

In this study Postema et al. used a large multisite sample of MR images from people with and without ASD to assess effects of cortical asymmetry. Data from ~1800 individuals aged 2-65 with and without ASD were assembled for the analyses. FreeSurfer was used to obtain cortical thickness and surface area measures from anatomical regions in the Desikan atlas and entered into linear mixed models that accounted for site-related clustering. Significant asymmetry differences in ASD relative to the control sample were found for thickness overall and for several regions for thickness, with asymmetry generally reduced in ASD, and only one region for surface area. In subcortical regions examined, only the putamen showed a significant difference in volume asymmetry. Importantly the authors report effect sizes which were relatively small, and demonstrate that they have sufficient power to detect effects in this range but that most previous work in this area has been underpowered. This is an interesting and very clearly written paper. While the work is important, the authors acknowledge that the clinical utility of the findings may be limited given the small effect sizes. Nonetheless doing this work to establish that benchmark is important and commendable. I have relatively minor comments towards an improved version. 1) Some citations for the paragraph at the top of page 6 are needed to support claims of e.g. larger average volume 2) Line 69 -would add 'potentially' before 'due to' 3) Given the large sample size, I was curious why age-and sex-specific effects or interactions were not modeled, but only followed up posthoc in areas showing a diagnostic effect, particularly given the wide age range? 4) Page 12 line 217 -confusing in the same section to use n1 and n2 twice to mean different things 5) In the age-specific effects follow up I was confused as to why age was coded as a binary variable? More generally I wonder about the utility of including the full age range given the sparse sampling particularly at older ages? 6) Figure 1: color bar should be labeled and the *s are a bit clumsy -it's hard to actually see what they are meant to refer to on the surface. 7) Line 357: why a 'putative' effect? 8) Age and IQ were examined separately -perhaps a missed opportunity to understand the effect in the rostral ACC? 9) Given the small effect size reported, I wondered about difference in sensitivity to small cortical features in the 1.5T vs. 3T acquired data. Maybe worth running analyses on the subset of 3T acquired data?
Reviewer #2 (Remarks to the Author): This is another interesting and important study from the ENIGMA consortium. The main strength of the paper is the large sample size and the use of a known and standardised pipeline. While the question is not particularly novel, the sample size allows a more definitive answer on the presence of brain asymmetry in autism.
There are some major issues to consider with the methodological approach used here.
The authors have removed outliers at various stages in their analyses. Complete removal of outliers is, in my experience, not typically encouraged by statisticians and where it does happen it would be more typical to present the outlier-removed results as a secondary/sensitivity analysis.
It is not clear what the authors did with regards to outlier removal and why. They wrote: "outliers were defined per group (cases/controls) and per dataset as those values above and below 1.5 times the interquartile range, and were excluded". It would be helpful if they clarified: • Why this approach was necessary in more details. Why are outliers particularly a concern for this type of analysis? Did the authors consider less extreme treatment of outlier such as Winsorisation?
The authors state how many datapoints were removed, but they do not give the important context of what fraction of the observations this was and whether this varies by site/caseness. • What did they actually do? In this context the interquartile range must be taken relative to something (e.g. mean, median, or upper/lower quartiles). The latter option would be Tukey's fences method (basis for whisker end points in a Tukey boxplot), this method is common, but not universal. This has implications for evaluating the technique. If it is Tukey's method ~0.7% of normal data would be excluded erroneously. If it is the sample median +/-1.5IQR then ~4% of perfectly normally distributed data would be excluded. Because the authors repeat the process 3 times (L, R and LI) the expected rate of outliers in perfectly normal is 1 -(1 -Rate)^3 i.e. ~2% and ~11% respectively. • Was outlier detection calculated for each combination of group and dataset, or once looking for outliers within each group (pooling datasets) and once looking for outliers within each dataset (pooling groups)?
There are also some concerns about the appropriateness of the technique: • The effectiveness of this method (sensitivity to outliers) varies systematically with sample size. Because outliers are considered within each site and without reference to the final model this adds a difficult to quantify bias to the results. • The outlier detection is conducted within site and caseness, but age and eTIV are not considered -some of the subjects are as young as 2 years old, presumably they will be outliers for most of the assessed measures? Again the effectiveness of the outlier identification will vary systematically with key covariates. • The outlier detection is run repeatedly for each measure, a multivariate approach considering both L and R data could be run just once per measure and would capture unusual LI. Something like a bivariate confidence ellipse or Malhalanobis D2. I would suggest reporting the results without excluding outliers as the primary analysis, but an alternative would be to use an outlier robust model such as the robustlmm package which fits huberised linear mixed effect models for potentially contaminated data. These models downweight the influence of outliers in a data-driven manner.
Other minor issues: The authors did not adjust for IQ in their analysis, and carried out a post hoc analysis in the patients only, providing a good justification for this. However, it would be helpful if they also performed the same post hoc analysis in the healthy controls, to explore the relationship between IQ and brain asymmetries in non-pathological conditions. Terms such as "autistic individuals" should be avoided and replaced by "individuals with autism".
In the Introduction, it is unclear why only prevalence in the USA is reported.
Reviewer #3 (Remarks to the Author): MS: NCOMMS-19-09029: Altered structural brain asymmetry in autism spectrum disorder: largescale analysis via the ENIGMA consortium General Comments: This large-scale brain structure collaborative study investigated the structural brain asymmetry among 1774 subjects with autism spectrum disorders (ASD) and 1809 controls recruited from 54 sites (54 datasets) in terms of cortical thickness, surface area, and subcortical volume using Freesurfer analysis. The authors used "asymmetry indexes (AI), a widely used inex in brain asymmetry studies, to estimate the structural brain asymmetry. The authors identified significant group (ASD-control) differences in 11 structural AIs: the total hemispheric average thickness AI; eight regional cortical thickness AIs including frontal regions (superior frontal, rostral middle frontal, medial orbitofrontal), temporal regions (fusiform, inferior temporal, superior temporal), and cingulate regions (rostral anterior, isthmus cingulate); one cortical regional surface area AI (the medial orbitofrontal cortex); and one subcortical volume AI (the putamen). These results survived multiple testing correction. The absolute value of the effect sizes presented by Cohen's d of these 11 significant results ranging from 0.09 to 0.16 is very low. In addition, there was significant sex by diagnosis interaction for the rostral anterior cingulate thickness but none sex by diagnosis interaction for any of the above 11 AI main findings. Moreover, within the ASD group, IQ was positively The strengths of this work is the largest scale brain structural study examining the structural brain asymmetry with combining data from 54 sites using the same analysis protocol with sophisticated statistical analysis considering the random effect from the same site (lack of independence of observation within the same site) and multiple testing correction in the group differences analysis. Nevertheless, I have several major and minor comments as follows. Specific Comments: Abstract 1. The eight regional cortical thickness AIs included the major regions related to Default-Mode Network (DMN) in the frontal regions, temporal regions, and cingulate regions. This can be included in the abstract section to make a point of altered AI in DMN in ASD. Introduction 1. In addition to the prevalence of ASD in the US, as ENIGMA consortium collected data from many countries, the ASD prevalence rates in the European countries, as well as other countries, are recommended to be included. 2. The authors reviewed the group differences in asymmetry in imaging studies regarding restingstate fMRI, DSI, and structural MRI. As sex-and age-by diagnosis interactions would be tested in this study and the influence of IQ and autistic severity on the AI would be investigated, too, a concise review about the age, sex, IQ and ASD symptom severity is suggested. Methods 1. The sample: i. Diagnosis: how many subjects were diagnosed as ASD according to the DSM-IV diagnostic criteria? Among these ASD subjects, the number of subjects in each of the diagnostic subgroups (Autistic Disorder, Asperger's Syndrome, Childhood Disintegrative Disorder, and Pervasive Developmental Disorder -Not Otherwise Specified (PDD-NOS). I am interested in knowing the number of subjects with Childhood Disintegrative Disorder. What instruments were used to make a diagnosis of ASD? Clinical diagnosis? ADI-R? ADOS? ASD diagnosis based on clinical assessment by board-certified psychiatrists or other recognized clinicians is acceptable. I am interested in knowing how to ensure that the controls did not have a diagnosis of ASD. ii. Age range: In such a big dataset with wide-ranging age distribution is anticipated. Although the subjects with ages of 1.5 years were excluded from the analysis (how many subjects with such a young age?), ages of 2 to 4 were still included in the FreeSurfer analysis. Brain structure analysis of preschoolers (ages of 2-6) using FreeSurfer are very challenging and may create invalid data. Since this work has the largest dataset, the inclusion of ASD and control subjects whose ages of 6 and older is suggested. iii. IQ: The IQ range is wide for both ASD and control groups. How many are under 70 for the ASD and control groups? For controls with IQ under 70, any diagnosis of developmental disorders was assessed? , iv. Comorbidity: Only one-third of ASD subjects were assessed for comorbid conditions. Among them, less than 10% showed at least one comorbid condition (ADHD, OCD, depression, anxiety, and/or TS). The rate is quite low as compared to clinical sample even epidemiological samples. I wonder whether ADHD and the other psychiatric disorders were excluded in the recruitment procedure. ADHD is quite common in individuals with ASD, and it is hard to add ADHD as one of the exclusion criteria for ASD imaging studies. What about psychiatric comorbid conditions in the controls? 2. Structural MRI scans, data acquisition, and analysis: i. The data combined 54 images datasets acquired using different field strengths (1.5T or 3 T), and different cross-site scanners (different sequences and parameters). I read their protocols and do not find how the research team dealt with different scanners and field strengths. This is a basic issue that needs to be solved before data analysis of such large-scale data from 54 sites. ii. The authors can conduct some sensitivity analyses by randomly taking away some datasets and running the analyses to see whether the results are similar. iii. Regarding the quality control, more details of the procedures and criteria taken to deal with quality control are needed. Did they take care of motion issue? Please refer to the references, e.g., https://www.ncbi.nlm.nih.gov/pubmed/23707591 Renormalization of the imaging data before conducting FreeSurfer is needed rather than combining the output raw data of FreeSurfer from each site. Please refer to the reference, e.g., https://ieeexplore.ieee.org/abstract/document/4141193/metrics#metrics https://www.sciencedirect.com/science/article/pii/S1053811907007604 iv. Page 10, the definition of outliers as the values above and below 1.5 times the interquartile range. Any reference for using this criterion as outliers? Since 54 datasets were included in the analysis, did the authors found which sites or scanners, or the processing were associated with significantly higher rates of outliers? The authors have conducted the analyses with and without removing outliers. Whether the results were the same, regardless of removing outliers or not? v. Page 10, the definition of outliers as the values above and below 1.5 times the interquartile range. Any reference for using this criterion as outliers? Since 54 datasets were included in the analysis, did the authors found which sites or scanners, or the processing were associated with significantly higher rates of outliers? The authors have conducted analyses with and without removing outliers. Whether the results were the same, regardless of removing outliers or not? 3. Statistical Analysis: i. The authors provide detailed description of statistical models used in the data analyses. To address the lack of independence within the same site, the Linear Mixed Model was used in all the analyses. The detailed modeling equation may not be necessary if there is a limited word count. I think the readers should be familiar with the effect size presented by Cohen's d. The equation of Cohen's d may not be necessary. However, the interpretation of the magnitude of effect size should be included (small, medium, and large effect sizes as Cohen's d 0.2 to 0.5, 0.5 to 0.8, and ≥ 0.8, respectively [Cohen, 1988]). Cohen J (1988), Statistical power analysis for the behavioral sciences (2nd ed.) Hillsdale, NJ: Lawrence Earlbaum Associates ii. Why did the authors only correlate IQ with AI results in the ASD group? If the authors want to test these correlations, the analyses should be conducted for both groups separately. iii. For ADOS correlations with AI measures, did the authors check any significant difference in the ADOS scores across sites? How many subjects have ADOS assessments? If there were significant cross-site differences in ADOS scores, the R-package used for normalization is based on ranking may not be able to normalize the ADOS scores across so many sites. If there were no significant differences, the authors' efforts may be OK.
Results: 1. Table 1 and Figure 1 are clear to the readers. 2. Page 19: There were no statistical significant sex-and age-by diagnosis interactions except sex by diagnosis interaction for the rostral anterior cingulate thickness AI. I wonder the main effects of diagnosis still maintained significant after adding the interaction terms in the models. 3. Page 20: The authors need to mention that the correlation of ADOS severity score is not corrected for multiple testing. Discussion 1. The eight regional cortical thickness AIs included the major regions related to Default-Mode Network (DMN) in the frontal regions, temporal regions, and cingulate regions. The findings are important regarding the role of altered brain asymmetry in DMN in individuals with ASD. Review and Discussions on the DMN in ASD are suggested. 2. The effect sizes are very low. Why did the authors estimate the power using a very low effect size? Given such a large sample, very low effect sizes may still generate significant results with a small p-value. 3. Despite the strengths of largest sample size on this topic, the issue of heterogeneity of subjects with ASD (controls, too), MRI field strengths, scanners, parameters, and sequences need to manage carefully using state of the art methodology. 4. Page 22, 1st paragraph: The main results need to be discussed. 5. Page 22, 3rd paragraph: The finding regarding the fusiform cortex is contradictory to the previous study by Dougherty et al. (2016) needs the interpretation and discussion. 6. Page 23, Lines 442-443: At least a reference is needed for this sentence. 7. Page 23, Line 447:…due to limited statistical power in the earlier studies, which may have resulted in false positive findings. It should read as "negative" finding. 8. Line 463: how to interpret "lower-performing males"? Please do not just repeat the result but interpretation and discussions of the findings are needed. 9. Line 464: the only finding regarding ADOS correlation may be just by chance because of the uncorrected p-value. General speaking the AIs results were not correlated with autistic symptoms. 10. Line 471, the comorbidities are low. 11. Line 475, Why the authors did not test the effect of the medication use? 12. For neuropsychological and brain imaging studies, IQ needs to be controlled in the models.

Reviewer #1 (Remarks to the Author):
In this study Postema et al. used a large multisite sample of MR images from people with and without ASD to assess effects of cortical asymmetry. Data from ~1800 individuals aged 2-65 with and without ASD were assembled for the analyses. FreeSurfer was used to obtain cortical thickness and surface area measures from anatomical regions in the Desikan atlas and entered into linear mixed models that accounted for site-related clustering. Significant asymmetry differences in ASD relative to the control sample were found for thickness overall and for several regions for thickness, with asymmetry generally reduced in ASD, and only one region for surface area. In subcortical regions examined, only the putamen showed a significant difference in volume asymmetry. Importantly the authors report effect sizes which were relatively small, and demonstrate that they have sufficient power to detect effects in this range but that most previous work in this area has been underpowered. This is an interesting and very clearly written paper. While the work is important, the authors acknowledge that the clinical utility of the findings may be limited given the small effect sizes. Nonetheless doing this work to establish that benchmark is important and commendable.
Authors: Thank you very much for these supportive comments. 2) Line 69 -would add 'potentially' before 'due to'

Authors: Done
3) Given the large sample size, I was curious why age-and sex-specific effects or interactions were not modeled, but only followed up posthoc in areas showing a diagnostic effect, particularly given the wide age range?
Authors: We have now added this to the paper as a secondary analysis for all AIs, pages 15 (Methods section), 20 (Results section), and Tables S9-S14 (interaction term and stratification results for all AIs). Briefly, the addition of diag:age interaction to the primary model did not show any significant interaction effects after FDR correction. There was one significant diag:sex interaction, for the rostral anterior cingulate thickness AI. This was already remarked on in the paper, since this region showed an effect of diagnosis in the primary analysis (and had therefore already been subject to post-hoc analysis).

4)
Page 12 line 217 -confusing in the same section to use n1 and n2 twice to mean different things Authors: We have changed the second 'n1' and 'n2' to 'x1' and 'x2'.

5)
In the age-specific effects follow up I was confused as to why age was coded as a binary variable? More generally I wonder about the utility of including the full age range given the sparse sampling particularly at older ages?
Authors: We have now coded age as a continuous variable in all models where it is included as a covariate, to be consistent with the primary analysis (page 15). We have also added a sensitivity analysis where we excluded any individuals aged over 40 years, and the pattern of significant results remains the same as the primary analysis (pages 14(methods), 18 (results), and Tables S5-S7). Figure 1: color bar should be labeled and the *s are a bit clumsy -it's hard to actually see what they are meant to refer to on the surface.

8) Age and IQ were examined separately -perhaps a missed opportunity to understand the effect in the rostral ACC?
Authors: We have now noted (page 21) that none of the interaction terms age*IQ, sex*IQ or age*sex*IQ were significant (all P>0.05) when included in the case-only analysis model for this regional AI. 9) Given the small effect size reported, I wondered about difference in sensitivity to small cortical features in the 1.5T vs. 3T acquired data. Maybe worth running analyses on the subset of 3T acquired data? Tables S5-S7). There were slight changes in significance, such that two of the diagnosis effects from the primary analysis (i.e., inferior temporal and isthmus cingulate thickness AIs) were no longer significant after false discovery rate correction, but three other effects now became significant (i.e., superior temporal thickness AI, fusiform surface area AI, and caudate nucleus AI). However, as the sample drops from 1778 cases and 1829 controls in the primary analysis to 1467 cases and 1574 controls in the 3T-only analysis, then slight changes in significance levels are expected, and do not necessarily indicate systematic differences of 3T and 1.5T data.

Reviewer #2 (Remarks to the Author):
This is another interesting and important study from the ENIGMA consortium. The main strength of the paper is the large sample size and the use of a known and standardised pipeline. While the question is not particularly novel, the sample size allows a more definitive answer on the presence of brain asymmetry in autism.
There are some major issues to consider with the methodological approach used here.
The authors have removed outliers at various stages in their analyses. Complete removal of outliers is, in my experience, not typically encouraged by statisticians and where it does happen it would be more typical to present the outlier-removed results as a secondary/sensitivity analysis.
It is not clear what the authors did with regards to outlier removal and why. They wrote: "outliers were defined per group (cases/controls) and per dataset as those values above and below 1.5 times the interquartile range, and were excluded". It would be helpful if they clarified: • Why this approach was necessary in more details. Why are outliers particularly a concern for this type of analysis? Did the authors consider less extreme treatment of outlier such as Winsorisation? The authors state how many datapoints were removed, but they do not give the important context of what fraction of the observations this was and whether this varies by site/caseness. • What did they actually do? In this context the interquartile range must be taken relative to something (e.g. mean, median, or upper/lower quartiles). The latter option would be Tukey's fences method (basis for whisker end points in a Tukey boxplot), this method is common, but not universal. This has implications for evaluating the technique. If it is Tukey's method ~0.7% of normal data would be excluded erroneously. If it is the sample median +/-1.5IQR then ~4% of perfectly normally distributed data would be excluded. Because the authors repeat the process 3 times (L, R and LI) the expected rate of outliers in perfectly normal is 1 -(1 -Rate)^3 i.e. ~2% and ~11% respectively.
• Was outlier detection calculated for each combination of group and dataset, or once looking for outliers within each group (pooling datasets) and once looking for outliers within each dataset (pooling groups)?
There are also some concerns about the appropriateness of the technique: • The effectiveness of this method (sensitivity to outliers) varies systematically with sample size. Because outliers are considered within each site and without reference to the final model this adds a difficult to quantify bias to the results. • The outlier detection is conducted within site and caseness, but age and eTIV are not considered -some of the subjects are as young as 2 years old, presumably they will be outliers for most of the assessed measures? Again the effectiveness of the outlier identification will vary systematically with key covariates. • The outlier detection is run repeatedly for each measure, a multivariate approach considering both L and R data could be run just once per measure and would capture unusual LI. Something like a bivariate confidence ellipse or Malhalanobis D2. I would suggest reporting the results without excluding outliers as the primary analysis, but an alternative would be to use an outlier robust model such as the robustlmm package which fits huberised linear mixed effect models for potentially contaminated data. These models down-weight the influence of outliers in a data-driven manner.
Authors: Many thanks to the reviewer for this advice. We have now handled outliers as recommended by the reviewer, i.e. no outlier exclusion for the primary analysis, and the asymmetry indexes winsorised for a sensitivity analysis. We used a threshold of k=3 for winsorization (i.e., so that the two most upper/lower values were brought back to the value of the 3rd upper/lower value) for a confirmatory analysis. This threshold was chosen after visual inspection of outlier patterns. Therefore no datapoints have been excluded in the revised version of the paper. The results remain largely as before -see changes to pages 16,17 (results) and the results tables (Table 1, Tables S2-S4). In addition, the significant effects from the new, primary analysis are the same ones that are significant in the winsorised analysis, except that the effect of diagnosis in medial orbitofrontal surface area AI did not survive multiple testing correction in the winsorized analysis (although the P value barely changed) ( Table S6). Note that the asymmetry index (L-R)/(L+R) does not necessarily scale with L and R due to its denominator (now noted on page 11), so that larger or smaller brains are not necessarily more likely to be outliers for this asymmetry index. In any case, we have also added a sensitivity analysis excluding the subjects aged less than 6 years (in response to a comment from reviewer 3), and the results are largely unchanged by this (see below in response to reviewer 3, and Tables S5-S7). Additionally, excluding individuals >40years old did not change the pattern of results (see response to reviewer 1 above).

Other minor issues:
The authors did not adjust for IQ in their analysis, and carried out a post hoc analysis in the patients only, providing a good justification for this. However, it would be helpful if they also performed the same post hoc analysis in the healthy controls, to explore the relationship between IQ and brain asymmetries in non-pathological conditions. Table  S15). In the patients, only the rostral ACC thickness asymmetry shows an association with IQ, whereas in the controls this asymmetry shows no association with IQ, but a different asymmetry shows an association with IQ (superiorfrontal thickness). This perhaps underlines the uncertain nature of effects arising from these exploratory, secondary analyses of the data, where FDR correction was not applied. Alternatively, brain-IQ associations may be different in cases and controls. Both possibilities are now acknowledged in the paper (Discussion, page 26).

Authors: We have now added this post hoc analysis in controls too (pages 15, 21 and
Terms such as "autistic individuals" should be avoided and replaced by "individuals with autism".

Authors: We have changed to 'individuals with autism'.
In the Introduction, it is unclear why only prevalence in the USA is reported.

Reviewer #3 (Remarks to the Author):
MS: NCOMMS-19-09029: Altered structural brain asymmetry in autism spectrum disorder: large-scale analysis via the ENIGMA consortium General Comments: This large-scale brain structure collaborative study investigated the structural brain asymmetry among 1774 subjects with autism spectrum disorders (ASD) and 1809 controls recruited from 54 sites (54 datasets) in terms of cortical thickness, surface area, and subcortical volume using Freesurfer analysis. The authors used "asymmetry indexes (AI), a widely used inex in brain asymmetry studies, to estimate the structural brain asymmetry. The authors identified significant group (ASD-control) differences in 11 structural AIs: the total hemispheric average thickness AI; eight regional cortical thickness AIs including frontal regions (superior frontal, rostral middle frontal, medial orbitofrontal), temporal regions (fusiform, inferior temporal, superior temporal), and cingulate regions (rostral anterior, isthmus cingulate); one cortical regional surface area AI (the medial orbitofrontal cortex); and one subcortical volume AI (the putamen). These results survived multiple testing correction. The absolute value of the effect sizes presented by Cohen's d of these 11 significant results ranging from 0.09 to 0.16 is very low. In addition, there was significant sex by diagnosis interaction for the rostral anterior cingulate thickness but none sex by diagnosis interaction for any of the above 11 AI main findings. Moreover, within the ASD group, IQ was positively The strengths of this work is the largest scale brain structural study examining the structural brain asymmetry with combining data from 54 sites using the same analysis protocol with sophisticated statistical analysis considering the random effect from the same site (lack of independence of observation within the same site) and multiple testing correction in the group differences analysis. Nevertheless, I have several major and minor comments as Introduction 1. In addition to the prevalence of ASD in the US, as ENIGMA consortium collected data from many countries, the ASD prevalence rates in the European countries, as well as other countries, are recommended to be included. 2. The authors reviewed the group differences in asymmetry in imaging studies regarding resting-state fMRI, DSI, and structural MRI. As sex-and age-by diagnosis interactions would be tested in this study and the influence of IQ and autistic severity on the AI would be investigated, too, a concise review about the age, sex, IQ and ASD symptom severity is suggested.

Authors:
We have now briefly mentioned some further information on these aspects in the last paragraph of the introduction, while being mindful of the overall length of the article (page 8). For this purpose, the following articles were cited (the last of these goes into some detail on these issues, and is very recent): The subtyping of ASD proved to be very unreliable across sites, since even top expert centers adhered to different diagnostic algorithms. Reliability however was very good to excellent at the level of the overall ASD category. Also, previous data are very inconsistent as to specific neurobiological underpinnings of the older clinical subtypes of ASD. For these reasons, we have not collated information on DSM-IV subtyping across the datasets. We have now explained this (page 9) and cited the following paper: https://www.annualreviews.org/doi/full/10.1146/annurev-clinpsy-032814-112745 However, all of our datasets are in principle focused on Autistic Disorder, not any of the older subgroups. ADOS was available for 878 of the cases, which permitted severity score analysis in relation to brain asymmetry, and this information and analysis are included in the manuscript (see further below). We do not have data from other instruments in sufficient numbers to contribute to the present study.
Inevitably there was not a homogenous assessment/recruitment process for controls across these many legacy datasets, but the overwhelming majority (see point 1.iii below) were typically developing/healthy, and no controls showed features that might have met a diagnosis of ASD. All of these aspects have now been made explicit in the Methods (pages 9,10).
ii. Age range: In such a big dataset with wide-ranging age distribution is anticipated. Although the subjects with ages of 1.5 years were excluded from the analysis (how many subjects with such a young age?), ages of 2 to 4 were still included in the FreeSurfer analysis. Brain structure analysis of preschoolers (ages of 2-6) using FreeSurfer are very challenging and may create invalid data. Since this work has the largest dataset, the inclusion of ASD and control subjects whose ages of 6 and older is suggested.
Authors: In our primary analysis we have retained all subjects, but we have added a new sensitivity analysis where we confirm that the results are largely unchanged after removing all subjects aged below 6 years (113 cases and 64 controls were removed for this)(pages 14, and Tables S5-S7). Briefly, conducting our analysis on individuals aged 6 years and older, as compared to total sample, one new association now surpassed the multiple testing correction threshold (fusiform surface area AI), and one previously significant association dropped below the multiple testing correction threshold (isthmus cingulate thickness AI), but again the actual P values changed only slightly. These subtle changes do not necessarily indicate that exclusion of younger ages improved signal to noise in the data. We have also cited a study (

Authors:
We have now stated in the paper (pages 9,10) that there were 85 subjects with IQ below 70, of which 19 were controls. In these subjects the exclusion of ASD diagnosis was performed by a senior child psychiatrist. Eighteen of these controls came from the FSM dataset and were diagnosed with ID.
iv. Comorbidity: Only one-third of ASD subjects were assessed for comorbid conditions. Among them, less than 10% showed at least one comorbid condition (ADHD, OCD, depression, anxiety, and/or TS). The rate is quite low as compared to clinical sample even epidemiological samples. I wonder whether ADHD and the other psychiatric disorders were excluded in the recruitment procedure. ADHD is quite common in individuals with ASD, and it is hard to add ADHD as one of the exclusion criteria for ASD imaging studies. What about psychiatric comorbid conditions in the controls?
Authors: We now acknowledge the relative lack of information on comorbid conditions as a limitation in the Discussion (page 26). As mentioned above, the 54 datasets were collected over the last 25 years in many countries, as separate studies, according to various recruitment and selection criteria. For example, until DSM-V came into use, a comorbid diagnosis of ADHD and ASD was not even possible. Nonetheless we expect that the 54 datasets capture the research-clinical ASD population well, even if comorbidities were not recorded in many of them.
There was not a homogenous assessment/recruitment process for controls across these datasets, but the overwhelming majority (see point 1.iii above) were typically developing/healthy at the time of MRI, and no controls showed features that might have met a diagnosis of ASD (now made clear in the Methods).
2. Structural MRI scans, data acquisition, and analysis: i. The data combined 54 images datasets acquired using different field strengths (1.5T or 3 T), and different cross-site scanners (different sequences and parameters). I read their protocols and do not find how the research team dealt with different scanners and field strengths. This is a basic issue that needs to be solved before data analysis of such largescale data from 54 sites.
ii. The authors can conduct some sensitivity analyses by randomly taking away some datasets and running the analyses to see whether the results are similar.
Authors: We indeed found in previous meta-analysis of 99 healthy control or general population datasets (Kong et al., PNAS 2018) that there was heterogeneity between datasets for the same regional asymmetry indexes as used in the present study. Accordingly, for our analyses in the present study of ASD, 'Dataset' was included as a random variable (i.e. each dataset received a random intercept, thus levelling all datasets when modelling the effects of diagnosis and other covariates) to account for differences between datasets, which can include demographic, clinical and technical heterogeneity. We have made clearer in the revised version that the analysis was based on a 'mixedeffects random-intercept model' (page 11). The primary purpose of our study, based on 54 datasets that were originally collected as separate studies, was to assess the total combined evidence for effects over all of the available datasets, while explicitly allowing for heterogeneity between datasets through the use of random intercepts, and finally adding sensitivity and secondary analyses with respect to relevant variables. We now include five different sensitivity analyses: in 3T-only data (see response to reviewer 1, above), removing the youngest subjects (see further above), removing the oldest subjects (above), including a non-linear age adjustment (this was already in the paper), and analysis with winsorization of outlier datapoints (see above), as well as secondary analyses of age:diagnosis interactions, sex:diagnosis interactions, IQ, ADOS scores, and medication. The heterogeneity mentioned by the reviewer is a feature of the current field. That is, as long as researchers publish many separate papers based on single datasets collected in particular ways, the field overall has the same problem. It can be considered a strength of our study that our 54 datasets capture this real-world heterogeneity, while the results represent the total combined evidence in these datasets. We have expanded on these points in the Discussion (page 28).' iii. Regarding the quality control, more details of the procedures and criteria taken to deal with quality control are needed. Did they take care of motion issue? Please refer to the references, e.g., https://www.ncbi.nlm.nih.gov/pubmed/23707591 Renormalization of the imaging data before conducting FreeSurfer is needed rather than combining the output raw data of FreeSurfer from each site. Please refer to the reference, e.g., https://ieeexplore.ieee.org/abstract/document/4141193/metrics#metrics