A normative modelling approach reveals age-atypical cortical thickness in a subgroup of males with autism spectrum disorder

Understanding heterogeneity is an important goal on the path to precision medicine for autism spectrum disorders (ASD). We examined how cortical thickness (CT) in ASD can be parameterized as an individualized metric of atypicality relative to typically-developing (TD) age-related norms. Across a large sample (n = 870 per group) and wide age range (5–40 years), we applied normative modelling resulting in individualized whole-brain maps of age-related CT atypicality in ASD and isolating a small subgroup with highly age-atypical CT. Age-normed CT scores also highlights on-average differentiation, and associations with behavioural symptomatology that is separate from insights gleaned from traditional case-control approaches. This work showcases an individualized approach for understanding ASD heterogeneity that could potentially further prioritize work on a subset of individuals with cortical pathophysiology represented in age-related CT atypicality. Only a small subset of ASD individuals are actually highly atypical relative to age-norms. driving small on-average case-control differences.

1. I understand that authors have done QC steps, and put Euler index as a covariate in the model, Nonetheless, I still have concerns that the current results might be impacted by inter-individual variations in in-scanner motion, because individuals with extreme age-deviancy in atypical cortical thickness happened to have relatively poor image quality. This critical issue might be further alleviated by the following attempts: 1) test whether there is a correlations between Euler index with w-scores. 2) Compare Euler index between autistic individuals with higher age-normed deviancy and those within the age norms of CT. 3) further exclude additional 10% of participants with extreme Euler index from the remaining samples going through QC step. then see whether the major pattern still exist. If additional analyses based on either 1) or 2) yield significant findings, or the most remarkable pattern do not preserve after the practice of 3), then the main conclusions and inferences of the present findings should be considered problematic insofar as the in-scanner motion is taken into account. 2. Following the point 1: in the paragraph 3 of Discussion, the authors noted the limitation of a lack of phenotypic data of the subgroup with extreme age-deviancy, and seemed to explain that may be related to heterogeneous mechanisms of neurogenesis and development. Nonetheless, it was explicitly described in the Results that the T1 data of individuals with extreme deviancy in age-atypical cortical thickness showed relatively poor image quality. I think the likely influence of in-scanner motion on this result should be explicitly acknowledged and discussed here. 3. The authors have explained the rationale of using LOESS estimation for constructing the normative model, which I understand. I have concerns about the model fitting issue despite the strengths of LOESS estimation. The authors may want to provide the information to evaluate the goodness of fit to endorse the selection of LOESS estimation.
Reviewer #2 (Remarks to the Author): The manuscript examines the possibility of using structural brain features (mainly cortical thickness but also cortical surface, volume as well as gyrification) in relation to age-related norms to try to better characterize ASD neural heterogeneity. I am on the fence regarding this article. My general impression is that the idea is interesting, in principle, but results appear unconvincing to me. I may be convinced otherwise and remain open to a rebuttal, with supporting material/analyses, from authors. Here are the main reasons for my ambivalence: 1) The authors identify outliers across all brain regions is 7.6% instead of the expected 4.55%. So, one could view this as meaning that the autism spectrum disorder group has a greater variance in cortical thickness than typically developing controls. If this is the case, that would indeed be an interesting finding. However, it's not entirely clear to me that this is not simply driven by poor surface extraction due to subtle but systematic movement artifacts. See point 2.
2) It is not entirely clear to me that the Euler index does a good enough job at eliminating systematic subtle artifacts due to movement when comparing two groups with one known to perhaps moving more than the other. This is even more concerning here because the peek in outlier 'prevalence' happens to be children and that children are known to move more. Is it possible that FreeSurfer is particularly sensitive to movement artifacts in some regions more than in others? Can this have driven some of the results? To further mitigate problematic surface extraction effects, the authors included the Euler index as a covariate. While this is likely to help, it implicitly assumes a simple linear effect of the Euler index on cortical thickness. Even if the Euler index as a covariate had a simple linear association with thickness, it is not clear that other reconstruction issues would be dealt with. On the other hand, movement typically leads to artifactually thinner cortices rather than thicker ones, and here, after outlier removal and statistical thresholding, the autism outlier group appears to have some regions with a thicker cortex and no regions with a thinner cortex. This is partially reassuring and soothes my concern a little but not entirely. Now, I am not saying, of course, that no autism versus typical controls imaging studies should be conducted but that in the wake of so many false positive and unreproducible findings in the imagine literature, I feel that extra care is needed here. My QC concerns would be satisfied if authors implemented a more thorough QC by, perhaps, focusing on the regions that came out as statistically significant and by addressing a QC caveat in the discussion.
3) The analysis has been restricted, for the autism spectrum disorder (ASD) group, to those matched in IQ with typically developing individuals. This leads to a mean IQ of 106 in the ASD group which is higher than the general population mean and not at all representative of autism spectrum disorder where about 50% have IQs below 70 and only about 3% have IQs above 115. While I understand the need to match for IQ given frequently reported associations between IQ and cortical thickness, I fear that this restriction severely undermines the generalizability of finding to the general population of individuals with autism spectrum disorder. I suggest carefully and clearly addressing this in the discussion and perhaps even in changing the title to reflect this fact (perhaps referring to high level functioning individuals with ASD instead of simply autistic males).
Minor issues to consider: I suggest avoiding the use of the term "autistic" (used in the title) as this is no longer used in the DSM-5 and suggest being consisted throughout the manuscript by using autism spectrum disorder or ASD.
Page 5: Authors refer to a "w-score", stating that it is analogous to a z-score. To me, it is exactly a zscore from the equation they provided. Why not call it a regional z-score? Table 1 has a group labeled as NT (I am assuming they are referring to the typically developing group for which they use TD for the rest of the manuscript. I suggest TD throughout instead of using NT at times. Page 8: authors state that 7.6% is much higher than the expected 4.55% but provide to statistical test supporting this statement. Page 9: Authors write: "There are other interesting attributes about this subset of brain regions. With regard to age, these patients were almost always in the age range of 6-20, and were much less prevalent beyond age 20 (S5)". Shouldn't this be supplementary figure S6 instead of S5?
Regarding S5: Authors write, in their legend, that there are 14 subjects for which the ratio score exceeds 0.5 yet the plots are not about these 14 subjects (this might be confusing to the reader).
Reviewer #3 (Remarks to the Author): The authors present a technique for evaluating cortical thickness changes in individual ASD patients. The approach is in contrast to the current standard group-level cortical thickness analysis paradigm, which is limited to identifying differences between groups.
Specific comments: 1. In the introduction the authors note that biological sex is likely to modulate ASD-related neuroantomical differences. What is the evidence for this claim? 2. Introduction: the authors discuss heterogenous findings in previous ASD studies, however they don't discuss the possibility that there may be effectively no morphometric differences between ASD individuals and healthy controls. Some negative studies have been published. 3. The author refers to brain regions passing FDR correction in the results section. I found this phrasing a little difficult to understand; FDR correction refers to a subset of statistical tests that are deemed statistically significant at a threshold modified for multiple comparisons. I assume the authors mean that some regions have statistically significant differences in cortical thickness between ASD and healthy controls? Consider clarifying. In a related point since the journal has the results section before the methods it's unclear what sort of spatial scale the authors are referring to when they talk about brain regions -is it CT averaged over cortical parcellations or are they vertex-wise cortical thickness estimates? It might be useful to modify the results section text to make it easier for the reader. 4. The authors note limitations with the Euler index for quantifying image quality. Pardoe et al Neuroimage 2016 "Motion and morphometry in clinical and nonclinical populations" demonstrated that in-scanner head motion, estimated using fMRI scans in ABIDE subjects, were correlated with cortical thickness estimates. It would be helpful to assess average head motion in the participants with abnormal age-related cortical thickness trajectories to make sure that the results aren't driven by inscanner head motion. 5. Further to this, once the participants with abnormal CT for their age have been identified, there is relatively little further investigation of other factors that may explain why these participants had abnormal CT. This would strengthen the manuscript. Similarly it would be helpful if the authors provided the specific study IDs of the participants they identified with abnormal CT in the supplementary material.
We first thank the reviewers for their thoughtful, encouraging, and constructive feedback. We fully agree that the paper would benefit from additional sensitivity analyses, as seems to also be the consensus across all three reviewers. Thus, in addition to providing a point-by-point reply to each point addressed we would like to first address the common remarks in relation to motion and data quality.
In the original manuscript we included the Euler index as a confounder in our models yet the reviewers all raise interesting concerns that it is of course still possible that there are residual data quality related issues in our downstream analysis. We indeed also made that remark in relation to the individuals with more extreme ratio scores, which also had somewhat poorer image quality in some cases. To assess and subsequently address these issues more formally we conducted a number of additional analyses.
1. We downloaded and processed the resting-state fMRI from all individuals present in ABIDE 1 and ABIDE 2 to obtain individual measures of in-scanner motion, specifically their mean framewise displacement. Framewise displacement was calculated for every EPI volume with the method described by Power and colleagues 1,2 and then the mean was extracted for every individuals' scan session. Although fMRI and MPRAGE images were acquired during the same session, we cannot discard that participants' movements vary between acquisition sequences. Notwithstanding, it has been shown that framewise displacement is not only highly consistent across sessions but it is also strongly associated with MPRAGE image quality, suggesting that this metric could be also used as a proxy of motion-induced artifacts on structural images 3 .
For 35 individuals no resting-state data was available and a further 15 individuals had too poor quality of imaging to reliably assess in-scanner head-motion. In the overlapping sample we assessed case-control differences in mean framewise displacement, correlation between the computed ratio scores and confound variables and regional correlation between w-scores and both and Euler indices.
As can be expected there were indeed case-control effects of head-motion in the resting-state data (A), there was no significant correlation between the Euler indices and the absolute and negative w-ratio scores, but a small significant correlation with the positive ratio scores (B). In addition, we indeed find small regional (mostly negative) correlations with the w-score (ranging from -0.18 to 0.14) (C) 4 . We already included Euler indices as a confounder in any downstream analyses from here (e.g. one-sample test of w-scores, case-control differences in CT etc.). However, the reviewers are correct in noting that additional sensitivity analyses would help ensure our findings were not driven by any confounding variables and we thus followed this assessment by testing these effects more formally. 2. First we performed 5-fold cross validation on both mean framewise displacement as well as on the Euler index. In both cases separately we systematically removed the top 5% percent of individuals (e.g. highest motion or highest Euler) up to 25% of removed data and recomputed the Cohen's d values across all regions for our initial case control analysis. We then computed the spatial correlation across regions for each fold. We find that for both motion as well as Euler the resulting Cohen's d maps were highly consistent (lowest r = 0.7 at 75% of individuals) and only decreased linearly with sample size (D). The new figure below shows the spatial correlation in the resulting Cohen's d maps. The upper triangle shows the validation for iterative exclusion of high motion individuals up to 25% (fold 5), the lower triangle shows the same for iterative exclusion of high Euler index individuals up to 25% (fold 5).

Panel D shows the spatial correlation between Cohen's D maps from analyses where subject with either high motion (upper triangle) or high Euler indices (lower triangle) were iteratively excluded. The fold refers to the cohorts of exclusion ranging from 1 = 5% exclusion to 5 = 25% excluded.
3. Subsequently, we also re-ran all our analyses including motion as a confounder and again quantified the spatial correlation in resulting maps with the original model that did not include motion. We find that for both case-control differences as well as the one-sample test these correlations were close to 1 (E-F). 4. Then, in order to quantify any residual effects of motion on the whole-brain ratio scores we ran correlation analysis on all three of the ratios with mean framewise displacement in our overlapping sample. Here we find that there are some small residual correlations between FD and ratio-scores (all BF < 10, max r = 0.16) (G-I). But no clear evidence that the individuals identified as having an extreme ratio (>0.5) were disproportionately overlapping with individuals that also exhibit higher in-scanner head motion.
Panel G shows the small relation between the absolute w-score and mean framewise displacement (r = 0.15, p < .001, . Panels H and J shows the same for the positive (r = 0.09, p < .05, BF = 0.13) and negative (r = 0.16, p < .001, BF = -6.67) ratio's respectively. 5. Although from these analysis it did not appear to be the case that the top motion individuals were also the individuals we classified as statistical outliers, we followed up this more formally. Specifically, we removed the top 5% of motion individuals and recomputed the ratio scores, assessed a residual correlation between motion and absolute ratios and recomputed the spatial prevalence (J-L). This resulted in two of the 14 original outlier subjects no longer being in the sample. 7. Finally, all analyses now include mean framewise displacement as a confound regressor. These comparisons against the original data are included in a new supplement (and given the resolution the whole figure has been made available online as a high-resolution figure). Including motion as a confound variable did reduce the effect enough in two regions for them to no longer pass FDR corrections, though the overall spatial pattern was highly similar. Interestingly, the inclusion of motion had a much stronger impact on the conventional case control analysis (e.g. where w-score outliers are still included) compared to the analysis excluding the small proportion of normative outliers. In the conventional model the number of regions passing FDR corrections dropped from 38 to 27 with the inclusion of motion, though again the overall pattern was highly similar. Previous figure 2 has been updated accordingly as has the results section describing the outcome of this analysis.
"Our first analysis examined conventional case-control differences using linear mixed effect modelling including site, sex, age, in-scanner head motion 4 and Euler index 5   We also added an extra section to the results to briefly describe these sensitivity analyses.
"Sensitivity analyses on the effects of reconstruction quality using Euler index as well as residual effects of in-scanner head motion from the resting-state acquisition did not reveal a significant impact on thresholded case-control differences or w-score. Specifically, systematic exclusion of top motion and Euler subjects resulted in highly spatially consistent effect size maps (all r < 0.7). Individuals identified as statistical outliers did not have disproportionally high motion or high Euler indices. For more details see methods and supplementary materials ( Figure S3)." We would again like to thank the reviewers for bringing this important factor to our attention and while our main results were unaffected by the inclusion of motion (e.g. spatial topology of effect, number of significant regions, spatial prevalence of atypicality and proportion of individuals considered statistical outliers) the conventional analysis clearly was and the more conservative approach is indeed to include motion as a confound variable. Thus, the manuscript has been updated throughout and the suggested additional sensitivity analyses have been added to the supplementary materials (in addition to a separate section on motion on the accompanying GitHub repository). Please find a point-by-point reply on other points raised by the reviewers below.

Bethlehem and colleagues applied normative modeling to characterize individualized metric of atypical cortical thickness in males ASD, based on ABIDE dataset. They identified that a large proportion of case-control differences in brain structures of ASD is driven by a subgroups of autistic individuals with highly age-atypical cortical thickness based on age-related norms. Overall, this is a well-conducted and wellwritten study with very interesting objectives and solid computation methodology. I've followed this work on biorxiv for a while, and am wondering why it has not been officially published in any peer-reviewed journal. I have few more points, which hopefully may improve soundness of the current findings. 1. I understand that authors have done QC steps, and put Euler index as a covariate in the model, Nonetheless, I still have concerns that the current results might be impacted by inter-individual variations in in-scanner motion, because individuals with extreme age-deviancy in atypical cortical thickness happened to have relatively poor image quality. This critical issue might be further alleviated by the following attempts: 1) test whether there is a correlations between Euler index with w-scores. 2) Compare Euler index between autistic individuals with higher age-normed deviancy and those within the age norms of CT. 3) further exclude additional 10% of participants with extreme Euler index from the remaining samples going through QC step. then see whether the major pattern still exist. If additional analyses based on either 1) or 2) yield significant findings, or the most remarkable pattern do not preserve after the practice of 3), then the main conclusions and inferences of the present findings should be considered problematic insofar as the in-scanner motion is taken into account.
We thank the reviewer for their positive comments and clear path to addressing some of the open concerns. As per the suggestions we have now conducted extensive additional analysis on the potential influence of the data quality using both cross-validation, analysis of residual correlation as well as analysis based on exclusion of the more severely affected subjects. We are now even more confident that our original results were not affected by the data quality but have nonetheless chosen to update all results with the more stringent inclusion of motion in addition to Euler as a confounder. Rather than comparing the Euler index between two groups based on a particular cutoff we show that there is no meaningful residual correlation between Euler and the ratio score. We have updated the main manuscript to reflect these changes and included all sensitivity analyses in the supplementary materials. In addition, we have added to our discussion a more comprehensive acknowledgement of potential confounders. 2. Following the point 1: in the paragraph 3 of Discussion, the authors noted the limitation of a lack of phenotypic data of the subgroup with extreme age-deviancy, and seemed to explain that may be related to heterogeneous mechanisms of neurogenesis and development. Nonetheless, it was explicitly described in the Results that the T1 data of individuals with extreme deviancy in age-atypical cortical thickness showed relatively poor image quality. I think the likely influence of in-scanner motion on this result should be explicitly acknowledged and discussed here.

"Finally, although in-scanner head motion is a well-known confounder in resting
As per the previous point (and see also at the top of this reply), we now included several motion sensitivity analyses and find that it did not meaningfully affect our results. However, the reviewers did raise a point that caution is warranted when interpreting imaging results without including motion as an important confounder and in addition to updating our analyses with this we also acknowledge the issue of motion more explicitly in the discussion. As requested we now provide more information on how parameter optimisation for LOESS was conducted. Bootstrapping analysis of the w-score reliability was done around this optimisation and showed high between subject and between region consistency.
"We used a local polynomial regression fitting procedure (LOESS) 8,9 , where the local width or smoothing kernel of the regression was determined by the model that provided the overall smallest sum of squared errors using hyperparameter optimisation across 5-100% of the full age range using Brent's method 10 as implemented in the R optim function from the stats package.
We also assessed consistency of our output using centiles scoring and consistency of the normative model using extensive bootstrapping, both showed high outcome consistency (see

Reviewer #2
The manuscript examines the possibility of using structural brain features (mainly cortical thickness but also cortical surface, volume as well as gyrification) in relation to age-related norms to try to better characterize ASD neural heterogeneity. I am on the fence regarding this article. My general impression is that the idea is interesting, in principle, but results appear unconvincing to me. I may be convinced otherwise and remain open to a rebuttal, with supporting material/analyses, from authors.
Here are the main reasons for my ambivalence: 1) The authors identify outliers across all brain regions is 7.6% instead of the expected 4.55%. So, one could view this as meaning that the autism spectrum disorder group has a greater variance in cortical thickness than typically developing controls. If this is the case, that would indeed be an interesting finding. However, it's not entirely clear to me that this is not simply driven by poor surface extraction due to subtle but systematic movement artifacts. See point 2.
2) It is not entirely clear to me that the Euler index does a good enough job at eliminating systematic subtle artifacts due to movement when comparing two groups with one known to perhaps moving more than the other. This is even more concerning here because the peek in outlier 'prevalence' happens to be children and that children are known to move more. Is it possible that FreeSurfer is particularly sensitive to movement artifacts in some regions more than in others? Can this have driven some of the results? To further mitigate problematic surface extraction effects, the authors included the Euler index as a covariate. While this is likely to help, it implicitly assumes a simple linear effect of the Euler index on cortical thickness. Even if the Euler index as a covariate had a simple linear association with thickness, it is not clear that other reconstruction issues would be dealt with. On the other hand, movement typically leads to artifactually thinner cortices rather than thicker ones, and here, after outlier removal and statistical thresholding, the autism outlier group appears to have some regions with a thicker cortex and no regions with a thinner cortex. This is partially reassuring and soothes my concern a little but not entirely. Now, I am not saying, of course, that no autism versus typical controls imaging studies should be conducted but that in the wake of so many false positive and unreproducible findings in the imagine literature, I feel that extra care is needed here. My QC concerns would be satisfied if authors implemented a more thorough QC by, perhaps, focusing on the regions that came out as statistically significant and by addressing a QC caveat in the discussion.
We thank the reviewer for their comprehensive assessment of our work and fully agree that motion and its resulting artefact should really have been included in our analyses in the first place. The reviewers are likely correct in hypothesising that motion may affect Freesurfer reconstruction in a spatially dependent manner. Since it is unclear what that spatial dependency might be or how to best estimate that from a structural image reconstruction we chose to not conduct a ROI analysis on this but rather include motion throughout all regions and all analyses. Thus, we thoroughly assessed the effect and association of motion to our results and, while we find that overall it did not change the outcome of the analysis, we agree that the validity of our findings is considerably strengthened by these additional analyses and by including it in our model. In addition, we now more forcefully acknowledge the potential of these confounders to influence morphological measurements.
"Finally, although in-scanner head motion is a well-known confounder in resting-state connectivity studies 1,2 it has recently been shown that the same motion may also affect structural image quality and surface reconstruction, especially in clinical cohorts 4 . To address this issue in the present analysis we included mean framewise displacement in our models. In this line, Savalia et al. 3 recently showed that framewise displacement is a sensitive proxy of motion-related bias in structural images. We find that, while this severely impacted the conventional case control analysis (e.g. reducing the number of significant ROI's from 38 to 27), it did not impact the outlier thresholded analysis to the same extent. To further assess the sensitivity of motion on the present approach we include sensitivity analyses based on systematic removal of high motion subjects and find that the spatial topology of effects was strongly conserved. Given the impact on the conventional analysis approach we strongly encourage future studies to consider motion as an important confounder." 3) The analysis has been restricted, for the autism spectrum disorder (ASD) group, to those matched in IQ with typically developing individuals. This leads to a mean IQ of 106 in the ASD group which is higher than the general population mean and not at all representative of autism spectrum disorder where about 50% have IQs below 70 and only about 3% have IQs above 115. While I understand the need to match for IQ given frequently reported associations between IQ and cortical thickness, I fear that this restriction severely undermines the generalizability of finding to the general population of individuals with autism spectrum disorder. I suggest carefully and clearly addressing this in the discussion and perhaps even in changing the title to reflect this fact (perhaps referring to high level functioning individuals with ASD instead of simply autistic males).
We thank the reviewer for pointing out this limitation to the generalizability of our approach and have taken the reviewers suggestion to list this as a caveat in our discussion. We chose not to change the title to include "high functioning ASD" as some recent literature has questioned whether higher IQ implies higher functioning (Tillmann et al. 2019). The exclusion of low IQ individuals and subsequent matching thus does not necessarily restrict the sample to what may normally be defined as high-functioning. We added the following caveat to our discussion to emphasize this.
"Fourth, our current sample was matched on IQ and as a result excluded individuals with low IQ scores (< 70). While higher IQ does not automatically imply higher overall functioning 48 it does limit the generalisability of our findings to individuals with normal to high IQ." Minor issues to consider: I suggest avoiding the use of the term "autistic" (used in the title) as this is no longer used in the DSM-5 and suggest being consisted throughout the manuscript by using autism spectrum disorder or ASD.
This has been changed to ASD throughout the paper Page 5: Authors refer to a "w-score", stating that it is analogous to a z-score. To me, it is exactly a z-score from the equation they provided. Why not call it a regional zscore?
We agree that this is similar to a z-score yet have explicitly chosen not to refer to it as such since z-scores are more commonly associated with normalised scores within a population. Thus although calculation is the same we chose to use w-score to emphasise that the measure in the autism group is a z-score that is relative to the control group (not a z-score relative to the same group or to the combined group). Table 1 has a group labeled as NT (I am assuming they are referring to the typically developing group for which they use TD for the rest of the manuscript. I suggest TD throughout instead of using NT at times.

Page 6:
This table and text have been updated as per the reviewers suggestion.
Page 8: authors state that 7.6% is much higher than the expected 4.55% but provide to statistical test supporting this statement.
The chi-square test statistics are now included.
"This difference from an expected proportion of 5% in the present sample corresponds to a X 2 of 3.85 (with Yates continuity correction 11 ) that is significant at p < .05." Page 9: Authors write: "There are other interesting attributes about this subset of brain regions. With regard to age, these patients were almost always in the age range of 6-20, and were much less prevalent beyond age 20 (S5)". Shouldn't this be supplementary figure S6 instead of S5?
With the addition of more supplemental analyses described above all figure indices have shifted and in the updated manuscript all references have been double-checked.
Regarding S5: Authors write, in their legend, that there are 14 subjects for which the ratio score exceeds 0.5 yet the plots are not about these 14 subjects (this might be confusing to the reader).
The caption has been updated.

Reviewer #3
The authors present a technique for evaluating cortical thickness changes in individual ASD patients. The approach is in contrast to the current standard group-level cortical thickness analysis paradigm, which is limited to identifying differences between groups.
Specific comments: 1. In the introduction the authors note that biological sex is likely to modulate ASDrelated neuroantomical differences. What is the evidence for this claim?
Multiple studies have shown sex*diagnosis interaction effects where the case-control effects present in one sex are statistically quite different than the same contrast in the other sex (e.g., Lai  . Furthermore, studies that stratify by sex and specifically examine CT find qualitative and quantitative distinctions between the sexes as well as sex-specific associations with symptom severity (Bedford et al., 2020). We have added a specific reference to this Bedford et al., paper to the statement about how biological sex modulates neuroanatomical differences with respect to CT.

Introduction
: the authors discuss heterogenous findings in previous ASD studies, however they don't discuss the possibility that there may be effectively no morphometric differences between ASD individuals and healthy controls. Some negative studies have been published.
The reviewer raises a very interesting point that admittedly we had not acknowledged as explicitly as we probably should have. It is indeed possible that no morphometric differences exist and in fact our paper highlights that much of the previous literature likely over estimated true group mean differences. By focusing on a more individualised assessment we also find that it is only a small subset of individuals in which a broad atypical morphology is found and that these in fact drive most of the case-control difference. Thus it is likely fair to consider that on average no morphometric differences exist, yet that there may be a small subgroup of individuals within the autism group that do show some atypicality. We have rephrased our introduction and discussion to reflect this notion and included references on null findings.
"However, the vast neuroimaging literature is also inconsistent, with reports of hypo-or hyperconnectivity, cortical thinning versus increased grey or white matter, brain overgrowth, arrested growth, or even lack of morphological difference altogether etc. [12][13][14][15][16][17][18][19][20][21] , leaving stunted progress towards understanding mechanisms driving cortical pathophysiology in ASD and translating neuroimaging into clinical utility. " "Furthermore, conventional case-control analyses may obscure more subtle individual differences as they assume on-average group differences. This is especially important in light of previously reported null-findings 6 ." "Utilizing normative modelling as a way of identifying and removing CT-atypical outlier patients, we find here that most small case-control differences are driven by a small subgroup of patients with high CT-atypicality for their age, which indeed begs the question of the existence of onaverage atypical cortical morphology in autism 6 ." 3. The author refers to brain regions passing FDR correction in the results section. I found this phrasing a little difficult to understand; FDR correction refers to a subset of statistical tests that are deemed statistically significant at a threshold modified for multiple comparisons. I assume the authors mean that some regions have statistically significant differences in cortical thickness between ASD and healthy controls? Consider clarifying. In a related point since the journal has the results section before the methods it's unclear what sort of spatial scale the authors are referring to when they talk about brain regions -is it CT averaged over cortical parcellations or are they vertex-wise cortical thickness estimates? It might be useful to modify the results section text to make it easier for the reader.
We thank the reviewer for this helpful suggestion and have now restyled the manuscript to fit better with the ordering of NPG journals also allowing us to further clarify that all analysis were done on cortical parcellations.
"All analyses were done on cortical thickness averaged within 308 cortical regions 22 ."

The authors note limitations with the Euler index for quantifying image quality. Pardoe et al Neuroimage 2016 "Motion and morphometry in clinical and nonclinical
populations" demonstrated that in-scanner head motion, estimated using fMRI scans in ABIDE subjects, were correlated with cortical thickness estimates. It would be helpful to assess average head motion in the participants with abnormal age-related cortical thickness trajectories to make sure that the results aren't driven by in-scanner head motion.
We thank the reviewer for this helpful reference. We have now included motion in our model and in addition have conducted several sensitivity analyses to assess the impact of motion on our results. While we find that motion did not significantly impact our findings we now report all findings from the more conservative approach that does include motion and have updated the manuscript to more explicitly acknowledge the issue of motion.
"Finally, although in-scanner head motion is a well-known confounder in resting-state connectivity studies 1,2 it has recently been shown that the same motion may also affect structural image quality and surface reconstruction, especially in clinical cohorts 4 . To address this issue in the present analysis we included mean framewise displacement in our models. We find that, while this severely impacted the conventional case control analysis (e.g. reducing the number of significant ROI's from 38 to 27), it did not impact the outlier thresholded analysis to the same extent. To further assess the sensitivity of motion on the present approach we include sensitivity analyses based on systematic removal of high motion subjects and find that the spatial topology of effects was strongly conserved. Given the impact on the conventional analysis approach we strongly encourage future studies to consider motion as an important confounder." 5. Further to this, once the participants with abnormal CT for their age have been identified, there is relatively little further investigation of other factors that may explain why these participants had abnormal CT. This would strengthen the manuscript.
Similarly it would be helpful if the authors provided the specific study IDs of the participants they identified with abnormal CT in the supplementary material.
We fully agree that it would be interesting to see whether these individuals would show any particular atypicality in other domains but unfortunately little additional phenotypic information was available on the individuals identified as outliers. The full table of ratio scores has now been made available online including the anonymised subject ID's.