Accurate sex prediction of cisgender and transgender individuals without brain size bias

The increasing use of machine learning approaches on neuroimaging data comes with the important concern of confounding variables which might lead to biased predictions and in turn spurious conclusions about the relationship between the features and the target. A prominent example is the brain size difference between women and men. This difference in total intracranial volume (TIV) can cause bias when employing machine learning approaches for the investigation of sex differences in brain morphology. A TIV-biased model will not capture qualitative sex differences in brain organization but rather learn to classify an individual’s sex based on brain size differences, thus leading to spurious and misleading conclusions, for example when comparing brain morphology between cisgender- and transgender individuals. In this study, TIV bias in sex classification models applied to cis- and transgender individuals was systematically investigated by controlling for TIV either through featurewise confound removal or by matching the training samples for TIV. Our results provide strong evidence that models not biased by TIV can classify the sex of both cis- and transgender individuals with high accuracy, highlighting the importance of appropriate modeling to avoid bias in automated decision making.

individuals with higher TIV as males and individuals with lower TIV as females, while making more mistakes for individuals with intermediate TIV.
The use of such a TIV-biased sex classifier is particularly problematic when analyzing data of individuals for whom local and global brain structural alterations have been reported, such as those with "gender incongruence," where a person's sex and gender identity differ 21 .In the present paper, following the linguistic guidelines provided by the Professional Association of Transgender Health 22 , the term "sex" is used to refer to the sex that a person was assigned at birth based on their anatomical sexual characteristics, whereas the term "gender (identity)" is used to denote the subjective identification of an individual as female, male, or one of the other gender identities which might be also fluid or non-binary.While the coherence of sex and gender is termed cisgender for cisgender men and women (CM, CW), gender incongruent individuals are denoted as transgender men and women (TM, TW, 21 ).
To date, it is not yet fully understood if and to which extent local and global brain organization of transgender individuals is driven by factors matching their gender identity on top of those matching their sex.So far, studies contrasting groups of cisgender and transgender individuals reported regional GMV differences in the putamen 23 , insula 16 as well as in surface areas, cortical and subcortical brain volumes 24 .Additionally, transgender individuals undergoing cross-sex hormone treatment (CHT) were reported to show structural alterations in the hypothalamus and the third ventricle 25 .Thus, there is some evidence indicating that transgender individuals display local brain volume differences 24,[26][27][28] .Extending the results of group studies contrasting cisgender and transgender individuals, sex classification approaches-building a classifier on cisgender individuals' data and then applying it to transgender individuals-have reported reduced sex classification accuracies for transgender compared to cisgender samples (76.2% vs. 82.6% 17; 61.5% vs. 93.2-94.9% 16).Higher rates of misclassification of sex in transgender as opposed to cisgender individuals have been taken to indicate that transgender brains might differ from those typical for their sex, implying an interaction between sex and gender at the neuroanatomical level 16,17,29 .However, before such conclusions can be drawn, biases that can influence a sex classifier must be taken into account, particularly those related to TIV 18,19 .It is crucial to be aware of the impact of local and global structural brain alterations that can lead to increases or decreases of TIV resulting in the TIV of transgender individuals falling between TIV of cisgender women and men 25 .Consequently, the predictions of a TIV-biased classifier might erroneously be interpreted as evidence for transgender brain organization to align with gender identity as has been reported before 16,29 .
Here, we investigate the impact of TIV bias by examining two approaches to control for confounding effects of TIV 10 in sex classification to evaluate which approach is most suited to account for TIV bias in the present sex classification analysis.We compare two statistically different approaches of controlling for TIV bias in comparison to a baseline model that does not account for the influence of TIV.For the first approach, we built debiased models through featurewise confound control by removing confounding effects of TIV during training (Fig. 1, 20,30 ).In the second approach, we trained models on a stratified sample where women and men were matched for TIV.Model performance and TIV bias were assessed on hold-out samples of cisgender individuals to compare performance of the biased to the debiased models.We hypothesized that a TIV-biased model should achieve high performance but also exhibit a biased output pattern.In contrast, a model not biased by TIV will likely exhibit a drop in classification accuracy.However, importantly, misclassifications of such a model should be largely independent of TIV.In the final step, the debiased models were applied to application samples comprising both cisgender and transgender individuals to examine whether models without a TIV bias provide any evidence for an interaction of sex and gender influences on structural brain organization, as previously suggested 17 .

Results
Classifiers employing Support Vector Machine (SVM) models with radial basis function kernel (rbf) were trained on whole-brain voxelwise GMV data of two large, non-overlapping cisgender samples to classify sex assigned at birth.In the first sample, women and men were matched for age (AM sample) to create a sample with a natural occurring TIV-distribution (Fig. S1 and Table S1).As a baseline, we trained the first model on this sample without any control for TIV bias (AM model), following the methodology of a previous study 16 .We then compared the baseline model to other models, which integrated two different approaches for confound control in order to assess which approach successfully removes TIV bias while accurately classifying sex.For the first approach, a ML model was also trained on the AM sample, but additionally controlled for TIV bias by featurewise confound removal (AM+cr model), while the third model comprised stratification for TIV by training the model on a sample of women and men who were matched for both age and TIV (ATM; see Fig. S1 and Table S1 for demographic details and TIV distribution of the samples).While the third model was trained on the ATM sample without additional TIV-control (ATM model) to evaluate stratification in itself, the fourth model employed a combination of both approaches to assess whether the addition of featurewise confound removal might further improve results (AM+cr model, Fig. 1).Subsequently, all models were calibrated to ensure that the prediction probabilities of the models match the respective class label (Figs.S2 and S3, Supplementary Results, https:// scikit-learn.org/ stable/ modul es/ calib ration.html# calib ration).To evaluate model performance on hold-out data, each sample (AM and ATM) was split into a training sample (80%) and a hold-out sample (20%).As the two approaches-featurewise confound removal and stratification by matching-might exhibit differences in model performance since they are based on different statistical processes 8 , all four models were evaluated on both AM and ATM hold-out samples.This allowed for a thorough understanding of model behavior and evaluation of whether both approaches successfully remove TIV bias.Assessing model performance on the first sample (AM hold-out sample), which exhibits a naturally occurring TIV-distribution among women and men, enables a realistic evaluation of the model's effectiveness in broader populations beyond those included in the present study.In turn, the ATM hold-out sample enables a more in-depth evaluation of the model performance, as it displays no significant difference in TIV between women and men.Consequently, an accurate model performance for the ATM hold-out sample indicates a non-TIV-biased model behavior as the model classifies a person's sex based on other features than TIV, providing a "confound-free accuracy" 31 .Additionally, the models were tested on two independent application samples comprising transgender and cisgender individuals (sample A, sample B, see Fig. S1 and Table S1 for demographic details and TIV distribution of the samples).
Evidence for TIV bias in the AM model.The application of the AM model to the AM hold-out sample resulted in a high classification accuracy of 96.89% (Table 1, Table S2, and Fig. 2).Accordingly, the assigned probability of being classified as male (prediction probability) was higher for men than for women (Fig. 3a).The comparison of TIV distributions revealed that men who were classified congruently with their sex as male had a significantly higher TIV than incongruently classified men (Fig. 3b).Similarly, women classified incongruently with their sex as male on average had a higher TIV than congruently classified women, even though this difference was not significant (details in Table 2).
When applied to the ATM hold-out sample, the AM model resulted in a much lower classification accuracy of 79.19% (Tables 1 and S2), presumably as the model could not rely on TIV for classifying in the ATM sample.Still, we observed a similar pattern as above, with men having a higher prediction probability than women (Fig. 3c), significantly higher TIV in sex congruently as opposed to incongruently classified men, and significantly lower www.nature.com/scientificreports/TIV in sex congruently as opposed to incongruently classified women (Fig. 3d and Table 2).Altogether, across both hold-out samples, this model tended to classify subjects with higher TIV as male and those with lower TIV as female, clearly indicating a brain size bias inherent in this model.
Reducing TIV bias by confound removal.Featurewise control for TIV in the AM+ cr model resulted in decreased classification accuracies both for the AM (61.80%) and the ATM (72.98%; further details in Fig. 2, Table 1 and Table S2) hold-out samples.In comparison to the AM model with no TIV control (Fig. 3a) prediction probability displayed a much larger overlap between women and men (Fig. 3e, g).Further evaluation did not reveal any evidence for a TIV bias-i.e.neither did sex congruently classified men show higher TIV than Table 1.Model performance of all models applied to the hold-out and application samples (* Balanced Accuracy).Model performance of all models applied to the hold-out and application samples.www.nature.com/scientificreports/incongruently classified men nor did sex congruently classified women show lower TIV than incongruently classified women in both the AM (Fig. 3f) and the ATM (Fig. 3h and Table 2) hold-out samples.

Reducing bias by matching the training sample for TIV.
The application of the two models built using TIV matched data with and without featurewise TIV control (ATM and ATM+cr model, respectively) to the AM hold-out sample resulted in similarly high classification accuracy (86.65% for ATM, 85.71% for ATM+cr model, details in Tables 1 and S2), performing between accuracies achieved by the AM and the AM+cr model.Thus, for the ATM models, additional featurewise TIV control did not result in decreased model performance.This is further reflected in similar prediction probability distributions (Fig. 3i, m), which were higher for men than for women.Likewise, the TIV of sex congruently and incongruently classified individuals did not differ significantly from each other both for women and for men (Fig. 3j, n and Table 2).Application of these models to the ATM hold-out sample (details in Tables 1 and S2), displayed better performance (92.55%) than for the AM hold-out sample.Furthermore, prediction probability distributions showed a comparable (Fig. 3k, o) but more pronounced pattern for the ATM hold-out sample.Again, when testing on the ATM hold-out sample, there was no difference between TIV of sex congruently and incongruently classified individuals both for the model without (Fig. 3l and Table 2) and with additional confound removal (Fig. 3p and Table 2).
Overall, the AM model achieved highest classification accuracy, but evaluation of the model output identified clear evidence for a TIV bias of the model.Reducing TIV-related variance by featurewise confound removal in the AM+cr model resulted in a less biased model, which also displayed a pronounced decrease in model performance, especially for the AM hold-out sample.Both models trained on the TIV balanced sample (ATM, ATM+cr model) did not show evidence of a TIV bias while still retaining high classification performance and appropriate calibration curves (Figs.S2 and S3), indicating that-at least for the present classification problem-training on a matched sample is more appropriate than featurewise confound removal.Thus, in the following, we will focus on comparing the performance of the biased AM model and the nonbiased ATM model on cisgender and transgender individuals in the application samples (sample A, sample B).Results for the AM+cr and ATM+cr models are provided in the Supplementary Results and Fig. S4.

Biased performance of the AM model for cisgender and transgender individuals.
The application of the TIV-biased AM model resulted in an overall high performance of 88.70% for sample A, with an accuracy of 81.63% for cisgender and 93.43% for transgender individuals (detailed measures in Tables 1 and S3).Likewise, for sample B, the model achieved high overall accuracy of 93.10% (Tables 1 and S3) with an accuracy of 90.24% for cisgender individuals and 95.65% for transgender individuals.Matching the high accuracies, the prediction probability showed a sex congruent pattern with higher prediction probabilities for CM and TW (assigned male at birth) than for CW and TM (assigned female at birth) in both sample A (Fig. 4a, c) and sample B (Fig. 4e, g).A comparison of probability distributions of cis-and transgender individuals with the same sex revealed a trend for higher prediction probability for CW than for TM in sample A (t = 1.98, p = 0.0527, Cohen´s d = 0.53), which was significant in sample B (t = 3.58, p < 0.001, Cohen´s d = 1.01), matching the TIV-distributions showing higher TIV for CW than TM (Fig. S1).
The comparison of prediction probabilities for CM versus TW was not significant in both samples (Sample A: t = − 0.55, p = 0.5820, Cohen´s d = − 0.15; Sample B: t = 1.07, p = 0.2922, Cohen´s d = 0.36), while the effect size indicated a trend of lower prediction probability for TW than CM.While TIV-distributions for sex congruently and incongruently classified individuals did not differ significantly (Table 3), sex congruently classified CW and TM had a lower TIV than those classified in a sex incongruent manner.Sex congruently classified CM and TW had a higher TIV than those classified sex incongruently (Fig. 4b, d, f, h), indicating a similar bias of this model for both cisgender and transgender individuals.
Nonbiased ATM model: similar performances for cisgender and transgender individuals.The application of the ATM model to sample A displayed a high overall sex classification accuracy of 91.30% (91.84% for cisgender and 90.01%for transgender individuals).This model also performed accurately on sample B with an overall accuracy of 93.10% (92.68% for cisgender and 93.48% for transgender individuals, details in Table 1  and S3).In both samples, the ATM model yielded sex congruent prediction probabilities for all four groups (Fig. 4i, k, m, o).As opposed to the biased model, here, TM showed a trend of higher prediction probability than CW in Sample B (CW vs TM: t = − 1.27, p = 0.2093, Cohen´s d = − 0.36; Sample A: t = 0-0.47,p = 0.6425, Cohen´s d = − 0.12;).This gender congruent trend was not observed for TW (CM vs. TW: Sample A: t = 0.31, p = 0.7577, Cohen´s d = 0.08; Sample B: t = − 2.02, p = 0.0510, Cohen´s d = − 0.68).The comparison of TIV distributions between sex congruently and incongruently classified individuals (Fig. 4 j, l, n, p) did not reveal any significant differences (Table 3), neither for cisgender nor for transgender individuals, thus displaying no evidence for a TIV bias of this model.

Discussion
In this work, we systematically compared two confound removal approaches, featurewise confound removal and sample stratification, with the aim to train accurate sex classification models without a TIV bias.In order to directly compare our findings to those of a previous study, we implemented a ML pipeline that has demonstrated high levels of sex classification accuracy 16 .This pipeline consisted of principal component analysis (PCA) for dimensionality reduction, followed by an SVM model with rbf kernel for learning, but did not report any consideration of the confounding effects of TIV.
Consistent with previous results, the baseline AM model which does not consider confounding effects of TIV achieved near-perfect classification accuracy on the AM hold-out sample by accurately classifying men with high TIV as male and women with low TIV as female 11,12,16,17 , but relied on TIV as a proxy for sex, indicating a pronounced TIV bias (Fig. 3b).The TIV bias was even more pronounced when the model was applied on the ATM hold-out sample presumably as the AM model was more likely to make mistakes for men with relatively lower TIV and women with relatively higher TIV.The pronounced TIV bias observed here is especially interesting, since the GMV data had already been scaled for TIV during preprocessing.Thus, our results align with previous claims that while the absolute amount of tissue is corrected for individual TIV, such scaling does not fully remove TIV-related variance ( 32 , http:// www.neuro.uni-jena.de/ cat12/ CAT12-Manual.pdf).
For the AM+cr model, where a featurewise removal of TIV was performed on the AM data, the misclassifications of both women and men were not systematically related to TIV differences, indicating that this model was not biased by TIV.This suggests that the AM+cr model based its classifications on different information than the AM model did.Our results match the findings of previous studies 20,30,33,34 , reporting a decrease in accuracy for sex classification models controlling for TIV in contrast to TIV-biased models.This decrease is likely related to the removal of TIV-related variance during featurewise confound removal, which might have decreased the overall amount of information available for the AM+cr model in contrast to the AM model 20,30,33,34 .This observation is in line with the results of a previous study suggesting that TIV alone contains enough information to classify sex at a similar level of accuracy as TIV-uncorrected GMV 34 .Considering that features in the AM sample can be assumed to contain more TIV-related variance than the ATM sample presumably explains why the drop in accuracy between the AM and the ATM+cr is less pronounced for the ATM hold-out sample than for the AM sample.Altogether, featurewise confound removal reduced TIV bias at the cost of classification accuracy.While a lack of bias in a model is desirable, so is high accuracy, suggesting that featurewise confound removal might not be the ideal approach to reduce TIV bias in structural sex classification.
In contrast to the models trained on the AM sample, both ATM trained models resulted in high and unbiased model performance for the AM as well as the ATM hold-out samples.The slightly higher accuracy for the ATM hold-out sample is likely due to the ATM hold-out sample better matching the characteristics of the ATM training sample, in particular with respect to TIV distribution, which is highly related to the target variable sex 30 .The better performance of the ATM and ATM+cr model on the ATM hold-out samples also supports the relevance of stratifying training and hold-out samples with respect to relevant variables that may interact with the target 35,36 .
The comparison of TIV of sex congruently and incongruently classified women and men did not indicate a TIV bias, which is in line with a study proposing beforehand matching to be a more efficient approach than feature-wise confound removal in the statistical analysis 9 .However, another study argued against the matching of data, arguing that matching for specific characteristics creates a sample that is not representative of the whole population 20 .While we agree that the ATM sample does not strictly represent the TIV distribution of the population by rather comprising men with relatively low and women with relatively high TIV, the ensuing models achieved high classification accuracies, even when applied to the AM hold-out sample which reflects the natural TIV distribution.This indicates that the models themselves are not biased by training sample characteristics, especially the restricted TIV range.In fact, the models appear to correctly capture sex differences in a generalizable manner as exemplified by their performance on the two hold-out samples.However, we would like to emphasize that both confound removal approaches employed in the present study rely on different statistical operations which are anticipated to result in different outcomes and model performances 8 .Thus, high model performance of one approach does not imply the other one to behave in a similar manner.For this reason, testing which approach is most suited for an individual ML-problem is crucial.The present results demonstrated that matching women and men for TIV in the training sample provides an appropriate approach for creating unbiased and accurate sex classification models.
In contrast to previous studies 16,17 , we observed similarly high classification accuracies for cis-and transgender individuals regardless of whether the models were debiased or not.This discrepancy may partly be explained by the fact that TIV of the transgender individuals in the present samples matched TIV of cisgender subjects of the same sex rather than aligning with gender identity (Fig. S1).Thus, even a biased classifier could accurately classify transgender individuals.However, in samples where the TIV values for transgender individuals indeed fall in-between those of cisgender men and women, as reported previously 25 TIV-biased models would misclassify transgender individuals in accordance with their gender identity, which could explain prior findings 16 .Future www.nature.com/scientificreports/studies should apply TIV-debiased models to additional datasets to help disentangle the complex interaction of sex, gender and the brain.It would be particularly interesting to apply our debiased models, which are available to other researchers (https:// github.com/ juaml/ sex_ predi ction_ vbm) to those datasets for which a reduction of sex classification accuracy for transgender participants has previously been reported 16,29 .Another explanation for the discrepancy between present and previous results 16,29 , might be that our classifiers learnt fundamentally different models, e.g.employing different feature weights than those in previous studies, which in turn might be caused by differences in characteristics of the training samples and in turn different parameters learnt during model optimization.Beside the differences due to different training samples, other factors affecting ML models and respective results might relate to differences in age-distribution.Here, we not only balanced for sex but also employed an exact matching of men and women with regards to age which might have reduced variance in comparison to the training-samples of other studies 16,29 leading to differences in the fundamental model and results.In addition to age in the training sample, the age distribution of the application sample could also play a role, due to age-related GMV decline.Thus, older TW could be misclassified due to age-related GMV changes.The present models were trained on a diverse collection of samples, ensuring a heterogeneity in several variables, such as age, scanning characteristics, and nationality.Likewise, as application samples we used two completely independent datasets comprising TW and TM.To our knowledge, previous studies have focused on test samples only comprising TW when applying a sex classifier trained on structural data of cisgender individuals to transgender individuals 16,29 , limiting conclusions to TW rather than transgender individuals in general.Notably, one study employing data of both TW and TM did not report significantly lower classification accuracy for transgender data 17 , which is in line with the present results.While we did not observe decreased sex classification accuracy for transgender individuals, this cannot be taken as a proof of absence of such structural brain differences, which might be revealed by the investigation of different sets of brain features or different analysis approaches.
Future studies can benefit by incorporating confound control approaches within interpretable ML pipelines that can provide insight into how many and which brain regions are most relevant for sex differences.Those insights can shed further light on which features are more common in men, women or both, thereby carrying implications for hypotheses as the mosaic of the human brain 37 , which exceeds the scope of the current study design.Methodologically sound studies, including both sex and gender aspects, are needed to improve our understanding of sex and gender-related differences in behavior and prevalence rates of mental disorders to advance development of sex-specific treatments 38,39 .Viewing patients through the lens of sex and gender is an essential step towards personalized care and individualized medicine 6,40 .Therefore, to achieve the ultimate goal of neuroimaging-based precision medicine, the present study takes a first step towards exploring appropriate confound removal in ML-based sex classification 41 .Although each ML analysis must consider confounds specific to the research question at hand, TIV is an important confound to consider in neuroimaging data in general, as also shown by others 9,18,33,34,42 .In addition to its application in sex classification analyses, as demonstrated here, appropriate confound control should also be considered for other ML applications.We, therefore, recommend that researchers should investigate which confound removal method is appropriate for their ML analysis.

Conclusion
Our findings demonstrate that stratification via TIV-matching effectively eliminates TIV bias while achieving high levels of classification accuracy in a sex classification analysis using structural brain imaging features.Contrary to previous results 16 , our sex classification model demonstrated comparable levels of classification accuracy for both cisgender and transgender individuals.Our study emphasizes the importance of removing TIV bias appropriately in sex classification tasks to prevent incorrect interpretations.In general, confounding is a common issue in many ML-based modeling tasks, albeit with varying confounds and levels of confounding effects.Therefore, future studies utilizing ML approaches on brain imaging data should diligently examine for biases and implement appropriate confound control measures.

Data. Data pool for model training and evaluation.
To ensure a heterogeneous sample for training the classifiers, we combined data from 10 large cohorts into one data pool of structural magnetic resonance imaging (MRI) images from subjects differing in nationality, imaging parameters and age range.Supplementary Table S4 gives further details on the composition of the data pool, and details of the MRI data acquisition parameters can be found in the Supplementary Material.We only included subjects aged between 18 and 65 years with no indication of any psychiatric disorder, resulting in a total N of 5557 subjects.It is important to note, that the majority of large datasets, which have been employed for sex classification studies so far, likely report sex based on "presented sex", i.e. the name and outer appearance of participants or on self-reported sex without explicitly collecting information on gender identity.We assume that among subjects not describing themselves as transgender, self-reported gender identity is equivalent to sex assigned at birth, while acknowledging that this match may neither be perfect nor binary.
Sixteen subjects whose TIV values differed more than three standard deviations from the mean TIV of the data pool were excluded as outliers.Then, two non-overlapping samples were extracted from the data pool.In the first sample (AM), women and men were matched for age to control for age-related GMV decline [43][44][45][46] .In the second sample (ATM), women and men were additionally matched for TIV.Possible differences between samples and sites in scanning acquisition were controlled for by including similar numbers of subjects from the different samples in the AM and ATM-sample respectively.Where applicable, for featurewise TIV control TIV-related variance was removed after dimensionality reduction by subtracting the fitted values of each feature in a crossvalidation (CV)-consistent manner to avoid data leakage 20,30 .Stratified tenfold CV was performed to assess generalization performance.The two hyperparameters, C (1 − 1e 8 , log-uniform) and gamma (1e -7 − 1, log-uniform), were tuned via Bayesian Hyperparameter Optimization with 250 iterations within a fivefold CV inner loop following the analysis employed in a previous study 16 .The best performing combination of hyperparameters from the Bayesian Hyperparameter Optimization was used to train the final model on the full sample (details depicted in Supplementary Material).
The four final models were used to obtain predictions for the AM and ATM hold-out samples and both application samples (Fig. 1).Before application of the models to the hold-out samples, we ensured that the models were calibrated (https:// scikit-learn.org/ stable/ modul es/ calib ration.html# calib ration) by assessing probabilities of classifying an individual into a respective class in relation to the actual labels of the individuals (Supplementary Figs.S2 and S3, Supplementary Results).These calibrations allow for checking whether the models gave accurate estimates of class probabilities and support probability predictions.To distinguish between the predicted and actual label of the sex a person identifies with, we refer to the terms "male" and "female" as predicted labels of an ML model whereas we refer to "men" and "women" as actual (true) label of an individual.
To further explore model behaviour, we compared the TIV-distributions of individuals classified in accordance with their sex and those who were not, by use of violin plots 54 and by Wilcoxon rank sum tests.Due to the amount of comparisons conducted here, we chose a conservative significance level of α = 0.005 with effect sizes estimated accordingly 55 .To examine whether models were confounded by total GMV, we first tested whether GMV differed between the sexes in the two samples.In the AM sample, similarly to TIV, sexes exhibited significant differences in total GMV (two-sample t-test; t = − 31.21,p < 0.001).However, matching for TIV in the ATM sample also resulted in a non-significant difference in total GMV (t = 0.85, p = 0.40), indicating that matching on TIV was effective also for GMV.We then compared the GMV distributions of individuals classified correctly in accordance with their sex and those who were misclassified (Tables S5 and S6) with the same conservative significance level as for TIV-differences of α = 0.005.Further details can be found in the Supplementary Results and Tables S5 and S6.To assess potential differences between cis-and transgender individuals in prediction probabilities, we statistically compared probabilities of CM and TW as well as CW and TM.A power-analysis for these comparisons was conducted using G*Power to compute sample size required for effect sizes as found in previous work with a α-level of 0.05 and power-level of 0.8 29,56,57 .

Figure 1 .
Figure 1.Analysis pipeline.Workflow of the sex classification analysis.

Figure 2 .Figure 3 .
Figure 2. Sex classification accuracy.Accuracy values of the four different models for the cross validation (CV)folds and applied to the AM and ATM hold-out sample.

Figure 4 .
Figure 4. Association between prediction probability and TIV for the AM and ATM models in the two application samples.The upper row (a-h) shows the prediction probability (a, c, e, g) and TIV distribution (b, d, f, h) of sex congruently and incongruently classified CM, CW, TM and TW in the AM model in sample A and B. The bottom row (i-p) shows the prediction probability (i, k, m, o) and TIV distribution (j, l, n, p) of sex congruently and incongruently classified CM, CW, TM and TW in the ATM model in sample A and B. (CW/f: CW classified as female; CW/m: CW classified as male; CM/m: CM classified as male; CM/f: CM classified as female; TM/f: TM classified as female; TM/m: TM classified as male; TW/m: TW classified as male; TW/f: TW classified as female).

Table 2 .
Wilcoxon rank sum tests of the hold-out samples.Comparison of individuals classified as female versus male (Wilcoxon rank sum tests) for the AM and ATM sample.