Autism spectrum disorder (ASD) is a set of neurodevelopmental disorders which are diagnosed based on behavioural abnormalities, such as deficiencies in social interaction and communication, and repetitive behaviour1. Although there is a great variability regarding the prevalence of ASD, it is evident that the number of individuals diagnosed with ASD has been considerably increasing in the last decades specially in countries with high socio-demographic indexes. These growing numbers are not explained solely by the use of newer advanced diagnostic methods, but also by the increase of risk factors for ASD2,3. To this date, there are no epidemiological studies investigating ASD prevalence in the Netherlands, however, the number of children with ASD in primary schools in 2018 was 14/1000, similar to prevalence data found in other European countries4,5. While worldwide prevalence largely varies, a 4:1 ratio male to female remains consistent across the globe6. Both genetic and environmental factors contribute to ASD development7, nonetheless, the exact underlying mechanisms that accompany this disorder are yet to be elucidated. Gastrointestinal disturbances such as diarrhea, constipation and abdominal pain are often present in individuals with ASD8. In addition, gastrointestinal problems are correlated to a higher degree of ASD severity9. Moreover, there are differences in the microbial communities colonizing the gut of ASD individuals when compared to control populations10, and some of these changes can alter multiple host’s functions hinting for plausible molecular pathways relevant in ASD11. Although abnormal gut microbiota composition has been repeatedly described in ASD, there is no consensus amongst the observed differences in bacterial abundances12. In spite of that, multiple lines of associative evidence indicate the importance of these, bidirectional, interactions between microbiota, gut and brain (also referred to as microbiota-gut-brain axis) in ASD and other neurodevelopmental disorders13.

Lifestyle, specifically diet, is a major contributing factor when studying human gut microbiome14,15. Individuals with ASD commonly lack a diverse diet compared to neurotypical individuals16. Siblings are frequently included as control subjects to better control for inter-individual variables like genetic background, household environment and dietary habits17. Several studies emphasize the importance and possible role of abnormal gut microbiota in ASD development18,19,20, and how the correction of the bacterial communities living in the gut may be an effective approach for ameliorating intestine- and brain-related problems in ASD21. For example, faecal transplantation in Phase-I clinical trial has shown to significantly improve ASD behavioural scores and ASD-associated gastrointestinal symptoms over a time frame of 2 years22.

Given the complexity of ASD pathophysiology and the lack of agreement on which gut microbes play an important role in the disorder, this study aims to identify a specific subset of bacteria that is (i) a signature for ASD classification, and (ii) reproducible among other populations. To these aims, we applied a machine learning-based algorithm, named recursive ensemble feature selection (REFS)23,24,25, in three available datasets from the analysis of gut microbiota composition in both ASD and control populations. One dataset including neurotypical siblings as controls was used for feature selection, while the other two datasets, which included unrelated age-matched children as controls, were used for validation26,27,28. Feature selection methods allow us to identify specific traits to predict certain conditions29. Thus, this study highlights the advantages of using a machine learning-based method to successfully predict ASD with the minimal number of features, in this case gut bacterial taxa. By analyzing data obtained from distinct cohorts, we suggest that our results are not dependent on other confounding factors such as lifestyle, dietary habits, and geographical region.



Datasets were selected based on (i) availability of raw 16S rRNA gene sequencing data using Illumina, (ii) detailed information regarding the subjects recruited in the study including age, sex, and diagnosis of ASD, among others, (iii) neurotypical siblings for feature selection, and age-matched neurotypical subjects for feature validation as controls, and (iv) subjects between 2 and 7 years old. Raw 16S rRNA gene amplicon sequencing data from the study David et al. was used for feature selection26. Paired-end reads 150 bp long from the V4 region of the 16S rRNA gene were sequenced in 60 subjects with autism spectrum disorder (ASD cases), and 57 siblings (controls). Two separate 16S rRNA gene sequencing datasets were used for validation, PRJNA589343 (single reads 250 bp long from V4 region in 77 ASD cases and 50 age-matched controls)27 and PRJNA578223 (paired-end reads 300 bp V3–V4 regions in 48 ASD cases and 48 age-matched controls)28. See Table 1 for the characteristics of the individuals included in the studies.

Table 1 Characteristics of the subjects included in this study.

Sequence filtering, chimera removal, and taxonomic assignment

Raw sequence reads from David et al.26 were processed using the software package DADA2 (version 1.8) under R 4.1.2 environment30. The first 10 nucleotides (nt) were trimmed from the forward and reverse reads following DADA2’s recommendation. In addition, reads with more than two expected errors were excluded from the analysis. Consecutively, the reads were independently dereplicated and denoised using DADA2’s default parameters. The resulting forward and reverse reads were merged with a minimum overlap of 20 bases. This resulted in 7160 amplicon sequence variants (ASVs) prior to the removal of chimera sequences. The removal of the chimera sequences resulted in 2040 ASVs, with only 5% of the total reads removed. Taxonomies were assigned to all ASVs using the IDTAXA method from the DECIPHER package31. A pretrained classifier based on the SILVA SSU rRNA database (version r138)31,32 was used with the IDTAXA method. See Table 2.

Table 2 Taxonomy annotation of the 26 selected ASVs using SILVA.

Raw sequence reads from PRJNA58934327 were processed following the aforementioned procedure. The set filtering parameters included no truncation and a maximum of two expected errors per read, and the lack of a merging step as this dataset consists of single-ended reads (only forward reads). This resulted in 2030 ASVs after removal of the chimera sequences. Raw sequence reads from PRJNA57822328 were processed in a similar manner with the following filtering parameters: truncate forward and reverse reads to 290 and 220 nt, respectively; and remove all reads with more than two expected errors. This resulted in 18,758 ASVs after removal of chimera sequences.

Feature selection and validation

We used REFS23,24,25, a method applied for discovering biomarkers, to determine which ASVs are appropriate for differentiating ASD cases from controls. The ensemble is composed by 8 classifiers from the sci-kit learn toolbox33: Stochastic Gradient Descent (SGD), Support Vector Machine classifier (SVC), gradient boosting, random forest, logistic regression, passive aggressive classifier, ridge classifier and bagging. It is known that working with a small number of samples can cause overfitting, to avoid this problem, REFS uses nested-cross validation in a tenfold cross-validation scheme, a proven solution to produce more accurate and unbiased results regarding the number of samples34. Prior to feature selection, the data (matrix containing ASVs’ counts) was normalized using scikit-learn’s Z-score algorithm33. Each cycle of REFS removed the 20% least important ASVs, until only one feature was left. To prevent method’s randomization from negatively affecting results, this process was concurrently run 30 times. For each run, performance metrics including averages and variances were calculated. The reduced features were selected based on the best performing cycle of the best run. To avoid bias selection of the ASVs, we performed a validation process similarly as it has been previously described in previous studies23,24,25. This process applies 5 different classifiers that do not belong to the ensemble from the sci-kit learn toolbox33: AdaBoost, Extra Trees, KNeighbors, Multi-Layer Perceptron (MLP), and Least Absolute Shrinkage and Selection Operator plus iterative process using Cross-Validation (LassoCV). The accuracy given by the average of the five classifiers in a nested tenfold cross-validation gives us an area under the curve (AUC) that estimates the power of a discriminant test, being more successful with values close to 1.0 which reflects an excellent accuracy35.

The selected top scoring ASVs, were then validated in PRJNA589343 and PRJNA578223 datasets27,28. After processing the 16s rRNA raw sequences using DADA2, top scoring ASVs were extracted from the resulting ASVs of the validation datasets. The reads’ length in both validation datasets were 250 bp and 300 bp, respectively. As the selected ASVs of the discovery dataset were shorter (150 bp), ASVs were counted in the validation datasets when an exact match was found. Abundance data of all matching ASVs was added together and treated as one during validation. The resulting filtered datasets were tested using a tenfold cross-validation with the 5 classifiers different from the classifiers used for the ensemble, and AUC of the receiver operating characteristic (ROC) curves were calculated. Whether the selected 26 ASVs were present or not in the validation datasets can be found in Table 2. For an overview of the methodology, see Fig. 1.

Figure 1
figure 1

Bioinformatic pipeline to select the optimal number of ASVs associated to ASD phenotype by applying REFS to a 16S rRNA gene sequencing dataset (discovery phase), and validation of the selected set of ASVs across different cohorts (valiation phase)26,27,28.

Differential abundance

To determine differential abundance, the reduced discovery dataset was normalized using scikit-learn’s StandardScaler scaling algorithm33. Then, differential abundances for the discovery dataset for each selected ASVs were plotted as a heat-map comparing ASD cases and controls. See Supplementary Figure 1. Likewise, normalized differential abundances were plotted for the identified ASVs found on each validation dataset. See Supplementary Figs. 2 and 3. Heat-maps were created using the python script


Feature selection and validation of case-control cohorts for ASD

For feature selection, we used a dataset of 16S rRNA gene sequences from 117 subjects (60 ASD cases and 57 siblings)26. After applying REFS, features were reduced from 2040 ASVs (processed 16rRNA raw sequences obtained with DADA2) to 26 ASVs as the optimal number of features to distinguish between ASD cases and controls (see Fig. 2A). AUC of ROC curves was used for evaluating the diagnostic accuracy of each classifier. Mean classification accuracy in a tenfold cross-validation method increases when using the set of 26 reduced features compared to the 2040 processed features (average AUC = 0.816 and average AUC = 0.41, respectively). See results in Table 3.

Figure 2
figure 2

(A) Optimal number of features, ASVs, for ASD classification applying REFS to the discovery dataset26. (B) ROC curve of the 26 selected features, ASVs, in the best performing classifier, Multi-Layer Perceptron (MLP), in REFS for the discovery dataset26.

Table 3 Accuracy of nested cross-validation (tenfold cross-validation) classifiers used for REFS on the discovery dataset26.

For feature validation, we used 16S rRNA gene sequencing data from two age-matched cohorts with a total of 223 samples (125 ASD cases and 98 controls)27,28. We evaluated the mean AUC of the ROC curve of the previously identified 26 features employing the validation datasets in a tenfold cross-validation testing five classifiers using REFS. Although not all 26 sequences were found in both validation datasets, 22 and 20 ASVs were found, respectively27,28 (see Table 2). In comparison to the classification accuracy mean of the discovery set (average AUC = 0.816), the scores of the validation sets resulted in good diagnostic accuracy (average AUC=0.748 and average AUC=0.74, respectively; see results in Table 4). Specifically looking at the classifier with the best diagnostic accuracy, MLP for the discovery dataset and Extra Trees for both validation datasets, we reported an AUC=0.90 for the discovery cohort, see Fig. 2B, and an AUC=0.84 for both validation datasets, see Fig. 3.

Table 4 Accuracy of a set of 26 features selected from the discovery dataset26 in nested cross-validation (tenfold cross-validation) classifiers in the validation sets27,28.
Figure 3
figure 3

(A,B) ROC curves of the 26 selected features, ASVs, in the best performing classifier, Extra Trees, in REFS for both validation datasets27,28, respectively.

Differential abundances of 26 specific bacterial taxa to distinguish between ASD cases and controls

To better understand the role of gut microbiota composition in ASD, we investigated the differential abundances and the taxonomy of the 26 identified features, the 26 ASVs, in the discovery dataset26. See Table 2 for the assigned taxonomies to the selected ASVs, and Supplementary Fig. 1 for differential abundances.

At phylum level, we identified 17 ASVs belonging to the phylum Firmicutes, 3 ASVs to Proteobacteria, 2 ASVs to Bacteroidota, 2 ASVs to Actinobacteria, and 2 ASVs were not assigned to any phyla. At a family level, 7 ASVs belong to Lachnospiraceae, 6 ASVs to Clostridiaceae, 3 ASVs to Enterobacteriaceae, 2 ASVs to Erysipelatoclostridiaceae, single ASVs were assigned to Bifidobacteriaceae, Prevotellaceae, Tannerellaceae, Oscillospiraceae, Coribacteriaceae and Peptoniphilaceae, and 2 ASVs were not assigned to any family group. At a genus level, 11 ASVs were not assigned to any genera, 3 ASVs belong to Clostridium, 2 ASVs to Lachnospira, and single ASVs to Bifidobacterium, Parabacteroides, Oscillospira, Sarcina, Anaerosporobacter, Collinsella, Butyricicoccus, Lachnospiraceae UCG-004, Erysipelatoclostridium and Murdochiella. Only one ASVs was assigned to species level, Eubacterium eligens.

Both genera belonging to Actinobacteria phylum, Bifidobacterium and Collinsella, were decreased in ASD cases compared to controls. On the contrary, the two bacterial taxa belonging to Bacteroidota phylum, Prevotellaceae and Parabacteroides, were increased in ASD cases when compared to controls. Among the 3 different ASVs belonging to the Proteobacteria phylum, all of them assigned to the Enterobacteriaceae family, 2 of them were increased in ASD cases while the other one was increased in controls. The latter ASV-associated bacterium, was not found in any ASD subject. Within the ASVs from the phylum Firmicutes, we reported lower abundances of Erysipelatoclostridiaceae, Murdochiella, Butyricicoccus, Clostridium, Lachnospiraceae UCG-004, and Eubacterium eligens. Within the same phylum, increased differential abundances of bacterial taxa represented by the selected ASVs included Lachnospiraceae, Clostridium, Sarcina, Anaerosporobacter, and Oscillospira. In addition, 3 ASVs assigned to Lachnospiraceae and Oscillospira were not present in any control subject.


Recently, the use of machine learning-based techniques is becoming more popular to study complex systems, and that has been the case for ASD. Artificial intelligence has been mostly applied to overcome limitations of traditional diagnostic methods37,38. If we also consider its potential use for microbiome studies in human health39, implementing machine learning to study the relationship between gut dysbiosis and ASD can give us valuable insights to understand the disorder, to better diagnoses, and to develop plausible therapies.

Microbiome analyses have multiple challenges that compromise the interpretation and reproducibility of the results across studies. Besides the importance of defining well-standardized methods to collect, store, process, and sequence the samples, it is important to agree on a consensual approach to analyze the data in order to draw correct conclusions40. Several publications have investigated the effect of confounding effects in microbiome outcomes in ASD, as well as discrepancies based on the method to process the sequencing data like clustering the data in operational taxonomic units (OTUs) rather than ASVs41,42,43,44. The present study, using the machine learning approach REFS, identified a specific set of bacterial taxa from ASV taxonomic annotation that is sufficient to optimally differentiate between ASD cases and controls in a published sibling-controlled dataset26. In addition, we demonstrated that this set of bacterial taxa can distinguish between ASD and control populations in two independent published datasets27,28, indicating the robustness of the method.

We analyzed 16S rRNA gene sequencing data using the aforementioned machine-learning approach based on integrative analysis allowing to study the compositional nature of the gut microbiome. A proper statistical practice is essential for the correct interpretation of the analysis of microbiome data as using traditional statistical methods often present assumptions and biases45,46. Looking into literature, approaches that integrate feature selection and cross-validation strategies provide a good predictive tool by discriminating between two phenotypes with the least number of relevant features47,48. REFS has been successfully employed previously in other medical research studies using microRNA expression and messenger RNA expression data49,50, indicating its large potential being applied to other kinds of biological data.

Besides reducing the relevant features for ASD classification from 2040 ASVs to 26 ASVs using one dataset, we also tested how important these ASVs are for ASD classification, thus their application as a predictive tool, by validating them in other two independent datasets. Overall, our results showed good prediction accuracy means. For the discovery dataset, the best-performing classifier reported an AUC value of 0.90 using the dataset from David et al.26. During the validation phase, the best-performing classifier showed an AUC value of 0.84 for both datasets Ding et al.27 and Zou et al.28 when the 26 selected ASVs were used for ASD status classification. According to Šimundić35, these results indicate a highly accurate predictive method for discriminating ASD phenotype using 16S rRNA gene sequenced data despite of the expected high variability of the included populations in the study as they have different geographical regions, and most likely different lifestyles. Using our proposed approach to further analyze existing microbiome data or new data from longitudinal-designed studies including ASD and control populations will be valuable to better understand, and corroborate, the role of these 26 selected bacterial taxa in ASD development and progression.

A recent study by Yap et al. did not find grounds for an associative link between the gut microbiota and ASD diagnosis, implying that differences in the gut microbiota composition of ASD subjects are solely a consequence of ASD-related feeding behaviour51. However, some research has been done regarding the potential use of gut microbiome data for ASD diagnosis making use of machine learning approaches. In three compelling meta-analyses, Wu et al., Chavira et al. and Pietrucci et al. processed gut microbiome data across different existing ASD studies concluding that gut dysbiosis is associated to ASD, nonetheless, other factors such as sex, age, geographical region and lifestyle cannot be ignored when studying the role of intestinal microbes in ASD52,53,54. All meta-analyses used a fivefold cross-validation approach for machine learning classification on taxonomic annotation data, but the number and type of classifiers differed between the studies. In Chavira et al.53, they examined how taxonomic resolution impact predictive accuracy concluding that the higher the taxonomic resolution is, the better the models’ performance is. Additionally, Pietrucci et al.54 looked at the importance of the control groups for ASD classification using three different classifiers. Their results reported better accuracy when unrelated control groups were used for ASD classification. It can be explained by the fact that siblings have more similar microbiomes than unrelated individuals as they share more alike environment, lifestyle and genetics55. While in these three studies a collective analysis was performed to study differential abundances of the intestinal microorganisms and the most relevant features for ASD classification, we applied feature selection in a sibling-control cohort and tested the selected features in two independent cohorts showing that other confounding factors are not interfering the reliability of the results. Moreover, the input data in our approach are ASVs rather than OTUs or taxonomic annotation allowing us to identify specific bacterial taxa with one-nucleotide difference56.

Many of the studies investigating the influence of the gut bacteria in ASD pathophysiology, targeted microbiota composition by analyzing 16S rRNA gene sequencing data, measuring relative abundances of the present bacteria, and then searching for correlations between each individual bacterial taxa with other ASD-related traits. A serious limitation of using conventional statistical strategies in this type of compositional and high-dimensional data can lead to erroneous outcomes57,58. A possible way to overcome this problem is the use of multivariate analysis rather than univariate analysis, thus our study uses a multivariate, multidataset approach. Here, we identified 26 ASVs that, as a community, can separate ASD phenotype from neurotypical controls.

Moreover, looking at the differential abundances between ASD cases and controls in the discovery dataset, we also observed noticeable changes between ASD and control cases. In line with several studies that reported lower faecal Bifidobacteria abundance in children with ASD compared to neurotypical children59,60,61,62,63, we also identified an ASV annotated as Bifidobacterium that was decreased in ASD cases compared to controls. One of the plausible pathways explaining the involvement of gut bacterial imbalance in ASD is metabolism. In a study, germ-free mice received human faecal microbiota transplantation using samples from either children with ASD or typically-developed children. The results showed alterations in tryptophan metabolism, and Bifidobacteria changes correlated with differently abundant metabolites from derivatives of tryptophan metabolisms64. Furthermore, dietary interventions either with Bifidobacterium strains or other bacterial strains and/or compounds that increase Bifidobacteria levels in the gut have shown to improve, not only intestinal outcomes, but also ASD severity65,66.

On the other hand, we observed increased bacterial taxa in ASD phenotype including Clostridia, Sarcina and Parabacteroides, among others, that have been also found to be increased in children with ASD10,59,62,67,68,69. Interestingly, one of the aforementioned meta-analysis of gut microbiome data in ASD, also discovered a relevant feature for ASD classification belonging to the Sarcina genus, being increased in ASD cases compared to control cases53. Although commensal Clostridia in the gut help to develop and maintain an intestinal homeostatic state and accordingly a balanced functioning of host’s biological processes, abnormal levels of some members of this genus have been widely linked to health problems including neurodevelopmental disorder susceptibility70. Molecular pathways by which Clostridia can influence ASD involve metabolic, immunological and physiological processes71,72. Additionally, antibiotic usage against these bacteria has shown beneficial effects in behavioural parameters of ASD studies73. However, there is no consensus among other studies indicating that the beneficial effects stopped once the antibiotic intervention was finalized, and furthermore pre- and post-natal antibiotic use has been linked to ASD development in several studies11,74. In our study we also observed a reduction of ASVs associated to Butyricicoccus and Eubacterium eligens known to exert beneficial effects by modulating immune response and producing health-promoting compounds as short-chain fatty acids75,76,77. Recently, in a propionic acid rat model for ASD it was shown that therapeutic interventions like Bifidobacterium longum or faecal transplantation ameliorated gut dysbiosis restoring Butyricicoccus levels78. Taking into account all these associative evidence, targeting the gut microbiome with dietary interventions might improve ASD-related complaints. However, we need to comprehend the biological meaning of bacterial changes in ASD to check the added value of specific bacterial taxa as new diagnostic tool, and the use of gut microbiome modulation for improving ASD-related complaints.

A limitation of the current study is the different ASV length after processing the raw 16S rRNA gene sequencing data for each dataset. Because of the selected ASVs are shorter than the ASVs from both validation datasets, we cannot make a clear association between the specific bacterial taxa of the discovery dataset and the validation datasets. Most of the 26 selected ASV were taxonomically assigned to family or genus level, nevertheless, the amount of bacterial species and strains comprised in a single family or genus is remarkably high79. When the 26 ASVs were matched with the validation ASVs, there were multiple cases that more than one validation ASV contained a selected ASV indicating that different but closely related bacteria share that specific sequence. In addition, confounding factors that are known to influence the gut microbiota should be further explored in the context of ASD. For instance, evidence show that sex differences potentially have an important role in ASD patophysiology and support the observed male-sex bias80,81, and specifically in analysis targeting the gut microbiota82.


Overall our results demonstrate a strong microbiome signature for the classification of ASD in three independent cohorts. Our approach identified 26 features, bacterial taxa, that distinguish ASD cases from control cases with high accuracy. The method that we propose overcomes problems of bias and overfitting results by selecting the smallest number of relevant features identifying ASD status using several classification algorithms that rank the features differently. The discovery of a robust set of bacterial taxa associated to ASD phenotype can potentially be used for diagnostic purposes, and it might provide new insights into plausible molecular mechanisms of the gut-brain axis in ASD. However, further studies should aim to understand the biological significance of these specific bacteria in ASD pathophysiology, and additional data such as metabolic function of the gut microbiome may be of great importance to pursue this line of investigation.