Ethnicity influences the gut microbiota of individuals sharing a geographical location: a cross-sectional study from a middle-income country

No studies have investigated the influence of ethnicity in a multi-ethnic middle-income country with a long-standing history of co-habitation. Stool samples from 214 Malaysian community members (46 Malay, 65 Chinese, 49 Indian, and 54 Jakun) were collected. The gut microbiota of the participants was investigated using 16S amplicon sequencing. Ethnicity exhibited the largest effect size across participants (PERMANOVA Pseudo-F = 4.24, R2 = 0.06, p = 0.001). Notably, the influence of ethnicity on the gut microbiota was retained even after controlling for all demographic, dietary factors and other covariates which were significantly associated with the gut microbiome (PERMANOVA Pseudo-F = 1.67, R2 = 0.02, p = 0.002). Our result suggested that lifestyle, dietary, and uncharacterized differences collectively drive the gut microbiota variation across ethnicity, making ethnicity a reliable proxy for both identified and unidentified lifestyle and dietary variation across ethnic groups from the same community.


Gut microbiota analysis. Each participant was provided with a Fisherbrand Commode Specimen Col-
lection System and a cooler box containing cooler packs to store their fecal samples. The fecal consistency was self-measured by participants assisted with a provided Bristol Stool Scale (BSS) chart. The fecal samples were collected and transported back to the laboratory the next day, where they were kept at − 50 °C until further analysis.
DNA extraction was conducted on the collected fecal samples from April through June 2019 using protocol Q of the International Human Microbiome Standards 13 . Briefly, fecal DNA extraction was conducted using the QIAamp DNA Stool Mini Kit in combination with a zirconium bead-beating method. The V3-V4 region of the 16S rRNA gene was then amplified using the primer pair 341F and 805R. The metagenomic library was then prepared according to the Illumina 16S Metagenomic Sequencing Library Preparation and amplified using the Illumina Miseq platform with 2 × 250 bp paired-end sequencing.
The raw sequence data yielded 16,723,707 reads, with a mean of 72,397 reads per sample. Primers and barcode sequences were then removed using the R package DADA2 14 , yielding 14,548,208 reads, with a mean of 62,979 reads per sample. After inferring amplicon sequence variants using DADA2, the abundance of each sequence variant was exported into Pathway Prediction by Phylogenetic Placement (PAPRICA) pipeline version 0.5 15 for phylogenetic placement. In brief, the sequences were aligned and placed into the near-full-length 16S rRNA gene reference tree. Phylogenetic groupings, or "edges, " was inferred based on the placement of the sequences in the reference tree. An edge can be referring to either the terminal node of the tree, or a path connecting two nodes. Unique numbers were assigned to each edge, and the number of consensus sequences in the same edge was recorded into the abundance table for further downstream analyses. Similarly, the abundance data of each edge was utilized for the prediction of metabolic pathway abundance by feeding them into the reference MetaCyc metabolic pathway database 16 . Statistical analyses. Count data were filtered to exclude edges and pathways with less than 1000 counts prior to the analyses. The α-diversity indices of Shannon diversity and Pielou's evenness were inferred using the phyloseq package version 1.30.0 17 . For ß-diversity analysis, the edge and pathway data containing the bacterial taxa abundance profiles was transformed using centered-log transformation using the function aldex.clr in the ALDEx2 R package version 1.22.0 18 , to account for the compositionality of high throughput sequencing data 19 . The transformed data was then ordinated using principal component analysis with Euclidean distance using the R package vegan version 2.5-6 20 . Afterward, the differences across factors were analyzed using Permutational Multivariate Analysis of Variance (PERMANOVA) implemented in the adonis function under the R package vegan version 2.5-6 20 . The Prevotella: Bacteroides ratio (P:B) and its 95% confidence interval was analyzed using linear mixed model by inputting the log-transformed P:B as the dependent variable with household as the random effect, to account for the household clustering of the participants, using the package lme4 version 1.1-21 21 . The presence of gut enterotypes were analyzed using the R package DirichletMultinomial version 1.28.0 22 . The differential abundance of taxa across factors was analyzed using the generalized linear model wrapped in the R package ALDEx2 version 1.22.0 18 , with p-value for multiple-groups comparison adjusted using the Benjamini-Hochberg method with a threshold for significance of 0.1.

Results
Population demographics. This study recruited participants from Segamat, a district located in southern Peninsular Malaysia. Fecal samples from 231 participants in 110 families were obtained. After excluding participants with missing data, 214 participants from 106 families were included for analyses. The participants were equally distributed across sex and ethnicity (Table 1). There was an equal occupational distribution across ethnicity, with most participants (n = 60) being homemakers. Every ethnic group also exhibited similar economic status, with most households in each ethnic group earning between RM1001-5000 monthly. However, a significantly higher proportion of Jakun and Chinese earned the least (RM401-700) and the most (> RM5000), respectively. Overall, the mean age of the participants was 44.25 ± 19.59, ranging from 10 to 83 years old. A higher proportion of Chinese and Jakun belonged to the older and younger age quartiles, respectively (Chi-Square test, p < 0.05, Supplementary  Table S2). These factors were tested against the Shannon diversity index, which measures the richness and proportion of different bacterial taxa, and Pielou's index, which measures the evenness in the distribution of the taxa abundances. Bristol stool scale (BSS) (Fig. 1a,e) and BMI (Fig. 1b,f) exhibited significant differences across both indices, showing a negative association with diversity (BSS scale 3 vs. scale 6; BMI healthy vs. obese, Tukey's HSD test, p < 0.05). Household income was significantly associated with diversity but had similar evenness (Fig. 1c,g). Participants in the lower-income group exhibited a higher diversity, although Tukey's HSD test found no significance after correcting for multiple testing. Meanwhile, diversity and evenness were similar across ethnicity (Fig. 1d,h, p > 0.05).
Ethnicity exhibited the strongest effect on the gut microbiota. PERMANOVA univariate analysis identified 16 factors that were significantly associated with the GM (p < 0.05). These factors can be categorized into either demographic, dietary behavior, hygiene practices, or health conditions (Supplementary Table S2, Fig. 2). Altogether, these factors accounted for 23.49% of the variation observed in the GM. Among the factors analysed by univariate analyses, ethnicity exerted the largest effect size (Fig. 3, PERMANOVA Pseudo-F = 4.24, R 2 = 0.06, p = 0.001). A multivariate analysis was then conducted, which revealed that ethnicity was still a sig-    Table S3).
Ethnicity-specific microbiota and their association with host demographics, lifestyle, diet, and hygiene. In a previous study, Arumugam and colleagues 23 suggested that the human microbiome can be broadly stratified into three enterotypes: Bacteroides, Prevotella, and Ruminococcus-dominant. Using Dirichlet Multinomial model 22 , three clusters were detected in our dataset, namely Prevotella-dominant, Bacteroides-dominant, and Bifidobacterium-dominant enterotypes (Fig. 4a, Supplementary Fig. S1). Interestingly, a higher proportion of the Prevotella-(type 1) and Bacteroides-(type 2) dominant enterotypes were exhibited by Jakun and Chinese, respectively. Both Malay and Indian were equally distributed in the first and second enterotype. Supporting this observation, a linear mixed model analysis on the ratio of Prevotella to Bacteroides (P:B) resulted in a gradient, with Jakun, Indian, Malay and Chinese exhibiting the largest to the lowest P:B ratio (Fig. 4b) (likelihood ratio test = 0.002).
To identify taxa which were associated with specific covariate indices (either demographic, dietary, health or hygiene), we conducted differential abundance analyses which were controlled for all significant covariates detected by the univariate PERMANOVA analysis (Fig. 2) except the index of interest (Fig. 5). Additionally, the demographic and dietary covariates' adjustment included other variables such as occupation and fruits intake (Supplementary Table S2) to account for dietary and demographic factors being potential major confounders to the gut microbiota 24,25 . Significant differences were identified for the Indian-Malay and Jakun-Indian ethnic pairings, where Malay exhibited an elevated abundance of two unclassified Clostridiales (edge 1741 FDR = 0.01; edge 1719, FDR = 0.06) associated with the hygiene index. Meanwhile, Jakun and Indian were differentiated by hygiene index, with the former associated with a higher prevalence of Klebsiella quasipneumoniae (edge 12,145, FDR = 0.03) in the former and Bifidobacterium longum (edge 5621, FDR = 0.05) in the latter. Additionally, Bifidobacterium catenulatum (edge 5721, FDR = 0.09) was also elevated in Indian, which was associated with health covariates. Interestingly, two edges of the Clostridiales order were found to be elevated in Malay compared to Indian even after adjusting for all these covariates (edge 1725, FDR = 0.03; edge 1741, FDR = 0.05).
The functional metabolic pathways of the microbiota were also compared (Fig. 6). Notably, functional pathway differences between Jakun and Malay was attributed to hygiene covariates, with Jakun exhibiting elevated pathways relating to pyruvate fermentation, protocatechuate degradation, NAD biosynthesis, methylglyoxal   www.nature.com/scientificreports/ degradation, and maltose degradation pathways. Meanwhile, Indians had a higher elevation of L-arginine and fatty acid biosynthesis pathways compared to Chinese and Malay, which was attributed to demographic and health index, respectively. Substantial variations across lifestyle parameters were observed across the four ethnic groups (Supplementary Table S2). Chinese consumed the most pork, while Jakun and Malay had higher ulam (a group of traditional vegetables commonly eaten as delicacies or side dish) consumption compared to Chinese and Indian. Across hygiene indices, Jakun had the least access to piped water for both drinking and bathing. Also, a higher proportion of Jakun utilized borehole toilets compared to the other ethnic groups, who overwhelmingly used flush toilets. Additionally, Indian exhibited the highest diabetes burden in the dataset.

Discussion
This study provides novel insights into the effect of ethnicity on the GM of a multi-ethnic community. In the absence of geographical variation, ethnicity exerted a significant influence on the GM of participants from Segamat. Notably, the significance of ethnicity was retained after adjusting for demographic, dietary and other significant covariates, albeit at a smaller effect size.
It is worth disclosing the natural clustering of each ethnic group in different sub-districts, resulting in a collinear relationship between ethnicity and sub-district in our dataset. Jabi had the heaviest density of Malay and was located approximately 50 km away from both Bekok and Chaah, which were neighboring sub-districts. Nevertheless, all sub-districts were under the same Segamat municipality and had similar environmental exposure and demographic distribution 12 . As such, all statistical adjustments conducted in this study excluded the sub-district covariates despite its significance.
The significance of ethnicity on the GM was likely driven by the unique lifestyle patterns exhibited by each ethnic group as captured through the administered survey. Notably, the higher proportion of Prevotella-dominant enterotypes in Jakun was likely driven by higher consumption frequency plant-based foods as reflected in the higher consumption of ulam in Jakun. Ulam itself is a traditional salad which comprises multiple types of traditional vegetables and is widely consumed by the local community either as a delicacy or a side dish 26,27 . Importantly, ulam has been reported to contain high antioxidant properties 26 .
Additionally, differential abundance analyses indicated that the gut-ethnic variation observed could be attributed to either health or hygiene indices. In our dataset, three health indices were significantly associated with the GM: bristol stool scale, diabetes, and medication. Clostridiales has been previously inversely correlated with type II diabetes 28 , which likely drove their reduced abundance in Indian, considering that Indians had a significantly higher diabetes burden, both nationally 29 and in our dataset. Meanwhile, the genus Bifidobacterium is frequently associated with the consumption of probiotics and fermented food 30 . It is therefore intriguing that two edges belonging to this genus were elevated in Indian due to health and hygiene indices, but not diet. As noted www.nature.com/scientificreports/ earlier, this could be due to the dietary variations which were not captured in this study. It is worth noting that Bifidobacterium is commonly reported to be elevated in South Asians, which includes Indians, both in infants and adults 5,31,32 . Regardless of whether this association is attributed to dietary or genetics, as hypothesized by others 5,32 , it is interesting to investigate the higher burden of diabetes in South Asian despite being consistently elevated in Bifidobacterium, which has been strongly associated with an alleviation of diabetes conditions 33 . Altogether, these outcomes suggested that "ethnicity" is a manifestation of multiple lifestyle factors, which might have a low effect singly, but collectively resulted in a unique GM profile that distinguished the ethnic groups in Segamat.
Interestingly, the influence of ethnicity was retained even after adjusting for all significant covariates, with differential abundance analysis observing the elevation of two Clostridiales edge in Malay compared to Indian (edge 1725, FDR = 0.03; edge 1741, FDR = 0.05). Previously, the influence of genetics on the GM has been hypothesized to be in the 1.9-8.1% range 34 . It remains to be seen whether part of this ethnic influence in our study was driven by genetic factors or they were merely a manifestation of lifestyle variables which were unaccounted for by this study.
Understanding predicted pathways could provide a better picture due to the functional roles of the GM 15,35 . Notably, the elevation of NAD biosynthesis pathway in Jakun could be related to their higher frequency of plantbased food consumption, where NAD biosynthesis has been associated with the presence of tryptophan 36 , a plant-derived amino acid 37 . The elevation of protocatechuate degradation could have arisen from the abundance of aromatic hydrocarbons in the Jakun's gut environment 38 . Aromatic hydrocarbon is commonly found in leafy vegetables 39 . In addition, elevations of pyruvate and maltose metabolism suggested the higher carbohydrate consumption among Jakun. Meanwhile, the elevation of fatty acid biosynthesis in Indian suggests the abundance of precursor metabolites for fatty acid synthesis in their gut environment, which might be indicative of a highfat dietary preference. This outcome could also be associated with the significantly higher BMI exhibited by the Indians in Segamat. It is worth noting that most of these functions were more closely associated with energy metabolism. It is possible that the non-exhaustive nature of our dietary survey likely resulted in the collinearity of these factors with non-dietary indices.
Urbanization has been associated with variations across α-diversity measures 1,2,40 . Higher intake of resistant starch and a plant-based diet is typical among rural populations compared to their urbanized counterpart, which typically had a diet high in protein and fat content. Even though Jakun resided in a more rural setting, in the fringes of the jungle, the similar economic status and availability of common foods precluded any major impact of urbanization in our samples. Davenport and colleagues 3 had observed the lack of difference in the Shannon index among the self-isolating Hutterites across different seasons, which influenced the consumption of fresh fruit and vegetable produce during the summer and winter seasons. The absence of such a difference indicates the presence of other confounding factors exerting effects on bacterial diversity, which we have yet to establish.
The analysis of a subset of healthy participants exhibited the largest effect size of ethnicity. Among those excluded were diabetic, hypertensive, and obese participants, as these factors have been associated with changes in the GM [41][42][43] . Apart from diabetes, there was an equal proportion of hypertensive and obese participants across ethnicity. Thus, it was likely that their inclusion in the overall dataset partially masked the ethnicity effect. Participants suffering from the same conditions were expected to exhibit a similar GM profile, consequently clouding the variations across ethnicity. Our result of ethnicity-specific healthy GM is consistent with previous studies 35,44 . The mapping of core "healthy GM" is essential to pave the way for personalized medicine, given the strong implication of GM in the etiology and prognosis of various diseases.
A comparison of the gut microbiome between Chinese and Indian students have previously been conducted in Singapore 45 . Prevotella and Lactobacillus were more abundant among Indians, which was attributed to higher consumption of whole wheat by Indian students, along with the dominance of animal fat and protein in the diet of the Chinese students. Like our study, they found that Indian students exhibited a higher Prevotella abundance compared to Chinese. It is unclear whether the Chinese and Indian students in their study were of national or foreign origin. A difference in their nation of origin would be expected to introduce variation in their gut microbiome. More crucially, the Singaporean study only employed 16 participants, which may not be an adequate representation of the community. As a side note, the comparison of gut profiles across studies is inevitably affected by differences in sample and data processing 13,46 .
This study is not free from limitations. Firstly, we did not obtain information on plant-based food intake other than ulam consumption. Although we detected the significance of ulam on the GM, it has to be noted that ulam may not be a proxy for overall plant-based food consumption frequency in all ethnic groups, especially considering the skewed preference for ulam across ethnicity in Malaysia 47 . Also, the use of week-long food recall questionnaire instead of a comprehensive food diary to record the participants' dietary habit prevents an in-depth GM-dietary association from being conducted. However, an in-depth food diary was deemed inappropriate due to concerns over participants' compliance. Additionally, the absence of a lifestyle food intake analysis presents a possibility that some participants did not practice their typical dietary habit during the week of sampling. Nevertheless, dietary variation across ethnicity in Malaysia has been reported before 48,49 , minimizing the possibility that the significant dietary differences across ethnicity observed in this study was due to irregularity in the dietary patterns of the participants during the sampling week. Furthermore, all the information used in this study, except for age, BP and BMI, were based on self-reported answers, which would have undoubtedly exposed our data through recall bias 50 .
In conclusion, the influence of ethnicity on the gut microbiota was detected from a community living in the same geographical region. This influence could be traced to the collective effect of multiple lifestyle factors exerting subtle yet distinct differences across ethnicity. Ethnicity, therefore, serves as a proxy for lifestyle and dietary variations across different ethnic groups living in the same community. Future studies on the GM should consider the impact of ethnicity to ensure valid interpretation of their study outcome.