Multi-kingdom gut microbiota analyses define COVID-19 severity and post-acute COVID-19 syndrome

Our knowledge of the role of the gut microbiome in acute coronavirus disease 2019 (COVID-19) and post-acute COVID-19 is rapidly increasing, whereas little is known regarding the contribution of multi-kingdom microbiota and host-microbial interactions to COVID-19 severity and consequences. Herein, we perform an integrated analysis using 296 fecal metagenomes, 79 fecal metabolomics, viral load in 1378 respiratory tract samples, and clinical features of 133 COVID-19 patients prospectively followed for up to 6 months. Metagenomic-based clustering identifies two robust ecological clusters (hereafter referred to as Clusters 1 and 2), of which Cluster 1 is significantly associated with severe COVID-19 and the development of post-acute COVID-19 syndrome. Significant differences between clusters could be explained by both multi-kingdom ecological drivers (bacteria, fungi, and viruses) and host factors with a good predictive value and an area under the curve (AUC) of 0.98. A model combining host and microbial factors could predict the duration of respiratory viral shedding with 82.1% accuracy (error ± 3 days). These results highlight the potential utility of host phenotype and multi-kingdom microbiota profiling as a prognostic tool for patients with COVID-19.

Here and elsewhere I find the color schemes confusing. It is fine that red is Cluster 1 throughout and blue is Cluster 2 throughout, but then it is confusing that bacteria are colored the same red, the mycobiome is colored the same blue, and the virome is colored the same green as the chi squared tests.  But now I think it is just a color scheme issue, as the legend says baseline (red like Cluster 1) and 6month follow-up (blue like Cluster 2). Also, reading the legend I see that C = Cluster 1 and D = Cluster 2. This should be labeled on the figure, just like with A + B.
Reviewer #2 (Remarks to the Author): Summary Liu et al. present in their manuscript an analysis of COVID-19 patients with and without post-acute COVID-19 syndrome based on their gut microbiome profiles, and additional demographic and clinical data. While the herein included data and the described analysis are interesting and valuable, not least in the light of the ongoing pandemic, the manuscript often lacks important information as well as a more thorough discussion. In particular, the code and data required to reproduce the study are missing, and many passages in the Results and Methods sections are confusing (e.g. inconsistency in sample numbers and insufficient description of the analysis). Most importantly, the authors have already published a very similar study, which is also being cited in the present manuscript (http://dx.doi.org/10.1136/gutjnl-2021-325989, reference 10). However, the newly obtained results described here are not being discussed in the context of the previously published study.

Major comments
Comment 01 The authors demonstrate interesting correlations of the taxa with COVID-19 severity. However, aside from the single mention of the use of multi-omics for weighted similarity networks, the use of multiomics is very limited. Therefore the title could be reassessed and revised. Importantly, given the availability of other COVID-19 datasets, it may be important to validate their in silico findings in a random publicly-available subset to ensure their robustness.

Comment 02
The post-acute COVID-19 syndrome (PACS) is defined as the presence of long-term complications observed after 4 weeks after the onset of symptoms (L338-340). This is a rather short time period and it is not clear from the manuscript whether these symptoms were reassessed at the 3 and 6 months time points to reevaluate their persistence.
The authors do not provide the information required to reproduce their study: there is no reference to a code repository to reproduce the described analysis; there is no raw data accession in the Data Availability statement and the one provided in the Methods section (i.e., PRJNA714459) is from another already published study; there is no sample metadata including the demographic information, laboratory measurements, PACS symptoms etc.; there is no processed data such as the microbiome profiles, built prediction models, and data used to generate the figures.

Comment 04
Sample numbers.
There seems to be some discrepancy in the number of samples and missing information on the sample size for some analyses. For example, the Abstract states 296 fecal metagenomes, however 293 is mentioned in L92. Furthermore, the authors describe the assessment of only 79 fecal samples for metabolomics (L94). It is unclear why or how this number was arrived upon. Additionally, it is not evident whether the profiles from 79 samples are representative of the 133 patients (or 296 fecal) samples collected. Similarly, the viral load was assessed only for 94 samples (upper and lower respiratory tract, and fecal samples, L373). There is also no information on the missing stool samples: there have to be 3*133=399 fecal samples in total so there are 103 to 106 missing samples. It is not stated whether the missing samples affect certain time points more than others.
A full list of the samples with the appropriate metadata should be provided for review. Any sample that was not used for each of the various assays should be clearly indicated with justification. In addition, statistical justification along with a stratification of the different patient/sample groups is required.

Comment 05
Found patient clusters and comparison to previous studies.
It is interesting to see two distinct clusters based on the similarity network approach. However, the description lacks some details. For example, it is not clear whether "pooling" in L101 is referring to the weighted similarity network fusion approach and whether "baseline" in L97 describes the time point "at admission".
The authors state that the found two clusters had similar demographic parameters except for age: Cluster 1 contained older patients than Cluster 2 (L122-123). Given that the risk of higher disease severity for COVID-19 increases with age, it cannot be ruled out that the found clusters represent different age groups and that the observed differences in microbiome profiles and symptoms might be the consequence of that. The authors should consider this option and investigate whether patients' age is correlated with their symptoms.
For the time point "at admission", there is also the possibility that the patients were admitted to the hospital at different stages of infection which might be also reflected in their microbiome and viral load. The authors should discuss this in the manuscript as it might also have contributed to the observed cluster formation.
Could there be a bias based on the geographic origins of the patients? It is also likely that the "severe patients" were housed in the same if not similar wards, compared to those with milder symptoms. Based on the Methods section, the possible biases induced due to housing conditions at the hospital (i.e. rooms) are unaccounted for. A reader would benefit from such information which can be contributing factors to microbiome profiles (see https://doi.org/10.1128/msphere.01007-21).
Finally, the authors state that the microbiome was stable over time within the clusters (L159-161). Furthermore, the majority of patients in Cluster 1 (84%) had PACS while only 44% were present in Cluster 2 (Fig. 3C, L166-168). However, in their other study, the authors also describe a recovery of the gut microbiome in patients without PACS at 6 months after disease resolution (http://dx.doi.org/10.1136/gutjnl-2021-325989, reference 10). Based on that, one would expect to see at least some temporal changes in Cluster 2 containing patients who did not develop PACS. Given this discrepancy, the authors should additionally compare the microbiome profiles (alpha-and beta-diversity) of patients with and without PACS and also across the different time points. In general, while the authors do cite their previous study in the present manuscript they do not compare the newly obtained results to previous observations made in COVID-19 patients, though both studies have certain similarities in their design.

Comment 06
Microbiome profiling and data integration.
The authors state that the microbiome profiling was done to generate datasets representing three biomes: viruses (using a custom assembly-based approach), bacteria and fungi (using MetaPhlAn3 and MiCoP). These data sets were then combined using weighted similarity network fusion (WSNF). There are multiple points which have to be clarified regarding this part of the analysis. 1) MetaPhlAn3 includes in its database information bacterial, archaeal, viral, and eukaryotic genomes. Similarly, MiCoP estimates abundance of viral and fungal organisms. It has to be clarified whether the authors used all obtained hits from these two tools or if they limited the output only to specific organism groups, e.g. bacteria and fungi. If all hits were retained then it should be stated how the authors deduplicated the results to avoid multiple hits to the same species from different tools and approaches.
2) The WSNF method assigns weights to the different biomes based on their richness: the number of species found in each biome. The number of observed species is highly affected by the used tools and their databases. Does the used approach account for that? Moreover, the authors should reference the respective publication describing the used WSNF approach.
3) The authors should state why they decided to pursue the assembly-based custom approach for virome profiling instead of using the output from MetaPhlAn3 and/or MiCoP.
4) It is not completely clear from the text whether the integration of the three biome datasets was done using all available samples or for each time point separately (i.e., at admission, 3 months, and 6 months).
Comment 07 L99/L457-458: The authors conveniently use a prevalence cutoff of 5% when filtering the dataset. However, it is unclear how they arrive at this number. Additionally it should be justified with appropriate references (see PMID 33510727). Also, it is not clear how the "average abundance of 1%" was used for filtering and why the authors chose the 1% cutoff. Moreover, the authors should state how many features were removed from the dataset using their filtering criteria and how many remained and were used for further analysis.
Comment 08 L137: The authors claim that blood urea levels were correlated with disease severity. However, based on the figures, a majority of the points are beyond the 2x standard deviation, suggesting a potential artifact due to outlier samples. This analysis has to be adjusted for outlier effects and assessed critically.
Comment 09 L205: The authors claim an 'ensemble-based' method for the network analyses. However, the Methods section only describes using the "Reboot" approach. It must be noted that Reboot is quite outdated (2012) and since then additional/better methods for assessing sparse compositional matrices have been introduced. This is evident from Schwager et al.'s work (PMID: 29140991). It may be prudent to use an alternative method to corroborate the findings described herein.
Comment 10 L218: The authors demonstrate an example with C. spiroforme. However, it is unclear why this taxon was chosen. Especially given the fact that R. gnavus was also part of the description. It may be more prudent to use R. gnavus as an example to state the case, or explain why it did not show the expected correlations with the severe cluster.

Comment 11
L226: It is surprising to observe that the "density" of the network is low, and yet the connected component value is 1. Could the authors explain the phenomenon driving this? In network theory, a highly connected component usually arises from a highly dense network, so this finding is counterintuitive.

Comment 12
For the patient stratification model, it seems that the clustering results obtained from the complete dataset were used as patient labels. If this is the case, then there might be an overfitting issue as the clusters and their optimal number were defined based on all available data. Therefore, the validation set used later does not completely represent truly unseen data as described by the authors (L472). A cleaner approach would be to perform the clustering using only the microbiome profiles of the training set. Expected labels for the validation set could be generated, for example, by a k-nearest neighbors approach to be able to compute the model accuracy. Another point is the feature ranking and selection procedure. Here it is not clear how exactly this was done (i.e., which data was used for feature selection, training and validation) and what the final model is. Additionally, the authors should not only look at accuracy but also consider other performance measures and their variance (in addition to the mean value).
Similarly, the paragraph about the RF model used to predict the time period of patients being SARS-CoV-2 positive lacks some details to understand the performed model training and validation procedure. The authors should try to provide a more concise and clear description of this analysis step.
Minor comments L204-205: It would be better to not introduce additional names or labels for Cluster 1 and Cluster 2.
L46: R.gnavus are commensal taxa found in many mammalian microbiomes. They are not pathogenic bacteria, unless in a particular context such as septic arthritis. Please refrain from using this misleading description L90-92: The text reads such that viral loads were checked in PBMCs. Please rephrase.
L111-114: Please show the 95% and 99% confidence intervals and provide the PERMANOVA analyses results as a supplementary table with special emphasis on the F statistic.
L130: Additional details regarding how the functional profiling was performed are needed.
L138: Additional details about how specific microbiome species were associated with elevated urea levels are needed.
L209-211: Please provide the correlation coefficient and p-value cutoffs in the results with respect to what is termed a "co-occurence" and what is not.
L344: It is unclear if all the patients were housed in the same hospital/quarantine facility. Please elaborate.
L363-364: Why do the authors mention the antibiotic treatment if having no antibiotics therapy 6 months before, during and 6 months after the SARS-CoV-2 infection was one of the inclusion criteria (L344-346)? L379: Were the whole blood samples collected at admission? L386-404: Which samples were analyzed here? All 293 stool samples? L432: The subsection title "Bioinformatic Viral Processing" is confusing. E.g., "Viral profiling of metagenomics data" or similar would be more appropriate.
L437: Were the contigs deduplicated? L448-450: How were the OTUs defined and generated from the reads mapped to contigs? Why was the normalization done by total number only and not also by the OTU sequence length? L464-465: Which software was used for patient clustering? What is the relationship between the eigengap method to choose the optimal number of clusters and the number of nearest neighbors? L467: While the features used for model training are described in this paragraph the authors do not explicitly say what the response variable is.
L470-473: Although the training and validation datasets are described, the 'test' dataset is missing in the description. Please clarify. L489-493: Is the result of the training procedure the "sparse model" or how exactly was that model generated after ranking the features? L503-504: Which variables are described here? Why were categorical variables represented as numeric values (numbers and/or percentages)? Was the microbiome data normalized or transformed?
L538: USPTO application number is missing.

Reviewer #1 (Remarks to the Author):
Qin Liu et al. present an analysis of many types of microbiome and host data that effectively predict patient COVID-19 outcomes. As far as I can tell, the study was done well and the analyses sound. I have no major complaints.
Response: We thank the reviewer for the positive comments. We present a revised version of the manuscript based on this reviewer's additional comments.
Intermediate concerns: 1. Overall, the grammar is not horrible, but has room for improvement, particularly in the discussion.
Here are some examples: Following the Reviewer's comment and advice we have now Line 39: odd phrasing to say "viral load of respiratory samples" Response: Thank you. We have rephrased this sentence to "viral load in 1,378 respiratory tract samples". (Line 40) Line 90: This sentence is confusing and seems like cytokines are part of viral RNA load quantification? Rephrase or perhaps break into two sentences.
Response: We have divided this sentence into two sentences. (Line 92-95) "We assessed viral RNA level in nasopharyngeal swabs and fecal samples using quantitative PCR with reverse transcription (RT-qPCR). We also assessed plasma cytokines and chemokines levels and leukocyte profiles in freshly isolated peripheral blood mononuclear cells (PBMCs) using flow cytometry. " Line 93: Rephrase. Fecal samples don't have time points --or at least that's not what was done. This sounds like one collected fecal sample was processed at three time points. Also, since not all patients had three fecal samples, I would recommend saying "up to three longitudinal time-points." Response: We have rephrased the sentence to " We also analyzed gut microbiome composition (bacteria, virus, fungi) in 296 serial faecal samples collected at up to three longitudinal time-points from admission to six months after virus clearance…" (Line 95-97) Line 133: change "function" to "functions" Response: We have corrected "function" to "functions". (Line 141) Line 259/260: As written this is hard to understand: "Our evaluation on model revealed that including clinical information in addition to gut microbiome..." Response: We have rephrased this sentence to "Evaluation of our model revealed that a combination of clinical information and gut microbiome data can achieve a substantial improvement in..." . (Line 268-269) Line 290: This is phrased oddly. Is it being suggested that Klebsiella is a (beneficial) symbiont? 4 covariates. So, the biomarkers identified by MaAsLin have been adjusted by potential confounders and/or covariate included in the analysis such as demographic and clinical features. However, there is a possibility that demographic and clinical features could also predict the clinical outcomes of the patients. Therefore, in the random forest model, we included demographic and clinical features together with microbial taxa as predictors in order to achieve optimal prediction. The additionally clinical data and sub-samples of the training dataset used in the random forest model may also lead to the slight differences between random forest and MaAsLin.
Line 181: forest not forrest Response: We have corrected this. (Line 199) Line 181: Why were 11 factors chosen? Did this perform better than the top 10, top 12, or other numbers? Response: We aimed to determine the highest possible accuracy with the least number of factors. We performed the test from the top 5 to top 20 and found that using the top 11 achieved the best performance based on this model. This information has been included in the results . "We next evaluated the sub model performance from the top 5 to the top 20 and found that using the top 11 achieved best performance based on this model." Line 197: What value says what taxa contributed most to the model? Is it possible to add that information? Looking at Supp Fig 6, there is no indication that fungi and viruses contributed more than bacteria. Perhaps that value would also explain why the three specific taxa are pointed out on line 198/199. Also, the last part of this sentence does not make sense. It's odd to say these taxa could be considered after they were just considered.
Response: Thank you for the comments. We have rephrased this sentence for clarity and the top microbiome taxa were from all three kingdoms. We have made this clearer in the results (Line 216-219): "The microbiome taxa that contributed most to the model to determine duration of viral shedding were from the three kingdom classes, including Adlercreutzia equolifaciens, Asaccharobacter celatus, Candida dubliniensis, Klebsiella phage vB KpnP SU50, and Rhizobium phage vB RglS P106B (Supplementary Figure 6).
Line 248: Remove "For example," Response: It has been removed. Thank you.
Line 349: One could make the case that most people have special eating habits. I'm surprised to see vegetarians excluded. What fraction of individuals were excluded due to dietary preferences?
Response: We thank the reviewer for this comment. In previous studies, subjects following a strict plant-based diet showed a decrease in gut microbiome diversity which was caused by an enrichment of microbes specifically involved in degrading plant carbohydrates but was not perceived as a negative health hallmark. In this work, only less than 3 individuals were excluded due to dietary preferences. Response: Micop [1] has been shown to identify more eukaryotes in human microbiome data than existing methods. We also found that MiCoP consistently identified a more diverse fungi than MethPhlAn3 across all samples. We have included this message and the reference to the method section (Line 449-450). Here and elsewhere I find the color schemes confusing. It is fine that red is Cluster 1 throughout and blue is Cluster 2 throughout, but then it is confusing that bacteria are colored the same red, the mycobiome is colored the same blue, and the virome is colored the same green as the chi squared tests.
Response: We have adjusted the color schemes for bacteria, fungi, viruses, which are different from Cluster 1 and Cluster 2.  6 4H. Needs more information than Positive Time. Not clear until read text or legend that referring to viral shedding.
Response: We have changed the font of axis title and taxa names are italicized. We have replaced "Positive Time" by "Viral shedding duration", which is consistent with the text in the manuscript. While the herein included data and the described analysis are interesting and valuable, not least in the light of the ongoing pandemic, the manuscript often lacks important information as well as a more thorough discussion. In particular, the code and data required to reproduce the study are missing, and many passages in the Results and Methods sections are confusing (e.g. inconsistency in sample numbers and insufficient description of the analysis). Most importantly, the authors have already published a very similar study, which is also being cited in the present manuscript (http://dx.doi.org/10.1136/gutjnl-2021-325989, reference 10). However, the newly obtained results described here are not being discussed in the context of the previously published study.
7 Response: We thank this reviewer for the valuable comments and we have addressed them in the point-by-point response below.

Comment 01
The authors demonstrate interesting correlations of the taxa with COVID-19 severity. However, aside from the single mention of the use of multi-omics for weighted similarity networks, the use of multiomics is very limited. Therefore the title could be reassessed and revised. Importantly, given the availability of other COVID-19 datasets, it may be important to validate their in silico findings in a random publicly-available subset to ensure their robustness. Response: Thank you for the suggestion. We have removed the word multi-omics and revised the title to "Multi-kingdom gut microbiota analyses define COVID-19 severity and post-acute COVID-19 syndrome" We agree with the reviewer that it is important to test the robustness of the findings in publicly available subsets. However, after attempting we found that important metadata and sequencing data of published studies are not available, thus we were unable to include them in our analysis. We have included this as a limitation in the discussion (Line 320).

Comment 02
The post-acute COVID-19 syndrome (PACS) is defined as the presence of long-term complications observed after 4 weeks after the onset of symptoms (L338-340). This is a rather short time period and it is not clear from the manuscript whether these symptoms were reassessed at the 3 and 6 months time points to reevaluate their persistence.
Response: Thank you for the comments. In this study, the persistence of post-COVID symptoms were reassessed at 3 and 6 months. This has been included in the paper in Line 355-356: "We assessed the persistence of the 30 most commonly reported post-COVID symptoms at 3 and 6 months…"

Comment 03
Reproducibility. The authors do not provide the information required to reproduce their study: there is no reference to a code repository to reproduce the described analysis; there is no raw data accession in the Data Availability statement and the one provided in the Methods section (i.e., PRJNA714459) is from another already published study; there is no sample metadata including the demographic information, laboratory measurements, PACS symptoms etc.; there is no processed data such as the microbiome profiles, built prediction models, and data used to generate the figures.
Response: Thank you for the comments. We have added the link to code availability and data availability in the paper.
Data availability: Sample information and raw sequences are available in the National Center for Biotechnology Information Sequence Read Archive under BioProject 8 ID: PRJNA876804 (Supplementary Table 9) and will be made public upon formal publication. We have added a statement of this in the revised manuscript.
Comment 04 Sample numbers. There seems to be some discrepancy in the number of samples and missing information on the sample size for some analyses. For example, the Abstract states 296 fecal metagenomes, however 293 is mentioned in L92. Furthermore, the authors describe the assessment of only 79 fecal samples for metabolomics (L94). It is unclear why or how this number was arrived upon. Additionally, it is not evident whether the profiles from 79 samples are representative of the 133 patients (or 296 fecal) samples collected. Similarly, the viral load was assessed only for 94 samples (upper and lower respiratory tract, and fecal samples, L373). There is also no information on the missing stool samples: there have to be 3*133=399 fecal samples in total so there are 103 to 106 missing samples. It is not stated whether the missing samples affect certain time points more than others.
A full list of the samples with the appropriate metadata should be provided for review. Any sample that was not used for each of the various assays should be clearly indicated with justification. In addition, statistical justification along with a stratification of the different patient/sample groups is required.
Response: We thank the reviewer for raising this important issue. We have provided the full list of samples for each analysis including metagenomic sequencing (at baseline, month 3 and month 6), blood cell count, plasma cytokines at baseline and medical records during hospitalization, PACS at month 3 and month 6. These data are included in Supplementary Table 9.

Comment 05
Found patient clusters and comparison to previous studies. It is interesting to see two distinct clusters based on the similarity network approach. However, the description lacks some details. For example, it is not clear whether "pooling" in L101 is referring to the weighted similarity network fusion approach and whether "baseline" in L97 describes the time point "at admission".
Response: Thank you for the comments. We used weighted similarity network fusion that takes into consideration the richness of the gut microbiome. For better clarity, we have modified the text to "By subjecting multi-biome data to a non-supervised similarity network fusion approach, fecal samples were divided into two distinct patient clusters based on their microbiota matrix." (Line 106) The baseline is the time point "at admission". We have modified the text in the paper "Gut multibiome (bacteria, fungi, virus) profile at admission was integrated…" (Line 102) The authors state that the found two clusters had similar demographic parameters except for age: Cluster 1 contained older patients than Cluster 2 (L122-123). Given that the risk of higher disease severity for COVID-19 increases with age, it cannot be ruled out that the found clusters represent different age groups and that the observed differences in microbiome profiles and symptoms might be the consequence of that. The authors should consider this option and investigate whether patients' age is correlated with their symptoms.
Response: Thank you for the valuable comments. We examined whether patients' age were correlated with the post-COVID symptoms and found that there were no significant differences in the age of patients with PACS at six months. We have added this in the revised paper in Line 170-171 9 "Although older age is recognized in Cluster 1, there were no significant differences in the age of patients with PACS at six months between two clusters." For the time point "at admission", there is also the possibility that the patients were admitted to the hospital at different stages of infection which might be also reflected in their microbiome and viral load. The authors should discuss this in the manuscript as it might also have contributed to the observed cluster formation.
Response: Thank you for the comments. During the early stage of the COVID-19 pandemic in Hong Kong, every confirmed case was admitted to the hospital on the same day or one day after positive SARS-COV-2. The time point "at admission" of the sample was the first stool sample collected after hospital admission so there is unlikely to be a big lag. Despite the timely admission to the hospital, there is still the possibility that patients were at different stages of infection. We have included this in the discussion. Line 305-307: "Besides, despite the timely hospital admission and sample collection, there is also the possibility that the patients were admitted at different stages of infection which might be reflected in their viral load and gut microbiome composition." Could there be a bias based on the geographic origins of the patients? It is also likely that the "severe patients" were housed in the same if not similar wards, compared to those with milder symptoms. Based on the Methods section, the possible biases induced due to housing conditions at the hospital (i.e. rooms) are unaccounted for. A reader would benefit from such information which can be contributing factors to microbiome profiles (see https://doi.org/10.1128/msphere.01007-21).
Response: Thank you for this point and highlighting the possible impact of geographic origins and housing conditions at the hospital.
Among the 133 patients, 110 patients were from Prince of Wales Hospital, 17 from United Christian Hospital, 6 from Yan Chai Hospital. Since most of the patients (110/130) were assigned to the same hospital, which is nearest to their geographic location in Hong Kong, the bias based on the geographic origins of patients should be limited in this study.
Based on the triage prioritisation in Hong Kong, only patients with critical COVID-19 could be provided ICU at admission, the rest of patients were housed in the same ward condition. Moreover, the baseline samples were collected on the day of hospitalization, limiting the bias based on the different wards. In this study, there were 12 critical patients who were admitted to ICU care and the rest of the subjects had the same housing condition, limiting the impact of confounding factors. In addition, there was no statistical difference of the ratio of critical patients in the two clusters.
Finally, the authors state that the microbiome was stable over time within the clusters (L159-161). Furthermore, the majority of patients in Cluster 1 (84%) had PACS while only 44% were present in Cluster 2 (Fig. 3C, L166-168). However, in their other study, the authors also describe a recovery of the gut microbiome in patients without PACS at 6 months after disease resolution (http://dx.doi.org/10.1136/gutjnl-2021-325989, reference 10). Based on that, one would expect to see at least some temporal changes in Cluster 2 containing patients who did not develop PACS. Given this discrepancy, the authors should additionally compare the microbiome profiles (alpha-and beta-diversity) of patients with and without PACS and also across the different time points. In general, while the authors do cite their previous study in the present manuscript, they do not compare the newly obtained results to previous observations made in COVID-19 patients, though both studies have certain similarities in their design.
Response: We thank the reviewer for this comment. We analysed the gut microbiome profiles (diversity) of patients without PACS at 6 months in Cluster 2 and across different time points. The results showed a stable trend from baseline to 6-month follow-up ( Supplementary Fig 5E and 5F) with or without PACS in patients.
The previous study solely focused on the bacterial microbiota, where gut microbiome in patients without PACS at 6 months after disease resolution, where patients exhibited a recovery gut microbiome composition. However, the key difference of the current analysis is the mult-kingdom microbiome analysis including include bacteria, fungi and viruses, which may explain the discrepancy between current and previous study.
We have included this in the paper Line 174-177: "We further assessed whether there were temporal changes in patients without PACS in Cluster 2. The multi-microbiome exhibited stable microbiome profiles from baseline to as long as 6 months follow-up, indicating the persistent impact of SARS-COV-2 infection on the gut multi-kingdom microbiome." 11 Comment 06 Microbiome profiling and data integration. The authors state that the microbiome profiling was done to generate datasets representing three biomes: viruses (using a custom assembly-based approach), bacteria and fungi (using MetaPhlAn3 and MiCoP). These data sets were then combined using weighted similarity network fusion (WSNF). There are multiple points which have to be clarified regarding this part of the analysis. 1) MetaPhlAn3 includes in its database information bacterial, archaeal, viral, and eukaryotic genomes. Similarly, MiCoP estimates abundance of viral and fungal organisms. It has to be clarified whether the authors used all obtained hits from these two tools or if they limited the output only to specific organism groups, e.g. bacteria and fungi. If all hits were retained then it should be stated how the authors deduplicated the results to avoid multiple hits to the same species from different tools and approaches.
Response: Micop [1] has been shown to identify more eukaryotes in human microbiome data than previous-used method. We also found that MiCoP consistently identified a more diverse fungi than MethPhlAn3 across all samples.
For fungi, MetaPhlAn3 identified 2 genera in the samples, Candida and Aspergillaceae, while MiCoP identified 6 genera, including the two identified by MetaPhlAn3. Additionally, while the Candida genus dominated the MetaPhlAn2 results with 96.3% abundance, genera identified by MiCoP were distributed in a more balanced manner.
2) The WSNF method assigns weights to the different biomes based on their richness: the number of species found in each biome. The number of observed species is highly affected by the used tools and their databases. Does the used approach account for that? Moreover, the authors should reference the respective publication describing the used WSNF approach.
Response: We thank the reviewer for this suggestion. Indeed, the richness will affect the similarity network infusion. In this study, the approach we used is to assume differential influences of each biome on the overall multi-biome based on both taxonomic composition and richness. Weighting was assigned according to the total number of observed species in a particular biome. We've cited the reference in the manuscript where we described the approach.
Line 102 "Gut multi-biome (bacteria, fungi, virus) profile at admission was integrated by an unsupervised weighted similarity network fusion (WSNF) approach 4 ." 3) The authors should state why they decided to pursue the assembly-based custom approach for virome profiling instead of using the output from MetaPhlAn3 and/or MiCoP.
Response: Identification of viral sequences in the process of viral metagenomic analysis is notoriously challenging due the lack of a universal viral marker as opposed to bacterial 16S rRNA for example. Thus, the reference-based read mapping is limited by a scarcity of annotated viral genomes. We used the optimized pipeline, capable of de novo extraction and retrieval of viral contigs from shotgun sequencing reads. Owning to the use of assembly and classification approach, we were able to retain only viral sequences for downstream analysis. We have included this in the paper (Line 454-458).
4) It is not completely clear from the text whether the integration of the three biome datasets was done using all available samples or for each time point separately (i.e., at admission, 3 months, and 6 months). Response: Thanks for this comment. The integration of three biome dataset was performed on the admission stool samples. We next examined and compared the gut microbiome composition and occurrence of post-acute COVID-19 syndrome at different time points after acute infection. We have provided the samples list for integration of three biome datasets clearly.

13
Comment 07 L99/L457-458: The authors conveniently use a prevalence cutoff of 5% when filtering the dataset. However, it is unclear how they arrive at this number. Additionally it should be justified with appropriate references (see PMID 33510727). Also, it is not clear how the "average abundance of 1%" was used for filtering and why the authors chose the 1% cutoff. Moreover, the authors should state how many features were removed from the dataset using their filtering criteria and how many remained and were used for further analysis.
Response: We thank the reviewer for this comment. The data filtering procedure is applied to remove interactions resulting from random noise at the expense of sensitivity to weak signals. The cut-off value is suggested by the developer of weighted similarity network fusion for microbiome. Because of the high-coverage sequencing in this study, it is possible to detect some very-rare species with very low prevalence. We have cited the appropriate reference to the method section and the total number of species retrieved and remained for analysis (Line 486). Comment 08 L137: The authors claim that blood urea levels were correlated with disease severity. However, based on the figures, a majority of the points are beyond the 2x standard deviation, suggesting a potential artifact due to outlier samples. This analysis has to be adjusted for outlier effects and assessed critically.
Response: We thank the reviewer for this comment. We have reassessed the correlation of blood urea levels and severity using Wilcoxon, which is a non-parametric test. These tests do not require a distributional assumption and are robust to the presence of outliers. Instead of transforming the data or discarding outliers, Wilcoxon Test is widely used to test whether there is difference in the values. We have updated the figure in supplementary Figure 1E.
Comment 09 L205: The authors claim an 'ensemble-based' method for the network analyses. However, the Methods section only describes using the "Reboot" approach. It must be noted that Reboot is quite outdated (2012) and since then additional/better methods for assessing sparse compositional matrices have been introduced. This is evident from Schwager et al.'s work (PMID: 29140991). It may be prudent to use an alternative method to corroborate the findings described herein.
Response: Thank you for the suggestion. We found sparse compositional matrices are more widely used. We have reassessed the network by using the widely used approach SparCC (Sparse Correlations for Compositional data). We have revised the content accordingly in results and methods (Line 222-247, Line 534-538) Comment 10 L218: The authors demonstrate an example with C. spiroforme. However, it is unclear why this taxon was chosen. Especially given the fact that R. gnavus was also part of the description. It may be more prudent to use R. gnavus as an example to state the case, or explain why it did not show the expected correlations with the severe cluster.
Response: We agree it is more prudent to use R. gnavus as an example. We have made modifications in manuscript and figure (Line 244-246, Supplementary Figure 7F) Comment 11 L226: It is surprising to observe that the "density" of the network is low, and yet the connected component value is 1. Could the authors explain the phenomenon driving this? In network theory, a highly connected component usually arises from a highly dense network, so this finding is counterintuitive.
Response: In this section, we did not show the whole microbiome species, which may not be particularly informative and readable. We only showed those species with top co-occurrence values. The cut-offs of correlation coefficient and p-value are |Coef| > 0.1; p<0.05, which led to the "low" "density" of the network. (Line 229)

Comment 12
For the patient stratification model, it seems that the clustering results obtained from the complete dataset were used as patient labels. If this is the case, then there might be an overfitting issue as the clusters and their optimal number were defined based on all available data. Therefore, the validation set used later does not completely represent truly unseen data as described by the authors (L472). A cleaner approach would be to perform the clustering using only the microbiome profiles of the training set. Expected labels for the validation set could be generated, for example, by a k-nearest neighbors approach to be able to compute the model accuracy.
Response: We thank the reviewer for this insightful comment. 1)In this section we used random forest model for stratification, which was different from the clustering approach. We would like to clarify the training dataset and testing dataset for model development. We have modified the description to avoid misunderstanding. "Four datasets from 133 patients including demographic, blood test, cytokines and multi-biome were used separately or in combination (ensemble) to train the model. Machine learning models were first trained on the training set (70%, n=93) with five-fold cross validation, and then were applied to the test set (30%, n=40) for validation." 2) We validate the clustering using only the microbiome of training set and five-fold cross-validation by two machine learning approach Support Vector Classifier (SVC),) and K-nearest neighbours algorithm. The AUROC yield above 0.9 indicating the robustness of clustering.
Another point is the feature ranking and selection procedure. Here it is not clear how exactly this was done (i.e., which data was used for feature selection, training and validation) and what the final model is. Additionally, the authors should not only look at accuracy but also consider other performance measures and their variance (in addition to the mean value).
The feature importance vector (mean decrease Gini index), including weights for every species predictive capacity, was collected. The Random Forest algorithm has built-in feature importance which can calculate the Gini importance (or mean decrease impurity) from the Random Forest structure. Mean decrease impurity" is defined as the total decrease in node impurity (weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node)) averaged over all trees of the ensemble. The training dataset (70%) was used for feature selection.
We have clarified this in the paper in Line 504-507.
"The training dataset (70%) was used for feature selection. A trained forest produces a variable importance list based on mean decrease in Gini index. The feature importance vector (mean decrease Gini index), including weights for every species, demographic, blood test, or cytokines predictive capacity was collected. " The model was repeated ten times to obtain a distribution of random forests prediction evaluations The final model was chosen looking at the best overall AUC value and overall accuracy. We have clarified this in the paper (Line 503, Line 524-526) Similarly, the paragraph about the RF model used to predict the time period of patients being SARS-CoV-2 positive lacks some details to understand the performed model training and validation procedure. The authors should try to provide a more concise and clear description of this analysis step.
The dataset was divided into 70% training and 30% testing set. We have reorganized this paragraph to be clearer and more concise. (Line 519-520) Minor comments L204-205: It would be better to not introduce additional names or labels for Cluster 1 and Cluster 2.
Response: Thanks for the suggestion. We agree to keep the consistent label of Cluster 1 and Cluster 2 throughout the paper. We have removed the additional names of for Cluster 1 and Cluster 2.
L46: R.gnavus are commensal taxa found in many mammalian microbiomes. They are not pathogenic bacteria, unless in a particular context such as septic arthritis. Please refrain from using this misleading description Response: We have edited the text. "was characterized by enriched abundance of bacteria (Ruminococcus gnavus, Klebsiella quasipneumoniae)" Line 48 L90-92: The text reads such that viral loads were checked in PBMCs. Please rephrase.
Response: The text is rephrased. Line 90-93 We assessed viral RNA levels in nasopharyngeal swabs and fecal samples using quantitative PCR with reverse transcription (RT-qPCR). We also assessed plasma cytokines and chemokines levels and leukocyte profiles from freshly isolated peripheral blood mononuclear cells (PBMCs). L111-114: Please show the 95% and 99% confidence intervals and provide the PERMANOVA analyses results as a supplementary table with special emphasis on the F statistic.
Response: We have showed 99% confidence intervals and the PERMANOVA results are provide in Supplementary Table 2. L130: Additional details regarding how the functional profiling was performed are needed.
Response: Details regarding how the functional profiling was performed is added to the text Line 137-140 "For functional annotation, we applied the Human Microbiome Project Unified Metabolic Analysis Network 3 (HUMAnN3) pipeline that maps reads to functionally annotated organism genomes and uses a translated search to align unmapped reads to UniRef90 protein clusters".
L138: Additional details about how specific microbiome species were associated with elevated urea levels are needed.
Response: We have added and revised how specific microbiome species were associated with elevated urea levels in Line 147-154 "Next, we investigated how specific microbiome species were associated with elevated urea in severe COVID-19. The relative abundance of urea cycle pathway and K01940 in the urea cycle were significant higher in Cluster 1. Furthermore, we found a marked increase in K01940 (argininosuccinate synthase, the key enzyme in urea cycle pathway, Supplementary Figure 3B) in the severe cluster (Supplementary Figure 3C), which was predominantly driven by Klebsiella species such as Klebsiella quasipneumonia， Klebsiella pneumoniae and Klebsiella variicola (Supplementary Figure 3D) by comparing the subclasses pathway and microbial contributors (quantify gene presence and abundance in a species-stratified manner)." L209-211: Please provide the correlation coefficient and p-value cutoffs in the results with respect to what is termed a "co-occurence" and what is not.
Response: The cut-offs of correlation coefficient and p-value are |Coef| > 0.1; p<0.05. (Line 229) L344: It is unclear if all the patients were housed in the same hospital/quarantine facility. Please elaborate.
Response: Among the 133 patients, 110 patients were from Prince of Wales Hospital, 17 from United Christian Hospital, 6 from Yan Chai Hospital. We have added this information to the manuscript in Line 343-344.
L363-364: Why do the authors mention the antibiotic treatment if having no antibiotics therapy 6 months before, during and 6 months after the SARS-CoV-2 infection was one of the inclusion criteria (L344-346)?
Response: Since antibiotics would be a big confounder of gut microbiome analysis. We would like to exclude this impact on gut microbiome. In previous analyses, we found the impact of antibiotics could persistent for 1 month after hospitalization. In this work, only patients without antibiotic therapy before at least 6 months, during and 6 months after acute infection of SARS-CoV-2 were eligible for analyses.
L379: Were the whole blood samples collected at admission?
Response: Yes, the whole blood samples were collected at admission. L386-404: Which samples were analyzed here? All 293 stool samples?
Response: We analysed metabolomics of 79 faecal samples at admission.
We have clarified this in the paper in Line 406 "The quantification of fecal metabolites from 79 faecal samples at admission was performed by Metware Biotechnology Co., Ltd. (Wuhan, China)." L432: The subsection title "Bioinformatic Viral Processing" is confusing. E.g., "Viral profiling of metagenomics data" or similar would be more appropriate.
Response: We thank the reviewer for this suggestion. We have revised the subtitle to "Viral profiling of metagenomics data". L437: Were the contigs deduplicated?
Response: The redundant sequences of contigs were removed by using CD-hit-EST. We have added the information to the method section in Line 471-472.
L448-450: How were the OTUs defined and generated from the reads mapped to contigs? Why was the normalization done by total number only and not also by the OTU sequence length?
Response: We thank the reviewer for this comment. We also normalized the OTUs by the OTU sequence length. We have included this in the paper in Line 476. L464-465: Which software was used for patient clustering? What is the relationship between the eigengap method to choose the optimal number of clusters and the number of nearest neighbors?
Response: The clustering was conducted by weighted SNF (WSNF) using an online tool (https://integrative-microbiomics.ntu.edu.sg). The eigengap method and K nearest neighbors were two parallel approaches imbedded in WSNF for the determination of clusters. We have revised the sentence and added the appropriate citation. (Line 492-494) L467: While the features used for model training are described in this paragraph the authors do not explicitly say what the response variable is.