Genomic insights into the evolution of Echinochloa species as weed and orphan crop

As one of the great survivors of the plant kingdom, barnyard grasses (Echinochloa spp.) are the most noxious and common weeds in paddy ecosystems. Meanwhile, at least two Echinochloa species have been domesticated and cultivated as millets. In order to better understand the genomic forces driving the evolution of Echinochloa species toward weed and crop characteristics, we assemble genomes of three Echinochloa species (allohexaploid E. crus-galli and E. colona, and allotetraploid E. oryzicola) and re-sequence 737 accessions of barnyard grasses and millets from 16 rice-producing countries. Phylogenomic and comparative genomic analyses reveal the complex and reticulate evolution in the speciation of Echinochloa polyploids and provide evidence of constrained disease-related gene copy numbers in Echinochloa. A population-level investigation uncovers deep population differentiation for local adaptation, multiple target-site herbicide resistance mutations of barnyard grasses, and limited domestication of barnyard millets. Our results provide genomic insights into the dual roles of Echinochloa species as weeds and crops as well as essential resources for studying plant polyploidization, adaptation, precision weed control and millet improvements.

I have two large concerns. The first is the rigor that was applied when making when making the phylogenetic inferences and the conclusions about introgression versus ILS. I bring up several concerns and questions for that portion of the manuscript that should be considered. The second major concern is the selection of figures. They seem to serve to impress and overwhelm the reader rather then tell a consistent and convincing narrative. Careful production and selection of figures is recommend to streamline the narrative and tell a more compelling overarching story.
Line Edits Line 82: The use of "inevitably" implies a strong causal link, while the sentiment is appreciated perhaps herbicide tolerance in all use cases does not inevitably lead to herbicide resistance.
Line 92: "Evidently, barnyard grasses are excellent resources or systems for understanding plant stress". This abrupt transition could be made clearer that this. Is it-because of their success in the face various abiotic stresses?
Line 214: "significant gene losses." The word significant is being used without obvious statistical method, were losses determined using CAFÉ?
Line 220: Barabaschi et al. 2020 posits a fitness cost but makes no such calculation. Rather these authors infer it based on genomic comparisons. Similarly, Ye et al. 2020 do not seem to measure fitness cost but rather gene expansion. Recommend that authors cite Review of fitness cost by Brown and Rant 2013, or other similar sources used in Ye et al. 2020 to first establish the strong link between R genes and fitness cost.
Line 228 and 230: "significantly" is used again without obvious statistical method Lines 303-323: The authors set out to establish if discordance between predicted discordance times and ks estimates are driven by incomplete lineage sorting and/or introgression. The authors calculated ks divergence using a COGE like approach, if not COGE exactly. Phylogenetic methods, were based off of Orthofinder and RAxML. Coalesecent approach done with ASTRAL. While the authors work using QuIBL is respectable, their handling of the concatenated RAxML tree is unsatisfying. They report no alignment QC, partitioning or model selection. While Orthofinder is a nice general purpose program for generating Orthogroups, it is no substitute for handling the data rigorously. Questions abound.
1. Why not use nucleotide sequences with a codon aware aligner instead of the proteins. If the species are close enough to calculate reliable ks peaks, aligning at the nucleotide level is a tractable task. 2. Since whole genome sequences are being used, why not employ the same set of homologous genes identified via syntenty in genome guided orthology approach? This would be even more reliable than orthofinder since it incorporates genomic position, and basic data handling would allow for the use of amino acid or nucleotide sequences as authors saw fit.
In the case where Orthofinder is employed. I caution the authors that past experience with Orthofinder strongly suggests that the data will only benefit from a more rigorous handling of the sequences. This is because the alignments are frequently gappy and poorly aligned under alignment settings provided by most automated aligners and thus a QC step is required. The QC step and post processing should address the following.
1. How were gaps dealt with, the introduction of gaps into multiple sequence alignments can have a severe consequences. 2. Was GBlocks or a similar program employed to limit the amount of noise in the alignment? If so why was it not reported and what were the flags used to ensure reliable alignments were used in the concatenated sequence 3. Was partitionfinder2 or some suitable substitute used to ensure that the correct models were applied throughout the concatenated matrix? 4. What model was used for RAxML?
The same point about alignment QC applies to the Multiple Sequence Alignments (MSAs) used for IQTree. IQtree does an excellent job testing for best substitution models, however it does not manage gapped or poorly aligned sequenced output regularly produced by MAFFT with default settings. This is particularly problematic for sequences with large disordered domains and for orthogroups in which one or more sequences contains novel amino acid sequence insertions. These must be dealt with before MSAs are given to IQTree. Again, there is no obvious evidence that this was done. Poor alignments produce unreliable inferences.
The authors have not convinced me that they have handled the analysis rigorously enough to support the claims based on this analysis. In particular, it seems just as likely that they are measuring artifact introduced by the handling of data as they are measuring ILS or introgression. I admit that measuring ILS and introgression is a worthy undertaking, but has not been convinced by the current handling of the data.
Ln574: Please cite "Tranel, Patrick J., and Terry R. Wright. "Resistance of weeds to ALS-inhibiting herbicides: what have we learned?." Weed Science 50, no. 6 (2002): 700-712."or "Murphy, Brent P., and Patrick J. Tranel. "Target-site mutations conferring herbicide resistance." Plants 8, no. 10 (2019): 382." It is an essential piece of literature addressing TSR for ALS inhibitors Line 613: Please cite "Patterson, Eric L., Dean J. Pettinga, Karl Ravet, Paul Neve, and Todd A. Gaines. "Glyphosate resistance and EPSPS gene duplication: convergent evolution in multiple plant species." Journal of Heredity 109, no. 2 (2018): 117-125." It is a more focused review of the effect that CNVs have on glyphosate resistance. Lines: 668-675 "Significantly" is used twice without an obvious statistical method. The pipeline CNVings is briefly described, but no mention is made of statistical test is applied to determine differences, perhaps it was a T-test. Figure 1: Line 1106-1112: This circus plot can be deconvoluted. Many of these tracks are unnecessary (e.g. GC content) as they don't increase our understanding of the narrative and are for the most part uninteresting (especially crammed in a tiny figure). Only include the data relevant to your point(s). This will highlight the more relevant tracks (like Copia, Gypsy, etc).
Line 1113: Not sure this is interesting in a graphical form. Just extra complexity and we understand well from the text how complete the genome(s) are. Also, it may allow more room for panel D and/or E to be larger and more interpretable.   The article brings novel insights into an evolution and agronomic significance of high polyploid grasses. Although there are several genome assemblies of allopolyploid species already available, the comprehensive analysis presented here brings valuable general insights into allopolyploid evolution in general. The model system is fascinating and in depth investigated by an impressive collection of global diversity of the three focal species. The analyses are diverse and at many aspects very comprehensive (esp. genome assembly, investigation of range-wide diversity) while the analysis of selective footprints of local adaptation, herbicide resistance and domestication is rather limited to testing particular hypotheses, which are addressed by a diverse and somehow inconsistent approaches, that seem not to be always clearly justified. Although I do not see any major significant flaw, I have the following comments aiming to clarify and potentially improve some aspects of the study.
1) The phylogeny of Echinochloa species and subgenomes is crucial for further interpretations. I understand the reasons why there are no more genomes assembled and sequences but I wonder whether there are at least some sequence data for additional species available. Addition of such data (e.g. in a supplementary tree involving a subset of loci gathered also from current genomes) might better clarify the origin of the other subgenomes, esp. FH and EH that are currently isolated in the tree. Putting these newly assembled (sub)genomes into a broader context of Echinochloa evolution would help understanding their evolution.
2) Many recently assembled allopolyploid genomes demonstrate surprising stasis and lack of large genomic exchanges among the sub-genomes (e.g. Burns et al. 2021 Nat Ecol Evol and refs therein), however, it would be still worth checking and discussing if that is also the case for Echinochloa or whether they are in fact such homeologous recombinations present. Lack of homeologous exchange is an assumption of many downstream analyses but has not been discussed.
3) What is the breeding system of the species? In case of strong selfing, this shall be taken into account in the interpretations of diversity and differentiation from population resequenced data. E.g. could the decrease in nucleotide diversity among E.c. varieties (l. 650-653) be attributed to breeding system shift, possibly associated with domestication? 4) l 217-240 The finding of contraction of NB-ARC gene family is interesting, it is however unclear how much this case represents an outlier in gene contraction among different gene families in Echinochloa. Hence, it is hard to speculate on adaptive significance of such a reduction (see l. 219) and it is a bit unclear the motivation why an entire paragraph is devoted to discussion of this particular family. Is the NRC-ARC family contraction significantly higher than is a genome-wide average in Echinochloa? 5) Fig. 1E: It is totally unclear from where the time estimates come from. Are they based on some uniform substitution rate estimate in grasses? If so, such an assumption seem to be tricky given known variation in substitution rates, and surely presenting a single point estimate of divergence time, based on such an approach, may be very misleading. It will be beneficial either to date the splits based on calibrated phylogeny (includng confidence intervals for each node), or such numbers shall be excluded from the main figure to avoid potential misleading divergence times (these values seem not to be very important for further interpretation anyway). Related to that, a phylogeny based on multispecies coalescent reconstruction across multiple loci (i.e. Astral, FIg S13), i.e. an approach directly developed for phylogeny reconstructions, may provide more robust than the relatedness based on the distribution of nonsynonymous substitutions.
6) The motivation for using particular methods for detection selection is not always clear and raises concerns about the homogeneity of the approaches throughout the ms. Specifically, why results of flowering-time variation are based on a genome-wide divergence (Fst) outlier scan while e.g. the domestication process is investigated by allele frequency spectra-oriented approach targeting only a subset of (pre-selected?) loci? 7) In some candidate loci it is unclear how the variation with these loci looks in the other two homeologs within a hexaploid (e.g. Hd1 is reported only from subgenome AH but nothing is mentioned about subgenome BH and CH). Are they pseudogenes? Invariable but likely functional loci? Variable but not enough to be considered outliers? Undetected? Such a context would provide important hints about the role of allopolyploidsation and potential sub/neofunctionalisation in adaptation.
8) The entire section on herbicide resistance (l. 569-627) lacks a primary exploratory genome-wide approach and/or justification why the entire parts focuses only on a small subset of particular loci. As a result, is appears as somehow selective "cherry-picking" of certain loci without a justification how much variation is such loci is exceptional in the context of the entire genome. Thus, although this section provides some hints of potentially valuable loci for a follow-up it does not adhere to the major focus of the paper on genome-wide variation and evolutionary significance of genomic variation related to allopolyploidy. At current state this section provides rather some particular supplementary information targeted to a very specialised audience.
9) The Disucssion is very brief and descriptive. Although not being an expert in the filed of agricultural genomics, I doubt there is so scarce literature on the topics as it is presemted here (in total four citations in the entire Discussion section). E.g. considering additional well-researched cases of the evolution of agricultural weeds such as Amaranthus tuberculatus (Kreiner et al. 2019 PNAS, Kreiner et al. 2018 ARPB andrefs. therein) would be helpful. Also putting the result into a context of other emerging studies on genomic evolution in allopolyploids (e.g. Arabidopsis suecica, Brassica napus, Gossypium, ....) might increase the impact of the study. 10) Although the authors at multiple places highlight the significance of their study for follow-up genetic studies, list of candidate loci for adaptation towards the discussed factors (resistance, latitude, domestication...) are missing. It would be evry helpful providing complete lists of candidate loci found genome wide by Fst scans, GWAS approaches and SFS-based scans as such primary data would provide molecular biologist a key resource for addressing particular genes in this emerging model.

Minors
-I consider crucial the fact the studied group is ALLOpolyploid. I suggest adding this information already to the abstract. -l. 105-106 maybe "most" missing from "the MOST compact genome" -l. 253 "calculated the pairwise Ks" is a jargon. Please explain the principle/motivation behind this analysis.
-l. 321 based on what the uthors conclude that "introgression may have facilitated an increase in environmental adaptation"? This seems rather speculative given the presented Ks analysis. -l. 338 what is the ploidy of E. walteri that is used as an outgroup in the resequencing data? -l. 397 unclear, which measure the % of "admixture" are based on -l. 506 unclear what are the "outliers", this shall be also explained in the text not only in the Fig caption -l. 536 how many individuals were used for the GWAS? How many loci others than Hd1 were found to be significantly associated with FT?

Reviewer #1 (Remarks to the Author):
In this manuscript, Wu et al. created, improved and analyzed extensive genomic resources for barnyard grasses (Echinochloa spp.). Some Echinochloa species are devastating weeds in agriculture, while others are cultivated and serve as food sources. Analyzing the population structure of Echinochloa is thus important to understand the evolution of weediness and domestication in this genus. The authors created chromosome-scale genome assemblies of three polyploidy Echinochloa species and re-sequenced a diversity panel of 737 accessions. While several near complete Echinochloa assemblies have been generated previously, this manuscript presents the first chromosome-level assemblies of polyploidy species of this genus. While most of the findings presented here are not entirely novel (e.g., a loss of NLR diversity in Echinochloa has been reported previously), the extent of the genomic resources generated here and the scope of the analyses are impressive. In general, the manuscript is very clear and very well written. I do have a few comments/questions that the authors should address:

Response:
Thanks for the positive comments on our manuscript.
• The authors developed a neat approach named "diploid-assisted scaffolding (DipHiC)" to assign contigs to the various sub-genomes of polyploidy Echinochloa species. The authors use a sequence-based approach (k-mers) for validation. However, I miss a truly independent validation of this scaffolding method, for example by using a genetic map or chromosome painting. I assume that large non-homoeologous translocations (such as the translocation between chromosome 4A and 7B in hexaploid wheat) would be missed with this approach. I am aware that the construction of a genetic map or the establishment of chromosome painting is tedious and might not be within the scope of this manuscript, but the authors should clearly point out this limitation (if it is one) in the main text. I am also not sure if it is correct to refer to "chromosome-level" assemblies without an independent validation.

Response:
Thanks for your comments. K-mer based clustering of contigs is one of the approaches for clustering. In addition, we also considered contig allelism and inter-contig linkages, which could eliminate the faults caused by using only K-mer based clustering method. Nevertheless, additional independent evidence to validate the assembly accuracy is needed. The assembly of polyploid genome is still a challenge, due to high sequence similarities between subgenomes and homeologous exchanges, which requires more evidence (e.g. genetic maps, optical maps) to validate and improve the assembly accuracy and completeness. We have added some discussion on the concern.
For the term "chromosome-level", it is now commonly used in most genome papers, in which genomes were assembled by the next-/ third-generation sequencing and HiC data with no validation of genetic maps, such as the two Nature Communications • The gene annotation is only poorly described. Line 856 states that RNA-Seq data was used for gene annotation. Were RNA-Seq data produced for all three species and were the tissues and sequencing depth comparable? How was the filtering for high-confidence and low-confidence genes done? How many low-confidence genes are there?

Response:
Thanks for your suggestion. We have added more detailed description about gene annotation. The RNA-seq datasets for genome annotation have been added in Supplementary Table 18 and Method part. All gene structures predicted by ab initio predictions, homeologous gene evidence and transcriptomic support (RNA-seq), were integrated into consensus gene models by EVidenceModeler. Gene models were identified as those supported by homeologous genes or transcript evidence or at least two ab initio methods. High-confidence gene models were further filtered to remove short gene models (less than 50 amino acids) and those with homology to sequences in the Repbase (E value≤1e−5, identity≥30%, coverage≥25%). Minor comments: • Line 105: "have the most compact hexaploid genomes currently known in plants"?

Response:
We have corrected it. The genomes of two hexaploid Echinochloa species (E. crus-galli and E. colona) are among the most compact hexaploid genomes currently known in plants.

Response:
We have deleted it. It should be "consensus reads generated from Pacbio CCS".
• Line 176: How many of the core conserved Poales genes were present in single copy and multiple copies in the E. colona genome?

Response:
We have added the paper of Wheat Pangenome (Walkowiak et al., 2021, Nature).

Reviewer #2 (Remarks to the Author):
The authors provide two high quality barnyard grasses reference genomes and sequences of 737 individuals. Overall, this work is excellent.

Response:
Thanks for the very positive comments on our manuscript.
One of the novelties in this manuscript is that the authors proposed a new pipeline (DipHiC) to phasing and scaffolding of subgenomes in an allo-tetraploid species E. oryzicola and two allo-hexaploid species E. crus-galli and E. colona. DipHiC is able to separate subgenomes by untilizing the mapping depth of a diploid close relatives and mutually exclusive relationship between contigs from the target polyploid genome. Assessement using Hi-C contact heatmap (e.g. SFigure 4) and distribution patterns transposon elements (SFigure 6) showed that the genome assemblies are of high-quality. However, requring addtional information (for instance, a diploid assembly and 20-kb mate pair library) and a less robust pipeline may limit the widespread use of the strategy. In fact, separating the subgenomes of allo-polyploid species is less complicated than phasing auto-polyploid genomes expecially for those with high divergence between subgenomes. Evidences show that the repetitive DNA sequences are rapidly evolving in subgenomes, which have been used to distinguish subgenomes in a couple of allopolyploid speices, e.g. Miscanthus. In addition, an unsupervised partitioning method (polycracker) was proposed to achive this goal without any prior knowledge, and had been successully applied on two allopolyploid species, tobacco and wheat. Given the previous knowledge, I question the intention and necessity to develop such a complex pipeline to assemble these genomes.

Response:
Thanks for your comments and suggestions.
Firstly, we tested several regular HiC scaffolding software and they did not performed well in Echinochloa genomes, even for AllHiC, a method specially designed for polyploid genome scaffolding ( Supplementary Fig. 1a), which resulted from high similarities among three subgenomes.
In the case of tetraploid Miscanthus genome (Mitros, et al., 2020, Nature Communications), the scaffold-level assembly was integrated into chromosome-level by using large-insert mate library, HiC and genetic map firstly. The chromosomes were then assigned into subgenomes using a K-mer based method. It was a strategy of "chromosome building firstly and subgenome phasing secondly". In another Miscanthus genome project (Zhang et al., 2021, Nature Plants), the integration of 10X genomic data, HiC data, Bionano optical data and high-density genetic map scaffolded contigs into chromosomes, but only 91.03% sequences were anchored on chromosomes.
We did attempt to scaffold contigs into chromosomes directly, but failed. Third, the repetitive elements were relatively low, which would not provide enough subgenome-specific sequences as markers to phase subgenome. Another major concern is that the authors frequently mix results and discussion, making their conclusion unreliable sometimes. For instance, Lines 462-463, "Either sexual hybridization or asexual grafting can lead to chloroplast capture or horizontal transfer". Are they making conclusion or just citing a previous publication to support their conclusion? At least a reference needed here. In addition, Lines 409-428, the whole paragraph should be moved to discussion part rather than presented in Results.

Response:
Thanks for the suggestions. We have revised the main text and separated the results and discussion according to your suggestions.

Response:
We have deleted the word "significant".
In Line 230, the authors claimed that "NB-ARC genes were significantly enriched on chromosome 4", but did not provide any statistics test. The same issue should be applied whenever "significant/significantly" are mentioned, e.g. Lines 447, 525, 544 etc.

Response:
Thanks for your suggestion. We have added the statistic tests when we mentioned "significant".
In addition, SFigure 10 shoud be normalized using z-score.

Response:
We have revised it according to your suggestion (see the following figure).
Lines 287-294, the authors concluded that the discordance between gene and species trees in Echinochloa was more likely caused by ILS rather than introgression by presenting two evidences: 1) 1:1:1 ratio in the gene counts of the three trees and 2) QuIBL results. But they did not provide details for explanation. Why the 1:1:1 ratio could indicate ILS and what QuIBL results lead authors to make such conclusion?

Response:
Topology counting of a triplet with an outgroup is a commonly-used method to explore ILS and the 1:1:1 ratio of three topologies implied possible ILS around the Lines 840-850, how the accuracy was validated by specific-kmer?

Response:
Each genome of E. colona, E. oryzicola and E. crus-galli was split into 13-mers. The frequencies of K-mers were counted for each chromosome. Only the K-mers mainly distributed in one subgenome (counts proportion in this subgenome greater than 95% of total counts) were kept as subgenome-specific K-mers and potentially could be used to phase subgenome. For each subgenome in any Echinochloa polyploids, there existed some subgenome-specific K-mers.
There are many PERL scripts on github, but without enough description, pleaed add a rich README.md.

Response:
Thanks for your comments. We have updated the README files for scripts used in this study, particularly the pipelines used for genome scaffolding of three Echinochloa polyploids. See details at https://github.com/bioinplant/Echinochloa_genome.

Reviewer #3 (Remarks to the Author):
Synopsis: Researchers sequenced the genomes of three Echinocloa species: hexaploids E. crus-galli and E. colona and tetraploid E. oryzicola. Only one Echinocloa species has been previously sequenced, the diploid E. haploclada. The authors then sequenced hundreds of new Echinocloa plants from around the world to add to their panel of 737 Echinocloa collections. Using the new genomes and diversity panel the authors set out to investigate questions around Echinocloa genus" genome biology, evolution, phylogeny, domestication, gene transfer, and abiotic stress resistance especially herbicides. General Comments: This manuscript represents a staggering amount of work and explores the Echinocloa genomes in several ways. It is so much and so dense that this might be to its detriment. The shear number of different topics and questions being addressed means the central through line is easily lost. Maybe it would be better for the manuscript to be broken into a number of smaller sequential manuscripts that address each topic in turn and fully explores their individual hypotheses and nuances. I really enjoyed that the manuscript addresses several outstanding questions and hypotheses previously posited, while opening the door for new avenues of research and discovery. The writing is excellent and little needs to be edited. I have two large concerns. The first is the rigor that was applied when making when making the phylogenetic inferences and the conclusions about introgression versus ILS. I bring up several concerns and questions for that portion of the manuscript that should be considered. The second major concern is the selection of figures. They seem to serve to impress and overwhelm the reader rather then tell a consistent and convincing narrative. Careful production and selection of figures is recommend to streamline the narrative and tell a more compelling overarching story.

Response:
Thanks for your positive comments on our manuscript. We have carefully addressed the questions you raised (see the following response for your concrete question).
Line Edits Line 82: The use of "inevitably" implies a strong causal link, while the sentiment is appreciated perhaps herbicide tolerance in all use cases does not inevitably lead to herbicide resistance.

Response:
We have deleted this word.
Line 92: "Evidently, barnyard grasses are excellent resources or systems for understanding plant stress". This abrupt transition could be made clearer that this. Is it-because of their success in the face various abiotic stresses?

Response:
We have revised this sentence.
Line 214: "significant gene losses." The word significant is being used without obvious statistical method, were losses determined using CAFÉ?

Response:
We have revised this sentence by deleting "significant".

Response:
We have changed the citation.
Line 228 and 230: "significantly" is used again without obvious statistical method

Response:
We removed it in line 228 and added statistic test in line 230 Lines 303-323: The authors set out to establish if discordance between predicted discordance times and ks estimates are driven by incomplete lineage sorting and/or introgression. The authors calculated ks divergence using a COGE like approach, if not COGE exactly. Phylogenetic methods, were based off of Orthofinder and RAxML. Coalesecent approach done with ASTRAL. While the authors work using QuIBL is respectable, their handling of the concatenated RAxML tree is unsatisfying. They report no alignment QC, partitioning or model selection. While Orthofinder is a nice general purpose program for generating Orthogroups, it is no substitute for handling the data rigorously. Questions abound.

Response:
Thanks for your comment. We have updated the phylogenetic analyses under strict alignment QC and clarified the model selection clearly in Methods. Please also see responses for the following questions.
1. Why not use nucleotide sequences with a codon aware aligner instead of the proteins. If the species are close enough to calculate reliable ks peaks, aligning at the nucleotide level is a tractable task.

Response:
We used the CDS sequences of single-copy orthologs to construct concatenation trees. The phylogeny trees based on Raxml and iqtree+ASTRAL using PROT and CDS datasets are consistent.
Phylogenetic trees using RaxML and iqtree+ASTRAL for amino acid sequences (PROT) and nucleotide sequences (CDS) datasets 2. Since whole genome sequences are being used, why not employ the same set of homologous genes identified via syntenty in genome guided orthology approach? This would be even more reliable than orthofinder since it incorporates genomic position, and basic data handling would allow for the use of amino acid or nucleotide sequences as authors saw fit.

Response:
Thanks for your suggestions. It is an ideal approach for constructing species phylogeny based on syntenic orthologs. While it is still a challenge to do multi-genome alignment due to the widespread genome-wide structural variations (e.g. gene loss and duplication). Here we employed a frequently-used method to build the species phylogeny by using 3500+ single-copy orthologs. The gene number is sufficient to sort the phylogenetic relationship.
In the case where Orthofinder is employed. I caution the authors that past experience with Orthofinder strongly suggests that the data will only benefit from a more rigorous handling of the sequences. This is because the alignments are frequently gappy and poorly aligned under alignment settings provided by most automated aligners and thus a QC step is required. The QC step and post processing should address the following.
1. How were gaps dealt with, the introduction of gaps into multiple sequence alignments can have a severe consequences.

Was
GBlocks or a similar program employed to limit the amount of noise in the alignment? If so why was it not reported and what were the flags used to ensure reliable alignments were used in the concatenated sequence
3. Was partitionfinder2 or some suitable substitute used to ensure that the correct models were applied throughout the concatenated matrix?

Response:
We used jModelTest and ProtTest to test the best substitution model for nucleotide and amino acid sequences. And also the best models tested by ModelFinder implemented in IQ-TREE were also considered.

Response:
For amino acid (PROT) dataset, the substitution model was set as The same point about alignment QC applies to the Multiple Sequence Alignments (MSAs) used for IQTree. IQtree does an excellent job testing for best substitution models, however it does not manage gapped or poorly aligned sequenced output regularly produced by MAFFT with default settings. This is particularly problematic for sequences with large disordered domains and for orthogroups in which one or more sequences contains novel amino acid sequence insertions. These must be dealt with before MSAs are given to IQTree. Again, there is no obvious evidence that this was done. Poor alignments produce unreliable inferences.

Response:
Thanks for your comment. We have updated the phylogenetic analysis by IQ-TREE using GBlocks-trimmed alignments (-b4=5 -b5=h). The species phylogeny and triplet phylogeny results are consistent with previous results.
Species phylogenetic trees by IQ-TREE and ASTRAL using datasets of amino acid sequences (PROT, left) and nucleotide coding sequences (CDS, right) The authors have not convinced me that they have handled the analysis rigorously enough to support the claims based on this analysis. In particular, it seems just as likely that they are measuring artifact introduced by the handling of data as they are measuring ILS or introgression. I admit that measuring ILS and introgression is a worthy undertaking, but has not been convinced by the current handling of the data.

Response:
Thanks for your comments. Yes, it is not easy to distinguish introgression and ILS.
Here, we considered the radiation of Echinochloa at the first divergence node as an ILS 10.1093/sysbio/syaa099). In Echinochloa, QuIBL strongly indicated the main factor behind phylogenetic discordance among the Echinochloa subgenomes for radiation was ILS, rather than introgression.
Along with evolution, the footprints of ILS and introgression should be different.
Although recombination would split the introgressed blocks, genes with the same discordant topology caused by introgression would be expected to be located together.
In the evolution of Heliconius butterfly, large blocks showing the same topology were observed (Edelman et al., 2019, Science, 10.1126/science.aaw2090). It is a more focused review of the effect that CNVs have on glyphosate resistance.

Response:
We have added the literature.
Lines: 668-675 "Significantly" is used twice without an obvious statistical method. The pipeline CNVings is briefly described, but no mention is made of statistical test is applied to determine differences, perhaps it was a T-test.

Response:
Thanks for your comment. We have added the statistical methods for "significant". Only include the data relevant to your point(s). This will highlight the more relevant tracks (like Copia, Gypsy, etc).

Response:
Thanks for your suggestion. We have deleted the track about the GC content.
Line 1113: Not sure this is interesting in a graphical form. Just extra complexity and we understand well from the text how complete the genome(s) are. Also, it may allow more room for panel D and/or E to be larger and more interpretable.

Response:
We have removed this sub-figure.

Response:
Here we expanded all the branches aiming to show the population structure and phylogeny of Echinochloa species/subspecies more intuitively. Figure 3: Line 1161-1164: Same comment as before, what benefit do we get as a reader to seeing every branch and branch length? Simplify your figures where you can to tell a more concise and interpretable story.

Response:
Here we expanded all the branches to show the population structure and phylogeny of Echinochloa species/subspecies more concretely based on chloroplast genomes, and to show more clearly the phylogenetic positions of outliers as indicated by red asterisks.
Line?: The figure panel of the two clades (Below the ML phylogeny) is not described in the figure head.

Response:
Thanks for your comment. We have added the legend. Response: Figure 1D illustrated the NB-ARC number base on single reference genomes aiming to explore the dynamics of copy number among species, while Figure 5 discussed the copy number alteration by domestication (estimated by NGS resequencing data).

Reviewer #4 (Remarks to the Author):
The article brings novel insights into an evolution and agronomic significance of high polyploid grasses. Although there are several genome assemblies of allopolyploid species already available, the comprehensive analysis presented here brings valuable general insights into allopolyploid evolution in general. The model system is fascinating and in depth investigated by an impressive collection of global diversity of the three focal species. The analyses are diverse and at many aspects very comprehensive (esp. genome assembly, investigation of range-wide diversity) while the analysis of selective footprints of local adaptation, herbicide resistance and domestication is rather limited to testing particular hypotheses, which are addressed by a diverse and somehow inconsistent approaches, that seem not to be always clearly justified. Although I do not see any major significant flaw, I have the following comments aiming to clarify and potentially improve some aspects of the study.

Response:
Thanks for the very positive comments on our manuscript.
1) The phylogeny of Echinochloa species and subgenomes is crucial for further interpretations. I understand the reasons why there are no more genomes assembled and sequences but I wonder whether there are at least some sequence data for additional species available. Addition of such data (e.g. in a supplementary tree involving a subset of loci gathered also from current genomes) might better clarify the origin of the other subgenomes, esp. FH and EH that are currently isolated in the tree. Putting these newly assembled (sub)genomes into a broader context of Echinochloa evolution would help understanding their evolution.

Response:
Thanks 2) Many recently assembled allopolyploid genomes demonstrate surprising stasis and lack of large genomic exchanges among the sub-genomes (e.g. Burns et al. 2021 Nat Ecol Evol and refs therein), however, it would be still worth checking and discussing if that is also the case for Echinochloa or whether they are in fact such homeologous recombinations present. Lack of homeologous exchange is an assumption of many downstream analyses but has not been discussed.

Response:
Thanks for your comment.
For E. crus-galli, in the process of genome assembly, we have noticed the existence of HEs in its genome at the step of contig correction (Supplementary Note 4).
The parental genomes are available, which provide the chance to comprehensively profile the HEs in E. crus-galli. By mapping ancestral genomes, we identified 24 candidate HEs in E. crus-galli (see the following figure). These candidate HE events were supported by HiC interaction and some are supported by the mapping mate-paired reads, which were unlikely to be caused by mis-assembly (see the following figures).
HEs would interfere the phylogeny based on whole-genome genes and genes within HE regions should be removed before phylogeny construction. Actually, none of the previously identified 3,557 single-copy genes were located within the candidate HE regions. Large HEs were observed to be close with telomeres (chromosome ends) (e.g. HEs on BH01, CH01, BH05) where genes are poor. Thus, the HEs would not influence the conclusions of phylogenetic inference in our case.
HiC interaction at single-chromosome scale in E. crus-galli genome None of abnormal signals are observed around HE regions (e.g. terminal regions on chromosome BH01, CH01, BH05).
Due to the lack of direct ancestors of E. oryzicola and E. colona, we did not check the HEs in these two species.
3) What is the breeding system of the species? In case of strong selfing, this shall be taken into account in the interpretations of diversity and differentiation from population resequenced data. E.g. could the decrease in nucleotide diversity among E.c. varieties (l. 650-653) be attributed to breeding system shift, possibly associated with domestication?

Response:
Yes, the breeding system of Echinochloa is selfing. As we know and observe, no breeding system shift occurs among E. crus-galli varieties.

4) l 217-240
The finding of contraction of NB-ARC gene family is interesting, it is however unclear how much this case represents an outlier in gene contraction among different gene families in Echinochloa. Hence, it is hard to speculate on adaptive significance of such a reduction (see l. 219) and it is a bit unclear the motivation why an entire paragraph is devoted to discussion of this particular family. Is the NRC-ARC family contraction significantly higher than is a genome-wide average in Echinochloa?

Response:
Thanks for your comments. A focus on the NB-ARC gene family was not a random selection, rather it was based on our analysis results on the dynamics of family sizes of several genes with a role in abiotic and/or biotic stress response (see Supplementary Fig. 8 and the following Table) and that the fact of NB-ARC being crucial in plant"s fitness.
As you can see from the following Table, compared to abiotic response gene families, biotic response gene families in Echinochloa showed relatively smaller in size, compared to its neighboring species, such as Alloteropsis semialata and Setaria italica, particularly the NB-ARC family.
Comparison of abiotic and biotic gene family sizes 5) Fig. 1E: It is totally unclear from where the time estimates come from. Are they based on some uniform substitution rate estimate in grasses? If so, such an assumption seem to be tricky given known variation in substitution rates, and surely presenting a single point estimate of divergence time, based on such an approach, may be very misleading. It will be beneficial either to date the splits based on calibrated phylogeny (includng confidence intervals for each node), or such numbers shall be excluded from the main figure to avoid potential misleading divergence times (these values seem not to be very important for further interpretation anyway). Related to that, a phylogeny based on multispecies coalescent reconstruction across multiple loci (i.e. Astral, FIg S13), i.e. an approach directly developed for phylogeny reconstructions, may provide more robust than the relatedness based on the distribution of nonsynonymous substitutions.

Response:
Thanks for your suggestions. We have rebuilt this sub-fig by removing the divergence time and replacing Ks-based phylogenetic tree with multi-loci coalescent tree by ASTRAL.
6) The motivation for using particular methods for detection selection is not always clear and raises concerns about the homogeneity of the approaches throughout the ms. Specifically, why results of flowering-time variation are based on a genome-wide divergence (Fst) outlier scan while e.g. the domestication process is investigated by allele frequency spectra-oriented approach targeting only a subset of (pre-selected?) loci?

Response:
Here we focus on domestication-related genes that have been cloned in rice. The basic molecular biology studies are very limited for Echinochloa, thus in order to have an overall understanding to Echinochloa domestication, the cloned domestication-related genes in rice provide good resources to explore the genetic mechanism behind the barnyard millet domestication. The analyses of domestication-related genes in barnyard millet is straight-forward, and the haplotype profiling in Echinochloa is comparable against rice domestication. Genome-wide scan (e.g. Fst) to detect domestication regions is popular and useful, but the scan results usually contain a large proportion of false positive genes with unknown function and roles in domestication. Nevertheless, genome-wide scan of differentiation between wild and cultivated Echinochloa will provide valuable resources for following molecular studies and millet breeding and improvement. We have added genome-wide Fst scan and listed all the genes within the top 5% genomic regions with cloned rice genes annotation in Supplementary Tables 15-17 in this revise version.

7)
In some candidate loci it is unclear how the variation with these loci looks in the other two homeologs within a hexaploid (e.g. Hd1 is reported only from subgenome AH but nothing is mentioned about subgenome BH and CH). Are they pseudogenes? Invariable but likely functional loci? Variable but not enough to be considered outliers? Undetected? Such a context would provide important hints about the role of allopolyploidsation and potential sub/neofunctionalisation in adaptation.

Response:
Thanks for your suggestion. We have added the analyses on differentiation and selection of other two copies of Hd1 in subgenomes BH and CH (Supplementary Figure   31 Preferential differentiation and selection on Hd1 in E. crus-galli and E. oryzicola (a) genomic differentiation scan in 20-kb sliding windows around three orthologs of Hd1 (AH06.1150, BH06.1184 and CH06.1302) on chromosome AT06, BT06 and CH06. Red dots represent the window harboring Hd1 genes. Red and black dashed lines represent thresholds of top 1% and 5% Fst values, respectively. (b) Tajima"s D scan in 20-kb sliding windows around Hd1. Red dashed lines represent 95% significance thresholds. (c) nucleotide diversity π in 20-kb sliding windows around Hd1.
8) The entire section on herbicide resistance (l. 569-627) lacks a primary exploratory genome-wide approach and/or justification why the entire parts focuses only on a small subset of particular loci. As a result, is appears as somehow selective "cherry-picking" of certain loci without a justification how much variation is such loci is exceptional in the context of the entire genome. Thus, although this section provides some hints of potentially valuable loci for a follow-up it does not adhere to the major focus of the paper on genome-wide variation and evolutionary significance of genomic variation related to allopolyploidy. At current state this section provides rather some particular supplementary information targeted to a very specialised audience.

Response:
Thanks for your comments. As weeds, world-wide Echinochloa grass suffered from not only local natural selection, but also artificial stress by agricultural managements.
Human-directed recent selection pressure on weeds largely comes from the wide application of herbicides. You are right that the loci presented in the section were not picked up by a genome-wide scan as that requires functional evidence (can be generated by large scale phenotyping and GWAS analyses following by functional confirmation) of all possible loci associated with herbicide resistance, which is current unavailable. We thus focused on the genes with a confirmed role in herbicide resistance and surveyed their variations in the world-wide Echinochloa accessions to generate a mutation atlas of both target site (TSR) and non-target-site resistance (NTSR), as we believe that the variations contributed to herbicide resistance are very important for evolution of Echinochloa species and their adaptation to diverse or localized agrosystems.
While the trait of herbicide resistance per se may be not of interest for those working on polyploid and crop evolution, it"s a very important and demanding topic of agriculture as a whole. Thus, the results should be of general interest.
9) The Discussion is very brief and descriptive. Although not being an expert in the filed of agricultural genomics, I doubt there is so scarce literature on the topics as it is presemted here (in total four citations in the entire Discussion section). E.g. considering additional well-researched cases of the evolution of agricultural weeds such as Amaranthus tuberculatus (Kreiner et al. 2019PNAS, Kreiner et al. 2018 ARPB and refs. therein) would be helpful. Also putting the result into a context of other emerging studies on genomic evolution in allopolyploids (e.g. Arabidopsis suecica, Brassica napus, Gossypium, ....) might increase the impact of the study.

Response:
Thanks for your suggestions. We have overhauled the discussion section by discussing a bit deeper on the relevant topics.
10) Although the authors at multiple places highlight the significance of their study for follow-up genetic studies, list of candidate loci for adaptation towards the discussed factors (resistance, latitude, domestication...) are missing. It would be evry helpful providing complete lists of candidate loci found genome wide by Fst scans, GWAS approaches and SFS-based scans as such primary data would provide molecular biologist a key resource for addressing particular genes in this emerging model.

Response:
Thanks for your suggestion. We have listed all genome scan results and candidate genes in the supplementary materials in the revised version of the MS.

Minors
-I consider crucial the fact the studied group is ALLOpolyploid. I suggest adding this information already to the abstract.

Response:
We have added the information in the abstract.

Response:
We have revised it.
-l. 253 "calculated the pairwise Ks" is a jargon. Please explain the principle/motivation behind this analysis.

Response:
We have updated the wording.
-l. 321 based on what the authors conclude that "introgression may have facilitated an increase in environmental adaptation"? This seems rather speculative given the presented Ks analysis.