Comprehensive genomic resources related to domestication and crop improvement traits in Lima bean

Lima bean (Phaseolus lunatus L.), one of the five domesticated Phaseolus bean crops, shows a wide range of ecological adaptations along its distribution range from Mexico to Argentina. These adaptations make it a promising crop for improving food security under predicted scenarios of climate change in Latin America and elsewhere. In this work, we combine long and short read sequencing technologies with a dense genetic map from a biparental population to obtain the chromosome-level genome assembly for Lima bean. Annotation of 28,326 gene models show high diversity among 1917 genes with conserved domains related to disease resistance. Structural comparison across 22,180 orthologs with common bean reveals high genome synteny and five large intrachromosomal rearrangements. Population genomic analyses show that wild Lima bean is organized into six clusters with mostly non-overlapping distributions and that Mesomerican landraces can be further subdivided into three subclusters. RNA-seq data reveal 4275 differentially expressed genes, which can be related to pod dehiscence and seed development. We expect the resources presented here to serve as a solid basis to achieve a comprehensive view of the degree of convergent evolution of Phaseolus species under domestication and provide tools and information for breeding for climate change resiliency.

P3: It is inaccurate to state 10,497 SNP markers were used, as these are not independent "markers". It appears that the number of markers is 522. I am also curious how many of these 522 were observed in each line, versus being missing and imputed. This gives rise to my problems with the introgression statements above.
P4, Fig 1: The legend uses "Phaseolus lunatus" while the text uses "Lima bean". The two should use the same name for the organism. P4: "SNP density". Note this is the density of SNPs able to be called from GBS, which relies on uniquely mapped sequences; thus it is not a true measure of SNP density or sequence diversity. Figure 2: Phaseolus lunatus vs Lima bean again. Also legend on parts A and B is hard for me to see the colors representing the samples and impossible if printed out. I would prefer a more saturated, color-blind friendly color scheme, bigger legend and perhaps a different shading on the different samples. Figure 3: Here, for some reason, we go back to "Lima bean" not "Phaseolus lunatus". P11: "contained 49 accessions from northern Guatemala, Costa Rica, northern Colombia and few from the Peninsula of Yucatan". It would be good to have a number here rather than "few". It is not clear if the authors mean "few" (in this context, means little or none) or "a few" (a small number).
P14: "Different analysis alternatives" is unclear here and would usually refer to different statistical methods. What I think the authors mean is whether the ratio of gene expression is measured between the stages or accessions? P15: "Genes with predicted resistance to biotic stresses". Genes don't have resistance to biotic stresses, they confer resistance on plants.
P18, Materials and Methods: DNA extraction procedures, and the tissue and stage of plants used for extraction, seem to be missing.

Reviewer #3 (Remarks to the Author):
This revised manuscript is much approved and I appreciate the authors' efforts on additional data analyses that provide some biological and genetic insights.

Minor:
Last two paragraphs on Page 5: These paragraphs describing QTL mapping should be in a separate section. Fig. 2c: Please carefully check the inversions shown in this figure, and make sure that the inversions contain no errors introduced such as by insufficient information during the pseudomolecule construction process (e.g., some scaffolds could be non-oriented).
Change "it decays at about 500 kbp" to "it decays to background levels at about 500 kbp" to make it clearer. Also change "basal levels" to "background levels" in the previous sentence.
Comment not directly related to this version of the manuscript: GWAS (although it's not related to the current manuscript as the GWAS results have been removed but I think it should be helpful for future related work): I am sorry that the authors misunderstood my comments related to GWAS. I did not mean that GBS data cannot be used for GWAS. Instead, I support using GBS data for GWAS. However, in the previous version of this manuscript, the Manhattan plots showed that only one SNV standing out for most of the association loci (instead of a cluster of SNVs; please see Figs 3-5 of this paper for examples: https://www.nature.com/articles/s41588-018-0266-x). This indicates that the results are suspicious and could be unreliable. This could be due to the low-density of the marker data or inappropriate model used for GWAS. The classical Mixed Linear Model is widely used for GWAS but this does not necessarily mean that the model would fit your data. Trying different models and quality checking such as with Q-Q plots and lambda values are highly recommended.

R /
We are glad to know that the answers provided to the comments included in the last review were satisfactory and we thank the reviewer for all his/her suggestions including those present in the current review.
Regarding the potential effect of missing data, we understand that this has been an issue in some manuscripts based on GBS data. However, in our case we performed a conservative filtering to minimize this effect in our downstream analysis. More than 1 million reads were generated for each sample (average 6.8 million) and the high mapping rates (at least 59% and less than 80% for only 20 samples) ensured an adequate number of aligned reads. Although the raw genotype matrix of 116,030 SNPs has about 40% missing data, almost all results related to the diversity analysis were performed on a filtered dataset with 12,398 SNPs, which, among other filters, required at least 450 individuals genotyped for each SNP. The overall missing data rate of this dataset was 5.3%, genotype calls for each sample were present for at least 75% of the SNPs and this percentage is less than 90% for only 6 samples. Moreover, the missing data was not imputed to perform the structure analysis and for general population statistics.
About the low number of landraces showing signals of introgression, we found that about the 10% of the landrace accessions analyzed showed signals of introgression. This is maybe the expected pattern for a predominantly autogamous species as Lima bean. An important aspect to take into account is that in the cultivated field, gene flow from wild populations may be limited through selection that farmers make against hybrids that carry wild traits. Although wild-weedy-crop complexes in Lima bean have not been well-studied, the available data suggest that in some places of the Peninsula of Yucatan gene flow may be asymmetric, predominantly in the direction from crop to wild. Although it is always possible that a few small introgressions could be identified if the correct genotype calls would somehow be identified to fill the missing data, we do not anticipate a much larger number of introgressions. We improved the methods text (page 25) and added the supplementary data file 4 with the numbers mentioned in this answer.
The only analysis for which imputation was performed was the fineSTRUCTURE analysis because this tool can not handle missing data. Because we had to impute data, we decided to use a less filtered dataset including 28,753 SNPs genotyped in at least 300 accessions. As expected, this dataset has a larger missing data rate of 11%, which was the exact percentage of imputed data. As explained in the results, the results obtained with fineSTRUCTURE were highly consistent with the clustering performed by STRUCTURE and moreover it was consistent at a finer level with the geographical origin of the samples.
Unfortunately, the review is not precise in the exact sentences of the discussion regarding diversity could be more definite than the data included in the results. We performed the following statement by statement analysis:

Sentence Support with results
The analysis of this information revealed, at a greater detail, the genetic structure of wild and domesticated Lima beans across their large distribution range in the Americas, (on page 15, first paragraph of the discussion) Analysis performed by fineSTRUCTURE, compared to Chacon-Sanchez et al 2017.
Genes that confer resistance to biotic stresses on plants show large nucleotide diversity but good correspondence with common bean resistance genes based on orthology predictions and literature. (0n page 15, second paragraph of the discussion) Diversity analysis of genes related to biotic stresses ( Figure 2e) Accordingly, we found that Andean wild populations are much less diverse than the Mesoamerican ones. In wild common beans, a reduction of diversity was also documented in Andean populations but this time this reduction was attributed to the effect of rare longdistance dispersal events from Mesoamerican populations to the Andes. (on page 16, second paragraph) Supplementary Results of the introgression analysis ( Figure  3d). As detailed below, the 522 loci mentioned in the comment below correspond to the biparental population (not the diversity population) and hence are not related to the analysis of introgressions.
The higher number of samples and genetic markers analyzed in the present study also allowed us to resolve the genetic substructure of the MI gene pool and its relation to geography. Within the MI wild cluster, two subclusters were observed: northern-western Mexico and southern-western Mexico. Also, within the cluster of Mesoamerican MI landraces, three subclusters were detected: Mexico-Central America, South America and the Peninsula of Yucatan. (on page 16, third paragraph) Results of the fineSTRUCTURE analysis ( Figure 3C) The fact that landraces from Peninsula of Yucatan form a separate subcluster and that their genetic diversity is significantly much lower than other MI landraces, is in agreement with the hypothesis that after its domestication in central-western Mexico, Lima bean was a relatively late adoption of the Mayan communities of the Peninsula of Yucatan. (on page 16, third paragraph)

Supplementary table 5. Diversity statistics.
Average H E for this population was 0.03, the lowest of the entire dataset and lower than MI landraces from other locations (0.07). However, we toned down this statement using the word "relatively" instead of "significantly much".

Other minor comments:
Abstract: "the present resources" implies the resources are already available, I think the authors mean "the resources presented here".
R/ "the present resources" was changed to "the resources presented here" as suggested by the reviewer. Moreover, the abstract was reduced to 150 words to meet the editorial requirements.
P2: The P. vulgaris genome alone "produces loss of information". This isn't exactly true and the truth is more important. The use of the P. vulgaris genome prevents Lima researchers from gaining a true picture of genetic diversity, gene expression and causal polymorphism because of reference bias, and also may lead to loss of information when used for analysis in key regions where the two differ.
R/ This point was acknowledged in the document by saying that "relying on a P.
vulgaris genome alone may have consequences for downstream diversity analyses due to reference bias".
P3: It is inaccurate to state 10,497 SNP markers were used, as these are not independent "markers". It appears that the number of markers is 522. I am also curious how many of these 522 were observed in each line, versus being missing and imputed. This gives rise to my problems with the introgression statements above.
R/ We adjusted the text of the result to avoid calling the SNPs "markers". This statement refers to the biparental population (not the diversity population) and hence the 10,497 SNPs identified in this population do not have any relationship with diversity analysis and particularly with the introgression analysis. This population was sequenced at a relatively lower depth and then the overall missing data rate of 21%. These missing data were imputed to build the genetic map and identify QTL.
P4, Fig 1: The legend uses "Phaseolus lunatus" while the text uses "Lima bean". The two should use the same name for the organism.
R/ In the legend of figure 1, "Phaseolus lunatus" was changed to "Lima bean".
P4: "SNP density". Note this is the density of SNPs able to be called from GBS, which relies on uniquely mapped sequences; thus it is not a true measure of SNP density or sequence diversity.
R/ In this part of the text "SNP density" was changed to "observed number of SNPs" for more clarity.