Two divergent haplotypes from a highly heterozygous lychee genome suggest independent domestication events for early and late-maturing cultivars

Lychee is an exotic tropical fruit with a distinct flavor. The genome of cultivar ‘Feizixiao’ was assembled into 15 pseudochromosomes, totaling ~470 Mb. High heterozygosity (2.27%) resulted in two complete haplotypic assemblies. A total of 13,517 allelic genes (42.4%) were differentially expressed in diverse tissues. Analyses of 72 resequenced lychee accessions revealed two independent domestication events. The extremely early maturing cultivars preferentially aligned to one haplotype were domesticated from a wild population in Yunnan, whereas the late-maturing cultivars that mapped mostly to the second haplotype were domesticated independently from a wild population in Hainan. Early maturing cultivars were probably developed in Guangdong via hybridization between extremely early maturing cultivar and late-maturing cultivar individuals. Variable deletions of a 3.7 kb region encompassed by a pair of CONSTANS-like genes probably regulate fruit maturation differences among lychee cultivars. These genomic resources provide insights into the natural history of lychee domestication and will accelerate the improvement of lychee and related crops.

4 the two haploid genomes are from the accession. Other EMC accessions may have a similar genetic background as "Feizixiao," so they have comparable mapping coverage when using "Feizixiao" as the reference. For example, they may be descendants of "Feizixiao," or they are also hybrids of LMC and EEMC, as shown in Fig. 3A.
For the timing of hybridization, I am not sure why the HY haplotype and HH haplotype need to have diverged from the common ancestor of lychee and longan . In Line 226-228, the divergence between HNW and YNW has been estimated at 1.89 Mya (1.24 Mya in Line 623-624 though), so this is probably the earliest when the two haplotypes originated. There might be gene flow between the two populations, so the divergence between the two haplotypes could be inferred using the phylogenetic tree with single-copy orthologs as in Suppl. Fig. 15.
In the analyses of differential expression alleles, it is not entirely clear how the authors distinguished RNA-Seq reads from different alleles. Although a figure in the Methods suggests that only uniquely mapped reads with allele-specific SNPs are considered, mapping tools may tolerate mismatches that complicate the procedure, so it would be great if the approach could be discussed in somewhat more detail and caveats mentioned. Fig. 4D does not show the differences of SNP densities for EEAs and DEAs in various gene features (Line 420-423), but just SNP densities in various gene features. The authors state that indicating that DEAs (differentially expressed alleles) were under greater purifying selection pressure than were EEAs (evenly expressed alleles) (Fig. 4C) (lines 413-418). I'm not sure I understand. I would intuitively assume that purifying selection would keep expression of the alleles the same, unless the expression is different from the start (since the 'merging' of the two subgenomes) and this has to be maintained (is this what the authors mean?), while positive selection could be responsible for different alleles having different expression, like if you would compare duplicated genes with identical functions, and positive selection on one copy gives it a different function.
According to the authors (section "Flowering-related genes in lychee"), flowering time is the determinant of fruit maturation in lychee, and there is a tandem expansion of SVP genes in lychee and longan. Knocking out the SVP genes in peach would result in the evergrowing mutant in a certain environment. This may indeed suggest possibly adaptation -or at least a link -to climates, but I think this would naturally lead to another question of whether there is any copy number variation of the SVP genes among different cultivated and wild lychee accessions? Also, is there any copy number difference for the HH and HY haplotypes in the "Feizixiao" accession?
In the section on 'Cultivation history of lychee', it is not entirely clear to me what is meant with second (and higher-order) relationships? The authors write that 'estimates of the level of relationship' were obtained using the KING software and reflect the level of shared heterozygous or homozygous haplotype blocks. But what does 'level' mean? I could not find this in the manuscript.
Reviewer #2: Remarks to the Author: The authors sequenced the genome of lychee and re-sequenced 72 accessions which allowed the to obtain genome wide SNPs. They use resulting information to investigate demography, domestication and to look for evidence of selection and allele specific gene expression. Overall the study is intended as resource for breeding.
At some level the study is novel as it provides information for a tree crop that has not been studied extensively using genomics. However it was difficult for me to see the argument for general interest as the genome information is not really used to generate novel biological insights.
The pop gen analysis also needs help Line 194 : the simple ratio of numbers of non-syn and syn polyorphims is likely affected by the different samples sizes of the three datsets compared here. The authors should calculated pi_n/pi_s, which allows to compare datasets with different sample sizes.
Line 204 (and 212): Figure 2A is not a "phylogenetic tree". This is because meotic recombination affected the genealogical relationships between these accessions such that no single tree can explain the evolutionary history of the sample. The this tree as a graphical representation of genetic distances between the 72 resequenced accessions.
Line 227 : it would be good to have an independent estimate of this split time done using another method (either msmc2,fastsimcoal2,dadi,or Relate). This would confer more confidence in this important result.
Line 244: Could it be that those differences also arise because of difference in levels of selffertilization in natural populations? Figure 3A: how exactly are those kinship relationships calculated? l.279: what do they authors mean by "neutral" mutations? Synonymous polymorphisms? It would seem appropriate to use synonymous nucleotide diversity, which allows the comparisons of diversity between samples of different sample sizes.
L 316-318: Having more pronounced LD pattern in the cultivars versus the wild populations alone could simply result from repeated bottlenecks during domestication. l. 320:On "time of cultivation"? Has this been done using archeological records? l. 333: All arguments based on changes in proportion of "high-impact mutation" would benefit for rechecking for effect of different sample sizes. l. 387: This first sentence is not clear: what is meant by "adaptability" here? Is it used in the sense of Evolvability (Payne and Wagner2018).
l.395 15000 DEA out of how many genes, 30000? That is every second gene is a DEA? Is it still possible to do meaningful enrichment analysis with this large number of DEA? l. 416/417: using 0.1 as a cutoff for "strong purifying selection" is arbitrary -what is the justification. l. 433 I don't understand what evidence the authors have that DEA are time or tissue specific. l. 542: just having Ka/Ks > 1 is not enough evidence for stating that the gene was targeted by positive selection. The authors need to use additional analyses to provide significant evidence using selective sweep detection methods like Sweepfinder2 (http://degiorgiogroup.fau.edu/sf2.html). This applies to other parts of the manuscript. l. 550: What is the value of reporting non significant associations?
Reviewer #3: Remarks to the Author: Hu et al. report a genome assembly for the specialty fruit crop lychee and utilize resequencing and comparative genomics to uncover the domestication history and genomic basis for flowering time variation in lychee, respectively. The authors produced a fully phased, chromosome-scale assembly of the highly heterozygotic lychee genome and uncovered extensive allele specific expression. They provide evidence for two domestication events in lychee and identified homologs of flowering time genes in Arabidopsis that may be related to flowering time differences between early and late maturing cultivars. I read this paper with interest, but I have a few concerns that should be addressed before publication.
1. I have some concerns about the genome assembly and haplotype phasing. It is unclear based on the results and methods how one haplotype was extracted from the assembly and used for Hi-C anchoring. Based on the Hi-C contact matrix and BUSCO score, it seems like HaploMerger2 successfully binned the genome into two haplotypes, but each bin of contigs likely represents a chimera of the HY and HH haplotypes stitched together. Based on the raw PacBio assembly size, both the HY and HH haplotypes were assembled for 'Feizixiao', so couldn't these be used as a basis for haplotype phasing? This would give the authors a better sense of copy number variations, TE polymorphisms, large-scale rearrangements, and other differences between the HY and HH haplotypes in 'Feizixiao' compared to using Illumina data alone. This may also improve alignment of reads for the resequencing data and analysis of differentially expressed alleles. If the within genome heterozygosity is truly ~2.2%, the HiC data should be sufficient for phasing HY and HH from the original assembly, especially if the authors use ALLHiC (which was created by several coauthors of this manuscript). There are very few phased haplotype genome assemblies for plants and additional analyses using an improved assembly would strengthen the manuscript. Details on haplotype phasing and verification are vague. The authors state HapCUT2 was used for phasing and 'eventually, 15 pairs of homologous chromosomes were obtained'. If the original haploid assembly was a chimera, wouldn't these chimeric regions be carried over in the resulting HY and HH haplotypes? I am unsure why aligned Illumina data was used for phasing when both alleles were already assembled in the raw PacBio contigs.
2. The results on maturation time are interesting, but the finding that CONSTANS genes contribute to maturity time is not well supported. It is not surprising that the authors found no statistically significant GWAS peaks for flowering time given the low sample size (62 based on the methods) and significant population structure between the wild and independently domesticated cultivars. The authors identified one flowering time related gene from the top 20 GWAS peaks, but this is not unexpected since the genome contains so many flowering genes. The heterozygous deletion in one haplotype is interesting, but because both alleles are expressed, it's difficult to say if this plays a role in flowering time variation. Additional evidence would help strengthen the claim that this gene/gene pair is involved in maturation.
3. The manuscript is quite long overall, and some sections contain extraneous text that distracts from the most meaningful findings. For instance, the conservation of a VRN1-like gene cluster across eudicots is interesting but does not yield meaningful results in lychee. The number of VRN1-like genes is similar in lychee and other genomes and many of these genes are expressed across diverse tissues, so it is unclear what role this gene cluster may play in flowering time regulation compared to other species.
Minor: It is difficult to distinguish the wild from cultivated, maturity classes, and country of origin of the accessions in Figure 2A. I like the concept of using different fruit and leaf colors, but it is hard to interpret.
Lychee is a specialty fruit and I suspect many readers will be unfamiliar with the different cultivars of lychee and their distinguishing characteristics. It may be helpful to include pictures of representative cultivar groups for the wild, extremely early, early, and late-maturing groups in one of the figures.

Dear Editors and Reviewers,
Thank you for your time and effort for handling our manuscript. We are pleased that the Reviewers and the Editors believe that our work is of great interest and provide new, valid information to the plant genomics field. We are grateful for the invaluable comments and suggestions from the three Reviewers, which were very helpful for the revision and improvement of our manuscript. We have substantially revised our manuscript by conducting additional sequencing, redoing our haplotypic genome assembly, reanalyzing many datasets, and redrawing or reorganizing a few figures. Below we provide a brief summary of the revisions made to best address the Reviewer's concerns and questions.
The major revisions are as follows.
(1) To address the concern expressed by both the Reviewers and Editors on the haplotype phasing, we first performed additional genome sequencing using the 10X Genomics platform to better assess the accuracy of haplotype phasing and to improve the identification of haplotypic SNP blocks.
(2) For further improvement of haplotype genome quality, we adopted a completely new strategy to assemble haplotypic genomes. A new set of haplotypes with significantly better SNP haplotyping consistency was obtained and used for downstream data reanalyses, including new variant analyses.
(3) We were able to obtain more haplotypic genomic information, including structural variation, copy number variation, and large-scale rearrangements, by comparing the new haplotypic genomes. Using the new and improved haplotype assemblies, we redid most of the analyses related to the differential expression of allelic genes.
(4) For the population genetic analyses, we have redone many analyses or added additional analyses, as reviewer #2 suggested we solidify our results. In order to carry out the revisions we developed new software to calculate the requested n / s values (coined as PiNSiR, n s analysis in R). The software is available in github at https://github.com/jsalojar/PiNSiR. Additionally, we found that inbreeding has a strong impact on demographic analyses using coalescent approaches, such as pairwise sequentially Markovian coalescent (PSMC), Stairway plot, and SMC++ models, and we here demonstrate its effect for PSMC. We summarize the result in a new supplementary note where we simulate the effect of inbreeding with an R code developed for the purpose, and we provide a practical approach for compensating for it. as thoroughly as you can. Please note that we would expect to see all technical points fully addressed as a condition of further consideration of your manuscript. 2) Please delete or re-phrase the statement on the origin of the Feizixiao cultivar "to please his favorite concubine".
>>>Thank you for identifying the first key priority and important second point. We apologize for the confusion or inclarities in our original manuscript.

Response to the priority #1:
Reviewer #3 was concerned with the haplotype phasing strategy we previously used, and he/she recommended to use AllHiC for direct haplotype phasing and contig anchoring simultaneously. This was a pertinent suggestion, as we had indeed tested AllHiC when we started to assemble the initial genome, but unfortunately AllHiC did not provide a good result at the time. To better address the reviewer's concern, we (including the main developers of the AllHiC tool, Dr. Xingtan Zhang and Dr. Haibao Tang) repeated the work using the AllHiC strategy (as the reviewer #3 suggested), but in a much more thorough way. We still obtained poor results, which we concluded to mean that AllHiC was not appropriate for our data. In a best-case scenario, assembled contigs assigned to different homologous groups (HGs) can be separated into contig groups (in the partition step of AllHiC), showing good synteny to corresponding chromosomes. But in our reanalysis, as shown in the figure below, although assembled contigs can be well separated into 15 HGs, most contig groups in a HG (y axis) do not show a good 2-to-1 relationship to the corresponding chromosomes (x axis) of the reference genome. This is likely due to our HiC data not effectively distinguishing preassembled haplotypic contigs, as was suspected by the referees as well. Among the 15 chromosomes, only Chr7 and its HGs showed a good 1-2 relationship. Most of the other HGs represent partial or incomplete 2-to-1 correspondences to their related chromosomes, for instance, Chr1,8,9,11,14. For some HGs, only 1-to-1 relationships were detected, including Chr5, 6 10, 15, suggesting that the two haplotypes of these chromosomes could not be possibly distinguished by AllHiC either.
To ascertain why AllHiC did not work in our case, we carefully evaluated our results, and found that the main reason for the failure of AllHiC was because the assembly tool CANU likely introduced chimeric contigs in the first (error correction) step of contig assembly. CANU is one of the most popular and effective tools used for assembling contigs from raw PacBio long reads, and we still preferred to employ it given our experience with various assembly options available. The problem of chimeric assembly using CANU has also been demonstrated and reported in our original AllHiC paper (Zhang et al., 2019, "Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data", Nature Plants). Once the chimeric contigs are introduced, it is almost impossible to resolve them in the following steps. Although the CANU-trio binning algorithm is a haplotype phasing method, it requires parental genome information, which was lacking in our case. Therefore, more data were obtained and a new phasing approach was employed, as described below.
To evaluate the accuracy of our haplotype phasing, we resequenced the 'Feizixiao' genome using the 10X Genomics platform (10X data) to obtain long-range linked-reads data with 100X coverage. Using the 10X data, we found that our original phasing accuracy of SNPs using HiC data was around 90%, which we considered to have been reasonable. For sequence blocks with SNP quantity >50 per block, the consistency between each haplotype and the 10X data is between 86-90% (as shown in the However, inspired by the reviewers' suggestions and a recent strategy our coauthor Dr. Xintan Zhang used in another study (Zhang et al., 2020, Genomes of the banyan tree and pollinator wasp provide insights into fig-wasp coevolution, Cell), we decided to adopt a similar approach for an all-new haplotype phasing. SNP blocks identified from HiC data and 10X data are used for direct phasing of raw PacBio long reads before the step of CANU assembly. After the phasing, reads separated into two groups were subsequently assembled by CANU independently, after which assembled contigs were scaffolded using RaGOO with the guidance of the reference genome, achieving 15 pairs of homologous chromosomes. The workflow is shown below, and in Supplementary figure 1 as well. Detailed steps are described in Methods.
In this way, we were able to obtain two new, high-quality haplotypic genome sequences. Although their completeness dropped slightly, the accuracy of haplotype phasing was improved considerably. We consider this tradeoff reasonable, since the strategy involved separating raw PacBio reads into two groups for haplotype CANU assembly, where the amount of PacBio reads used for each haplotype assembly was almost half of the total reads. Accuracies of the new haplotype assemblies compared to 10X data were increased to 93-96% for genomic blocks with >50 SNPs (shown in the

Response to the priority #2:
We have reworded the sentence to "In the ancient Tang Dynasty, roughly 1300-1100 years ago, the Emperor set up a courier service with fast horse relays to transport fresh lychee from southern China to the imperial court because of the prodigious flavor of the spoilable fruit." at Lines 76-79.

Reviewers' Comments:
Reviewer #1: Remarks to the Author: This paper describes a highly heterozygous lychee genome consisting of two clearly divergent haplotypes. Together with 72 resequenced cultivated and wild lychee accessions, the authors infer that the de novo assembled genome originated from a hybrid between two independently domesticated accessions with opposite features of the fruit-maturation period. Also, in the hybrid genome, alleles from different haplotypes may express differently in the same tissue. Some of the expression differences may be correlated with the early or late maturation period of lychee fruits. I have read this paper with great interest. In general, the paper is well written, and the data and results are clearly presented. However, there are still a few points that need to be clarified, in my opinion.
>>> Thank you; we greatly appreciate these positive comments.
In the genome (haplotype) assembly and annotation, it is not entirely clear to me how the two haploid genomes were annotated. According to the gene prediction results, the two haplotypes have exact same gene numbers on each chromosome as the reference genome (Suppl . Table 8). However, the two gene sets each have over 70% genes with indels and SNPs, over 90% of which would change the amino acid sequences (Line 364-372). Therefore, I was wondering whether there would be any non-sense SNPs/indels or structural variations between the two haploid genomes, because of the independent domestication history of the two haploid genomes as claimed in the paper. For example, later in the paper, there is a missing intergenic region, which may be attributed to the maturity time.
>>> Thank you for raising this question. We performed a genome-wide calculation. There are indeed non-sense SNPs/indels, which account for ~2.6% (8,292/319,125) of all the nonsynonymous SNPs, and these occur in ~9.2% (2934/31,896) of all annotated genes. A GO functional analysis revealed that those genes with non-sense SNPs/indels were significantly enriched in biological processes related to defense responses (see table below). We added this information to the revised manuscript at Lines 382-384.
Top 20 GO enriched terms ("Biological process") for genes with non-sense SNPs/indels Lines 359-361, the comparable mapping coverage to both HY and HH of some EMC accessions, would not suggest they are recent hybrids. Actually, one would expect a half-half read coverage for any self-mapping in a diploid genome, and this is exactly the case for the read mapping of "Feizixiao" because the two haploid genomes are from the accession. Other EMC accessions may have a similar genetic background as "Feizixiao," so they have comparable mapping coverage when using "Feizixiao" as the reference. For example, they may be descendants of "Feizixiao," or they are also hybrids of LMC and EEMC, as shown in Fig. 3A.
>>>Thank you for this concern. We agree that "a half-half read coverage would be expected for any self-mapping in a diploid genome". However, in our case, it was not self-mapping; instead we mapped the resequencing data from different lychee varieties (both wild and cultivated) to the two haplotypes of 'Feizixiao' independently. For 'Feizixiao' itself, half-half coverage indeed came from self-mapping, but for all other varieties, their biased coverage between HY and HH was not due to self-mapping, but instead because of the diverse genetic backgrounds studied.
We also agree that the comparable mapping coverage of other EMC varieties suggested that they may have similar genetic backgrounds as 'Feizixiao', i.e., that they might be descendants of 'Feizixiao' or hybrids of LMC and EEMC, which is also supported by our inferred cultivation history for lychee varieties (Figure 3). We have revised a few words to make this information clearer at Lines 363-366.
For the timing of hybridization, I am not sure why the HY haplotype and HH haplotype need to have diverged from the common ancestor of lychee and longan (Line 380-384). In Line 226-228, the divergence between HNW and YNW has been estimated at 1.89 Mya (1.24 Mya in Line 623-624 though), so this is probably the earliest when the two haplotypes originated. There might be gene flow between the two populations, so the divergence between the two haplotypes could be inferred using the phylogenetic tree with single-copy orthologs as in Suppl. Fig. 15.
>>> Regarding the divergence time, as suggested, we did consider and experiment with using single-copy orthologs. The divergence time estimate obtained in this manner was ~4.6 Mya (below), which is not rational for a within-species split. We therefore only feel comfortable reporting our estimates based on actual Hainan and Yunnan wild population genomic data, given biasing factors in the approach outlined above such as deep fossil calibration points (far older than any within-species split in Sapindaceae) and unknown (and family-by-family variable) natural selection influences on single-copy ortholog sequences. On the latter point, our genome-wide SNPs are expected to be mostly neutral given the low percentage of gene space per megabase of DNA in the lychee assembly.
In the analyses of differential expression alleles, it is not entirely clear how the authors distinguished RNA-Seq reads from different alleles. Although a figure in the Methods suggests that only uniquely mapped reads with allele-specific SNPs are considered, mapping tools may tolerate mismatches that complicate the procedure, so it would be great if the approach could be discussed in somewhat more detail and caveats mentioned.
>>>We understand the inclarity in our previous manuscript version. We used the STAR aligner to map RNAseq reads to the merged HY and HH haplotype genome. We agree that it is possible that the tolerance of mismatches of the mapping tool may complicate the distinction between different alleles. However, using STAR and selecting uniquely mapping reads may still be the best strategy to minimize the possibility of mismapping. During RNAseq read mapping, the mismatch mainly came from significant SNP positions, which means fewer mismatches but less SNP difference. Therefore, the unique mapping with less mismatches should ensure that the majority of mapped RNAseq reads comes from each corresponding haplotypic genome.
To appraise this, we performed quick evaluations for a few different RNAseq data. We first aligned RNAseq reads to merged haplotypic genome sequences and only allowed unique matches, which separated RNAseq reads into two groups, reads aligned to HH and reads to HY. Then SNP positions were calculated based on the uniquely mapped RNAseq reads. After that, RNAseq-based SNPs for each haplotype were compared to the SNPs detected by HaploCUT2 from HiC and PacBio data to calculate the SNP consistency. Overall, the consistency value was >95%, suggesting a very low level of mismapping. Therefore, our expression calculations for allelic genes based on unique mapping of RNAseq reads can be considered of high accuracy.
We added a few words on this to the Methods section. >>>Thank you for catching this deficiency. The old Fig. 4D fails to show the SNP differences between EEAs and DEAs. We replaced old Fig. 4D with an updated figure, in which the difference of SNP densities between EEAs and DEAs was compared among different gene feature regions. DEAs have higher SNP densities in most of these regions except exons and ~2kb after the 3'-UTRs.
The authors state that indicating that DEAs (differentially expressed alleles) were under greater purifying selection pressure than were EEAs (evenly expressed alleles) (Fig. 4C) (lines 425-427). I'm not sure I understand. I would intuitively assume that purifying selection would keep expression of the alleles the same, unless the expression is different from the start (since the 'merging' of the two subgenomes) and this has to be maintained (is this what the authors mean?), while positive selection could be responsible for different alleles having different expression, like if you would compare duplicated genes with identical functions, and positive selection on one copy gives it a different function.
>>> Thank you for pointing out this inclarity. Indeed, haplotypic expression was likely to be different when 'Feizixiao' formed from hybridization between long-distinct LMC/HNW and EEMC/YNW haplotypes, and this differential expression was likely maintained. In fact, this genomic plasticity might explain the heterosis observed for 'Feizixiao', which has much improved fruit traits compared to wild lychee.
According to the authors (section "Flowering-related genes in lychee"), flowering time is the determinant of fruit maturation in lychee, and there is a tandem expansion of SVP genes in lychee and longan. Knocking out the SVP genes in peach would result in the evergrowing mutant in a certain environment. This may indeed suggest possibly adaptation -or at least a link -to climates, but I think this would naturally lead to another question of whether there is any copy number variation of the SVP genes among different cultivated and wild lychee accessions? Also, is there any copy number difference for the HH and HY haplotypes in the "Feizixiao" accession? >>>Thank you, this is a good question. We double checked our original PacBio sequencing data of 'Feizixiao' and found that there is no copy number variation for SVP genes between the HH and HY haplotypes. Our 10X Genomics single-cell data also confirmed this result.
For other lychee variants (wild or cultivated lychee), we were not able to assess this because only Illumina short reads were sequenced for them, which were of too low coverage (average 13.72X) to reliably assemble the entire SVP cluster region.
In the section on 'Cultivation history of lychee', it is not entirely clear to me what is meant with second (and higher-order) relationships? The authors write that 'estimates of the level of relationship' were obtained using the KING software and reflect the level of shared heterozygous or homozygous haplotype blocks. But what does 'level' mean? I could not find this in the manuscript.
>>> We apologize, the nomenclature was an error made by a non-native English speaker. The order of relationship between individuals was estimated using KING-Robust algorithm (Manichaikul et al. 2010 ), which estimates a kinship coefficient that is claimed to be independent of sample composition or population structure. The proper expression is the "order" of relationship, whether it is monozygotic twins, 1st order, 2nd order or 3rd order, where the order is determined by the kinship value ranges recommended in the KING online manual. We have now corrected this in the text and added explanation of the kinship coefficient into Materials and Methods. Reviewer #2: Remarks to the Author:

References
The authors sequenced the genome of lychee and re-sequenced 72 accessions which allowed the to obtain genome wide SNPs. They use resulting information to investigate demography, domestication and to look for evidence of selection and allele specific gene expression. Overall the study is intended as resource for breeding.
At some level the study is novel as it provides information for a tree crop that has not been studied extensively using genomics. However it was difficult for me to see the argument for general interest as the genome information is not really used to generate novel biological insights.
>>> Thank you for your comment. Although lychee is not a globally popular fruit crop, it is one of the economically foremost fruit crops grown in eastern Asia, with >2000 years of recorded of cultivation history. It is the most agriculturally important crop in Sapindaceae, a huge family of flowering plants (including maple) that consists of 138 genera and ~2000 accepted species.
So far, in-depth genomics research on Sapindaceae species is sparse. Our thorough analyses of one lychee genome, diverse cultivars and wild populations will greatly broaden knowledge of the genomics of Sapindaceae species. Moreover, the highly heterozygous genome (2.27%) of 'Feizixiao' lychee enabled the assembly of two haplotypic genomes, which in fact represent two entirely separate wild lychee populations. We also demonstrated that extremely early and late-maturing cultivars were derived from these two populations via independent domestication events. Thus, we believe our work indeed provides valuable general insights into plant genomics.
The pop gen analysis also needs help Line 194 : the simple ratio of numbers of non-syn and syn polyorphims is likely affected by the different samples sizes of the three datsets compared here. The authors should calculated pi_n/pi_s, which allows to compare datasets with different sample sizes.
>>> Thank you for this suggestion; we have now carried out all the analyses using n, s and their ratio. To do so, we developed a pipeline for calculating them from whole genome sequencing data, since we found the existing software cumbersome and very slow for whole genome sequencing data, taking over one week for one chromosome and one set of individuals. The implementation of our pipeline is explained in the M&M and is made available in github (https://github.com/jsalojar/PiNSiR). We are using a combination of snpEff to predict functional impact of a position, ANGSD to calculate accurate estimates of site-wise n, s values under missing data, and we developed R scripts implementing parallel computation to calculate genome-wide averages. The pipeline also filters the genome for high-quality gene models. In its current state, assuming that the existing SnpEff and ANGSD files are present, the implementation calculates the n, s values within tens of minutes.
The revised analyses did not change the biological results. We can still see the effects of a cultivation bottleneck, but the tools help in assessing the effects more accurately, which is shown by the interesting results from non-synonymous diversity, where the cultivation bottleneck is clearly visible from purging of deleterious alleles.
Line 204 (and 212): Figure 2A is not a "phylogenetic tree". This is because meotic recombination affected the genealogical relationships between these accessions such that no single tree can explain the evolutionary history of the sample. The this tree as a graphical representation of genetic distances between the 72 resequenced accessions.
>>> Yes, this is correct. We have corrected the nomenclature and discuss genetic distances regarding the SNP tree instead.
Line 227 : it would be good to have an independent estimate of this split time done using another method (either msmc2, fastsimcoal2, dadi, or Relate). This would confer more confidence in this important result.
>>> Thank you for the recommendation. This suggestion made us look at the population histories in a very detailed manner. In order to increase the accuracy of our estimation we first estimated ancestral states for the nucleotides using longli, longan and rambutan sequencing data and ANGSD software. We then applied several demographic modeling approaches (pairwise sequentially Markovian coalescent, Stairway plot 2, and SMC++) to see if the results agree, as well as Fastsimcoal2 modeling suggested by the referee, using the estimated unfolded site frequency spectrum. In all cases the population divergence times showed surprisingly deep time scales on the order of millions of years. Upon close inspection of the Ne trajectories we identified a consistent "shift" towards modern times in the Yunnan population. Since the inbreeding coefficient was high among the Yunnan population, and heterozygosity is the main factor used to define coalescence, we speculated that the discrepancy was due to an extensive amount of inbreeding or selfing occurring in the Yunnan population. We simulated the effect of selfing -loss of heterozygosity due to Mendelian inheritance in the selfing individuals -using actual data from one individual from Hainan, and obtained population trajectories matching the effects seen in Yunnan lychee data (see the figure below extracted from our new supplementary note). Furthermore, since nucleotide diversity and mutation rate are linearly dependent, we corrected for the effect of inbreeding by adjusting the mutation rate of the individual. The adjustment matched the timing of the Ne trajectories, except that the overall effective population sizes were lower (corresponding to the drop of effective population size due to inbreeding).
With the actual data we noticed that adjusting the generation time of Yunnan to 3x the time in Hainan provided the best fit, corresponding to ~66% loss of heterozygosity due to inbreeding. This adjustment provided concordant trajectories for all demographic models. To avoid over-fitting, we carried out the estimation of inbreeding only with PSMC results and then observed the effect in Stairway plots and SMC++ results. See the plots below regarding differences for unadjusted (left) and adjusted (right) trajectories. The result and argumentation are summarized in our new supplementary note.
The SMC++ split time estimation does not allow different generation times since the 2d-site frequency spectrum is not properly estimated, so we therefore estimated the divergence time by overlaying the independently estimated trajectories. This gave us a divergence time of ~18,000 years (with considerable variation around the estimate due to uncertain generation times and mutation rates). This is much more realistic than the previously obtained date on the order of millions of years. Unfortunately, under this scenario, parameter estimation using Fastsimcoal simulations was not possible due to problems in estimating the 2-d site frequency spectrum. Regarding the models suggested by the referee, neither msmc2 nor Relate were applicable since they required phased genome data.
Line 244: Could it be that those differences also arise because of difference in levels of self-fertilization in natural populations?
>>> Yes. We assumed that selfing occurs because there's little opportunity for outbreeding due to low effective population size. It is of course also possible that the Yunnan populations have lower levels of self-incompatibility and therefore selfing is more common. We propose both scenarios in the revised manuscript at Lines 225-227. Figure 3A: how exactly are those kinship relationships calculated?
>>> The order of relationship between individuals was estimated using the KING-Robust algorithm (Manichaikul et al. 2010, "Robust relationship inference in the presence of population substructure"), which estimates a kinship coefficient that is claimed to be independent of sample composition or population structure by calculating an estimator of the kinship coefficient based on the difference between shared heterozygosity and shared homozygosity (see Eq. 9 in the original publication). We have added this explanation in the Methods part of the manuscript. >>> We focused on intergenic mutations to estimate the neutral mutations. Compared to synonymous mutations, which are present only in gene coding regions, intergenic mutations are much more abundant, do not depend on the quality of the predicted models, and therefore permit more accurate estimation of s.
L 316-318: Having more pronounced LD pattern in the cultivars versus the wild populations alone could simply result from repeated bottlenecks during domestication.
>>> Yes, this is absolutely true, and has been observed in other crop species as well. We have altered the text accordingly at Lines 317-318.
l. 320:On "time of cultivation"? Has this been done using archeological records?
>>> We apologize, there is no record of the timing of the cultivation event. This was not accurately expressed. The text has been corrected to say "we split the cultivars into different categories according to their relatedness and cultivar origins" l. 333: All arguments based on changes in proportion of "high-impact mutation" would benefit for re-checking for effect of different sample sizes.
>>> Thank you, we have now taken this into account by developing a novel pipeline for estimating n and s values from whole genome sequencing data, and share the pipeline through github. We have compared the estimated diversities to those computed for a large number of species by another group (Chen et al 2017) and the values as well as the n/s ratio agree with outcrossing plants in general. Reference: 1. I have some concerns about the genome assembly and haplotype phasing. It is unclear based on the results and methods how one haplotype was extracted from the assembly and used for Hi-C anchoring. Based on the Hi-C contact matrix and BUSCO score, it seems like HaploMerger2 successfully binned the genome into two haplotypes, but each bin of contigs likely represents a chimera of the HY and HH haplotypes stitched together. Based on the raw PacBio assembly size, both the HY and HH haplotypes were assembled for 'Feizixiao', so couldn't these be used as a basis for haplotype phasing? This would give the authors a better sense of copy number variations, TE polymorphisms, large-scale rearrangements, and other differences between the HY and HH haplotypes in 'Feizixiao' compared to using Illumina data alone. This may also improve alignment of reads for the resequencing data and analysis of differentially expressed alleles. If the within genome heterozygosity is truly ~2.2%, the HiC data should be sufficient for phasing HY and HH from the original assembly, especially if the authors use ALLHiC (which was created by several coauthors of this manuscript). There are very few phased haplotype genome assemblies for plants and additional analyses using an improved assembly would strengthen the manuscript. Details on haplotype phasing and verification are vague. The authors state HapCUT2 was used for phasing and 'eventually, 15 pairs of homologous chromosomes were obtained'. If the original haploid assembly was a chimera, wouldn't these chimeric regions be carried over in the resulting HY and HH haplotypes? I am unsure why aligned Illumina data was used for phasing when both alleles were already assembled in the raw PacBio contigs.
>>>Thank you for your comments and suggestions regarding the phasing of haplotype genomes. Indeed, as described above, we found assembly chimerism when using CANU and all PacBio reads for contig assembly. Therefore, we originally had phased haplotype specific SNPs identified from HiC data onto our original reference genome to obtain the two haplotypic genomes originally reported, but we lost a considerable amount of genomic information regarding structural variation, copy number variation, and large-scale rearrangements in the process.
In our revision, we tested ALLHiC as the reviewer suggested. Unfortunately it failed to work, as expected from our previous attempts. We then designed a new strategy to conduct the haplotype phasing. In brief, as described above, we first separated the raw PacBio long reads into two groups using haplotypic SNPs identified from HiC and 10X Genomics data, and then performed guided genome assembly separately for each of the reads group. In this way, two independently assembled haplotypic genomes were obtained. For detailed description of the process, please refer to our response to the editor's #1 priority above.
We also explored CNVs, SVs and rearrangements between the two newly assembled haplotype genomes. All the related results were provided in Supplemental table 20-22, and corresponding description was added in Line 368-377.
2. The results on maturation time are interesting, but the finding that CONSTANS genes contribute to maturity time is not well supported. It is not surprising that the authors found no statistically significant GWAS peaks for flowering time given the low sample size (62 based on the methods) and significant population structure between the wild and independently domesticated cultivars. The authors identified one flowering time related gene from the top 20 GWAS peaks, but this is not unexpected since the genome contains so many flowering genes. The heterozygous deletion in one haplotype is interesting, but because both alleles are expressed, it's difficult to say if this plays a role in flowering time variation. Additional evidence would help strengthen the claim that this gene/gene pair is involved in maturation.
>>> Thank you for this comment. We fully agree that further functional validation will be required to confirm the role of these CONSTANS-like genes in lychee flowering regulation. However, as a perennial woody fruit crop, it is currently intractable to perform functional characterization in lychee using a gene overexpression or knock-out/knock-down or similar strategy. So far, there is not even a stable transgenic system available, although we are working hard on developing one.
However, although >500 flowering-related genes were identified from the lychee genome, they were very rarely located close to the top 20 GWAS signals. In a ±15 kb spanning region, 27 annotated genes were found close to these 20 signals. Among them, only one gene (COL307) was a flowering-related gene. If we extended the spanning region to ±50 kb, COL307 was still one of the two flowering-related genes among all 109 associated genes. Two more flowering-related genes could be found if a window size of ±100 kb was chosen, including the COL305 gene. In short, four flowering-related genes are located in the ±100 kb regions spanning the top 20 GWAS signals. Among them, LITCHI023128 (TPS1_ARATH) is not a differentially expressed allelic gene. Although we cannot exclude the possibility that G2OX2 (LITCHI026618) is phenotypically important, we suggest that the two COL genes are clearly better candidate genes, especially given our finding of the association with the 3.7kb deletion. 3. The manuscript is quite long overall, and some sections contain extraneous text that distracts from the most meaningful findings. For instance, the conservation of a VRN1-like gene cluster across eudicots is interesting but does not yield meaningful results in lychee. The number of VRN1-like genes is similar in lychee and other genomes and many of these genes are expressed across diverse tissues, so it is unclear what role this gene cluster may play in flowering time regulation compared to other species.
>>>Thank you. While we do agree that the conservation of the VRN1-like gene cluster does not yield distinctive results for lychee per se, but this is an exceptional clade-specific finding which may have broad meaning to the plant community. First, VRN1 is a functionally essential gene in vernalization (temperature responses) as proven in Arabidopsis. Its highly conserved synteny among many eudicots implies that these syntenic VRN1 genes likely have similar essential (conserved) functions among eudicot plants. Second, as indicated previously, lychee is the most agriculturally important crop in Sapindaceae, a huge family of flowering plants (including maple) that consists of 138 genera and ~2000 accepted species. We have intended that our detailed study of the lychee genome would provide in-depth genomic relevance for Sapindaceae species in general. The specific expansion of the VRN1 gene cluster suggests that there are probably some Sapindaceae-specific traits regulated by this particular set of duplicated VRN1 genes. The diverse expression of these VRN1-like genes indeed implies more complexity in their function, which may be different from their orthologous counterpart, the VRN1 gene in Arabidopsis, and they may well have broader functions other than simply vernalization (temperature responses). Even if the main roles of this VRN1-like gene cluster are not fully associated with flowering, the process of tandem duplication and neo/subfunctionalization provide a novel and unique case to study the evolutionary characteristics of a functionally relevant gene family.
Minor: It is difficult to distinguish the wild from cultivated, maturity classes, and country of origin of the accessions in Figure 2A. I like the concept of using different fruit and leaf colors, but it is hard to interpret.
>>>Thank you, we have revised the figure by highlighting the different groups with different shade colors.
Lychee is a specialty fruit and I suspect many readers will be unfamiliar with the different cultivars of lychee and their distinguishing characteristics. It may be helpful to include pictures of representative cultivar groups for the wild, extremely early, early, and late-maturing groups in one of the figures.
>>>Thank you for this suggestion. We have added a few representative photos to show lychee fruit and its diversity in Fig. 1A and supplementary figure 1.

Decision Letter, first revision:
8th Apr 2021 Dear Professor Xia, Your Article, "Two divergent haplotypes from a highly heterozygous lychee genome point to independent domestication events for early and late-maturing cultivars" has now been seen by 3 referees. You will see from their comments below that while they find your work of interest, some important points are raised. We are interested in the possibility of publishing your study in Nature Genetics, but would like to consider your response to these concerns in the form of a revised manuscript before we make a final decision on publication.
As you will see from these comments, reviewer #1 has concerns regarding the number of genes in the reference genome and the two haplotypes. Reviewer #2 points out essential technical issues that need to be addressed. In addition, please try to improve the biological insights if possible.
We therefore invite you to revise your manuscript taking into account all reviewer and editor comments. Please highlight all changes in the manuscript text file. At this stage we will need you to upload a copy of the manuscript in MS Word .docx or similar editable format.
We are committed to providing a fair and constructive peer-review process. Do not hesitate to contact us if there are specific requests from the reviewers that you believe are technically impossible or unlikely to yield a meaningful outcome.
When revising your manuscript: *1) Include a "Response to referees" document detailing, point-by-point, how you addressed each referee comment. If no action was taken to address a point, you must provide a compelling argument. This response will be sent back to the referees along with the revised manuscript.
*2) If you have not done so already please begin to revise your manuscript so that it conforms to our Article format instructions, available <a href="http://www.nature.com/ng/authors/article_types/index.html">here</a>. Refer also to any guidelines provided in this letter.
*3) Include a revised version of any required Reporting Summary: https://www.nature.com/documents/nr-reporting-summary.pdf It will be available to referees (and, potentially, statisticians) to aid in their evaluation if the manuscript goes back for peer review. A revised checklist is essential for re-review of the paper.
Please be aware of our <a href="https://www.nature.com/nature-research/editorial-policies/imageintegrity">guidelines on digital image standards.</a> Decision Letter, second revision: Our ref: NG-A54825R1 25th Jun 2021 Dear Dr. Xia, Thank you for submitting your revised manuscript "Two divergent haplotypes from a highly heterozygous lychee genome point to independent domestication events for early and late-maturing cultivars" (NG-A54825R1). It has now been seen by the original referees. The reviewers find that the paper has improved in revision, and therefore we'll be happy in principle to publish it in Nature Genetics, pending minor revisions to satisfy the referees' final requests and to comply with our editorial and formatting guidelines.
If the current version of your manuscript is in a PDF format, please email us a copy of the file in an editable format (Microsoft Word or LaTex)--we can not proceed with PDFs at this stage.
We are now performing detailed checks on your paper and will send you a checklist detailing our editorial and formatting requirements in about a week. Please do not upload the final materials and make any revisions until you receive this additional information from us. I am delighted to say that your manuscript "Two divergent haplotypes from a highly heterozygous lychee genome suggest independent domestication events for early and late-maturing cultivars" has