Genome sequences of horticultural plants: past, present, and future

Horticultural plants play various and critical roles for humans by providing fruits, vegetables, materials for beverages, and herbal medicines and by acting as ornamentals. They have also shaped human art, culture, and environments and thereby have influenced the lifestyles of humans. With the advent of sequencing technologies, there has been a dramatic increase in the number of sequenced genomes of horticultural plant species in the past decade. The genomes of horticultural plants are highly diverse and complex, often with a high degree of heterozygosity and a high ploidy due to their long and complex history of evolution and domestication. Here we summarize the advances in the genome sequencing of horticultural plants, the reconstruction of pan-genomes, and the development of horticultural genome databases. We also discuss past, present, and future studies related to genome sequencing, data storage, data quality, data sharing, and data visualization to provide practical guidance for genomic studies of horticultural plants. Finally, we propose a horticultural plant genome project as well as the roadmap and technical details toward three goals of the project.


Introduction
Horticultural plants mostly comprise vegetable-producing, fruit-bearing, ornamental, and beverage-producing plants and herbal medicinal plants. These plants have played important economic and social roles in the human lives and health by providing basic food needs, beautifying urban and rural landscapes, and improving personal esthetics. For example, the Food and Agriculture Organization of the United Nations reported that, while worldwide cereal food together is valued at 125 points (normalized value), vegetables and fruits together are valued at 137 points (http://faostat.fao.org). Horticultural plants also contribute to ecological balance by improving our biological environment by providing oxygen and balancing urban temperatures.
Horticultural plants are distributed among a wide variety of taxonomic plant spectra, which include a large number of flowering plants and a few early-divergent land plants. The sizes of their genomes vary greatly. For example, the vegetable garlic (Allium sativum) has a diploid genome (2n = 16) with an estimated genome size of >30 Gb 1 , and onion (Allium cepa) has a similar genome size 2 . In addition, most horticultural plants are domesticated, and their genome sequences have experienced strong artificial selection. For example, grape was found to have been cultivated (via viticulture) for >6000 years 3 ; citrus, >4000 years 4 . In addition, some horticultural plants are intermediates of domesticated and wild plants, such as medicinal plants including ginseng (Panax ginseng), noto ginseng (Panax notoginseng), and Artemisia (Artemisia annua). Many domesticated horticultural plants have high levels of genetic diversity and heterozygosity, such as sunflower (10% of bases differ between homologous chromosomes) 5 , grape (7%) 6 , and potato (4.8%) 7 .

De novo sequencing of horticultural plant genomes
As of December 31, 2018, the genomes of 181 horticultural species have been sequenced (Table 1). These include 4 beverage, 47 fruit, 44 medicinal, 44 ornamental, and 42 vegetable plants (Fig. 1a). In terms of taxonomic distribution, these plants include 175 angiosperms, 2 gymnosperms, 3 lycophytes, and 1 moss (Fig. 1b). As shown in Fig. 1c, the number of sequenced genomes of horticultural plants completed each year has significantly increased from 1 in 2007 to 40 in 2018. Although most of the horticultural plants are angiosperms, the genome sequencing of non-angiosperm species has also demonstrated steady growth (Fig. 1c). Vegetables and fruits have been a focus of plant research in the past few years. However, only two vegetables and seven fruits had their genomes sequenced in 2018 (Fig. 1d). This is probably because many economically important vegetables and fruits were already sequenced prior to 2018.
Some angiosperms have a significant role in the economy 8 . The 181 horticultural plants with sequenced genomes are distributed in 30 of the 64 angiosperm orders. Among these 30 orders, 7 (Fabales, Rosales, Cucurbitales, Brassicales, Sapindales, Solanales, and Laminales) have >10 species whose genomes have been sequenced (Fig. 1e), suggesting their vital importance to humans.
Determining the genomic basis of legume-rhizobium interactions would help not only to solve a classic fundamental problem in biology but also to improve nitrogen utilization in horticultural plants.
The Brassicaceae family is a medium-sized family with 4000 species, including many horticultural plant species. The Brassicaceae vegetable plants with sequenced genomes include Zhacai (Brassica juncea) 54 , cabbage (Brassica oleracea) 55 , napa cabbage (Brassica rapa) 56 , Capsella (Capsella bursa-pastoris and Capsella rubella) 57,58 , radish (Raphanus sativus) 59 , and field pennycress (Thlaspi     The Cucurbitaceae family includes >3700 species belonging to 134 genera (www.theplantlist.org). Within this family, the genome-decoded vegetable plants include silver-seed gourd (Cucurbita argyrosperma) 66 , winter squash (Cucurbita maxima) 67 , pumpkin (Cucurbita moschata) 67 , summer squash (Cucurbita pepo) 68 , bottle gourd (Lagenaria siceraria) 69 , and bitter melon (Momordica charantia) 70 . The genome-decoded fruit species include muskmelon (Cucumis melo) 71 and watermelon (Citrullus lanatus) 72 . The only genome-decoded medicinal plant is monk fruit (Siraitia grosvenorii) 73,74 . Via analysis of these available genome sequences, it was found that a tetraploid-inducing event occurred in the last common ancestor of the Cucurbitaceae species 75 . These genome sequences can also help to better understand the domestication history 76 and fruit development 77 . Increasing numbers of the wild relatives of these economically important crop species, as well as those of thousands of plant cultivars, will be sequenced in the near future, providing additional details and surprises.

Genome resequencing and the pan-genome of horticultural plants
A single reference genome sequence is not sufficient for identifying the best candidate genes for molecular breeding or for understanding the genomic background of a population due to the prevalence of genomic structural variations. Compared to the construction of a reference genome, genome resequencing usually requires less sequencing coverage. It is feasible to obtain a high-quality resequenced genome via mapping to a reference genome. A pan-genome is the summary of genomes of a species obtained by comparing a large number of resequenced genomes of a species or, occasionally, a genus. A pan-genome can help to understand the size of a core genome (defined as the conserved part among the related genomes), the size of a pangenome, and the amount and nature of variations within a species or a genus, which improve our understanding of the evolution of a species/genu, as well as of agronomic traits. Currently, a growing number of pan-genomes among horticultural plants have been constructed (Table 2).
Soybean is an economically important vegetable crop; in addition to being a source of human protein, it is an important source of vegetable oil. Glycine soja is the closest wild relative to cultivated soybean (Glycine max). The G. soja pan-genome was the first horticultural pangenome released, which occurred in 2014 and consisted of seven wild accessions 85 (Table 2). This pan-genome revealed that, when more genomes were added, the number of shared genes decreased, and in contrast, the number of total genes increased when more genomes were added. In addition, this pan-genome confirmed that a single reference genome does not adequately represent the genomic and genetic diversity of a species. Because the reference genome of G. soja was not previously available, those researchers assembled all seven genomes with the de novo assembly method, but this method was not adopted by subsequent researchers. Assembly of the B. oleracea pan-genome 86 is another early trial in the genomic research of horticultural plants ( Table 2). It is relatively small, created using nine morphologically diverse varieties (covering two cabbage, one broccoli, one brussels sprout, one kohlrabi, two cauliflowers, and one kale plant) and a wild relative, Brassica macrocarpa. Through the analyses of this pan-genome, we observed that 20% of genes are absent in some cultivar (s), and there are presence-absence variations (PAVs), including those related to major agronomic traits. This is a pioneering study that provided assembled pan-genome contigs, pan-genome annotations, and the GBrowse tool, available at http://brassicagenome.net.
Pepper plants are important vegetable plants with distinct fruit morphologies. The pepper pan-genome has been generated for the pepper genus Capsicum 87 . This pan-genome consists of 5 species and 383 cultivars, all of which have 15 chromosomes. In addition to the comparison of PAVs among this large amount of pepper cultivars, the pan-genome is also useful in linking the association between important agronomic traits and corresponding genes. These valuable pan-genome data and JBrowse and other search tools are available (www. pepperpan.org:8012).
Sunflower plants provide seed that can be used for cooking oil and serve as popular ornamentals. The sunflower pan-genome was created by sequencing 493 accessions, including cultivars, landraces, and wild relatives 5 . A total of 61,205 genes have been identified within the gene set of the sunflower pan-genome. Via the aid of this pan-genome, the understanding of the evolutionary history of sunflower species has significantly improved, and genes linked to biotic stress resistance have been identified 5 . Although pan-genome data can be found in the sunflower genome database (www.sunflowergenome. org), no publicly accessible tool has been built to date (accessed March 31, 2019).
Reference genome sequences are necessary to identify genes and to understand evolutionary trajectory. However, a pan-genome can help to uncover additional details. For example, relying on the tomato genome sequence, researchers mapped only several genes and pathways controlling fruit ripening 28 . These fleshand flavorrelated genes are the best targets in breeding. Moreover, genome sequences allow comprehensive and systematic analyses of fruit biology. Furthermore, via the sequencing of a tomato population and analysis of its pan-genome consisting of 725 accessions, the genes selected during domestication and quality improvement were identified 88 . Thus a pan-genome not only improves our understanding of crop evolution but also is useful for the discovery of novel genes and breeding.

Data storage and visualization
In addition to comprehensive plant-centric databases such as Phytozome (https://phytozome.jgi.doe.gov) and EnsemblPlants (http://plants.ensembl.org), 27 horticultural plant-specific genome databases have been constructed (Table 3). Among these, 22 provide data for downloading. Some databases are freely accessible to all users, while others provide only limited access to specific data or users. For example, the Genome Database for Rosaceae 89 requires user registration and a login to access the breeding data.
Visualization of genomic data of horticultural plants is challenging due to the heterogeneous nature of the different types of data. GBrowse 90 and JBrowse 91-93 are powerful tools that provide a visualization of various levels of genomic features. The availability of genomic analysis tools also varies greatly among databases. BLAST-related tools such as NCBI-BLAST 94 and viro-BLAST 95 are provided by some databases for homologous sequence searches and sequence comparisons. Gene query tools can help to obtain details of genes such as their sequence, annotation, and expression. HMMER 96 searches allow the inference and extraction of gene families from genomes in the database. Syntenic tools allow the identification and visualization of genome-wide syntenic relationships across genomes. The BioCyc tools (https://biocyc.org) allow users to navigate individual pathways or the whole metabolic map of a genome for functional analyses 97 .
The Genome Database for Rosaceae (GDR), which was developed by the main bioinformatics laboratory at Washington State University 89 99 , Coffee Interactome Data, and the SGN Ontology Browser are provided. The Breeders Toolbox was developed for breeders. The same team also developed a series of horticultural plant-themed databases, including the YamBase (https://yambase.org), CassavaBase (https://cassavabase.org), and MusaBase (https://musabase.org) databases. All these databases adhere to the release of genomic data before publication (the Toronto Agreement) 100 .
The Brassica Database (BARD) 102 , a database of important Brassica species, covers the vegetable species Brassica rapa and B. oleracea, as well as the model plant Arabidopsis and Brassicaceae close relatives. In addition to its genomic data, the BRAD database hosts a curated list of genes involved with anthocyanins, resistance, auxin, flowering, and glucosinolates and a full list of gene families that are of considerable importance in Brassica research. BLAST and JBrowse tools were built for visualization of genomic data, and syntenic tools are useful for comparative analyses.
The Herbal Medicine Omics Database 103 includes genomic, transcriptomic, pathway, and metabolomic data for medicinal plants, although the medicinal properties of some plants are recognized only in some parts of the world. In this database, hundreds of medicinal plants are included. However, the database currently provides only the BLAST and GBrowse tools for the visualization of omics data. Other collected omic data can be downloaded but cannot be analyzed or visualized online.
There are other tool-specific databases that can be very useful for the visualization and online analyses of horticultural plant genome sequences. The Plant Genome Duplication Database (PGDD) 104

Discussions and future perspectives
The horticultural plant genome project It is challenging to determine the exact number of species or cultivars that exist for horticultural plants. In terms of fruit-bearing plants, at least 91 species are economically important and produce fruit that are consumed (https://simple.wikipedia.org/wiki/List_of_fruits). More than 200 vegetable plants are consumed (https://simple. wikipedia.org/wiki/List_of_vegetables). The exact number of ornamentals is also unclear, as novel cultivars are produced each year. However, it has been estimated that there are >6000 ornamental cultivars (https://www.rhs. org.uk/plants/pdfs/agm-lists/agm-ornamentals-(1).pdf), and many cultivars are created and disappear each year. Up to December 2018, genome sequences had been decoded for only 181 species, accounting for only a small proportion of the total horticultural plant species. Hence, there is a strong need to sequence additional genomes for more horticultural plants that would be valuable for comparative genomics, to better understand their evolutionary history, and to possibly make genetic modifications to better utilize these plant species.
Here we propose a horticultural plant genome project (HPGP) with three goals (Fig. 2). The first goal of the HPGP is to generate reference genome sequences for all horticultural plants, after which pan-genomes and core collections would be generated as genetic banks for horticultural plants. Two recently developed genome assembly methods could be applied to decode highly ploidy 71 and highly heterozygous 106-108 horticultural genomes. The second goal is to identify the various genomic variations within a pan-genome. In addition, the mechanistic signatures leading to the variations would be explored. The third goal is to link the phenotypes to the genomic regions. Two methods would be applied: quantitative trait locus methods to correlate genomic variations with a quantitative trait and genome-wide association study methods to associate genomic variation with many genomic variations from different individuals 109,110 . The good news is that the Earth Genome Project and the 1000-Plant Genome Project will accelerate the genome sequencing process of horticultural plants.
The timeline for obtaining the genome sequences of all horticultural plants at both draft and reference scales (goal one of the HPGP) would be short-within 3-5 years -because the cost for sequencing is dropping rapidly. However, collecting and sequencing the population definitely requires worldwide collaborations and would take >10 years. The second goal is to analyze the genomic variations to identify the mechanistic signatures within a  Fig. 2 The proposed roadmap to the horticultural plant genome project (HPGP). The first goal of HPGP is to generate all reference genome sequences for horticultural plants, after which pan-genomes and core collections will be generated as a gene bank for horticultural plants. Two recently developed methods would be applied to decode the highly ploidy and highly heterozygous horticultural genomes. The second goal is to detect the various genomic variations within a pan-genome. In addition, the mechanistic signatures leading to the variations would be explored. The third goal is to link the phenotypes with the genomic regions. Two methods would be applied: the quantitative trait locus (QTL) method to correlate genomic variations with a quantitative trait and the genome-wide association study (GWAS) method to associate genomic variation with many genomic variations from different individuals ***p < 0.001 population, which is also time consuming and would be gradually achieved. The third goal is an advanced step that occurs after or concurrently with the second goal. Although these last two goals appear to be enormous challenges, we are confident in the ability to achieve most of these two goals in model horticultural plants such as the tomato, cucumber, and strawberry in the coming years.
In addition, the quality of assembly and annotation of existing reference genomes of horticultural plants need to be further improved. Although a few tools such as BUSCO 111 and CEGMA 112 have been widely used to evaluate the quality of genome annotations, a good standard is still not available for the systematic evaluation of the quality of genome assemblies. As a result, the quality of the genome assemblies is very uneven and is sometimes related to the complexity or heterozygosity of the taxa. This situation is changing as sequencing platforms are being upgraded. For example, since the first apple genome sequence was released in 2010 based on next-generation sequencing technology 15 , an improved version produced by next-generation sequencing (NGS) and PacBio technologies was released in 2016 113 . The third improved version of the apple genome, which was obtained using a combination of NGS, PacBio, and Bionano technologies, was released in 2017 114 . The fourth improved version was released in 2019, based on the utilization of NGS, PacBio, and Hi-C technologies 27 . In the future, the quality of the reference genome should reach certain minimal standards upon which the community can agree, similar to the proposal for bacteria and archaea 115 , thereby leading to more accurate pan-genome analyses and biotechnology.
Storage and access of genomic data constitute another problem concerning horticultural biologists and bioinformatics scientists. For access to genome sequences and raw sequencing data, a number of public databases are usually the first choice of researchers due to the nature of their stability, low cost, and ease of access. The wellknown public databases include the NCBI (https://ncbi. nlm.nih.gov), EMBL (www.embl.org), CNGB (www.cngb. org), BIGD (bigd.big.ac.cn), DDBJ (www.ddbj.nig.ac.jp), GigaDB (gigadb.org), Dryad (www.datadryad.org), and Phytozome (https://phytozome.jgi.doe.gov) databases. To share these data with worldwide researchers, we encourage the release of data before publication, as was suggested by the Toronto Agreement in 2009 100 .

The need for a horticultural plant-centric database
Unlike agricultural plants, horticultural plants share multiple features. For example, plant growth requires controlled conditions with specific equipment or facilities; plants generally need grafting, postharvest treatment, and a long juvenile phase; and plants usually undergo asexual reproduction and have unique specialized metabolism. All of these concerns make it hard to study these traits in model plants or via regular tools. Uniting the various omic data and the development of novel tools for horticultural plants are needed. Moreover, aside from the comprehensive plant databases and the 27 horticultural plantspecific databases mentioned above, there is still an increasing need to find and compare an increased amount of data for horticultural plants. However, horticultural biologists usually need to frequently deal with breeders; thus the need to create a comprehensive horticultural database to meet the interests of basic biologists and breeders is largely required. Such a database should cover as many horticultural plant genomes as possible and should provide an integrated set of bioinformatics tools. We believe that, in the future, the need for such a comprehensive database of all horticultural plants will satisfy additional horticulture researchers and breeders.
Given the advancement of sequencing technologies and reduced costs, the genome sequencing data of horticultural plants are accumulating rapidly. The storage, analyses, and sharing of large collections of genome sequencing data are becoming even more laborious and time consuming. The integrative analysis of various omic data, such as genomic, transcriptomic, metabolomic, phenomic, and breeding data, have become a major challenge for many horticultural biologists and requires coordinated efforts of scientists from different fields. For data processing and visualization, we recommend using BioMart tools, which could be easily built into a database. For database construction, we suggest following the template of the Tripal series (www.tripal.infor) 8 . Finally, we believe that, with a fostered collaboration of the horticultural community, the HPGP and subsequent knowledge and experiences will greatly benefit biology researchers and breeders.