The rubber tree genome reveals new insights into rubber production and species adaptation

The Para rubber tree (Hevea brasiliensis) is an economically important tropical tree species that produces natural rubber, an essential industrial raw material. Here we present a high-quality genome assembly of this species (1.37 Gb, scaffold N50 = 1.28 Mb) that covers 93.8% of the genome (1.47 Gb) and harbours 43,792 predicted protein-coding genes. A striking expansion of the REF/SRPP (rubber elongation factor/small rubber particle protein) gene family and its divergence into several laticifer-specific isoforms seem crucial for rubber biosynthesis. The REF/SRPP family has isoforms with sizes similar to or larger than SRPP1 (204 amino acids) in 17 other plants examined, but no isoforms with similar sizes to REF1 (138 amino acids), the predominant molecular variant. A pivotal point in Hevea evolution was the emergence of REF1, which is located on the surface of large rubber particles that account for 93% of rubber in the latex (despite constituting only 6% of total rubber particles, large and small). The stringent control of ethylene synthesis under active ethylene signalling and response in laticifers resolves a longstanding mystery of ethylene stimulation in rubber production. Our study, which includes the re-sequencing of five other Hevea cultivars and extensive RNA-seq data, provides a valuable resource for functional genomics and tools for breeding elite Hevea cultivars.

T he rubber tree (Hevea brasiliensis, hereafter referred to as Hevea) is a member of the spurge family (Euphorbiaceae), along with several other economically important species such as cassava (Manihot esculenta) and the castor oil plant (Ricinus communis). Natural rubber (cis-1, 4-polyisoprene) makes up about one-third of the volume of latex that is essentially cytoplasm of the articulated laticifers in Hevea. The latex is extracted by tapping the bark, a non-destructive method of harvesting that facilitates continual production. As an industrial commodity, natural rubber is an elastomer with physical and chemical properties that cannot be fully matched by synthetic rubber 1 . In contrast to synthetics, the production of natural rubber is sustainable and environment friendly 2 . The commercial cultivation of Hevea, a native to the Amazon Basin, began in 1896 on a plantation scale in Malaya (now Malaysia) and expanded to other Southeast Asian countries that lead in world natural rubber production today 3 . Decades of selective breeding have resulted in a gradual improvement in rubber productivity, from 650 kg ha -1 derived from unselected seedlings during the 1920s to 2,500 kg ha -1 yielded by elite cultivars by the 1990s 4 . Nevertheless, the field production achieved so far is still well below the theoretical yield of 7,000-12,000 kg ha -1 , as has been suggested for the rubber tree 5 . Meanwhile, conventional rubber breeding has been stagnating in the introduction of high-yield cultivars. The reasons include a narrow genetic basis for exploiting breeding potential and difficulty in introducing wild germplasms because of the genetic burden in removing unfavourable alleles 6 . The incorporation of marker-assisted selection and transgenic techniques offers promise to improve breeding efficiency for latex yield, and sequencing of the Hevea genome would uncover even more avenues leading to this end.
The first draft Hevea genome was released by a Malaysian team 7 that was participant to the recent boom in transcriptomic and proteomic studies of the species [8][9][10][11] . However, its low sequence coverage (∼13×) and a lack of large insert libraries (such as fosmid-or BAC-based clone libraries) have limited the success of genome assembly (a scaffold N50 size of 2,972 bp), precluding its application for furthering quality research in the field.
Here, we report a high-quality genome assembly of Hevea Reyan7-33-97, an elite cultivar widely planted in China 12,13 based on sequence data from both whole-genome shotgun (WGS) and pooled BAC clones. This assembly contains 7,453 scaffolds (N50 = 1.28 Mb), has a length of 1.37 Gb and covers ∼94% of the predicted genome size (1.46 Gb). Together with analysis of data from re-sequencing five other cultivars and comprehensive transcriptomic surveys, we aim to obtain new insights into the physiology of laticifers and molecular details of rubber biosynthesis, especially in relation to ethylene-stimulated rubber production.

Results
Genome assembly and annotation. We assembled the Hevea genome (cultivar Reyan7-33-97) based on 138 Gb (94× genome coverage) processed shotgun sequences (Illumina GA2 and  Hiseq 2000; Supplementary Table 1). We also generated pairedend reads for five other Hevea cultivars (PR107, Reyan8-79, RRIM600, Wenchang11 and Yunyan77-4) with high-coverage of 38× to 59× (Supplementary Fig. 1 and Supplementary Table 2). Based on a 17-mer sequence library, the estimated genome size of Reyan7-33-97 is 1. 46 Gb, whereas that of the other five cultivars ranges from 1.41 Gb to 1.55 Gb (Supplementary Fig. 2). To assist with scaffold construction, we generated an additional 55-Gb high-quality mate-pair data (insert sizes from 800 to 10,000 bp; 682× physical coverage; Supplementary Table 1 and Supplementary  Fig. 3), and obtained a preliminary assembly with a scaffold N50 value of 55 kb. We further improved the assembly by adding pooled BAC sequences (Supplementary Fig. 4 and Supplementary Note 1) from 47,616 BAC clones (mean insert size = 135 kb) (Supplementary Table 3 and Supplementary Fig. 5) that generated pseudo-mate-pair reads of 10 to 200 kb insert lengths to assist scaffolding ( Supplementary Fig. 6). This yielded a 1.37-Gb genome assembly that covers 93.8% of the estimated genome size, and contains 7,453 scaffolds (N50 = 1.28 Mb) and 84,285 contigs (N50 = 30.6 kb) ( Table 1).
To validate quality of the assembly, we first aligned all Hevea DNA, expressed sequence tag (EST) and protein sequences available in the public domain to show mapping rates of 91.5, 97.9 and 100%, respectively ( Supplementary Figs 7 and 8). Of the 1.11-Gb Hevea genome assembled by the Malaysian group 7 , only 25.2 Mb failed to be aligned to our assembly. The unalignable portions may reflect real sequence differences between cultivars, RRIM600 used by the Malaysian team and Reyan7-33-97 in the present study, as also reflected in the genome size difference ( Supplementary Fig. 2). Next, we aligned 18 BAC scaffolds with low repetitive sequences to our genome assembly, and yielded alignments from 91.69% to 99.74% ( Supplementary Fig. 9). Finally, the transcripts we assembled also showed excellent alignment to the genome; of 84,241 transcripts, 98.32% were mapped (transcript coverage > 80% and identity > 99%; Supplementary Table 4 and Supplementary Note 4).
To assist genome annotation and gene expression analysis, we combined three methods: ab initio, 84,241 unique transcripts (Supplementary Table 5 Table 7).
Repeat-driven genome expansion. We identified 71% of the genome length as repeats (Table 1 and Supplementary Table 8). Among the five species of Malpighiales (Hevea, Manihot, Ricinus, Populus and Linum), Hevea not only has the largest genome but also the greatest repeat content (Supplementary Table 9). Of the repeat sequences, two types of long-terminal repeat retrotransposons (LTR-RTs), Gypsy (691,209 in number) and Copia (114,165), are most abundant, and they total more than half (∼0.7 Gb; 50.9%) of the genome assembly, substantially higher than that of the other four Malpighiales species. Among the five Malpighiales species, Hevea harbours the largest set (35,079) of LTR-RTs (Supplementary Table 10), and both Gypsy and Copia show a peak substitution rate around 0.05, suggesting a burst of LTR-RT insertion at that point (of time) ( Supplementary  Fig. 12), which occurred five million years (Myr) ago as calculated (see Methods). Similar to Hevea, three other Malpighiales species (Manihot, Ricinus and Populus, with Linum being the exception) have more Gypsy than Copia, and share nearly identical substitution rates for the two elements ( Supplementary Fig. 12).
Genome duplication, collinearity and phylogeny. We explored gene-based collinearity and calculated synonymous substitution rate (K s ) as well as fourfold synonymous third-codon transversion (4DTv) rate among paralogues and orthologues. With the exception of Ricinus, all the other four Malpighiales species display an obvious two-peak pattern for the distributions of Ks/4DTv, with the left sharp peak representing the recent species-specific duplication and the right smooth one representing the eurosid duplication (Fig. 1b). The most ancient peak, representing γ duplication, is rather obscure and mingled with the right peak. A recent phylogenomic analyses inferred the clades Euphorbiaceae (Hevea, Manihot), Salicaceae (Populus) and Linaceae (Linum) to have emerged 89.9, 58.0 and 39.5 Myr ago, respectively 14 . Hence, Linum is the youngest of the five species, as also reflected by its lower K s and 4DTv peak values (Fig. 1b). If the diversification time is assumed to a maximum of 89.9 million years, according to the K s value of the second peak, the Hevea synonymous substitution constant is estimated as 7.5 × 10 -9 per site per year. Thus, the recent Hevea-specific duplication might have occurred in ∼15.3 Myr ago, nearly 10 million years earlier than the burst of Hevea LTR insertion. However, we are still not sure whether the two duplication events in Hevea are whole genome duplication because only 30.0% and 5.8% genes are covered by the first and second peaks, respectively. In comparison, the relative rates in Manihot, Populus and Linum are much higher; for example, 92.8% and 56.8% genes are attributable to the two peaks of Populus (Supplementary Table 12).
Plotting collinear regions of Hevea with itself and five other species reveals obvious conserved gene clusters ( Supplementary  Fig. 13). The absence of collinearity with Ricinus is unexpected, which might indicate a lack of species-specific duplication in Ricinus as revealed by its K s /4DTv distribution pattern (Fig. 1b). Among Hevea, Manihot and Populus ( Supplementary Fig. 14), the collinearity size of Hevea is significantly larger than that of To analyse the gene gain-loss event, we used 72 single-copy gene families to construct a maximum likelihood phylogeny between Hevea and 12 other species (Fig. 1d), representing a typical Rosid clade sequenced so far in Eudicots. Taxonomically, Hevea, Manihot, Ricinus, Populus and Linum belong to the Celastrales-Oxalidales-Malpighiales (COM). The phylogenetic placement of the COM clade remains controversial in angiosperms. Here, the COM clade was placed in Malvidae, consistent with a recent study using single-and multicopy nuclear loci 15 . Among the 13 species examined, Hevea and rice are representative of the most and fewest gene families, respectively. In most species, the gene family gain is less than gene loss; the only exceptions are in rice and soybean. The divergence time for the Euphorbiaceae family is estimated to be 57.7 Myr ago (Fig. 1d)    reported dating of 89.9 Myr ago 14 , possibly because of the limited number of Euphorbiaceae species used here.
Low genetic diversity regions or 'SNP deserts' often indicate selective sweeps 16 and relate to domestication in many major crop species 17,18 . All re-sequenced Hevea cultivars exhibit a unimodal pattern in the distribution of SNPs with a single low-SNP peak at <1 SNP per kilobase (Fig. 2b), contrasting with bimodal curves observed in rice and date palm, where the low-SNP peak reflects trait selection in cultivation 17,19 . The SNP deserts account for 42% of the Hevea genome (Supplementary Table 14), a value significantly higher than that reported in rice (8%) 17 or date palm (18%) 19 . Compared with the genome average, the SNP desert has a higher K a /K s (nonsynonymous/synonymous substitution) ratio (1.72 versus 1.27) (Supplementary Table 14). The low-SNP peak in various cultivars might be attributed to natural selection; artificial selection remains a possibility, although to a lesser extent due to a short domestication history of Hevea (since the late nineteenth century).
About half of SNP deserts (13,127 blocks, 261,150 kb) are conserved in all six cultivars (Supplementary Table 15), defined as shared or core SNP deserts representing most sequence signatures left mainly by natural selection after Hevea speciation. There are 3,820 genes (8.7% of the total genes) located in the core SNP desert (Supplementary Table 16) that are most enriched in respiratory electron transport chain and defence response functions (Fig. 2c). Of particular interest is the fact that the single most highly expressed gene in latex-by a wide margin-REF1 resides in the core SNP desert (Supplementary  Tables 16 and 20). In this regard, the sister gene to REF1, SRPP1, is noteworthy by its absence in the SNP desert. Meanwhile, both genes showed the least sequence diversity among the six Hevea cultivars examined, suggesting the importance of their conservation in this species (Supplementary Table 17).

Rubber biosynthesis and expansion of the REF/SRPP family.
In the laticifer network of Hevea, natural rubber is synthesized by a sequential condensation of isopentenyl diphosphates (IPPs) in cisconfiguration to the priming allylic molecules (initiators) 20 . By convention, genes involved in the synthesis of IPP to the final rubber polymer are termed as rubber biosynthesis genes 21 . From our genome assembly, 84 rubber biosynthesis genes from 20 gene families were identified (Supplementary Table 18), including 18 in the cytosolic mevalonate (MVA) pathway and 22 in the plastidic 2-C-methyl-D-erythritol-4-phosphate (MEP) pathway for IPP synthesis, 15 for initiator synthesis in the cytosol and 39 for 'rubber elongation' on rubber particles (Fig. 3a). Of the 24 MEP genes, only two DXS (1-deoxy-D-xylulose 5-phosphate synthase) genes (DXS7 and DXS10) show substantial and preferential expression in latex ( Fig. 3a and Supplementary Table 18). In contrast, at least one gene for each MVA pathway enzyme shows latex-biased abundant expression, supporting the proposition that it is the MVA rather than the MEP pathway that is involved in rubber biosynthesis 22,23 . Three gene families, REF/SRPP, CPT (cis-prenyltransferase) and DXS, have more than ten genes. Recently, a homologue of the human Nogo-B receptor was demonstrated to be essential for rubber biosynthesis in dandelion 24 . However, blast searching yielded no significant hits in our genome assembly or the public Hevea sequences, indicating a discrepancy in the mechanisms of rubber biosynthesis between these two species.  Table 19). These results suggest a possible link between expansion of the REF/SRPP family and the ability to produce rubber.
A close involvement of REF and SRPP in rubber synthesis has been proposed based on in vitro rubber biosynthesis assays 25,26 , and a positive correlation of REF expression with rubber yield 27 . Recent transgenic studies in Taraxacum brevicorniculatum 28,29 also reveal their essential role in rubber biosynthesis. REF and SRPP share a conserved REF motif that is also retained in several stress-related or lipid droplet-associated proteins in plant cells 25,30,31 . Based on the presence of a carboxy (C) terminus found in SRPP and its absence in REF 25,30 , and the similarities among protein sequences, the 18 Hevea REF/SRPP genes are named SRPP1 to 10 and REF1 to 8 (Supplementary Fig. 15). Of these, REF1 (138 aa) and SRPP1 (204 aa) are the two most abundant isoforms (Supplementary Table 18), and correspond, respectively, to the well characterized REF and SRPP 25,26 . In the other 17 plants examined, the REF/SRPP family has isoforms with sizes similar to or larger than SRPP1, but no isoforms with similar size to REF1. The uniqueness of REF1 is highlighted by an alignment of REF/SRPP proteins from different plants (Supplementary Fig. 16).
The 18 REF/SRPP genes exhibit distinct expression patterns in seven Hevea tissues (Fig. 3a). Four isoforms, REF1, SRPP1, REF3 and REF7, have striking RPKM values of >7,000 in latex, this being especially true of REF1 (38,999). In total they account for 96.8% of expression of the REF/SRPP gene family in latex (Supplementary Table 18). In addition, their expression in latex is over tenfold higher than in any other tissue. Considering the fact that all other tissues also contain laticifers and small amounts of latex, we think that the expression of these genes in the tissues may have arisen from latex itself, and REF1, SRPP1, REF3 and REF7 are thus essentially laticifer-specific genes. When compared with all the genes expressed in latex, these four REF/SRPP genes rank among the top, with REF1 ranked first, SRPP1 sixth, REF3 ninth and REF7 the twelfth (Supplementary Table 20).
Phylogenetically, the Hevea REF/SRPP genes are classified into two major groups, with 14 in group I and four in group II (Fig. 3b). Of the group II genes, three (SRPP4, 6 and 10) are not expressed or expressed at very low levels in latex, and the remaining SRPP7 shows somewhat constitutive low expression across different tissues (Fig. 3a), therefore precluding the active participation of group II genes in rubber biosynthesis. group I genes are further divided into two distinct clades, 13 in clade 1 and only SRPP2 in clade 2. SRPP2 exhibited moderate expression across diverse tissues, excluding its special function in laticifers. Interestingly, all latex-specific isoforms (REF1, SRPP1, REF3 and REF7) are clustered in clade 1. When the REF/SRPPs from Hevea and 11 other plants were investigated for their phylogenetic relationship, all the 13 clade 1 isoforms were clustered into an independent clade whereas the remainder were scattered together with the homologues from other plants into different clades (Supplementary Fig. 17). These results suggest that the clade 1 REF/SRPP genes might have evolved independently towards functioning in rubber biosynthesis.
The Hevea REF/SRPP family was further surveyed for their genomic location. Of particular note, 12 of the 13 clade 1 REF/SRPPs including the four laticifer-specific genes, are located in a single 205-kb Scaffold1222 whereas the remaining six are scattered into six different scaffolds (Fig. 3c and Supplementary Fig. 18). SRPP1  (1,000 replicates). The series of plots on the right show SNP density (SNP per kb) and frequencies. SNP frequency is calculated based on a 50-kb sliding window in 1-kb steps. c, Gene Ontology (GO) enrichment in biological process for genes located in core SNP desert using GOEAST. Note that two biological processes, defence response (DR) and respiratory electron transport chain (RETC), are most enriched. LBP, leucine biosynthetic process; ASCPT, ATP synthesis coupled proton transport; CMO, cellulose microfibril organization; CG, cell growth.     and the parallel precursor for REF1 appear to have evolved from the same ancestral SRPP gene, but the emergence of REF1 was a more recent event (Fig. 3b, upper part of clade 1).
Active ethylene signalling and response in laticifers. Since latex production is greatly increased by the ethylene-releasing agent ethephon, we identified 509 differentially expressed genes (DEGs) in latex that respond to ethylene treatment ( Fig. 4a and Supplementary Table 21). About one-quarter of the 342 annotated DEGs (>2-fold expressional change) are transcription factors (53) and protein kinases (31) (Supplementary Tables 22  and 23). Both gene groups are essential in ethylene signal transduction and response in higher plants 32,33 .
To understand ethylene-dependent mechanisms, we first identified the genes responsible for ethylene biosynthesis in Hevea and examined their expression patterns. The three enzymes that act sequentially to synthesize ethylene from methionine, that is S-adenosyl-L-methionine synthase (SAMS), 1-aminocyclopropane-1-carboxylic acid (ACC) synthase (ACS) and ACC oxidase (ACO) 34 have 8, 14 and 16 genes respectively in Hevea (Supplementary Table 24). However, the expression of these gene families, especially that of SAMS (>fivefold) and ACO (>10-fold) is much lower in latex than other tissues. Such low expression or a lack of expression has been noticed recently for nine ACO genes in Hevea latex 35 . These results, together with the low oxygen condition in latex 36 and the requirement of oxygen by ACO to produce ethylene 34 , point out a weak ethylene-synthesizing ability in laticifers. In addition, ethylene treatment had little effect on the expression of ACS, the rate-limiting enzyme of ethylene biosynthesis in latex (Supplementary Table 24). The primary ethylene signalling components, namely ETR, CTR1, EIN2 and EIN3/EIL1, were investigated for their expression in latex in response to ethylene treatment (Supplementary Table 25). Four out of the eight receptor genes (ETR) showed enhanced expression after ethylene treatment, indicating an ethylene response triggered in laticifers since increased expression has been observed for one or more ETRs in every known ethylene response 33 . Of particular interest, expressions of two EIN2s and one EIL1 were apparently boosted in latex after ethylene treatment. As the central transducer and master transcription factors in ethylene signalling, EIN2 and EIN3/EIL1 genes are mainly controlled by ethylene at a post-translational level 32,33 . Hitherto, none of these genes identified in other plants has been found to be transcriptionally regulated by ethylene 32 . The transcriptional enhancement of EIN2 and EIL1 in latex suggests an active ethylene signalling in laticifers.
Downstream of the primary ethylene signalling pathway, ethyleneresponsive element binding factors (ERFs) are implicated. A total of 225 Hevea AP2/ERF members, including 181 ERFs, 35 AP2s, 7 RAVs and 2 Soloists (Supplementary Table 26) were identified, representing the largest assemblage in this grouping as compared to other plants (Supplementary Table 27) 37 . Nearly one-tenth of the AP2/ERF members showed a more than twofold change in expression after ethylene treatment; 10 of them were upregulated and 11 downregulated (Supplementary Table 28). Interestingly, of the seven downregulated ERF genes, four belong to repressor subgroups (ERF-IIa and ERF-VIIIa) of ethylene response 38 . The depression of their expression is expected to be beneficial for activating downstream ethylene response.
Ethylene treatment triggers a number of stimulated downstream biochemical cascades in relation to latex production: sugar loading and its catabolism 9,36,39,40 , water uptake 41 , energy availability 42 , cytosolic alkalinization 42 , nitrogen assimilation 43 and defence responses 44 . Many genes implicated in these events were among the DEGs identified in this study (Supplementary Tables 22 and 23), but not the rubber biosynthesis genes, which is consistent with the previous proposition that ethylene has little direct effect on genes in rubber biosynthesis 45 . The defence response genes involved in producing and scavenging reactive oxygen species (ROS) were further analysed for their importance in the functioning of laticifers 20 . Three DEGs in ROS production were identified, among which an L-ascorbate oxidase (scaffold2147_8831) and an NADPH oxidase (scaf-fold0143_441741) appeared upregulated, whereas a lipoxygenase (scaffold4250_5149) was downregulated (Supplementary Tables 22  and 23). Of the six ROS-scavenging DEGs, five that include two thioredoxins (scaffolds0444_166127 and 0520_1896151), a glutaredoxin (scaffold4986_4172), a caleosin-related peroxygenase (scaffold1473_35058) and polyamine oxidase (scaffold0036_2884533) were upregulated, and one, a thioredoxin (scaffold0794_475278), was downregulated. Both the number of DEGs involved in ROS scavenging and their expression levels after ethylene treatment were much higher than those involved in ROS production. We illustrate in Fig. 4b the mechanisms underlying ethylene stimulation of latex production in Hevea.

Discussion
First, an effort to improve genome assembly is especially meaningful in plant genomes with high repeat content 46 . The Hevea genome contains 71% repetitive content and our BAC-pool sequencing allowed an over 20-fold increase in scaffold N50 length, which can be exploited in sequencing other repeat-rich large plant genomes. Second, expansion and divergence of the REF/SRPP family and its correlation with rubber biosynthesis leads to new insights into the physiology of rubber-producing laticifers. Third, an in-depth RNA-seq analysis assisted by the high-quality genome assembly has allowed us to gain new understanding of the mechanisms underlying ethylene stimulation of rubber production.
What sets the rubber tree apart from the numerous other rubberbearing plants 47 is its ability to produce prodigious amounts of rubber. In this regard, the Hevea genome provides a good resource to understand how the tree has managed to accomplish this uncommon feat. An appraisal of data from the genome points to the REF/SRPP gene family, which encompasses the most highly expressed genes in the latex. The divergence of this gene family into several laticifer-specific abundant isoforms suggests extraordinary selection pressure in play. Rubber contained in latex provides Hevea with a protective function against boring pests (such as beetles) with its coagulating ability to entrap them in the exuding latex which then self-seals the wound 48 . From the viewpoint of evolutionary advantage to the species, this is the clear benefit that the rubber tree receives. From the anthropological perspective, this adaptation has fortuitous consequences in that it provides mankind with the building blocks of a global industry.
A search for REF and SRPP in the Hevea SNP desert is instructive. REF heads the list in gene expression in the SNP desert, implying that it is an active gene that has descended from a recent, single selection of an ancestral sequence. On the other hand, the significance of SRPP in relation to the SNP desert lies in its absence. The phylogenetic dendrogram in Fig. 3b offers an explanation. An SRPP isoform is an ancestral protein in the REF/SRPP gene family. A mutation event occurred relatively recently whereby SRPP was truncated to give rise to REF. Deletion of the SRPP C terminus that resulted in REF isoforms appears to have occurred more than once in the rubber tree's evolutionary history, but the isoforms that prevail in modern Hevea are REF1 and SRPP1. The late appearance of REF also explains why, among the other 17 plants examined, the REF/SRPP family has isoform sizes similar to or larger than SRPP1, but none with similar size to REF1. The breakout event that produced REF is important to the modern Hevea's capacity for high rubber production. Hevea is unique in its high rubber production arguably because REF is unique to Hevea.
There are two classes of rubber particles in Hevea latex: the large particles (LRPs) with REF located on their surfaces and the small particles (SRPs) that are coated with SRPP 25,49 . SRPs are far superior in number, accounting for 94% of all rubber particles in the latex, whereas LRPs constitute only the remaining 6% of the rubber particles. However, it is precisely this 6% of rubber particles by number that makes up 93% of the rubber by volume in the latex 50 . REF1 might contribute to rubber biosynthesis by facilitating the biogenesis of LRPs or the 'growing' of SRPs to LRPs. Two pieces of evidence support this presumption: (1) the amount of REF1 protein in the whole latex has been found to be proportional to the rubber content 26 ; (2) REF1 gene expression was reported to correlate with yield levels of Hevea cultivars 27 . Taking a hypothetical stance, if REF had not evolved to facilitate the formation of LRPs, the typical tapped latex that has a dry rubber content (drc) of 33% would have less than 3% drc for the same number of SRPs per unit volume. In other words, the entire natural rubber industry is essentially founded on a very small proportion of rubber particles that are LRPs.
In Hevea planting, the ethylene generating compound, ethephon, is commonly applied to the bark of Hevea to lengthen the flow duration and to aid latex regeneration after tapping. Although bearing a weak capability for ethylene synthesis that is little affected by ethylene treatment too, the laticifers reveal an active ethylene signalling capability as evidenced by transcriptional accentuation of the ethylene signal receptor and transduction genes on ethylene stimulation. The low ethylene-synthesizing ability in laticifers, but an active ethylene signalling mechanism that is responsive to exogenous ethylene could shed new light on how ethylene triggers a significant increase in latex flow and rubber production.
The data from this study, together with other public resources, pave a way for whole genome association studies, germplasm improvement and genetic modification of Hevea to meet increasing global demand for natural rubber.
Shotgun library construction and sequencing. For genome sequencing, paired-end libraries (<600 bp) were constructed by the standard A-tailing Illumina protocol. Mate-paired libraries (800 bp to 10 kb) were constructed by a modified SOLiD mate-pair library preparation protocol 52 . The Reyan7-33-97 genome was sequenced by the IlluminaGA2 and Hiseq2000 systems, whereas the genomes of re-sequenced cultivars were sequenced solely on the Hiseq2000 platform. Raw reads were preprocessed by an in-house perlscript, and then proofread by the ErrorCorrection module in the SOAPdenovo package 53 . For RNA-seq, RNA samples were used to prepare libraries and sequenced using the Illumina Hiseq2000 system. Raw RNA-seq reads were processed to trim terminal low quality bases and adapter sequences via an in-house custom pipeline. The clean reads were then mapped to the Hevea genome using GSNAP 54 . Cufflinks 55 was used to compute the transcripts expression levels in reads per kilobase per million reads mapped (RPKM) and to identify the differentially expressed genes (DEGs) upon ethylene stimulation. Hierarchical clustering of DEGs was conducted using custom R scripts.
Genome assembly and validation. The pre-processed paired-end reads were assembled using SOAPdenovo (69-mer size), and mate-pair reads were used to construct scaffolds using SSPACE 56 . Scaffold gaps were closed with Gapfiller 57 and the cd-hit software was employed to filter chimeric scaffolds 58 . A BAC-pooling sequencing strategy was employed to improve the assembly further ( Supplementary  Fig. 4 and Supplementary Note 1). Public databases, BAC and transcript sequences were used to validate the genome assembly (Supplementary Note 4).
Genome annotation. Genome repeat sequences were annotated de novo by using RepeatMasker (http://www.repeatmasker.org). The protein-coding genes were predicted based on the repeat-masked genome through a combination of ab initio, conserved protein homologues and assembled transcripts. The non-coding RNAs were annotated by employing the INFERNAL software 59 to search against the Rfam database (Supplementary Note 5).
Estimation of whole-genome duplication. The all-to-all BLASTP program was used to identify the homolog pairs in Hevea proteins and the MCscan program 60 was used to search for collinearity blocks which were then filtered using the criterion of fewer than ten non-collinear genes between any two collinear genes. The same method was used to identify the syntenic blocks in L. usitatissimum, P. trichocarpa, R. communis and M. esculenta, or between Hevea and any of the four species. The collinearity region sizes for Hevea, M. esculenta and P. trichocarpa were estimated statistically by the Wilcox test. To estimate the duplication event, we calculated the synonymous K s and fourfold synonymous third-codon transversion position (4DTv) using KaKs_calculator 61 with the NG model and an in-house perl script, respectively.
Phylogenetic analysis. Twelve species, Medicago truncatula, Glycine max, Fragaria vesca, Vitis vinifera, Oryza sativa, L. usitatissimum, P. trichocarpa, R. communis, M. esculenta, Citrus sinensis, Carica papaya and Arabidopsis thaliana, were selected to construct a phylogenetic tree with Hevea using 72 single-copy gene families (Supplementary Note 7). These species represent the typical Rosids sequenced to date in the Eudicots. O. sativa was designated as an outgroup.
Genetic heterogeneity of Hevea cultivars. Paired-end sequences (35×) from each of the six Hevea cultivars were mapped to the Reyan7-33-97 reference genome using BWA (ref. 62). The mapping results were transformed into bam format and sorted with SAMtools 63 . SNP calling results were filtered against the following criteria: (1) each SNP supported by at least five non-redundant reads; (2) average mapping quality more than 40; and (3) two SNPs within 10 bp to be excluded. Based on the SNP calling results, SNP deserts (<1 SNP per kb) were identified in each sequenced cultivar using a perl script. Genes located in SNP deserts were subjected to functional annotation and GO enrichment analysis using GOEAST (http://omicslab.genetics.ac.cn/GOEAST/).
Genes associated with rubber biosynthesis, ethylene synthesis and signalling. The 20 gene families that have been reported as involved in rubber biosynthesis were identified from the genome (Supplementary Note 10). Ethylene synthesis and signalling genes that had been characterized in A. thaliana 64 were retrieved for their corresponding protein sequences from the Arabidopsis Information Resource (TAIR) (https://www. arabidopsis.org/ index.jsp). The retrieved A. thaliana proteins were processed with Interproscan 65 and Blastp searched against Hevea proteins. Hits sharing >30% amino acid identity and >50% amino acid alignment length with the A. thaliana homologues were further checked for pfam domain architecture. Those having the same pfam domains as their A. thaliana homologues were regarded as proteins involved in ethylene synthesis and signalling in Hevea (Supplementary Table 25).
The H. brasiliensis genome and RNA-seq sequences have been deposited in GenBank/DDBJ/EMBL under the accession codes of LVXX01000000 and SRP069104, respectively.