Cassava genome from a wild ancestor to cultivated varieties

Cassava is a major tropical food crop in the Euphorbiaceae family that has high carbohydrate production potential and adaptability to diverse environments. Here we present the draft genome sequences of a wild ancestor and a domesticated variety of cassava and comparative analyses with a partial inbred line. We identify 1,584 and 1,678 gene models specific to the wild and domesticated varieties, respectively, and discover high heterozygosity and millions of single-nucleotide variations. Our analyses reveal that genes involved in photosynthesis, starch accumulation and abiotic stresses have been positively selected, whereas those involved in cell wall biosynthesis and secondary metabolism, including cyanogenic glucoside formation, have been negatively selected in the cultivated varieties, reflecting the result of natural selection and domestication. Differences in microRNA genes and retrotransposon regulation could partly explain an increased carbon flux towards starch accumulation and reduced cyanogenic glucoside accumulation in domesticated cassava. These results may contribute to genetic improvement of cassava through better understanding of its biology.


Supplementary Figure 2 Genome size estimation with multi-kmer frequency distribution of W14
The main graph depicts the distribution of 17mer, 21mer, 25mer, and 29mer in the reads of short insert size libraries (200-500 bp) and the inset shows the volume of 25mer corrected by the kmer spectrum method. The total kmer number of 'k=25 corrected' is 9,644,794,319, and the volume peak is 13, so the genome size can be estimated in 742 Mb using the formula: (total kmer number) / (the volume peak).

Supplementary Figure 3 Data contribution to hybrid assembly
The bar-chart was depicted contribution of scaffolds accumulated length consist of different types of sequencing data. The line-points were depicted contribution of different types of sequencing data in for constructed the scaffolds number.    Mapman images indicate the expression differences of genes that are related to the secondary metabolism between Arg7 and W14 and between KU50 and W14. The log2 ratios of Arg7 vs. W14 and KU50 vs. W14 were used to draw the Mapman images. The color of blue means that the gene has a higher expression level in W14

Supplementary Figure 4 Sequencing libraries insert size span distribution
and that of red means that the gene has a higher expression level in cultivar (Arg7 or KU50).

Supplementary Figure 33 lower expression of genes for cell wall synthesis in
KU50, Arg7 than W14.
Mapman images indicate that the genes are related to the cell wall precursors between Arg7 and W14, and between KU50 and W14. The log2 ratios of Arg7 vs.
W14 and KU50 vs. W14 were used to draw the Mapman images. The color of blue means that the gene has a higher expression level in wild W14 and that of red means that the gene has a higher expression level in cultivar (Arg7 or KU50).

3) Supplementary Notes
Supplementary Note 1 Choice of cassava ancestor and cultivated variety for

WGS
Manihot genus includes about 98 species occurring in both northern South America  Table 1). KU50 is propagated by stem cuttings and seldom bears fruits, but it has a high tuber root yield, with3 to 10 kg/plant per year. KU50 has a high photosynthesis rate and its starch content in fresh tuber roots ranges from 28 to 32%. On the other hand, W14 usually produces a large number of fruits and is propagated only by seeds. However, the photosynthesis rate of W14 is lower than that of KU50. Its tuber root yield is only 0.5 -2.0 kg/plant per year and its starch content in fresh tuber roots is only 3 -5% (Supplementary Figure 1). Two other cultivated varieties, CAS36 and Arg7 were used in the experiment. CAS36, a self pollinated generation (S1.600) of sugary cassava, a natural mutant supplied by EMBRAPA Brazil with high sugar content (15-20%), substantial starch content (5%) and large tuber roots, was used for production of 20-fold coverage re-sequencing of genome. Arg7, a variety with elite agronomic traits bred in Argentina was used for transcription profiling and annotation with KU50 and W14. and GenoProfiler 12 using the following criteria: Fragments in the size range of 75 -1,000 bp were measured. For the data quality control, vector bands and clones failed in fingerprinting or lacking inserts were removed. In addition, samples with fewer than 25 or more than 200 fragments were excluded from the analysis. Fingerprints of cross-contaminated samples were detected using a module in the FPMiner and removed from the data set. The cross-contamination was defined as clones residing in neighboring wells in either 384-well format or 96-well format (quadrants) plates and sharing 30% or more of the mean numbers of fragments calculated using the formula: shared bands*2/(bands of clone 1 + bands of clone 2).

Construction of BAC libraries
Contig assembly For the AM560 physical map assembly, after fingerprint editing, 58,244 clone fingerprints representing 8 cassava genome equivalents, were suitable for contig assembly. These clone fingerprints were then used for an initial automated contig assembly using the FPC software 13 , with a tolerance of 5 (0.5 bp). The initial assembly was performed at a relatively high stringency (110 -45 ) to minimize contig assembly of clones from unrelated regions of the genome. The "DQer" function was used to dissemble contigs containing more than 15 % questionable (Q) clones. The "Singleton to End" and "End to End" functions were employed to merge contigs that are actually overlapped by successively decreasing the assembly stringency, i.e., of increasing the Sulston cutoff values. At last, the 10% largest contigs were subjected to manual editing such as examining and disjoining the contigs with CB map analysis.
At the end, the FPC assembly resulted in a total of 2,105 contigs and 5,054 singletons (Supplementary Table 2). For the W14 physical map assembly, similar assembly procedure and parameter settings were used, resulting in a total of 2,485 contigs and 2,909 singletons (Supplementary Table 2).
BAC end-sequencing The minimum tilling path (MTP) clones were picked from both the AM560 and W14 assemblies and end sequenced. The BAC end sequences (BESs) were generated using the Sanger sequencing approach (ABI 3730XL) and used to align whole-genome shotgun sequence contigs and scaffolds onto the physical map BAC contigs.

Supplementary Note 3 Genome size estimation
As shown in Supplementary Figure 2, the main graph depicts the distribution of 17mer, 21mer, 25mer, and 29mer in the reads of short insert size libraries (200-500 bp) and the inset shows the volume of 25mer corrected by the kmer spectrum method. The peak at low frequency and high volume represents random base errors and heterozygosity in the raw sequences. The high frequency and volume peaks at 63 may be due to the presence of Endophyte sequences. The total kmer number of 'k=25 corrected' is 9,644,794,319, and the volume peak is 13, so the genome size can be estimated in 742 Mb using the formula: (total kmer number)/(the volume peak 1 ).

Supplementary Note 4 Data description
The Manihot esculenta genome contains 18 chromosome pairs. We used a combined whole-genome shotgun sequencing (WGS) and BAC pooling (BP) strategy. Sequence  Table 1; Supplementary Table 3). We also generated a high-quality re-sequencing data set for sugary cassava CAS36, with a genome coverage of 21x (Supplementary Table 4).

Supplementary Note 5 Assembly strategy
Cassava draft genome sequences were assembled with multi-formed sequencing data using the following strategies, the following of which programs were parts of GNU software package and GATE v1.0 ( https://github.com/BENMFeng/GATE).
Data pre-processing Low-quality PCR duplications were discarded using the PERL script 'filterPCRdup.pl'; duplicated reads were identified by seed generated from Pollution removal Using BWA 21 v0.6.1 with parameters set as 'aln -l 31 -k 0' to align all reads to Bacterium or potential pollution during the whole library construction and sequencing processes, and discarded the reads that perfectly matched to the pollution sequences. We separated the nuclear and organelle genome sequence by alignment to pre-assembly of chloroplast and mitochondrial sequence, for the reason that an organelle genome has 1000-fold more copies than a nuclear genome in plant DNA library. Therefore, we could reduce the artificially induced complexity of the sequences for draft genome assembly.
Error corrections According to the kmer-frequency distribution compared to kmer species based on Lander-Waterman model with Fan's algorithm 14 , we eliminated or corrected the reads with low kmer frequency (kmer from 17 to 25) using a C++ program of 'ec' -"Error Correction" in some reads, but the same kmer sets in others with more higher (>2 times) frequency, of which most were due to random base calling errors.  Figure 4.

De novo assembly
After eliminated the polluted sequences of other known species of GeneBank using BLAST, GC content was used for estimation of assembled contigs sourced from the same segment. As shown in Supplementary Figure 5 (Supplementary Figure 6).

Supplementary Note 7 Integrated scaffolding of physical map and draft genome
To anchor the draft genome to physical map 23 Table 6).

Supplementary Note 9 Gene Prediction
We utilized five ab initio predictors to construct the entire predicted genes structure in silicoFor W14 and KU50, respectively,39,919 and 43,318 genes were predicted using AUGUSTUS 27 , 64,236 and 65,960 genes predicted using SNAP 28  and KU50, respectively (Supplementary Figure 9a, b). Then, we validated the de novo predicted gene models with annotated transcriptome (Supplementary Figure   9c). Compared to over 90% mapped reads, approximately 55.3 -66.3% de novo assembled transcripts could be aligned to predicted genes in W14 and KU50.
However, 75 -87.9% annotated transcripts with gene products could be aligned to predicted genes in W14 and KU50 genome. The unmappable transcripts were probably the non-protein coding RNA, most of which were sourced from lncRNAs.
We had independently investigated the IncRNAs in Supplementary Note 21. The unmappable annotated transcripts could be lost in un-annotated assembly or lacked in the assembly. These results indicated that the most of the gene structures were covered in the assembly and gene sets predicted from the draft genomes of W14 and KU50. Their functional annotation is shown in Supplementary Table 7.

Supplementary Note 11 Genome heterozygosity in cassava
To compare the diversity of the three cassava genomes, W14, KU50 and AM560-2, we aligned the raw reads to their draft genomes using BWA v0.6.1 19 (Supplementary Figure 14). where min is denoted as the minimal length as the length of two pairwise alignment

Comparison of gene models in Euphorbiaceae
ORFs; match is denoted as matched homology length of two pairwise alignment ORFs. In total, 34,154 gene CNVs were composed of 28,072 genes for W14, 31,310 genes for KU50, and 28,484 genes for AM560. Using an 80% similarity pairwise alignment, 6,567 genes were identified in single copy in the three genomes.
For PAV (present and absent variation) analysis, the PAV genes among the W14, KU50 and AM560 genomes were grasped directly. Considering de novo assembly and ab inito gene prediction false negatives, we filtered the PAV genes which still could be found in the draft genome with 100% coverage or 30% coverage and over 30x coverage in raw read mapping. The results showed that 1,584 genes were only present in W14 but absent in cultivars, while 1,678 genes were only present in cultivars but absent in their wild progenitor W14. GO annotation revealed that most of the PAV genes were ascribed into six biological processes, including binding, catalytic activity, metabolic process, cellular process, cell and cell part (Supplementary Figure 17), suggesting that these genes have been strictly selected during the process of long-time domestication. Copy Number Variation (CNV) annotation found that of the genes with significant difference in CNV between W14 and cultivars, 30 genes have high CN only in W14 and 80 genes have higher CN only in cultivars. All these genes are involved in catalytic activity, binding and transferase activity (Supplementary Figure 18).

Supplementary Note 15 Comparison of SNV/InDel and SV in one-to-one single copy genes
A total of 6,567 one-to-one single-copy genes among the three cassava genomes, W14,   Figs. 28a-b and 29), respectively. In storage root with typical developing root at middle stage (MTR), 1,103 and 2,160 genes were higher expressed in cultivars KU50 and Arg7, whereas 2,017 and 2,052 genes were higher expressed in wild W14 (Supplementary Figs. 28c-d and29). Of these genes, 406 and 1,690 genes higher co-expressed in leaf and storage root of cultivars, whereas 343 and 1,042 genes higher expressed in wild W14.
Gene Ontology (GO) enrichment analysis of the four gene groups revealed the enhanced or dwindled pathways in evolution. In storage root, the subcategories of GO class of 'cellular component', 'cell part' to 'cytoplasmic part' and 'organelle', and 'response to stimulus' with subcategories of 'response to abscisic acid stimulus', 'response to oxidative stress' and 'response to temperature stimulus', were enriched in cultivated species; but GO class of 'metabolic process', subnets related to 'cell wall polysaccharide biosynthesis process', 'lipid metabolic process' to 'fatty acid metabolic process', 'secondary metabolic process' and 'response to stimulus' with subnets of 'response to chemical stimulus' to 'response to water stress' and 'response to jasmonic acid stimulus' were enriched in wild species (Supplementary Figure   30A, C). Meanwhile, in functional leaf, GO subcategories of 'cellular metabolic process', photosynthesis, 'cell part' to 'chloroplast part' and 'photosystem', and 'response to stimulus' with subnets of 'response to heat', 'response to light stimulus', 'response to oxidative stress' and 'response to bacterium and fungus' were expanded in cultivated species; but GO terms of 'transporter activity' to 'potassium ion symporter activity', 'sugar/hydrogen symporter activity' and 'calcium transporting

Supplementary Note 21 Micro RNA and non-coding RNA annotation
We searched for novel miRNAs in the three cassava genomes, W14, KU50 and AM560 using the corresponding small RNA-seq datasets, respectively 53 . The method for Cassava miRNA identification has been documented previously 54 . Briefly, we first processed raw sequence reads by an in-house method that recursively searches for the longest substring of the adaptor appearing within a sequence read. Qualified reads, the ones carrying 3' sequencing adaptor and being longer than 17-nt, were then mapped to a genome using Bowtie (version 0.12.7). The loci with a sufficient number of reads mapped to were subject to miRNA identification with stringent criteria, including presence of a hairpin structure and a 21-nt RNA duplex with 3'-nt overhang. The newly identified miRNAs and known miRNAs in cassava were characterized in details in previous studies 53 . We further carried out a homology search by aligning the mature and hairpin sequences of the miRNAs to the three cassava genomes, respectively, using the local alignment tool BLAST. We set the p-value obtained from BLAST less than 1e-10 and manually examined the alignment to determine if a hit of BLAST was homologous to the input miRNA. We mapped the qualified reads from the corresponding genome datasets to the identified homologous sequences by Bowtie and counted the map-able reads. If no homologous sequences could be identified in a genome assembly, we then mapped the reads from the sequencing datasets of the same genome to the input miRNA sequence, with allowing two mismatches. We considered a miRNA not conserved in the genome assembly, if both criteria were met: 1) no homologous sequences identified and 2) no sufficient mappable reads (less than 10 normalized reads from the sample of the cultivar; reads were divided by the number of mappable reads in each sample to adjust for variation across samples. More details were presented in the previous work 53 . If there were reads mapped to the input miRNA sequences, we considered the miRNA conserved in the genome assembly, even if we were not able to identify the homologous sequences.
The other noncoding RNA genes were analyzed using existing tools. In particular, tRNAs were analyzed using tRNASCAN-SE 55