Sequencing and de novo assembly of a near complete indica rice genome

A high-quality reference genome is critical for understanding genome structure, genetic variation and evolution of an organism. Here we report the de novo assembly of an indica rice genome Shuhui498 (R498) through the integration of single-molecule sequencing and mapping data, genetic map and fosmid sequence tags. The 390.3 Mb assembly is estimated to cover more than 99% of the R498 genome and is more continuous than the current reference genomes of japonica rice Nipponbare (MSU7) and Arabidopsis thaliana (TAIR10). We annotate high-quality protein-coding genes in R498 and identify genetic variations between R498 and Nipponbare and presence/absence variations by comparing them to 17 draft genomes in cultivated rice and its closest wild relatives. Our results demonstrate how to de novo assemble a highly contiguous and near-complete plant genome through an integrative strategy. The R498 genome will serve as a reference for the discovery of genes and structural variations in rice.

for their starting and ending positions), and gray boxes represent gaps. None of the five gaps was completely covered by a single map. Therefore, their sizes could not be estimated.  R498 and Nip. (b) Comparison of genome maps to R498 on chromosome 6 shows the correctness of the sequence assembly in R498. The first break point of the inversion (between 12.76-12.81 Mb) in R498 is supported by a single genome map. The second break point of inversion (between 18.51-18.55 Mb) in R498 supported by another single genome map. (c) Three smaller inversions on chromosomes 7, 11 and 12. (d) Comparison of the R498 sequences of the three inversions in (c) to genome maps shows that their boundaries (red arrows) in R498 were assembled correctly. Figure 13: Dot plot showing alignment of the 93-11 sequences to R498. The x-axis is the R498 sequences (0-390.3 Mb). The numbers on x-axis are the chromosome numbers which are in the same order for y-axis . Note that there are many 93-11 unanchored sequences being aligned to each chromosome of R498, represented by red dotted lines as the one bounded by the light blue dashed box. The blue dotted line represents the same inversion on chromosome 6 between R498 and Nip.  Contig types: PBcR LS, PBcR assembly under low stringent conditions; PBcR HS, PBcR assembly under high stringent criteria; CANU, CANU assembly; Falcon, Falcon assembly; Fosmid contigs, assembled fosmid contigs from pooled fosmid clones; Centromere, indicating contigs containing centromeric satellites but not mapped onto the pseudomolecules; Unclassified, indicating the remaining contigs not mapped onto the pseudomolecules. CGL: Covered genome length by WGS contigs; ACL: total aligned contig length of a WGS assembly; OCN: overhang contig number, the number of contigs aligned to pseudomolecules with overhang length >1 kb; COL: sum of contig overhang length. If a contig is aligned to genome in multiple regions, it is designated as a chimeric contig, in which the aligned fragments except the longest one are treated as overhangs of the contig.  Centromere regions containing all the f ull units of rice centromere tandem repeats RCS2. All links between a pair of nodes on the overlap graph are created based on their overlaps to the opposite ends of fosmid contigs. However, a link can be viewed as one of the following subtypes based on the genetic map positions of the nodes it connects :

Supplementary
(1) direct link (negative edge length) between a pair of neighboring WGS contigs with sequence overlap between the WGS contigs; (2) indirect link (positive edge length) between two neighboring WGS contigs without sequence overlap between the WGS contigs ; (3) direct or indirect link between an anchored WGS contig and an unanchored WGS contig; (4) direct or indirect link between two unanchored WGS contigs. Clearly, the positional information on a genetic map has put another restriction which can be used to prioritize the node pairs to be merged, ie, the closer a node pair on a linkage group, the less chance of error can occur when connecting them. For example, a direct link is generally more reliable than an indirect link, and the links involving anchored contigs are generally more reliable than those not.
Compared with randomly connecting two nodes with a fosmid link between them, we can see that by utilizing the restrictions that fosmid contigs and genetic map put on the overlap graph, the success rate of connecting two truly adjacent WGS contigs will be increased significantly. We minimized the connection errors by observing the following rules: (1) non-redundancy rule, i.e. a WGS contig is used only once ; the used contig (and all its links from the used end to the same linkage group and all links to different linkage groups ) is removed to reduce conflicts; (2) global best-match-first rule, i.e., the best-scored link (in all linkage groups) is merged first; (3) delayed conflict-resolving rule, i.e., if a contig end overlaps with more than one WGS contigs with the same score, first anchored nodes, then nodes with direct links are merged first; otherwise the connection is delayed to next step; (4) decreasing and minimum overlap threshold rule, i.e., a sequence overlap cannot be shorter than 5 kb with a minimum alignment identity of 97% and three level of identity threshold (99%, 98% and 97%) are used in turn to merge the nodes satisfying each threshold. No nodes from the different linkage groups can be merged except the split part from chimeric contigs.
On the overlap graph constructed using PBcR HS contigs, we found that the average edge number between the nodes on the path of the final super-contigs was 29, while the average edge number for all other connected node pairs being merely 2. As a comparison, the average edge number between the nodes (contigs) adjacent on hybrid genome maps was 27, and the average edge number between other contigs on hybrid maps was also 2. Totally 81% of the gaps in hybrid maps can be connected by one fosmid contig. These results indicated that the fosmid contigs are mostly correctly assembled with a small number of random chimeric ones which did not affect the quality of the connecting regions.
Initially we used the PBcR LS contigs to build super-contigs since the genetic map was constructed using LS contigs. However, after several iterations when no more neighboring contigs (or super-contigs) could be connected, we found that the unfilled gaps between adjacent contigs (or super-contigs) were either located in the centromere regions or resulted from misassembled LS contigs around the gaps. For the latter case, we found that those gaps could be successfully filled after replacing the PBcR LS sequences around the gaps with the best matched HS contigs. Therefore, we redid the whole connection process by aligning the HS contigs onto the constructed LS super-contigs to group and order them (to form anchored and unanchored HS contig sets) and using the fosmid contigs to connect them. In the final HS super-contigs, no unfilled gaps were left due to misassembled contigs. The replacement of PBcR LS contigs with HS contigs also added a few Mb of sequences into the final super-contigs which included several more telomeres. These results suggested that although the LS assembly had higher N50 size but it contained many errors not existing in the HS assembly which was more accurate for the final assembly.

Supplementary Note 2: Quality assessment of the WGS assemblies and the connecting regions
Based on the best aligned blocks of the assembled WGS contigs to the pseudomolecules, we computed the following metrics to evaluate the quality of the WGS assemblies: sensitivity (length percentage of genome being covered by WGS contigs), redundancy ((total aligned length of WGS contigscovered genome length) / covered genome length) and error rate (total overhang length of >1 kb of aligned WGS contigs / total aligned length of WGS contigs) (Supplementary Table 5). The PBcR HS assembly had the highest sensitivity (99.75%), the highest redundancy (15.86%) and the lowest error rate (0.57%). The high sequence redundancy and low error rate in the PBcR assembly suggested that it generated multiple copies of sequences in repetitive (or heterozygous) regions ( Supplementary Fig. 5), which represented a general feature of overlap-layout-consensus (OLC) assemblers 1 , such as Celera in PBcR. On the other hand, string graph assemblers such as Falcon do not usually keep repetitive sequences at the end of assembled contigs 1 . These results suggested that the PBcR HS contigs were very suitable for building super-contigs under the guidance of a genetic map or genome map.
The PBcR HS contigs covered 98.34% of the connecting regions with identity >=98%, indicating that most of the connecting sequences were present in HS assembly, but in short contigs with N50 size of 24 kb. The Illumina short reads covered 98.02% of the connecting regions (with sequencing depth of 109x) with identity>=98%. As a comparison, the whole genome coverage and sequencing depth from Illumina short reads was 99.88% and 89x. The high sequencing depth in the connecting regions suggested that they were unlikely to be misassembled at base level. If misassembled, the low quality regions should have much lower sequencing depth or sequencing coverage (see Supplementary Fig. 7). Illumina platform cannot sequence many GC-biased regions. The total number of zero mapping regions by Illumina short reads were 4,751 (a total of 486,331 bp), with max length of 10,170 bp. Since these regions are shorter than the read length of SMRT sequences, they were very unlikely caused by misassemblies.