Twelve Platinum-Standard Reference Genomes Sequences (PSRefSeq) that complete the full range of genetic diversity of Asian rice

As the human population grows from 7.8 billion to 10 billion over the next 30 years, breeders must do everything possible to create crops that are highly productive and nutritious, while simultaneously having less of an environmental footprint. Rice will play a critical role in meeting this demand and thus, knowledge of the full repertoire of genetic diversity that exists in germplasm banks across the globe is required. To meet this demand, we describe the generation, validation and preliminary analyses of transposable element and long-range structural variation content of 12 near-gap-free reference genome sequences (RefSeqs) from representatives of 12 of 15 subpopulations of cultivated rice. When combined with 4 existing RefSeqs, that represent the 3 remaining rice subpopulations and the largest admixed population, this collection of 16 Platinum Standard RefSeqs (PSRefSeq) can be used as a pan-genome template to map resequencing data to detect virtually all standing natural variation that exists in the pan-cultivated rice genome.

One key finding from a population 90 structure analysis of this dataset showed that the 3,000 accessions can be subdivided into 91 (Wang et al. 2018) and show that the 3K-RG dataset can be further subdivided into a total 105 of 15 subpopulations. We then present the generation of 12 new and near-gap-free high-106 quality PacBio long-read reference genomes from representative accessions of the 12 107 subpopulations of cultivated rice for which no high-quality reference genomes exist. All 108 12 genomes were assembled with more than 100x genome coverage PacBio long-read 109 sequence data and then validated with Bionano optical maps (Udall and Dawe 2018 The group membership for each sample was defined by applying the threshold of 0.65 141 to admixture components. Samples with no admixture components exceeding 0.65 were 142 classified as follows. If the sum of components for subpopulations within the major 143 groups cA (circum-Aus), XI (Xian-indica), and GJ (Geng-japonica) was ≥ 0.65, the 144 samples were classified as cA-adm (admixed within cA), XI-adm (within XI) or GJ-adm 145 (within GJ), respectively, and the remaining samples were deemed 'fully' admixed. 146 The newly defined groups were mostly either aligned with the previous K=9 grouping, or 147 refined those groups, and they were named accordingly (e.g. XI-1B1 and XI-1B2 are new 148 subgroups within XI-1B). 149 The phenogram shown in Figure 1 was constructed with DARwin v6 150 (http://darwin.cirad.fr/, unweighted Neighbor-joining) using the identity by state (IBS) 151 distance matrix from Plink on the 4.8M Filtered SNP set (available at https://snp-152 seek.irri.org/_download.zul). Colors were assigned to subpopulations based on K15 153 Admixture results. One entry, MH 63 (XI-adm) represents the admixed types among the 154 XI group. 155

Sample selection, collection and nucleic acid preparation 156
To select accessions to represent the 12 subpopulations of Asian rice that lack high-157 quality reference genome assemblies, the following strategy was employed. The IBS closest to the centroid for which seed was available was chosen as the representative for remaining 10 accessions (Table 1) were obtained from the International Rice Genebank, maintained by IRRI, Los Baños, Philippines. All seed were sown in potting soil and 170 grown under standard greenhouse conditions at UA, Tucson, USA for 6 weeks at which 171 point they were dark treated for 48-hours to reduce starch accumulation. Approximately 172 20-50 grams of young leaf tissue was then harvested from each accession and 173 immediately flash frozen in liquid nitrogen before being stored at -80°C prior to DNA 174 extraction. High molecular weight genomic DNA was isolated using a modified CTAB 175 procedure as previously described (Porebski et al. 1997). The quality of each extraction 176 was checked by pulsed-field electrophoresis (CHEF) on 1% agarose gels for size and 177 restriction enzyme digestibility, and quantified by Qubit fluorometry (Thermo Fisher 178 Scientific, Waltham, MA). 179

Library construction and sequencing 180
Genomic DNA from all 12 accessions were sequenced using the PacBio single-molecule  Table 2). According to the estimated genome size of the 196 IRGSP RefSeq, the average PacBio sequence coverage for each accession varied from 197 103x (LIMA::IRGC 81487-1) to 149x (IR 64) ( Table 2). 198 For Illumina short-read sequencing, HMW DNA from each accession was sheared to 199 between 250-1000bp, followed by library construction targeting 350bp inserts following 200 standard Illumina protocols (San Diego, CA, USA). Each library was 2 x 150bp paired-end sequenced using an Illumina X-ten platform. Low-quality bases and paired reads with Illumina adaptor sequences were removed using Trimmomatic (Bolger et al. 2014). 203 Quality control for each library data set was carried out with FastQC (Brown et al. 2017). 204 Finally, between 36.52-Gb and 51.05-Gb of clean data from each accession was 205 generated and used for genome size estimation (Table S1)  were also removed for each assembled contig. In addition, we gave contiguous contigs a 234 higher priority than ones with gaps to be retained in each assembly. After manual 235 checking, editing, and redundancy removal, the number of contigs in each assembly 236 ranged from 26 (NATEL BORO::IRGC 34749-1) to 588 (LIU XU::IRGC 109232-1) 237 (Table S3). 238 Step 3: The sequence quality of each contig was then improved by "sequence polishing": 239 twice with PacBio long reads and once with Illumina short reads. Briefly, PacBio 240 subreads were aligned to GPM edited contigs using the software blasr genome assemblies was 18, with 8 assemblies containing less than 10 gaps (Table 3). 262 Step 5: To independently validate our assemblies, we generated and compared Bionano 263 optical maps to each assembly. In total, 17 (Azucena) to 56 (LIU XU::IRGC 109232-1) 264 Bionano optical maps were constructed for all 12 rice accessions, which yielded contig 265 chromosomes and/or chromosome arms of all 12 de novo assemblies were highly supported by these ultra-long optical maps. Although rare, a few discrepancies between 269 the optical maps and genome assemblies can be seen and are likely due to small errors 270 and chimeras that can be produced through both the optical mapping and sequence 271 assembly pipelines (Udall and Dawe 2018). 272 Following these five steps, we were able to produce 12 near-gap-free published maize genome assemblies ( Figure S5) and therefore, concluded that 13 of the 291 16 "conserved" genes in the BUSCO database are not present in cereals, and should be 292 excluded from our gene space analysis. Taking this into account, we recalculated the 293 BUSCO gene space content for each of 12 assemblies and found that 10 of 12 assemblies 294 captured more than 98% of the BUSCO gene set (Table 3). 295

Transposable element (TE) prediction 296
To determine the pan-transposable element content of cultivated Asian rice we analyzed  Table 3. Bionano optical maps were generated and used to validate all 12 genome assemblies. 359 360 This paper is the first release of 12 PSRefSeqs, optical maps and all associated raw data 361 for the accessions listed in Table 3.

Code Availability 363
The population re-analysis of 3K-RG dataset and 12 genome assemblies were obtained 364 using several publicly available software packages. To allow researchers to precisely 365 repeat any steps, the settings and the parameters used are provided below: