Novel sequences, structural variations and gene presence variations of Asian cultivated rice

Genomic diversity within a species genome is the genetic basis of its phenotypic diversity essential for its adaptation to environments. The big picture of the total genetic diversity within Asian cultivated rice has been uncovered since the sequencing of 3,000 rice genomes, including the SNP data publicly available in the SNP-Seek database. Here we report other aspects of the genetic diversity, including rice sequences assembled from over 3,000 accessions but absent in the Nipponbare reference genome, structural variations (SVs) and gene presence/absence variations (PAVs) in 453 accessions with sequencing depth over 20x. Using either SVs or gene PAVs, we were able to reconstruct the population structure of O. sativa, which was consistent with previous result based on SNPs. Moreover, we demonstrated the usefulness of the new data sets by successfully detecting the strong association of the “Green Revolution gene”, sd1, with plant height. Our data provide a more comprehensive view of the genetic diversity within rice, as well as additional genomic resources for research in rice breeding and plant biology.


Background & Summary
With the continued reduction of croplands, we are facing a great challenge of feeding the fast growing world population. Asian cultivated rice (Oryza sativa L.) is grown worldwide and is the staple food for half the world population. The rich resources of the global rice germplasms are expected to provide a sustainable solution in future rice improvement. However, the total genomic diversity within the primary gene pool of O. sativa remains poorly explored. In order to reveal the comprehensive genomic diversity, 'The 3,000 rice genomes project (3k RG)' 1,2 was initiated, in which 3,024 accessions representing the most diversity of the cultivated rice population were selected from 89 countries. The 3,024 rice genomes were sequenced using HiSeq 2000 platform with 500 bp DNA fragment libraries and a total of 17 trillion bases were generated. The sequencing data of the 3,024 rice accessions were freely available at GigaDB 1 (Data Citation 1), NCBI SRA (Data Citation 2), Amazon cloud (https://aws.amazon.com/cn/public-data-sets/ 3000-rice-genome/) and Aliyun cloud (http://ricecloud.org/). We have previously reported the single nucleotide polymorphisms (SNP) data derived from these sequencing data 3 , which is very useful for QTL mapping by genome-wide association studies.
Structural variations (SVs) and gene presence/absence variations (PAVs) represent additional dimensions of the total genetic diversity within a species and remain largely unknown in almost all eukaryotes. In this descriptor, we reported the SV data and gene PAV data of O. sativa, together with the novel sequences absent in the widely used Nipponbare reference genome IRGSP-1.0, as key results of the in-depth analyses of the sequencing data of the 3k RG 4 .
In this data descriptor, we present the rice SV data obtained by calling against the Nipponbare reference genome using novoBreak 5 because it had the lowest false positive rate when compared with results from several tools such as BreakDancer 6 and Delly 7 . This SV data set contains a total number of 93,683 SVs (deletions, inversions and duplications of >100 bps and o1 Mbps, and translocations), or an average of 12,178 SVs per genome detected in the 453 high-quality genomes with sequencing depths ≥20 × and mapping depths ≥15 × . We also present the PAV data sets of 48,098 full-length proteincoding genes (35,633 Nipponbare reference genes and 12,465 novel genes) and 23,876 gene families in the 453 rice accessions, which were obtained from a "map-to-pan" pipeline 8 . These data not only provide a more comprehensive understanding of the genomic diversity within O. sativa, but also provide additional genomic markers for genome-wide association studies of rice.
The sequencing depths of the remaining 3,010 genomes ranged from 4.2x to 63.8x. Clean reads of each accession were mapped to the Nipponbare reference genome (IRGSP-1.0) using "bwa mem" with default parameters 9 . The mapping depth was calculated as total mapped bases divided by the size of IRGSP-1.0 genome. Only the 453 rice accessions with sequencing depth ≥20 and mapping rate ≥15 were selected for reporting final SVs and PAVs.

Structural variations
novoBreak 5 (https://sourceforge.net/projects/novobreak/?source = navbar) was used for SV calling against the Nipponbare genome. The comparison of performances of three software, novoBreak 5 , BreakDancer 6 and Delly 7 are available in our related work 4 . novoBreak was first run on each of the 3,010 accessions independently with default parameters. We detected deletions, inversions and duplications with sizes between 100 bps and 1Mbps, and translocations (interchromosomal breakpoints). SVs which could be inferred by no less than 3 kmer/reads were further selected with the following filter conditions: 1) more than 4 supporting split reads; or 2) no less than 3 discordant read pairs for both breakpoint. To identify SVs in different samples, all SVs passed the filter condition in the 453 accessions were pooled together to justify coordinated variants. Any two adjacent SVs were identified as a single SV if their start and end positions varied no more than 1 kb, and their overlapping region is over half of the total size. The presence/absence matrix of SVs in each accession was built with these criteria. To minimize the false positive rates, SVs detected in less than 6 accessions or more than 80% of the 3,010 accessions were removed. The SVs in the 453 high-depth accessions were then extracted as a more confident subset (File 1, Data Citation 3).

Novel sequences
Novel sequences absent in the reference genome were derived from the de novo assemblies of the 3,010 rice accessions. To achieve high-quality assemblies, we utilized a method with iterative use of SOAPdenovo r240 10 to select the best Kmer parameter for each accession. In detail, we first determined an initial Kmer (K init ) for rice accession R based on a linear model K init = ROUND(a*Dep R +b), where Dep R is the sequencing depth of R, and a and b are the parameters of the linear model. This linear model was trained on 50 randomly selected rice accessions and the initial Kmer was set as 'K = 2*int (0.38* (sequencing depth) +10)+1' (R2 = 0.68). After K init was determined, we ran SOAPdenovo 3 times with stepped Kmers (Kmer low = Kmer init -2, Kmer mid = Kmer init and Kmer high = Kmer init +2) using command line "SOAPdenovo-63mer(or SOAPdenovo-127mer) all -s configure_file (average insertion length set as 460 in the configure file) -o output_directory -K Kmer -R -F -p 8"; Gapcloser tool within SOAPdenovo package was then used on the SOAPdenovo raw scaffolds with command "GapCloser -b configure_file (the same as the configure file used in running SOAPdenovo) -a soap_raw.scaf -o output_gc.scaf -t 8"; Gap closed scaffolds were then broken down to contigs by cutting scaffolds at sites with >10 consecutive Ns. The N50 values of the gap-closed contigs were calculated for Kmer low , Kmer mid and Kmer high respectively. If N50 of Kmer mid was the highest, assembly at Kmer mid was selected as the final result. If N50 of Kmer high was the highest, an additional run with Kmer = Kmer high +2 was needed and we compared N50 of Kmer mid , Kmer high and Kmer high +2; this iteration continued until N50 of the median Kmer became the highest. If N50 of Kmer low was the highest, Kmer went down and similar iteration was carried out to determine the best Kmer. In order to control the false positives, though GapCloser was involved in the assembly pipeline and provided larger N50, only raw contig assemblies from SOAPdenovo were used for detection of novel sequences.
Assemblies were assessed by comparing to the Nipponbare reference genome using QUAST v2.3 11 , and those unaligned contigs were then selected. All the unaligned contigs with length >500 bp from each rice accession were merged into a single sequence set. Next, CD-HIT v4.6.123 12 was used to remove redundant sequences at an identity cut-off of 90% with command "cd-hit-est -i input.fa -o output.fa -c 0.9 -T 16 -M 50000". This process was carried out for 3 times to ensure redundancies were removed. Next, various contaminants including archaea, bacteria, virus, fungi and metazoa were removed. In practice, the non-redundant sequences were aligned to the NT database (07/26/2014) with NCBI-blast v2.2.28+ with command "blastn -db nt.fa -task megablast -query input.fa -num_threads 16evalue 1e-5 -out o -outfmt \"6 qseqid sseqid qlen length qstart qend sstart send pident evalue\" -best_hit_overhang 0.25 -perc_identity 0.5 -max_target_seqs 10". Contigs whose best alignments (E-value as 1 st keyword and identity as 2 nd keyword) were not in Viridiplantae (annotation information downloaded from NCBI taxonomy database) were considered as contaminants and were filtered out. The remaining contigs formed the non-redundant novel sequence dataset (identity o0.9 in comparison to IRGSP genome and identityo 0.9 among the novel sequences themselves). In order to be browsed in a genome browser, the novel contigs were concatenated with 100 consecutive Ns as the delimiters. The reference sequences and the novel sequences were combined to form the rice pan-genome sequences (File 2, Data Citation 3).

Gene presence/absence variations
Both the Nipponbare reference genes and non-Nipponbare novel genes were included in the PAV analyses. The gene annotation of IRGSP-1.0 genome was downloaded from the Rice Annotation Project (RAP) 13 . In order to be consistent with gene annotation for non-IRGSP sequences, also to simplify the subsequent analyses, only those transcripts with the longest open reading frame (ORF) were selected as representatives. Protein-coding genes on the novel sequences were predicted using MAKER2 14 (a gene prediction pipeline combining ab initio predictions, expression evidence and protein homologies). In detail, low-complexity repeats were first masked. Two ab initio predictors, snap 15 and AUGUSTUS 16 , were utilized to predict gene models with their default parameters for rice. All rice ESTs were downloaded from Genbank (12/15/2014) and were aligned to the novel sequences with NCBI-blastn. All rice proteins were downloaded from NCBI (12/15/2014) and were aligned to the novel sequences with NCBI-blastx. To get more informative alignments, Exonerate 17 was used to realign each sequence identified by blast around splice sites. EVidenceModeler 18 was used to combine and refine the ab initio predictions with RNA and protein evidences. Incomplete gene models were further removed. Since we used a method based on representative sequences to build the novel sequences, there might be local similarities between novel sequences and the reference genome, as well as among novel sequences themselves. Therefore, there might also be genes predicted in the novel sequences similar to genes in the reference genome and also novel genes similar to each other. In order to remove these genes, we clustered all genes at global protein identity of 95%, and remove those novel genes who are not the representative of the group. The nonredundant novel genes together with the reference genes formed the rice pan-genome (File 3, Data Citation 3).
We utilized a "map-to-pan" strategy 8 to determine gene presence/absence based on the observation that more than 98% of the genome can be covered by short read mapping at sequencing depth ≥20. The presence/absence of each gene was determined by the read coverage of the gene body and its coding region. In practice, we first mapped reads of each accession to the pan-genome sequences with "bwa mem" and a gene was considered as present if it had 1) coding coverage >0.95 (covered bases in the ORF divided by the ORF length) and 2) gene body coverage >0.85 (the number of covered bases in the gene body divided by the gene length). The presence/absence information of each gene from the 453 high sequencing-depth accessions were then merged together as a 0/1 matrix (File 4, Data Citation 3).

Xian/Indica
Geng/Japonica circum-Aus circum-Basmati/Sadri admixed a b Gene family presence/absence variations Similar to a recent pan-genome study on soybean 19 , the genes were clustered to gene families with OrthoMCL v2.0.9 20 at protein similarity cut-off of 0.5. In detail, all coding sequences of genes on the pangenome sequences were first extracted and then translated into protein sequences. All protein sequences were compared by using all-by-all blastp (Evalue set as 1e-5). OrthoMCL was used to process the blastp output and cluster genes to gene families (File 5, Data Citation 3). Gene family presence/absence (File 6, Data Citation 3) was further determined with a straightforward rule based on gene presence/absence. If at least one gene of a gene family presented in the rice accession, this gene family was considered to be present in this rice accession.

Code availability
All codes for detection of the novel sequences and determination of gene PAVs were implemented in EUPAN package 8 available at http://cgm.sjtu.edu.cn/eupan/. Software and their used versions were described as above.

Data Records
The raw data of 'The 3,000 rice genome project' (3k RG) were deposited in GigaDB

Technical Validation
Based on the SNP data, the 3,010 rice accessions were clearly divided into Xian-Indica, Geng-Japonica, circum-Aus and circum-Basmati groups, together with some admixed accessions 4 . Here we reconstructed the phylogenetic relationship of the 453 rice accessions based on SVs and gene PAVs with "pars" tool of  the phylip package (http://evolution.genetics.washington.edu/phylip.html). As a result, both the SVs and the gene PAVs showed consistent phylogenies ( Fig. 2a and b), suggesting the validity of these data. In order to demonstrate the usefulness of these data, we carried out genome-wide association studies of plant height, an agronomically important phenotype, based on both the SVs and the gene PAVs. For 323 out of the 453 rice accessions, the plant height data were collected. SVs including deletions, duplications and inversions were utilized for the analysis. Efficient mixed model analyses (EMMA) by EMMAX software 21 were used for the association studies. In detail, BN kinship matrixes were calculated to measure the genetic similarity between accessions based on the genome-wide SV data and gene PAV data, respectively. As a result, GWAS based on both SVs (Fig. 3a) and gene PAVs (Fig. 3b) successfully detected the sd1 locus, the well-known "Green Revolution gene", whose loss of function is associated with the greatly reduced plant heights 22 . Moreover, we also detected several other candidate variations significantly associated with plant heights. This paper released two important types of variations beyond SNPs for rice. This is also the first release of gene presence/absence variations among hundreds of individuals for a higher eukaryote. We demonstrated the potential of these variations to explain phenotype variations.