Eight soybean reference genome resources from varying latitudes and agronomic traits

Comparative analysis of multiple reference genomes representing diverse genetic backgrounds is critical for understanding the role of key alleles important in domestication and genetic breeding of important crops such as soybean. To enrich the genetic resources for soybean, we describe the generation, technical assessment, and preliminary genomic variation analysis of eight de novo reference-grade soybean genome assemblies from wild and cultivated accessions. These resources represent soybeans cultured at different latitudes and exhibiting different agronomical traits. Of these eight soybeans, five are from new accessions that have not been sequenced before. We demonstrate the usage of these genomes to identify small and large genomic variations affecting known genes as well as screening for genic PAV regions for identifying candidates for further functional studies.


Background & Summary
Soybean is an important crop that is responsible for about 50% of the world's oilseed production (www.fao.org), and a source of high-quality protein for animal feeds. Cultivation of soybean has experienced specific selections for the last 4,000 years, yielding over 45,000 Glycine max (G. max) accessions 1 . The construction and subsequent targeted improvement of the reference genome assembly for G. max Williams 82 (W82), has greatly enhanced genome contiguity and has in turn promoted research on soybean 2,3 . However, recent genomic advances in many plant and animal species have shown that a single reference genome is insufficient to capture and represent the variations that exist within the population of the corresponding species [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20] . Such is also true in soybeans. Resequencing and de novo assembly of wild and cultivated soybean accessions identified hundreds of genic presence-absence variations (PAVs) or SVs affecting agronomic genes compared to W82 genome 7,[21][22][23][24][25][26] . The recent construction of the soybean pan-genome also discovered over 120 K non-redundant SVs 27 . Thus, we endeavor to enhance the genomic resources available for furthering soybean functional research and breeding by constructing reference-grade genome assemblies from cultivars found in different latitudes and exhibiting different agronomic traits. In this data descriptor, we report the sequencing, genome assembly, annotation, and genomic variation resources generated for the assembly of 8 reference-grade soybean genomes.

Methods
Sample selection, collection, and nucleic acid extraction. Eight soybean accessions were selected for this project (Table 1). Aside from IGA1003, which is a wild soybean, the remaining seven were cultivated soybeans. These eight soybeans showed different phenotypic traits including flowering color, pubescence color, maturity time, seed shape, and hundred-grain weight (Fig. 1a, Supplementary Table 1). The eight soybeans were typically grown in different latitudes with IGA1007 and IGA1005 typically grown in southern China (around 22°N) and Huanghuai region (around 31°N), China, respectively (Fig. 1b). In terms of traits, we have included a high salinity tolerant accession (IGA1001) and a high yielding accession (IGA1004). We do note that IGA1003, IGA1005, and IGA1008 were sequenced and assembled previously. IGA1003 was sequenced and assembled using only Illumina sequencing data 7 . IGA1005 28 and IGA1008 3 had published genome assemblies but we have included these materials allowing for assessing genomic variations within a particular accession.
Seeds from each accession were grown in the greenhouse at Northeast Institute of Geography and Agroecology. Five seeds were planted with 14 hours of light and temperature controlled at 28 °C during the day time and 20 °C during the night time. Soil composition was 8:5:3 ratio of peat:vermiculite:perlite mixed with 5 g of phosphate. Plants were grown until the V2 growth phase and the leaves were harvested from the top.
Hi-C sample preparation, library construction and sequencing. About 1.5 g of young leaves were used for Hi-C library construction as described in previous reports with some modifications 53 . Briefly, leaf samples were cross-linked with 3% formaldehyde for 45 minutes in vacuum at 4 °C and stopped using 0.4 M glycine. Leaf pellets were then pulverized in liquid nitrogen followed by resuspension in the nuclei isolation buffer (NIB). The cross-linked nuclei were treated with 0.3% SDS and neutralized with 3% Triton X-100. The resulting DNA was digested with MboI (NEB) overnight at 37 °C and the reaction was stopped with heat inactivation at 65 °C. Restriction fragment ends were fixed with Klenow and labeled with biotinylated cytosine nucleotides using biotin-14-dCTP (TriLINK). Blunt-end ligation was carried out using T4 DNA ligase incubated at 16 °C overnight. After ligation, the cross-linking was reversed by proteinase K (Thermo) overnight at 65 °C. DNA purification was performed using DNeasy Plant Mini Kit (Qiagen) according to manufacturer's instructions. Purified DNA was sheared to a length of ~400 bp using Covaris M220 (Covaris). Hi-C ligated junctions were captured by Dynabeads MyOne Streptavidin C1 (Thermofisher) according to manufacturer's instructions. The Hi-C sequencing library was prepared using NEBNext Ultra II DNA library Prep Kit for Illumina (NEB) following manufacturer's instructions. Fragments between 400 and 600 bp were sequenced on the Illumina platform with PE150 format to generate 46-57 G of data ( De novo genome assembly. De novo assembly was performed with PacBio sequencing data using FALCON 62 setting length_cutoff = 3000-13000 and length_cutoff_pr = 3000 and CANU 63 setting correctedEr-rorRate = 0.039-0.04. We further improved the assembly by merging complementing contigs between FALCON and CANU using CANU assembly as the basis 64 . The final contigs were assembled into chromosomes with Hi-C data using LACHESIS v.c23474f 65 . The assembly was corrected with PacBio long reads using Arrow in SMRTLink 5.0 66 and Illumina short reads using Pilon v1.22 67   www.nature.com/scientificdata www.nature.com/scientificdata/ performed utilizing ab initio-, homology-, RNA-sequencing-, and Iso-seq-based methods. Augustus v.3.3 80 and Glimmer v.3.0.4 81 were used for ab initio gene prediction. For homology-based annotation, protein sequences from Glycine max Williams 82, Glycine soja, Arabidopsis thaliana, Arachis duranensis, and Cajanus cajan were obtained from NCBI and aligned to each of the 8 genomes using TBLASTN 82 . Exonerate was used to build gene structure based on the Blast results. For RNA-seq based gene prediction, short reads were mapped to their respective genome using TopHat v2.1.1 83 and the gene structure was predicted using Cufflinks v2.2.1 84 . For Iso-seq based gene prediction, reads were mapped to IGA1008 assembly using GMAP v2016-09-14 85 and TransDecoder v4.1.0 was used to filter for high quality gene models. Lastly, a consensus gene set was generated by integrating gene annotations from each source using MAKER 86 . With this process, we identified 57,286 to 58,392 protein coding genes (Table 4).

BUSCO evaluation of genome and annotations. BUSCO (Benchmarking Universal Single-Copy
Orthologs) evaluation was performed on the genome assembly and the gene annotations using BUSCO v.3.0.2 89 with embryophyta_odb9 data set.
Variation detection. Whole genome alignment based variation detection was performed by aligning each genome to IGA1008 using MUMer4 90 with parameters-mum -l 40 -c 90. The MUMer alignment was filtered using delta-filter with parameters --1. SNVs, small Indels were called using show-snps with parameters -rlTHC. Large SVs were called using custom scripts MumSV (https://github.com/jeff-sc-chu/MumSV) based on output from show-coords with parameters -THrcl. Alignment gaps associated with assembly gaps were filtered away. Presence-absence variations (PAVs) were assessed based on large insertions and deletions (>100 bp) using BLASTN against the reference genome. A deletion or insertion is considered a PAV if BLASTN alignment to the reference genome is <50% query coverage or <90% PID. Accession specific SNP, Indel, and SVs were identified by selecting variants not overlapping any variants of the same class in all other genomes. Using the above process, we identified 1.86-4.27 million SNVs, 0.44-0.92 million small indels, 11.75-25.33 thousand large indels (>100 bp), 706-3006 translocation events, and 200-413 inversion events (Table 5).   www.nature.com/scientificdata www.nature.com/scientificdata/

Title Description
Annotated peptide sequence for each soybean genome Peptide sequence of each annotated coding gene for each genome Annotated CDS sequences for each soybean genome Coding sequences of each annotated gene for each genome

Gene annotations (GFF) Gene annotation of each genome in GFF format
SNVs for each genome compared to IGA1008 SNV location detected by whole genome MUMer alignment of each genome using IGA1008 as a reference.
Indels between soybean genome assemblies and IGA1008 Indel location detected by whole genome MUMer alignment of each genome using IGA1008 as a reference.
PAVs of between soybean genomes compared with IGA1008 PAV location detected by whole genome MUMer alignment of each genome using IGA1008 as a reference.
Gene orthology correspondence Gene orthology correspondence between IGA1008 and Williams 82 v4 genome.  www.nature.com/scientificdata www.nature.com/scientificdata/ Technical Validation Genome assembly quality assessment. The final assemblies of the 8 genomes ranged between 986.1 Mb and 1001.3 Mb with more than 98.7% of the genome anchored to 20 chromosomes (Table 3). The genome assembly size is similar to its respective k-mer estimation (Table 3) and also comparable to soybean genomes, such as W82_v4 3 , ZH13 28 , and W05 26 , which were assembled using similar sequencing technologies. The contig N50 were between 1.4 Mb-6.1 Mb and the scaffold N50 were between 48.8 Mb-50.7 Mb (Table 3) [68][69][70][71][72][73][74][75] . The number of gaps ranged between 830 and 2922 (Table 3). Each unspanned gap in the assembly was arbitrarily set with 500 bp of Ns. As a comparison, the current soybean reference Williams 82 version 4 genome has a genome size of 978 Mb with contig N50 of 0.41 Mb, scaffold N50 of 20.44 Mb, and 8920 gaps 3 .
We further assessed the quality of our genome assemblies in a number of ways. First, the presence of centrimeric repeats CentGm-1/2 were found in all 20 chromosomes for all genome assemblies. We assessed the completion and accuracy of the assemblies using BUSCO 89 and re-mapping of Illumina short read data. We observed an average of 96.98% complete BUSCO alignment and an average 99.75% remapping rate. Next, a high accuracy genome assembly would expect a very low level of homozygous SNVs from the remapping analysis and as expected we see an average of 0.00037% homozygous SNVs and 0.0073% heterozygous SNVs indicating the low error rates in these assemblies (Table 3).
Genome comparison with soybean reference genome Williams 82. IGA1008 is a soybean cultivar derived from Williams 82. Between IGA1008 and the W82_v4 assemblies, we identified 0.38 million SNVs, 0.14 million small indels, 3,203 large indels (>100 bp), 255 translocation-like and 135 inversion-like events. These events do not border or cross assembly gaps and may represent genomic variations between the two W82 lines.
Gene set completeness assessment. The numbers of protein-coding genes annotated for these cultivars were highly similar, ranging between 57,286 and 58,392 genes. We used BUSCO to evaluate the completion of our gene annotation and found that over 96% of the 1440 genes were found completely (Table 4).
Structural variation assessment. IGV 92 was used to visually inspect many structural variations. We also examined known structural variations for their presence in the 8 genomes and our data confirmed the 40Kb Williams 82-specific insertion on Chromosome 15 ( Supplementary Fig. 1) 7,26 , the I-locus inversion event ( Supplementary Fig. 2) 26 and the large deletion (~15Kb) affecting the E3 gene ( Supplementary Fig. 3) 27 .

Usage Notes
identifying variations in known genes. The resources generated in this dataset allows one to search for genomic variations in genes of interest. For instance, we were able to identify nonsense mutations in the E2 gene as well as frameshift indels in J and FT1b (Supplementary Fig. 3). These resources, such as the indel found in FT1b, has not been characterized before and provide candidate alleles for further soybean research. PAV gene screening. As more soybean genomic resources become available, one can take a pan-genome approach to identify genomic variations that are unique or shared between different soybean accessions. As a demonstration, we included 5 additional soybean genomes (ZH13 28 , W05 26 , Lee 3 , PI483463 3 , W82 3 ) to identify genomic variations that are shared in a subset of the genomes (Fig. 2). These 5 genomes were chosen for their genome assembly quality, which were improved using sequencing technologies with longer reads. We identified 60 genes that were found in 3 wild soybeans but missing or truncated in the other 10 cultivated soybeans and 185 genes found missing or truncated in all 3 wild soybeans (Supplementary Table 3). Analyses such as this could generate additional resources to further soybean research and crop improvement.

Code availability
The versions and parameters of published software used in this study were described in the Methods. MumSV is a set of custom scripts to call large SVs based on MUMer alignments and it can be accessed at https://github.com/ jeff-sc-chu/MumSV.