The complex polyploid genome architecture of sugarcane

Sugarcane, the world’s most harvested crop by tonnage, has shaped global history, trade and geopolitics, and is currently responsible for 80% of sugar production worldwide1. While traditional sugarcane breeding methods have effectively generated cultivars adapted to new environments and pathogens, sugar yield improvements have recently plateaued2. The cessation of yield gains may be due to limited genetic diversity within breeding populations, long breeding cycles and the complexity of its genome, the latter preventing breeders from taking advantage of the recent explosion of whole-genome sequencing that has benefited many other crops. Thus, modern sugarcane hybrids are the last remaining major crop without a reference-quality genome. Here we take a major step towards advancing sugarcane biotechnology by generating a polyploid reference genome for R570, a typical modern cultivar derived from interspecific hybridization between the domesticated species (Saccharum officinarum) and the wild species (Saccharum spontaneum). In contrast to the existing single haplotype (‘monoploid’) representation of R570, our 8.7 billion base assembly contains a complete representation of unique DNA sequences across the approximately 12 chromosome copies in this polyploid genome. Using this highly contiguous genome assembly, we filled a previously unsized gap within an R570 physical genetic map to describe the likely causal genes underlying the single-copy Bru1 brown rust resistance locus. This polyploid genome assembly with fine-grain descriptions of genome architecture and molecular targets for biotechnology will help accelerate molecular and transgenic breeding and adaptation of sugarcane to future environmental conditions.


SUPPLEMENTAL FIGURES
Supplemental Figure 1 Sliding window genetic marker extraction example Page 3 Supplemental Figure 2 Transcript coverage plot for single flow-sorted chromosome Page 5 libraries Supplemental Figure 3 Simplex marker barplots for scaffold clustering Page 6 Supplemental Figure 4 Simplex marker chromosome clustering and initial Page 8 chromosome construction

Supplementary Data
The biology and pedigree of R570 poses significant technical challenges during genome assembly, the largest being the task of constructing biologically representative chromosomes for an interspecific hybrid crop with variable ploidy, heterozygosity and redundancy introduced through backcrossing and 2n+n chromosome transmission.Equally challenging was accurately representing a partially redundant genome with varying numbers of chromosomes that each do not always pair with the same homeologous copy (polysomic inheritance).While chromosome construction for the majority of plant genomes can be completed using an optical map or Hi-C scaffolding, the unique biology and breeding of R570 (discussed in the main manuscript) requires additional genomic resources (eg.simplex markers, single flow sorted chromosome libraries, Hi-C, optical map, Sorghum gene synteny, progenitor specific kmers) in order to separate, phase and construct homeologous chromosomes.Below is an outline of each genomic resource constructed and its function/rationale in the genome assembly process.Our intent with this document is to showcase the strengths and weaknesses of each genome resource/technology so that other researchers can avoid pitfalls and the assumption that a single bioinformatics pipeline/strategy will work for other highly complex genomes.The overall assembly pipeline required the generation and inspection of thousands of plots (each of which was re-generated when new joins/breaks/rearrangements were investigated) used to order and orient contigs into homeologous chromosomes.To help illustrate how the assembly was completed, scripts, commands and representative figures are provided to help guide others through each step of the process.

Supplemental Table 1-R570 Genome Resource
Overview.Listed below are each of the genomic sequencing technologies and techniques used to construct chromosomes for sugarcane R570, along with how they were generated and what their function when scaffolding contigs into chromosomes.Please note that this table includes only resources used for assembling the genome.Other resources (ie.RNAseq, IsoSeq) and analyses (eg.HiFi reads-inference of haplotype collapse) are described in the main text.

Simplex markers and the genetic map
The selfed offspring population (S1) of R570 represents a unique resource for contig phasing and chromosome construction.Segregating populations like S1, which are often used for quantitative trait locus mapping, can also enable phased chromosome construction in complex polyploids.Simplex, or single dose, markers are regions of the genome where a single variant distinguishes one locus among all homeologous chromosomes (octoploid genotype example: CTTTTTTT).Simplex markers, with dominant inheritance, segregate at a 3:1 ratio in selfed progeny regardless of ploidy number or chromosome pairing 1 , and therefore control for the potential confounding effects of similar sequences among homologous but non-recombining chromosomes.This is ideal for R570 considering its high ploidy and differential chromosome pairing affinity during meiosis.
To generate simplex markers for chromosome construction and phasing, we first constructed a genome assembly from Illumina-only data.Approximately 80X per haplotype genome coverage (~1.9 terabases) was assembled using HipMer 2 , (parameters: -k 101), generating 1,735,266 contigs (Total size: 5.03 Gb; Scaffold N50: 56.1 Kb).To generate genetic markers with anchored physical locations across the genome, the Illumina read libraries were aligned back to the genome assembly using BWA-MEM (version 0.7.5) 3 .Using custom Python scripts (bedFileFromFai.py;extractMarkersFromBam.py),genetic markers were extracted from the resulting bam alignment file by sliding a 80bp non-overlapping window across each alignment position and collapsing and counting all 80bp kmer present from reads which aligned with a primary mapping quality of 50 or greater (Supplementary Figure 1).This approach generated 55,875,929 putative genetic markers across 38,717,727 windows (hereby referred to as the Full Marker Set; or FMS)., followed by all possible kmers from Illumina reads that aligned fully across the window with mapping qualities greater than or equal to 50.If no reads had a mapping quality greater than 50 for a given window, then only the reference sequence was printed.Additionally, if there were no unique alignments across a given window, then no marker or reference sequence was generated (NO ALIGNMENTS).
To determine which markers were simplex and could be used to generate a genetic map and phase contigs, the FMS were genotyped in 96 selfed (S1) offspring from R570 to identify markers demonstrating the expected 3:1 segregation pattern of simplex markers in an S1 population.Each FMS marker was matched and counted within the re-sequenced population of S1 offspring, which were sequenced to a depth of 15X.As spurious matches can occur simply due to sequencing error, marker genotype counts greater than one but less than or equal to three were labeled as NA.The remaining markers were coded as zero for absent or one as present.Each marker's segregation pattern was tested using a binomial test (p < 0.1), finding 1,825,092 simplex markers.
To ensure a complete marker set for generating a genetic map, missing data within the simplex marker genotype matrix were imputed from 50 marker sliding windows, which reduced the amount of missing marker calls in the matrix from approximately 13.5% to 0.2%.Based on recombination frequency, the marker set was culled down to 10,732 non-identical (redundant) markers that could be used for genetic map construction.Linkage groups from the raw simplex markers were calculated in JoinMap v4.0 4 , retaining markers within the same linkage group that had a pairwise LOD score of ≥ 3 and a recombination fraction ≤ 0.05.These parameters produced 237 tightly-linked linkage groups that were unlikely to include spurious joins.Marker order was inferred within linkage groups via a four step pipeline: (1) missing data were k-nearest neighbor imputed using the median method in DMwR (v0.4.1) 5 ; (2) recombination fractions were estimated in R/qtl (v1.42-8) 6 ; (3) the resulting matrix was fed into tspMap 7 and markers were ordered using the concorde algorithm; (4) genetic map estimation was accomplished in R/qtl using the kosambi mapping function, an error probability of 0.01.With this scaffold of high-confidence marker orders, we then interpolated the positions of all simplex markers, which were then used to cluster contigs with shared linkage for chromosome construction.

Single flow-sorted single chromosome libraries
While the genetic map markers were the useful for determining phase in a complex, high ploidy system such as sugarcane, their presence and detection is skewed toward the more heterozygous, S. spontaneum portion of the genome (131 Mb of simplex markers; 45% derived from S. spontaneum portions of the genome; Fisher's Exact Test: 3.25x enrichment for simplex markers in S. spontaneum regions [p<0.0001]).To generate more genetic markers that distinguish among other homeologous chromosomes through non-simplex regions, flow cytometry was used to sort and sequence single R570 chromosomes (n=79; ~115X coverage) using methodology described in the main text.Each library corresponded to one physical chromosome 8,9 , and enabled coarse-grained clustering of contigs into homologous chromosome groups from unique marker placements.
To first distinguish which homologous chromosome was captured per sequencing library, reads were aligned to Sorghum bicolor (v3.1) 10 primary transcripts using BWA-MEM (version 0.7.5) 11 to calculate per transcript percent coverage (parameters: bedtools genomecov -bga -ibam) 12 .Coverage per transcript was calculated for each base with five or more reads aligned (intended to ignore differences in sequence divergence between R570 and S. bicolor and repeat sequences).Percent coverage per transcript was converted to a scatterplot and visualized (Supplementary Figure 2).This approach enabled quick determination of: 1) which homologous chromosome copy had been flow-sorted, captured and sequenced, 2) whether a recombinant chromosome was captured, or 3) if there was a problem with the library that would result in uninterpretable marker placements/patterns downstream.Each of the single flow-sorted sequenced chromosome libraries was visually inspected using the approach described above.Chromosome-specific markers for each homeolog were then generated by assembling each flow cytometry-sorted library into contigs using HipMer 2 (parameters: -k 101), then extracting 500bp markers within each assembly, spaced every 2000bp from contigs greater than 1 kb (script: extractUniqueMarkersFromFasta.py).This approach generated 500,092 chromosome specific markers from 79 Illumina libraries (SCL marker set).This marker set, along with the simplex markers were used to order and orient contigs into chromosomes (primary and secondary joins).

R570 Chromosome Construction Contig construction and scaffolding
The R570 contigs were assembled from approximately 38X coverage per haplotype of PacBio HiFi data using HiFiAsm (version 0.13-r308) with the resulting contigs polished with RACON (version 1.4.10) 3 .
Comparison to a previous R570 assembly (version 1.0-PacBio consensus long read; ~159X assembled with CANU 13 , polished with ARROW 14 ) showed a 20X contiguity improvement (484.3Kb and 10 Mb, respectively), so we proceeded with chromosome construction from HiFi contigs only.Initial scaffolding of the genome was completed using a Bionano Optical map generated by Corteva (methods described in main document text); however it became apparent that the identical regions of the genome, mainly contributed from S. officinarum, caused large segmental duplications in scaffold ordering, as negative gaps are allowed to occur.Additionally, with a large portion of the genome being identical among multiple homeologous chromosomes, an out-of-phase contig could easily be selected during optical map scaffolding.For this reason, the contig order within the optical scaffold agp file was used as an initial guide during chromosome construction but priority was given to alignment of genetic map and single sorted library markers (respectively), discussed below.
To cluster sequences derived from the same homeologous chromosome (primary joins), simplex markers were aligned to optical map scaffolds using parallel BLAT (pblat) (v2.5) 15 , retaining only single placement, perfect coverage alignments (parameters: -noHead -extendThroughN -minIdentity=100 -fastMap -minMatch=3 -tileSize=12 -minScore=80).As there was no straightforward way to ascertain the expected number of linkage groups that could be associated with contigs that vary in length, the most straightforward approach was to simply count the number of simplex markers (and their associated linkage group) that aligned to each scaffold, and barplot each for visual inspection.Sequences sharing the same linkage information were then manually clustered together to construct individual chromosomes.For example: optical scaffold_1000034 (Supplementary Figure 3) could only be ordered and oriented with other scaffolds with simplex markers associated with linkage groups 79, 201 and 4.
Supplementary Figure 3-Simplex marker barplots for sequence clustering.Simplex markers (with their genetic map associated linkage group [LG]) were aligned to each optical scaffold.Perfect alignments were counted and plotted to record which scaffolds were associated with each linkage group from the R570 genetic map.
To generate a putative join order to each chromosome contig cluster, S. bicolor (v3.1) primary peptide sequences were aligned to each cluster using pblat (version v2.5; parameters: -noHead -extendThroughN -tileSize=5 -minIdentity=60 -t=dnax -q=prot -minMatch=2).Contigs and scaffolds with shared linkage were then ordered and oriented relative to S. bicolor peptide order.Additionally, using this approach we were able to identify both falsely-joined regions and redundant sequences.We inspected large-scale overlapping contigs through a screen of gene alignment copy number relative to the S. bicolor outgroup.For example, take the case where two contigs ['A', 'B'] could be joined to generate a scaffold.When oriented relative to S.bicolor and joined, overlap between contigs A and B can be detected by aligning primary S. bicolor peptides and counting the number of high-quality alignments (greater than 85% identity and 80% coverage) that occur.To trim regions of overlap, alignment position and copy number was sorted relative to the shortest contig (in this example: contig A) used to make the join.Breaks were then made at the midpoint position where alignment copy number changed from one to two (example of sorted S. bicolor alignment copy number on contig A: 1111111:222222; : = breakpoint; example: Supplementary Figure 4).During the clustering process, there were also instances where S. bicolor alignment copy number found redundant sequences that were ignored when making chromosome joins.For example, consider two contigs ['X','Y'] could be joined to an existing scaffold.When contigs X and Y are considered together, all S. bicolor alignments counts on contig X are duplicated [example of sorted S. bicolor alignment copy number on contig X: 2222222], whereas only some are on contig Y [example of sorted S. bicolor alignment copy number on contig Y: 11111122222211111].In this instance, contig Y would be favored over contig X.All broken sequences and unused, redundant contigs were retained during these steps to ensure no potential coding regions were lost.We also considered situations where tandem gene duplicates were erroneously removed when determining breakpoints between two adjacent contigs based on alignment copy number, but these would be exceedingly rare.First, an ancestrally single-copy gene family would have had to be duplicated so that two copies exist in one R570 haplotype, while other R570 haplotypes and S. bicolor contain exactly one copy; as such, single-copy gene alignments would exist between some, but not all, R570 contigs and S. bicolor.This situation is already rare: 1.2% of all gene families are ≥2-copy in one R570 chromosome and 1-copy in all other haplotypes and S. bicolor).Second, the contig break must have occurred so that exactly one copy of the family is represented on each of two adjacent contigs.If such a rare event had occurred, NucFreq analysis 16 (described in the main manuscript) would have likely uncovered it unless the sequences were perfectly identical between the haplotypes.
Supplementary Figure 4-Simplex marker chromosome construction.Sorghum bicolor proteins are aligned to scaffolds that share simplex marker linkage groups to determine putative joins and breakpoint positions.Each scaffold is listed along with the linkage group with the majority of associated simplex markers.Alignment copy number is used to determine breakpoints between scaffolds, where alignment gene copy number is greater than one between joined, adjacent scaffolds.In this example, two scaffold breaks were made.Scaffold_1000311 (light blue) was not included in this constructed chromosome because its inclusion would only duplicate gene copy alignments (contig X example from previous paragraph).Baseline position for each peptide sequence was provided by relative order within the S.bicolor (v3.1) annotation.Secondary joins were made among linkage cluster joins by aligning SCL markers to all joined linkage clusters from the same homologous chromosome (eg.S. bicolor chromosome 2).Single flow-sorted markers enable joins among scaffolds and contigs where linkage group information was lacking, but could be linked by sequence alignment.Markers from each individual flow-sorted chromosome (script: extractUniqueMarkersFromFasta.py;modified slightly to retain non-unique sequences) were aligned using megablast 17 (parameters: -m 8 -p 95 -a 10 -W 100) to each joined linkage cluster and remaining scaffolds to search for additional joins based on visual inspection of high-density marker hits.Once additional joins were manually identified (Supplementary Figure 5), the sequences were ordered and oriented again based on S. bicolor protein alignments, as described above (Supplementary Figure 4).This process was completed in an iterative fashion, where joins were repeatedly made, broken and shuffled to generate the longest path of single copy S. bicolor protein alignments through each constructed chromosome.

Supplementary Figure 1 -
Example output of the sliding window marker extraction.Non-overlapping 80bp windows were extracted from the alignment bam file where Illumina reads (approximately 80X coverage) were aligned to HipMer scaffolds.The reference sequence per window was printed first (REF line)

Figure 2 -
Sorghum bicolor transcript coverage by single flow-sorted chromosome libraries from R570.Reads from individually flow-sorted and sequenced chromosomes were aligned to S.bicolor (v3.1) transcripts to calculate percent coverage to transcript base.The example in panel A) shows a single homologous chromosome for S.bicolor chromosome 2 (R570 chromosome 5).Panel B) shows a S. spontaneum chromosome (fusion between Sb08 and Sb09; R570-Chr6_9).