Effect of sequence depth and length in long-read assembly of the maize inbred NC358

Improvements in long-read data and scaffolding technologies have enabled rapid generation of reference-quality assemblies for complex genomes. Still, an assessment of critical sequence depth and read length is important for allocating limited resources. To this end, we have generated eight assemblies for the complex genome of the maize inbred line NC358 using PacBio datasets ranging from 20 to 75 × genomic depth and with N50 subread lengths of 11–21 kb. Assemblies with ≤30 × depth and N50 subread length of 11 kb are highly fragmented, with even low-copy genic regions showing degradation at 20 × depth. Distinct sequence-quality thresholds are observed for complete assembly of genes, transposable elements, and highly repetitive genomic features such as telomeres, heterochromatic knobs, and centromeres. In addition, we show high-quality optical maps can dramatically improve contiguity in even our most fragmented base assembly. This study provides a useful resource allocation reference to the community as long-read technologies continue to mature. Sequence depth and read length determine the quality of genome assembly. Here, the authors leverage a set of PacBio reads to develop guidelines for sequencing and assembly of complex plant genomes in order to allocate finite resources using maize as an example.

D uring the two decades following publication of the first larger eukaryotic genomes (i.e., Drosophila melanogaster 1 and Homo sapiens 2 ), considerable progress has been made in sequencing technology and assembly methods, improving our basic knowledge of genome complexity across the tree of life. For example, we now understand that genome composition (e.g., gene complement, the extent of intergenic space, and the landscape of transposable elements (TEs)) varies substantially at both the inter-and intra-specific levels.
Throughout the genomic era, continual improvements have been made to both the data underlying assemblies and assembly algorithms. The first reference genomes employed bac-by-bac approaches with long-read Sanger sequencing generated at great expense over several years 2,3 . Next-generation assemblies initially relied on short-read data due to cost and technological limitations. These data have often been used for assembly in a combination of paired-end and mate pair formats with local assembly of contigs relying on paired-end sequences, which are then scaffolded using mate pair information [4][5][6] . While these assemblies represented genes reasonably well, repetitive regions containing TEs and tandem repeats were either omitted or highly fragmented 7 . Newly developed long-read sequencing technologies now enable contiguous assembly of even the repetitive fraction of eukaryotic genomes 8 with, e.g., a complete telomere-to-telomere human X chromosome recently being assembled 9 . Highly contiguous long-read assemblies have now been published for a wide range of species [10][11][12][13] . Recent long-read assemblies in maize also show considerable improvement in both completeness and contiguity relative to previous efforts [13][14][15][16] , suggesting these data are particularly useful for plant species like maize with genomes that are large (2.3 Gb), complex (paleopolyploid and comprised primarily of TEs), and highly repetitive (extensive tandem sequence arrays in heterochromatic knobs and centromeres).
Assembly algorithms are also continually being benchmarked and improved to better utilize long-read data with substantial variability in performance of particular methods observed across species 10,[17][18][19] .
The cost of long-read sequence data can still be prohibitive for species with larger genomes, and the critical target for average read length and read depth remains unclear. A full assessment of the impacts of varying sequence read length and depth on the contiguity and completeness of assemblies in genic and repetitive regions is therefore essential for informed allocation of finite resources. Here we conduct a comprehensive assembly experiment using subsets of a high-depth, long-read (PacBio) dataset for the maize inbred line NC358 to evaluate critical inflection points of quality during the assembly of a complex, repeat-rich genome.

Results
Genome sequence subsampling and assembly statistics. We sequenced the NC358 genome to 75× depth (based on a~2. 27 Gb genome size 20 ) using the PacBio Sequel platform, which generated a subread N50 of 21.2 kb (Table 1, Supplementary  Table 1, and Supplementary Fig. 1). To identify an optimal assembly approach for this study, the complete data from NC358 were each assembled using Falcon 21 , Canu 22 , WTDBG2 23 , and hybrid approaches in which Falcon was used for error correction and Canu, Flye 24 , and Peregrine 25 were used for assembly (Supplementary Table 2). A subset of methods (Supplementary Table 2) were confirmed to be robust using data from the B73 v4 genome assembly (68× depth) 13 . All assembled contigs were superscaffolded with a de novo Bionano optical map ( Supplementary Fig. 2) and pseudomolecules were constructed based on maize GoldenGate genetic markers 26 and high-density maize pan-genome markers 27 (Methods). The Falcon-Canu hybrid assemblies of both the NC358 and B73 genomes showed consistently higher quality in terms of contig length, Bionano conflicts, Benchmarking Universal Single-Copy Orthologs (BUSCOs) 28 , and LTR Assembly Index (LAI) 8 (Supplementary Table 2); thus, this method was used for all subsequent assemblies performed on subsets of the data. NC358 subreads were downsampled from 75× to 60×, 50×, 40×, 30×, and 20×, while maintaining a 21 kb subread N50 and to 50× depth with a subread N50 of 11 and 16 kb. For these latter assemblies, we based read-length distributions on empirical datasets generated for the human HG002 29 and maize B73 v4 13 genome assemblies ( Supplementary Fig. 3). NC358 read subsets were error-corrected and assembled independently using the hybrid assembly approach described above (Methods and Supplementary Note 1). These processes were resourceintensive and were accelerated through cloud computing. The central processing unit (CPU) time required for both Falcon error correction and Canu assembly increased substantially as read depth increased, whereas the required maximum memory was fairly similar (Fig. 1h and Table 1).
Most assemblies had a total contig size covering >92% of the flow-cytometry estimated genome size of NC358 (2.27 Gb 20 ), with the notable exception of the 21k_20x assembly (70.4% covered; Table 1). Contig length metrics were positively correlated with both read length and sequence coverage (Fig. 1b), with the highest contig N50 (24.54 Mb) and the longest contig (79.68 Mb) observed in the 21k_75x and 21k_60x assembly, respectively (Table 1). A dramatic drop in contiguity was observed for both the lowest depth (21k_20×) and shortest sequence length (11k_50×) assemblies, where the number of contigs was 17×-32× more than the complete 21k_75x dataset (Table 1 and Fig. 1e).
For each assembly, superscaffolds were generated from the contigs using a common Bionano optical map. Even the most fragmented Falcon-Canu assembly could be scaffolded to high contiguity using this optical map due to the high density of labels in the map (Fig. 1a-c). The resulting assemblies all had scaffold N50s at~98 Mb (Table 1), highlighting the utility of a high-quality optical map when sequencing quality is suboptimal. In fact, chromosome 3 (~237 Mb) consisted of a single scaffold in five out of eight assemblies (Table 1). However, conflicts vs. the Bionano map were much higher in the assemblies with 20× coverage and a subread N50 of 11 kb (Table 1 and Fig. 1e), suggesting assembly error increased with lower coverage and shorter read length. Assemblies with shorter read length contained many more deletions relative to the optical map ( Fig. 1i; Supplementary Table 3), which may be due to the collapse of repetitive sequences. We did not observe a clear pattern between read length and deletion size (Fig. 1i). Assembly misjoins were reduced with both longer reads and higher coverage, as shown by the relative number of insertions (Fig. 1i and Supplementary Table 3).
For each of the assemblies, pseudomolecules were constructed using the GoldenGate and pan-genome genetic markers, which placed >99% of the total assembled bases into pseudomolecules (Supplementary Table 4 and Supplementary Fig. 4). The resulting NC358 pseudomolecules were highly syntenic across assemblies and to the B73 v4 genome ( Supplementary Fig. 5).
Evaluation of the gene assembly space. We evaluated the completeness of gene-rich regions in each of the assemblies using BUSCO 28 . The percentage of complete BUSCO genes increased from 68.0% to 96.3% from the 21k_20x to the 21k_75x assembly (Table 1, Fig. 1d, and Supplementary Table 5). Minimal improvement in BUSCO scores was achieved at depths higher than 30× (95.5% complete BUSCO genes), indicating this depth provides satisfactory gene-space assembly.
To further evaluate the assembly of genic regions, we annotated gene models in the 21k_20x and the 21k_75x assemblies (Methods) and obtained a total of 28,275 and 39,578 genes, respectively (Supplementary Table 6). More than 99% of the 21k_75x genes could be mapped to all but the 21k_20x assembly, to which only 90% were mapped (Supplementary Table 7). Exon and intron lengths of the annotated genes were similar across the assemblies (Supplementary Table 6). However, genes in the 21k_20x assembly contained 50 times more gaps than those in the 21k_75x assembly (Supplementary Table 7), suggesting higher coverage contributed to the contiguity of genic regions. Longer reads similarly improved the gene-space contiguity (Supplementary Table 7).
In addition, we sequenced RNA libraries from ten tissues with two biological replicates (Methods). On average, 80% of reads in these libraries could be uniquely mapped to the various NC358 assemblies (Fig. 1g). The 21k_20x assembly was a notable exception with only 63% of reads uniquely mapped ( Fig. 1g and Supplementary Fig. 6). We extracted the reads that did not map to the 21k_20x assembly and remapped them to the 21k_75x assembly, obtaining a unique mapping rate of 36% (Supplementary Table 8). These reads mapped to 3184 genes in the 21k_75x assembly (Supplementary Table 9). Of these genes, 20% are present in the 21k_20x assembly but had assembly errors that prevented the RNA sequencing (RNA-seq) reads from mapping, while the remaining 80% were within sequence gaps (Supplementary Table 9).
In addition to metrics of gene completeness, we also examined each assembly for its ability to capture two notable maize tandem gene arrays, Rp1-D 30 and zein 31 . The total length of these gene arrays was estimated at 536 kb and 62 kb in NC358, respectively, based on the optical map. Both the Rp1-D and zein loci were completely assembled in all, except for the 21k_20x assembly, where only 70% and 91% of the loci were assembled, respectively ( Fig. 2g and Supplementary Table 10).
Evaluation of the TE assembly space. The completeness of transposon-rich regions of the genome was assessed through the assembly index of LTR retrotransposons, called LAI 8 . A higher LAI score is indicative of a more complete assembly in TE-rich regions. The 21k_20x assembly had a substantially lower LAI score compared to other assemblies (LAI = 12.2; Table 1). As sequence depth increased a substantial improvement in LAI was observed, while the effect of sequence length on LAI was minimal (Fig. 1f). This is likely due to the fact that the length of LTR retrotransposons is ∼10 kb on average ( Supplementary Fig. 7), which could be spanned by even the 11 kb reads. The assemblies that were generated from ≥40× genomic depth achieved gold quality (LAI ≥ 20 (ref. 8 )) ( Table 1 and Fig. 1f), which was comparable to the B73 v4 genome and much higher than many previously published maize genome assemblies generated with short-read data ( Supplementary Fig. 8).
The insertion time of each LTR retrotransposon can be dated based on sequence divergence between terminal repeats 8 . We identified 36% fewer intact LTR retrotransposons in the highly fragmented 21k_20x assembly ( Supplementary Fig. 9) and significantly older LTR elements in the 11k_50x assembly (p < 10 −5 , one-way analysis of variance followed by Tukey's test), suggesting fragmentation of assemblies could bias conclusions of transposon studies. LTR retrotransposons shorter than 26 kb were assembled well across the assemblies (Supplementary Figs. 10 and 11). However, a substantial effect of longer reads and higher depth was observed in the assembly of LTR sequences longer than 26 kb (Fig. 2b). We examined the assemblies of the longest LTR sequence clusters using the Bionano optical map and found most assemblies contained no gaps and were virtually complete (Fig. 2c), with the notable exception that the 11k_50×, 16k_50×, and 21k_20x assemblies, which contained large gaps in one of the LTR clusters (Supplementary Table 11). We also inspected the bz locus 32 , which has highly nested transposon insertions and an estimated size of 303.5 kb in NC358. The bz locus was well assembled in all but the 21k_20x assembly, in which only 56.3% of the sequence was included (Supplementary Table 12). In summary, with ≥40× of sequence coverage, longread sequencing and assembly can traverse most transposon-rich genomic regions including relatively long LTR sequences, although with shorter reads (i.e., read N50 of 11-16 kb) this sequencing depth may not be sufficient.  Evaluation of the non-TE repeat assembly space. The assembly of non-TE tandem repeat space was also evaluated, including telomeres (7 bp repeats), subtelomeres (300-1300 bp repeats), CentC arrays (156 bp repeats), nucleolus organizer region (NOR, 11 kb repeats), and the two major knob repeats (mixture of 180 and 350 bp repeats) ( Fig. 2a and Supplementary Table 13). The effects of sequence read depth and sequence read length were far more pronounced across many of these tandemly duplicated portions of the genome (Fig. 2a).
Telomeres are characterized by 7 bp tandem repeats at the end of each chromosome. Our results showed a substantial increase in the assembled length of telomere sequence with the increase of both read length and sequence coverage ( Fig. 2d and Supplementary  Table 14). However, a precise estimate of telomere length was not possible with our optical map due to the lack of Bionano DLE-1 sites in these highly repetitive regions. Using the full dataset (21k_75×), only 10 of 20 telomere-subtelomere combined regions were assembled to >90% of the Bionano estimated size (Supplementary Table 15), suggesting even longer reads and higher coverage are required for the full assembly of these regions.
The centromere is one of the most repetitive regions of many species' genomes including maize. We characterized NC358 centromeres based on CentC arrays 33 , which are abundant in functional centromeric regions 34 . Even with the full dataset (21k_75×), only half of CentC arrays were assembled (Fig. 2e, Supplementary Fig. 12, and Supplementary Table 16). Hybrid scaffolded assemblies with sequence coverage ≥60× yielded a better approximation to the Bionano estimated size, even though these regions largely consisted of gaps (Fig. 2e). Although assembled sequences were not significantly increased, higher sequence depth resulted in better anchoring of sequences with the Bionano optical map. Only three centromeres, which contained a mixture of CentC arrays, transposons, and intergenic sequences, could be traversed by Bionano DLE-1 labeling due to having a comparatively higher content of low-copy sequence 34 . The size of the remaining centromeres was likely underestimated  Fig. 13) and further improvements in scaffolding technology are required to traversing these regions. The NOR is enriched with ribosomal DNA (rDNA) and spans approximately 9 Mb on chromosome 6 of NC358 (Supplementary  Table 17). Longer read length improved the assembly of this region, but substantial differences were not observed with coverage ≥30× (Fig. 2f). Approximately 72% of the NOR was included in the 21k_30x assembly and this improved by just 9% to 81% in the 21k_75x assembly (Supplementary Table 17 and Fig. 2f).
Finally, maize knobs are heterochromatic regions consisting of 180 bp (knob180) and 350 bp (TR-1) repeats 35 . We used the Bionano optical map to assess the assembly of two knobs that together spanned a total of 5 Mb. With longer reads and higher coverage, more knob sequences were assembled, with 6.5% of the two knobs present in the 21k_20x assembly and up to 65% in the 21k_75x assembly (Supplementary Table 18 and Fig. 2h).

Discussion
Recent innovations in long-read and scaffolding technology have made highly contiguous assembly possible across a wide range of species. We have documented how both the completeness and contiguity of assemblies improve with increasing depth and read length. The biological aims of an investigation must be considered when determining the level of investment in depth of sequence. With long-read sequencing, the low-copy gene space (including tandem gene arrays) can be well assembled with as low as 30× genomic coverage across a range of read lengths. Complete characterization of TEs in complex genomes such as maize will require a greater depth of sequence (~40×) and should employ library preparation protocols that maximize read-length N50. Finally, complete assembly of highly repetitive genomic features such as heterochromatic knobs, telomeres, and centromeres will require substantially more data 36 . In fact, complete assembly of these latter highly repetitive sequences will likely require innovations beyond current sequencing technology.

Methods
Sample preparation. Seeds for the maize NC358 inbred line were obtained from GRIN Global (seed stock ID Ames 27175), grown, and self-pollinated at Iowa State University in 2017. A total of 144 seedlings derived from a single selfed ear were grown in the greenhouse. Leaf tissues from the seedlings at the Vegetative 2 (V2) growth stage were sampled after a 48-hour dark treatment to reduce carbohydrates. A total of 35 g of tissue was collected and flash-frozen. Tissue was sent to the Arizona Genomics Institute (AGI) for high-molecular-weight DNA isolation using a CTAB protocol 37 .
Illumina and PacBio sequencing. Pacific BioSciences long-read data for NC358 were generated at AGI using the Sequel platform. Libraries were prepared using the manufacturer's suggested protocol (https://www.pacb.com/). The subreads that were generated covered the genome at an estimated 75-fold depth (75×) with a subread N50 of 21,166 bp. Reads from each SMRT cell were inspected and quality metrics were calculated using SequelQC 38 . After validating the PSR (polymerase to subread ratio) and ZOR (ZMW occupancy ratio) were satisfactory, all subreads were used for subsequent steps.
Paired-end Illumina data for NC358 were generated at the Georgia Genomics and Bioinformatics Core from the same DNA extraction as was used for the longread sequencing. Quality control of DNA was conducted using Qubit and Fragment Analyzer to determine the concentration and size distribution of the DNA. The library was constructed using the KAPA Hyper Prep Kit (catalog number KK8504). During library preparation, DNA was fragmented by acoustic shearing with Covaris before end repair and A-tailing. Barcoded adaptors were ligated to DNA fragments to form the final sequencing library. Libraries were purified and cleaned with solid phase reversible immobilization (SPRI) beads before being amplified with PCR. Final libraries underwent another bead cleanup before being evaluated by Qubit, quantitative PCR (qPCR) (KAPA Library Quantification Kit catalog number KK4854), and Fragment Analyzer. The final pool undergoing Illumina's Dilute and Denature Libraries protocol was diluted to 2.2 pM for loading onto the sequencer and then sequenced with 1% PhiX by volume. Libraries were sequenced on the NextSeq 500 instrument using PE150 cycles. The demultiplexing was done on Illumina's BaseSpace.
Downsampling full dataset. The 75× SMRT Sequel full data from maize NC358 was downsampled to 60×, 50×, 40×, 30×, and 20x data using seqtk (v1.2) (https://github.com/lh3/seqtk). Downsampling was performed as serial titration, in which each dataset was the superset of the next smaller dataset, and was sampled to have similar length distributions (Supplementary Fig. 3). The N50 of the downsampled datasets were almost identical to the N50 of the full 75× data (Table 1). There is a chance that the subsampling process could pick a disproportionate number of poor reads and result in a poor assembly. There is also a chance that the subsampling process could pick, on average, better quality reads and result in a higher quality assembly. However, evidence suggests it is more likely that the small sample size (i.e., 20×) could not provide enough sequence information to reconstruct the original genome, as has been observed in a study based on simulated and empirical long-read data to test the effect of different depths of Prokaryotic genome assemblies 19 .
Shifting read-length distribution of subreads. Two more NC358 datasets were downsampled and trimmed from the original 75× SMRT dataset to match the readlength distribution of the maize B73 data 13 and the human HG002 data 29 , which had read N50 lengths of~16 kb and~11 kb, respectively ( Supplementary Fig. 3). To do this, first, the read lengths of the maize B73 and human HG002 data were each sorted in descending order. For each read-length value, all subreads from NC358 that were longer than said value were randomly sampled without replacement and clipped to have matched read length. The unused clipped part of the read was put back in the pool for further use with short read length. This distribution-shifting approach was chosen to achieve a realistic distribution of read length rather than trimming all reads by fixed lengths. These datasets were labeled as 16k and 11k based on their N50 of subread data of 16,765, and 11,092, respectively.
RNA tissue sampling and sequencing. Samples from ten tissues throughout development were collected to generate expression evidence for gene annotation. Two biological replicates were collected for each tissue type and each replicate consisted of three individual plants. The tissues that were sampled were as follows: (1) primary root at 6 days after planting; (2) shoot and coleoptile at 6 days after planting; (3) base of the tenth leaf at the Vegetative 11 (V11) growth stage; (4) middle of the tenth leaf at the V11 growth stage; (5) tip of the tenth leaf at the V11 growth stage; (6) meiotic tassel at the Vegetative 18 (V18) growth stage; (7) immature ear at the V18 growth stage; (8) anthers at the Reproductive 1 (R1) growth stage; (9) endosperm at 16 days after pollination; and (10) embryo at 16 days after pollination. Tissue from developmental stage V11 and older were taken from field-grown plants, while all younger tissue samples were taken from greenhouse-grown plants. For the endosperm and embryo samples, tissue from 50 kernels per plant (150 total per biological replicate) were sampled. Greenhousegrown plants were planted in Metro-Mix300 (Sun Gro Horticulture) with no additional fertilizer and grown under greenhouse conditions (27°C/24°C day/ night and 16 h/8 h light/dark) at the University of Minnesota Plant Growth Facilities. Field-grown plants were planted at the Minnesota Agricultural Experiment Station located in Saint Paul, MN with 30 inch row spacing at~52,000 plants per hectare. RNA was extracted using the Qiagen RNeasy plant mini kit following the manufacturer's suggested protocol.
The quality of the total RNA was assessed by Bioanalyzer or Fragment analyzer to determine RNA concentration and integrity. The sample concentration was normalized in 25 µL of nuclease-free H 2 O before library preparation. Libraries were prepared using KAPA's stranded mRNA-seq kit with halved reaction volumes. During library preparations, mRNA was selected using oligo-dT beads, the RNA was fragmented, and cDNA was generated using random hexamer priming. Single or dual indices were ligated depending on the desired level of multiplexing. The number of cycles for library PCR was determined based on kit recommendations for the amount of total RNA used during library preparation. Libraries were quality control checked using Qubit or plate reader, depending on the number of samples in the batch for library concentration, and fragment analyzer for the size distribution of the library. The pooling of samples was based on qPCR. The pooled libraries were then checked by Qubit, Fragment Analyzer, and qPCR.
RNA libraries were prepared for sequencing on Illumina instruments using Illumina's Dilute and Denature protocol. Pooled libraries were diluted to 4 nM, then denatured using NaOH. The denatured library was further diluted to 2.2 pM and PhiX was added at 1% of the library volume. RNA pools were sequenced on a NextSeq 550 to generate 75 bp pair-end reads. On average, 24.5 million pair-end reads were generated per replicate per tissue type, for a total of 489 million reads across all samples. Data were demultiplexed and trimmed of adapter and barcode sequences on BaseSpace (Supplementary Fig. 14).
Bionano data generation. The DNA extraction was performed using the Bionano Prep™ Plant Tissue DNA Isolation Kit according to a modified version of the Plant Tissue DNA Isolation Base Protocol. Approximately 0.5 g leaf tissue was collected from young etiolated seedlings germinated in soil-free conditions and grown in the dark for ∼2 weeks after germination. Freshly cut leaves were treated with a 2% formaldehyde fixing solution and then washed, cut into small pieces and homogenized using a Qiagen TissueRuptor probe. Free nuclei were concentrated by centrifugation at 2000 × g, washed, isolated by gradient centrifugation, and embedded into a low-melting-point agarose plug. After proteinase K and RNase A treatments, the agarose plug was washed four times in Wash Buffer and five times in TE (Tris and EDTA) buffer. Finally, purified ultra-high-molecular weight nuclear DNA (uHMW nDNA) was recovered by melting the plug, digesting it with agarase and subjecting the resulting sample to drop dialysis against TE.
The Bionano Saphyr platform, in combination with the Direct Label and Stain (DLS) process, was used to generate chromosome-level sequence scaffolds and validate PacBio sequence contigs. Direct labeling was performed using the Direct Labeling and Staining Kit (Bionano Genomics Catalog 80005) according to the manufacturer's recommendations, with some modifications 39 . In total, 1 μg uHMW nDNA was incubated for 2:20 h at 37°C, followed by 20 min at 70°C in the presence of DLE-1 Enzyme, DL-Green, and DLE-1 Buffer. Following proteinase K digestion and cleanup of the unincorporated DL-Green label, the labeled DNA was combined with Flow Buffer, DL-Dithiothreitol (DTT), and incubated overnight at 4°C. DNA was quantified and stained by adding Bionano DNA Stain to a final concentration of 1 μL per 0.1 μg of final DNA. The labeled sample was loaded onto a Bionano chip flow cell and molecules separated, imaged and digitized in a Bionano Genomics Saphyr System and server according to the manufacturer's recommendations (https://bionanogenomics.com/support-page/saphyr-system/). Data visualization, processing, DLS map assembly, and hybrid scaffold construction were all performed using the Bionano Genomics software Access, Solve, and Tools. A filtered subset of 1,282,746 molecules (353,596 Mb total length) with a minimum size of 150 kb and a maximum size of 3 Mb were assembled without pre-assembly using the non-haplotype parameters with no CMPR cut and without extend-split.
Genome assembly. To determine the assembly approach to apply to each of the datasets, six different methods were tested on the complete dataset, including Falcon-only, Canu-only, WTDBG2-only, a Falcon-Canu hybrid approach, a Falcon-Peregrine hybrid approach (the longest 23× corrected reads were used), and a Falcon-Flye hybrid approach. We also downloaded PacBio sequencing data for the B73 v4 genome (68× subreads) for comparison of the different approaches.
The Falcon genome assemblies were performed using the falcon_kit pipeline v0.7 (ref. 21 ) with some modifications. TANmask and REPmask were not used due to their extensive masking for the maize genome. Error correction for subreads was performed on the longest 50× coverage, with the average read correction rate set to 75% (-e 0.75) and local alignments for at least 3000 bp (-l 3000). The usage of -l 3000 instead of -l 2500 was done because of the omitted repeat masking, which works better for highly repetitive genome species like maize. A minimum of two reads and a maximum of 200 reads were used for error corrections (--min_cov 2 --max_n_read 200). For sequence assembly, the exact matching k-mers between two reads was set to 24 bp (-k 24) with read correction rate as 95% (-e 0.95) and local alignments at least 1000 bp (-l 1000). The longest 20x coverage reads were used for assembly with a minimum coverage of two and maximum coverage of 80 (--min_cov 2 --max_cov 80). Full parameter sets are included in the Supplementary Note 1.
For Canu read correction and assembly, Canu v1.7 (ref. 22 ) was used. K-mers more frequent than 500 were not used to seed overlaps (ovlMerThreshold = 500). The genome size of 2,272,400,000 bp and 2,500,000,000 bp for NC358 and B73, respectively, were used in this study 20 . Other parameters were used as default. Due to a bug in the Canu v1.7 program, truncations of large contigs would occur during the consensus process (https://github.com/marbl/canu/releases/tag/v1.8). As the program was not expecting the superlong contigs that were being generated for our NC358 assemblies, we found a total of nine large contigs that suffered from consensus truncations. To fix these truncation gaps, consensus-free contigs were generated using Canu v1.7 (cnsConsensus = quick), then blastn was used to search for 5 kb boundaries of truncation gaps in consensus-free assemblies. Truncated sequences were retrieved and patched to the truncated contigs.
For the Falcon-Canu hybrid approach, the error correction was performed by Falcon, and the trimming and assembly were performed by Canu using the versions and parameters described above. All the assemblies were performed on the DNAnexus cloud platform. CPU core hour and maximum memory usage were recorded every 10 minutes for each Falcon error correction and Canu assembly job. For Falcon error correction of the 21k datasets, the CPU core hour (y) could be predicted by subread depth (m) with y ¼ 20603100000 þ ð3136:685 À 20603100000Þ= 1 þ m=1932:377 For Canu assembly of the 21k datasets, the CPU core hour (y) could be predicted by corrected read depth (n) with y ¼ 6438752000 þ ð1284:689 À 6438752000Þ=ð1 þ ðn=56334:74Þ^1:872455Þ These curves were fit using the https://mycurvefit.com/ website and plotted in R.
Three of the assembly approaches were tested using both maize NC358 and B73 reads. For both inbred lines, a similar assembly size was generated by the Falcononly, Canu-only, and Falcon-Canu hybrid approaches. However, the Falcon-Canu hybrid approach yielded the longest contig length (78.4 Mb and 19.7 Mb, respectively), the highest contig NG50 (23.0 Mb and 3.0 Mb, respectively), and the lowest number of assembly errors based on Bionano DLE-1 conflict (21 and 64, respectively; Supplementary Table 2). The gene-space completeness evaluated using BUSCOs v3.0.2 28 and the repeat space continuity evaluated using the LAI (vbeta3.2) 8 were similar between the Canu and the hybrid approach and higher than those assemblies that were created using the Falcon assembler (Supplementary Table 2). This was likely due to the consensus approach used at the end of the Canu program, which was missing in the Falcon program.
The remaining three approaches were tested using only NC358 reads. For assemblies generated by the Falcon-Flye hybrid, Falcon-Peregrine hybrid, and WTDBG2-only approaches, the assembled sizes were 16-51% larger than the estimated genome size with 28-202 times more contigs compared to the assembly generated by the Falcon-Canu hybrid approach (Supplementary Table 2). Other quality metrics of these assemblies, such as the longest contig, contig N50, Bionano DLE-1 conflict, and LAI were all inferior compared to those of the Falcon-Canu assembly (Supplementary Table 2). BUSCO of the Falcon-Peregrine assembly was higher than that of the Falcon-Canu raw assembly, but lower than that of the Pilon or Arrow-polished assembly. The correction-free WTDBG2 assembly was the most fragmented probably due to the lack of error correction that hindered the assembly of repetitive sequences, which is demonstrated in the low LAI value (LAI = 2.5).
Due to the consistently high quality of the assemblies generated from the Falcon-Canu hybrid approach, we used this approach to assemble each of the NC358 datasets with varying sequence depth and read length. Full parameter sets of all assembly approaches are included in the Supplementary Note 1.
Genome polishing. Two polishing approaches were tested on the 21k_75x assembly. The first was done using Arrow with PacBio subreads (75× coverage). Read mapping to the assembly was done using BLASR 40 with default parameters (--minMatch 12 --bestn 10 --minPctSimilarity 70.0 --refineConcordantAlignments). The Arrow tool in the SMRT Link (v5.1.0) software package was then applied to correct for sequencing errors with default parameters. A second approach for polishing was done using Pilon with Illumina pair-end reads (30.7× coverage). Read mapping to the assembly was done using Minimap2 (v2.16) 41 with the short read option (-ax sr). Pilon (v1.23-0) 42 was then applied to correct for sequencing errors including SNPs and small indels (--fix bases) on sites with a minimum depth of 10 and a minimum mapping quality of 30 (--mindepth 10 --minmq 30).
With both approaches, minimal differences were observed in the contiguity statistics (Supplementary Table 2) or the repeat content for the 21k_75x assembly ( Supplementary Fig. 15), and it is expected that this minimal impact would be observed across all of the NC358 assemblies. A more substantial difference in BUSCO scores was observed with both the Arrow-polished and the Pilon-polished 21k_75x assemblies (Supplementary Table 2). As the polishing had a substantial impact on this metric, the other NC358 assemblies were also polished using Pilon with the same parameter settings and similar improvement of BUSCO scores was observed (Table 1; Supplementary Table 4).
Generation of pseudomolecules. Hybrid scaffolds for the assemblies were generated with Bionano Direct Label and Stain data using Bionano Solve (v3.2.1_04122018). Overlaps of contigs within Bionano map space were resolved by placing 13 bp of Ns (13 N gaps) at the overlap site. In addition to arranging contigs into scaffolds, the hybrid scaffold was also used to detect misassembly and to assess the completeness of the assembled genome and repeat elements.
The pseudomolecules were constructed from the hybrid scaffolds using ALLMAPS (v0.8.12) 43 . Both pan-genome anchor markers 27 and GoldenGate markers 26 were used with equal weights for ordering and orientating the scaffolds. For pan-genome anchor markers, data were downloaded from the CyVerse Data Commons (http://datacommons.cyverse.org/browse/iplant/home/shared/panzea/ genotypes/GBS/v27/Lu_2015_NatCommun_panGenomeAnchors20150219.txt.gz) and a bed file with 50 bp upstream and downstream of the B73 v3 coordinates were generated. A text file with marker name and predicted distance was also constructed from the same file. The extracted markers were mapped to HiSat2 (v2.1.0) 44 indexed assemblies of NC358 by disabling splicing (--no-splicedalignment) and forcing global alignment (--end-to-end). Very high read and reference gap open and extension penalties (--rdg 10000,10000 and --rfg 10000,10000) were also used to ensure full-length mapping of marker sequence. The final alignment was then filtered for mapping quality of greater than 30 and tag XM:0 (unique mapping) to retain only high-quality uniquely mapped marker sequences. The mapped markers were merged with the predicted distance information to generate a CSV input file for ALLMAPS. Only scaffolds with more than 20 uniquely mapped markers, with a maximum of 100 markers per scaffold, were used for pseudomolecule construction.
The GoldenGate markers were downloaded from MaizeGDB (https://www. maizegdb.org/data_center/map?id=1203673). For the markers with coordinates, 50 bp flanking regions were extracted from the B73 v4 genome. For markers without coordinates, marker sequences were used as-is, and those missing both coordinates and sequences were discarded. Mapping of the markers was done similar to the method described above for the pan-genome anchor markers, with all uniquely mapped markers retained. The genetic distance information for these markers was converted to a CSV file before using it in ALLMAPS. ALLMAPS was run with default options, and the pseudomolecules were finalized after inspecting the marker placement plot and the scaffold directions. Synteny dotplots were generated using the scaffolds as well as pseudomolecule assemblies against the B73 genome by following the ISUgenomics Bioinformatics Workbook (https:// bioinformaticsworkbook.org/dataWrangling/genome-dotplots.html) 45 . Briefly, the repeats were masked using RepeatMasker (v4.0.9) 46 and the Maize TE Consortium (MTEC) curated library 3 . RepeatMasker was configured to use the NCBI engine (rmblastn) with a quick search option (-q) and GFF as a preferred output. The repeat-masked genomes were then aligned using Minimap2 (v2.2) 41 and set to break at 5% divergence (-x asm5). The paf files were filtered to eliminate alignments less than 1 kb and dotplots were generated using the R package dotPlotly (https://github.com/tpoorten/dotPlotly).
Gene annotation and RNA-seq mapping. The MAKER-P pipeline 47 was used to annotate protein-coding genes for Pilon-polished NC358 21k_20x and 21k_75x genome assemblies. The baseline evidence used in annotating the B73 v4 genome 13 was applied. Before gene annotation, the MTEC curated TE library 3 and Repeat-Masker was used to mask repetitive sequences. For gene prediction, we used Augustus 48 and FGENESH 49 (http://www.softberry.com/berry.phtml) with training sets based on maize and monocots, respectively. To identify genes that were missing in the 21k_20x assembly, total coding sequences (CDS) from the 21k_75x annotation was masked by total CDS from the 21k_20x annotation using RepeatMasker (-div 2 -cutoff 1000 -q -no_is -norna -nolow). The 21k_75x CDS that were masked less than 20% were determined missing in the 21k_20x annotation. These missing CDS were blast against the 21k_20x assembly and those that had less than 20% similarity were also determined to be missing in the 21k_20x assembly.
A total of 20 RNA-seq libraries were sequenced from NC358 tissue samples. Each library was sequenced to 21.9× ± 0.7× coverage with a mapping rate of 86.4% ± 1.0% to the B73 v4 using STAR (v2.5.2b) 50 (Supplementary Fig. 16 and Supplementary  Table 19). To benchmark the gene-space assembly, STAR (v2.5.2b) 50 was used to map the RNA-seq reads against the Pilon-polished NC358 assemblies. Unmapped reads from the 21k_20x assembly were extracted using SAMtools 51 and remapped to the 21k_75x assembly with STAR. Genes with read coverage ≥20% were extracted using BEDtools 52 , and blast against the 21k_20x assembly for the identification of full-length copies. The NC358 TE library (see next section for details on library generation) was used to identify TE fragments in genes with aligned reads (Supplementary Table 9). In addition, TEsorter (v1.1.4) 53 (https://github.com/ zhangrengang/TEsorter) was used to identify TE-related protein domains in genes with default parameters (Supplementary Table 9).
The repeat space contiguity was accessed using the LAI (vbeta3.2) 8 . To annotate LTR retrotransposons, LTR_retriever (v2.6) 58 was used to identify intact LTR retrotransposons and construct LTR libraries for each NC358 assembly with default parameters. To generate a high-quality LTR library for NC358, assemblyspecific LTR libraries were aggregated and masked by the MTEC curated LTR library using RepeatMasker (v4.0.7) 46 . Library sequences masked over 90% were removed and redundant sequences were also removed using utility scripts (cleanup_tandem.pl and cleanup_tandem.pl) from the EDTA package 59 . Nonredundant NC358-specific LTR sequences were added to the MTEC curated LTR library to form the final LTR library for NC358. The final library was then used to mask the 21k_75x assembly for the estimation of total LTR content. The total LTR content of 76.34% and LTR identity of 94.854% was used to estimate LAI values of all NC358 assemblies (-totLTR 76.34 -iden 94.854). The LAI of the other maize line genomes, including PH207 (GeneBank Accession: GCA_002237485.1) 60 , CML247 (GeneBank Accession: GCA_002682915.2) 27 , Mo17 61 , and GeneBank Accession: GCA_003185045.1 62 ), W22 (GeneBank Accession: GCA_001644905.2) 63 , and B73 v4 (GeneBank Accession: GCA_000005005.6) 13 were also evaluated for context. Effective assembly size, which is the length of the uniquely mappable sequences of an assembly, was estimated using unique 150-mers in each sequence assembly and quantified using Jellyfish (v2.0) 64 with default parameters.
Misassembly identification with optical maps. The Bionano optical mapping was used as an orthogonal method to identify misassemblies in genomes. Bionano de novo assembled optical maps were aligned to the sequence pseudomolecules to characterize structural inconsistencies using the structural variant calling pipeline of BionanoSolve 3.4. Default parameters were employed from the non-haplotype_noES_DLE file. Homozygous calls with a confidence of 0.1, a size of 500 bp, and non-overlaps with gap regions were regarded as insertions and deletions in sequence assemblies.
Assembly quality evaluation in repeat space. The coordinates of CentC arrays, knob180, TR-1 knobs, and NOR in the assemblies were identified by blasting CentC, knob180, TR-1 knob consensus sequences 34 , and the rDNA intergenic spacer (AF013103.1) against each assembly. An individual repeat array was defined as clusters of repetitive sequences that had less than 100 kb interspace between repeated elements. The level of repeats and gaps were then quantified in each defined repeat array. Respective sizes of each repeat array in the Bionano maps were estimated using the Bionano labels closest to the start and end coordinates in the assemblies.
To identify the telomere-subtelomere boundaries of the NC358 assemblies, seven maize subtelomere repeat sequences were downloaded from NCBI (EU253568.1, S46927.1, S46926.1, S46925.1, CL569186.1, AF020266.1, and AF020265.1) and used as queries to blast against the NC358 21k_75x assembly. Subtelomere boundaries were first identified at the start and end of chromosomes where blast hits were clustering then cross-checked with subtelomere-specific fluorescence in-situ hybridization (FISH) data 65 . The blast results were concordant with FISH results, showing the beginning of chromosomes 7, 8, 9, and 10 lack subtelomeres (Supplementary Table 15). Telomeres were defined as the distance between the boundary of subtelomeres to the end of pseudomolecules of the 21k_75x assembly, which were used as the basis for estimating the telomere size and count of the telomeric repeat sequences (5′-TTTAGGG-3′ and 5′-CCCTAAA-3′ in reverse complementation) in all other NC358 assemblies.
To identify the bz locus in the NC358 assemblies, the sequence of the maize W22 bz locus was first downloaded from NCBI (EU338354.1) 32 . The starting and ending 2 kb of the W22 bz locus were used to blast against the NC358 21k_75x assembly and the longest matches on chromosome 9 were used as the location of the bz locus in the NC358 21k_75x assembly. The obtained NC358 bz locus is 289,103 bp in length (chr9:11625031.11914133), which is 50 kb longer than that of the W22 bz locus (238,141 bp). Similarly, the 2 kb flanking sequences of the NC358 21k_75x bz locus were used to locate the bz locus coordinates in the other NC358 assemblies.
The zein sequence was downloaded from NCBI (AF031569.1) and the Rp1-D from MaizeGDB (AC152495.1_FG002). The same method as described for the bz locus was used to identify coordinates in the NC358 assemblies based on blast results using 2 kb flanking sequences.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.