A complete telomere-to-telomere assembly of the maize genome

A complete telomere-to-telomere (T2T) finished genome has been the long pursuit of genomic research. Through generating deep coverage ultralong Oxford Nanopore Technology (ONT) and PacBio HiFi reads, we report here a complete genome assembly of maize with each chromosome entirely traversed in a single contig. The 2,178.6 Mb T2T Mo17 genome with a base accuracy of over 99.99% unveiled the structural features of all repetitive regions of the genome. There were several super-long simple-sequence-repeat arrays having consecutive thymine–adenine–guanine (TAG) tri-nucleotide repeats up to 235 kb. The assembly of the entire nucleolar organizer region of the 26.8 Mb array with 2,974 45S rDNA copies revealed the enormously complex patterns of rDNA duplications and transposon insertions. Additionally, complete assemblies of all ten centromeres enabled us to precisely dissect the repeat compositions of both CentC-rich and CentC-poor centromeres. The complete Mo17 genome represents a major step forward in understanding the complexity of the highly recalcitrant repetitive regions of higher plant genomes.

sequencing errors in the ONT reads. As indicated by the presence of over 2 kb of telomeric repeats in the assembly, 18 chromosomal ends were fully assembled by both ONT and PacBio approaches. Incomplete chromosomal ends were found on the short arm of chr1 (chr1S) and chr2S (Supplementary Fig. 9). The telomeric region of chr1S was assembled by ONT data only. While for the end of chr2S, the telomeric region was found in only the PacBio Canu assembly, and therefore, the corresponding PacBio contig was integrated into the basal Mo17 assembly (Supplementary Fig. 9).
To avoid telomeric repeat sequences being incorrectly trimmed by the assembler, we further corrected telomeric regions using ONT reads. Thus, we obtained the assemblies of the ends of all 10 chromosomes, as confirmed by tiling ONT reads and coverage analysis (Fig. 1E, Extended Data Fig. 4).

Gap closure of 5 super-long TAG trinucleotide repeat arrays
Following gap-closing and correction by the PacBio assembly, there were only 6 gaps remaining in the basal Mo17 assembly, including 5 related to super-long TAG repeat arrays on chr1 (gap1 and gap2), chr2 (gap3 and gap5) and chr4 (gap_LCR6) and one related to the 45S rDNA array on chr6 (gap6). These six highly repetitive gaps were too long (all longer than 200 kb) to be spanned by current ultra-long ONT reads directly. According to previous reports 6,7 , we found there were three types of reads with sequencing errors of ONT technology after carefully examining reads falling into these gaps (Supplementary Fig. 6): (1) Symmetrical read: reads generated due to a given DNA fragment being mistakenly sequenced twice from opposite directions. (2) Fused read: reads containing sequences apparently fused from two or more genomic regions. (3) Microsatellites (also known as simple sequence repeats) miscalled reads: a stretch of microsatellites in reads were miscalled as other microsatellites, including many reads with long TAG repeats or telomeric repeats. Two or more types of these errors could happen in a single read. With a base error rate of approximately 15% 8,9 , these additional sequencing errors of current ONT technology were a major reason that gaps were not easily filled using a typical genome assembler.
Specific gap-filling approaches were developed for gaps related to super-long TAG repeats and 45 rDNAs, respectively. We tried to close the 5 super-long TAG repeat array related gaps by manual extension using the ultra-long ONT reads. Gap1 and gap5 were spanned by 6 and 16 tiling ONT reads, respectively (Fig. 1f, Extended   Data Fig. 5). The assembly showed that the TAG repeat array related to gap1 was 375 kb, among which 67.64% of sequences were TAG repeats (Supplementary Table 3).
Gap5 was much longer (1.56 Mb), including 890.9 kb of TAG repeat sequences (Supplementary Table 3). Gap2, gap3 and gap_LCR6 were also extended for about 700 kb, 100 kb, and 1 Mb, respectively. However, each of these still included one TAG repeat region that was not spanned by ONT reads, which were termed sub-gap2, sub-gap3, and sub-gap_LCR6, respectively. Further analyses indicated that TAG repeat sequences harbored in sub-gap2, sub-gap3, and sub-gap_LCR6 were all longer than 90 kb (Extended Data Fig. 5). We hypothesized that these remaining gaps might contain only TAG trinucleotide repeats, allowing few point mutations. Based on our earlier analyses that the TAG repeats could often be base-called as other microsatellites, we manually checked 4,129 reads with at least 5 kb of microsatellites at one end of the read (out of all 77× quality-passed ONT reads longer than 100 kb).
With the exception of 504 reads composed only of microsatellite repeats, the remaining 3,625 reads were mapped back to other parts of our Mo17 genome assembly, confirming that sub-gap2, sub-gap3, and sub-gap_LCR6 all contain only TAG repeat sequences. We then tried to determine their length using BioNano physical molecules of the Mo17 genome 4 . The length of sub-gap_LCR6 was approximately 129 kb according to one BioNano molecule that spanned it (Extended Data Fig. 5). Sub-gap3 could be spanned by a total of 7 reliable BioNano molecules.
Interestingly, they varied in their estimated length of sub-gap3, with length differences ranging up to 70 kb (Supplementary Fig. 10). These findings are consistent with previous reports showing the length of highly similar tandem repeats can vary greatly among individuals due to unequal crossover mediated expansions and contractions of repeat arrays 10 . We estimated the length of sub-gap3 to be 210 kb, the median of the 7 BioNano molecules (Extended Data Fig. 5, Supplementary Fig. 10). Next, we tried to determine the length of sub-gap2. Briefly, total length of all 6 genomic regions with consecutive TAG repeats longer than 90 kb was estimated first using 45.6× ONT reads longer than 150 kb, including one region in each of gap1, gap3, gap_LCR6, and two regions in gap5, as well as sub-gap2. The estimated total length of the 6 genomic regions (1,025 kb) was then subtracted by the lengths of the 5 regions with known sizes. Thus, the length of sub-gap2 was determined to be approximately 166 kb. In summary, gap2, gap3, and gap_LCR6 were closed, with lengths of 637.39 kb, 211.26 kb, and 1.13 Mb, respectively (Extended Data Fig. 5).
To validate the accuracy of our assembly for these 5 super-long TAG repeat array related gaps, we performed a fluorescence in situ hybridization (FISH) assay using pachytene stage meiocytes. A total of 6 identifiable TAG repeat signals were identified using such sequences as probes. According to their relative positions on chromosomes, these signals were found to correspond to the longest 6 TAG repeat arrays in the genome, including those in the 5 TAG repeat array related gaps and that in a TAG repeat array related LCR (LCR3) (Fig. 2b). In addition, the intensities of these FISH signals were generally consistent with the lengths of TAG repeats inferred from our assembly (Fig. 2b), thus confirming the accuracy of the assembly of these super-long TAG repeat arrays.

Gap closure of 45S rDNA array
We closed the 45S rDNA related gap using PacBio HiFi reads. The intergenic spacer region (IGS) sequence of 45S rDNA was utilized as the main anchor for closure as there was abundant genetic variation [11][12][13] . First, two internal 'islands' with opposite directions of 45S rDNAs were identified by PacBio HiFi and ONT reads. As we decided to extend along the transcriptional direction of 45S rDNAs, three locations served as starting points of extension, including two sides of an 'island' with rDNAs in which IGSs were adjacent to each other, and the centromere-proximal end of the gap in which the 45S rDNAs direction was toward the telomere (Extended Data Fig.   6). During the extension steps, PacBio HiFi reads were utilized because of their high base accuracy in order to distinguish genetic variation among different 45S rDNA copies. In addition, the N50 size of PacBio HiFi reads was about 15.9 kb, nearly two-fold longer than the 45S rDNA repeat unit, which made it possible to extend one copy of 45S rDNAs most of the time. Briefly, extensions were made using an in-house script as follows: the IGS sequence of the 45S rDNA to be extended was mapped to the PacBio reads harboring 45S rDNAs, and the 5 best hit reads were selected and mapped back to the corresponding 45S rDNA to be extended. The read with the best hit was further selected for extension. In some cases, manual checks and extension were needed when TE insertions occurred or no PacBio reads were found based on the thresholds set for this best-hit-based method. Overall, the 45S rDNA related gap was closed by PacBio reads with a total of 3,106 rounds of extension, with a total length of 26.8 Mb containing 2,974 copies of 45S rDNAs. The 1.2 Mb TE-enriched region in the centromere-proximal end of the array was the only region which could also be reliably assembled by ONT reads. The number and the order of 45S rDNAs in this TE-enriched region were highly consistent across assemblies based on PacBio and ONT reads (Fig. 1g). To further validate the accuracy of the assembly of the array, the number of 45S rDNA copies was then estimated by different approaches. The blast-based copy number estimation with ultra-long ONT and PacBio HiFi data showed that there were 2,662 and 2,849 copies of 45S rDNAs in the Mo17 genome, respectively (Fig. 2c). The k-mer-based estimation with Illumina data indicated that the number of 45S rDNAs in the Mo17 genome was 2,694 ( Fig. 2c). In addition, digital PCR based experimental estimations were also performed, demonstrating there were 3,364 ± 237 copies of 45S rDNAs (Fig. 2c). Overall, the total 45S rDNA copies estimated were highly consistent with the 45S rDNA array assembly, the only genomic region having 45S rDNAs in Mo17 genome.

PacBio reads
We remapped the ultra-long ONT reads and PacBio HiFi reads to the T2T Mo17 assembly and found uniform coverage across nearly all genomic regions, which confirmed the overall accuracy of the assembly (Extended Data Fig. 7). According to the alignments of ONT reads, except for the 20 telomeres that often show underrepresentation of reads and one subtelomeric region on chr2S with highly tandem repeats (Extended Data Fig. 4), we observed 15 LCRs and 10 HCRs with read depth lower than 100 and higher than 250 (genome-wide average: 180.7), respectively, which all corresponded to gaps, LCRs and HCRs mentioned for the basal Mo17 assembly (Supplementary Fig. 11). Based on PacBio HiFi reads, 107 additional local coverage-anomalous regions with coverage lower than 20 or higher then 105 (genome-wide average: 65) were identified. Notably, the assemblies of these regions, with an average size of 20.84 kb, were all confirmed by concordant PacBio assembly and tiling ONT reads.

Evaluation of the completeness of the final T2T Mo17 assembly using ONT reads
The completeness of the T2T assembly was validated. Of 7,751,268 quality-passed ONT reads longer than 10 kb, 0.47% originated from maize mitochondrion and chloroplast genomes and microbial DNA contamination as identified by mapping with organelle genomes and the NCBI NT database, and 6.21% were generated due to sequencing errors, including 3.60% fused reads and 1.54% symmetrical reads, as well as 1.07% reads of unknown mistaken origin. The remaining 93.32% reads could be properly mapped to a unique position of the T2T Mo17 assembly with a minimum query sequence coverage of 0.85 (Extended Data Fig. 8). We did not detect quality-passed ONT reads that originated from the Mo17 genome but failed to map to the final assembly.

Analysis of tandem repeats in the Mo17 genome
With a requirement of at least 5 consecutive copies, we identified 5.45 Mb of microsatellites (repeat unit: 1 to 9 nucleotides), 16 Table 4).

Genes annotated in the Mo17 genome
A total of 42,580 high-confidence, protein-coding genes were annotated in the Mo17 genome, of which 30,975 were supported by RNA-seq data with a threshold of at least 90% coverage for CDS (Supplementary Table 5). In addition, we found that 41,127 of predicted proteins were homologous to genes identified in the NAM founder lines of maize 14 Table 5) and that 55.82% of the remaining 1,453 genes were expressed in at least one of the 127 RNA-seq data sets we used (Supplementary   Table 14) with a threshold of a FPKM (fragments per kilobase of transcript per million mapped reads) value larger than 1. In total, 1,029 predicted genes were newly anchored in the Mo17 chromosomes, including 246 newly assembled genes ( Table 1) such as the RPN8 gene (Zm00014ba318810, Supplementary Fig. 15), which its homolog in Arabidopsis encodes a 26S proteasome subunit that specifies leaf adaxial identity 15 .

An extreme case of region with tandem gene duplications
One extreme case of region with tandem gene duplications was an approximate 800 kb newly assembled region between Zm00014ba065330 and Zm00014ba065630 on chr10, which contained a total of 29 genes and 6 putative pseudogenes. Of these 29 genes, 20 (mG1-1 to 1-20) were duplicated genes encoding threonine protein kinase, and 6 (mG4-1 to 4-6) were duplicated genes with unknown function (Extended Data Fig. 9). For the 6 pseudogenes, 2 of them (mPG2-1 and 2-2) were homologous with the 6 duplicated genes of unknown function, and another 4 (mPG1-1 to 1-4) were duplicated with each other (Extended Data Fig. 9). The corresponding region in the B73 genome (about 230 kb) was found to have 10 annotated genes. Among them, 3 genes (bG4-1 to 4-3) were homologous with the 20 duplicated threonine protein kinase genes, one gene (bG6) was homologous with the 6 duplicated genes of unknown function, and one gene (bG3) was homologous with the 4 duplicated pseudogenes in the Mo17 genome (Extended Data Fig. 9). Except for the genes mentioned above, the remaining 3 genes (mG2, 3, and 5) in the Mo17 genome and 5 genes (bG1, 2, 5, 7, and 8) in the B73 genome were not duplicated and no homology was found among them (Extended Data Fig. 9).

Variant distance between different satellite repeat copies
Totally, 122,760 intact knob180, 4,514 intact TR-1, and 44,747 intact CentC repeat copies were identified, respectively. Using the method reported previously 16 , we generated a position probability matrix (PPM) for each type of satellites, and calculated a variant distance to the PPM for each satellite repeat copy. Substantial sequence variation was observed for knob180, TR-1 and CentC repeats, with a mean variant distance of 24.7, 54.7, and 13.1, respectively (Extended Data Fig. 10a).
Interestingly, there were two types of knob180 repeats according to their variant distances. About 20% of knob180 repeats displayed with relatively higher variant distances and were significantly enriched on Knob-8L (Extended Data Fig. 10a and   b). These high varied knob180 repeats were not randomly distributed, but showed some extent of depletions across Knob-8L, with thousands (average about 4100, ranging from 1100 to 7900) of knob180 repeats between two adjacent depletions (Extended Data Fig. 10c). Satellite repeats with five or fewer pairwise variants were defined as higher-order repeat groups. A total of 3,695, 2,059, and 1,901 higher-order repeat groups, with an average of 13,155, 445, and 8,179 copies per group, were identified for knob180, TR-1 and CentC, respectively.

The correlation between gene number and non-CRM Gypsy abundance in centromeres
A total of 82 genes were identified in centromeres of the Mo17 genome. Most of these 82 genes were located in seven CentC-poor centromeres, including centromeres of chr2, chr3, chr4, chr5, chr6, chr8 and chr10 which harbored 13, 11, 4, 9, 15, 14 and 11 genes, respectively. By contrast, for the 3 CentC-rich centromeres, there was only 2 and 3 gene identified on centromeres of chr7 and chr9, respectively, and no gene was identified on centromeres of chr1 (Fig. 5c). Interestingly, more than half (ranging from 36.55% to 74.30%) of sequences of the 7 CentC-poor centromeres were occupied by non-CRM Gypsy, while the other 3 CentC-rich centromeres had only an average of 7.97% of non-CRM Gypsy (Fig. 5a). In contrast, no obvious difference in CRM abundance was observed between the two types of centromeres. In general, non-CRM Gypsy abundance was positively correlated (r = 0.722) with the number of genes in centromeres (Supplementary Fig. 17b), reflecting a potential role of

ONT and PacBio libraries constructing and sequencing
For each ONT common sequencing library, approximately 3-4 μg of gDNA with size about 20 kb was selected using the Pippin HT system (Sage Science, USA). Next, the end reparation of DNA fragments and A-ligation reaction were performed using NEBNext Ultra II End Repair/dA-tailing Kit (Cat# E7546). Then, the adapter was The raw data generated were processed using the CCS algorithm (v.4.0.4, https://github.com/pacificbiosciences/unanimity) with the parameter -minPasses3.

Closure of the TAG repeat array related gaps
The 5 TAG repeat array related gaps were manually closed based on the ultra-long ONT reads. To avoid the possible assembly errors around the boundaries of the gaps, the flanking 500 kb sequences of the two ends of each gap were removed from the basal Mo17 assembly, resulting in a trimmed Mo17 assembly. Then, the ONT data was iteratively mapped to the trimmed Mo17 assembly by Minimap2 17 with the parameters of '-x map -ont -r 10000'. For each round of mapping, the reads that could extend the gaps as far as possible were selected to extend the assembly of corresponding ends of gaps. The assembly extended by the reads was used for mapping in the next cycle. For each gap, the iteration of extension will be terminated when either the extension from two ends of the gap were overlapped, or no reliable reads could be found for extension from both ends. With this approach, gap1 and gap5 were closed.
Most regions of gap2, gap3, and gap_LCR6 were filled by ONT reads, but there was still a sub-gap in each of them that was not spanned by ONT reads. All these three sub-gaps were composed by TAG repeats longer than 90 kb. Next, we tried to determine if there were any other non-TAG sequences located inside of these three sub-gaps. Considering that the TAG repeats could be read as other microsatellites, we identified the ONT reads with the threshold that there was at least 5 kb microsatellite in one end of the read. A total of 4,129 quality-passed ONT reads longer than 100 kb were identified. Manual checking showed that except for 504 reads only composed by TAG repeats, all the remained 3,625 reads could be mapped back to our Mo17 assembly. This indicated that the three sub-gaps were all composed by TAG repeats only. Published BioNano molecules of Mo17 genome 4 were used to determine the lengths of these sub-gaps. BioNano molecules were mapped to the assembly by Solve kb region in Gap_LCR6. Two types of reads associated with these 6 regions were identified firstly. One type is the reads which could be mapped to the 6 regions. The second type is the reads which harbored with consecutive microsatellite repeats longer than 90 kb but could not be mapped to the 6 regions. Notably, in consideration of extra sequence errors for ONT reads with long TAG repeats and there was no other type of microsatellites longer than 90 kb in the genome, the reads which harbored with consecutive microsatellites longer than 90 kb, including microsatellites of non-TAG repeats, were also identified as the second type of reads. Then, we summed up the length of TAG repeats harbored for the two types of reads associated with these 6 regions, and normalized it by division by the average genome coverage of the data (45.6×), which resulted a total of 1,024.6 kb for the 6 regions. After subtracting the lengths of the other 5 TAG repetitive regions with known sizes, the length of sub-gap in gap2 was estimated about 165.6 kb. Altogether, gap2, gap3 and gap_LCR6 was closed finally.

Closure of the 45S rDNA array related gap
The 45S rDNA array related gap was closed based on PacBio HiFi reads. We checked the direction of 45S rDNA sequences harbored by PacBio reads and ONT reads, and found there were two 'islands' with opposite directions of 45S rDNAs inside the gap.
One island harbored with two rDNAs in which their intergenic spacer (IGS) regions were adjacent, whereas the other island harbored with two rDNAs in which their internal transcribed spacer (ITS) regions were adjacent. We performed the gap closure by extending along the transcriptional direction of 45S rDNAs. Therefore, three locations were served as starting points of extension, including the centromere-proximal end of the gap which 45S rDNAs direction was toward to the telomere, and the two sides of a 'island' with rDNAs which IGSs were adjacent with each other. Next, the extension was performed with following steps.
Step 1: the IGS sequences of the 45S rDNAs to be extended were mapped to the PacBio reads including at least one intact 45S rDNA copy and intact IGS sequences of 45S rDNA adjacent to the 25S rRNA end of the intact 45S rDNA copy. BLASTN 18 (v2.9.0) was used for mapping with the parameters: -task megablast -max_hsps 1. Only the alignments in which the IGS sequences were mapped to the IGSs of the intact 45S rDNAs with more than 99% identity were retained.
Step 2: For each of the three starting points, the best 5 hit reads were further selected based on the retained alignments in step 1, and were mapped back to the corresponding 45S rDNAs to be extended using BLASTN 18 (v2.9.0). The read with the best hit were selected for extension.
Step 1 and step 2 were iterative performed by in-house script. If there were two or more reads matched the thresholds of this best-hit-based method, one of which will be randomly selected for extending. In addition, manual check and extension were needed when TE insertion happened or no PacBio reads were found for extension based on the thresholds set for this best-hit-based method.

Copy number estimation of rDNAs
The copy number of 5S and 45S rDNAs in the genome was estimated by the blast-based method using both ONT ultra-long and PacBio HiFi data. The sequences of rDNA harbored on the data were identified by BLASTN 18

Determination of the origin of the unmapped ONT reads
Except for properly mapped reads, fused reads, and symmetrical reads, the remained ONT reads were mapped to the maize organelle genomes (NCBI Genebank access ID: assembly. Consequently, the completeness of the Mo17 assembly was estimated to be 99.92%. PacBio HiFi reads corresponding to the remained 0.08% k-mers were collected and then aligned to the final Mo17 assembly using Minimap2 17 with the parameters of '-x map-pb -r 1000 -N 50'. According to the alignments, the locations of these k-mers were determined. A k-mer originated from multiple locations and with frequency lower than 30 (the cut-off of solid k-mers) for each location was redefined as un-solid k-mer. For the k-mers with the frequency of at least one genomic location higher than 30, the read depths for corresponding genomic locations were checked. A k-mer was considered to be introduced by sequencing errors within reads and base errors harbored in the assembly if its corresponding read depth was normal (between 100 to 250, genome-wide average: 180.7).

Gene annotation
Both ab initio prediction and evidence-based prediction were used to predict the protein-coding genes in the Mo17 genome. Prior ab initio prediction, the repeat sequences were masked using RepeatMasker 23   ONT reads depth higher than 250 for the T2T assembly of Mo17 genome, which ultra-long ONT reads longer than 10 kb were used for analysis. HCR1 was related to subtelomeric repeats. HCR5, 6, 8, and 10 were related to TEs. HCR2, 3, 4, 7 and 9 were related to genomic regions homologous with maize mitochondrion genome ( Supplementary Fig. 7).
types of ONT reads with sequencing errors.