Chromosomal level genome assemblies of two Malus crabapple cultivars Flame and Royalty

Malus hybrid ‘Flame’ and Malus hybrid ‘Royalty’ are representative ornamental crabapples, rich in flavonoids and serving as the preferred materials for studying the coloration mechanism. We generated two sets of high-quality chromosome-level and haplotype-resolved genome of ‘Flame’ with sizes of 688.2 Mb and 675.7 Mb, and those of ‘Royalty’ with sizes of 674.1 Mb and 663.6 Mb, all anchored to 17 chromosomes and with a high BUSCO completeness score nearly 99.0%. A total of 47,833 and 47,307 protein-coding genes were annotated in the two haplotype genomes of ‘Flame’, and the numbers of ‘Royalty’ were 46,305 and 46,920 individually. The assembled high-quality genomes offer new resources for studying the origin and adaptive evolution of crabapples and the molecular basis of the accumulation of flavonoids and anthocyanins, facilitating molecular breeding of Malus plants.


Background & Summary
Malus hybrid 'Flame' ('Flame') and Malus hybrid 'Royalty' ('Royalty') are representative ornamental crabapples of the genus Malus in the rose family (Rosaceae).'Flame' belongs to the ever-green leaf category, with green leaves and white flowers, while 'Royalty' belongs to the category of ever-red leaf, with purple-red leaves, flowers and fruits, and the fruit is fetal red 1 .'Royalty' and 'Flame' crabapples are rich in flavonoids 2 .In 'Royalty' , 17, 17, 15 and 9 kinds of flavonoids were detected from the leaves, flowers, peel and flesh respectively, and 15, 17, 11 and 9 types were detected from 'Flame' crabapple.And a putative transcription factor, MdMYB8, associated with flavonol biosynthesis was discovered by Li et al. based on transcriptome analysis of the transcriptomes of the fruit of 'Flame' from five continuous developmental stages 3 .Flavonoids are an important class of natural organic compounds with a wide range of biological activities.Previous studies had shown that flavonoids exhibit strong antioxidant activity and possess various pharmacological functions such as antibacterial, anti-inflammatory, anti-tumor, and anti-diabetic effects [4][5][6] .Therefore, 'Royalty' and 'Flame' as natural carriers for the synthesis and accumulation of flavonoids have significant utilization value and strong development potential 7,8 .Studying their genomes contributes to research on the pathways of flavonoid accumulation.
In addition, 'Royalty' and 'Flame' are the preferred materials for studying plant coloration mechanism due to the significant differences in the colors of diverse tissues.For example, as the key anthocyanin regulator, McMYB10 was identified in leaves and petals of crabapple and relatd to anthocyanin accumulation in 'Royalty' , a crabapple cultivar with red-colored leaves and flowers 9,10 .Then, the targeted gene McF3'H 10 , McDFR1 promoter 11 and specific ubiquitin E3 ligases McCOP1-1 and McCOP1-2 12 of McMYB10 were found through the investigation of leaf development in the two crabapples, besides that transcription factor McMYB12 promoting the accumulation of proanthocyanidins was discovered 13 .Furthermore, the endogenous McCHS gene was proved to be a critical factor during petal coloration by comparing content of flavonoids and anthocyanin of three typical crabapple cultivars with different petal colors 14 .Thus, the genomic data obtained in this study lays the foundation for subsequent investigations using multi-omics analysis strategy to explore the molecular mechanisms of anthocyanin synthesis, which is of great significance for a deep understanding of this important trait of coloring and improving the color breeding of these important ornamental crabapples.
In this study, we present high-quality genomes for Malus hybrid 'Royalty' and Malus hybrid 'Flame' using PacBio, Illumina, and Hi-C technologies.The results of k-mer analysis showed that the heterozygosity of 'Flame' was 2.89% and the genome size was ~691.2Mb, while the heterozygosity of 'Royalty' was 1.78% and the genome size was ~685.4Mb, which confirmed that both 'Royalty' and 'Flame' were highly heterozygous diploids (Fig. 1).The maximum assembled genome of 'Flame' (hapA) had a size of 688.2 Mb with a contig N50 of 31.6 Mb and the other was 675.7 Mb with a contig N50 of 35.6 Mb.The two haplotype genomes of 'Royalty' were 674.1 Mb with a contig N50 of 23.7 Mb (hapA) and 663.6 Mb with a contig N50 of 28.7 Mb (hapB) (Table 1).The assembled contigs were all further anchored to 17 pseudo-chromosomes, with an anchoring rate of 93.4% in 'Flame'-hapA, 96.4% in 'Flame'-hapB, 92.2% in 'Royalty'-hapA and 95.4% in 'Royalty'-hapB (Table 1, Fig. 2).The two haplotype genome of 'Flame' both had 5 chromosomes assembled into single-ended telomeres, 11 chromosomes assembled into double-ended telomeres, and only 1 chromosome not assembled into telomeres.There were 4 chromosomes assembled into single-ended telomeres and the rest were assembled into double-ended telomeres of 'Royalty'-hapA, while 8 chromosomes of 'Royalty'-hapB were equiped with single-ended telomeres and the other 9 chromosomes were with double-ended telomeres (Fig. 3).A total of 47,833 and 47,307 protein-coding genes were identified and almost fully annotated in the two haplotype genomes of 'Flame' , respectively.All the 46,305 and 46,920 protein-coding genes of the two haplotype genome of 'Royalty' in each could be functionally annotated (Tables 2, 3).The quality of the final genomic assembly was assessed to be high gene completeness ('Royalty': 98.9% -hapA and 99.0% -hapB; 'Flame': 98.9% -hapA and 99.0% -hapB).The assembled high-quality genome of Malus hybrid 'Royalty' and Malus hybrid 'Flame' should be a valuable resource for future conservation genomics studies and flavonoid accumulation and anthocyanin synthesis investigations.
For Hi-C, leaves were fixed in 1% (vol/vol) formaldehyde for library construction.The Hi-C library construction schedule including cell lysis, chromatin digestion, proximity-ligation treatments, DNA recovery and subsequent DNA manipulations were performed according to a previously described method 15 .DpnII was used as the restriction enzyme in chromatin digestion.The Hi-C library was sequenced on the Illumina NovaSeq.6000 sequencing platform for 150 bp paired-end reads.
Genome survey and analysis.A total of 36 Gb and 26 Gb high-quality HiFi reads for 'Flame' and 'Royalty' , respectively, were obtained by PacBio Sequel II platform and utilized for genome size and ploidy analysis.The Jellyfish (v2.2.10) 16 software was performed for k-mer counting of reads from the two genomes, respectively.The reads were cut into 21-base sequences, the total number of 21-mers and the frequency of each 21-mer were counted and the distribution of 21-mers frequencies was plotted.The obtained matrix after 21-mer counting was then used to calculate the haplotype genome size and heterozygosity of 'Flame' and 'Royalty' , as well as the prediction of ploidy, using Genomescope (v2.0) 17 software.The genome size of 'Flame' and 'Royalty' were estimated to be 691,264,141 bp and 685,420,647 bp respectively.And the rate of heterozygosity were estimated to be 2.89% and 1.78%, respectively.The K-mer analysis indicated that both 'Flame' and 'Royalty' were highly heterozygous diploids (Fig. 1).

Genome assembly.
Contigs were de novo assembled from PacBio HiFi reads to generate a phased assembly graph and then HiC reads were ultilized to link unitigs that share mapped fragments by hifiasm (v0.16.1) 18 with parameters (-hom-cov 34-n-weight 6 -s 0.45 -O 2).Following that, contigs were anchored into 34 chromosomes in total using the software Juicer 19 and the 3D-DNA 20 (-m haploid -r 0) based on Hi-C interaction data ('Flame': 60 Gb, ~100×; 'Royalty': 80 Gb, ~133×) (Fig. 2).Subsequently, the assembled genome was manually corrected with JucieBox 21 , including correcting chromosome boundaries, rejoining misjoins, and addressing inversions and translocations, and the final genome was generated using agp2fa mode of RagTag 22 based on AGP format file recording contigs of each chromosome.The total length of two chromosome-level haplotype-resolved genomes of 'Flame' was 642.9 Mb (hapA) with a contig N50 of 31.6 Mb and 651.8 Mb (hapB) with a contig N50 of 35.6 Mb, of which of 'Royalty' was 628.5 Mb (hapA) with a contig N50 of 23.7 Mb and 637.4 Mb (hapB) with a contig N50 of 28.7 Mb, achieving anchoring rate of all haploid genomes higher than 92% (Table 1).
The telomere sequences were detected with the software TRF 23 and most of the chromosomes are assembled to telomeres.For examples, a total of 5 chromosomes assembled into single-ended telomeres, 11 chromosomes assembled into double-ended telomeres, and only 1 chromosome not assembled into telomeres of 'Flame' , while 4 chromosomes assembled into single-ended telomeres and 13 chromosomes assembled into double-ended telomeres of 'Royalty'-hapA, confirmed the high genomic integrity and continuity of the assembled genomes (Fig. 3).

Genome annotation.
Repeat sequences were annotated using de-novo approaches, by constructing a database of repeat sequences using the software RepeatModeler (v1.0.11) 24 with setting parameters (-database -pa 5).Subsequently, the constructed database was imported to RepeatMasker (v4.1.2) 25 to identify transposons or low-complexity repeats in the DNA sequences, and then the TRF (v4.09) 23 was used to identify tandem repeats.It had been found that both of 'Royalty' and 'Flame' genomes were highly repetitive, of which 64.76% were repetitive sequences in 'Flame' , and the major portion of the repetitive sequences was the retransposon LTR at a percentage of about 37.74%.In 'Royalty' , 64.59% were repetitive sequences, and the repetitive sequences that accounted for the most part of the repetitive sequences were also LTRs about 34.62%.
To annotate a complete and accurate gene structure, a strategy incorporating transcriptome, protein-based homology, and ab initio prediction was employed 26 .For transcript-based prediction, two sets of published transcriptome data ('Flame' was assisted by BioProject PRJNA546094 27 and 'Royalty' was assisted by BioProject PRJNA546107 28 ) were mapped to the assembled genomes, respectively, by HISAT2 (v2.2) 29 .The mapped reads were assembled by StringTie (v1.3) 30 to retain the longest transcripts as EST evidence.As for protein-based homology, the protein sequences of sequenced apple genomes of 'Golden Delicious' (Malus domestica cv.Golden Delicious), 'Hanfu' (Malus domestica cv.Hanfu), 'Gala' (Malus domestica cv.Gala), European wild apple (Malus sylvestris) and wild apple (Malus sieversii) were utilized to perform homology prediction by Exonerate 31 .For the ab initio prediction, the assemblies were hard masked according to the repeat annotation, and then Augustus (v3.4) 32 and BRAKER2 33 were performed to train a gene prediction model based on the transcripts.At last, protein coding genes were predicted using BRAKER2 with the trained model.Finally, the predictions generated by the above methods were integrated to generate the final of the annotation file by using the Maker (v3.1) 34 .Comparison of the protein-coding genes with single-copy homologous conserved gene databases using BUSCO (v4.1) 35 analysis showed that the two sets of haplotype genome sequences of 'Flame' contained complete homologous conserved genes in about 99.0% and 98.8% of plants, and those of 'Royalty' were about 98.9% and 99.0% respectively (Table 2).The functional annotation was performed following a standard workflow based on above annotated protein-coding genes: (i) Diamond (v2.0) 36 was run with an E-value threshold of 1e-4 against GenBank-NR 37 , Swiss-Prot 38 , TrEMBL 39 and the Arabidopsis protein database 40 ; (ii) InterProScan (v5.59) 41,42 was performed to identify functional protein structural domains against the InterPro 42 database; (iii) aligned results from the GenBank-NR database were combined with identified functional domains of InterPro proteins for GO (the gene Ontology Consortium) 43 annotation using the Blast2GO (v2.2) 44 program; (iv) the annotation results of SwissProt, TrEMBL, and Arabidopsis protein database were combined with AHRD (v3.3) program; (v) the Kyoto Encyclopedia of Genes and Genomes (KEGG) 45 database was also consulted for KEGG functional  annotations in Blast2GO (v2.2); (vi) Prediction of transcription factors (TF), transcriptional regulators (TR) and protein kinases (PK) for protein-coding genes using iTAK 46 software.The final annotation results showed that the hapA and hapB genomes of 'Flame' contain 47,833 and 47,307 genes respectively.For 'Royalty' , 46,305 genes were annotated in the hapA genome and 46,920 in the hapB genome (Table 2).For the functional annotations of 'Falme' , the protein-coding genes were compared with the GenBank-NR, SwissProt, Arabidopsis protein database, and TrEMBL databases, and of each was annotated 94,621, 66,487, 76,886, and 91,313 genes, respectively.A total of 44,046 genes were matched to GO database and 41,447 genes were linked with pathway annotations.6.58% of genes were indentified as TFs/TRs and 3.16% were labeled as PK.As for 'Royalty' , there were 93,225, 66,215, 76,165 and 90,160 genes matched with the GenBank-NR, SwissProt, Arabidopsis protein database, and TrEMBL databases, separately.Additionally, 43,776 and 41,123 genes were annotated by GO and KEGG in each.The total number of predicted transcription factors and transcriptional regulators was similar to the Flame's, but the identified protein kinases were 276 over than Flame's, counting for 3.58% of total genes (Table 3).

Fig. 2
Fig. 2 Hi-C interaction analysis and circos map.(a) Hi-C interaction heatmap of 'Flame' .(b) The circos map of 'Flame' .(c) Hi-C interaction heatmap of 'Royalty' .(d) The circos map of 'Royalty' .For the circos map, the tracks from outside to inside are: Chromosome ID and length (i), Density of protein-coding genes (ii), Density of LTR elements (iii), GC content (iv), Density of structural variations (v), Paralog synteny relationships (vi).

Table 1 .
Summary of Malus hybrid 'Flame' and Malus hybrid 'Royalty' genome assembly data.

Table 2 .
Overview of genome assembly and annotation.