Liriodendron genome sheds light on angiosperm phylogeny and species–pair differentiation

The genus Liriodendron belongs to the family Magnoliaceae, which resides within the magnoliids, an early diverging lineage of the Mesangiospermae. However, the phylogenetic relationship of magnoliids with eudicots and monocots has not been conclusively resolved and thus remains to be determined1–6. Liriodendron is a relict lineage from the Tertiary with two distinct species—one East Asian (L. chinense (Hemsley) Sargent) and one eastern North American (L. tulipifera Linn)—identified as a vicariad species pair. However, the genetic divergence and evolutionary trajectories of these species remain to be elucidated at the whole-genome level7. Here, we report the first de novo genome assembly of a plant in the Magnoliaceae, L. chinense. Phylogenetic analyses suggest that magnoliids are sister to the clade consisting of eudicots and monocots, with rapid diversification occurring in the common ancestor of these three lineages. Analyses of population genetic structure indicate that L. chinense has diverged into two lineages—the eastern and western groups—in China. While L. tulipifera in North America is genetically positioned between the two L. chinense groups, it is closer to the eastern group. This result is consistent with phenotypic observations that suggest that the eastern and western groups of China may have diverged long ago, possibly before the intercontinental differentiation between L. chinense and L. tulipifera. Genetic diversity analyses show that L. chinense has tenfold higher genetic diversity than L. tulipifera, suggesting that the complicated regions comprising east–west-orientated mountains and the Yangtze river basin (especially near 30° N latitude) in East Asia offered more successful refugia than the south–north-orientated mountain valleys in eastern North America during the Quaternary glacial period.

the contiguity of the Liriodendron genome assembly (Supplementary Table 3). All 23 sequence data have been deposited in the NCBI Sequence Read Archive under project 24 PRJNA418360. 25 26

Raw data processing in Illumina data 27
Low quality reads were filtered out and potential sequencing errors were removed or 28 corrected by k-mer frequency methodology. The following filtering criteria were 29 applied to reduce effects of sequencing errors on the assembly, thereby ensuring high 30 quality reads. 31

1) Reads with ambiguous bases (represented by the letter N) or poly-A structures. 32
2) Reads with ≥40% low-quality bases (base quality ≤7) in small insert size 33 libraries (170, 250, 500, and 800 bp). 34 3) Reads with adapter contamination: reads with ≥10 bp aligned to the adapter 35 sequence (≤3 bp mismatch allowed) were filtered out. 36 4) Small insert size reads in which read1 and read2 overlapped by ≥10 bp (10% 37 mismatch allowed). 38 5) PCR duplications (reads were considered duplicates when read1 and read2 of 39 the two paired-end reads were identical). 40 Low quality and duplicated reads were filtered out, 327.11 Gb of the L. chinense 41 genome was retained for the coming assembly (Supplementary Table 1 americana, the first two were sequenced in this study and the last one was available in 292 Ibarra-Laclette et al. 38 . To obtain as many genes as possible, we sequenced the mixed 293 tissues comprised of flowers, stems and leaves in both two Magnoliaceae plants and 294 the resting Lauraceae plant we selected was also sequenced based on mixed tissues 295 comprised of seeds, roots, stems, leaves, aerial buds and flowers 38 . The final numbers 296 of available protein sequences of these three magnoliids, Magnolia Grandiflora,297 Michelia alba and Persea americana,were 33,943,34,672 and 46,351,respectively. 298 First, we performed OG construction using OrthoFinder 37 . Then, we selected low-copy 299 OGs with the number of putative orthologues less than two in each species and putative 300 orthologues were found in at least four eudicots, four monocots, three magnnliids, one 301 basal angiosperm and one gymnosperm. 302 After that, a total of 1,163 low-copy OGs were separately aligned using Clustal Omega 303 v1.2.4 39 and all alignments were further trimmed by using TrimAl 1.2 40 . Next, we 304 constructed 1,163 single-gene trees by using RAxML v8.2.11 41 with the 305 PROTCATWAG mode. Finally, we rooted each OG tree using Gnetum montanum and 306 compared these single-gene trees with the species tree ( Supplementary Fig. 14) after 307 masking all magnoliids. Due to the limited number of informative sites in one gene, it 308 was hard to use a single-gene tree to resolve the relationship among the low-level 309 taxonomic hierarchies. Therefore, we selected the OGs with genes that, as they should, 310 formed a monophyletic gene clade within species of a monophyletic organismal group 311 (that is, eudicots and monocots) and the only one basal angiosperm, Amborella, was 312 sister to the clade of monocots and eudicots. After that, we unmasked all magnoliids 313 plants and excluded OGs with different magnoliids plants clustered with different 314 clades, that is eudicots, monocots and the clade of monocots and eudicots. In other 315 words, we only selected OGs with all magnoliids plants clustered with the same clade 316 (see examples in Supplementary Fig. 15), ultimately resulting in 502 low-copy OGs. 317 Finally, we classified these OGs according to which clade the magnoliids clustered with 318 into a sister group, ultimately resulting in three alternative topologies. 319 320

Phylogenetic signal quantification 321
We calculated phylogenetic signal as described in Sheng et al. 42 . Simply, we first 322 calculated the site-wise log-likelihood scores for the ML tree constrained to three 323 alternative topologies. Then, we calculated the difference in site-wise log-likelihood 324 scores (ΔSLS) between these three alternative topologies for every site. Next, by 325 summing the ΔSLS scores of all sites, we could obtain the difference in gene-wise log-326 likelihood scores (ΔGLS) between three alternative topological hypotheses. After that, 327 we could quantify the distribution of phylogenetic signal for these three alternative 328 phylogenetic topologies at the gene level, that is, we could count the number of genes 329 supporting for each alternative topology. Among the 506 low-copy OGs, 166 supported 330 the topology I, 167 supported the topology II and the final 169 OGs supported the 331 topology III with no statistically significant difference ( Supplementary Fig. 16). 332 In addition, we also excluded the OGs with ΔGLS values being outlier. The outlier 333 ΔGLS values were well defined 31 and we calculated the upper whisker and the lower 334 whisker and excluded the OGs with absolute ΔGLS values greater than the upper 335 whisker or smaller than the lower whisker, resulting in 481 low-copy OGs with 157 336 OGs supporting topology I, 159 OGs supporting topology II and the final 165 OGs 337 supporting topology III (Fig. 2b), showing an equal distribution of phylogenetic signal 338 for each topology at gene level. 339 340

Species tree estimation 341
We estimated the phylogenetic tree based on these 502-OG gene trees and 481-OG gene 342 trees using ASTRAL 5.6.1 43 ( Supplementary Fig. 17). In addition, we also extracted 343 and concatenated 78 genes from chloroplast genomes of 24 species for phylogenetic 344 analysis ( Supplementary Fig. 18 with the sample frequency set to 50, after a burn-in of 5,000,000 iterations. Parameters 360 of "finetune" were set at "0.004, 0.016, 0.01, 0.10, 0.58". Other parameters were set at 361 default values. 362 363

Gene family expansion and contraction 379
We used Café v4.0.1 47 , a random birth and death model proposed to study gene gain 380 and loss in gene families across a user-specified phylogenetic tree, to identify gene 381 families that had undergone expansion or contraction across the ML tree that was 382 constructed based on the 235-gene data set. Usually, a global parameter λ (lambda), 383 which describes both gene birth (λ) and death (µ, equal to -λ) rate across all branches 384 in the tree for all gene families is estimated using maximum likelihood. Then, a 385 conditional p-value is calculated for each gene family, and families with a conditional 386 p-value less than the threshold (0.05) will be considered as having an accelerated rate 387 of gain and loss. Also, branches responsible for a low overall p-value of significant 388 families will be identified. 389

SNP calling 407
Paired-end reads (100bp or 150bp) obtained from sequencing were mapped to the de 408 novo genome with BWA 50 . After the alignment, results in the SAM file format were 409 converted to bam format using SAMtools v1.3.1 51 . These bam files were sorted and 410 duplicated reads were marked by Picard pack tools. SNP detection was carried out by 411 the Genome Analysis Toolkit (GATK, version 3.2.2) 52 . As there is a low-quality 412 alignment around an indel region, two steps of realignment were implemented in GATK: 413 the RealignerTargetCreator package was used to identify regions which need 414 realignment in the first step. Then the IndelRealigner performed realignment of regions 415 found in the first step. SNP calling was performed with UnifiedGenotyper and Samtools 416 mpileup, then SelectVariants was used to combine the raw vcf files as dbSNP, which 417 was created by SAMtools and UnifiedGenotyper, filtering raw SNPs with "QD <20.0 418 or ReadPosRankSum <-8.0 or FS >10.0 or QUAL <meanqual". After that, base-quality 419 score recalibration was performed with BaseRecalibrator and the realigned bam file 420 was reduced by PrintReads and ReduceReads. In the next step CombineVariants was 421 used to combine the individual Gvcf files into a combind population of vcf files as a 422 dbSNP. Based on the dbSNP data and the BaseRecalibrator BAM files, GATK was used 423 to call raw SNPs and indels using parameters from UnifiedGenotyper. After obtaining 424 the raw result, VQSR, then VariantFiltration were used to filter some low-quality SNPs America. In the ML tree, DBS did not cluster into the east group of China and was 442 positioned the same as LY. Intriguingly, DBS is geographically close to LY. In general, 443 both NJ and ML trees clustered these Liriodendron individuals into three main 444 geographical groups. In addition, ped files were created as input for PLINK version 445 1.07 with parameters "--ped ped_file --recode12 --geno 0.5 --map output_map". Then 446 the program FRAPPE v1.1 56 was utilized to infer population structure and ancestry 447 information. The analysis was based on 13.3M SNP sites and we did not assume any 448 prior information about their ancestry. We ran 10,000 iterations and pre-defined the 449 number of clusters, K, from 2 to 5. ADMIXTURE v1.3.0 57 was used to find the best K 450 value based on a cross-validation test. We performed a PCA following the procedure as 451 reported. The eigenvector decomposition of the transformed genotype data was 452 performed using the R function eigen, and the significance of the eigenvectors was 453 determined with a Tracey-Widom test, implemented in the program twstats, provided 454 by EIGENSOFT 3.2 58 . 455 Nucleotide diversity (π) 59 and the Watterson estimator (θ w ) 60 were used to measure the 456 degree of variability within a population or species 61 . F st was used to measure the 457 population differentiation and genetic distance, based on genetic polymorphism data.

PSMC analysis 462
The PSMC model, originally applied to human genomes 62 , after which it was also 463 applied to plant genomes 63,64 , was used to study the effective population size (N e ) of the 464 two Liriodendron species over time. PSMC inferred the local time since the most recent 465 common ancestor on the basis of the local density of heterozygotes by use of a hidden 466 Markov model in which the observation is a single diploid sequence 62 . PSMC utilizes 467 sequence reads that are mapped to a reference genome to estimate historical fluctuations 468 in N e . To scale PSMC results to real time, we assumed 6 years per Liriodendron 469 generation (g) and a per-generation mutation rate (µ) of 7 × 10 -9 . PSMC was otherwise 470 conducted using default parameters. 471 For all L. chinense, the first bottleneck occurred about 0.9 million years ago (Fig. 4), Lake Period (30,000-40,000 years ago) or after retreat of the Quaternary glaciation 482 (after 20,000 years ago), indicating that L chinense might have migrated and been 483 restricted to its glacial refugia, widely scattered in eastern Asia. 484 For L. tulipifera, there was a sustained decrease of population since the Late Miocene 485 (Fig. 4). The population bottleneck occurred approximately 0.2 million years ago, 486 around the time of Penultimate Glaciation. Then, the population of L. tulipifera 487 experienced a period of explosive growth and achieved its peak during the warm 488 Greatest Lake Period (30,000-40,000 years ago), after which it stayed stable. with single-copy orthologs. Bioinformatics 31, 3210-3212 (2015). 518

genome.
Annotation Edit Distance (AED) provides a measurement for how well an annotation agrees with overlapping aligned ESTs, mRNA-seq and protein homology data. AED values range from 0 and 1, with 0 denoting perfect agreement of the annotation to aligned evidence, and 1 denoting no evidence support for the annotation. i.

Supplementary Figure 4. Assembly quality control by assembled pooled BACs.
We assembled 89 BAC sequences and mapped these BACs back to the genome assembly. Nine random alignments that indicate a low error rate are shown here. Most of the BAC sequences were covered and fewer gaps were observed in these BAC sequences than in the genome assembly.      The phylogenetic tree was constructed from 78 concatenated chloroplast gene sequences that were shared among 24 plant species using the ML method. Numbers associated with nodes are bootstrap values.  We found monocot-and dicot-specific gene families based on phylogenetic profiles in the Monocots PLAZA 3.0 database. We manually selected all the species that came from the target clade, i.e., monocots or dicots, for identifying clade-specific gene families with all species included and setting the gene number =0 within nontarget clade species. Finally, we separately obtained 93 monocot-and 114 dicot-specific gene families. The 20 inner tracks depict SNP frequency distributions for 1-Mb non-overlapping windows in the seven L. chinense that came from Western China, one L. chinense that came from Central China, six L. chinense that came from Eastern China, and six L. tuplifera that came from North America. The ML tree of all accessions constructed from whole-genome SNPs. Accessions coming from the same geographic areas are grouped together and colored corresponding to colors used in Figure 3. Varying the number of presumed ancestral populations (K) showed that 20 Liriodendron resequenced individuals were divided into two groups, L. chinense and L. tulipifera, when K = 2, and three distinct groups when K = 3 (b). (a) The relative positions of the lateral sinus located in the left half of the leaf were plotted. The X-axis represents the ratio of the vertical distance from the lateral sinus to the primary vein (x 1 ) to the vertical distance from the lateral lobe to the primary vein (x 2 ). The Y-axis represents the ratio of the vertical distance from the lateral sinus to the leaf blade base (y 1 ) to the vertical distance from the apical lobe to the leaf blade base (y 2 ). (b) The representative leaf shapes of three groups were plotted respectively. The X-axis represents the three Liriodendron groups supported by the SNP tree, PCA and structure analysis. Six, six, and seven individuals were separately included within these three group from left to right. The Y-axis represents inter-individual SNPs within three groups. The number of inter-individual SNP ranged from 4,224,002 to 5,710,354 with a mean value of 4,766,498 in the North America group, from 4,456,851 to 6,091,489 with a mean value of 5,485,145 in the Eastern China group, and from 6,793,165 to 8,407,025 with a mean value of 7,446,489 in the Western China group.