Oak genome reveals facets of long lifespan

Oaks are an important part of our natural and cultural heritage. Not only are they ubiquitous in our most common landscapes1 but they have also supplied human societies with invaluable services, including food and shelter, since prehistoric times2. With 450 species spread throughout Asia, Europe and America3, oaks constitute a critical global renewable resource. The longevity of oaks (several hundred years) probably underlies their emblematic cultural and historical importance. Such long-lived sessile organisms must persist in the face of a wide range of abiotic and biotic threats over their lifespans. We investigated the genomic features associated with such a long lifespan by sequencing, assembling and annotating the oak genome. We then used the growing number of whole-genome sequences for plants (including tree and herbaceous species) to investigate the parallel evolution of genomic characteristics potentially underpinning tree longevity. A further consequence of the long lifespan of trees is their accumulation of somatic mutations during mitotic divisions of stem cells present in the shoot apical meristems. Empirical4 and modelling5 approaches have shown that intra-organismal genetic heterogeneity can be selected for6 and provides direct fitness benefits in the arms race with short-lived pests and pathogens through a patchwork of intra-organismal phenotypes7. However, there is no clear proof that large-statured trees consist of a genetic mosaic of clonally distinct cell lineages within and between branches. Through this case study of oak, we demonstrate the accumulation and transmission of somatic mutations and the expansion of disease-resistance gene families in trees.

clone was selected by visualizing length polymorphisms between PCR products or by direct 170 sequencing of the products of PCR amplification for biallelic markers. 171

Sequencing and annotation of BAC inserts 172
In total, 34 BAC inserts were fully sequenced and annotated. Selected clones were cultured 173 individually on LB medium and DNA was isolated by the standard alkaline method 16 . DNA 174 inserts were sequenced in pools at 40x coverage, with the 454 mate-pair (5 kb) procedure. The 175 sequence reads were assembled with Newbler (version MapAsmResearch-04/19/2010-patch-176 08/17/2010). Additional sequencing was carried out with the Illumina MiSeq platform 177 (paired-end overlapping reads of 2 × 250 bp) at a coverage depth of 400x, and GapCloser (-178 V1.12-6) software was used to reduce the gaps between contigs. 179 Repeated elements were identified and classified in a two-step approach: i) Censor software 180 and Repbase V21.08 18 were initially used (see Plomion et al. 19 ), but detection was limited to 181 BAC insert regions displaying identity to sequences in Repbase; ii) searches for the remaining 182 repeated elements were performed with the library of consensus repetitive elements presented 183 in section 3.1. Structural and functional gene annotations were added to the BAC sequence, as 184 described in the approach presented in section 3.3, using: i) Eugene to integrate ab initio and 185 similarity-based gene finding programs 20 , and ii) FunAnnotPipe, an in-house bioinformatic 186 pipeline based principally on InterproScan 21 . The data were then manually curated with 187 BLAST tools from the NCBI website and NetGen2 22 for the confirmation of exon/intron 188 boundaries. Transcript evidence (ESTs and oak unigenes 23 were used to establish gene model 189 structures. We also used FGENESH 24 and Augustus 25 software to confirm or update Eugene 190 predictions, with Vitis vinifera or Theobroma cacao as the model. Some short-gene models 191 (encoding < 50 amino acids) were removed. Manually curated genes were then compared 192 with gene models predicted from the genome sequence. 193 generated graphical representations (see Supplementary Fig. 11, Supplementary Fig. 12, 291 Supplementary Fig. 13, Supplementary Fig. 27, Supplementary Fig. 28, Supplementary 292 Fig. 29) highlighting the advantages of long reads for differentiating between haplotypes. The 293 V1 release often merged the two haplotypes into a single scaffold for the three genomic 294 regions, whereas the V2 release contained a pair of scaffolds for each pair of BACs. 295 Furthermore, the V2 scaffolds showed fewer switches between haplotypes than the V1 release 296 (see Supplementary Fig. 11, Supplementary Fig. 12, Supplementary Fig. 13, 297 Supplementary Fig. 27, Supplementary Fig. 28, Supplementary Fig. 29). 298

Pseudomolecule construction 299
Scaffolding can extend the contiguity of a genome sequence assembly by orders of magnitude 300 relative to contigs, but the construction of a chromosome-scale genome requires either 301 physical or genetic maps to anchor the scaffolds. We used a genetic map for this purpose. A 302 composite genetic map was first established with LPmerge software 31 , bringing together 5,589 303 already mapped EST-SSR and SNP markers from eight individual linkage maps 17,32 304 (Supplementary Data Set 2 sheet #1), including one map for accession '3P' used to establish 305 the reference genome sequence. Gene model sequences for the 5,589 mapped loci were then 306 aligned with the 1,409 scaffold sequences, using BLAT 29 (> 95% identity). In total, 2,615 307 unique scaffold/marker relationships (Supplementary Data Set 2 sheet #2) were identified 308 and classified into four categories (Supplementary Table 28). Overall, the scaffold-309 anchoring strategy (taking into account 2,285 markers from the most reliable assigned 310 scaffolds, i.e. from categories 1, 2 and 3) delivered 612 (43%) anchored scaffolds, covering 311 624.8 Mb (77%) of the haplome. Additional scaffolds were then anchored onto the 12 oak 312 linkage groups, according to the synteny-driven strategy illustrated in Supplementary Fig. 2  Based on scaffold order and orientation on the 12 chromosomes, the oak genome browser was 334 populated with a "marker" track including an optimized set of markers tolerant of inversions 335 between physical and genetic positions within a maximum window of 5 cM. This track was 336 designed to project the position of any quantitative trait locus (QTL) from the eight individual 337 linkage maps onto the oak genome sequence, to facilitate subsequent biological interpretation 338 of their genetic bases. The track was created according to the procedure described in 339 Supplementary Fig. 30. We found that 2,127 of the 2,615 markers (retained for scaffold 340 anchoring) fitted the criteria presented in Supplementary Fig. 30 (referred to as set#1 in 341 Supplementary Data Set 2 sheet #1), and 1,943 were retained from the other set of 2,974 342 markers initially excluded from scaffold anchoring (set#2 in Supplementary Data Set 2 343 sheet #1). As a result, the "marker" track included 4,070 markers spanning the 12 linkage 344 groups (red horizontal lines in Supplementary Fig. 31). The alignment of each marker set 345 with the 12 chromosomes is shown in Supplementary Fig. 32. Overall, the rank correlation 346 between genetic and physical positions ranged from 0.991 to 0.999 (Supplementary Table  347 29). 348

Detection and annotation of transposable elements 350
As in other sequenced plant genomes, the class I retrotransposon fraction predominated (70% 351 of TE sequences), consisting of 53% LTRs (long terminal repeats: 26% Gypsy-like and 21% 352 Copia-like) and 16% non-LTR retrotransposons (mostly LINE). Class II DNA transposons 353 accounted for 15% of TE sequences, and 92% of the transposons in this fraction were TIRs 354 (terminal inverted repeats) (Supplementary Fig. 3, Supplementary Table 4). 355 Thirteen of the 1,750 consensus sequences (0.6% of the TE content) were further 356 characterized as Caulimoviridae sequences (see section 3.2). Supplementary Table 30 shows 357 a comparative analysis of TEs across the 16 species (including oak) used for the comparative 358 genomic analysis in section 4. We found no correlation between TE content and the 359 phylogeny of these species (based on NCBI Taxonomy Browser findings) (Supplementary 360 Fig. 33). 361

Identification and preliminary characterization of endogenous Caulimoviridae 362
Plant viruses can have a major impact on the populations and genomes of their hosts. 363 Paleovirology approaches can provide insight into virus-host associations by detecting 364 fragments of viral genomes integrated in host genomes 34 . Caulimoviridae is a major family of 365 plant viruses with deleterious effects on plant populations and crop production 35 . 366 Caulimoviridae do not need to integrate into the host genome during their replication cycle, 367 but such integration occurs randomly and repeatedly, resulting in the presence of significant 368 numbers of Caulimoviridae genome fragments in plant genomes 36,37 . We screened the oak 369 genome for the presence of genomic fragments from endogenous Caulimoviridae. Reverse 370 transcriptase (RT) is the best conserved domain of the Caulimoviridae family, so we began by 371 searching the oak genome for RT domains displaying the highest levels of identity to 372 homologs from known Caulimoviridae genera. Protein clustering (>80% identity) identified 373 eight groups including seven comprising several sequences corresponding to RT sequences 374 from Caulimoviridae. This viral family contains eight genera. Phylogenetic analysis revealed 375 that one of the RT cluster from endogenous oak Caulimoviridae belonged to the genera 376 Petuvirus, whereas the other seven belonged to the recently discovered Florendovirus 377 genera 37 (Supplementary Fig. 34). 378 We then performed targeted clustering (98% identity and 95% length) on the nucleotide 379 sequences corresponding to putative Caulimoviridae loci in the oak assembly and built 380 consensus sequences based on the multiple sequence alignment (MSA) for each cluster. We 381 then clustered the consensus sequences with the closest evolutionary relationships to 382 Caulimoviridae into seven families, each of which displayed at least 90% local identity. In 383 five families, the longest consensus sequence accounted for a complete, or almost complete 384 Caulimoviridae genome and was, thereafter, considered the representative sequence for each 385 family. Remarkably, we noticed that, while representative consensus sequences were built 386 from the MSA of only a few highly similar copies, we found cases in which consensus 387 sequences corresponding to truncated variants of the representative Caulimoviridae genomes 388 were generated from the MSA of hundreds of almost identical copies (Supplementary Fig.  389   35). We compared the representative sequences with the library of repetitive elements built by 390 TEdenovo and found that most were well represented in this library (see section 3.1). 391 Collectively, copies of consensus sequences from the TEdenovo library corresponding to 392 fragments of the Caulimoviridae genome accounted for 4.4 Mb of the REPET annotation, 0.6 393 % of the haplome, and were distributed evenly over the 12 chromosomes (Supplementary 394 Fig. 36). 395

TEs and genome dynamics 407
Estimation of the age of TE families from consensus sequences 3.4.1. 408 The consensus sequences used to annotate the TE copies in the oak genome represent 409 common ancestral structural variants (TE families) of TEs transposing in the oak genome in 410 the past 44 . Indeed, they were constructed from highly repeated genome segments (see section 411 3.1). We investigated the evolution of TEs in the oak genome, by plotting and comparing the 412 observed divergence (1-identity %) of TE copies from their respective consensus sequences, 413 to estimate their relative age 45 . We performed this analysis separately for different orders of 414 TEs (LTR and Non-LTR retrotransposons, TIRs and Helitron DNA transposons) and 415 superfamilies (LTR Gypsy and Copia superfamiles). Most TE copies (i.e. 62% representing 416 44% of the TE space) displayed more than 15% divergence from the corresponding consensus 417 sequence (Supplementary Fig. 38), whereas only 6.7% of TE copies (17% of the TE space) 418 displayed low levels of divergence (<5%) from their respective consensus. This result 419 suggests that all the TEs present in the oak genome are relatively ancient, contrasting with 420 findings for A. thaliana, in which 73% of TE copies (52% of the TE space) display more than 421 15% divergence and 10.5% of the copies (26% of the TE space) display less than 5% 422 divergence 45 . By contrast, the divergence of the LTR retroelement superfamilies Gypsy and 423 Copia suggests that TE activity continued until fairly recently for the elements of these 424 families. 425

Retrotransposition dynamics 3.4.2. 426
We first refined the annotation of LTR retrotransposons with the dedicated LTR Harvest 427 tool 46 , retaining the 5,904 complete elements that displayed more than 90% reciprocal overlap 428 with those from the general annotation of TEs with the REPET pipeline. We then classified 429 these elements into families by sequence clustering of their left LTRs with SiLiX, as 430 previously described 47 . We analyzed retrotranspositional history on a subset of 4,333 elements 431 from families of more than 200 elements. The insertion date of each element was calculated 432 from the sequence similarity between its left and right LTRs, as determined by LTR Harvest,433 as follows: date = ((1 -(% identity/100)) / 2.6 ) x 10 8 48 . We plotted the data as density 434 histograms representing the distribution of insertion dates within each family, together with a 435 curve representing local density estimates (Supplementary Fig. 39). We observed a general 436 asynchronism of retrotranspositional history, with families displaying one of three contrasting 437 patterns of activity history: ancient (e.g. fam #6 and #10), constant (e.g. fam #8) or recent 438 (e.g. fam #12, #14 and #29). This result suggests that the complexity of the oak genome 439 developed through repeated bursts of retrotransposition over the last five million years, with 440 no clear increase in such activity in the recent past. These findings differ from those for 441 annual plants of similar genome size, for which genome complexification has occurred more 442 recently, through concomitant bursts of transposition (over the last 1-2 million years 49 ). 443 3.4.3. 444 TEs are often associated with genome rearrangements. They have been found in breakpoint-445 containing windows in comparisons of A. thaliana and A. lyrata 38 . In maize and A. thaliana,446 the pericentromeric regions of the chromosomes are highly enriched in LTR retrotransposons 447 of the Gypsy superfamily, and maize also displays an accumulation of TEs from the Copia 448 superfamily in regions of euchromatin 50,51 . We investigated whether TEs, particularly those of 449 the Gypsy and Copia superfamilies, were evenly distributed throughout the oak genome. We 450 calculated the percentage of TEs and annotated genes in sliding windows (300 kb, with a 200 451 kb overlap). We found that TEs accumulated in gene-poor regions. We also identified a 452 region of chromosome #2 displaying strong TE accumulation, potentially corresponding to 453 the centromeric region (Supplementary Fig. 40). The Copia elements tended to accumulate 454 away from the potential centromere, both upstream and downstream. This pattern was 455 particularly marked for chromosome #2 (Supplementary Fig. 41). Below, we consider the 456 potential role of TEs located in close proximity to genes. 457 3.4.4. 458 We investigated the possible role of TEs in gene duplication and/or gene family expansion, by 459 comparing the genomic environment (in terms of proximity to TEs) of several categories of 460 genes according to their distance to the closest TE. The distance from each gene to the closest 461 TE was calculated with getDistance.py from the S-MART package 52 . Only distances up to 5 462 kb were considered. We assessed the dependence between the different classes of distances 463 and belonging to an expanded gene family, for oak genes. We found that genes from 464 expanded gene families (see section 4.1.2) were closer to TEs than other genes (Chi-squared 465 p-value < 2.2e -16 ; Supplementary Fig. 42 LTR-retrotransposons were identified. All potential candidates were validated by checking 483 that the LTR retrotransposon sequences were located on large contigs and not on isolated, 484 short sequences in genome assemblies, and that the high degree of sequence identity was 485 limited to the elements themselves and not in their flanking sequences, to eliminate possible 486 contamination during genome assemblies and annotation errors. Moreover, analysis with 487

Role of TEs in gene expansion and tandem duplication
Dotter software confirmed that all the horizontally transferred elements harbored both the 488 LTR and an internal sequence in the two species involved. We identified six HTT events 489 involving oak and grapevine (Vitis vinifera), with sequence identities of 90 to 94%. We found 490 one HTT event involving oak, grapevine and peach (Prunus persica). The HTT event between 491 grapevine and peach had already been identified (BO6) in the analysis by El Baidouri et al. 47 , 492 and 92% identity was found between the corresponding sequences from the two species. We 493 also identified one HTT event between oak and poplar (Populus trichocarpa), with 91% 494 identity. HTTs have been shown to occur frequently between flowering plants 47 , but our 495 findings for the oak genome provide the first evidence of multiple HTTs in a single species. 496 3.5. Gene prediction, functional annotation of protein-encoding genes and manual 497 curation 498 The total gene space of the 25,808 predicted proteins was 74 Mb in size, with a density of 499 0.32 genes/10 kb on average (Supplementary Table 31 Fig. 37). 507 Experts manually checked (using WebAppolo) the protein-coding sequence structures of 508 1,714 mRNAs. They validated 79% of the transcripts without the need for additional 509 modification, whereas the remaining 21% had to be corrected (Supplementary Table 12). 510 We then aligned the coding sequences of these 1,714 mRNAs, to validate 2,067 genes of the 511 Q. robur genome (diploid V2). Finally, 1,176 of these 2,067 genes were recovered in the Q. 512 robur haplome. In the following sections we provide information concerning some of the 513 gene families manually curated. 514 Aquaporin 3.5.1. 515 Forty genes encoding putative aquaporins were identified in the Q. robur haplome. 516 Aquaporins are intrinsic channel proteins found in all organisms. Their overall structure is 517 highly conserved, with six transmembrane helices connected by five loops, a tetrad of amino-518 acids (helix 2, helix 5 and loop E) forming an aromatic/arginine constriction region (Ar/R 519 filter), and two membrane embedded half-helices with an asparagine-proline-alanine signature 520 (NPA motif) 55,56 . Five conserved amino-acid residues discriminated aquaporins from other 521 major intrinsic proteins 57 . One Q. robur gene was invalidated due to the absence of a key 522 signature (Supplementary Table 33). 523 Q. robur was found to have aquaporins from the five subfamilies found in higher plants 524 (Supplementary Fig . 44), with 14 plasma membrane-intrinsic proteins (PIPs), nine tonoplast-525 intrinsic proteins (TIPs), eight nodulin-26 intrinsic proteins (NIPs), three small basic intrinsic 526 proteins (SIPs) and five unrecognized X intrinsic proteins (XIPs). Two subclasses of XIPs 527 were identified in Q. robur, with a particular mapping pattern for XIP2, suggestive of local 528 amplification on Qrob_H2.3_Sc0000154. Except for the XIPs, the composition of the Q. 529 robur aquaporin family was similar to those of Arabidopsis and maize 58,59 . However, the full-530 length TIP3 gene was missing from the Q. robur genome. In several species, TIP3s have been 531 reported to be specific to maturing and dry seeds 60 . Variations at key motifs and in gene 532 structure between the Q. robur aquaporin subclasses were consistent with published 533 findings 58,61 . The global rate of tandem duplication in this gene family was 37.5%, which 534 similar to the overall rate for the oak genome ( Fig. 45). The topology of the phylogenetic tree was similar to that 550 described for Arabidopsis 62 , with most of the subgroups conserved. However, like other 551 woody perennial plants, oak presented subgroups with more members than in herbaceous 552 annual plants such as Arabidopsis or rice (Supplementary Fig. 45). These expanded clusters 553 in woody plants include the so-called "woody preferential subgroups", which are completly 554 absent from the basal lineages of bryophytes and lycophytes and from the more recent 555 Brassicaceae and Monocot lineages 63 . We investigated the possible role of the MYB gene 556 family in tree habit specialization by classifying R2R3-MYB genes according to their 557 duplication and expansion profiles in woody perennials. The global rate of tandem duplication 558 in the R2R3-MYB family (32.4%) was slightly lower than the overall rate for the oak genome 559 (35.6%). However, the tandemly duplicated MYBs were remarkably enriched within the 560 woody-expanded subgroups (Supplementary Fig. 46, Fisher's exact test p-value < 0.0001). 561 A substantial enrichment of tandemly duplicated genes belonging to woody expanded 562 subgroups has been also observed in other woody plants, such as eucalyptus, poplar and 563 grapevine 64 . 564 The few genes from subgroups expanded in woody perennials that have been characterized 565 seem to regulate phenylpropanoid metabolism, mostly controlling flavonoid biosynthesis, 566 although, in some cases, they also directly or indirectly alter the content of lignin and other 567 soluble compounds, such as oligolignols or salicinoid phenolic glucosides 64-69 . During 568 evolution, tandemly duplicated genes have a greater likelihood of being retained if they are 569 involved in responses to environmental factors 70 . Unlike herbaceous annuals, which die after 570 reproduction, perennial plants, such as trees and shrubs, must survive many periods of 571 challenging stressful environmental conditions over their long lifespans. Woody perennial 572 plants may, therefore, contain more elaborate stress resistance mechanisms. The large number 573 of tandemly duplicated genes regulating the biosynthesis of flavonoids and other 574 phenylpropanoid-derived compounds, mostly known to be protective, may enable oak trees to 575 develop complex protective mechanisms and to adapt woody growth to environmental 576 conditions. It is also possible that the production of the some of the many phenolic 577 compounds accumulating in oak heartwood, such as ellagitannins, or the gallotannins found in 578 oak galls, are controlled by these genes. 579

580
The host plant supplies the mycorrhizal fungi with hexoses, which support the production of 581 the external fungal mycelium, a prerequisite for effective nutrient acquisition by hyphal 582 networks. Plant sugar transporters of the SWEET superfamily deliver sugars to microbes 71 , 583 and the microbe-specific modulation of SWEET gene expression may alter sugar efflux at the 584 site of colonization 72 . We therefore analyzed the phylogeny of the pedunculate oak SWEET 585 superfamily, and performed RNAseq analyses to determine whether the abundances of Q. 586 robur SWEET transcripts were altered by inoculation of the oak clone DF159 with the 587 ectomycorrhizal fungus (EMF) Piloderma croceum 73 , the mycorrhizal helper bacterium 588 (MHB) Streptomyces sp. AcH 505, and the causal agent of oak powdery mildew, Erysiphe 589 alphitoides 74 . Oak clone DF159 was micropropagated and rooted for gene expression analysis 590 as described by Herrmann et al. 75 , and cultivated in gamma-sterilized soil-based microcosms, 591 as described by Herrmann et al. 74 . Culture and inoculation conditions, RNA extraction, 592 sequencing and data processing for fungi and bacteria were as described by Tarkka   probably because they catalyze different reactions or have different protein partners. In 658 addition to these regulatory functions, TRX and GRX are also required for the regeneration of 659 some detoxification enzymes, particularly those requiring a reactive catalytic cysteine residue. 660 This residue oxidized upon reaction with the substrate, as in peroxiredoxins and methionine 661 sulfoxide reductases, must be recycled for the next turnover. We have also investigated the 662 The genes of the GRX family displayed the greatest variation in number in plant-specific 673 class III, with 9 to 24 genes in angiosperms, the oak genome having an average number of 674 these genes (N=14), 50% of them being tandem-duplicated. Hence, variations in this class can 675 be accounted for mostly by species-specific duplications. Similarly, the oak genome has a 676 number of genes from the TRX/TRX reductase family similar to that in other organisms, and 677 none of the subclasses are missing. The most striking characteristic is the presence of six 678 genes for NADPH-TRX reductase a/b type (NTRa/b), rather than the one or two in other 679 species, with four of the genes in oak identified as tandem duplicate genes. In line with this 680 observed expansion, the associated orthogroup (#2778) was found to be enriched in GO term 681 (GO:0004791) analysis. 682 Fourteen classes of GST genes were identified in the last phylogenetic analysis performed 683 with photosynthetic organisms 85 . Only 11 of these classes are present in angiosperms, the 684 other three classes being found in Physcomitrella patens. The total number of genes (88) in 685 oak is in the upper part of the range, as are those for P. trichocarpa, E. grandis and S. bicolor. 686 A detailed subclass analysis revealed that the difference between oak and other organisms 687 resulted principally from the presence of a larger number of GST Tau (GSTU) family 688 members. This finding was confirmed by the orthoMCL analysis (see section 4.1.2), with an 689 expansion observed for four clusters, including one with 19 GSTU genes (red branches in 690 Supplementary Fig. 48). From this very variable gene content (there is no GSTU in P. 691 patens, but 21 to 62 GSTU genes in the analyzed angiosperms) and the presented 692 phylogenetic tree (many sequences cluster by species), it seems clear that GSTU genes 693 evolved relatively recently and in a species-specific manner in plants. This situation differs 694 from that for most GRX and TRX classes, for which photosynthetic organisms usually have 695 the same number of each isoform (Supplementary  Supplementary Fig. 49), including seven belonging to clade V. 720 This clade contains the genes associated with powdery mildew susceptibility/resistance in 721 Arabidopsis thaliana 95 and some other species. The large number of MLO genes in oak, 722 particularly in clade V, is only surpassed by soybean, cotton and apple, all of which have 723 undergone recent whole-genome duplication events. Most of the MLO genes are located on 724 chromosomes #8 (5 genes, 4 of which belong to clade V), #10 (4 and 1) and #1 (3 and 2). We 725 also found seven incomplete genes with strong homology to MLO. As MLO genes are 726 susceptibility genes, these incomplete genes may confer resistance 90 . There are three 727 incomplete genes on chromosome #10, at the 5' and 3' ends of a complete clade V mlo gene. 728 If we consider all complete and partial genes, the overall rate of tandem duplication of MLO 729 genes is 46.2%, slightly higher than the overall rate for the oak genome (35.6%). 730 NB-LRR 3.5.6. 731 NLR-parser ( 96 , https://github.com/steuernb/NLR-Parser) was used to identify disease 732 resistance genes encoding nucleotide-binding leucine-rich repeat proteins (NB-LRRs or 733 NLRs) and related proteins from the oak genome (haplome). We identified an initial set of 734 1,431 genes. Based on the orthoMCL analysis, 81 proteins from NB-LRR-related classes (i.e. 735 orthogroup #1000, 1004, 1015, 1031, 1084, 1140, 1187, 1269, 1540, 1697, 2368, 2399, 4549, 736 5011, 7397, 14497  Beyond the large number of single domains retrieved (i.e. 14 CC, 151 TIR, 1 RPW8, 61 NB 753 and 85 LRR, Supplementary Table 8), the total complement of NB-LRR genes in the oak 754 genome is remarkable by comparison to those of other species. The list LRR genes is 755 probably incomplete, as this category is inherently very difficult to characterize due to the 756 highly variable number of LRRs and the abundance of other LRR-related proteins (e.g. LRR-757 RLK or LRR-RLP), a group that is also expanded in the oak genome. If we exclude single 758 domains, then, for the 1,091 genes, TIR-related NB-LRR proteins account for 43% of the 759 remaining disease resistance genes (335 of 779 genes). This ratio of TIR-to non-TIR-NB-760 LRRs close to 1 indicates that the disease resistance gene content of the oak genome is more 761 balanced than reported for other eudicots 102-105 . One group of non-TIR NB-LRRs, the 762 expanded set of RNLs (orthogroup #1140) may also reflect the evolutionary history of 763 pedunculate oak with the fungus Erysiphe alphitoides, responsible for oak powdery mildew. 764 No disease resistance gene for this disease has been identified and cloned or described in oak 765 trees, but this gene complement suggests considerable potential for resistance to these 766 pathogens and represents a valuable source of genetic information. 767 We investigated the expansions detected in the oak genome by the orthoMCL/CAFE analysis 768 in more detail, by retrieving protein sequences with an NB domain from classes that 769 displaying marked expansion relative to other plant species. We focused in particular on 770 orthogroup #1000 (labeled as #1 in Fig. 3d and 4b Fig. 6) 785 displayed two major specific expansions in oak that were well supported by bootstrap values. 786 Within these two clades, several small physical clusters containing more than three 787 contiguous genes were identified. These physical Arabidopsis thaliana (623) and Oryza sativa (1,147), using MAFFT 110 . Alignments were 806 cleaned with trimAl (gt 0.2, 111 ) and used to build an approximate maximum-likelihood 807 phylogenetic tree (Fastree 2.1.8,112 ). The quality of the alignments was systematically 808 manually checked around all sites on which a positive selection footprint was detected. If the 809 alignment was dubious (less than 4 sequences, presence of numerous gaps, or too divergent 810 sequences), the site was not considered. 811 With Arabidopsis and rice genes as references, this tree was used to classify the oak RLKs 812 into subfamilies, and into 20 subgroups (SG) for LRR-RLKs 113 . We identified 1,247 RLK 813 genes, corresponding to 4.83% of the gene repertoire, versus only 2.28% in Arabidopsis and 814 2.06% in rice (Supplementary Data Set 6). Two RLK subfamilies are clearly 815 overrepresented in oak: SD1 (0.88% of the oak gene repertoire, versus 0.11% in Arabidopsis 816 and 0.04% in rice) and LRR-RLK (1.69% of the oak gene repertoire, versus 0.83% in 817 Arabidopsis and 0.67% in rice). A comparison with the LRR-RLK repertoire of 31 other 818 angiosperm species 113 showed that two subgroups, SG-XIIa and SG-XIIb, displayed the 819 highest overall expansion rates relative to the estimated number of genes in the angiosperm 820 last common ancestor (102 copies in oak, expansion rate of 6.8 for SG-XIIa, and 50 copies in 821 oak, expansion rate of 10 for SG-XIIb). As for NBS-LRR genes, a large proportion of LRR-822 RLK expansions were caused by tandem duplications: 72% and 79% for SG-XIIa and SG-823 XIIb, respectively. In addition, LRR-RLKs from SG-XIIa, most of which belonged to 824 orthogroup #1006, displayed significant expansion in oak (labeled as #6 in Fig. 3d) and in 825 trees more generally (labeled as #5 in Fig. 4b), whereas LRR-RLKs from SG-XIIb, mostly 826 from orthogroup #1003, displayed significant expansion only in trees (labeled as #3 Fig 4b). 827 The few known genes in SG-XIIa include FLS2 (FLAGELLIN-SENSITIVE 2) and EFR (EF-828 TU RECEPTOR) in Arabidopsis, and Xa21 in rice. The SG-XIIb subgroup includes XIK1 829 (Xoo-induced kinase 1). All these receptors are involved in the response to bacterial which positive selection was detected were checked manually and we identified the domain to 851 which they belong, including the specific residue of the LRR when required. In the five gene 852 families identified as displaying significant expansion in oak, 24 groups of oak ultraparalog 853 genes containing up to 28 sequences were identified. Nineteen of these groups had a 854 significant strong signature of positive selection (11 corresponding to LRR-RLKs and 8 to 855 LRR-RLPs, Supplementary Data Set 9). The 11 LRR-RLK groups of ultraparalogs 856 belonged to four previously defined subgroups: SG-VIII-2 (1 group), SG-XI (1 group), SG-857 XIIa (5 groups) and SG-XIIb (4 groups). The two SG-XII subgroups were shown to have 858 undergone species-specific expansion events in a study of 31 angiosperm genomes 113 . Most of 859 the SG-XIIa genes described to date are involved in responses to biotic stresses. After manual 860 curation, 260 sites were confirmed to be targets of positive selection (175 in LRR-RLK, and 861 85 in LRR-RLP genes). We found that 78% (205) of the 260 sites were located in the LRR 862 domain (150 in LRRs of LRR-RLK genes and 55 in LRR-RLP genes). An investigation of the 863 precise location of the 150 sites within the LRR of LRR-RLK genes revealed that four amino 864 acids in particular (6, 8, 10 and 11), were more frequently targeted by positive selection (121 865 of the 150 sites, i.e. more than 80%, Supplementary Fig. 8). These variable amino acids lie 866 in the unconserved part of the LXXLXLXX -sheet/-turn structure typical of LRRs that is 867 involved in protein-protein interactions 120,121 . The residues targeted by positive selection were 868 solvent-exposed 122,123 . 869 Biosynthesis of hydrolysable tannins 3.5.8. 870 Oak tissues have a very high hydrolyzable tannin (HTs) or gallotannin content, and have been 871 one of the chief sources of HTs for leather tanning and dye manufacture for centuries. We 872 studied the oak genome, to find potential clues to the ability of oak to synthesize HTs, which 873 are esters of gallic acid with a polyol (typically β-D-glucose). Gallic acid is a derivative of the 874 shikimate pathway generated by the dehydrogenation of a 5-dehydroshikimate 875 intermediate 124 . The first committed step in HT biosynthesis is the formation of β-glucogallin 876 (1-O-galloyl-β-D-glucose), which is generated by the esterification of gallic acid and glucose 877 followed by transesterification to generate di-, tri-, tetra-, and pentagalloylglucose 878 Supplementary Fig. 50). Ellagitannins and gallotannins are derived from pentagalloylglucose 879 by the addition of further galloyl residues or oxidation 125 . The UDP-glucose:gallic acid 880 glucosyltransferase UGT84A13 has recently been identified as a candidate enzyme in the 881 biosynthesis of β-glucogallin in Q. robur 126 . However, the genes and enzymes involved in 882 further esterification steps to generate di-, tri-, tetra-, and pentagalloylglucose remain 883  Table 38). On the phylogenetic tree, Q. robur sequences 924 were distributed across the seven phylogenetic groups already described in Arabidopsis. 925

Supplementary Table 39 shows the number of laccases within each phylogenetic group for 926
Arabidopsis, poplar and pedunculate oak. Pedunculate oak was found to have about twice as 927 many laccase genes as Arabidopsis, and about half as many as poplar. Our results therefore 928 suggest that the laccase gene family has undergone expansion in oak, but to a lesser extent 929 than in poplar that however shows a recent whole-genome duplication. Groups #2 and #6 930 displayed a clear expansion of the laccase gene family in tree species relative to Arabidopsis, 931 with group #6 displaying stronger expansion in oak. Group #2 corresponds to laccase 932 homologs of ATLAC4, 11 and 17, essential for lignification in Arabidopsis. This biological 933 function is of primary importance for wood cell lignification in trees, so the patterns of 934 duplication and functionalization may differ between trees and herbaceous plants. Group #6 935 contains seven laccases, including AtLAC14. 936

Non-coding RNA prediction and annotation 937
The prediction of long non-coding RNAs was based on 13 RNAseq libraries (listed in 938  cmsearch program also predicted 263 pre-miRNA loci, two RNase MRP RNA genes, 31 SRP 984 RNA genes, 225 spliceosomal snRNA genes including 34 U1 snRNA genes, one U11 snRNA 985 gene, 55 U2 snRNA genes, one U12 snRNA gene, 33 U4 snRNA genes, 24 U5 snRNA genes, 986 64 U6 snRNA genes and 13 U6atac snRNA genes. We retained only one of the pre-miRNA 987 predictions made on both strands at same positions, resulting in the consideration of 204 pre-988 miRNA loci in the set of pre-miRNA gene predictions. We also predicted miRNA genes with 989 sRNA-PlAn (source code available as a workflow at https://forgemia.inra.fr/genotoul-990 bioinfo/ngspipelines/tree/master/workflows/srnaseq) on the 12 paired paired small RNAseq 991 datasets. sRNA-PlAn implements a model of miRNA biogenesis. Loci are built by 992 considering the regions of the genome to which reads produced by sRNA-seq experiments 993 map. Candidate loci are subjected to the miRNA prediction procedure, which considers the 994 expected pre-miRNA stem-loop structure, the size of the pre-miRNA sequence, the size of 995 pre-miRNA loops (bulges, internal loops, stem loop), the size of the most represented 996 sequence (20-24 nt), the alignment of this most represented sequence with the stem of the pre-997 miRNA and the expected expression profile of the pre-miRNA. A score is assigned to each 998 predicted pre-miRNA locus, taking into account the characteristics described above. Each 999 predicted pre-miRNA locus is then subjected to an annotation procedure in which it is aligned 1000 with miRBase 143 and RFAM 142 sequences with BLAST+ 144 , to differentiate between known 1001 ncRNA families and new candidate miRNA families. In total, 26,109 miRNA loci were 1002 predicted by sRNA-PlAn, from which 1,508 mature miRNA loci predicted by sRNA-PlAn 1003 with a high score were annotated as miRNAs on the basis of strong similarities (one error 1004 allowed) to sequences in the miRBase or RFAM databases (fasta files). We found that 145 of 1005 the related pre-miRNAs were specific to the RFAM cmsearch program and that 64 of these 1006 pre-miRNAs encoded members of the mir-69 gene family. Interestingly, 59 of the pre-1007 miRNA loci predicted by cmsearch contained one or both of the mature miRNAs predicted 1008 and annotated with sRNA-PlAn. Different tracks relating to ncRNA predictions/annotations 1009 were added to the genome browser, according to the software used. 1010

Supplementary
Finally, the 12 paired small RNAseq datasets as well as ncRNA predictions, lncRNA genes 1011 and TEs were used to assign expression evidence to the whole set of predicted noncoding 1012 regions. Reads obtained from small RNAseq datasets were mapped onto the genome with 1013 Bowtie2 145 , using default parameters and retaining only one alignment. Transcription 1014 evidence and count data were obtained with Featurecounts for each predicted non-coding 1015 RNA locus 146 . Predicted lncRNAs were used to confirm expression at predicted non-coding 1016 loci and to identify clusters of shorter ncRNA genes. SAMtools 137 and BEDtools 147 functions 1017 were used to manipulate alignments and to identify regions of overlap between the predicted 1018 lncRNA and ncRNA genes. We found that 212 of the predicted lncRNA genes overlapped 1019 annotated ncRNA loci (strand not considered The genetic diversity of oak (π) was 0.011 at synonymous sites (π 4 ), and 0.005 at non-1039 synonymous sites (π 0 ), with a mean π 0 /π 4 ratio of 0.44 (Supplementary Table 9). For 1,176 1040 manually curated genes, we recovered the π 0 /π4 ratio equals to 0.43 (π 0 = 0.00429 and π 4 = 1041 0.00990). Oak has a higher genetic diversity and π 0 /π 4 ratio (Fig. 2a) than the other woody 1042 perennial species studied by Chen et al. 148 . Further comparisons between "Expanded", 1043 "Contracted", and "Unchanged" gene families showed that π 0 estimates were significantly 1044 higher for expanded gene families in oak (0.007, p-value<2×10 -16 ) whereas π 4 values were 1045 similar for all types of gene families (0.012, Supplementary Fig. 54), resulting in a higher 1046 π 0 /π 4 ratio (0.56). Contracted family genes had a significantly lower π 0 /π 4 ratio (0.30) than 1047 unchanged families (0.32, p-value=5.2×10 -3 ). TDGs also had a higher π 0 /π 4 ratio (0.53), and 1048 an even higher π 0 /π 4 was found in the families expanded in oak (0.62). Similar estimates were 1049 obtained from the analysis of the pool-seq dataset, i.e. average π 0 /π4 of 0.50 (π 0 = 0.00538 1050 and π 4 = 0.0108), increasing to 0.59 for TDG, and 0.60 for expanded, 0.30 for contracted and 1051 0.32 for unchanged genes. Higher π 0 /π 4 values suggest a potential accumulation of deleterious 1052 mutations in expanded gene families relative to contracted or unchanged families. We 1053 calculated the frequency of mutations (as unnormalized pairwise differences) likely to cause 1054 protein malfunction (e.g. premature stop codons, start/stop codon changes). Genes from 1055 expanded families displayed significantly more potentially deleterious mutations (mean 1056 =0.23, p-value<2×10 -16 ) than those from contracted (0.09) or unchanged families (0.06). 1057

Detection of somatic mutations 1058
We compared the three libraries L1, L2 and L3 (i.e. 6 pairwise combinations, Supplementary 1059 Table 20) and detected 61 reliable somatic mutations. A total of 46 somatic mutations were 1060 completely absent from the poolseq dataset (40 SNPs), or and had MAFs below 0.5% (6 1061 SNPs), i.e. below the minimum threshold used to exclude sequencing errors in the poolseq 1062 dataset). Considering our high sequencing depth and the number of individual pooled (20 1063 genotypes), each allele is expected to be near 2.5%. As a consequence, low allele frequency 1064 variants are expected to be related to sequencing errors (estimated at 2.4%, see 1065 Supplementary Figure 25). Thus, to be conservative, we filtered out all candidate somatic 1066 mutations with an allele frequency above 0.005. As a result, 75% of the somatic mutations 1067 (46/61) could be considered to be detected exclusively in the "3P" accession (Supplementary 1068 Table 5). 1069 Noteworthy, one of the 40 somatic mutations detected exclusively in "3P" was found in a 1070 gene coding sequence: Sc0000066_1207928 in Qrob_T0204900.2 corresponded to a member 1071 of the large cytochrome P450 superfamily encoding a protein of the CYP4/CYP19/CYP26 1072 subfamilies with annotations relating to secondary metabolite biosynthesis, transport and 1073 catabolism, lipid transport and metabolism. This mutation was synonymous (ACC->ACT, 1074 corresponding to a threonine residue in the protein). 1075  Table 7). In total, 4860 orthogroups were 1089 common to all species. For the 25,808 oak proteins, 22,498 clustered into 11,813 orthogroups, 1090 479 of which were oak-specific and contained 1,737 oak proteins. There were also 3,310 1091 singleton proteins for oak. From the 36,844 orthogroups 524 and 72 were found to be 1092 expanded and contracted in oak, respectively (Supplementary Data Set 3 sheet #2 and #4). 1093

Identification of tandemly duplicated genes in oak 1099
Speciation and duplication events in the pedunculate oak genome were identified using the K s 1100 distribution of orthologous gene pairs between oak and peach (green bars in Supplementary 1101 Supplementary Fig. 20), respectively. The 1102 oak/peach ortholog K s distribution defines the position of the speciation event between these 1103 two species, with a single ancestral triplication event (γ) common to grape, peach, cocoa and 1104 oak and predating the speciation event. The burst of tandem duplicates highlighted by the 1105 purple K s peak occurred after oak/peach speciation and appears to be an oak-specific event. 1106

Fig. 20) and paralogs in oak (purple bars in
The dot plot representation of tandemly duplicated genes (TDGs) in oak is depicted in 1107 with 30-70% of species considered to be woody according to the strong prior) and diverse 1134 (defined here as containing >10 species), comprise 470 genera, 41 families and 12 orders. 1135 Paired comparisons of close relatives in these clades, ideally genera with both woody and 1136 herbaceous members, would make it possible to determine whether gene expansions in R-1137 gene families are correlated with evolutionary shifts in growth form. It is clear that certain 1138 clades are extraordinarily variable in terms of growth habit, but it seems unlikely that growth 1139 habit per se drives expansions and contractions in R-gene families. Instead, with their longer 1140 lifespans, woody species probably accumulate a greater pathogen load than herbaceous taxa. 1141 It would therefore appear reasonable to consider longevity as a driver of these functional gene 1142 shifts, and growth habit as a correlate of such differences in life history. 1143 We examined the full genome sequences currently or soon to be available for eudicots (as 1144 reported in (https://genomevolution.org/wiki/index.php/Sequenced_plant_genomes) as of 18 1145 December 2016), to identify future targets building on existing genomic resources. We had 1146 access to genome sequences for 46 herbaceous and 27 woody species from 18 orders 1147 (Supplementary Table 40 and Supplementary Fig. 55) Lamiales has three herbaceous and one woody (Fraxinus) species for which full genome 1158 sequences have been obtained, and Fabales has eight herbaceous and one woody (Cajanus) 1159 species with full genome sequences. Additional genome sequences for species from any of 1160 these clades (ideally within genera), considered together with the oak genome sequence, 1161 would improve our understanding of the evolution of genomic features favoring a long 1162 lifespan and woodiness in plants. 1163

Gene ontology (GO) enrichment analysis 1164
GO term enrichment in three categories of genes 5.4.1. 1165 We assigned a total of 3,433 GO terms (Supplementary Table 41 The 524 orthogroups expanded in oak comprise 5,910 genes (3 to 359 genes per orthogroup, 1233 with a mean of 11.3 genes per orthogroup, Supplementary Fig. 58). In total, 366 orthogroups 1234 were annotated with at least one GO term. The number of GO terms per gene family ranged 1235 from 1 to 17 (mean value, 2.89). We found that 4,217 of the 5,910 genes (71.4%) were 1236 annotated with at least one GO term (Supplementary Table 42     ( 2 ) Due to poor sequence quality, sequence analysis was performed on its allelic version Qrob_T0751510.2 (qrob_v2_scaffold_2295:14651-1620 following manual curation). ( * ) Sequence analysis was performed after the manual curation of intron/exon prediction. genome (haplome assembly) and selected embryophytes. In addition to oak sequences, other sequences were retrieved from genomic data 1990 available from thr Phytozome V11 portal by BLAST-p and tBLAST-n analyses using P. trichocarpa and A. thaliana sequences as references. 1991 The different classes in the grx, trx and gst families were defined according to 83-85 , respectively. 1992   haplomerger. In general, both haplotypes are well separated in the 2n assembly (blue and 2144 orange haplotypes). For each aligned block (pink polygons), we retained only the longest 2145 sequence (haplotype blue or orange) as recommended by the creators of haplomerger, to 2146 maximize gene content. Scaffolds merging the two haplotypes (as Scaffold F) in the 2n 2147 assembly were retained, without modifications, in the 1n assembly. 2148 2149 2150 Supplementary Fig. 11 Alignment of two allelic BACs (#50E24 and #177A20) against the 2151 V2 diploid assembly (scaffold #0721 and #0436) and the V2 haploid assembly (Sc00000330  (St:120,(Vv:117,(((Wa:101,((Md:53,(Pp:52,Fv:52) Fold enrichment GO term