The first identification of genomic loci in plants associated with resistance to galling insects: a case study in Eucalyptus L'Hér. (Myrtaceae)

Genomic loci related with resistance to gall-inducing insects have not been identified in any plants. Here, association mapping was used to identify molecular markers for resistance to the gall wasp Leptocybe invasa in two Eucalyptus species. A total of 86 simple sequence repeats (SSR) markers were screened out from 839 SSRs and used for association mapping in E. grandis. By applying the mixed linear model, seven markers were identified to be associated significantly (P ≤ 0.05) with the gall wasp resistance in E. grandis, including two validated with a correction of permutation test (P ≤ 0.008). The proportion of the variance in resistance explained by a significant marker ranged from 3.3% to 37.8%. Four out of the seven significant associations in E. grandis were verified and also validated (P ≤ 0.073 in a permutation test) in E. tereticornis, with the variation explained ranging from 24.3% to 48.5%. Favourable alleles with positive effect were also mined from the significant markers in both species. These results provide insight into the genetic control of gall wasp resistance in plants and have great potential for marker-assisted selection for resistance to L. invasa in the important tree genus Eucalyptus.

widely adopted for AM work with outcrossing plants, such as maize 27,28 and several forest trees 17,29 . However, this approach is inherently limited by the a priori choice of CGs which precludes the causal mutations located in nonidentified CGs 21 . Also, the trait variation explained by individual markers (usually single nucleotide polymorphism, SNP) is very low and rarely exceeds 5% 30 . In particular, it is impractical for those traits that no CGs have been discovered. On the other hand, genome-wide LD decay has been revealed to be substantially slower in outcrossing plants (e.g. approximately 3.7-5.7 kb with the largest LD up to 50 kb in E. grandis 31 ) than previous estimates with CGs. More recently, with the rapid development of genomic technologies and resources, genome-wide AM has been attempted using next generation sequencing based SNPs (e.g. maize 32 ), microarray-based SNPs [e.g. Picea glauca (Moench) Voss 33 ] and microsatellites [or simple sequence repeats, SSR; e.g. rice (Oryza sativa L.) 34 , Punica granatum L. 35 , Theobroma cacao L. 36 and Ipomoea batatas L. 37 ].
In this study, we used a select set of SSR markers to perform AM in E. grandis Hill ex Maiden for resistance to the gall wasp L. invasa and verified the associated SSRs in E. tereticornis Smith. SSRs have been the choice of markers for AM studies in many selfing 34,38,39 and outcrossing plants [35][36][37] . Both E. grandis and E. tereticornis are important species in terms of breeding and genomic efforts 40 , and E. grandis is the second tree genome (after Populus trichocarpa Torr. & Gray) to be sequenced 41 . Variation in L. invasa resistance has been observed in E. grandis 15 and also E. tereticornis 16 . Low population differentiation has been identified in E. grandis by isozyme markers (G ST = 0.12) 42 and SSRs (F ST = 0.037) 43 and also in E. tereticornis by SSRs (F ST = 0.012) 44 , suggesting a weak population structure that is ideal for AM analyses. Furthermore, though verification of association in additional population(s) is a valuable tool to demonstrate cross-population utility, only a few studies in plants have to date conducted it 17 . The objectives of this study were to (1) detect and verify the marker loci associated with resistance to L. invasa and (2) identify the favourable alleles for potential use in marker assisted selection in Eucalyptus. So far to our knowledge, this is the first report of mapping genomic loci associated with gall-inducing pest resistance in plants.

Materials and Methods
Plant materials. A total of 470 individual trees of E. grandis were sampled as a 'discovery' population from a provenance/progeny trial located at Zhaoqing City (112°27′E, 23°03′N), Guangdong Province, China. The trial was laid out following a randomized complete block design, with 32 replicates of single-tree (per family) plots at 2 × 3 m spacing. One to five trees (the first five replicates) were sampled from each of 158 open-pollinated (halfsib) families representing 16 natural seed sources (provenances) across the range of E. grandis in Australia 43 . A 'verification' population of 303 individual trees of E. tereticornis was sampled from a provenance/progeny trial located at Zhanjiang City (110°05′E, 21°16′N), Guangdong Province, China, which had been planted with four replicates of four-tree (per family) row plots in a randomized complete block design 16 . One to seven trees (the first two replicates) were sampled from each of 77 open-pollinated families from 11 natural provenances in Australia as described earlier 16 . Leaf samples were collected in July 2011 and March 2015 for E. grandis and E. tereticornis at ages of 15 and 31 months after planting, respectively. The leaves were stored at −80 °C prior to DNA extraction. Assessment of L. invasa infestation. Natural infestation of L. invasa was assessed for the E. grandis trial at age of 15 months when the gall incidence was evident. As different infestation indices were adopted in the literature 11,14,15 , we employed a five-grade criterion based on the number of galls visible on a whole tree as performed similarly by Goud et al. 14 , namely, grade 1 = 50 and more galls, grade 2 = 10-49 galls, grade 3 = 9 and less galls, grade 4 = multiple sprouts without galls and grade 5 = no symptom.
For the E. tereticornis trial, susceptibility by L. invasa was scored previously at nine months based on the percentage of galled leaves and twigs 16 . The scores were then approximated to the above five-grade criterion depending on infested leaf and twig numbers assuming a mean of two galls per leaf or twig. As the number of galls is strongly positively correlated with mean severity score (based on percentage infestation/100) and proportion of plants infested 12 , such an approximation would provide valid estimation of the infestation grades.
Phenotypic variation in L. invasa infestation was assessed using the statistical software SAS/STAT ® 8.1 (SAS Institute Inc., Cary, NC, USA). Analysis of variance (ANOVA) was conducted only for E. grandis (based on the relatively complete replicates 1−4) as that of E. tereticornis had been reported earlier for the whole trial 16 . Narrow-sense heritability h ( ) i 2 was calculated as: where r is the coefficient of relationship between the individuals within families (r = 0.40 for most open-pollinated families from natural stands of Eucalyptus 45 ), σ F 2 is the among-family variance within provenances, P 2 σ is the among-provenance variance, and E 2 σ is the residual error variance. Standard error of h i 2 was estimated using the delta method 46 .
DNA extraction and SSR marker assay. Genomic DNA was extracted from leaf samples (~300 mg per sample) using a modified cetyltrimethyl ammonium bromide (CTAB) method 47 . DNA concentration and quality were assessed using 1.2% agarose gel electrophoresis and a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific Inc., Waltham, MA, USA). A total of 839 genome-wide SSRs as used in Li et al. 48 were initially tested with one E. grandis DNA sample using routine polymerase chain reaction (PCR) amplification 49 . Those markers (561 SSRs, 66.9%) each resulting in a single clear band in agrose gel electrophoresis were subsequently screened against two sample pools of E. grandis, namely, resistant pool (four and four samples at grades 5 and 4, respectively) and susceptible pool (eight samples at grade 1). The markers (86 SSRs distributing across the 11 main scaffolds and a small scaffold of E. grandis genome; Supplementary Fig. S1) that exhibited at least 0.20 of allelic frequency difference between the resistant and susceptible pools were finally selected out for genotyping the 'discovery' population of E. grandis. The SSR genotyping method followed the fluorescein-12-dUTP based procedure as described earlier 49 .
In addition, 25 and 12 putatively neutral genomic SSRs (Supplementary Table S1) were used for population structure analysis in E. grandis and E. tereticornis, respectively. These SSRs were previously reported to neither Marker polymorphism, linkage disequilibrium (LD) and population structure. For E. grandis, number of alleles (N A ), observed heterozygosity (H O ), expected heterozygosity (H E ), allele size range (ASR) and polymorphic information content (PIC) per SSR marker were estimated with MSA software 50 . LD between the SSRs was evaluated using TASSEL 3.0 software 51 . The determination coefficient (r 2 ) was used to test the LD pattern with 100,000 permutations. STRUCTURE 2.3.4 software 52 was performed to cluster individuals into a number (K = 1−16 and 1−11 for E. grandis and E. tereticornis, respectively) of genetically homogeneous sub-populations based on an admixture model with correlated allele frequencies between provenances. For each K value, the Markov Chain Monte Carlo (MCMC) sampling was replicated with 10 runs 53 each following 100,000 burn-ins and 100,000 MCMC iterations. The optimal K value was determined with the highest ΔK method 54 in STRUCTURE HARVESTER 0.6 55 . The membership coefficient (Q) of each individual generated under the optimal K value was used to form the population structure Q matrix. Also, pair-wise kinship coefficients (K matrix) between individuals were estimated using SPAGeDi1-5a software 56 . The Q and K matrices were incorporated into the subsequent association analysis.
Association mapping and verification. A mixed linear model (MLM) was performed using TASSEL 3.0 software 51 [file type option 'Load polymorphism alignment (custom)'] for marker-trait association mapping in E. grandis and association verification in E. tereticornis. In order to avoid possible spurious associations, Q and K matrices generated above were incorporated as co-variates (Q + K method). The significant association probability was set at P ≤ 0.05. The R 2 value indicated the percentage of phenotypic variance explained by the marker identified. Only markers with allele frequencies of 5% or higher were included in association analysis. Also, the significant associations were further validated with a correction of permutation test (P ≤ 0.008 and 0.073 for E. grandis and E. tereticornis, respectively). The significant markers were function annotated by BlastX search of their original sequences against NCBI non-redundant protein database (https://blast.ncbi.nlm.nih.gov/Blast.cgi) with a cutoff E-value of 10 −5 .
Phenotypic allele effect was estimated in comparison to the average phenotypic value of 'null allele' (including the rare alleles with frequency less than 5%) 57 . An allele of positive effect was identified as favourable allele for L. invasa resistance. The general mean of positive or negative allelic effects was calculated as the average (positive or negative) allelic effect (AAE) of a marker, and its percentage taking account of the average 'null allele' phenotypic effect was also calculated 58 .

Results and Discussion
L. invasa resistance variation. The mean value of L. invasa resistance was slightly smaller in E. grandis than E. tereticornis (Table 1). In E. grandis, ANOVA indicated nonsignificant differences in L. invasa resistance among provenances and among families within provenances (Supplementary Table S2). The h i 2 estimate (0.10 ± 0.02; Table 1) was low, especially compared to that of E. tereticornis (0.52 ± 0.50) calculated from the whole trial 16 . However, the h i 2 for both E. grandis and E. tereticornis may be at similar magnitude considering the relatively high value of stand error shown in E. tereticornis 16 Table S3). The level of LD between the 86 SSRs inferred from E. grandis was generally low, with r 2 from 0 to 0.0878 (mean 0.0033) and only 87 (2.4%; P < 0.01) of the pairwise correlations showing significant LD (Fig. 2). Significant LD existed between linked and/or unlinked markers (Fig. 2).
Out-crossing plant species including eucalypts are expected to show a lower extent of LD compared to selfing plants 21 . The LD detected here in E. grandis is much lower than those estimated earlier in Eucalyptus. For example, Arumugasundaram et al. 59 reported r 2 values of 0−0.133 (mean 0.09) and 0−0.62 (mean 0.012) in 40 E. camaldulensis Dehnh. and 50 E. tereticornis trees, respectively, based on 62 SSRs, and Silva-Junior and Grattapaglia 31 reported average genome-wide r 2 of 0.131 in 48 E. grandis trees (two provenances) based on 21,351 SNPs. As population background can affect LD 39 , the lower LD level observed in this study could be mostly due to the larger size of population analysed (range-wide plant materials). Also, the extent of LD could vary with marker (genomic) loci 39 . Consequently, in light of a lower LD, a higher resolution of marker-trait associations can be expected.
Population structure analysis indicated that the optimal K value was determined to be two for the E. grandis 'discovery' population (Fig. 3a), which was in agreement with previous PCA analysis on the same population 43 . The 470 individuals were thus divided into two sub-populations (Fig. 3c). Similarly, the 303 individuals of the E. tereticornis 'verification' population were also divided into two sub-populations ( Fig. 3b and d). These results corroborate the previous division of two genetically distinct clusters of natural populations in E. grandis 43 and E. tereticornis 44 , indicating weak genetic structure among provenances for both species. Moreover, population structure can result in spurious marker-trait relations in subsequent association mapping 38 , and the appropriate identification of genetic structure, though weak in our cases, will help to eliminate false marker-trait associations.
Association mapping and verification. There were seven SSR markers associated in E. grandis with L.
invasa resistance at the P ≤ 0.05 significance level, of which two (EUCeSSR0755 and EUCeSSR479) were validated with a correction of permutation test (P ≤ 0.008; Fig. 4, Table 2 and Supplementary Table S4). The R 2 value of a significant marker ranged from 3.3% (EUCeSSR0930) to 37.8% (Embra333), with an average of 16.7%. The seven SSRs resided on scaffolds 2, 3, 6, 7, 8 and/or 5 of the E. grandis genome (Table 2). Further, four of the seven significant associations were verified and also validated in E. tereticornis (P ≤ 0.073 in permutation test;  38 and 20% for aluminum tolerance in rice 34 . The high  Little is known about the genomic loci associated with response to gall wasps in plant species, and the significant markers identified here would therefore provide insight into the genetic control of insect resistance in plants. In addition to the low h i 2 estimate, multiple significantly associated markers suggest the quantitative inheritance of gall wasp resistance in Eucalyptus. Of the seven significant loci detected in E. grandis, five (Embra333, EUCeSSR0930, Embra321, EUCeSSR479 and EUCeSSR683) are homologous to known genes or predicted proteins when their original sequences were BlastX searched against the NCBI non-redundant protein database. The locus Embra333 is functionally annotated as a C2H2 zinc finger protein (Cynara cardunculus var. scolymus L.; 9e−17 and 64% in E-value and similarity, respectively). C2H2 zinc finger proteins are one of the largest transcript factor families in plants and have been found to participate in diverse signal transduction pathways and developmental processes, including pathogen defense and stress responses 60 . EUCeSSR0930 has homology to M-phase inducer phosphatase 3 (Anthurium amnicola Dressler; 5e−22 and 70% in E-value and similarity, respectively), a protein that can be phosphorylated and activated by Cdkl/cyclin B and leads to entry into mitosis 61 . Embra321 is homologous to a gene annotated as predicted U-box domain-containing protein 51-like (E. grandis; 1e−8 and 85% in E-value and similarity, respectively). Though the physiological function of U-box domain remains unclear, plant U-box proteins have been implicated as regulators of fundamental cellular processes related to signal transduction, damage responses and programmed cell death as well as defense against biotic and abiotic stresses 62 . EUCeSSR479 is functionally related to membrane-anchored endo-1,4-beta-glucanases (Gossypium hirsutum L.; 8e−79 and 86% in E-value and similarity, respectively), which are involved in cellulose biosynthesis in plants 63 . Also, genes encoding endo-1,4-beta-glucanases have been found in bacteria, fungi, nematodes and insects. In the crown gall-forming bacterium Agrobacterium tumefaciens, cellulose fibers can be produced to adhere to plant cell walls during infection 64 . EUCeSSR683 is a predicted proline-rich receptor-like protein kinase PERK9 (E. grandis; 3e−22 and 98% in E-value and similarity, respectively), which is expressed widely in Arabidopsis thaliana L. Heynh. 65 and may act as a sensor/receptor in plants to monitor changes at cell walls during cell expansion or during exposure to abiotic/biotic stresses and then activate associated cellular responses 66 . However, the remaining two loci (EUCeSSR0755 and Embra345) were of unknown function. As EUCeSSR0755 is derived from an expressed sequence tag (ES589368), it may be a CG for physiological response to gall wasp infection. Nevertheless, causal gene(s) might be located in the LD region of a significant marker locus as neutral markers may represent artificial association caused by genetic hitchhiking 67 . grandis. SSRs and their original scaffolds are along the X-axis. P values were transformed as −log 10 P (Y-axis). *P ≤ 0.05; **P ≤ 0.01 with validation (P ≤ 0.008) in a correction of permutation test. Four markers (underlined) were verified to be significant in E. tereticornis. Few associations in plants have been verified in a different species though several studies have verified association results in an additional full-sib mapping population of the same species, such as yellow mosaic virus resistance associated SSRs in G. max 38 and wood-property associated SSRs in Populus tomentosa Carr. 68 . In our study, four of the seven significant markers in E. grandis were verified in E. tereticornis, suggesting the effectiveness of these marker-trait associations across species. However, three SSR markers remained non-significant in verification analysis with E. tereticornis. Several factors are possible to affect the verification results, including species, environment, population size and phenotyping. Related species may contain different loci affecting such a complex trait as insect resistance, which could have evolved independently in different populations and habitats 69 . Moreover, low LD in forest trees can give rise to inconsistent marker-trait associations among genotypes even within the same species 70 . Also, coupled with environment and population size, phenotyping technique plays an important role in finding accurate genotype-phenotype associations 25 . In the present study, phenotyping of gall wasp resistance was different between the 'discovery' and 'verification' species in terms of trial site, measurement age, season, population size and phenotyping method, which could be attributable, at least in part, to those non-verified associations in E. tereticornis. As the factors mentioned above are concerned, further efforts need to carry out multiple-site experiments deploying a large amount of clonally propagated genotypes.
Mining for favourable alleles and implications for practical breeding. The alleles with positive effects are considered as favourable alleles for L. invasa resistance. Table 3 shows the first allele with the largest (positive and negative) resistance effect of each of the significant SSR markers (see Supplementary Table S5 for the effects of all the alleles). In E. grandis, the allele Embra345-225 bp had the maximum positive effect (1.67, 83.3%), whereas EUCeSSR0755-274 bp had the maximum negative effect (−1.70, 39.2%; Table 3). In E. tereticornis, the alleles EUCeSSR683-167 bp and Embra333-250 bp had the maximum positive (1.38, 47.0%) and negative (−1.45, 49.2%) effects, respectively (Table 3). In particular, some of the alleles had the opposite phenotypic effect between the two eucalypt species. For EUCeSSR479 for example, the alleles 204, 231, 213 and 222 bp showed a negative effect in E. grandis (Supplementary Table S5) but a positive effect in E. tereticornis (Supplementary Table S6 This study reveals markers and favourable alleles that have potential in marker-assisted selection for resistance to the gall wasp L. invasa, at least in the important tree genus Eucalyptus. Firstly, the significant markers, especially those validated, could be used to track genomic loci of interest. While the SSRs associated in E. grandis provide candidates for verification purpose in other species, the associations verified in E. tereticornis may suggest robust genomic regions underlying gall wasp resistance. Secondly, the favourable alleles could be used as tags of selection for elite genotypes that have assembled the most desirable alleles over all the associated markers. This is extremely useful for species (e.g. many eucalypts) amenable to mass vegetative propagation as clonal cultivation offers the most effective approach to capture genetic gains 71 . Thirdly, based on the association results, elite parental combinations could be predicted to generate descendants with improved resistance to L. invasa. In this respect, the alleles robust in both E. grandis and E. tereticornis will have greater utility in a breeding programme.
In conclusion, this study presents genomic loci associated with gall-inducing insect resistance for the first time in plants and makes a valuable contribution to our understanding of the genetic basis underlying plant resistance to gall wasps. Seven SSR markers were associated with resistance to L. invasa in E. grandis, of which four associations were verified in E. tereticornis. These markers plus their favourable alleles can be used for marker-assisted selection for L. invasa resistance in Eucalyptus. Nevertheless, considering the quantitative nature of the insect resistance and the small proportion of genome sampled by the SSR loci, further association work should be undertaken to improve the genome coverage of markers by applying new technologies, such as next-generation sequencing 72 . In addition, as CG approaches have proven to be advantageous for breeding applications 17 , genome-wide screening of CGs followed by CG-based association mapping should be conducted for gall wasp resistance.