Introduction

Curly (Karelian) birch (Betula pendula Roth var. carelica (Mercklin) Hämet-Ahti) is a special variety of silver birch, which forms small populations scattered within the north-western part of B. pendula distribution: Finland, Russia (Karelia), Baltic countries, Belarus, Poland, southern Sweden and Norway1,2. The wood fibers of the Karelian birch are not directed strictly vertically, but at different angles, which leads to the formation of wood “curliness” underlying the special ‘curly’ phenotype. The figured ‘curly’ wood of the Karelian birch is used for manufacturing highly decorative furniture and architectural panels, and it is sometimes called ‘wooden marble from Finland’3. Because of its exotic nature, figured wood is paid by weight (3–5 € kg−1), rather than by volume like the wood of other trees4. Due to the high commercial value, interest in the propagation of curly birch on plantations has increased significantly since the 1980s. In Russia, the total area of forest plantations of Karelian birch by 1986 reached 5500 ha5. Unfortunately, in the 90 s, due to illegal logging, about 1.5 thousand curly birches were cut down on the territory of Karelia. In Finland, though, 6500 ha of curly birch stands have been established, and they are currently reaching their rotation age (37 years)3.

The major issue of curly birch propagation on plantations is that the first signs of 'curliness' among the trees appear at the age of 8–15 years6,7. Until that time, it is necessary to maintain a large plantation of trees with unpredictable yield, since any stands of trees artificially planted using Karelian birch seeds always have an admixture of ordinary (non-curly) birches. Proportions of ‘curly’ trees among progenies of open-pollinated Karelian birch vary from 2–3%8 to 25% (rarely 50%)9,10. 60–70% of curly-wooded offspring appear from controlled crosses of two parents with a ‘curly’ phenotype11.

Non-curly birches are usually removed from the stand at the age of about 10–13 years3 to provide genuine curly trees with sufficient space and light. For this cleaning, a non-destructive visual assessment of curliness is commonly used12. However, even at this age, tinning often encounters a problem, since detecting curliness by indirect morphological features of trees is rather difficult. Thus, understanding the genetic mechanism underlying the figured wood phenomenon of Karelian birch is becoming a truly challenging task, yet it has the potential to provide foresters with trait-specific molecular markers to help recognize curly birch genotypes at their earliest stages of development.

Throughout the long history of studying Karelian birch, various hypotheses have been suggested regarding the nature of the factors that determine curly wood12,13,14. The heritability of the trait, though, is acknowledged by most researchers. Recently, Kärkkäinen et al.11 have precisely evaluated the phenotypic segregation ratios in several progeny trials obtained from crosses of curly and non-curly birch parent trees. As a result, a simple Mendelian inheritance model for curliness was proposed: a monogenic trait with two alleles, the allele coding for curly phenotype is dominant over the allele coding for normal phenotype and the semi-dominant curly allele is lethal when homozygous11.

With advances in modern high throughput genotyping technologies and well-developed experimental workflow of the Genome-Wide Association Study (GWAS), it is possible to take another step forward in the study of the molecular genetic basis of the curly birch phenotype. In the present study, we used 37,045 SNP markers obtained by RADseq to genotype 192 trees with and without the ‘curliness’ phenotype derived from Karelian birch crosses, to reveal how many loci could be potentially involved in ‘curliness’ formation and to search for genetic variants associated with this trait.

Materials and methods

Plant material and phenotyping

In the study we investigated the full-sib seed progeny of the Karelian birch, growing on the Zaonezhskaya Forest-Seed Plantation, Medvezhyegorsk region (62° 15′ N 35° 02′ E) in the south-eastern part of Karelia (Russia). 192 trees from two full-sib families derived from two controlled crosses between Karelian birch parents №133 × №134 (88 trees) and №51 × №58a (104 trees) have been analyzed (Supplementary Table S1). The crosses were carried out in 1987 and 2006, respectively, direct and reciprocal crosses were performed. Phenotyping was carried out in June 2022. “Curliness” (curly wood, cw) of phenotyped trees was determined by a nondestructive testing approach: by stem growth form and bark color, but mostly by the presence and shape of bulges on the stem surface. Among these, spherical thickened, finely tuberculate and also without signs of “curliness” are visually distinguished (Fig. 1). According to the type of the surface of the stem, one can roughly estimate the level of the manifestation of wood “curliness”. The most intense patterned texture in wood, as a rule, is formed in Karelian birch trees with a finely tuberculate type of stem surface. The signs of “curliness” in a pronounced form appear on average in the 8–15th year of the plant’s life. It was also established that as the plants develop, after 30–40 years the reverse process of “smoothing” or “swollen” of the previously convex stem surface is observed due to an increase in the thickness of the bark (Fig. 1C). Although these features make it difficult to reliably identify Karelian birch without wood destruction, tree breeding programs make extensive use of this non-destructive approach to detect “curliness” based on the externally visible stem morphology conducted on 10-year-old and older birches5,11. The field studies including collection of plant material and phenotyping on birch plants presented in this paper comply with relevant institutional, national and international guidelines and legislation.

Figure 1
figure 1

Types of the surface of the stem of the Karelian birch: spherical thickened (A), finely tuberculate (B), “swollen” (C) and without signs of “curliness” (D). Types of stem surface (A), (B) and (C) were considered “curly” phenotype, (D)—as “non-curly” phenotype.

Genotyping procedures

Genotyping-by-sequencing of plant material was performed using double-digested RAD sequencing (RADseq) method15. The detailed protocol of RADseq library preparation using HindIII and NlaIII restriction enzymes was described previously16. In short, genomic DNA was isolated from frozen leaves using a modified CTAB method17. The integrity, quality and concentration of the purified DNA samples were determined by gel electrophoresis, using SPECTROstar Nano spectrophotometer (BMG LABTECH, Ortenberg, Germany) and Qubit 3.0 Fluorometer (Thermo Fisher Scientific, Waltham, MA, USA).

96 barcode adapters were used for indexing individual DNA samples. Thus, two RADseq libraries were prepared and sequenced separately to genotype 192 plants.

During library preparation the genomic DNA of each sample was digested with HindIII and an adapter with individual barcode was ligated to the sticky ends. Next, barcoded samples were mixed in one tube and purified with CleanMag DNA PCR magnetic beads (Evrogen, Moscow) to remove fragments shorter than 150 bp. The second restriction digestion was performed with NlaIII, the common adapter was ligated to the overhanging ends of NlaIII and purified again with magnetic beads.

The libraries were then enriched using PCR (98 °C for 30 s, 14 cycles of 98 °C for 10 s, 65 °C for 30 s, 72 °C for 15 s, and then 72 °C for 2 min) and purified using magnetic beads. The quality and quantity of two prepared amplicon libraries were determined using Qubit 3.0 Fluorometer and Agilent 2200 TapeStation.

Sequencing was performed using Illumina NovaSeq 6000 (Illumina, San Diego, CA, USA) in the paired-end mode, the length of reads was 150 bp. Raw sequence data is available on NCBI SRA under the project number PRJNA997794. Tree numbers, corresponding barcodes and tree phenotypes are summarized in Supplementary Table S1.

Read filtering, alignment, and SNP calling

The obtained reads were first distributed according to the sample barcodes in separate fastq files using the process_radtags function in Stacks software18. The quality of the reads and presence of Illumina adapters was assessed using FastQC software (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Reads were filtered using the Trimmomatic software19. Low-quality bases (Phred score < 18) in the reads, as well as reads that correspond to Illumina adapters and short reads (< 45 bp) were removed. The SNP calling procedure was performed using GATK v.4.4.0.0 software20 after aligning reads on the reference genome assembly21. Reads alignment was performed using a bowtie2 aligner with a very sensitive flag22. Birch genome21 sequence version 1.4c was used as a reference (https://genomevolution.org/coge/api/v1/genomes/35080/sequence). After alignment with bowtie2, reads were sorted within each bam file and next each bam file was indexed using Picard version 3.0 (https://broadinstitute.github.io/picard/). Next, for each bam file we generated individual vcf with the HaplotypeCaller function in GATK, then these files were merged using the CombineGVCFs function in GATK. Final SNP calling in combined vcf was performed with the GATK's GenotypeGVCFs function using the parameter Max Alternate Alleles = 2 to select only biallelic SNPs. GATK's Variant Filtration function was used to filter SNPs based on the following parameters: Minor Allele Frequency (removing SNPs with frequency > 0.05), Mapping Quality (removing SNPs with score < 40.0), Quality by Depth (removing SNPs with score/depth < 24.0), Fisher Strand Bias (removing SNPs with p-value > 60.0), Strand Odds Ratio (removing SNPs with odds ratio > 3.0), and Depth (removing SNPs with cumulative depth < 20.0 across samples). More detailed information on SNP calling and example scripts used in the present study are available at GitHub (https://github.com/RimGubaev/GATK_pipeline_customized). To assess the coverage of the SNPs we estimated the mean number of reads that cover each SNP.

Population structure assessment

In order to find the number of subpopulations within the studied birch cohort, ADMIXTURE software version 1.3 was used23. The potential number of subpopulations was set from 1 to 15. The principal component analysis was performed using default parameters in the PLINK software version 1.924. The number of principal components was set to 20. Principal components were estimated using the variance-standardized genetic relationship matrix. Linkage disequilibrium was evaluated using r2 which in turn was derived from the pairwise comparisons made in the PLINK software version 1.9. The r2 value was calculated for SNP pairs located within 1000 kb frames across the chromosomes. LOESS (locally estimated scatterplot smoothing) was used to estimate the regression between r2 and distance.

Association mapping, SNP annotation

To identify genotype–phenotype associations, the GAPIT version 3 software was used25. The mapping was performed using the SUPER (Settlement of MLM Under Progressively Exclusive Relationship) approach26. The first five principal components, as well as the kinship matrix calculated using methods described previously27 were added to the model to account for the population structure and kin relationships, respectively. In order to assume an association, the adjusted p-value (Bonferroni correction) of less than 0.00000135 (0.05/37,045) was used, where 37,045 is the number of tests (SNPs). Additionally, false discovery rate (FDR) correction was used according to the Benjamini-Hochberg (BH) procedure. The SNPs that demonstrated BH-corrected p-values less than 0.05 were assumed to be associated with the phenotype. To search for the candidate genes, the list of coding sequences was retrieved in the region on chromosome 10 identified by association mapping and restricted by SNPs S10_2623237 and S10_3848247 (± 30 kb). Original annotation in gff format for assembled genome was used to extract potential candidates within identified region on chromosome 10 (https://genomevolution.org/coge/api/v1/downloads/?gid=35080&filename=Betula_pendula_subsp._pendula_annos1-cds0-id_typename-nu1-upa1-add_chr0.gid35080.gff). The proportion of the explained phenotypic variance was estimated by performing an analysis of the variance of the linear model (lm() function) in the R base package.

SNP validation, molecular marker development

Sanger sequencing was applied to validate the high throughput SNP call performed by GATK using RADseq data. Primer pairs flanking the most significant SNPs were designed. (Supplementary Table S2). For the primer design B. pendula genome sequence version 1.4c21 was used as a reference.

For all the designed primers PCR reaction mix contained 1 × Taq-buffer (2.5 mM Mg2+); 200 µM dNTP; 2.5 U Taq DNA polymerase (Evrogen, Moscow); 0.4 µM of each primer; ~ 20 ng of template DNA, sterile distilled water in a final volume of 25 µl. PCR cycling conditions consisted of an initial denaturation step of 95 °C for 3 min, followed by 30 cycles of 95 °C for 30 s, 58 °C for 30 s, 72 °C for 45 s, and a final extension cycle at 72 °C for 5 min. PCR products were purified using magnetic particles, followed by termination PCR with BrilliantDye™ Terminator Cycle Sequencing Kit 3.1 using forward and reverse primer. Sanger sequencing was performed with Applied Biosystems 3500 Genetic Analyzer. The DNA sequences were aligned with UniproUGENE software28. The genetic marker BpCW1 described in this paper is the subject of a patent application no. 2023122806 filed by the authors and protected by the Federal Service for Intellectual Property of Russia.

Ethical approval and informed consent

All methods in the present research were performed in accordance with the relevant guidelines and regulations specified in Nature Portfolio journals’ Editorial policies. The present research did not involve human or animal participants.

Research involving plants, permission for collection of plant material

The research and field studies on plants presented in this paper comply with relevant institutional, national and international guidelines and legislation. Karelian birch (Betula pendula Roth var. carelica (Mercklin) Hämet-Ahti) is a protected birch variety in the Republic of Karelia, permission for the collection of the plant specimens was issued by the Ministry of Natural Resources and Environment of the Republic of Karelia.

Results

Phenotyping, genotyping and SNP calling

The phenotypic evaluation was performed for 192 trees from two studied crosses. In the first full-sib population consisting of 88 trees, 70 plants were classified as ‘curly’ and 18 demonstrated non-curly phenotype. For the second cross (104 trees) 59 and 45 trees demonstrated curly and non-curly phenotypes, respectively (Supplementary Table S1).

The procedure of RADseq library preparation using HindIII and NlaIII enzyme combination and CleanMag bead ‘lower cut’ protocol resulted in two final libraries with the amplicon sizes varied from 450 to 600 bp. The RADseq procedure yielded an average of 16,090,513 paired-end sequencing reads for each birch DNA sample. After the read filtering procedures, an average of 9,107,446 reads per plant were mapped to the birch genome. By analyzing alignment files, we identified 3,065,097 biallelic SNPs using the GATK SNP caller. As a result of filtering, 37,045 SNPs remained for further analysis of population structure and association mapping (Supplementary Table S3). The mean number of reads that cover each SNP followed a Poisson distribution with an average coverage of 13.8 reads per SNP.

Population structure analysis revealed clusterization and rapid LD decay

To assess the population structure of the birch cohort we estimated the potential number of K clusters (sub-populations) using ADMIXTURE software23. Next, we performed a principal component analysis implemented in PLINK software24 to estimate the proportion of genotype variance explained by principal components as well as to perform visualization (Fig. 2A–D). Finally, we assessed a linkage disequilibrium decay using PLINK (Fig. 2E).

Figure 2
figure 2

Population structure of the studied Karelian birch cohort. Panels A and B represent the results of the ADMIXTURE software. (A) Estimated cross-validation error value for possible clusters from 1 to 15. The drop in the cross-validation error indicates the optimal number of possible subpopulations. (B) ADMIXTURE bar plots reflecting the subpopulations at K = 2, 4, 7, 11 each bar corresponds to the birch accession, colors indicate the proportion of subpopulation admixtures in each accession. For K = 2 green color corresponds to the genetic admixtures from population 2, while yellow color corresponds to the genetic admixtures from population 1. (C) and (D) represent the result of the Principal Component Analysis. Colors correspond to the phenotype of the trees (cw—‘curly’, ncw—‘non-curly’) and the shape indicates the corresponding full-sib population according to the plan of the birch plantation. The numbers in brackets correspond to the proportion of genotype variance explained by the principal component. (E) Linkage disequilibrium (LD) decay in the studied cohort of birch trees. Each dot corresponds to the r2 value between a pair of SNPs. The red line represents the LOESS curve. The gray marker corresponds to the 95% confidence interval.

The most significant drop in cross-validation (CV) error was observed when binning the birch cohort into two clusters (K = 2) (Fig. 2A,B). This is in concordance with the information that the studied cohort was represented by two populations derived from two independent crosses. Subsequent binning at K = 4, 7, 11 divides each of the two studied populations into different groups (Fig. 2B). However, no correlation between the observed phenotype of the tree wood and clustering at K = 4, 7, 11 were identified. Thus, the observed clustering could be due to random genetic segregation. The first three principal components explained 50.74% of genotypic variance which is in concordance with the strong population clustering (Fig. 2C,D). The first principal component explained the variation between the two studied populations obtained from two different crosses, while the second and the third principal components explained variation within them. Notably ‘curly’ and ‘non-curly’ phenotypes were evenly distributed between and within two full-sibs' populations. Linkage disequilibrium (LD) decayed rapidly and was equal to 30.8 kb at r2 = 0.25 and 79.5 kb at r2 = 0.15 (Fig. 2E).

Association mapping revealed the target locus on chromosome 10

To perform association mapping, the SUPER model was applied accounting for population structure and relationships by introducing the first five principal components and kinship matrix. Out of the 37,045 SNPs tested, 3 SNPs (S10_3168885, S10_3465040, S10_3472479) were found to be significant after Bonferroni correction for multiple testing (Fig. 3).

Figure 3
figure 3

Results of the association mapping of curly wood trait. (A) Manhattan plot and respective, (B) QQ-plot showing SNP markers associated with the 'curly' phenotype of Karelian birch. Each dot corresponds to a single SNP. Blue dots correspond to SNPs from odd chromosomes, yellow dots correspond to SNPs from even chromosomes. (C) Manhattan plot for the region of interest on chromosome 10 restricted with SNPs S10_2623237 and S10_3848247. The red line corresponds to the Bonferroni-adjusted significance threshold. The blue line corresponds to the FDR-adjusted significance threshold. Linkage between the associated SNPs (D), numbers in squares correspond to the digits after the comma of the r2 value between a pair of SNPs.

Using a softer FDR threshold (α = 0.05) 5 additional SNPs (S10_2623237, S10_3103637, S10_3835235, S10_3848236, S10_3848247) surrounding the above-mentioned SNPs were identified to be significantly associated with the studied trait. 5 out of 8 SNPs were found within the coding sequences annotated for the B. pendula reference genome (Table 1). In addition, S10_3465040, detected outside of any coding sequence, was located in the 3'UTR region of the Bpev01.c0000.g0109 gene, at 27 bp downstream of the stop codon. The mean number of reads that cover each SNP associated with the ‘curly’ phenotype varied from 13.07 to 39.43 (Table 1).

Table 1 SNPs associated with the 'curly' phenotype in two full-sib populations obtained from crossing Karelian birches according to GWAS results.

At the next step we assessed the effect that the SNPs exerted on the phenotype. To do so the phenotype distribution was visualized (Fig. 4). For one of the most significant SNP (S10_3465040) the 'non-curly' phenotype was associated with 'G' allele in the homozygous state, while the presence of the alternative ‘A’ allele indicated the curly phenotype. Remarkably, this SNP showed the largest phenotypic variance explained (PVE = 56.28%). That makes the S10_3465040 a good candidate for marker-assisted breeding applied to the curly wood trait.

Figure 4
figure 4

Barplots reflecting the effects of detected SNPs on the segregation of trees according to the “curly” phenotype. Each bar corresponds to a phenotype distribution across one of the three possible genotypes. Each colored bar shows the proportion of the phenotype distribution (cw—curly wood, ncw—non-curly wood) across genotypes.

Since all identified SNPs were located at the 10th chromosome, we assumed the presence of a locus associated with the studied trait (Fig. 3C). Despite the quite large chromosome region that carries SNPs associated with the trait (more than 1200 kb), the marginal SNPs (S10_2623237 and S10_3848247) demonstrated moderate (r2 > 0.3) linkage (Fig. 3D). Hence, we considered this region as potentially carrying a candidate gene or genes associated with the curly wood phenotype.

Detailed analysis of the region restricted by SNP S10_2623237 (3'-end) and SNP S10_3848247 (5'-end) plus an additional 30 kb at both ends, yielded 104 coding sequences (Supplementary Table S4). Among the sequences encoding proteins with known function, there were some genes previously discussed as potential players in curly wood formation. Most of the genes have been annotated as encoding unknown proteins, however, some genes from the list were previously discussed as potential players in curly wood formation.

For example, Bpev01.c0000.g0045 (Bpe_Chr10_2774789-2776265) encoding protein from Flavin-containing_monooxygenase_family located approx. 150 kb upstream from the candidate SNP S10_2623237. This coding sequence was recently reported as the BpYucca5 gene involved in auxin biosynthesis29. The Yucca gene family is a key player in IAA (the main natural auxin form) homeostasis. Much of IAA in plants is synthesized from tryptophan with the involvement of Yucca flavin-dependent monooxygenases30. It has recently been proposed that the development of anomalous wood in Karelian birch may be associated with impaired auxin biosynthesis14,29. Another possibly relevant gene Bpev01.c0000.g0081 encoding Sm-like_snRNP_protein (Bpe_Chr10_3118250-3127056) is located in 14,613 bp downstream from SNP S10_3103637. Sm-like snRNP proteins are required for mRNA splicing, export, and degradation. It was shown that the homolog of the candidate gene Bpev01.c0000.g0081 in Arabidopsis (Sm-like protein (SAD1), AT5G48870) affects both ABA sensitivity and drought-induced ABA biosynthesis: ABA-deficiency was detected in sad1 mutant plants31. The authors proposed a critical role for the SAD1 gene in RNA metabolism: sad1 mutation affects the decay rate of mRNA for an early component(s) in ABA signaling in drought stress conditions.

SNP validation revealed the first molecular marker for B. pendula curly wood trait (BpCW1)

Validation by Sanger sequencing was performed for 4 out of 8 candidate SNPs: three of them were found to be significant by multiple testing, and one additional SNP (S10_3103637) was selected because it was the only SNP identified in an exon sequence (Table1). PCR fragments obtained with primers flanking each of the four most significant SNPs were re-sequenced for the randomly chosen 16–20 birch trees. The accuracy of genotyping was estimated as 81%, 100%, 86%, 75% for SNPs S10_3103637, S10_3168885, S10_3465040, S10_3472479, respectively (Supplementary Table S5).

The length of PCR products obtained for neighboring SNPs S10_3465040 and S10_3472479 varied significantly in size in the populations due to numerous InDels detected within the amplified fragments. In particular, in the sequence of the PCR fragment containing S10_3472479, at least two InDels of 101 bp and 6 bp were detected by Sanger sequencing. PCR analysis of the entire sample of 192 trees with the primers specific for S10_3472479 revealed at least three more length variants of the PCR products (Supplementary Fig. S6). None of the InDels adjacent to S10_3472479 were significantly linked to the “curly” phenotype. As a consequence, no correlation between the PCR fragment size and the “curly” phenotype was detected (Supplementary Fig. S6).

At least 3 InDels of 54 bp, 13 bp and 2 bp were detected within the PCR products amplified with primers flanking S10_3465040. Notably, the 54 bp deletion both in the homozygous and heterozygous state correlated perfectly with the "curly" phenotype in the analyzed populations. 54 bp deletion results in a 476 bp amplified fragment compared to the size of 530 bp intact PCR fragment, which makes it easily recognizable by electrophoresis (Fig. 5). Of the 190 trees tested for the presence of a 54 bp deletion with this PCR marker, 174 trees (92%) were identified correctly as ‘curly’ or ‘non-curly’ depending on the presence or absence of the 476 bp PCR-fragment (Supplementary Fig. S7). To our knowledge, this is the first molecular marker proposed for the recognition of Betula pendula Roth var. carelica with the curly wood phenotype (BpCW1).

Figure 5
figure 5

Results of the validation of the 54 bp deletion associated with SNP marker S10_3465040. (A) The result of PCR amplification of DNA fragments containing deletion associated with the curly wood phenotype. Samples 17, 24, 108, 112 and 25, 26, 101, 103 demonstrating curly wood (cw) phenotype carry deletion in homozygous (AA) and heterozygous state (AB), respectively. Samples 27, 29, 104, 106 do not carry deletion (BB) and demonstrate non-curly wood phenotype (ncw). (B) Phenotype (cw, ncw) distribution among genotypes associated with the deletion (AA, AB, BB) for the studied cohort of 192 trees.

Discussion

In this study we attempted to unravel the enigma of genetic factors underlying the phenomenon of curly wood in Karelian birch, which has been debated for over a century32. Although in 1922 it was already shown that the patterned texture of the wood is quite stably transmitted to progenies during seed propagation33, there were many adherents of the ‘infectious’ or ‘pathological’ hypotheses, which considered curly wood to be the result of exposure to biotic or abiotic stressors34,35,36. The idea finds support in some modern studies, for example, the influence of a low-temperature stress gradient is assumed, which determines the boundaries of the distribution of Karelian birch37. According to the reports, the suppression of xylem-phloem transport and a decrease in the rate of plant growth occur in parallel with an increasing deficit of heat, moisture, or mineral nutrition. Recently, an ecological and genetic hypothesis has been proposed, according to which the origin of the Karelian birch is associated both with the natural and climatic conditions of its growth and with genetic factors38,39.

Ruden40 was the first to postulate that the 'curly' trait is monogenic, the gene can be dominant and lethal when homozygous. Later it was proposed that the inheritance of curly wood is due to a series of multiple alleles in one locus or genes in different loci41. In 2011 a hypothesis about the epigenetic origin of curly wood of the Karelian birch was suggested42.

Since the aberrant 'curly' phenotype occurs regularly, albeit sporadically within the north-western part of the silver birch distribution area, and is inherited in offspring, the hypothesis that curly birch can be a single-locus mutant of B. pendula seems quite plausible. This hypothetical mutation apparently disrupts the normal development of the plant, affecting not only the general habitus but probably also the tree's longevity. As stated by some reports, the life cycle of the Karelian birch takes approximately 50–60 years35,39,43,44, while the life cycle of the silver birch takes 120–140 years. The impairment of xylogenesis observed in Karelian birch may be due to the loss of function of a particular causative gene. It may also play a role in whether this mutation, which affects the fitness of the tree, is present in a homo- or heterozygous state. Accordingly, Kärkkäinen et al.11 reported that their analysis of crosses studying the inheritance of curly birch phenotype indicates a Mendelian inheritance based on a single semi-dominant gene that is lethal when homozygous.

Consistent with this report, our study based on RADseq and association mapping revealed a single interval on chromosome 10 spanning ~ 1200 kb, where 8 SNPs were found significantly associated with the trait under study. The fact that one of the SNP markers (S10_3465040) was identified explaining up to 56% of phenotypic variation (Table 1) speaks in favor of the previously suggested monogenic (Mendelian) mode of inheritance for the curly wood trait. This particular SNP S10_3465040 and the 7 kb distant SNP S10_3472479 seem to fall into a region on the chromosome rich in small InDels. Three InDels were found only within 530 bp surrounding SNP S10_3465040, and even more within the 621 bp interval surrounding SNP S10_3472479.

Thus, in the region of interest, we observed a distinct structural characteristic: a high frequency of insertions and deletions of varying sizes. This structural dynamism might hint at the active presence of transposable elements (TEs) within this region. TEs, often termed “jumping genes”, have been recognized for their capacity to move within the genome, and their activation can be influenced by various factors. TEs are not just passive genomic elements; they can play pivotal roles in gene regulation, especially in response to environmental stresses45. Furthermore, TEs can influence alternative splicing patterns, potentially serving as alternative splicing start or acceptor sites46.

We found that the 54 bp InDel located at a distance of 160 bp from SNP S10_3465040 successfully predicted the “curly” phenotype for 92% of the 190 trees examined. It should be considered that the phenotyping of the “curly” trait was carried out by the method of non-destructive visual assessment of the habitus of the tree, without assessing the cut, so a certain percentage of erroneous phenotyping can be expected. Notably, of the 16 mislabeled trees, 15 were diagnosed as “curly” by the PCR marker, while visual assessment phenotyped these trees as “non-curly”. Only one tree was mislabeled the other way around. It should be considered that by the time of visual assessment (June, 2022), almost half of the phenotyped trees had just reached 16 years of age. Thus, in some of these “mislabeled” individuals the “curly” phenotype may not yet be visually manifested, but will appear at a later stage.

The identified 54 bp InDel does not fall into any coding sequence, although it is located in the 3'UTR region of the Bpev01.c0000.g0109 gene encoding an unknown protein. Blast analysis of the Bpev01.c0000.g0109 sequence revealed just a few hits—in the genomic assemblies of Quercus, Alnus and Fagus, which suggests that it may be specific to the Fagales order. Deletions in the 3'UTR region of this unknown gene can dramatically affect the fate of its mRNAs. The 3′-UTR regulatory regions are involved in polyadenylation, stability, transport and mRNA translation in plant cells47. When the effect of increasing the length of the 3'UTR on expression of luciferase mRNA was examined in transiently transfected carrot protoplasts, expression increased 18-fold when the length of the 3'-UTR was increased from 7 to 27 bases48.

Several studies have highlighted the profound effects of TE insertions, especially within the 3'UTR regions of genes. Such insertions can lead to translational repression, thereby influencing phenotypic outcomes. For instance, in rice, the Ghd2 gene, a regulator of flowering time, is impacted by such a mechanism49. This phenomenon isn’t exclusive to plants. In Drosophila, a TE insertion has been linked to enhanced survival rates in response to chemicals, due to its effect on the upregulation of the CG11699 gene50. Similarly, in mammals, TEs, particularly SINE repeats, have been shown to modulate gene promoter activity51. A study in humans demonstrated that the expression of an AcGFP reporter gene is significantly attenuated by inverted Alu repeats in the 3′UTR52. Given the proximity of this SNP to the 3'UTR and the known effects of TEs on gene regulation, it can be assumed that TEs may influence the curly wood phenotype of the Karelian birch.

Bpev01.c0000.g0109 can be considered as one of the possible candidate genes, however, linkage disequilibrium (LD) between the S10_3465040 and the putative causal locus located near the identified polymorphism can also be assumed. Rapid LD decay was reported for many arboreal species including pines53, aspens54 and willows55 due to the high level of genetic diversity (heterozygosity) of these woody species. In our study, though, the identified LD at r2 = 0.15 was higher (79.5 kb) compared to one revealed in the previous study (23 kb) performed on the 60 genetically diverse birch trees21. This could be explained by the fact that the association study was performed using a cohort of related individuals (two populations of full-sibs), which undergo only few recombination events. This may be the reason why the chromosomal interval at LG10, where significantly associated SNPs were detected, stretched to 1200 kb.

The detected interval on chromosome 10 was discussed previously as containing the sequence of the BpYucca5 gene involved in auxin biosynthesis which is critical for normal xylogenesis in Karelian birch29. The BpYucca5 gene is listed as one of the 104 coding sequences detected within the interval on chromosome 10. Among the potential candidates, there are also several genes encoding proteins that facilitate the processing, splicing, editing, stability and translation of mRNAs (Supplementary Table S4). Some of them are regulated by abiotic stress factors, such as Sm-like_snRNP_protein (Bpe_Chr10_3118250-3127056) which affects ABA biosynthesis under drought conditions.

Assuming that a causal polymorphism (in which TEs may be involved) can influence gene expression or mRNA stability depending on the environment, it may be explained why the Mendelian inheritance of this valuable trait has long been contested. Indeed, the formation of figured patterns in the wood of Karelian birch can begin at any age: cuts of trees are described in which patterned wood began to form at 8, 15, and even at 25 years. Moreover, there are examples of cross-cuts of trees, where the formation of a patterned wood texture is visible only in a certain sector of the trunk, perhaps, due to the special light conditions for the growth of the tree2.

The phenomenon, when the severity of the phenotype caused by the genotype can vary among affected individuals, is known in genetics as “expressivity”. Nowadays, both variable expressivity and incomplete penetrance are mainly investigated in studies focusing on monogenic human diseases. They are described to be caused by a range of factors, including variants in regulatory regions, epigenetics and environmental factors. The exact causes of expressivity are still not well understood, although they most likely lie in the molecular mechanisms governing genetic regulation56,57. Accordingly, in Karelian birch the dominant allele may underlie the cause of figured wood, however, the degree of xylogenesis anomalies and the stage of development at which they occur may depend on environmental conditions. That may explain the reason why the segregation ratio of curly and non-curly progenies in the crossing of Karelian birches varies from study to study2,58.

Our results suggest that the variation of the "curly" phenotype in Karelian birch is linked to the single interval on chromosome 10, where a high frequency of insertions and deletions of different sizes was observed. Unfortunately, the RADseq approach employed in our study doesn't provide a comprehensive resolution of this region. To determine the true causal polymorphism in this interval, a deeper resequencing effort using more genetically diverse plant material is required. The parents of the crosses in our study were not related to each other, but they originated from the same geographical region—the Republic of Karelia. Thus, the identified SNPs associated with the curly phenotype and the proposed InDel marker should be verified in an independent sample of birch trees segregated for “curliness”. This will provide information on whether the BpCW1 marker is useful to phenotype Karelian birch with other genetic backgrounds.

The discovered PCR marker BpCW1, though, is the first marker suggested for selection at an early stage of development of the Karelian birch seedlings, which later, upon reaching the age of 8–10 years, will most likely show the characteristic features of decorative (patterned) wood.

Identification of genetic determinants of curly wood phenotype and development of corresponding molecular markers can prevent excessive costs and efforts in the maintenance of Karelian birch plantations for industrial purposes and, possibly, reduce the anthropogenic pressure on the natural populations of these trees.