## Introduction

The genus Juglans, one of the most important genera of the Juglandaceae family, contains approximately 21 species, of which Persian walnut (Juglans regia L.) is the most economically important for nut production. Persian walnut is a long-lived, monoecious, open-pollinated and dichogamous tree, widely cultivated across temperate and subtropical regions1,2,3,4. Persian walnut trees are native to the mountainous regions of central Asia5,6,7, and today are distributed and grown commercially over a wide geographical range, including west-central Asia, southern Europe, North and South America, Australia and New Zealand1,8,9,10.

Iran and Afghanistan are thought to be among the main centers of origin and domestication of walnut5,11. In Iran, walnut has been planted widely for both nut and wood production, grows in extensive naturalized populations, and plays a key role in the country’s economy and culture2,3,11. Iranian walnut populations inhabit areas of widely varying precipitation, temperature, altitude and latitude12. Walnut trees have been seed-propagated in this area for thousands of years and dichogamy has promoted considerable genetic variation within existing natural or planted seedling populations13,14. Due to propagation primarily by seed, considerable phenotypic variability can be observed for different traits in the natural walnut populations of Iran2,13,14. Therefore, this valuable native walnut gene pool, located throughout the country and containing widely varying alleles, could be a valuable resource for the development of walnut breeding programs in Iran, either by direct selection or by use of superior genotypes in cross-breeding programs2,12,13,14.

Information regarding the amount of genetic divergence between geographically separated walnut populations in Iran can be used to enhance the effectiveness of breeding programs2,15,16. During recent decades the genetic diversity and structure of Persian walnut populations in Iran has been studied using both morphological and molecular markers, and some promising genotypes (e.g. for high yield, good kernel color and late leafing) have been selected and introduced2,3,11,14. However, there are wide areas populated by walnut trees in different parts of Iran that are still genetically unexplored15. These genetic materials are most likely adapted to the local environmental conditions and may have high levels of tolerance to biotic and abiotic stresses16,17. Hence, these native populations potentially could provide resources for sustainable high yields under the climate change scenario16. Therefore, assessing the genomic variation and differentiation of Iranian walnut genotypes will facilitate the effective use of these valuable genetic resources in future breeding programs, not only in Iran but also in other countries.

Patterns of walnut population genetic diversity and structures have been studied using several molecular marker systems, including randomly amplified polymorphic DNA (RAPDs)15, and simple sequence repeat (SSRs)18,19,20,21,22,23,24. However, there is no information in the literature on analysis of walnut genetic resources in Iran using high throughput genotyping platforms. Advances in next-generation sequencing (NGS) and the continuous decrease in cost have facilitated the discovery of whole genome single nucleotide polymorphisms (SNPs)25,26,27. SNP markers are the most abundant type of sequence variations. Distributed throughout the genome, they are the best candidates for performing advanced genetic analyses such as genome-wide association studies (GWAS) and genomic selection (GS)25,26,27,28. Ciarmiello et al.29 developed a simple technique for genetic characterization of walnut through exploiting SNPs in internal transcribed spacers (ITS). You et al.30 released a 6K Infinium SNP array based on ‘Chandler’ for use in walnut genetic improvement. A full walnut genome sequence (‘Chandler’ reference genome) has recently been published31. In addition, Marrano et al.25 released a high density Axiom J. regia 700K SNP array after high depth re-sequencing of 27 founders of the Walnut Improvement Program of University of California, Davis.

GWAS takes advantage of natural genetic variation, which is found extensively in many fruit species26,28. This approach makes it possible to simultaneously screen a large number of individuals for genomic variation, allowing identification of novel alleles underlying various traits26,28. However, few GWAS studies in nut tree crops have been reported so far, due to complex population structures that arose during a long domestication process26,28. Understanding the genetics of characters related to nut and kernel quality is crucial for walnut breeding programs worldwide and no study has identified regions (quantitative trait loci, QTLs) in the walnut genome associated with these traits. Therefore, it may be beneficial to use a diverse panel of walnuts of varied geographic origin in GWAS studies to identify useful new genetic variation for nut and kernel-related traits.

The walnut improvement program of the Horticulture Science Research Institute (HSRI) in Iran was launched in the early 1980’s with the aim of developing new cultivars. As a first step, in 1983 selected superior genotypes from the Iranian walnut gene pool were established in Karaj, Shahrood, Mashhad and Uremia. In 1994, seven genotypes from the Karaj collection were propagated by grafting and planted, along with eight French/Californian commercial cultivars, at Karaj. This resulted in the release of the first Iranian walnut cultivars, Jamal (Z63) and Damavad (Z30) in 2009–201032. Although the fourth phase of the walnut improvement program at HSRI based on traditional breeding continues, climate change and increasing environmental stresses (especially spring frost, drought and salinity) make it vital to initiate a new scion and rootstock breeding program using genomic-based approaches. Thus, we characterized a representative collection of the natural genetic and phenotypic variation present in Iran as first step towards future walnut sampling and the introduction of molecular breeding for Persian walnut in Iran.

The main objectives of the present study were to: (i) assess population structure, genomic variation and differentiation among Iranian walnut populations, (ii) evaluate the performance of the newly designed Axiom J. regia 700K SNP array on Iranian walnut genetic resources, and (iii) identify SNPs significantly associated with nut and kernel-related traits using GWAS. The present study is the first to directly assess the population diversity of Iranian walnut genotypes using high-density SNP markers. Although some of these materials have low agronomic value compared to commercial walnut cultivars, they may contain useful resilience alleles that have been lost in the modern genetic resources of walnut.

## Results

### Phenotypic variation and multivariate analysis

We measured 22 seed-related traits in a walnut collection comprising 95 genotypes, assembled from different parts of Iran (Table 1; Supplementary Table S1; Supplementary Fig. S1). These plant materials were collected from eight provinces ranging from the northwest of Iran (West Azerbaijan) with cold weather to the southeast of Iran (Kerman) with very arid regions. These were all old walnut trees from open pollinated seedlings grown in the valleys among mountainous areas. Twelve nuts were evaluated from each tree. As shown in Table 2, we observed a great phenotypic variation within our walnut collection. The ease of kernel removal from nuts, which varied on a scale from 1 to 7.67, with an average of 2.75, had the maximum coefficient of variation (46.88%), whereas nut width, which varied from 26.01 to 40.72 mm with an average of 32.24, had the lowest coefficient of variation (8.30%) (Table 2).

Correlation coefficients were used to determine the relationships between seed-related traits (Table 3). Significant positive correlations were observed between a size index and nut-related traits including nut length (r = 0.85**), nut width (r = 0.89**), nut thickness (r = 0.81**), and nut weight (r = 0.77**) (Table 3). A roundness index was negatively correlated with nut length (r = −0.80**), nut weight (r = −0.38**) and the size Index (r = −0.39**) (Table 3). Kernel weight was positively correlated with nut length (r = 0.45**), nut width (r = 0.46** nut thickness (r = 0.47**), nut weight (r = 0.83**), and size index (0.54**) (Table 3). Furthermore, kernel percentage was positively correlated with kernel fill and kernel plumpness, and negatively correlated with nut length, and nut width (Table 3).

Principal component analysis (PCA) was used to identify patterns of diversity among genotypes, based on phenotypic traits (Supplementary Figs S2 and S3). PCA showed that the first eight components explained 82.1% of the total variance among the 95 Iranian walnut genotypes (Supplementary Table S2). The first principal component accounted for 25.9% of the total phenotypic variation and showed the highest positive correlation with nut length (0.37), size index (0.35), nut weight (0.32), nut width (0.27), and nut thickness (0.22) (Supplementary Table S2; Fig. S3). The second principal component accounted for 15. 1% of the phenotypic diversity and positively correlated with kernel weight (0.35), nut thickness (0.28), nut weight (0.27), and nut width (0.26) as well as negatively depends on shell color (−0.34), kernel vein (−0.28), and shell texture (−0.26) (Supplementary Table S2; Fig. S3). The bi-plot segregated the genotypes into groups based mainly on their geographical origin (Supplementary Figs S2 and S3).

### Genotyping and data quality control

We genotyped the whole Iranian collection using the latest the Applied Biosystem Axiom J. regia 700K SNP array (Table 1; Supplementary Table S1; Fig. S1). By applying default thresholds (dish quality control – dQC < 0.82 and quality control call rate < 0.97), all of the 95 samples passed the quality standards. The average cluster call rate and average reproducibility were 99.75% and 99.95%, indicating high quality genotyping results.

The SNPs were categorized into the six default classes using Affymetrix Power Tools (APT): 1) Poly High Resolution (PHR), which comprises polymorphic SNPs with three high-resolution genotypic clusters; 2) No Minor Homozygote (NMH), SNPs with no samples of the minor homozygous genotypes; 3) Mono High Resolution (MHR), SNPs which are monomorphic across the genotypes studied; 4) Call Rate Below Threshold (CRBT), SNPs with genotype call rate below threshold (97%); 5) Off-Target Variant (OTV) polymorphisms, where the genotyping data with low-intensity cluster resulted from dissimilarity between the probe and the target sequences and; (6) Other, which includes SNPs with no clear cluster pattern of the genotypic data.

A summary of the distribution of all SNPs in the different categories is shown in Table 4. Overall, the vast majority of the SNPs (323,273; 53.03%) fell into the PHR class of polymorphisms, which represent the highest quality variants. These were filtered for missing rate (<20%) and minor allele frequency (MAF > 0.05), obtaining a final subset of 313,657 PHR SNPs that were used in the subsequent analyses.

### Population structure analysis

To study the structure of Persian walnut populations and the genetic relationship among samples, three different analyses were performed. Principal Component Analysis (PCA) and cluster analysis (CA) were used to assess the genetic distances among Iranian walnut genotypes by using a linkage disequilibrium (LD)-pruned SNP subset of 33,336 PHR polymorphisms. Figure 1a shows the first two principal components, which explained 8.05 and 7.33% of the total variation, respectively. PC1 clearly distinguishes Kerman individuals from the other provinces, while along PC2, genotypes from Fars and Yazd provinces clustered separately from Semnan (Fig. 1a). The cultivar Chandler, which was used as standard during the genotyping process, grouped with individuals from Ilam (Fig. 1a). Overall, we identified a clear separation of the Iranian walnut genotypes into four main genetic clusters centered in (1) Kerman, (2) Fars and Yazd, (3) Semnan, and (4) Ilam, Markazi, Hamedan, and West Azerbaijan provinces (Fig. 1a). By overlaying a map of Iran, we can observe a coincidence between these four walnut subpopulations and potential barriers to gene flow, such as mountains and deserts.

In contrast, the cluster analysis on a genome-wide Identical-By-State (IBS) matrix grouped the Iranian walnut genotypes into five groups (Supplementary Fig. S4): (1) Yazd, Fars and West Azerbaijan, (2) Kerman, (3) Ilam, Markazi and Hamedan, (4) Semnan and (5) Ilam and Chandler (Fig. 2b). As with PCA, the cluster analysis assigned all genotypes to their geographical regions (Supplementary Fig. S4), showing the Fars-Yazd and Semnan groups as the two most distant and the genotypes from Ilam and Markazi as genetically closest.

We then applied the model-based clustering approach implemented in fastSTRUCTURE software to determine the most likely number of genetic groups (K) within our Iranian walnut collection. According to the best choice algorithm function of fastSTRUCTURE, the most likely K ranged from 2 to 5 (Fig. 1c). At K2 the two main groups comprised (1) Kerman individuals (n = 41), (2) all other genotypes (Fig. 1c). At K3 the three major groups encompassed (1) Kerman individuals (n = 41), (2) Semnan individuals (n = 9), and (3) the genotypes from other provinces (Fig. 1c). At K4, in addition to the clusters identified at K3, we observed a fourth group including the cultivar Chandler and two individuals from Ilam (Fig. 1c). At K5, we identified a further substructure with the Kerman population dividing it into two groups, (1) thirty-eight Kerman individuals, and (2) three very old Kerman individuals (Fig. 1c). As with the cluster analysis, most of the Ilam population (n = 10) and all of the Markazi (n = 4) individuals were not clearly assigned to a defined group by fastSTRUCTURE, showing admixture with the Kerman, Semnan and Fars groups (Fig. 1c, K6, Supplementary Table S3). According to the marginal likelihood graph, the most likely number of subgroups in our Iranian walnut diversity panel was K3 (Supplementary Fig. S5). The three- methods of population structure analysis each identified different degrees of substructure, but overall we can conclude that the Iranian walnut population panel comprises mainly four genetic clusters (Fig. 1a–c; Supplementary Figs S4, 5).

### Genetic variation and differentiation

The average values of observed heterozygosity (Ho) and expected heterozygosity (He) were 0.34 and 0.38, respectively but the level of genetic diversity varied among the Iranian walnut populations (Table 5).

The lowest value of allelic richness (Ar) was found in the West Azerbaijan population (1.44), while the Ilam and Kerman populations both showed the highest value of Ar (1.62) (Table 5). The observed heterozygosity (Ho) ranged from 0.30 in the Markazi population to 0.34 in the Ilam and West Azerbaijan populations (0.33) (Table 5). The expected heterozygosity (Nei’s gene diversity, He) varied from 0.26 (West Azerbaijan) to 0.34 (Kerman). The lowest (0.31) and highest (0.35) unbiased expected heterozygosity values (UHE) were identified in the Semnan and Ilam populations respectively (Table 5). The fixation index (FIS) ranged from −0.29 in the West Azerbaijan population to 0.03 in Kerman and Fars populations, but on average, we observed low values of FIS in all Iranian walnut populations, indicating a deficit of homozygotes (Table 5).

To study the amount of genetic differentiation among the Iranian walnut populations, we evaluated genetic differentiation parameters (including GST, DJost and FST) for each pairwise comparison between the Iranian walnut populations (Table 6 and Supplementary Tables S4, 5).

The global FST, FIT and FIS values (0.07, 0.11 and 0.04 respectively) among our Iranian walnut populations indicate moderate genetic differentiation (Supplementary Table S4). A clear differentiation was observed between the Semnan population and those of Yazd, Fars, West Azerbaijan and Kerman provinces (Table 6). The same differentiation was also observed when estimating the DJost (Supplementary Table S5).These findings are in agreement with the above analysis of population structure. Comparison to the Iranian map (Fig. 1b), indicates that all of the genomic results for our population correlate with geographical and ecological location.

### Parental analysis

We investigated the level of relatedness within our walnut collection, including ‘Chandler’ as a commercial walnut cultivar. A pair of Kerman genotypes had a proportion of IBD alleles (PI-HAT) value > 99% and were therefore considered to be genetically identical. The first and second-degree relationships, the most informative, are shown in Table 7. In first-degree relationships, the probability of parent-offspring pairs sharing zero, one, and two IBD alleles (Z0, Z1, and Z2) is expected to be closer to 0, 1 and 0 respectively, while for second-degree pairs, Z0, Z1, and Z2 are expected to be roughly 0.5, 0.5 and 0 respectively. Nine pairs of individuals showed PI_HAT equal to or higher than 0.5, and were then considered to be first-degree relatives (Table 7). Eight pairs showing PI-HAT ≈ 0.25–0.35 and relatively high Z0 and Z1 (≈0.5) values were considered second-degree pairs (Table 7). Interestingly, we identified a second-degree relationship between two individuals from Ilam population and the cv. ‘Chandler’. Overall, we identified first and second-degree relationships only between individuals of the same province.

### Association mapping for nut-related traits

Due to the great phenotypic variability observed in our collection for nut-related traits, we ran GWAS to dissect, for the first time in walnut, the genetic basis of these traits. GWAS was performed using as phenotypic entrance, both the average of 12 kernels per tree and the individual best linear unbiased predictors (BLUPs), as well as the first five PCs of the phenotypic data. We applied the SUPER and FarmCPU algorithms, accounting for both population structure and kinship. A total of 55 loci on 11 chromosomes were significantly associated with six nut and kernel-related traits and the first five PCs of the phenotypic data (Fig. 2; Supplementary Table S6). Three SNPs on chromosome 7 were identified for nut weight. Six SNPs on chromosome 3 were identified for kernel percentage. For both nut width and nut thickness a common SNP on chromosome 1 was identified. Six SNPs, two on chromosome 10 and one each on chromosomes 3, 4, 7, and 13, were identified for shape index. For roundness index, nine SNPs were identified - three on chromosome 10 and one each on chromosomes 1, 3, 4, 6, 8, and 13. We also found 19 SNPs significantly associated with the first five PCs of the phenotypic data. The allelic effect and minor allele frequency (MAF) for these ranged from −5.72 to 12.24 and 0.05 to 0.50 respectively. No significant associations were found for other traits but at a suggestive threshold we found association for all of the traits studied except kernel shrivel, kernel veins, and packing tissue thickness.

Considering the suggestive threshold, we identified 56 loci associated with 13 nut and kernel related traits (Supplementary Table S6). Seven SNPs, six on chromosome 7, and one on chromosome 13 were identified for shell texture. For shell seal, six were on chromosome 13, and one on chromosome 14. Six SNPs on chromosomes 8 (n = 3) and 13 (n = 3) were detected for kernel filled trait. We found 5 SNPs each associated with kernel weight, kernel plumpness, and ease of kernel removal from nuts. Four SNPs each were associated with nut length, size index, shell color, and kernel color and located across chromosomes 1, 2, 4, 6, 7, 10, 11, 12, 13, and 16. Two SNP on chromosome 4 were detected for nut shapes and one SNP each on chromosomes 3 and 4 were identified for shell strength. A single SNP on chromosome 13 was identified for shell thickness. The allelic effect and MAF for these suggestive SNPs ranged from −1.94 to 3.04 and 0.07 to 0.49 respectively.

The 55 significant (41 unique sequence) and 56 suggestive SNPs associated with nut and kernel-related traits were annotated using BLASTx queries (Supplementary Table S6). We found that the most significant SNPs associated with both shape index and round index, fell in genes coding for the WRKY transcription factor 70 isoform X1 (E value = 1E-12; (Fig. 2) and WD repeat-containing protein 3 isoform X2 (E value = 7E-09). Also, the SNP on chromosome 13 associated to kernel fill is located within a gene encoding for the cyclic DOF factor 2-like (E value = 1E-08). Other genes identified as involved in the studied traits included were the WD repeat-containing protein RUP2, the acidic endochitinase-like, the LRR receptor-like serine/threonine-protein kinase GSO1, the classical arabinogalactan protein 9-like, the cysteine-rich receptor-like protein kinase 25, and the NAC domain-containing protein 43 genes (Supplementary Table S6). However, no genes were identified for some marker-trait associations (“None” in the Supplementary Table S6).

## Discussion

Iran has a long history of walnut production and is the third leading country in walnut production worldwide (445,829 in-shell tons33). Most walnut trees grown in Iran originated from seed, and exhibit considerable diversity in yield, quality, and resistance to abiotic and biotic stresses2,11,12. Environmental stresses and climate change are reducing walnut yield17. To face these challenges, native genotypes with interesting phenotypic traits need to be explored and preserved for future development of improved scion and rootstock varieties16. In this regard, as a pre-selection step, we evaluated various Iranian walnut populations in their native habitats for traits including climatic adaptations, precocity, yield, nut quality and resistance to biotic and abiotic stresses. Based on the profiles of all these traits, we selected the most interesting and variable 95 genotypes for the assessment of walnut genomic variation as a first step towards future walnut sampling and the introduction of molecular breeding for Persian walnut in Iran. These individuals originated from eight Iranian provinces (Table 1), rich in native walnut trees adapted to local conditions2,15,16,23. Many of the sampled trees are estimated to be at least 100, and some up to 500, years old. In particular, this collection of walnut trees exhibited especially high levels of phenotypic variation for nut-related traits, which were high correlated to each other, as reported in previous work17,34,35,36,37,38.

The study of genomic variation and genetic differentiation in domesticated or natural populations is important for understanding patterns of local adaptation and dissecting the genetic basis of traits of interest. Access to the new Axiom J. regia 700K SNP array25 allowed us to explore in-depth the genome-wide allelic variation within our Iranian walnut genetic resources.

Compared to previous surveys of Iranian walnut23,36,37,38, we characterized a gene pool covering most of the Iranian walnut distribution at higher genetic resolution. Although the latest Axiom J. regia 700K SNP array was designed using the deep re-sequencing of 27 California walnut accessions, the conversion rate for Iranian samples (53.03%) was similar to that observed for the California material25, confirming this array’s value in assessing the population structure and genomic variation of Iranian walnut populations.

Using the genetic profiles of high-quality SNPs evenly distributed across the genome, we identified four distinct genetic groups in our walnut collection (Fig. 1a,b). Such population structure can be explained by the geographical proveniences of our genotypes as well as the various climate conditions to which they adapted. For instance, the walnut genotypes from Kerman clearly separated from the others (Fig. 1a,b). The Kerman population includes many individuals located at high altitude and a set of three very old trees that formed a separate group at K5. In addition, most of the individuals from Ilam and Markazi were admixed with the genotypes from Kerman, Semnan, and Fars. This level of genetic similarity could be the result of human-mediated exchanges among these provinces. Overall, our population structure results are in line with previous genetic analysis of Persian walnut7,23,39,40.

The average heterozygosity (Ho = 0.34) was similar to that of UC Davis WIP accessions (Ho = 0.3) examined by Marrano et al.25 using the same SNP array. In contrast, our Iranian collection was more heterozygous than six J. regia populations from Kerman province studied by Vahdati et al.16 (0.23), but less heterozygous than other Persian walnut germplasm described in previous genetic surveys23,40,41. This discrepancy can be explained by the use in this study of SNP markers, which are bi-allelic and therefore detect polymorphisms differently from the multi-allelic SSRs42. In the Kerman, Fars, and Ilam populations, He was slightly higher than Ho, probably due to the Wahlund effect or inbreeding. On the other hand, the values of Ho and He in the West Azerbaijan population suggest low inbreeding and large genetic variation. However, these findings could also be due to the small sample size of our walnut panel. Collecting additional walnut material from Iran will be essential in providing further support for our conclusions.

As in Marrano et al.25, overall most of the Iranian populations exhibited little inbreeding and considerable genetic variation, as expected due to the dichogamous nature of walnut which promotes outcrossing. The high inbreeding coefficient found by Vahdati et al.16 suggests significant heterozygote deficiency, likely due to crossing between closely related individuals. We observed greater genetic variation than Vahdati et al.16, probably because we assessed more diverse populations from eight provinces and using genome-wide markers. In addition, although walnut is a mixed-mating tree species, negative FIS values are expected in adult trees because breeding is usually purged at an early age, thus age of the tree and sampling is an important indicator in determining FIS values41. Absence of inbreeding may also be due to an “isolate breaking effect” that occurs when previously isolated populations interbreed43,44.

FST values for most wind-pollinated tree species tend to be lower than 0.1041, indicating that more than 90% of the neutral genetic variation is maintained within populations. Our global FST value of 0.07, indicates that 93% of the genetic diversity of our walnut collection occurs within the seven Iranian populations (Supplementary Table S2). Our results agree with previous studies reporting overall FST values for J. regia populations in Europe, Africa and Asia40, China10,45 and six from Kerman province16. These findings possibly reflect the human-mediated dispersal of J. regia in space and time, resulting in a reassembly and homogenization of walnut genetic diversity.

The pairwise FST analysis suggests the Semnan population is the most genetically differentiated from the other provinces. Semnan province is far from the other studied provinces so neither pollen nor natural seed dispersal from them is feasible. Thus, it will be interesting to explore more in-depth the Semnan population for crossing in future Iranian walnut breeding. The Ilam-West Azerbaijan, and Ilam-Markazi populations showed lower pairwise FST values, suggesting gene flow among these populations. Due to their geographical location, there is a possibility of wind-driven cross-pollination, seed dispersal by animal movement, and seed movement of superior genotypes between orchards by walnut growers.

To date, there is no published parentage analysis of Iranian walnut populations. Although walnut can be propagated by several methods46, propagation of valuable trees from Iranian landraces or natural populations has been almost exclusively by seed, via humans or birds. However, we found a clonal relationship between one pair of very old individuals from Kerman province (Gugher) that were genetically identical and likely propagated by natural layering. Based on our observation at the site, natural layering occurred when a branch of the older tree (more than 500 years old) touched the ground, producing adventitious roots and eventually a new tree. Vahdati and Khalighi47 have previously reported vegetative propagation by layering in walnut. This is the first report of vegetative propagation in an Iranian walnut population.

Parentage analysis identified seventeen pairwise relationships among the Iranian genotypes, nine classified as first and eight as second-degree relationships. All of these were between individuals within given populations, probably due to open pollination or seed exchange between local walnut growers. Interestingly, we identified a second-degree relationship between the cultivar Chandler and two individuals from the Ilam population, possibly indicating one of the ancestors of ‘Chandler’ could have originated from Iran. Our results are consistent with Tulecke48 that stated the parents of Chandler might have originated from Iran. These parentage results, together with the genomic variation and differentiation analysis, are of interest both for clarifying the relationships between walnut accessions in view of GWAS analysis, and for accurate planning of future breeding programs in Iran.

Due to the limited number of individuals per population, our conclusions about genetic differentiation among populations have to be considered preliminary. The results of this study can be improved by increasing the sample size from the populations that we had a limited number of samples of them in our survey (Hamadan, West Azerbaijan, Markazi, and Yazd), and from the Semnan population as the most genetically differentiated from the other populations. In addition, in future studies, sampling of young and old trees from the same locations might reveal if local regeneration methods tend to preserve local genetic diversity or obscure it by the importation of new genetic types. Additional collections from other Iranian provinces and further genomic analysis will provide additional evidences to our results.

As further proof of the value of our walnut collection, we performed association mapping for nut and kernel-related traits, identifying marker-trait associations for 19 of the 22 traits studied. Our GWAS results revealed 55 significant SNPs (41 of unique sequence) associated with the variation of six traits, and 56 suggestive SNPs associated with 13 traits. None of the markers identified for the studied traits were previously mapped. In some cases, the same SNP was associated with different traits, which could be explained by the high correlation observed among them, or pleiotropic effects. Limited information exists in walnut regarding the genetic based of nut and kernel-related traits. The annotation of the significant and suggestive SNPs revealed genetic mechanisms for the studied traits also identified for other plant species. In particular, it has been already reported that the genes encoding for the WRKY transcription factor, the LRR receptor-like serine/threonine-protein kinase GSO1, the NAC domain-containing protein 43, and cyclic DOF factor 2-like are likely involved in embryo development49 and, therefore, the determination of seed size50, as well as in the development of the epidermal surface in embryos and cotyledons51, seed size/weight52, and seed maturation53. The SNPs identified in this study, if appropriately validated, could be used as potential markers for marker-assisted breeding in walnut.

## Conclusion

Genome-wide markers offer new opportunities for better understanding genomic variation and architecture of horticulturally important traits in walnut. We used the Axiom J. regia 700K SNP array to characterize Iranian walnut genotypes and to verify that the genetic variation available in our panel was suitable to perform an informative association mapping study. We observed a conversion rate similar to that obtained using the same SNP array in a Californian walnut collection. This indicates the Axiom J. regia 700K SNP array is a robust and valid genomic tool for further exploring the genetic variation and differentiation of walnut worldwide.

Population structure analysis of this Iranian walnut collection showed four main groups. Total differentiation among the populations was moderate, reflecting the occurrence of cross-hybridization events between native populations. Pairwise FST analysis found the Semnan population to be the most genetically differentiated and further in-depth examination of it should be prioritized in view of its value for future breeding programs. Overall, we observed consistency among the different genetic analyses employed and results were in accord with geographical and ecological information. Our findings demonstrate that large genetic variation still exists within Iranian walnut populations located in one of the main centers of origin and domestication of Persian walnut. Also, the potential of our population for future GWAS studies was confirmed through the results of association mapping for nut and kernel-related traits. The information generated in this study will be useful for better understanding the genetic basis of adaptation in walnut and identifying resilience alleles to be used by future breeding programs in addressing the challenges of climate change. Our invaluable collection of walnut genotypes adapted to diverse climates and altitudes across Iran were maintained at the Nut Crops collection orchard of Aburaihan Campus, University of Tehran. All the seven populations investigated in present study, along with additional material collected from other parts of Iran, have been established in a common garden to investigate in the future the genetic architecture of local adaptation and the correlation among genotypes and both environmental variables and drought-related traits using GWAS approaches.

## Materials and Methods

### Sample collection and phenotypic measurements

In the first step, pre-selection of Iranian walnut genotypes based on phenotypic records from the Iranian Ministry of Agriculture and local growers was performed with the aim of selecting superior genotypes to be used in future scion and rootstock breeding programs. We selected 95 genotypes to characterize at both phenotypic and genomic levels. In particular, the selected 95 individuals inhabit disjunctive mountainous areas, and are old walnut trees from open pollinated seedlings (50- to 500-year-old) with trunk diameters greater than 50 cm. They represent local populations (seedlings) that were randomly planted by humans or birds (mostly crows), and grow across wide areas in different parts (valleys or mountains) of Iran. Seeds of the studied genotypes collected from walnut populations located in eight main walnut producers provinces in Iran. The walnut trees investigated in the present research were separated from each other by 366–1768 km (approximately 890 km on average). We sampled four locations from Kerman (Baft-Gugher, Rabor, Rabor-Hanza and Bardsir), two each from Fars (Eqlid and Bavanat) and Ilam (Ilam and Eyvan) and one each from Semnan, Yazd, Markazi, West Azerbaijan and Hamadan (Table 1; Supplementary Table S1; Fig. S1). We planned to collect a minimum of ten samples per location; however, the number of samples collected per region was not equal because of different plant density found in each regions. These native genotypes are considered diverse on a regional scale since each region has gradually selected individuals adapted to environmental, horticultural, cultural and traditional features of the location. Therefore, the 95 selected genotypes likely represent a large part of the full genetic diversity found in Iranian walnut populations. A summary of their profiles for nut and kernel related traits are shown in Table 2.

Leaf tissue and open-pollinated seeds (at least 60 nuts per mother tree) were sampled. Twelve nuts of each selected genotype were used to evaluate 22 fruit-related traits, based on International Plant Genetic Resources Institute (IPGRI) or BI descriptors54. These included nut length (NuLe), nut width (NuWi), nut thickness (NuTh), nut weight (NuWe), kernel percentage (KePe), shape index (ShIn), size index (SiIn), round index (RoIn), nut shape (NuSh), shell thickness (SheTh), shell color (SheCo), shell texture (SheTe), shell seal (SheSe), shell strength (SheSt), packing tissue thickness (PaTiTh), kernel weight (KeWe), kernel color (KeCo), kernel plumpness (KePl), kernel shrivel (KeSh), kernel vein (KeVe), kernel fill (KeFi), and ease of kernel removal from nuts (EKeNu). Statistical analyses including descriptive statistics and normality testing on data and their residuals were performed in Minitab 18 statistical software (Minitab, Inc., State College, PA, USA). Multivariate statistical analyses, including principal component analysis (PCA), and correlation analysis were conducted using R55. Pearson and Spearman’s rank correlation coefficients were used to determine the relationships between two continuous variables and two continuous or continuous-ordinal variables respectively.

### Plant materials and DNA extraction

This study examined 95 adult walnut trees grown locally in diverse parts of Iran with various climates. The plant material was collected from eight Iranian provinces: Kerman, Fars, Ilam, Semnan, Yad, Markazi, West Azerbaijan and Hamedan. A detailed list is presented in Table 1. During the summer of 2017 mature fresh leaves were collected, immediately frozen in liquid nitrogen, and lyophilized. Geographical information for each tree was recorded along with detailed climate and population data (Table 1, Fig. 1b).

Total genomic DNA was extracted from 40 mg of dry leaves using the E-Z 96 Plant DNA Kit (Omega Bio-tek; Norcross, GA) according to the manufacturer’s instructions. DNA was quantified using Qubit dsDNA High Sensitivity (HS) Assay Kits (InVitrogen, Life Technologies). The quality of DNA samples was checked by agarose gel electrophoresis.

### Genotyping with the Axiom J. regia 700K SNPs array

DNA samples were adjusted to the recommended concentration of 15 ng/µL in 50 µL aliquots and sent to Affymetrix (now part of Thermo Fisher Scientific, Santa Clara, CA; www.affymetrix.com) for genotyping using the new Axiom J. regia 700K SNPs array25 on the Affymetrix GenTitan platform. Genomic DNA of the cultivar Chandler was used as a control.

### SNP allele calling and data analysis

SNP allele calling was performed by the Bioinformatics Core of Affymetrix as described in the Axiom Genotyping Solution Data Analysis Guide (http://www.bea.ki.se/documents/axiom_genotyping_solution_analysis_guide.pdf). The samples with a dQC value ≥ 0.82 and a QC call rate ≥ 97% were considered for further analysis. SNPs were then classified into six major classes: PHR, NMH, OTV, MHR, CRBT and Other.

### Analysis of population structure

PHR SNPs filtered for missing rate (>20%) and MAF (<5%), and LD-pruned (plink commands: indep-pairwise 50 5 0.25) were used to perform the population structure analysis. Two approaches were used: (i) Principal Component Analysis (PCA) and (ii) fastSTRUCTURE analysis. The PCA analysis was performed using the R package ‘SNPRelate’56. The PCA plot was constructed using the R package ggplot2. A Bayesian clustering approach using fastSTRUCTURE software v1.057 was then applied. A number of clusters (K values) ranging from 2 to 10 were tested, with ten replicates each using the default convergence criterion and priors. The most likely K value was chosen by plotting the marginal likelihood of the data, and with the best choice function implemented in fastSTRUCTURE. The results of all replicates for each K cluster were summarized using CLUMPAK (http://clumpak.tau.ac.il/)58.

To confirm further the subgroups identified by the above analysis, individual dissimilarities for each pair of individuals were calculated and used for hierarchical cluster analysis by the R package ‘SNPRelate’56.

### Genomic diversity and differentiation among Iranian walnut populations

The genomic diversity of populations was estimated using the ‘diveRsity’ package59, which is available in R55. Mean number of alleles per locus (A), mean observed heterozygosity (Ho), expected heterozygosity (He), unbiased expected heterozygosity (UHe), allelic richness per population (Ar), and inbreeding coefficient (FIS) were calculated for SNPs with a missing rate < 20% and MAF > 5% across the different sub-populations of Iranian walnut accessions. Populations with only one individual were excluded from the analysis.

Two important measurements of within-population genomic variation at a marker locus are the expected heterozygosity (He or gene diversity) and the observed heterozygosity60. These are computed as:

$${H}_{e}={\sum }_{i=1}^{k}{\sum }_{j=i+1}^{k}2{p}_{i}{p}_{j}=1-{\sum }_{i=1}^{k}{p}_{i}^{2}\,{\rm{for}}\,{\rm{k}}\,{\rm{alleles}}$$
(1)

Where k is the number of distinct alleles at a locus, and pi (i = 1, 2,… k) is the frequency of allele i in the population.

And

$${H}_{o}={H}_{e}(1-{\rm{F}})$$
(2)

Where F ranges from 0 (no inbreeding) to 1 (completely inbred population)

Genetic differentiation statistics between subpopulations for each locus and across all loci were computed using the R package ‘diveRsity’. These included: GST = Hedrick’s standardized “differentiation” per locus61, DJost = Jost’s true allelic differentiation per locus62, and the three unbiased estimators of Wright’s F-statistics63 within-population inbreeding coefficient (FIS), total-population inbreeding coefficient F (FIT), and the among-population genetic differentiation coefficient θ(FST). The index FST was computed as:

$${{\rm{F}}}_{{\rm{ST}}}({\rm{\theta }})=\frac{{\sigma }_{w}}{{\sigma }_{a}+{\sigma }_{b}+{\sigma }_{w}}$$
(3)

where σw is the variance of the allele frequencies between populations, σb is the variance of the allele frequencies between individuals within populations, and σa is the variance of the allele frequencies between gametes within individuals.

PGDSpider version 2.1.1.564 was used to convert PLINK files (PED and MAP) to Genepop format as data input format for ‘diveRsity’ R package.

### Relatedness Analysis

The PLINK 1.9 software65 was employed on each pair of the Iranian walnut genotypes to infer relatedness for all pairwise comparisons among the 95 Iranian walnut accessions. Pairwise IBD analysis was used to explore the first-degree and second-degree relationships among individuals as the proportion of the SNPs at which there were zero, one, or two shared IBD alleles represented by Z0, Z1, and Z2 respectively. Relatedness was then measured using the PLINK PI_HAT parameter, which indicates the proportion of SNPs in IBD between individual pairs. Pairs of accessions with PI_HAT value > 95% were considered to be genetically identical. We considered indvidual pairs to be first and second-degree relatives if they had PI_HAT values ≥ 0.47 and 0.22 respectively.

### Association mapping for nut-related traits

Association mapping was performed for 22 seed-related traits using the average performance of each genotype, BLUPs and the first five PCs of the phenotypic data. GWAS was carried out by applying three models: MLMM66, SUPER67 and Fixed and random model Circulating Probability Unification (FarmCPU)68 method, as implemented in GAPIT69. Population structure and familial relatedness were taken into account in all models. We determined suggestive thresholds to correct the p-value for multiple testing using the approach described by Gao et al.70. The significant and suggestive P values were 5e−07 and 1.006e−05 respectively. Manhattan plots were constructed accordingly using the GAPIT. For each trait, the significant SNPs were compared and annotated using the walnut gene annotation v1.0 (taxid: 51240)31.