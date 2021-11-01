SNP calling relative to the AL8/78 reference genome

Following whole-genome shotgun sequencing, we called SNPs across the panel relative to the Ae. tauschii AL8/78 reference genome assembly. The 306 Ae. tauschii samples were aligned to the Ae. tauschii AL8/78 reference genome14 using HISAT2 default parameters52. All alignment BAM files were sorted and duplicates removed using SAMtools (v.1.9 ‘view’, ‘sort’ and ‘rmdup’ sub-commands). All BAM files were fed into the variant call pipeline using BCFtools (-q 20 -a DP,DV | call -mv -f GQ) with parallelization ‘-r $region’ of 4-Mb windows for a total of 1,010 intervals (regions). The raw variant files were filtered or recalled using a published AWK script based on DP/DV ratios (the ratio of non-reference read depth and total read depth) with default parameters (https://bitbucket.org/ipk_dg_public/vcf_filtering/src/master/) except minPresent parameter (we used minPresent = 0.8 and minPresent = 0.1). The minPresent=0.8 dataset was used for redundancy analysis. The minPresent = 0.1 and minPresent = 0.8 were both used for genome-wide association study (GWAS) analysis. The resulting matrix (104 million SNPs for minPresent = 0.1 concatenated using BCFtools v.1.11) were uploaded to Zenodo.

Quality control for redundancy and residual heterogeneity

A total of 100,900 (100 every 4-Mb window) SNPs were randomly chosen to compute pairwise identity by state among all samples for a total of 46,665 comparisons using custom R and AWK scripts (https://github.com/wheatgenetics/owwc). For every sample pair, a percent identity greater than 99.5% was deemed redundant based on the histogram distribution of all identity by state values (Extended Data Fig. 1c). This analysis confirmed the results of the KASP analysis conducted on the L2 accessions (Extended Data Fig. 1b and Supplementary Note).

For each accession (except TOWWC0193, which is related to the reference genome AL8/78), the fraction of heterozygous SNPs in the total number of biallelic SNPs was computed. Based on the distribution of these values (Extended Data Fig. 1d and Supplementary Table 3), 0.1 was deemed to indicate a low degree of residual heterogeneity. BW_26042, with a value of 0.17, was found to be the only outlier exceeding this threshold.

Based on these quality control analyses, a non-redundant and genetically stable set of 242 accessions was retained for further analysis. The redundant pairs, along with the different similarity scores, are given in Supplementary Table 4, and the set of 242 non-redundant accessions is provided in Supplementary Table 5.

De novo assembly from whole-genome shotgun short-read data

The primary sequence data of non-redundant accessions were trimmed using Trimmomatic v.0.238 and de novo assembled with the MEGAHIT v.1.1.3 assembler using default parameters53. The output of the assembler for each accession was a FASTA file containing all the contig sequences. The assemblies are available from Zenodo.

Genome assembly of Ae. tauschii accession TOWWC0112

TOWWC0112 (line BW_01111) was assembled by combining paired-end and mate-pair sequencing reads using TRITEX54, an open-source computational workflow. A PCR-free 250-bp paired-end library with an insert size range of 400–500 bp was sequenced to a coverage of ~70. Mate-pair libraries MP3 and MP6, with insert size ranges of 2–4 kb and 5–7 kb, respectively, were sequenced to a coverage of ~20. The assembly generated had an N50 of 196 kb (Supplementary Table 7). The assembly is available from the electronic Data Archive Library (e!DAL).

Genome assembly of Ae. tauschii accession TOWWC0106

Accession TOWWC0106 (line BW_01105) was sequenced on a PacBio Sequel II platform (Pacific Biosciences) with single-molecule, real-time chemistry and on the Illumina platform. For single-molecule, real-time library preparation, ~7 μg of high-quality genomic DNA was fragmented to a 20-kb target size and assessed on an Agilent 2100 Bioanalyzer55. The sheared DNA was end repaired, ligated to blunt-end adaptors and size selected. The libraries were sequenced by Berry Genomics. A standard Illumina protocol was followed to make libraries for PCR-free paired-end genome sequencing with ~1 μg of genomic DNA that was fragmented and size selected (350 bp) by agarose gel electrophoresis. The size-selected DNA fragments were end blunted, provided with an A-base overhang and then ligated to sequencing adapters. A total of 251.8 Gb of high-quality 150 paired-end PCR-free reads were generated and sequenced on the NovaSeq sequencing platform.

A set of 11.35 million PacBio long reads (289.6 Gb), representing a ~66-fold genome coverage, was assembled using the CANU pipeline with default parameters56. The assembled contigs were polished with 251.8 Gb of PCR-free reads using Pilon default parameters57. The resulting assembly had an N50 of 1.5 Mb (Supplementary Table 7). The assembly is available from e!DAL.

Phenotyping the Ae. tauschii diversity panel and synthetic hexaploid wheat lines

Wheat stem rust

The wheat stem rust phenotypes with P. graminis f. sp. tritici isolate 04KEN156/04, race TTKSK, and isolate 75ND717C, race QTHJC, were obtained from Arora et al.22. As part of this study, we also phenotyped the same Ae. tauschii lines with isolate UK-01 (race TKTTF)58 (Supplementary Table 8) using the same procedures as described in ref. 59. UK-01 was obtained from Limagrain.

Trichomes

For counting trichomes and measuring flowering time in Ae. tauschii, 50 L1 accessions and 150 L2 accessions were pregerminated at ~4 °C in Petri dishes on wet filter paper for 2 d in the dark. They were transferred to room temperature (~20 °C) and daylight for 4 d. Three seedlings of each genotype were transplanted on 22 January 2019 into 96-cell trays filled with a mixture of peat and sand and then grown under natural vernalization in a glasshouse with no additional light source or heating at the John Innes Centre, Norwich, UK. Trichome phenotyping was conducted 1 month later. Close-up photographs of the second leaf from seedlings at the three-leaf stage were taken and visualized in ImageJ, and trichomes were counted along one side of a 20-mm leaf margin in the mid-leaf region. Measurements were taken from three biological replicates (Supplementary Table 8).

Flowering time, biological replicate 1

Three seedlings used for trichome phenotyping (see above) were transferred on 25 March into individual 2 l pots filled with cereal mix soil60. Flowering time was recorded when the first five spikes were three-fourths emerged from the flag leaf sheath, equivalent to a 55 on the Zadoks growth scale61 (Supplementary Table 8).

Flowering time, biological replicates 2 and 3

A total of 147 Ae. tauschii L2 accessions were grown in the winters of 2018/2019 and 2019/2020 in the greenhouse at the Department of Agrobiotechnology, University of Natural Resources and Life Sciences, Vienna, Austria. Seeds of each accession were sown in multitrays in a mixture of heat-sterilized compost and sand and stratified for 1 week before germination at 4 °C with a 12 h day/12 h night light regimen. Thereafter, the seeds were germinated at 22 °C and at the one-leaf stage vernalized for 11 weeks. Five seedlings per accession were transplanted to 4 l pots (18 cm in diameter, 21 cm in height) filled with a mixture of heat-sterilized compost, peat, sand and rock flour. In the winter of 2018/2019, one pot (= one replicate) per accession was planted, whereas in 2019/2020, two pots (= two replicates) were planted. The pots were randomly arranged in the greenhouse and maintained at a temperature of 14/10 °C day/night with a 12 h photoperiod for the first 40 d. At spike emergence, the temperature was increased to 22/18 °C day/night with a 16 h photoperiod at 15,000 lx. At least ten spikes per pot were evaluated for beginning of anthesis, taken as 60 on the Zadoks growth scale61, resulting in a minimum of 30 assessed spikes per accession. Flowering time was recorded every second day.

The flowering date was analyzed using a linear mixed model, which considered subsampling of individual spikes within each pot as follows:

$${{{\mathcal{Y}}}}_{ijkl} = \mu + g_i + e_j + ge_{ij} + r_{jk} + p_{ijk} + \varepsilon _{ijkl}$$

Here, \({{{\mathcal{Y}}}}_{ijkl}\) denotes the flowering date observation of the individual spikes, μ is the grand mean and g i is the genetic effect of the ith accession. The environment effect, e j , is defined as the effect of the jth year, and the genotype-by-environment interaction is described by ge ij . r jk is the effect of the kth replication within the jth year, p ijk is the effect of the ith pot within the kth replication and jth year and ε ijkl is the residual term. Analysis was performed with R v.3.5.1 (ref. 62) using the package sommer63 with all effects considered as random except g i , which was modeled as a fixed effect to obtain the best linear unbiased estimates (Supplementary Table 8).

Spikelets per spike

For Ae. tauschii spikelet phenotyping, 151 accessions from L2 were vernalized at a constant temperature of 4 °C for 8 weeks in a growth chamber (Conviron). After vernalization, the accessions were transplanted to 3.8 l pots in potting mix (peat moss and vermiculite) and placed in a temperature-controlled Conviron growth chamber with diurnal temperatures gradually changing from 12 °C at 02:00 to 17 °C at 14:00 with a 16 h photoperiod and 80% relative humidity. To represent biological replication, each accession was grown in two pots, and each pot contained two plants. At the transplanting stage, 10 g of a slow-release N-P-K fertilizer was added to each pot. At physiological maturity, 5–15 main stem/tiller spikes per replication (that is, per pot) were collected, and the number of immature as well as mature spikelets were counted. Any obvious weak heads from late-growing tillers were not included. Least square means for each replication were used for k-mer-based association genetic analysis (Supplementary Table 8).

Powdery mildew

Resistance to B. graminis f. sp. tritici was assessed with Bgt96224, a highly avirulent isolate from Switzerland64, using inoculation procedures previously described65. Disease levels were assessed 7–9 d after inoculation as one of five classes of host reactions: resistance (R; 0–10% of leaf area covered), intermediate resistance (IR; 10–25% of leaf area covered), intermediate (I; 25–50% of leaf area covered), intermediate susceptible (IS; 50–75% of leaf area covered) and susceptible (S; >75% of leaf area covered) (Supplementary Table 8).

Wheat curl mite

A total of 210 Ae. tauschii accessions, 102 from L1 and 108 from L2 (Supplementary Table 8), were screened for their response against wheat curl mite. Aceria tosichella (Keifer) biotype 1 colonies (courtesy of M. Smith, Department of Entomology, Kansas State University) were mass reared under controlled conditions at 24 °C in a 14 h light/10 h dark cycle using the susceptible wheat cv. Jagger. The biotype 1 colony was previously reported as avirulent toward all Cmc resistance genes38,66,67,68. A single colony consisted of an individual pot with ~50 seedlings, and 20 colonies were grown to have sufficient mite inoculum to conduct the phenotyping. Colonies were placed inside 45 cm × 45 cm × 75 cm mite-proof cages covered with a 36-µm mesh screen (ELKO Filtering Co.) to avoid contamination until being used to infest the Ae. tauschii accessions. Accessions from L1 and L2 were evaluated in independent experiments. Six plants per accession were individually grown in 5 cm × 5 cm × 5 cm pots under controlled conditions at 24 °C in a 14 h light/10 h dark cycle. Pots were arranged randomly in an incomplete block design where the block was the tray fitting 32 pots (8 rows and 4 columns). A single pot with the susceptible check cv. Jagger was included in each tray. Accessions were infested at the two-leaf stage, with mite colonies collected from infested pieces of leaves from the susceptible plants and spread as straw over the pots. Plants were evaluated individually 10–14 d after infestation. Wheat curl mite damage was assessed as curled or trapped leaves using a visual scale from 0 to 4, with 0 indicating no symptoms and 1 to 4 indicating increasing levels of curliness or trapped leaves (Extended Data Fig. 7a).

The adjusted mean or best linear unbiased estimator for each accession was calculated with the ‘lme4’ R package69 using the following linear regression model:

$$y_{ijkl} = \mu + G_i + T_j + R_{k(j)} + C_{l(j)} + e_{ijkl}$$

Here, y ijkl is the phenotypic value, µ is the overall mean, G i is the fixed effect of the ith﻿ accession (genotype), T j is the random effect of the jth tray assumed as independent and identically distributed (iid) \(T_j\approx N(0,\sigma _T^2)\), R k(j) is the random effect of the kth row nested within the jth tray assumed distributed as iid \(R_{k(j)}\approx N(0,\sigma _R^2)\), C l(j) is the random effect of the lth column nested within the jth tray assumed distributed as iid \(C_{l(j)}\approx N(0,\sigma _C^2)\) and e ijkl is the residual error distributed as iid e ijkl ≈ N(0, \(\sigma _e^2\)).

k-mer presence/absence matrix

k-mers (k = 51) were counted in trimmed raw data per accession using Jellyfish70 (version 2.2.6 or above). k-mers with a count of less than two in an accession were discarded immediately. k-mer counts from all accessions were integrated to create a presence/absence matrix with one row per k-mer and one column per accession. The entries were reduced to 1 (presence) and 0 (absence). k-mers occurring in less than two accessions or in all but one accession were removed during the construction of the matrix. Programs to process the data were implemented in Python and are published at https://github.com/wheatgenetics/owwc. The k-mer matrix is available from e!DAL.

Phylogenetic tree construction

A random set of 100,000 k-mers was extracted from the k-mer matrix to build an unweighted pair group method with arithmetic mean (UPGMA) tree with 100 bootstraps using the Bio.Phylo module from the Biopython v.1.77 (http://biopython.org) package. Further, a Python script was used to generate an iTOL-compatible (https://itol.embl.de/) tree for rendering and annotation. The Python script and the random set of 100,000 k-mers used for generating the tree are available at https://github.com/wheatgenetics/owwc.

Bayesian cluster analysis using STRUCTURE

Bayesian clustering implemented in STRUCTURE19 version 2.3.4 was used to investigate the number of distinct lineages of Ae. tauschii. To control the bias due to the highly unbalanced proportion of the three groups20 in the non-redundant sequenced accessions (119 accessions of L2, 118 accessions of L1 and 5 accessions of putative L3), 10 accessions each of L1 and L2 were randomly selected for each STRUCTURE run along with the 5 accessions of the putative L3 and the control L1–L2 RIL. The random selection of 10 accessions each of L1 and L2 was performed 11 times without replacement, thus covering a total of 110 accessions each of L1 and L2 over 11 STRUCTURE runs (Supplementary Table 6). STRUCTURE simulations were run using a random set of 100,000 k-mers with a burn-in length of 100,000 iterations followed by 150,000 Markov chain Monte Carlo iterations for five replicates each of K ranging from 1 to 6. STRUCTURE output was uploaded to Structure Harvester (http://taylor0.biology.ucla.edu/structureHarvester; Web v.0.6.94 July 2014; Plot vA.1 November 2012; Core vA.2 July 2014)71 to generate a ΔK plot for each run. For each STRUCTURE run, a clear peak was observed at K = 3 in the ΔK plot, suggesting that there are three distinct lineages of Ae. tauschii19,71. STRUCTURE results were processed and plotted using CLUMPAK72,73 (http://clumpak.tau.ac.il/; beta version accessed on 11 May 2021) to maintain the label collinearity for multiple replicates of each K.

Determination of genome-wide fixation index

Genome-wide pairwise fixation index (F ST ) between the three Ae. tauschii lineages was computed using VCFtools74 v.0.1.15 with the parameters ‘–fst-window-size’ and ‘–fst-window-step’ set to 1,000,000 and 100,000, respectively.

Admixture analysis of the wheat D subgenome

To assign segments of the wheat D subgenome to Ae. tauschii lineages for each of the 11 chromosome-scale wheat assemblies21, we considered only those k-mers as usable that were present at a single locus in the D subgenome. Furthermore, out of these k-mers, for nine modern cultivars, only those k-mers were considered usable that were also present in the short-read sequences from 28 hexaploid wheat landraces17. For the assembled wheat genomes, each chromosome of the D subgenome was divided into 100-kb non-overlapping segments. A 100-kb segment was assigned to Ae. tauschii if at least 20% of 100,000 k-mers within that segment were usable as well as present in at least one non-redundant Ae. tauschii accession. A segment assigned to Ae. tauschii was further assigned to one of the three lineages (L1, L2 and L3) if the count of usable k-mers specific to that lineage exceeded the count of those specific to the other lineages by at least 0.01% of 100,000 k-mers. Scripts to determine the counts of lineage-specific and total Ae. tauschii k-mers per 100-kb segment are published at https://github.com/wheatgenetics/owwc, and the output files obtained for 11 wheat assemblies were collated in an Excel file that is available from Zenodo.

Anchoring of a de novo assembly to a reference genome

The contigs of a de novo assembly were ordered along a chromosome-level reference genome using minimap2 (ref. 75) (version 2.14 or above), and the genomic coordinates of their longest hits were assigned.

Correlation prefiltering

For each of the assembly k-mers (including those present at multiple loci), if also present in the precalculated presence/absence matrix, Pearson’s correlation between the vector of that k-mer’s presence/absence and the vector of the phenotype scores was calculated. Only those k-mers for which the absolute value of correlation obtained was higher than a threshold (0.2 by default) were retained to reduce the computational burden of association mapping using linear regression.

Linear regression model accounting for population structure

To each filtered k-mer from the previous step, a P value was assigned using linear regression with a number of leading PCA dimensions as covariates to control for the population structure. PCA was computed using the aforementioned set of 100,000 k-mers. The exact number of leading PCA dimensions was chosen heuristically. Too high a number might overcorrect for population structure, while too few might undercorrect. In the context of this study, three dimensions were found to represent a good trade-off.

Approximate Bonferroni threshold computation

For each phenotype in this study, the total number of k-mers used in association mapping varied between 3,000,000,000 and 5,000,000,000. In general, if the k-mer size is 51, a SNP or any other structural variant would give rise to at least 51 k-mer variants. Therefore, the total number of tested k-mer variants should be divided by 51 to get the effective number of variants to adjust the P value threshold for multiple testing. Assuming a P value threshold of 0.05, a Bonferroni-adjusted –log P value threshold between 9.1 and 9.3 was obtained for each phenotype. The more stringent cutoff of 9.3 was chosen throughout this study.

Generating association mapping plots

Association mapping plots were generated using Python. For a chromosome-level reference assembly, each integer on the x axis corresponds to a 10-kb genomic block starting from that position. For an anchored assembly, each integer on the x axis represents the scaffold that is anchored starting from that position. Dots on the plot represent the –log P values of the filtered k-mers within each block. Dot size is proportional to the number of k-mers with the specific –log P value. The plotting script is published at https://github.com/wheatgenetics/owwc.

Optimization of k-mer GWAS in Ae. tauschii

We used previously generated stem rust phenotype data for P. graminis f. sp. tritici isolate 04KEN156/04, race TTKSK, on 142 Ae. tauschii L2 accessions22. Mapping k-mers with an association score of >6 to the Ae. tauschii reference genome AL8/78 gave rise to significant peaks for the positive controls Sr45 and Sr46 (Extended Data Fig. 4a). The peaks contain k-mers that are negatively correlated with resistance (shown as red dots) because the AL8/78 reference accession does not contain Sr45 and Sr46. To identify the true Sr45 and Sr46 haplotypes, accession TOWWC0112 (which contains Sr45 and Sr46)22 was assembled from tenfold whole-genome shotgun data using MEGAHIT (N50 = 1.1 kb) and used in association mapping. However, noise masked the positive signals from Sr45 and Sr46 when the short scaffolds were distributed randomly along the x axis (Extended Data Fig. 4b). Anchoring the scaffolds to the AL8/78 reference genome considerably improved the plot and produced positive signals for Sr45 and Sr46 (blue peaks; Extended Data Fig. 4c). An improved assembly (N50 = 196 kb), generated with mate-pair libraries and again anchored to AL8/78, further reduced the background noise (Extended Data Fig. 4d).

Performing k-mer GWAS in Ae. tauschii with reduced coverage

The trimmed sequence data of each non-redundant accession was randomly subsampled to reduce the coverage to 7.5-fold, 5-fold, 3-fold and 1-fold. For each coverage point, the k-mer GWAS pipeline was applied, and k-mers with an association score of >6 were mapped to the Ae. tauschii reference genome AL8/78 (Extended Data Fig. 5).

Computing genome-wide LD

The Ae. tauschii AL8/78 reference genome was partitioned into five segments (R1, R2a, C, R2b and R3; Extended Data Fig. 8) based on the distribution of the recombination rate, where the boundaries between these regions were imputed using the boundaries established for the Chinese Spring RefSeqv1.0 D subgenome51. PopLDdecay76 v.3.41 with the parameter ‘-MaxDist’ set to 5 Mb was used to determine the LD decay in these regions for both L1 and L2. For L2, the value of mean r2 in the telomeric regions R1 and R3 dropped below 0.1 at genomic distances of 291 kb and 476 kb, respectively, while for L1, the corresponding genomic distances were 661 kb and 561 kb, respectively.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article