Diversity analysis of 80,000 wheat accessions reveals consequences and opportunities of selection footprints

Undomesticated wild species, crop wild relatives, and landraces represent sources of variation for wheat improvement to address challenges from climate change and the growing human population. Here, we study 56,342 domesticated hexaploid, 18,946 domesticated tetraploid and 3,903 crop wild relatives in a massive-scale genotyping and diversity analysis. Using DArTseqTM technology, we identify more than 300,000 high-quality SNPs and SilicoDArT markers and align them to three reference maps: the IWGSC RefSeq v1.0 genome assembly, the durum wheat genome assembly (cv. Svevo), and the DArT genetic map. On average, 72% of the markers are uniquely placed on these maps and 50% are linked to genes. The analysis reveals landraces with unexplored diversity and genetic footprints defined by regions under selection. This provides fertile ground to develop wheat varieties of the future by exploring specific gene or chromosome regions and identifying germplasm conserving allelic diversity missing in current breeding programs.


Supplementary Figure 3. Distribution of SNP marker density in tepraploids.
Distribution of 45,376 markers on 21 bread wheat chromosomes divided in 10MB window of the IWGSC RefSeq assembly v1.0. In green can be observed the number of genes; in blue the number of DArTseq markers and in orange the number of markers that overlap one transcript.

Supplementary Figure 4. Distribution of SNP marker density in CWR.
Distribution of 55,739 markers on 21 bread wheat chromosomes divided in 10MB window of the IWGSC RefSeq assembly v1.0. In green can be observed the number of genes; in blue the number of DArTseq markers and in orange the number of markers that overlap one transcript. a) The markers were distributed in five categories of alignment: align uniquely (the markers mapped only in one location of the genome), mapped to multiple places (the markers mapped two or more location of the genome), mapped discordantly (each allele of the marker aligned to different locations of the genome), mapped non-reciprocally and no alignments. The most representatives are the markers aligned uniquely which represent 66,607 markers (77%) for the hexaploids, 30,806 markers (67.9%) and 31,181 markers (68.7%) for the tetraploids with the RefSeq v1.0 and Svevo genome respectively and 28,054 markers (50.3%) for the CWR. b) Distribution of markers mapping within genes, at <5kb (upstream, downstream) and in intergenic region fr the hexaploidy, tetraploid and CWR groups.

Supplementary Figure 6. Distribution of DArTseq markers on the DArT consensus map.
In blue we present the DArT genetic map (v4) including 105,122 markers distributed across the 21 chromosomes with an average of 5,006 markers per chromosome. We mapped a total of 44,501, 24,185 and 18,738 SNP markers representing 52.03%, 53.29% and 33.61% of the total numbers of SNP markers of the hexaploid (red), tetraploid (green) and wild relative (WR; pink) data sets respectively. Common markers shared among more than one of the three groups are shown in orange. a b

Supplementary Figure 7. MRD between landraces and Elites/Synthetics in hexaploid.
a) The Modified Roger Distance (MRD) between each of the 22,698 landraces and the allelic frequencies for the group of 11,792 breeding elite lines shows three distinct groups, 4,742 landraces (20.9%) genetically very close to breeding elite lines with a genetic distance between 0.240 and less than 0.260, 17,311 landraces (76.3%) between 0.260 and less than 0.280 and 645 (2.84%) more than 0.28 which represent the outliers identified as tetraploid. b) Similar analysis comparing landraces with the group of synthetic accessions revealed 1,621 landraces (7.14%) genetically very close to synthetics with MRD between 0.240 and less than 0.260, 19,284 landraces (84.95%) with MRD between 0.260 less than 0.280, and 3) 1,793 (7.90%) with MRD greater than 0.280. In both cases, it can be observed two angles of the same image. Different angles images of the MRD distance between each of the 10,801 landraces with the allelic frequencies of the group of 4,048 elite breeding lines shows four distinct groups, 837 landraces (7.8%) genetically very close to vast majority of the breeding elite lines with <0.2 distance, 6,029 landraces (55.8%) between 0.20 to 0.30 distance in which 29% of the accession are from Turkey and 10% from Iran, 3,554 (33%) between 0.30 and 0.35 distance in which 92% are form Ethiopia, and only 381 (3.5%) with more than 0.35 distance with the elite lines being 42% from Turkey.

Supplementary Figure 24. ADMIXTURE analysis in tetraploid.
ADMIXTURE ancestry coefficients of a subset of tetraploid samples in K=5, K=6 and k=7.

Supplementary Figure 25. Representation of the distribution of 7 groups based on clusters analysis.
The size of the boxes is proportional to the number of accession. In the right side are the Fixation index (Fst) values and inside the boxes the expected heterocigocity (He) of each group and in the last boxes in group 7 are identified each group with a brief description and the number which correspond to Curlywhirly group. 12% of the samples in red with potential miss-classification in their passport data which contain <10% of D genome markers (outliers); c) tepraploid samples in purpure and 4.37% of potential miss-classification in their passport information with 10 to 20% (green) and >20% (blue) of D genome markers (outliers). Distribution of accessions in the 8 domesticated tetraploid species included in the analysis.