Structural variants and presence/absence polymorphisms are common in plant genomes, yet they are routinely overlooked in genome-wide association studies (GWAS). Here, we expand the type of genetic variants detected in GWAS to include major deletions, insertions and rearrangements. We first use raw sequencing data directly to derive short sequences, k-mers, that mark a broad range of polymorphisms independently of a reference genome. We then link k-mers associated with phenotypes to specific genomic regions. Using this approach, we reanalyzed 2,000 traits in Arabidopsis thaliana, tomato and maize populations. Associations identified with k-mers recapitulate those found with SNPs, but with stronger statistical support. Importantly, we discovered new associations with structural variants and with regions missing from reference genomes. Our results demonstrate the power of performing GWAS before linking sequence reads to specific genomic regions, which allows the detection of a wider range of genetic variants responsible for phenotypic variation.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Code is available at https://github.com/voichek/kmersGWAS.
Saxena, R. K., Edwards, D. & Varshney, R. K. Structural variations in plant genomes. Brief. Funct. Genomics 13, 296–307 (2014).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Salzberg, S. L., Pertea, M., Fahrner, J. A. & Sobreira, N. DIAMUND: direct comparison of genomes to detect mutations. Hum. Mutat. 35, 283–288 (2014).
Zielezinski, A. et al. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 20, 144 (2019).
Lees, J. A. et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat. Commun. 7, 12797 (2016).
Sheppard, S. K. et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc. Natl Acad. Sci. USA 110, 11923–11927 (2013).
Lees, J. A. et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. eLife 6, e26255 (2017).
Rahman, A., Hallgrímsdóttir, I., Eisen, M. & Pachter, L. Association mapping from sequencing reads using k-mers. eLife 7, e32920 (2018).
Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8, 2184 (2017).
Sun, S. et al. Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nat. Genet. 50, 1289–1295 (2018).
Minio, A., Massonnet, M., Figueroa-Balderas, R., Castro, A. & Cantu, D. Diploid genome assembly of the wine grape Carménère. G3 9, 1331–1337 (2019).
Arora, S. et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat. Biotechnol. 37, 139–143 (2019).
1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Abney, M. Permutation testing in the presence of polygenic variation. Genet. Epidemiol. 39, 249–258 (2015).
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Li, X. et al. Exploiting natural variation of secondary metabolism identifies a gene controlling the glycosylation diversity of dihydroxybenzoic acids in Arabidopsis thaliana. Genetics 198, 1267–1276 (2014).
Vetter, M., Karasov, T. L. & Bergelson, J. Differentiation between MAMP-triggered defenses in Arabidopsis thaliana. PLoS Genet. 12, e1006068 (2016).
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
Morrison, G. D. & Linder, C. R. Association mapping of germination traits in Arabidopsis thaliana under light and nutrient treatments: searching for G×E effects. G3 (Bethesda) 4, 1465–1478 (2014).
Zapata, L. et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc. Natl Acad. Sci. USA 113, E4052–E4060 (2016).
Bryant, F. M., Hughes, D., Hassani-Pak, K. & Eastmond, P. J. Basic LEUCINE ZIPPER TRANSCRIPTION FACTOR67 transactivates DELAY OF GERMINATION1 to establish primary seed dormancy in Arabidopsis. Plant Cell 31, 1276–1288 (2019).
Schnable, P. S. et al. The B73 maize genome: complexity, diversity and dynamics. Science 326, 1112–1115 (2009).
Gore, M. A. et al. A first-generation haplotype map of maize. Science 326, 1115–1117 (2009).
Springer, N. M. et al. The maize W22 genome provides a foundation for functional genomics and transposon biology. Nat. Genet. 50, 1282–1288 (2018).
Zhao, W. et al. Panzea: a database and resource for molecular and functional diversity in the maize genome. Nucleic Acids Res. 34, D752–D757 (2006).
Bukowski, R. et al. Construction of the third-generation Zea mays haplotype map. Gigascience 7, 1–12 (2018).
Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641 (2012).
Lin, T. et al. Genomic analyses provide insights into the history of tomato breeding. Nat. Genet. 46, 1220–1226 (2014).
Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355, 391–394 (2017).
Zhu, G. et al. Rewiring of the fruit metabolome in tomato breeding. Cell 172, 249–261 (2018).
Tikunov, Y. M. et al. Non-smoky glycosyltransferase1 prevents the release of smoky aroma from tomato fruit. Plant Cell 25, 3067–3078 (2013).
Sohn, J.-I. & Nam, J.-W. The present and future of de novo whole-genome assembly. Brief. Bioinform. 19, 23–40 (2018).
Pascoe, B. et al. Enhanced biofilm formation and multi-host transmission evolve from divergent genetic backgrounds in Campylobacter jejuni. Environ. Microbiol. 17, 4779–4789 (2015).
Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).
Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).
Seren, Ü. et al. AraPheno: a public database for Arabidopsis thaliana phenotypes. Nucleic Acids Res. 45, D1054–D1059 (2017).
Fordyce, R. F. et al. Digital imaging combined with genome-wide association mapping links loci to plant–pathogen interaction traits. Plant Physiol. 178, 1406–1422 (2018).
Chan, E. K. F., Rowe, H. C., Hansen, B. G. & Kliebenstein, D. J. The complex genetic architecture of the metabolome. PLoS Genet. 6, e1001198 (2010).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Cheng, C.-Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 89, 789–804 (2017).
Portwood, J. L. 2nd et al. MaizeGDB 2018: the maize multi-genome genetics and genomics database. Nucleic Acids Res. 47, D1146–D1154 (2019).
Kokot, M., Dlugosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).
Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).
Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010).
Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995).
Togninalli, M. et al. The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog. Nucleic Acids Res. 46, D1150–D1156 (2018).
Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Robinson, J. T. et al. Integrative Genomics Viewer. Nat. Biotechnol. 29, 24–26 (2011).
Ristova, D., Giovannetti, M., Metesch, K. & Busch, W. Natural genetic variation shapes root system responses to phytohormones in Arabidopsis. Plant J. 96, 468–481 (2018).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
We thank the many colleagues who have shared A. thaliana phenotypic information with us. We thank in particular G. Zhu and S. Huang for help with tomato genotypic and phenotypic information and C. Romay, R. Bukowski and E. Buckler for help with maize genotypes and phenotypes. We thank K. Swarts, F. Rabanal and I. Soifer for fruitful discussions. This work was supported by the DFG ERA-CAPS 1001 Genomes Plus and the Max Planck Society.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Examples of how k-mers tag well characterized structural variants22 between the Col-0 reference genome and the Ler fully assembled genome. The two genomes were used to count 31 bp k-mers, and all k-mers unique to one genome and appearing only once in it were plotted in the indicated regions. The a translocation, b inversion and c-d insertion/deletion positions are indicated by vertical lines and red shades. The k-mers unique to Col-0/Ler are plotted in the upper/lower panels in red/blue, respectively. The five positions tagged by k-mers inside the translocation presented in a are either SNPs or 1 bp indels.
Extended Data Fig. 2 Genome-wide evaluation of k-mer potential to detect SVs in well-characterized genomes.
a, For every translocation or inversion, previously identified22 between the Col-0 reference genome or the Ler genome we evaluate if it is tagged by 31 bp k-mers. Each translocation or inversion will affect 4 edges between the translocated fragment and the neighbouring genomic regions (bottom panel). For every previously identified translocation or inversion, the number of edges (0-4) which are tagged by k-mers unique to one genome were counted. Only 1.1% of these SVs were not tagged by any k-mer unique to one genome (upper panel). b, For every edge tagged by k-mers, described in A, we plot the number of k-mers unique to one genome which tagged it. The histogram is enriched with edges covered by the maximal number of k-mers, 31. c, Evaluating the potential to tag by k-mers long insertions/deletions between the well characterized genomes of Col-0 and Ler22. While in the genome with the apparent deletion only the junction between the two fragments will be tagged by unique k-mers, in the genome with the apparent insertion, the entire insert will be tagged (bottom panel). Only 0.4% of the previously characterized long insertions/deletions are not tagged by unique k-mers.
a, Creating the k-mer presence/absence table: Each accession’s genomic DNA sequencing reads are cut into k-mers45, filtering k-mers appearing less than twice/thrice in a sequencing library. k-mers are further filtered to retain only those present in at least 5 accessions, and ones that are found in both forward and reverse-complement form in at least 20% of accessions they appeared in. All k-mer lists are combined into a k-mer presence/absence table. b, Genome-wide associations on the full k-mers table using SNP-based software: the k-mers table is converted into PLINK binary format, which is used as input for SNP-based association mapping software14,42. c, GWA optimized for the k-mers: k-mers presence/absence patterns are first associated with the phenotype and its permutations using a LMM to account for population structure16,17. This first step is done by calculating an approximated score of the exact model. Best k-mers from this first step (for example 100,000 k-mers) are passed to the second step, In which an exact p-value is calculated14 for both the phenotype and its permutations. A permutation-based threshold is calculated, and all k-mers passing this threshold are checked for their rank in the scoring from the first step. If not all k-mers hits are in the top 50% of the initial scoring, then the entire process is rerun from the beginning, passing more k-mers from the first to the second step. This last test is built to confirm that the approximation of the first step will not remove true associated k-mers.
Histogram of k-mer allele counts: For every N=1..1008, the number of k-mers appeared in exactly N accessions is plotted.
a, LD between SNPs associated with flowering time. Dashed lines represent the four variant types, as in Fig. 1c. b, LD between k-mers associated with flowering time, Dashed lines represent the four variant types, as in Fig. 1c. c, Same as Fig. 1d with only SNPs. d, Same as Fig. 1d with only k-mers presented, showing also k-mers lower than the threshold. e, Manhattan plot of SNPs and k-mer associations with flowering time in 10 °C as in Fig. 1d for k-mers of length 25 bp.
Extended Data Fig. 6 Comparison of SNP- and k-mer-GWAS on phenotypes from 104 studies on A. thaliana accessions.
a, Histogram of the number of identified k-mers vs. identified SNPs (in log2) for A. thaliana phenotypes. Only the 458 phenotypes with both variant types identified were used. b, Histogram of thresholds difference of k-mers vs. SNPs of all A. thaliana phenotypes. Thresholds were -log10 transformed.
a, Results from GWAS on measurements of lesions by Botrytis cinerea UKRazz strain39. An example of k-mers having better hold on a short variant: 19 k-mers and no SNPs were identified, all k-mers in complete LD (top row). Sequence reads containing the k-mers mapped to chromosome 3, with a single T nucleotide deletion out of an eight T’s stretch, in position 72,017. Manual (middle) and the 1001G project (bottom) calls are shown. In the 1001G, 57 of 61 accessions contain missing values. b, Haplotypes around SNPs associated with xylosides concentrations are not correlated with this trait. All SNPs in positions 870,000 to 874,000 in chromosome 5 were hierarchically clustered (left panel, white mark missing values). The two identified SNPs are marked by arrows and a close-up of their state is shown (middle panel). Phenotypic values colored according to the two SNPs: TG blue, TT red, and CT green (right panel). c-e, Manhattan plot for: c, xyloside percentage, d, seedling growth inhibition by a flg22 variant, e, germination in darkness in low nutrient conditions. f, Germination phenotype plotted for accessions with top associated k-mer present or absent. Boxes cover 25%- 75% percentiles, medians marked by horizontal lines, and whiskers cover the full range of values.
a, Histogram of k-mer allele counts for maize accessions. b, Histogram of difference between threshold values of SNPs and k-mers for maize phenotypes. c, Histogram of the top SNP P-value divided by the k-mers defined threshold, in (-log10), for maize phenotypes. Plotted for phenotypes with only identified SNPs (upper panel) or for phenotypes with both SNPs and k-mers identified (lower panel). d, Histogram of the number of identified k-mers vs. identified SNPs for maize phenotypes. e, Histogram of the difference between top (-log10) p-values in the two methods for maize phenotypes identified by both methods. Plotted as in Fig. 2g. f, Manhattan plot of associations with ear weight (environment 07A). Associated k-mers could not be located in the reference genome, and are thus not presented.
a, Histogram of k-mers allele counts for tomato accessions. b, Histogram of difference between threshold values of SNPs and k-mers for tomato phenotypes. c, Histogram of the top SNP P-value divided by the k-mers defined threshold, in -log10, for tomato phenotypes. Plotted for phenotypes with only identified SNPs (upper panel) or for phenotypes with both SNPs and k-mers identified (lower panel). d, Histogram of the difference between top (-log10) p-values in the two methods for tomato phenotypes. e, Histogram of the number of identified k-mers vs. identified SNPs for tomato phenotypes.
Identification of pairs of tomato accessions for which relatedness as measured with k-mers is much lower than relatedness as measured with SNPs. For every pair among the 246 accessions, a black square is plotted if the difference in relatedness between SNPs and k-mers is larger than 0.15. Accessions are ordered by the number of black square in their row/column. Red lines mark the 21 accessions with most black squares, that is, those for which the k-mer/SNP difference in relatedness is larger than 0.15 for the most pairs.
About this article
Cite this article
Voichek, Y., Weigel, D. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52, 534–540 (2020). https://doi.org/10.1038/s41588-020-0612-7
Science China Life Sciences (2021)
Genetic variation for tolerance to the downy mildew pathogen Peronospora variabilis in genetic resources of quinoa (Chenopodium quinoa)
BMC Plant Biology (2021)
DNA Research (2021)
Journal of Plant Physiology (2021)
Frontiers in Plant Science (2021)