Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Identifying genetic variants underlying phenotypic variation in plants without complete genomes


Structural variants and presence/absence polymorphisms are common in plant genomes, yet they are routinely overlooked in genome-wide association studies (GWAS). Here, we expand the type of genetic variants detected in GWAS to include major deletions, insertions and rearrangements. We first use raw sequencing data directly to derive short sequences, k-mers, that mark a broad range of polymorphisms independently of a reference genome. We then link k-mers associated with phenotypes to specific genomic regions. Using this approach, we reanalyzed 2,000 traits in Arabidopsis thaliana, tomato and maize populations. Associations identified with k-mers recapitulate those found with SNPs, but with stronger statistical support. Importantly, we discovered new associations with structural variants and with regions missing from reference genomes. Our results demonstrate the power of performing GWAS before linking sequence reads to specific genomic regions, which allows the detection of a wider range of genetic variants responsible for phenotypic variation.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Flowering time associations in A. thaliana.
Fig. 2: SNP- and k-mer-based GWAS on 1,582 A. thaliana phenotypes.
Fig. 3: Specific cases of k-mer superiority.
Fig. 4: SNP- and k-mer-based GWAS in maize.
Fig. 5: SNP- and k-mer-based GWAS in tomato.
Fig. 6: Kinship matrix estimates with k-mers.

Data availability

A list of all phenotypes and top SNPs or k-mers passing their corresponding thresholds can be found at

The authors declare that all other data supporting the findings of this study are available within the Supplementary Information files.

Code availability

Code is available at


  1. 1.

    Saxena, R. K., Edwards, D. & Varshney, R. K. Structural variations in plant genomes. Brief. Funct. Genomics 13, 296–307 (2014).

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    PubMed Central  Google Scholar 

  3. 3.

    Salzberg, S. L., Pertea, M., Fahrner, J. A. & Sobreira, N. DIAMUND: direct comparison of genomes to detect mutations. Hum. Mutat. 35, 283–288 (2014).

    CAS  PubMed  Google Scholar 

  4. 4.

    Zielezinski, A. et al. Benchmarking of alignment-free sequence comparison methods. Genome Biol. 20, 144 (2019).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Lees, J. A. et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat. Commun. 7, 12797 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Sheppard, S. K. et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc. Natl Acad. Sci. USA 110, 11923–11927 (2013).

    CAS  PubMed  Google Scholar 

  7. 7.

    Lees, J. A. et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. eLife 6, e26255 (2017).

  8. 8.

    Rahman, A., Hallgrímsdóttir, I., Eisen, M. & Pachter, L. Association mapping from sequencing reads using k-mers. eLife 7, e32920 (2018).

  9. 9.

    Gordon, S. P. et al. Extensive gene content variation in the Brachypodium distachyon pan-genome correlates with population structure. Nat. Commun. 8, 2184 (2017).

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Sun, S. et al. Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nat. Genet. 50, 1289–1295 (2018).

    CAS  PubMed  Google Scholar 

  11. 11.

    Minio, A., Massonnet, M., Figueroa-Balderas, R., Castro, A. & Cantu, D. Diploid genome assembly of the wine grape Carménère. G3 9, 1331–1337 (2019).

    CAS  PubMed  Google Scholar 

  12. 12.

    Arora, S. et al. Resistance gene cloning from a wild crop relative by sequence capture and association genetics. Nat. Biotechnol. 37, 139–143 (2019).

    CAS  PubMed  Google Scholar 

  13. 13.

    1001 Genomes Consortium. 1,135 genomes reveal the global pattern of polymorphism in Arabidopsis thaliana. Cell 166, 481–491 (2016).

    Google Scholar 

  14. 14.

    Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Abney, M. Permutation testing in the presence of polygenic variation. Genet. Epidemiol. 39, 249–258 (2015).

    PubMed  PubMed Central  Google Scholar 

  16. 16.

    Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).

    Google Scholar 

  17. 17.

    Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Li, X. et al. Exploiting natural variation of secondary metabolism identifies a gene controlling the glycosylation diversity of dihydroxybenzoic acids in Arabidopsis thaliana. Genetics 198, 1267–1276 (2014).

    PubMed  PubMed Central  Google Scholar 

  19. 19.

    Vetter, M., Karasov, T. L. & Bergelson, J. Differentiation between MAMP-triggered defenses in Arabidopsis thaliana. PLoS Genet. 12, e1006068 (2016).

    PubMed  PubMed Central  Google Scholar 

  20. 20.

    Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Morrison, G. D. & Linder, C. R. Association mapping of germination traits in Arabidopsis thaliana under light and nutrient treatments: searching for G×E effects. G3 (Bethesda) 4, 1465–1478 (2014).

    CAS  Google Scholar 

  22. 22.

    Zapata, L. et al. Chromosome-level assembly of Arabidopsis thaliana Ler reveals the extent of translocation and inversion polymorphisms. Proc. Natl Acad. Sci. USA 113, E4052–E4060 (2016).

    CAS  PubMed  Google Scholar 

  23. 23.

    Bryant, F. M., Hughes, D., Hassani-Pak, K. & Eastmond, P. J. Basic LEUCINE ZIPPER TRANSCRIPTION FACTOR67 transactivates DELAY OF GERMINATION1 to establish primary seed dormancy in Arabidopsis. Plant Cell 31, 1276–1288 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Schnable, P. S. et al. The B73 maize genome: complexity, diversity and dynamics. Science 326, 1112–1115 (2009).

    CAS  PubMed  Google Scholar 

  25. 25.

    Gore, M. A. et al. A first-generation haplotype map of maize. Science 326, 1115–1117 (2009).

    CAS  PubMed  Google Scholar 

  26. 26.

    Springer, N. M. et al. The maize W22 genome provides a foundation for functional genomics and transposon biology. Nat. Genet. 50, 1282–1288 (2018).

    CAS  PubMed  Google Scholar 

  27. 27.

    Zhao, W. et al. Panzea: a database and resource for molecular and functional diversity in the maize genome. Nucleic Acids Res. 34, D752–D757 (2006).

    CAS  PubMed  Google Scholar 

  28. 28.

    Bukowski, R. et al. Construction of the third-generation Zea mays haplotype map. Gigascience 7, 1–12 (2018).

    PubMed  Google Scholar 

  29. 29.

    Tomato Genome Consortium. The tomato genome sequence provides insights into fleshy fruit evolution. Nature 485, 635–641 (2012).

    Google Scholar 

  30. 30.

    Lin, T. et al. Genomic analyses provide insights into the history of tomato breeding. Nat. Genet. 46, 1220–1226 (2014).

    CAS  PubMed  Google Scholar 

  31. 31.

    Tieman, D. et al. A chemical genetic roadmap to improved tomato flavor. Science 355, 391–394 (2017).

    CAS  Google Scholar 

  32. 32.

    Zhu, G. et al. Rewiring of the fruit metabolome in tomato breeding. Cell 172, 249–261 (2018).

    CAS  Google Scholar 

  33. 33.

    Tikunov, Y. M. et al. Non-smoky glycosyltransferase1 prevents the release of smoky aroma from tomato fruit. Plant Cell 25, 3067–3078 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Sohn, J.-I. & Nam, J.-W. The present and future of de novo whole-genome assembly. Brief. Bioinform. 19, 23–40 (2018).

    CAS  PubMed  Google Scholar 

  35. 35.

    Pascoe, B. et al. Enhanced biofilm formation and multi-host transmission evolve from divergent genetic backgrounds in Campylobacter jejuni. Environ. Microbiol. 17, 4779–4789 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. 36.

    Schneeberger, K. et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10, R98 (2009).

    PubMed  PubMed Central  Google Scholar 

  37. 37.

    Paten, B., Novak, A. M., Eizenga, J. M. & Garrison, E. Genome graphs and the evolution of genome inference. Genome Res. 27, 665–676 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  38. 38.

    Seren, Ü. et al. AraPheno: a public database for Arabidopsis thaliana phenotypes. Nucleic Acids Res. 45, D1054–D1059 (2017).

    CAS  PubMed  Google Scholar 

  39. 39.

    Fordyce, R. F. et al. Digital imaging combined with genome-wide association mapping links loci to plant–pathogen interaction traits. Plant Physiol. 178, 1406–1422 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Chan, E. K. F., Rowe, H. C., Hansen, B. G. & Kliebenstein, D. J. The complex genetic architecture of the metabolome. PLoS Genet. 6, e1001198 (2010).

    PubMed  PubMed Central  Google Scholar 

  41. 41.

    Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  42. 42.

    Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Cheng, C.-Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 89, 789–804 (2017).

    CAS  PubMed  Google Scholar 

  44. 44.

    Portwood, J. L. 2nd et al. MaizeGDB 2018: the maize multi-genome genetics and genomics database. Nucleic Acids Res. 47, D1146–D1154 (2019).

    PubMed  Google Scholar 

  45. 45.

    Kokot, M., Dlugosz, M. & Deorowicz, S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33, 2759–2761 (2017).

    CAS  PubMed  Google Scholar 

  46. 46.

    Kang, H. M. et al. Efficient control of population structure in model organism association mapping. Genetics 178, 1709–1723 (2008).

    PubMed  PubMed Central  Google Scholar 

  47. 47.

    Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Devlin, B. & Risch, N. A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995).

    CAS  PubMed  Google Scholar 

  49. 49.

    Togninalli, M. et al. The AraGWAS Catalog: a curated and standardized Arabidopsis thaliana GWAS catalog. Nucleic Acids Res. 46, D1150–D1156 (2018).

    CAS  PubMed  Google Scholar 

  50. 50.

    Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19, 455–477 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. 51.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  Google Scholar 

  52. 52.

    Robinson, J. T. et al. Integrative Genomics Viewer. Nat. Biotechnol. 29, 24–26 (2011).

    PubMed Central  Google Scholar 

  53. 53.

    Ristova, D., Giovannetti, M., Metesch, K. & Busch, W. Natural genetic variation shapes root system responses to phytohormones in Arabidopsis. Plant J. 96, 468–481 (2018).

  54. 54.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references


We thank the many colleagues who have shared A. thaliana phenotypic information with us. We thank in particular G. Zhu and S. Huang for help with tomato genotypic and phenotypic information and C. Romay, R. Bukowski and E. Buckler for help with maize genotypes and phenotypes. We thank K. Swarts, F. Rabanal and I. Soifer for fruitful discussions. This work was supported by the DFG ERA-CAPS 1001 Genomes Plus and the Max Planck Society.

Author information




Y.V. and D.W. designed the study and wrote the paper. Y.V. conducted the analysis.

Corresponding author

Correspondence to Detlef Weigel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Examples of well characterized structural variant tagged by k-mers.

Examples of how k-mers tag well characterized structural variants22 between the Col-0 reference genome and the Ler fully assembled genome. The two genomes were used to count 31 bp k-mers, and all k-mers unique to one genome and appearing only once in it were plotted in the indicated regions. The a translocation, b inversion and c-d insertion/deletion positions are indicated by vertical lines and red shades. The k-mers unique to Col-0/Ler are plotted in the upper/lower panels in red/blue, respectively. The five positions tagged by k-mers inside the translocation presented in a are either SNPs or 1 bp indels.

Extended Data Fig. 2 Genome-wide evaluation of k-mer potential to detect SVs in well-characterized genomes.

a, For every translocation or inversion, previously identified22 between the Col-0 reference genome or the Ler genome we evaluate if it is tagged by 31 bp k-mers. Each translocation or inversion will affect 4 edges between the translocated fragment and the neighbouring genomic regions (bottom panel). For every previously identified translocation or inversion, the number of edges (0-4) which are tagged by k-mers unique to one genome were counted. Only 1.1% of these SVs were not tagged by any k-mer unique to one genome (upper panel). b, For every edge tagged by k-mers, described in A, we plot the number of k-mers unique to one genome which tagged it. The histogram is enriched with edges covered by the maximal number of k-mers, 31. c, Evaluating the potential to tag by k-mers long insertions/deletions between the well characterized genomes of Col-0 and Ler22. While in the genome with the apparent deletion only the junction between the two fragments will be tagged by unique k-mers, in the genome with the apparent insertion, the entire insert will be tagged (bottom panel). Only 0.4% of the previously characterized long insertions/deletions are not tagged by unique k-mers.

Extended Data Fig. 3 Pipeline for k-mer-based GWAS.

a, Creating the k-mer presence/absence table: Each accession’s genomic DNA sequencing reads are cut into k-mers45, filtering k-mers appearing less than twice/thrice in a sequencing library. k-mers are further filtered to retain only those present in at least 5 accessions, and ones that are found in both forward and reverse-complement form in at least 20% of accessions they appeared in. All k-mer lists are combined into a k-mer presence/absence table. b, Genome-wide associations on the full k-mers table using SNP-based software: the k-mers table is converted into PLINK binary format, which is used as input for SNP-based association mapping software14,42. c, GWA optimized for the k-mers: k-mers presence/absence patterns are first associated with the phenotype and its permutations using a LMM to account for population structure16,17. This first step is done by calculating an approximated score of the exact model. Best k-mers from this first step (for example 100,000 k-mers) are passed to the second step, In which an exact p-value is calculated14 for both the phenotype and its permutations. A permutation-based threshold is calculated, and all k-mers passing this threshold are checked for their rank in the scoring from the first step. If not all k-mers hits are in the top 50% of the initial scoring, then the entire process is rerun from the beginning, passing more k-mers from the first to the second step. This last test is built to confirm that the approximation of the first step will not remove true associated k-mers.

Extended Data Fig. 4 Allele counts for A. thaliana 1001G k-mers.

Histogram of k-mer allele counts: For every N=1..1008, the number of k-mers appeared in exactly N accessions is plotted.

Extended Data Fig. 5 Flowering time-genotype associations in A. thaliana identified with k-mers.

a, LD between SNPs associated with flowering time. Dashed lines represent the four variant types, as in Fig. 1c. b, LD between k-mers associated with flowering time, Dashed lines represent the four variant types, as in Fig. 1c. c, Same as Fig. 1d with only SNPs. d, Same as Fig. 1d with only k-mers presented, showing also k-mers lower than the threshold. e, Manhattan plot of SNPs and k-mer associations with flowering time in 10 °C as in Fig. 1d for k-mers of length 25 bp.

Extended Data Fig. 6 Comparison of SNP- and k-mer-GWAS on phenotypes from 104 studies on A. thaliana accessions.

a, Histogram of the number of identified k-mers vs. identified SNPs (in log2) for A. thaliana phenotypes. Only the 458 phenotypes with both variant types identified were used. b, Histogram of thresholds difference of k-mers vs. SNPs of all A. thaliana phenotypes. Thresholds were -log10 transformed.

Extended Data Fig. 7 Specific case studies in which k-mers are superior to SNPs.

a, Results from GWAS on measurements of lesions by Botrytis cinerea UKRazz strain39. An example of k-mers having better hold on a short variant: 19 k-mers and no SNPs were identified, all k-mers in complete LD (top row). Sequence reads containing the k-mers mapped to chromosome 3, with a single T nucleotide deletion out of an eight T’s stretch, in position 72,017. Manual (middle) and the 1001G project (bottom) calls are shown. In the 1001G, 57 of 61 accessions contain missing values. b, Haplotypes around SNPs associated with xylosides concentrations are not correlated with this trait. All SNPs in positions 870,000 to 874,000 in chromosome 5 were hierarchically clustered (left panel, white mark missing values). The two identified SNPs are marked by arrows and a close-up of their state is shown (middle panel). Phenotypic values colored according to the two SNPs: TG blue, TT red, and CT green (right panel). c-e, Manhattan plot for: c, xyloside percentage, d, seedling growth inhibition by a flg22 variant, e, germination in darkness in low nutrient conditions. f, Germination phenotype plotted for accessions with top associated k-mer present or absent. Boxes cover 25%- 75% percentiles, medians marked by horizontal lines, and whiskers cover the full range of values.

Extended Data Fig. 8 Comparison of SNP- and k-mer- based GWAS in maize.

a, Histogram of k-mer allele counts for maize accessions. b, Histogram of difference between threshold values of SNPs and k-mers for maize phenotypes. c, Histogram of the top SNP P-value divided by the k-mers defined threshold, in (-log10), for maize phenotypes. Plotted for phenotypes with only identified SNPs (upper panel) or for phenotypes with both SNPs and k-mers identified (lower panel). d, Histogram of the number of identified k-mers vs. identified SNPs for maize phenotypes. e, Histogram of the difference between top (-log10) p-values in the two methods for maize phenotypes identified by both methods. Plotted as in Fig. 2g. f, Manhattan plot of associations with ear weight (environment 07A). Associated k-mers could not be located in the reference genome, and are thus not presented.

Extended Data Fig. 9 Comparison of SNP- and k-mer-based GWAS in tomato.

a, Histogram of k-mers allele counts for tomato accessions. b, Histogram of difference between threshold values of SNPs and k-mers for tomato phenotypes. c, Histogram of the top SNP P-value divided by the k-mers defined threshold, in -log10, for tomato phenotypes. Plotted for phenotypes with only identified SNPs (upper panel) or for phenotypes with both SNPs and k-mers identified (lower panel). d, Histogram of the difference between top (-log10) p-values in the two methods for tomato phenotypes. e, Histogram of the number of identified k-mers vs. identified SNPs for tomato phenotypes.

Extended Data Fig. 10 Kinship matrix calculation based on k-mers for tomato accessions.

Identification of pairs of tomato accessions for which relatedness as measured with k-mers is much lower than relatedness as measured with SNPs. For every pair among the 246 accessions, a black square is plotted if the difference in relatedness between SNPs and k-mers is larger than 0.15. Accessions are ordered by the number of black square in their row/column. Red lines mark the 21 accessions with most black squares, that is, those for which the k-mer/SNP difference in relatedness is larger than 0.15 for the most pairs.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, Supplementary Note and Supplementary Tables 2 and 3

Reporting Summary

Supplementary Table 1

Summarized genome-wide association analysis results for phenotypes used in this study.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Voichek, Y., Weigel, D. Identifying genetic variants underlying phenotypic variation in plants without complete genomes. Nat Genet 52, 534–540 (2020).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing