The genetic basis of traits can be understood by comparing the DNA of varieties of the same species. The genomes of many varieties of a model plant organism have now been sequenced, and the results are revelatory. See Article p.419
Charles Darwin wrote1 of the “endless forms most beautiful” of species that have arisen from natural selection. But his words also apply to the genetic variation within species such as the highly adaptable plant Arabidopsis thaliana (Fig. 1). The first analyses of the sequences of multiple genomes of A. thaliana2,3,4, including one on page 419 of this issue by Gan et al.4, have now been published. These studies provide a foundation for identifying the factors that shape genome change, and for mapping genome-sequence variation among a wide range of A. thaliana varieties that represents the plant's diversity. They should also facilitate the association of phenotypes (the observable characteristics of an organism) with genotypes (inherited genetic information) — most importantly in crop plants.
The genome sequences of most organisms are represented by examples taken from a single individual of each species, the choice of which has often been haphazardly forced on researchers by technological and financial limitations. New sequencing technologies, such as the Illumina methods5 used in the latest studies2,3,4, have removed the need for such arbitrary choices and provided exciting opportunities to explore sequence diversity within species. There is a huge range of organisms that can be studied, but it can be argued that two classes of genome deserve high priority for diversity studies: human genomes, because of the enduring interest in our origins and diseases; and plant genomes, because of humans' complete dependence on them for food.
Arabidopsis has an evanescent and opportunistic life cycle, much to the chagrin of gardeners. The plant colonized extensive regions of Eurasia and North Africa after the most recent ice age, spreading from refuges in the Iberian Peninsula and central Asia6. Subsequent European colonization then allowed it to spread worldwide. Arabidopsis is a widely used experimental plant because it is compact and easily grown, has a rapid life cycle and self-pollinates. Its compact genetic code was the first plant genome to be sequenced7, providing an accurate foundation for the current studies2,3,4.
Preliminary analyses8 of Arabidopsis genomes showed that plants from different geographical locations exhibit many commonly held genetic variations, consistent with the comparatively recent spread of the plant from a few locations and with frequent mixing of populations. The preliminary work also revealed that genome-wide association studies (GWAS) hold exceptional promise for identifying sequence variations that affect a wide range of plant phenotypes, many of which could be useful in agricultural crops.
As they report in Nature Genetics, Cao et al.2 have now sequenced the genomes of 80 strains of Arabidopsis that represent the genetic diversity of the plant across its extensive geographical range. By sequencing short pieces (or reads) of DNA and mapping them to a reference Arabidopsis genome, the authors identified single nucleotide polymorphisms (SNPs) — sequence variations between strains that involve single nucleotides. DNA sections that couldn't be mapped to the reference genome in this way were assembled de novo and then checked to see whether they could be anchored to the reference through the alignment of base pairs. This allowed more-extensive sequence variation to be identified.
In a related study published in the Proceedings of the National Academy of Sciences, Schneeberger et al.3 sequenced four varieties of Arabidopsis using a 'sub-assembly' approach. This involved clustering short reads into groups that correspond to certain regions of a reference genome7, assembling the reads into continuous sections (blocks), and then assembling the blocks into larger and larger sections until the whole genome was constructed. Gan et al.4 also used a sub-assembly approach to sequence 18 Arabidopsis varieties.
The advantage of sub-assembly approaches is that, as far as possible, different genomes are assembled independently. This changes the focus of comparative genomics: instead of comparing one or many genomes with a single reference, many genomes are compared with each other. Sub-assembly approaches also capture a spectrum of sequence variation broader than changes involving just a few nucleotides, thereby allowing a fundamental re-evaluation of sequence variation within a species. Indeed, this is one goal of the ambitious 1001 Genomes Project9, of which the three latest papers2,3,4 are part. The project has already sequenced 471 Arabidopsis varieties, and has a further 706 in its pipeline.
The new studies2,3,4 identified extensive genome-sequence changes between varieties. For example, SNPs and copy-number variants (differences in the number of duplications of one or more sections in a genome) are frequent. Another finding is that a significant proportion of Arabidopsis variation involves chemical changes to methylated cytosine bases. Cytosines are often methylated during epigenetic modifications to DNA, which alter gene expression without affecting DNA sequence, and the latest data2,3,4 suggest that such epigenetic changes have the potential to cause mutations. Furthermore, large numbers of genes in different Arabidopsis varieties contain premature stop codons (short sequences that signal the termination of translation), which would probably adversely affect the functions of proteins encoded by those genes. Interestingly, many of these genes also contain compensatory changes that would be expected to restore protein function. The most dramatic sequence variations were detected mainly in the studies that used sub-assembly approaches3,4, indicating that these approaches should be adopted in the future to maximize detection of a wide range of genomic variation in Arabidopsis and in crop plants.
The greatest variability within Arabidopsis was found in genes involved in defence and responses to the environment2,3,4; the same is true of sequence variation between species and between taxonomic families. Gan et al.4 assessed differences in gene expression between 18 naturally occurring varieties, and found that extensive genetic variation was concentrated within a short section (100 base pairs) that controls the expression of an adjacent gene. This accounted for much of the variation in expression between strains. The differences in expression affected other genes, mainly those involved in responses to pathogens and those encoding a family of transcription factors that control flowering.
The broad spectrum of genomic change identified in the three landmark studies2,3,4can be used to associate phenotypes — including 'quantitative' phenotypes that underlie complex traits — with sequence variation. With this in mind, Gan et al.4 sequenced the 18 diverse varieties of Arabidopsis that have been intercrossed to create a structured population10 used for mapping complex traits to DNA sequence variation. By contrast, Cao and co-workers' genomic data2, along with other data from the Arabidopsis 1001 Genomes Project, can be used to relate phenotypic variation to the underlying genotypic variation observed in GWAS.
Comparative genome studies have already provided many useful results. A pioneering study8 that examined 107 phenotypes of Arabidopsis in 96 diverse, genotyped lines found associations between several adaptive phenotypes and sequence variations. Similar approaches have been used in a study of 517 local varieties of rice to identify genetic variation associated with 14 useful agronomic traits11. Another genome-wide study has shown that adaptation of certain strains of Arabidopsis to high salt conditions is associated with sequence variation in a gene that encodes a sodium-transporter protein12. GWAS in general are also showing exceptional promise for identifying causal sequence variation in complex emergent traits in plants, such as crop yield and quality13.
The application of high-throughput DNA sequencing and genome-capture technology will inevitably lead to the large-scale re-sequencing of the genomes of crop species, in much the same way that the tiny, relatively simple Arabidopsis genome has been re-sequenced in the three new studies2,3,4. These technologies will revolutionize plant breeding by enabling a wide variety of phenotypic variations to be mined for their associated sequence variations, which can then be used to select breeding lines13. This will substantially reduce the time taken to create varieties of crop plants that are adapted to cope with changes in growing conditions or new pathogens, and/or to improve crop yield.
Darwin, C. On the Origin of Species by Means of Natural Selection (Murray, 1859).
Cao, J. et al. Nature Genet. 10.1038/ng.911 (2011).
Schneeberger, K. et al. Proc. Natl Acad. Sci. USA 108, 10249–10254 (2011).
Gan, X. et al. Nature 477, 419–423 (2011).
Bentley, D. R. et al. Nature 456, 53–59 (2008).
Sharbel, T. F., Haubold, B. & Mitchell-Olds, T. Mol. Ecol. 9, 2109–2118 (2000).
The Arabidopsis Genome Initiative Nature 408, 796–815 (2000).
Atwell, S. et al. Nature 465, 627–631 (2010).
Kover, P. X. et al. PLoS Genet. 5, e1000551 (2009).
Huang, X. et al. Nature Genet. 42, 961–967 (2010).
Baxter, I. et al. PLoS Genet. 5, e10011193 (2010).
Hamblin, M. T., Buckler, E. S. & Jannick, J.-L. Trends Genet. 27, 98–106 (2011).