Multiple reference genomes and transcriptomes for Arabidopsis thaliana

Gan, Xiangchao; Stegle, Oliver; Behr, Jonas; Steffen, Joshua G.; Drewe, Philipp; Hildebrand, Katie L.; Lyngsoe, Rune; Schultheiss, Sebastian J.; Osborne, Edward J.; Sreedharan, Vipin T.; Kahles, André; Bohnert, Regina; Jean, Géraldine; Derwent, Paul; Kersey, Paul; Belfield, Eric J.; Harberd, Nicholas P.; Kemen, Eric; Toomajian, Christopher; Kover, Paula X.; Clark, Richard M.; Rätsch, Gunnar; Mott, Richard

doi:10.1038/nature10414

Download PDF

Article
Open access
Published: 28 August 2011

Multiple reference genomes and transcriptomes for Arabidopsis thaliana

Xiangchao Gan¹^na1,
Oliver Stegle²^na1,
Jonas Behr³^na1,
Joshua G. Steffen⁴^na1,
Philipp Drewe³^na1,
Katie L. Hildebrand⁵,
Rune Lyngsoe⁶,
Sebastian J. Schultheiss³,
Edward J. Osborne⁴,
Vipin T. Sreedharan³,
André Kahles³,
Regina Bohnert³,
Géraldine Jean³,
Paul Derwent⁷,
Paul Kersey⁷,
Eric J. Belfield⁸,
Nicholas P. Harberd⁸,
Eric Kemen⁹,
Christopher Toomajian⁵,
Paula X. Kover¹⁰,
Richard M. Clark⁴,
Gunnar Rätsch³ &
…
Richard Mott¹

Nature volume 477, pages 419–423 (2011)Cite this article

75k Accesses
487 Citations
37 Altmetric
Metrics details

Subjects

Abstract

Genetic differences between Arabidopsis thaliana accessions underlie the plant’s extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.

A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range

Article Open access 11 April 2024

Plant pan-genomes are the new reference

Article 20 July 2020

Multi-Knock—a multi-targeted genome-scale CRISPR toolbox to overcome functional redundancy in plants

Article 27 March 2023

Main

Interpreting the consequences of genetic variation has typically relied on a reference sequence, relative to which genes and variants are annotated. However, this may cause bias, because genes may be inactive in the reference but expressed in the population¹, suggesting that sequencing and re-annotating individual genomes is necessary. Advances in sequencing² make this tractable for Arabidopsis thaliana^3,4,5, whose natural accessions (strains) are typically homozygous. Relative to the 119-megabase (Mb) high-quality reference sequence from Col-0 (ref. 6), diverse accessions harbour a single nucleotide polymorphism (SNP) about every 200 base pairs (bp) (ref. 3), and indel variation is pervasive^3,7,8. Characterizing this variation is crucial for dissecting the genetic architecture of traits by quantitative trait locus mapping in recombinant inbred lines (see, for example, ref. 9) or genome-wide association in natural accessions¹⁰.

Here we have sequenced and accurately assembled the single-copy genomes of 18 accessions that, with Col-0, are the parents of more than 700 Multiparent Advanced Generation Inter-Cross (MAGIC) lines⁹, similar to the maize Nested Association Mapping (NAM)¹¹ population and the murine Collaborative Cross¹². These accessions comprise a geographically and phenotypically diverse sample across the species⁹. Using the genomes, seedling transcriptomes and computational gene predictions we have characterized the ancestry, polymorphism, gene content and expression profile of the accessions. We show that the functional consequences of polymorphisms are often difficult to interpret in the absence of gene re-annotation and full sequence data. The assembled genomes also contribute to the A. thaliana 1001 Genomes Project^3,4,5,13.

Genome sequencing, assembly and variants

We assembled the 18 genomes so that single-copy loci would be contiguous, with less than one assembly error per gene, and therefore suitable for annotation. Accessions were sequenced with Illumina paired-end reads² (Supplementary Table 1), generally with two libraries with 200-bp and 400-bp inserts and reads of 36 and 51 bp, respectively, to between 27-fold and 60-fold coverage. Each genome was assembled by using five cycles of iterative read mapping¹⁴ combined with de novo assembly¹⁵ (Supplementary Information sections 2 and 3, and Supplementary Tables 1 and 2). We aligned reads to the final assemblies to detect polymorphic regions⁸ lacking read coverage (2.1–3.7 Mb per accession; Supplementary Table 3 and Supplementary Fig. 2). At unique loci, polymorphic regions probably reflect complex polymorphisms^3,8. The average N50 length (the contig size such that 50% of the entire assembly is contained in contigs equal to or longer than this value) of contiguous read coverage between polymorphic regions was 80.8 kb (Supplementary Table 4).

To report complex alleles consistently, we defined all variants against the multiple alignment consensus of Col-0 and the assembled genomes. For each accession there were 497,668–789,187 single-base differences from Col-0, and about 45,000 ambiguous nucleotides (Supplementary Table 5). The latter may reflect heterozygosity (particularly in Po-0; Supplementary Figs 5–7) or copy-number variants, and they were largely in transposable elements and repeats covering 21.9% of the genome (Supplementary Information section 5.1, and Supplementary Figs 8 and 9). Of 3.07 million SNPs, 45.2% were private to single accessions.

We identified 1.20 million indels, and 104,090 imbalanced substitutions, in which a sequence in Col-0 was replaced by a different sequence (Supplementary Tables 3 and 7). Although 57.5% of indels or imbalanced substitutions were shorter than 6 bp, 1.9% were longer than 100 bp, and overall 14.9 Mb of Col-0 sequence was absent in one or more accessions (Fig. 1a and Supplementary Fig. 8). The assemblies were about 1.6% and about 4.3% shorter than the reference (including and excluding polymorphic regions, respectively), probably reflecting limitations in detecting long insertions. Although sequence differences were enriched in transposable-element and intergenic regions, about 17% of bases deleted in one or more accessions were annotated as genic in Col-0 (Fig. 1a and Supplementary Fig. 8). The density of sequence differences is greater than between classical inbred strains of mice¹⁶, but less than between lines of maize¹⁷.

**Figure 1: **Assembly and variation of 18 genomes of** ***A. thaliana***.**

Both iterative and de novo assembly improved accuracy, with the latter being most effective at divergent loci (Fig. 1b, Supplementary Table 2 and Supplementary Fig. 10). As assessed with about 1.2 Mb of genomic dideoxy data^3,18,19 (Supplementary Information section 4), the substitution error rate was about 1 per 10 kb in single-copy regions, and about tenfold higher in transposable-element-rich regions. Further, RNA-seq reads covered about 100,000 SNPs per accession with 99.72% concordance (Supplementary Table 5), and junction sequences for 66 of 68 (97%) long indels and imbalanced substitutions were confirmed by PCR and dideoxy sequencing (Supplementary Table 8). The substitution error rate for our assemblies was comparable to that reported for four other A. thaliana genome assemblies⁴.

Genome-wide patterns of ancestry

The ancestral relationships of the accessions vary genome-wide. We computed phylogenies²⁰ across 1.25 million biallelic, non-private SNPs (Supplementary Information section 6). The ancestry of each pair of accessions within a phylogeny was quantified by using the genome-wide distribution of the minimum clade size of the subphylogeny containing the pair (Fig. 1c). Despite their wide geographical origins, with the exception of Po-0 and Oy-0, all pairs have distributions similar to that of an unstructured sample. The probability of recent co-ancestry is slightly higher than expected for a few pairs of accessions, with extended haplotype sharing at a minority of loci (Supplementary Figs 11–15), perhaps reflecting selective sweeps⁷. Both linkage disequilibrium and correlation between neighbouring phylogenies decrease by 50% within 5 kb (Fig. 1d and Supplementary Fig. 16). Variation among the 18 accessions is similar to a diverse global A. thaliana sample^7,8 in nucleotide diversity (Supplementary Figs 11–15), correlation with genomic features (Supplementary Tables 9–12) and structural variants (Supplementary Fig. 17).

Gene annotation and transcript diversity

A naive projection of the coordinates of the 27,206 nuclear protein-coding genes from Col-0 (TAIR10 annotation) onto the 18 genomes predicted that 93.4% of proteins were changed in at least one accession, with 32% of the total being affected by genic deletions, premature termination codons, or other disruptions (Supplementary Table 13). This large tally of disrupted genes implies that reference annotations cannot be transferred reliably; in fact, re-annotation reveals compensating changes, ensuring that many genes encode apparently functional proteins (Fig. 2a). Thus, in 96.2% of the 8,757 genes affected, the naive annotations were replaced by an alternative gene model in at least one accession (Fig. 2b and Supplementary Fig. 18). We predicted new splice sites in 64% of the 2,572 genes with splice site disruptions (in 696 cases the new sites were within 30 bp of the original ones; see, for example, Fig. 2a). Finally, there was evidence of alternative splicing in 2,106 genes (Supplementary Information sections 10.10–10.13).

Figure 2: **Transcript and protein variation.**

For genome annotation and expression analyses (for example Figs 2–4), we generated 78-bp RNA-seq reads from two biological replicates of seedling mRNA (about 9.5 million mapped reads per accession, including Col-0; Supplementary Information section 9, and Supplementary Table 14). We integrated read alignments²¹ with sequence-based gene predictions²² by using mGene.ngs (Supplementary Information sections 9–10.3, and Supplementary Fig. 19). On average, 24,681 coding genes were predicted for each accession. Comparison of Col-0 de novo predictions with TAIR10 annotations (Supplementary Table 16) showed that these predictions are more accurate (transcript F-score 65.2%) than using the genome sequence (mGene²², 59.6%) or RNA-seq alignments alone (Cufflinks²³, 37.5%; Supplementary Table 17). Finally, we consolidated the de novo annotations by incorporating TAIR10 annotations where applicable (Supplementary Information section 10.4, and Supplementary Fig. 20); novel transcript structures for a known TAIR10 gene were only accepted if each newly predicted intron was confirmed by RNA-seq alignments, or if the reference gene model was severely disrupted.

Figure 3: **Quantitative variation of coding gene expression.**

Figure 4: **Protein diversity and gene expression vary by gene category or family.**

We found, on average, 42,338 transcripts per accession (excluding Col-0), of which 5.5% (2,316) were novel (Table 1 and Supplementary Table 18). In each accession there were, on average, 319 novel genes (or gene fragments) supported by RNA-seq (Table 1); 717 novel genes were found in total, 496 whose sequence was present in Col-0 but not annotated, and 221 absent from the Col-0 genome but present in the de novo assemblies of the accessions. We found protein or expressed sequence tag matches for 74.9% of the new genes, primarily from A. thaliana, A. lyrata or other Brassicaceae species (Supplementary Information sections 10.8 and 10.9).

Table 1 Summary of gene predictions

Full size table

For accession Can-0, we generated additional independent higher coverage RNA-seq data from seedling, root and floral bud, which we used to confirm 83.3% of re-annotated introns (read alignment over splice junction) and 59.9% of transcripts (confirmation of every intron, or read coverage of 50% of the transcript for single exon transcripts; Table 1). We also obtained additional RNA-seq data for Col-0 and found similar confirmation rates for the reference annotation (Supplementary Table 19). Moreover, for Can-0 we confirmed 72.1% and 84.2% of novel introns and transcripts. Many novel introns stemmed from splice disruptions that tended to be weakly expressed so RNA-seq evidence was scarcer (Supplementary Fig. 22). Finally, more than 75% of novel alternative splicing events were supported by RNA-seq (Supplementary Information section 10.5).

Proteome diversity

To understand the effect of genetic diversity on proteins, it is insufficient to study isolated DNA polymorphisms in the context of the reference annotation. We therefore defined the distance between two amino-acid (AA) sequences by the fraction of amino-acid residues that did not align identically in their global alignment. For example, for FRIGIDA, between Col-0 and Sf-2, a premature stop codon leads to an AA distance of 49% (Fig. 2b). In 77% of proteins, the mean AA distance between all accessions was less than 3% (Fig. 2c). However, on average, 747 proteins per accession had a distance larger than 50% to any TAIR10 protein, with markedly greater variation for pseudogenes. As expected, variation between A. thaliana and its congener A. lyrata²⁴ exceeds that observed among A. thaliana accessions (Fig. 2c and Supplementary Fig. 23). Disruptions to splice sites and translation start and stop codons typically caused less severe effects than premature stop codons or frame shifts (Fig. 2c) when compensating splice sites created alternative in-frame splicing (for example Fig. 2a and Supplementary Fig. 24).

Next, we identified protein isoforms across accessions (Fig. 2b, right; distinct isoforms differ by at least roughly 2% AA distance; Supplementary Information section 10.7). For 80% of protein coding genes the most frequent isoform was very common (frequency at least 15 out of 19), whereas isoforms for pseudogenes usually occurred at lower frequency. Moreover, isoforms for large disruptions were rare (frequency 3 or less) for 37% of affected genes (Fig. 2d). This was most pronounced for premature stops and frameshifts, where purifying selection is expected to be strongest.

As expected^3,7, disease resistance genes of the coiled-coil and Toll interleukin 1 receptor subfamilies of the Nucleotide-Binding Leucine Rich Repeat (NB-LRR) gene family were predicted to encode the most variable proteins (Fig. 4a and Supplementary Fig. 26). F-box and defensin-like genes implicated in diverse processes including defence^25,26 were also highly variable. In contrast, housekeeping genes showed little variation.

Variation in seedling gene expression

Median expression heritability of protein-coding genes was 39%, similar to that of novel genes (36%) and pseudogenes (38%), and more than for non-coding RNAs (30%) (Supplementary Fig. 27). In total, 75% (20,550) of protein-coding genes (and 21% of non-coding RNAs and 21% of pseudogenes) were expressed in at least one accession (false discovery rate (FDR) 5%), and 46% (9,360) of expressed protein-coding genes were differentially expressed between at least one pair of accessions²⁷ (Fig. 3a; FDR 5%, Supplementary Information section 11). Of these, 19% (1,750) had more than tenfold expression changes, and 1.5% (142) more than 100-fold (Fig. 3b). For about 60% of genes, at least five accessions contributed to expression variation (Fig. 4d; Supplementary Information section 11.8).

Although the small sample size (19) precludes genome-wide association scans to identify trans expression quantitative trait loci (eQTLs), we identified potential cis-acting nucleotide variants, copy-number variants and gene structural variants (for example large indels and gene structure changes) associated with expression for 9% (836) of differentially expressed genes (FDR 5%; Supplementary Information section 12.2; we assessed gene-copy-number variation as in Supplementary Information section 12.4). Much of this variation was highly heritable (Fig. 3a). Consistent with identifying likely causal variants, 85% and 93% of associated SNPs and single-nucleotide indels for cis-eQTLs were within 5 and 10 kb of the gene, respectively, and were strikingly concentrated in the 100-bp promoter region and 5′ genic sequences (Fig. 3c, d). This was also true for heritable intron retention events, in which most cis associations were within the intron or less than 1 kb distant (Supplementary Fig. 32). Our results corroborate the general findings^28,29,30,31 of extensive cis regulation of gene expression in A. thaliana. Neither environmental variation nor population structure markedly affected expression variation (Supplementary Information section 13). Copy-number and structural variants were associated with expression in 3% (240) of differentially expressed genes, including 45% (64 out of 142) of genes with more than 100-fold differences (Fig. 3b), consistent with array studies²⁹.

Differential gene expression varied by gene ontology (GO) and gene family (Fig. 4b–d, Supplementary Table 24 and Supplementary Figs 39–41). Seventeen of the 18 GO classifications that were enriched for differential expression (P < 10⁻³) concerned response to the biotic environment, including pathogen defence and the production of glucosinolates³² to deter herbivores (Supplementary Table 24). These include NB-LRR genes (echoing protein variation), of which 74% were differentially expressed at up to 400-fold change, and for which many accessions typically contributed to differential expression (Fig. 4b–d). Patterns for housekeeping genes (such as ribosomal proteins, eukaryotic initiation factors or kinesins) were markedly different: although many were differentially expressed, fold changes were generally small, with variation more often being limited to a few accessions (Fig. 4b–d). Differentially expressed genes generally had much higher nucleotide diversity at synonymous sites relative to other expressed genes, a pattern also observed but less extreme at non-synonymous sites (Supplementary Table 25). This suggests that differences in expression level were not due solely to reduced selective constraint.

The type II MADS box transcription factor family³³ showed striking expression polymorphisms (Fig. 4b–d), including for the FLOWERING LOCUS C (FLC)³⁴ and MADS AFFECTING FLOWERING (MAF) genes³⁵. FLC, a floral inhibitor expressed highly in accessions that require prolonged cold (vernalization) to flower³⁶, varied more than 400-fold (Supplementary Fig. 42). F-box and defensin-like genes were exceptional in that expression was restricted in a minority of genes (41% and 12%, respectively; Fig. 4b), perhaps reflecting tissue-specific or environment-specific expression^25,37. Our data suggest that high turnover for some F-box families in the A. thaliana lineage⁷ extends to gene expression as well.

Conclusion

Our study goes beyond cataloguing polymorphisms^7,17 to provide genome sequences for a moderately sized population sample (see also refs 4, 16). In doing so, we were able to annotate each genome largely independently of the Col-0 reference. We found that disruptive polymorphisms were frequently compensated for, thereby conserving coding potential and highlighting the limitation of inferring consequences of polymorphisms in the absence of complete sequence data.

Our assemblies are accurate and largely complete in single-copy regions, although additional work will be needed to assemble the roughly 20% of the genome comprising repeats and transposable elements. Disentangling copy variation, long insertions and other genomic rearrangements remains a challenge. The methods we developed are of immediate relevance to the broader A. thaliana 1001 Genomes Project⁵ and to other organisms, and highlight the importance of RNA-seq data for annotation.

Finally, despite using only 19 accessions, we fine-mapped cis-eQTLs to small genomic regions (less than 10 kb), suggesting that analogous genome-wide scans in the more than 700 derived MAGIC lines could have single-gene mapping resolution for some loci. Our findings indicate that the MAGIC lines, for which population structure is largely mitigated⁹, will be an important and complementary resource to genome-wide association studies in A. thaliana populations¹⁰.

Methods Summary

We used the same seed stocks for Col-0 and the 18 accessions Bur-0, Can-0, Ct-1, Edi-0, Hi-0, Kn-0, Ler-0, Mt-0, No-0, Po-0, Oy-0, Rsch-4, Sf-2, Tsu-0, Wil-2, Ws-0, Wu-0 and Zu-0 that originated the MAGIC lines. DNA and RNA sequencing was performed with standard (DNA) or modified (RNA-seq) Illumina protocols. All methods are described fully in Supplementary Methods; software is available from the authors on request.

Accession codes

Primary accessions

Gene Expression Omnibus

GSE30814

Data deposits

DNA sequencing data are deposited in the European Nucleotide Archive (http://www.ebi.ac.uk/ena/) under accession number ERP000565. RNA sequencing data are deposited in the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) under accession number GSE30814. Data are also available at http://mus.well.ox.ac.uk/19genomes. Genome annotations are viewable at http://fml.mpg.de/gbrowse-19g.

References

Johanson, U. et al. Molecular analysis of FRIGIDA, a major determinant of natural variation in Arabidopsis flowering time. Science 290, 344–347 (2000)
Article ADS CAS PubMed Google Scholar
Bentley, D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008)
Article ADS CAS PubMed PubMed Central Google Scholar
Ossowski, S. et al. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 18, 2024–2033 (2008)
Article CAS PubMed PubMed Central Google Scholar
Schneeberger, K. et al. Reference-guided assembly of four diverse Arabidopsis thaliana genomes. Proc. Natl Acad. Sci. USA 108, 10249–10254 (2011)
Article ADS CAS PubMed PubMed Central Google Scholar
Weigel, D. & Mott, R. The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 10 107 10.1186/gb-2009-10-5-107 (2009)
Article PubMed PubMed Central Google Scholar
The Arabidopsis Genome Initiative . Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000)
Article ADS Google Scholar
Clark, R. M. et al. Common sequence polymorphisms shaping genetic diversity in Arabidopsis thaliana. Science 317, 338–342 (2007)
Article ADS CAS PubMed Google Scholar
Zeller, G. et al. Detecting polymorphic regions in Arabidopsis thaliana with resequencing microarrays. Genome Res. 18, 918–929 (2008)
Article CAS PubMed PubMed Central Google Scholar
Kover, P. X. et al. A multiparent advanced generation inter-cross to fine-map quantitative traits in Arabidopsis thaliana. PLoS Genet. 5, e1000551 (2009)
Article PubMed PubMed Central Google Scholar
Atwell, S. et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465, 627–631 (2010)
Article ADS CAS PubMed PubMed Central Google Scholar
McMullen, M. D. et al. Genetic properties of the maize nested association mapping population. Science 325, 737–740 (2009)
Article ADS CAS PubMed Google Scholar
Durrant, C. et al. Collaborative cross mice and their power to map host susceptibility to Aspergillus fumigatus infection. Genome Res. 21, 1239–1248 (2011)
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nature Genet 10.1038/ng.911 (28 August 2011)
Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011)
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010)
Article CAS PubMed PubMed Central Google Scholar
Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 10.1038/nature10413 (in the press)
Gore, M. A. et al. A first-generation haplotype map of maize. Science 326, 1115–1117 (2009)
Article ADS CAS PubMed Google Scholar
Lai, A. G., Denton-Giles, M., Mueller-Roeber, B., Schippers, J. H. & Dijkwel, P. P. Positional information resolves structural variations and uncovers an evolutionarily divergent genetic locus in accessions of Arabidopsis thaliana. Genome Biol. Evol. advance online publication, 10.1093/gbe/evr038 (27 May 2011)
Nordborg, M. et al. The pattern of polymorphism in Arabidopsis thaliana. PLoS Biol. 3 e196 10.1371/journal.pbio.0030196 (2005)
Article CAS PubMed PubMed Central Google Scholar
Song, Y. S. & Hein, J. Constructing minimal ancestral recombination graphs. J. Comput. Biol. 12, 147–169 (2005)
Article CAS PubMed Google Scholar
Jean, G., Kahles, A., Sreedharan, V. T., De Bona, F. & Ratsch, G. in Current Protocols in Bioinformatics Ch. 11, Unit 11.6 (Wiley, 2010)
Google Scholar
Schweikert, G. et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 19, 2133–2143 (2009)
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnol. 28, 511–515 (2010)
Article CAS Google Scholar
Hu, T. T. et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nature Genet. 43, 476–481 (2011)
Article PubMed Google Scholar
Silverstein, K. A., Graham, M. A., Paape, T. D. & VandenBosch, K. A. Genome organization of more than 300 defensin-like genes in Arabidopsis. Plant Physiol. 138, 600–610 (2005)
Article CAS PubMed PubMed Central Google Scholar
Gagne, J. M., Downes, B. P., Shiu, S. H., Durski, A. M. & Vierstra, R. D. The F-box subunit of the SCF E3 complex is encoded by a diverse superfamily of genes in Arabidopsis. Proc. Natl Acad. Sci. USA 99, 11519–11524 (2002)
Article ADS CAS PubMed PubMed Central Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 10.1186/gb-2010-11-10-r106 (2010)
Article CAS PubMed PubMed Central Google Scholar
Keurentjes, J. J. et al. Regulatory network construction in Arabidopsis by using genome-wide gene expression quantitative trait loci. Proc. Natl Acad. Sci. USA 104, 1708–1713 (2007)
Article ADS CAS PubMed PubMed Central Google Scholar
Plantegenet, S. et al. Comprehensive analysis of Arabidopsis expression level polymorphisms with simple inheritance. Mol. Syst. Biol. 5 242 10.1038/msb.2008.79 (2009)
Article CAS PubMed PubMed Central Google Scholar
West, M. A. et al. Global eQTL mapping reveals the complex genetic architecture of transcript-level variation in Arabidopsis. Genetics 175, 1441–1450 (2007)
Article CAS PubMed PubMed Central Google Scholar
Zhang, X., Cal, A. J. & Borevitz, J. O. Genetic architecture of regulatory variation in Arabidopsis thaliana. Genome Res. 21, 725–733 (2011)
Article CAS PubMed PubMed Central Google Scholar
Howe, G. A. & Jander, G. Plant immunity to insect herbivores. Annu. Rev. Plant Biol. 59, 41–66 (2008)
Article CAS PubMed Google Scholar
Kaufmann, K., Melzer, R. & Theissen, G. MIKC-type MADS-domain proteins: structural modularity, protein interactions and network evolution in land plants. Gene 347, 183–198 (2005)
Article CAS PubMed Google Scholar
Sheldon, C. C. et al. The FLF MADS box gene: a repressor of flowering in Arabidopsis regulated by vernalization and methylation. Plant Cell 11, 445–458 (1999)
Article CAS PubMed PubMed Central Google Scholar
Ratcliffe, O. J., Kumimoto, R. W., Wong, B. J. & Riechmann, J. L. Analysis of the Arabidopsis MADS AFFECTING FLOWERING gene family: MAF2 prevents vernalization by short periods of cold. Plant Cell 15, 1159–1169 (2003)
Article CAS PubMed PubMed Central Google Scholar
Lempe, J. et al. Diversity of flowering responses in wild Arabidopsis thaliana strains. PLoS Genet. 1, 109–118 (2005)
Article CAS PubMed Google Scholar
Schmid, M. et al. A gene expression map of Arabidopsis thaliana development. Nature Genet. 37, 501–506 (2005)
Article MathSciNet CAS PubMed Google Scholar

Download references

Acknowledgements

We thank C. Jiang and C. Brown for technical help; P. Dikjwel, K. Schneeberger and D. Weigel for providing sequence data in advance of publication; and J. Flint, M. Tsiantis, E. Feil, L. Hurst, A. Wachter and D. Weigel for comments on the manuscript. Funding was provided by the Biotechnology and Biological Sciences Research Council (BBSRC) BB/F022697/1 (to R.M. and P.X.K.), BB/D016029/2 (to P.X.K and R.M.), BB/F020759/1 (to N.P.H.) and BB/F019793/1 to P.K. (PI E. Birney), by core funding of the Max Planck Society (to G.R.), the German Research Foundation RA1894/1-1 and RA1894/2-1 (to G.R.), Volkswagen Stiftung (to O.S.), the National Science Foundation (NSF) 0929262 (to R.C. and C.T.) and 0820985 (to R.C., Principal Investigator L. Sieburth), and award no. KUK-I1-002-03 (to N.P.H), made by King Abdullah University of Science and Technology (KAUST). D. Buck, I. Ragoussis and colleagues in the Wellcome Trust Centre for Human Genetics Genomics Core performed most of the genomic sequencing, supported by the Wellcome Trust Core grant 090532/Z/09/Z. Contribution no. 11-388-J from the Kansas Agricultural Experiment Station.

Author information

Xiangchao Gan, Oliver Stegle, Jonas Behr, Joshua G. Steffen and Philipp Drewe: These authors contributed equally to this work.

Authors and Affiliations

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
Xiangchao Gan & Richard Mott
Max Planck Institute for Intelligent Systems and Max Planck Institute for Developmental Biology, Spemannstraβe 38, 72076 Tübingen, Germany
Oliver Stegle
Friedrich Miescher Laboratory, Max Planck Society, Spemannstraβe 39, 72076 Tübingen, Germany
Jonas Behr, Philipp Drewe, Sebastian J. Schultheiss, Vipin T. Sreedharan, André Kahles, Regina Bohnert, Géraldine Jean & Gunnar Rätsch
Department of Biology, University of Utah, Salt Lake City, 84112-5330, Utah, USA
Joshua G. Steffen, Edward J. Osborne & Richard M. Clark
Department of Plant Pathology, Kansas State University, Manhattan, 66506-5502, Kansas, USA
Katie L. Hildebrand & Christopher Toomajian
Department of Statistics, University of Oxford, South Parks Road, Oxford OX1 3TG, UK
Rune Lyngsoe
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Paul Derwent & Paul Kersey
Department of Plant Sciences, University of Oxford, South Parks Road, Oxford OX1 3RB, UK
Eric J. Belfield & Nicholas P. Harberd
The Sainsbury Laboratory, Norwich NR4 7UH, UK
Eric Kemen
Department of Biology and Biochemistry, University of Bath, Bath BA2 7AY, UK
Paula X. Kover

Authors

Xiangchao Gan
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Stegle
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Behr
View author publications
You can also search for this author in PubMed Google Scholar
Joshua G. Steffen
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Drewe
View author publications
You can also search for this author in PubMed Google Scholar
Katie L. Hildebrand
View author publications
You can also search for this author in PubMed Google Scholar
Rune Lyngsoe
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian J. Schultheiss
View author publications
You can also search for this author in PubMed Google Scholar
Edward J. Osborne
View author publications
You can also search for this author in PubMed Google Scholar
Vipin T. Sreedharan
View author publications
You can also search for this author in PubMed Google Scholar
André Kahles
View author publications
You can also search for this author in PubMed Google Scholar
Regina Bohnert
View author publications
You can also search for this author in PubMed Google Scholar
Géraldine Jean
View author publications
You can also search for this author in PubMed Google Scholar
Paul Derwent
View author publications
You can also search for this author in PubMed Google Scholar
Paul Kersey
View author publications
You can also search for this author in PubMed Google Scholar
Eric J. Belfield
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas P. Harberd
View author publications
You can also search for this author in PubMed Google Scholar
Eric Kemen
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Toomajian
View author publications
You can also search for this author in PubMed Google Scholar
Paula X. Kover
View author publications
You can also search for this author in PubMed Google Scholar
Richard M. Clark
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar Rätsch
View author publications
You can also search for this author in PubMed Google Scholar
Richard Mott
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.M., R.C., G.R., P.X.K. and C.T. conceived and led the project. R.M. and X.G. developed and performed genome assembly and analysis. R.C, J.S. and E.O. sequenced the transcriptomes and performed analyses. R.L. performed the ancestral recombination graph analysis. C.T. and K.L.H. performed population genetics analyses. S.J.S., G.J., G.R. and A.K. developed and applied RNA-seq alignment. J.B., G.R., S.J.S., A.K. and R.B. annotated the genomes and analysed their proteins. P.D., G.R. and O.S. performed differential expression analyses. O.S. and P.D. performed genetic association analysis of expression traits. V.T.S. processed data for display in Gbrowse. N.P.H., E.B. and X.G. performed experimental confirmations. P.K. and P.D. constructed the Ensembl version of the variation data. E.K. performed genomic sequencing. R.M., R.C., G.R., P.X.K., C.T., X.G. and O.S. wrote the paper with input from all authors.

Corresponding authors

Correspondence to Christopher Toomajian, Paula X. Kover, Richard M. Clark, Gunnar Rätsch or Richard Mott.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Information

The file contains Supplementary Text, Supplementary References, Supplementary Tables 1-15 and Supplementary Figures 1-42 with legends. (PDF 6751 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

Rights and permissions

This article is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence (http://creativecommons.org/licenses/by-nc-sa/3.0/), which permits distribution, and reproduction in any medium, provided the original author and source are credited. This licence does not permit commercial exploitation, and derivative works must be licensed under the same or similar licence.

Reprints and permissions

About this article

Cite this article

Gan, X., Stegle, O., Behr, J. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011). https://doi.org/10.1038/nature10414

Download citation

Received: 09 June 2011
Accepted: 05 August 2011
Published: 28 August 2011
Issue Date: 22 September 2011
DOI: https://doi.org/10.1038/nature10414

This article is cited by

High allelic diversity in Arabidopsis NLRs is associated with distinct genomic features
- Chandler A Sutherland
- Daniil M Prigozhin
- Ksenia V Krasileva
EMBO Reports (2024)
Comparison of metabolomic reconfiguration between Columbia and Landsberg ecotypes subjected to the combination of high salinity and increased irradiance
- Clara Segarra-Medina
- Lidia S. Pascual
- Sara I. Zandalinas
BMC Plant Biology (2023)
A de novo gene originating from the mitochondria controls floral transition in Arabidopsis thaliana
- Tomoyuki Takeda
- Kazumasa Shirai
- Kousuke Hanada
Plant Molecular Biology (2023)
Genetic changes in F1 hybrids of the genetically divergent Solanum L., Solanum melongena L. × Solanum aethiopicum L.
- Ning Yu
- Guiyun Gan
- Yikui Wang
Plant Growth Regulation (2023)
High-speed rail model reveals the gene tandem amplification mediated by short repeated sequence in eukaryote
- Haidi Chen
- Jingwen Xue
- Cheng Deng
Scientific Reports (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.