An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions

Haudry, Annabelle; Platts, Adrian E; Vello, Emilio; Hoen, Douglas R; Leclercq, Mickael; Williamson, Robert J; Forczek, Ewa; Joly-Lopez, Zoé; Steffen, Joshua G; Hazzouri, Khaled M; Dewar, Ken; Stinchcombe, John R; Schoen, Daniel J; Wang, Xiaowu; Schmutz, Jeremy; Town, Christopher D; Edger, Patrick P; Pires, J Chris; Schumaker, Karen S; Jarvis, David E; Mandáková, Terezie; Lysak, Martin A; van den Bergh, Erik; Schranz, M Eric; Harrison, Paul M; Moses, Alan M; Bureau, Thomas E; Wright, Stephen I; Blanchette, Mathieu

doi:10.1038/ng.2684

Download PDF

Article
Open access
Published: 30 June 2013

An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions

Annabelle Haudry^1,2^na1,
Adrian E Platts^3,4^na1,
Emilio Vello^3,4,
Douglas R Hoen⁵,
Mickael Leclercq^3,4,
Robert J Williamson¹,
Ewa Forczek⁵,
Zoé Joly-Lopez⁵,
Joshua G Steffen⁶,
Khaled M Hazzouri¹,
Ken Dewar⁷,
John R Stinchcombe¹,
Daniel J Schoen⁵,
Xiaowu Wang⁸,
Jeremy Schmutz^9,10,
Christopher D Town¹¹,
Patrick P Edger¹²,
J Chris Pires¹²,
Karen S Schumaker¹³,
David E Jarvis¹³,
Terezie Mandáková¹⁴,
Martin A Lysak¹⁴,
Erik van den Bergh¹⁵,
M Eric Schranz¹⁵,
Paul M Harrison⁵,
Alan M Moses¹,
Thomas E Bureau⁵,
Stephen I Wright^1,16 &
…
Mathieu Blanchette^3,4

Nature Genetics volume 45, pages 891–898 (2013)Cite this article

27k Accesses
244 Citations
68 Altmetric
Metrics details

Subjects

Abstract

Despite the central importance of noncoding DNA to gene regulation and evolution, understanding of the extent of selection on plant noncoding DNA remains limited compared to that of other organisms. Here we report sequencing of genomes from three Brassicaceae species (Leavenworthia alabamica, Sisymbrium irio and Aethionema arabicum) and their joint analysis with six previously sequenced crucifer genomes. Conservation across orthologous bases suggests that at least 17% of the Arabidopsis thaliana genome is under selection, with nearly one-quarter of the sequence under selection lying outside of coding regions. Much of this sequence can be localized to approximately 90,000 conserved noncoding sequences (CNSs) that show evidence of transcriptional and post-transcriptional regulation. Population genomics analyses of two crucifer species, A. thaliana and Capsella grandiflora, confirm that most of the identified CNSs are evolving under medium to strong purifying selection. Overall, these CNSs highlight both similarities and several key differences between the regulatory DNA of plants and other species.

Phylogenomics and the rise of the angiosperms

Article Open access 24 April 2024

The genome and population genomics of allopolyploid Coffea arabica reveal the diversification history of modern coffee cultivars

Article Open access 15 April 2024

Evolution of tissue-specific expression of ancestral genes across vertebrates and insects

Article 15 April 2024

Main

A central challenge in functional and evolutionary genomics has been to determine the parts of a genome that are under selective constraint. Whereas protein-coding regions are relatively straightforward to identify, other functional elements such as transcriptional and post-transcriptional regulatory regions may be short and lacking in clear sequence signatures that would allow them to be detected in a single genome. Comparative genomic analyses across a group of closely related species provide a powerful approach to identify functional noncoding regions¹. Over evolutionary time, non-functional sequences are expected to diverge faster than sequences under selective constraint. Patterns of sequence conservation may therefore be used to detect the footprints of functional noncoding elements. It is now widely accepted that the most powerful approach to phylogenetic footprinting is one based on a large number of species that have substantial aggregate divergence yet remain sufficiently closely related that the loss or displacement of functional elements is rare^2,3.

Comparative genomic studies have led to the identification of thousands of CNSs in, among others, vertebrates^4,5,6, fruit flies⁷ and yeast⁸. These CNSs are thought to be involved in diverse regulatory functions, including transcription initiation and transcript processing (for example, splicing or mRNA localization), as well as being implicated in complex patterning, such as embryonic development^{9,10,11,12,13}. Plant CNSs have previously been identified on a genome-wide scale on the basis of the comparison of few or distant genomes (for example, maize versus rice^14,15,16, Brachypodium distachyon versus rice¹⁷, A. thaliana versus Brassica oleracea^18,19 and sets of diverse angiosperms²⁰). This approach limits either the specificity provided by large divergence times or the sensitivity provided by the comparison of more closely related species^10,11,21. Comparisons of paralogous noncoding regions flanking duplicated genes have also provided key insights into functional noncoding elements^16,22, but intraspecies duplicated CNSs may often experience relaxed selective constraints.

The Brassicaceae are an ideal family for the identification of CNSs owing to their relatively small genome sizes, robust phylogeny²³ and wealth of genomic data. So far, the genomes of six crucifer species have been partially or completely sequenced, including those of (i) the model plant A. thaliana²⁴; (ii) Arabidopsis lyrata, a congener of A. thaliana with a more ancestral karyotype and genome size²⁵; (iii) Capsella rubella, which falls in the sister group to the genus Arabidopsis²⁶; (iv) Brassica rapa (Chinese cabbage²⁷), one of the several closely related Brassica crop species in the tribe Brassiceae that share a recent genome triplication event (Br-α)²⁸; (v) Eutrema salsugineum (previously Thellungiella halophila) of the tribe Eutremeae, an extremophile adapted to saline habitats²⁹; and (vi) Schrenkiella parvula (previously Thellungiella parvula)³⁰, another extremophile of uncertain tribal placement.

To complement the set of previously published Brassicaceae genomes, we have sequenced the genomes of three additional species selected, on the basis of previously published phylogenetic analyses^31,32, to provide a broad diversity of lineages within the family. We took advantage of these nine closely related genome sequences to identify and characterize over 90,000 CNSs. The extent of selection acting on them was determined using a combination of comparative and population genomics data. Several lines of computational and experimental evidence point to a large proportion of CNSs having a role in transcriptional or post-transcriptional regulation. A full catalog of CNSs and their associated annotations in A. thaliana and A. lyrata are available via a genome browser.

Results

Genome sequencing, assembly and annotation

To supplement six publicly available crucifer genomes, we sequenced and partially assembled the genomes of three further crucifers (Table 1 and Online Methods): (i) Leavenworthia alabamica (lineage 1 in the tribe Camelineae), a model plant species with recently lost self-incompatibility in some populations; (ii) Sisymbrium irio (lineage 2 in the tribe Sisymbrieae), a self-compatible annual closely related to the Brassica genus but lacking the derived whole-genome triplication; and (iii) Aethionema arabicum (tribe Aethionemeae), a self-compatible, early branching sister group to the remainder of the core Brassicaceae³². All species share the ancient whole-genome duplication that occurred at the base of the family (At-α; ref. 33).

Table 1 Assembly statistics and gene content

Full size table

Assemblies of these three genomes included orthologs for the majority of A. thaliana protein-coding genes (68–83%, on a par with those found in more completely assembled genomes) and 98–99% of the ultraconserved core eukaryotic genes³⁴ (Table 1), suggesting that the coverage of non-repetitive DNA was high. Furthermore, the scaffold size (N50 of 70–135 kb) was suitable for the identification of orthologous regions through synteny.

Protein-coding genes and transposable elements (TEs) were predicted across all nine genomes (Table 1, Online Methods and Supplementary Table 1). The number of annotated genes varied between species from 23,167 genes in A. arabicum to 41,174 in B. rapa. This variation was expected given the rediploidization process following the At-α duplication and several whole-genome amplifications after At-α (for example, the Brassica triplication²⁸). TEs comprised the majority of the variation in genome size observed across the crucifers, varying in content from 13–15% in the smallest genomes (A. thaliana and S. parvula) to ∼50% in E. salsugineum.

Multiple-genome comparison

Nine-way genome alignments were generated for the crucifer genomes, using either A. lyrata or A. thaliana as the reference (Online Methods). A region of a non-reference genome was only allowed to align to a single region of the reference genome, although many regions from non-reference genomes were allowed to map to the same reference genome region (Fig. 1a). This ensured that lone paralogous genes resulting from the At-α duplication did not contaminate the alignments, while allowing more recent duplication events, such as the whole-genome triplication in B. rapa^27,35, to be represented. Local pairwise alignment blocks were filtered to only retain those that belonged to long sets of collinear blocks (chains). The vast majority of the A. lyrata genome belonged to a single such chain in each of the other species (Fig. 1a and Supplementary Table 2), leaving little doubt about correct orthology. This pattern of alignment also provided strong support for the notion that most of the gene loss after At-α duplication was substantially completed before the divergence of species within the Brassicaceae family^36,37.

**Figure 1: Genome alignments and whole-genome triplications.**

However, two species showed a strong departure from this essentially diploid organization. As expected, most A. lyrata regions were spanned by three long alignment chains in the B. rapa genome, owing to the established Br-α triplication event (Fig. 1a,b). Notably, the L. alabamica genome showed the same clear signs of an independent whole-genome triplication, referred to hereafter as La-α. Comparative chromosome painting of the L. alabamica genome (Fig. 1c–f) independently supported the establishment of hexaploidy after At-α, with the patterning of A. thaliana BAC probes supporting the retention of some chromosomal regions in two copies and others in three copies.

A phylogenetic tree (Fig. 2) derived from 1,048,889 fourfold-degenerate sites and using Carica papaya³⁸ as an outgroup was consistent with previously published phylogenies^31,39. The total branch length of the tree (∼1.5 substitutions per site) was similar to that for a set of nine diverse mammals (∼1.3 substitutions per site)⁴⁰ used in the identification of conserved noncoding regions. The divergence between L. alabamica paralogs (∼0.3 substitutions per site) was similar to that between the paralogs of B. rapa (∼0.35 substitutions per site), formed ∼24 (18–28) million years ago^41,42 by the Br-α triplication. Although this slightly larger number of substitutions per site may not imply a more recent event, because of the possibility of variation in neutral substitution rates⁴³, it likely indicates a similar era for these independent hexaploidization events.

**Figure 2: A phylogenetic tree obtained using a set of 1,048,889 fourfold-degenerate sites in PhyML (general time-reversible (GTR) substitution model)⁷³.**

Selection on noncoding sites in the crucifer genomes

Comparisons of multiple closely related genomes allows the fraction of the genome that is constrained by selection to be estimated⁴⁴. PhyloP⁴⁵ was used to measure interspecies conservation of each nucleotide of the A. thaliana and A. lyrata genomes, independent of the flanking nucleotides. Because of the insufficient level of divergence between the nine species considered, these scores could not unambiguously distinguish individual constrained sites from neutral ones. However, the proportion of sites under selection in the whole genome or in any given subset of sites could be estimated by comparing the distribution of PhyloP scores across the genome to that at fourfold-degenerate sites, which are largely unconstrained⁴⁶ (Online Methods).

At least 17.7% of the assembled A. thaliana genome sequence (21.1 Mb) seemed to be evolving under constraint (Fig. 3a), with close to a quarter of this sequence (4.5 Mb) located outside of protein-coding regions. In the larger TE-rich A. lyrata genome, very slightly more sites seemed to be under selection (22.2 Mb), corresponding to a much smaller fraction of the genome (11.3%). Consequently, the major cause of the difference in genome size between A. lyrata and A. thaliana is probably not the loss of functional sites but rather the loss of effectively unconstrained regions in A. thaliana, likely coupled with higher recent TE activity in A. lyrata²⁵.

**Figure 3: Estimation of the fraction of sites under selection in the *A. thaliana* genome.**

In both A. thaliana and A. lyrata, the constrained noncoding sites were divided roughly evenly between transcript-associated sites (introns and UTRs) and intergenic sites (Fig. 3a and Supplementary Fig. 1). The proportion of sites under selection was particularly high in 5′ and 3′ UTRs (17% and 13%, respectively) as well as in intronic regions flanking exons (15% within 30 bp of splice sites) but was much lower in the center of introns (Fig. 3b). Contrary to what is observed in mammals⁴⁷ and Drosophila melanogaster⁴⁸, intronic bases located within 500 bp of the transcription start site (TSS) did not seem to be under significantly stronger selective pressure than other intronic bases.

Outside protein-coding transcripts, the proportion of sites under selection decreased with the distance from the TSS. Nonetheless, as observed in Drosophila⁴⁹, more than half of intergenic sites showing signs of selection were >1 kb away from the closest annotated TSS. Notably, and in contrast to findings in mammals⁵⁰, evidence of constraint was lowest immediately downstream of 3′ UTRs, suggesting that regulatory elements are rarely located in those regions.

At least 90,000 conserved noncoding sequences

The PhyloP analyses yielded estimates of the number of individual sites under selection in specific portions of the A. thaliana genome, but they did not pinpoint the location of those sites. Constrained sites can only be reliably identified if they cluster with other such sites, forming CNSs. CNSs are genomic regions that show a reduced mutation rate over contiguous or near-contiguous sets of noncoding bases. A set of 92,421 (90,104) CNSs was identified in the A. lyrata (A. thaliana) genome (Supplementary Fig. 2) using a pipeline based on PhastCons⁵. Evidence of selection was also clearly provided by the relative rarity of insertions and deletions within CNSs (Supplementary Fig. 3). Previously published CNSs obtained from pairwise genome comparisons were typically identified in this screen, but five- to tenfold more conserved regions were also identified, owing to the sensitivity afforded by the use of multiple genomes (Supplementary Note).

Each CNS was annotated in accordance with its position relative to genes in the A. lyrata genome. CNSs without evidence of expression in whole A. thaliana or A. lyrata plants (Online Methods) were classified as proximal upstream or downstream (<500 bp upstream of the TSS or downstream of a TES, respectively), distal (>500 bp away from any gene's TSS or TES) or ambiguous. Genic CNSs were subdivided on the basis of their location in 5′ UTRs, introns and 3′ UTRs. We also identified 820 CNSs with substantial evidence of short RNA expression in A. lyrata (Online Methods) and labeled these as putative small noncoding RNA CNSs (smRNA CNSs). Conserved regions with evidence of small RNA expression in other plants but not in A. lyrata were set apart as a class of potential smRNA CNSs (Supplementary Table 3).

CNS density in different types of genomic regions closely followed that of the inferred sites under selection (Supplementary Fig. 4). Crucifer CNSs were typically short (median length of 36 bp and slightly shorter in introns and UTRs) and had a GC content similar to that of the noncoding portion of the genome (25–40%; Supplementary Fig. 5). smRNA CNSs, most of which corresponded with known noncoding RNA genes, formed a relatively distinct group, showing higher conservation and GC content and a markedly bimodal size distribution, mostly caused by large numbers of microRNA (miRNA) and tRNA genes.

Evidence of purifying selection on CNSs at the population level

To independently assess evidence for purifying selection acting on CNSs, we analyzed the distribution of sequence diversity in CNSs within the populations of two Brassicaceae species: a recently sequenced set of 80 A. thaliana genomes⁵¹ and a set of 13 outbred individuals (26 haplotypes) of C. grandiflora (S.I.W., unpublished data), a close relative of C. rubella. Evidence for recent purifying selection acting on CNSs was found in the minor allele frequency (MAF) spectra of both populations. Both species showed an excess of rare variants in the CNS bases (Fig. 4) and reduced levels of population diversity compared to fourfold-degenerate sites, as measured by nucleotide diversity π (ref. 52) and Watterson's estimator θ_W (ref. 53) (Supplementary Fig. 6). However, purifying selection on CNSs was not generally as strong as in the highly constrained zero-fold degenerate sites. Similar observations were made for deletion polymorphisms at the population level (Supplementary Fig. 7). Because analyses of MAF spectra only examined segregating variation and ignored the level of polymorphism, these results provide independent validation of the action of purifying selection and limit the possibility of low divergence in CNSs arising from mutation cold spots^54,55.

**Figure 4: Evidence of selection on CNSs in populations.**

Distribution of CNSs in Brassicaceae and other plants

The fraction of A. lyrata CNSs for which homologs could be detected in other plant genomes was determined on the basis of sequence similarity (Fig. 5). Whereas most Brassicaceae genomes contained homologs for more than 75% of these CNSs, the early branching A. arabicum genome had homologs for only 38%. The two other Brassicaceae with a reduced number of identifiable homologs were those that had undergone whole-genome triplication events, B. rapa and L. alabamica, suggesting increased rates of CNS loss after each triplication. The proportion of A. lyrata CNSs with detectable homologs outside Brassicaceae was relatively low, ranging from 0.8% in the phylogenetically distant Oryza sativa to 3.4% in the more recent neighbor C. papaya. CNSs that seemed to predate Brassicaceae divergence were 75-fold enriched for small noncoding RNAs.

**Figure 5: The majority of *A. lyrata* CNSs are shared with most other Brassicaceae, but few are conserved outside that clade, with the exception of those corresponding to smRNAs.**

Loss of genes and CNSs after whole-genome triplications

The presence of two whole-genome triplications (Br-α and La-α) offered a further opportunity to study the fate of genes and CNSs after independent genome amplifications in closely related species⁵⁶. Alignment chain data indicated that 38% of A. lyrata genes had a unique ortholog in B. rapa, whereas 25% and 7% had two and three copies, respectively. Despite similar estimates for the timing of Br-α and La-α, B. rapa seems to have lost gene copies faster, having kept two (three) copies of only 18% (3%) of A. lyrata genes. Notably, genes kept in three copies in B. rapa were five times more likely to also be kept in three copies in L. alabamica than would be expected if losses had occurred randomly. As previously observed, those genes were enriched for dosage-sensitive pathways⁵⁷ (for example, the response to environmental cadmium ions, false discovery rate (FDR) = 5 × 10⁻⁹) or stoichiometry-dependent protein complexes⁵⁸ (for example, the ribosome, FDR = 2.6 × 10⁻⁶).

Similarly, CNSs that were preserved in three copies in B. rapa were, depending on their type, two to four times more likely to be preserved in three copies in L. alabamica (Supplementary Fig. 8). Intronic CNSs showed the highest degree of convergent retention, and 5′ UTR CNSs showed the weakest retention. Notably, 136 of the 428 CNSs kept in 3 copies each in B. rapa and L. alabamica were also kept in 2 copies after the At-α duplication, 89% more than would be expected by chance.

Many CNSs are transcriptional regulatory elements

To test for a function for intergenic CNSs in transcriptional regulation, the regions bound by 13 transcription factors in A. thaliana chromatin immunoprecipitation chip (ChIP-chip) and sequencing (ChIP-seq) data⁵⁹ were combined, and the extent of their overlap with CNSs was assessed (Supplementary Table 4). Seventeen percent of bases in distal CNSs were found to overlap a region bound by at least one transcription factor, representing a 13-fold enrichment. Reciprocally, CNSs covered more than 35% of bases in distal bound regions. Similar overlaps were observed for proximal and downstream CNSs, although at slightly lower levels. However, the situation was quite different for intronic and 3′ UTR CNSs, where only 3–4% of bases overlapped a bound region. Additional evidence points to a role in post-transcriptional regulation for many of these intronic and 3′ UTR CNSs (Supplementary Fig. 9 and Supplementary Note).

DNase I footprinting provides a more global perspective on protein binding to the genome. Overall, 44% of CNSs overlapped a recently published set of DNase I–hypersensitive sites (HSSs)⁶⁰, four times more than expected by chance. HSS-overlapping CNSs were mostly found in intergenic regions, with 51% (65%) of distal (proximal) CNSs overlapping an HSS but with only 9% of intronic CNSs and 23% of 3′ UTR CNSs showing overlap. Taken together, these results again suggest that CNSs in intronic regions are generally less likely than intergenic CNSs to mediate protein-DNA interactions. SNP density and MAF spectra clearly point to the portions of HSSs that overlap CNSs as the most strongly constrained subregions, whereas HSSs that had no overlap with CNSs showed reduced evidence of selective pressure (Supplementary Fig. 10).

We identified a set of 971 5-kb regions that were enriched for intergenic CNSs (coverage above 15%). These CNS-rich regions tended to be located next to genes involved in responses to hormonal stimuli, the regulation of transcription and organ development (Supplementary Table 5), matching previous reports in plants^14,22 and vertebrates⁴. Intergenic regions surrounding such genes, as well as their introns, were sometimes covered by up to 25–50% by CNSs. In these cases, it is likely that intronic CNSs have a role in transcriptional rather than post-transcriptional regulation.

Two CNS-rich regions surround the MIR166A and MIR166B loci, which harbor some of the rare CNSs for which homologs can be found outside Brassicaceae (Supplementary Fig. 11). These miRNA genes, which are conserved across eudicot and monocot plants, target the stability of members of the homeobox family of transcription factors⁶¹ and are crucial to the establishment of biological axes in stem, root and leaf^62,63. Both MIR166A and MIR166B were surrounded by a cluster of intergenic CNSs, including six with clear homologs beyond the Brassicaceae as well as conservation between the two A. thaliana loci. Other miRNA genes that were associated with clusters of CNSs are listed in Supplementary Table 6.

CNSs are enriched for specific sequence motifs

By highlighting genomic regions of likely regulatory function, CNSs facilitated the identification of regulatory motifs for transcription factors or RNA-binding proteins. Each type of CNS was found to be enriched for particular motifs, on the basis of a z-score calculation that compared the number of occurrences of motifs 6–8 nt in length in CNSs to permuted versions of the same motifs⁷ (Fig. 6a and Supplementary Table 7). Many of the motifs identified were associated with the binding preferences of ubiquitous transcription factors (G-box, E-box, W-box, EIRE, GT-1 and TATA-binding elements) and were enriched in all types of intergenic CNSs, suggesting that these CNSs may have similar regulatory roles. For example, the abscisic acid–linked G-box (CACGTG) and the calcium signaling–sensitive E-box (CATGTG) motifs, when grouped together under the CAYRTGTC motif (with Y representing C or T and R representing A or G), were four- to sevenfold enriched only in intergenic CNSs. The motif enrichment analysis was repeated on subsets of CNSs associated with genes with similar functions, as determined on the basis of GO-Slim⁶⁴, identifying a large number of known and new putative regulatory motifs (Supplementary Table 8 and Supplementary Note).

**Figure 6: CNSs are enriched for sequence motifs with evidence of constraint.**

Likely reflecting the binding contexts relative to the TSS needed for a functional site, diversity estimated in the population of 80 A. thaliana lines was 7-fold lower at G-box sequences in CNSs compared to G-box sequences located outside CNSs. SNP density within G-box sequences in CNSs showed constraint in A. thaliana and C. grandiflora populations (Fig. 6b) and in interspecies conservation profiles (Fig. 6c,d). Notably, positions immediately flanking most G-box sequences showed a reduced level of conservation compared to overall CNSs, possibly highlighting spatial constraints on the placement of other regulatory elements.

Several motifs of unknown function were strongly enriched in 5′ UTRs but not elsewhere, with some exhibiting strong strand bias, hinting at a role in post-transcriptional regulation. Others were found in both proximal and 5′ UTR CNSs (TATA box and GA track), suggesting a role in transcription initiation.

Discussion

Although the annotation of protein-coding and some small noncoding RNA genes in A. thaliana has become increasingly complete, until now, no high-resolution map of regulatory regions existed. Here we report the first genome-wide high-resolution atlas of noncoding regions under selection in crucifers. Because the detection of CNSs is based on the comparison of a large number of closely related species, the sensitivity of this map is higher than that in previous studies based on pairs or sets of more distant species^20,65 or on intragenomic comparisons²², resulting in an eight to tenfold increase in the number of constrained regions identified.

Our analysis shows that at least 5% of noncoding sites in A. thaliana seem to have been evolving under some form of purifying selection in Brassicaceae. Our estimate of the proportion of the A. thaliana genome that is constrained, combined with experimental estimates of substitution rate⁶⁶ (7 × 10⁻⁹ substitution per site per generation), yields a lower bound on the deleterious substitution rate of at least 0.15 bases per individual per generation, which is comparable to the conservative estimate of 0.1 from a large mutation accumulation experiment⁶⁷.

The number of genomes analyzed in this study and their divergence relative to each other are comparable to those used in similar comparative genomics studies of mammals^4,68 and fruit flies⁷. It is consequently possible to contrast the properties of their CNSs. The regulatory complexity of a genome can be approximated by the number of bases in CNSs, normalized by gene number. In A. thaliana and A. lyrata, this regulatory complexity amounts to 160 bp per gene, slightly more than in yeast (∼110 bp per gene) but substantially less than in animals (worms, ∼600 bp per gene; fruit flies, ∼2,500 bp per gene; mammals, ∼5,000 bp per gene)⁵. This finding suggests that noncoding regulatory mechanisms in plants are intermediate in complexity between those of yeast and worms and is consistent with the hypothesis that plants obtain regulatory complexity via gene or entire-genome duplication rather than from noncoding regulation⁶⁹. Alternatively, this low CNS-to-gene ratio might be caused by a high rate of turnover or streamlining of regulatory regions, possibly linked to frequent whole-genome duplications, resulting in many such CNSs going undetected by our approach.

The most highly conserved noncoding sequences identified were on average ∼70 bp in length with ∼0.15 substitutions per site, which is much lower than the 100% conservation over 200 bp used to define mammalian ultraconserved elements⁴. Nonetheless, the set of 2,012 most highly conserved noncoding sequences (with, at most, 0.5 substitutions per site over at least 50 bp; Supplementary Table 9) stands out from the rest of the CNSs in a manner that is reminiscent of ultraconserved elements, clustering around genes involved in the regulation of embryonic and post-embryonic development. Because the most recent common ancestor of plants and vertebrates was unicellular and, hence, likely lacked developmental patterning, this finding suggests that similar patterns of CNS control may have evolved independently in the two kingdoms—a noteworthy example of convergent evolution of genomic organization.

Crucifer CNSs differ from animal CNSs in the way that they are associated with their putative target genes: distal CNSs are comparatively less frequent in Brassicaceae than in animals, and first introns are generally depleted of CNSs, unlike in mammals where CNSs distribute roughly symmetrically around TSSs, and first introns are enriched for regulatory elements⁷⁰. These differences in CNS distribution may reflect some of the differences in intron-exon structures between plants and animals. In vertebrates, the first intron may be relatively extended, whereas, in plants, it tends to be shorter with alternative splicing more frequently driving intron retention⁷¹, thereby potentially limiting the first intron as a site for regulatory CNSs.

Together with findings from other ongoing investigations, the resources introduced here will help to establish the properties of the constrained portion of the noncoding genome of the crucifers, in much the same way as similar projects have in other eukaryotic species. Combined with the application of systematic genome-wide experimental assays⁷², this atlas of noncoding selection in the Brassicaceae will further open the door to the detailed characterization of the cis regulome of these species.

Methods

Sequencing and assembly.

The genomes of L. alabamica, A. arabicum and S. irio were assembled from Illumina paired-end reads (2 × 105 nt with a nominal 64-nt gap on the Genome Analyzer IIx platform; 55–110× coverage) and mate-pair reads (2 × 105 nt with 5-kb and 10-kb insert sizes on the Genome Analyzer IIx and HiSeq 2000 platforms). Libraries and reads were generated in accordance with Illumina protocols, with special attention paid to gentle shearing of mate-pair circular DNA. Reads were trimmed for quality (3′ trimming starting at the first position with Q < 32) and assembled with the Ray assembler⁷⁸ using a K-mer size of 31 to 41 (optimized per genome). Mate-pair reads were filtered for duplicates using a Bloom filter (pybloomfaster; see URLs) and for potential false mates on the edges of contigs, resulting in approximately 12× (5×) coverage for 5-kb (10-kb) inserts for each species and were then scaffolded using SOAPdenovo⁷⁹ with a K-mer size of 61 and with gap filling enabled.

The genomes of B. rapa (ssp. Chiifu-401-42; pre-publication data from Wang et al.²⁷), S. parvula (PRJNA63667), A. lyrata (ssp. lyrata; PRJNA41137), A. thaliana (Col-0; TAIR9/TAIR10, PRJNA10719), E. salsugineum (PRJNA80723) and C. rubella (PRJNA13878) were obtained either directly from their assemblers or from data published at the time of the release of the genome.

Genome completeness was assessed relative to the total and expected assembly length, the count of A. thaliana orthologs and the percentage of complete, highly conserved eukaryotic genes (Table 1). The A. arabicum genome was further validated against a set of physical mapping data (Keygene) that showed near-perfect concordance between assembled scaffolds and BAC contigs. The S. irio genome was compared against BAC and BAC-end data in GenBank. Although sequences were mostly concordant, the BAC data was from a tetraploid species and was consequently not expected to be completely similar. The L. alabamica genome was examined in several TE-rich extended loci (for example, the S locus) that had been assembled using long-read fosmid sequences⁸⁰ and showed near-perfect concordance.

CCP analysis in L. alabamica.

Preparation of chromosome spreads from young anthers and BAC painting probes, as well as multicolor FISH, followed the protocols described by Mandáková et al.⁸¹. In total, 237 chromosome-specific BACs from A. thaliana (∼23 Mb) were used as painting probes. The following A. thaliana BAC contigs were applied to identify 8 ancestral genomic blocks²³ on L. alabamica chromosomes: block A (31 BACs: T25K16–T29M8; 6.7 Mb), block O (24 BACs: F6N15–T1J1; 2.5 Mb), block P (13 BACs: T3H13–T22B4; 1.3 Mb), block Q (32 BACs: T20O7–T8M17; 2.6 Mb), block R (33 BACs: F7J8–T6G21; 7.4 Mb), block V (22 BACs: MBD2–K23F3; 2.4 Mb), block W (56 BACs: K21P3–MMN10; 4.3 Mb) and block X (26 BACs: MUP24–K9I9; 2.5 Mb).

Genome annotation.

All nine genomes were annotated for genic regions using Maker⁸² in conjunction with FGENESH and FGENESH+ (ref. 83), Augustus⁸⁴, SNAP⁸⁵ and BLAT⁸⁶ for transcript mapping. Repetitive regions and TEs were annotated with RepeatMasker using repeat models determined on a per-species basis obtained from RepeatModeler (Supplementary Table 2).

In addition to annotation of the A. lyrata and A. thaliana genomes, we combined sequenced mRNA from whole A. lyrata plants (Illumina, strand-specific RNA sequencing, 2 × 80 nt; R. Clark, personal communication; PRJNA207497), archived mRNA⁸⁷ (NCBI Sequence Read Archive (SRA) SRR019209, SRR019183 and SRR064165) and small RNA sequence data from both SOLiD and Illumina platforms (SRR040402, SRR072809, SRR034856 and SRR051926). Reads were aligned both with Novoalign (Novocraft) with high-alignment stringency and SpliceMap⁸⁸ for exon-spanning reads. Expression tracks were then lifted over between the two reference genomes to aid annotation.

Whole-genome alignments.

Each genome was soft masked and aligned to A. lyrata and E. salsugineum (primary and secondary reference genomes, respectively) using lastz⁸⁹, and chaining⁹⁰ and assembling collinear alignment blocks separated by gaps of <100 kb were then performed. We filtered for orthologous chains by retaining chains in decreasing order of score that did not substantially overlap previously selected chains in the non-reference genome (chains could overlap in the reference genome). This filter effectively separates orthologs from α paralogs, while allowing more recent whole-genome duplications to be properly represented. In the case of the two genomes with whole-genome triplications (B. rapa and L. alabamica), genomic regions were subdivided into three groups of chains, where each group contained non-overlapping chaining. We obtained a 13-way multiple alignment using the Multiz⁹¹ progressive alignment program, following phylogenetic order, using A. lyrata as the reference for lineage 1 and E. salsugineum as the reference for lineage 2. For the purpose of measuring sequence conservation of a region, the most conserved of the paralogs in B. rapa and L. alabamica were retained.

Determination of the fraction of sites under selection.

Our approach to estimate the fraction of sites under selection is based on that of Watterson et al.⁹², adapted to use site-specific conservation scores. PhyloP⁴⁵ was used to measure position-specific conservation levels on the basis of nine-way alignment, using a model of freely evolving sites obtained from fourfold-degenerate sites of the same alignment and on the JGI gene annotation of A. lyrata. The fraction of sites under selection in a given set of sites R was obtained as follows. Let S be a subset of the nine species considered and let R_S be a subset of sites from R that has nucleotides in S and gaps in the other species. PhyloP scores were discretized in 1,000 bins. For each S, let f_NS(x) be the distribution of discretized PhyloP scores obtained from fourfold-degenerate sites by replacing nucleotides in species outside S by gaps and calculating the distribution of PhyloP scores. Let f_RS(x) be the observed distribution of PhyloP scores in R_S. We express f_RS(x) as a mixture of f_NS(x) and f_FS(x), the unknown distribution of scores for sites under selection in R_S. Specifically, we estimated α_RS so that f_RS(x) = α_RSf_FS(x) + (1 −α_RS) f_NS(x). Let F_NS(x) and F_RS(x) be the cumulative distributions of f_NS(x) and f_NS(x), respectively. Let x* be the value for which F_NS(x)/F_RS(x) is maximized (excluding values of x for which either of the two cumulative distributions has a value less than 0.1), and let r_max= F_NS(x*)/F_RS(x*). We obtain α_RS = Σ_{x = x*...1,000}f_RS(x) − (f_NS(x)/r_max). Finally, the fraction under selection for region R is determined as Σ_Sα_RS |R_S|/|R|. Note that, because not all fourfold-degenerate sites are truly unconstrained, our estimate of the fraction under selection in R is a lower bound.

CNS identification.

CNSs were identified as regions located beyond annotated coding sequences in the A. lyrata reference genome that showed high PhastCons⁵ score (>0.82) over an extended length (>7 nt) and did not include a region of more than 12 nt with low PhastCons score (<0.55). To facilitate the comparison of incompletely sequenced genomes, insertions, deletions and missing orthologous sequences were not penalized, and CNSs were not required to be present in all nine species. The parameters were refined relative to an 800,000-base sequence generated from concatenated fourfold-degenerate sites within which a CNS FDR of <1% was required. Independent verifications based on evolutionary signatures of coding sequences using RNAcode⁹³, the absence of splice sites⁸⁸ and a uniform density of stop codons suggested that very few CNSs correspond with unannotated protein-coding exons.

Candidate CNSs were assigned a location category, and only those in the small noncoding and UTR classes were allowed to overlap expressed regions. CNSs were rejected that formed extensions of coding sequences (due to the PhastCons smoothing algorithm) and that overlapped other evidence of a potential coding role. Because UTR annotation, particularly the transition between coding sequences and UTRs, is not error free, we expect a slightly higher false positive rate for the UTR CNSs.

Motif enrichment.

The significance of the enrichment of motifs of 5 to 11 nt in length in CNSs was determined by a z score representing a comparison of the frequency of a motif's occurrence in all CNSs to the distribution of the occurrence of all permutations of the motif sequence in CNSs. This approach was selected to account for substantial base bias at single- and multi-base levels between CNSs and surrounding promoter regions. Enriched motifs were then clustered with those with a minimal edit distance and combined into IUPAC and PWM representations. Motif characterization was determined using the PLACE, TAIR and JASPAR databases. Enrichment of motifs in upstream, intronic, UTR and downstream CNSs relative to ontology groups was determined relative to the GO-Slim ontology annotation of the immediately proximate gene.

Population genomics analyses.

Alignments for the genomes of 80 Eurasian A. thaliana plants⁵¹ against the TAIR9/TAIR10 reference were obtained from the 1001 Genomes Project (see URLs). For measures of diversity over specific regions, those locations with base calls in all 80 samples were used, whereas, for more general comparisons with comparative genomics data, calls in at least 40 samples were required. Diversity estimates from the genomes of a population of 13 Greek C. grandiflora plants (26 haplotypes) were generated by aligning Illumina paired-end data to the genome of its close relative C. rubella using the STAMPY-GATK pipeline⁹⁴. Again, 26 base calls were required for diversity estimates, and regions with extremes of sequence depth or low quality were excluded. Pipelines for population genomics analyses were developed using Perl and Python languages and Bio++ libraries⁹⁵.

URLs.

RepeatMasker, http://www.repeatmasker.org/; RepeatModeler, http://www.repeatmasker.org/RepeatModeler.html; pybloomfaster, https://github.com/brentp/pybloomfaster; 1001 Genomes Project, http://1001genomes.org/data/MPI/MPICao2010/releases/.

Accession codes.

A. arabicum genome, PRJNA202984; S. irio genome, PRJNA202979; L. alabamica genome, PRJNA202983. All sequences, genome annotations, pairwise and multiple alignments, conservation scores and CNSs are available for visualization and download on a local installation of the UCSC Genome Browser at http://mustang.biol.mcgill.ca:8885.

Accession codes

Primary accessions

BioProject

Referenced accessions

BioProject

Sequence Read Archive

References

Duret, L. & Bucher, P. Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol. 7, 399–406 (1997).
Article CAS PubMed Google Scholar
Boffelli, D. et al. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394 (2003).
Article CAS PubMed Google Scholar
Hong, R.L., Hamaguchi, L., Busch, M.A. & Weigel, D. Regulatory elements of the floral homeotic gene AGAMOUS identified by phylogenetic footprinting and shadowing. Plant Cell 15, 1296–1309 (2003).
Article CAS PubMed PubMed Central Google Scholar
Bejerano, G. et al. Ultraconserved elements in the human genome. Science 304, 1321–1325 (2004).
Article CAS PubMed Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005).
Article CAS PubMed PubMed Central Google Scholar
Margulies, E.H. et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17, 760–774 (2007).
Article CAS PubMed PubMed Central Google Scholar
Stark, A. et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures. Nature 450, 219–232 (2007).
Article CAS PubMed PubMed Central Google Scholar
Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E.S. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254 (2003).
Article CAS PubMed Google Scholar
Adrian, J. et al. cis-Regulatory elements and chromatin state coordinately control temporal and spatial expression of FLOWERING LOCUS T in Arabidopsis. Plant Cell 22, 1425–1440 (2010).
Article CAS PubMed PubMed Central Google Scholar
Lyons, E. & Freeling, M. How to usefully compare homologous plant genes and chromosomes as DNA sequences. Plant J. 53, 661–673 (2008).
Article CAS PubMed Google Scholar
Freeling, M. & Subramaniam, S. Conserved noncoding sequences (CNSs) in higher plants. Curr. Opin. Plant Biol. 12, 126–132 (2009).
Article CAS PubMed Google Scholar
Zou, C. et al. Cis-regulatory code of stress-responsive transcription in Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 108, 14992–14997 (2011).
Article CAS PubMed PubMed Central Google Scholar
Hsieh, T.F. et al. Regulation of imprinted gene expression in Arabidopsis endosperm. Proc. Natl. Acad. Sci. USA 108, 1755–1762 (2011).
Article CAS PubMed PubMed Central Google Scholar
Inada, D.C. et al. Conserved noncoding sequences in the grasses. Genome Res. 13, 2030–2041 (2003).
Article CAS PubMed PubMed Central Google Scholar
Guo, H. & Moose, S.P. Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell 15, 1143–1158 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kaplinsky, N.J., Braun, D.M., Penterman, J., Goff, S.A. & Freeling, M. Utility and distribution of conserved noncoding sequences in the grasses. Proc. Natl. Acad. Sci. USA 99, 6147–6151 (2002).
Article CAS PubMed PubMed Central Google Scholar
Bossolini, E., Wicker, T., Knobel, P.A. & Keller, B. Comparison of orthologous loci from small grass genomes Brachypodium and rice: implications for wheat genomics and grass genome annotation. Plant J. 49, 704–717 (2007).
Article CAS PubMed Google Scholar
Colinas, J., Birnbaum, K. & Benfey, P.N. Using cauliflower to find conserved non-coding regions in Arabidopsis. Plant Physiol. 129, 451–454 (2002).
Article CAS PubMed PubMed Central Google Scholar
Haberer, G. et al. Large-scale cis-element detection by analysis of correlated expression and sequence conservation between Arabidopsis and Brassica oleracea. Plant Physiol. 142, 1589–1602 (2006).
Article CAS PubMed PubMed Central Google Scholar
Hupalo, D. & Kern, A.D. Conservation and functional element discovery in 20 angiosperm plant genomes. Mol. Biol. Evol. published online; 10.1093/molbev/mst082 (27 May 2013).
Reineke, A.R., Bornberg-Bauer, E. & Gu, J. Evolutionary divergence and limits of conserved non-coding sequence detection in plant genomes. Nucleic Acids Res. 39, 6029–6043 (2011).
Article CAS PubMed PubMed Central Google Scholar
Thomas, B.C., Rapaka, L., Lyons, E., Pedersen, B. & Freeling, M. Arabidopsis intragenomic conserved noncoding sequence. Proc. Natl. Acad. Sci. USA 104, 3348–3353 (2007).
Article CAS PubMed PubMed Central Google Scholar
Schranz, M.E., Lysak, M.A. & Mitchell-Olds, T. The ABC's of comparative genomics in the Brassicaceae: building blocks of crucifer genomes. Trends Plant Sci. 11, 535–542 (2006).
Article CAS PubMed Google Scholar
Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Hu, T.T. et al. The Arabidopsis lyrata genome sequence and the basis of rapid genome size change. Nat. Genet. 43, 476–481 (2011).
Article PubMed PubMed Central CAS Google Scholar
Slotte, T. et al. The Capsella rubella genome and the genomic consequences of rapid mating system evolution. Nat. Genet. 45, 831–835 (2013).
Article CAS PubMed Google Scholar
Wang, X. et al. The genome of the mesopolyploid crop species Brassica rapa. Nat. Genet. 43, 1035–1039 (2011).
Article CAS PubMed Google Scholar
Cheng, F. et al. Deciphering the diploid ancestral genome of the mesohexaploid Brassica rapa. Plant Cell published online; 10.1105/tpc.113.110486 (7 May 2013).
Yang, R. et al. The reference genome of the halophytic plant Eutrema salsugineum. Front. Plant Sci. 4, 46 (2013).
CAS PubMed PubMed Central Google Scholar
Dassanayake, M. et al. The genome of the extremophile crucifer Thellungiella parvula. Nat. Genet. 43, 913–918 (2011).
Article CAS PubMed PubMed Central Google Scholar
Schranz, M.E., Song, B.H., Windsor, A.J. & Mitchell-Olds, T. Comparative genomics in the Brassicaceae: a family-wide perspective. Curr. Opin. Plant Biol. 10, 168–175 (2007).
Article CAS PubMed Google Scholar
Couvreur, T.L. et al. Molecular phylogenetics, temporal diversification, and principles of evolution in the mustard family (Brassicaceae). Mol. Biol. Evol. 27, 55–71 (2010).
Article CAS PubMed Google Scholar
Bowers, J.E., Chapman, B.A., Rong, J. & Paterson, A.H. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature 422, 433–438 (2003).
Article CAS PubMed Google Scholar
Parra, G., Bradnam, K., Ning, Z., Keane, T. & Korf, I. Assessing the gene space in draft genomes. Nucleic Acids Res. 37, 289–297 (2009).
Article CAS PubMed Google Scholar
Lysak, M.A., Koch, M.A., Pecinka, A. & Schubert, I. Chromosome triplication found across the tribe Brassiceae. Genome Res. 15, 516–525 (2005).
Article CAS PubMed PubMed Central Google Scholar
Schnable, J.C., Wang, X., Pires, J.C. & Freeling, M. Escape from preferential retention following repeated whole genome duplications in plants. Front. Plant Sci. 3, 94 (2012).
Article CAS PubMed PubMed Central Google Scholar
Edger, P.P. & Pires, J.C. Gene and genome duplications: the impact of dosage-sensitivity on the fate of nuclear genes. Chromosome Res. 17, 699–717 (2009).
Article CAS PubMed Google Scholar
Ming, R. et al. The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus). Nature 452, 991–996 (2008).
Article CAS PubMed PubMed Central Google Scholar
Bailey, C.D. et al. Toward a global phylogeny of the Brassicaceae. Mol. Biol. Evol. 23, 2142–2160 (2006).
Article CAS PubMed Google Scholar
Thomas, J.W. et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003).
Article CAS PubMed Google Scholar
Yang, Y.W., Lai, K.N., Tai, P.Y. & Li, W.H. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J. Mol. Evol. 48, 597–604 (1999).
Article CAS PubMed Google Scholar
Town, C.D. et al. Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveal gene loss, fragmentation, and dispersal after polyploidy. Plant Cell 18, 1348–1359 (2006).
Article CAS PubMed PubMed Central Google Scholar
Yang, L. & Gaut, B.S. Factors that contribute to variation in evolutionary rate among Arabidopsis genes. Mol. Biol. Evol. 28, 2359–2369 (2011).
Article CAS PubMed Google Scholar
Ponting, C.P. & Hardison, R.C. What fraction of the human genome is functional? Genome Res. 21, 1769–1776 (2011).
Article CAS PubMed PubMed Central Google Scholar
Pollard, K.S., Hubisz, M.J., Rosenbloom, K.R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).
Article CAS PubMed PubMed Central Google Scholar
Margulies, E.H. & Birney, E. Approaches to comparative sequence analysis: towards a functional view of vertebrate genomes. Nat. Rev. Genet. 9, 303–313 (2008).
Article CAS PubMed Google Scholar
Sorek, R. & Ast, G. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13, 1631–1637 (2003).
Article CAS PubMed PubMed Central Google Scholar
Halligan, D.L., Eyre-Walker, A., Andolfatto, P. & Keightley, P.D. Patterns of evolutionary constraints in intronic and intergenic DNA of Drosophila. Genome Res. 14, 273–279 (2004).
Article CAS PubMed PubMed Central Google Scholar
Halligan, D.L. & Keightley, P.D. Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res. 16, 875–884 (2006).
Article CAS PubMed PubMed Central Google Scholar
Blanchette, M. et al. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res. 16, 656–668 (2006).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat. Genet. 43, 956–963 (2011).
Article CAS PubMed Google Scholar
Nei, M. & Li, W.H. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76, 5269–5273 (1979).
Article CAS PubMed PubMed Central Google Scholar
Watterson, G.A. On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7, 256–276 (1975).
Article CAS PubMed Google Scholar
Katzman, S. et al. Human genome ultraconserved elements are ultraselected. Science 317, 915 (2007).
Article CAS PubMed Google Scholar
Casillas, S., Barbadilla, A. & Bergman, C.M. Purifying selection maintains highly conserved noncoding sequences in Drosophila. Mol. Biol. Evol. 24, 2222–2234 (2007).
Article CAS PubMed Google Scholar
Feldman, M. & Levy, A.A. Allopolyploidy—a shaping force in the evolution of wheat genomes. Cytogenet. Genome Res. 109, 250–258 (2005).
Article CAS PubMed Google Scholar
Thomas, B.C., Pedersen, B. & Freeling, M. Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Res. 16, 934–946 (2006).
Article CAS PubMed PubMed Central Google Scholar
Luo, F., Liu, J. & Li, J. Discovering conditional co-regulated protein complexes by integrating diverse data sources. BMC Syst. Biol. 4 (suppl. 2), S4 (2010).
Article PubMed PubMed Central CAS Google Scholar
Muiño, J.M., Hoogstraat, M., van Ham, R.C. & van Dijk, A.D. PRI-CAT: a web-tool for the analysis, storage and visualization of plant ChIP-seq experiments. Nucleic Acids Res. 39, W524–W527 (2011).
Article PubMed PubMed Central CAS Google Scholar
Zhang, W., Zhang, T., Wu, Y. & Jiang, J. Genome-wide identification of regulatory DNA elements and protein-binding footprints using signatures of open chromatin in Arabidopsis. Plant Cell 24, 2719–2731 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kim, J. et al. microRNA-directed cleavage of ATHB15 mRNA regulates vascular development in Arabidopsis inflorescence stems. Plant J. 42, 84–94 (2005).
Article CAS PubMed PubMed Central Google Scholar
Nogueira, F.T. et al. Regulation of small RNA accumulation in the maize shoot apex. PLoS Genet. 5, e1000320 (2009).
Article PubMed PubMed Central CAS Google Scholar
Nogueira, F.T., Madi, S., Chitwood, D.H., Juarez, M.T. & Timmermans, M.C. Two small regulatory RNAs establish opposing fates of a developmental axis. Genes Dev. 21, 750–755 (2007).
Article CAS PubMed PubMed Central Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Lyons, E. et al. Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe with rosids. Plant Physiol. 148, 1772–1781 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ossowski, S. et al. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science 327, 92–94 (2010).
Article CAS PubMed Google Scholar
Schultz, S.T., Lynch, M. & Willis, J.H. Spontaneous deleterious mutation in Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 96, 11393–11398 (1999).
Article CAS PubMed PubMed Central Google Scholar
Ahituv, N., Prabhakar, S., Poulin, F., Rubin, E.M. & Couronne, O. Mapping cis-regulatory domains in the human genome using multi-species conservation of synteny. Hum. Mol. Genet. 14, 3057–3063 (2005).
Article CAS PubMed Google Scholar
Lockton, S. & Gaut, B.S. Plant conserved non-coding sequences and paralogue evolution. Trends Genet. 21, 60–65 (2005).
Article CAS PubMed Google Scholar
Margulies, E.H., Blanchette, M., Haussler, D. & Green, E.D. Identification and characterization of multi-species conserved sequences. Genome Res. 13, 2507–2518 (2003).
Article CAS PubMed PubMed Central Google Scholar
Hong, X., Scofield, D.G. & Lynch, M. Intron size, abundance, and distribution within untranslated regions of genes. Mol. Biol. Evol. 23, 2392–2404 (2006).
Article CAS PubMed Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Guindon, S., Lethiec, F., Duroux, P. & Gascuel, O. PHYML Online—a web server for fast maximum likelihood–based phylogenetic inference. Nucleic Acids Res. 33, W557–W559 (2005).
Article CAS PubMed PubMed Central Google Scholar
Higo, K., Ugawa, Y., Iwamoto, M. & Korenaga, T. Plant cis-acting regulatory DNA elements (PLACE) database: 1999. Nucleic Acids Res. 27, 297–300 (1999).
Article CAS PubMed PubMed Central Google Scholar
Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40, D1202–D1210 (2012).
Article CAS PubMed Google Scholar
Johnston, J.S. et al. Evolution of genome size in Brassicaceae. Ann. Bot. (Lond.) 95, 229–235 (2005).
Article CAS Google Scholar
Lysak, M.A., Koch, M.A., Beaulieu, J.M., Meister, A. & Leitch, I.J. The dynamic ups and downs of genome size evolution in Brassicaceae. Mol. Biol. Evol. 26, 85–98 (2009).
Article CAS PubMed Google Scholar
Boisvert, S., Laviolette, F. & Corbeil, J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J. Comput. Biol. 17, 1519–1533 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Article CAS PubMed PubMed Central Google Scholar
Chantha, S.C., Herman, A.C., Platts, A.E., Vekemans, X. & Schoen, D.J. Secondary evolution of a self-incompatibility locus in the brassicaceae genus leavenworthia. PLoS Biol. 11, e1001560 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lysak, M.A. & Mandáková, T. Analysis of plant meiotic chromosomes by chromosome painting. Methods Mol. Biol. 990, 13–24 (2013).
Article CAS PubMed Google Scholar
Cantarel, B.L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).
Article CAS PubMed PubMed Central Google Scholar
Salamov, A.A. & Solovyev, V.V. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10, 516–522 (2000).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008).
Article CAS PubMed Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinformatics 5, 59 (2004).
Article PubMed PubMed Central Google Scholar
Kent, W.J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Article CAS PubMed PubMed Central Google Scholar
Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011).
Article CAS PubMed PubMed Central Google Scholar
Au, K.F., Jiang, H., Lin, L., Xing, Y. & Wong, W.H. Detection of splice junctions from paired-end RNA-seq data by SpliceMap. Nucleic Acids Res. 38, 4570–4578 (2010).
Article CAS PubMed PubMed Central Google Scholar
Harris, R.S. Improved Pairwise Alignment of Genomic DNA. PhD thesis, Penn. State Univ. (2007).
Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W. & Haussler, D. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100, 11484–11489 (2003).
Article CAS PubMed PubMed Central Google Scholar
Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).
Article CAS PubMed PubMed Central Google Scholar
Waterston, R.H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002).
Article CAS PubMed Google Scholar
Washietl, S. et al. RNAcode: robust discrimination of coding and noncoding regions in comparative sequence data. RNA 17, 578–594 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lunter, G. & Goodson, M. Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 21, 936–939 (2011).
Article CAS PubMed PubMed Central Google Scholar
Dutheil, J. et al. Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics 7, 188 (2006).
Article PubMed PubMed Central CAS Google Scholar

Download references

Acknowledgements

We would like to thank the US Department of Energy Joint Genome Institute (for the C. rubella genome sequence, produced under a Community Sequencing Program (CSP) proposal submitted by D. Weigel and colleagues, and the E. salsugineum genome sequence, produced under a CSP proposal submitted by K. Schumaker, R. Wing and T. Mitchell-Olds) and R. Clark (for A. lyrata mRNA sequencing). We also thank S.-C. Chantha for assistance with genome sequencing in L. alabamica, S. Joly for suggestions on the genomic DNA isolation protocol and D. Scofield for helpful discussions about intron-exon structure. We thank M. Freeling, D. Weigel, E. Harmsen and I. Lacroix for comments on the manuscript. This project was funded by a Genome Canada/Génome Québec grant to T.E.B., S.I.W., M.B., J.S., A.M.M., D.J.S. and P.M.H. In addition, T.M. and M.A.L. were supported by the European Regional Development Fund (CZ.1.05/1.1.00/02.0068) and by the Czech Science Foundation (excellence cluster P501/12/G090). J.G.S. was supported by National Science Foundation (NSF) award 0929262. M.E.S. and E.v.d.B. were supported by a Vidi grant from the Netherlands Organisation for Scientific Research.

Author information

Annabelle Haudry and Adrian E Platts: These authors contributed equally to this work.

Authors and Affiliations

Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, Ontario, Canada
Annabelle Haudry, Robert J Williamson, Khaled M Hazzouri, John R Stinchcombe, Alan M Moses & Stephen I Wright
Université Lyon 1, Centre National de la Recherche Scientifique (CNRS), Unité Mixte de Recherche (UMR) 5558, Laboratoire de Biométrie et Biologie Evolutive, Villeurbanne, France
Annabelle Haudry
School of Computer Science, McGill University, Montreal, Quebec, Canada
Adrian E Platts, Emilio Vello, Mickael Leclercq & Mathieu Blanchette
McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada
Adrian E Platts, Emilio Vello, Mickael Leclercq & Mathieu Blanchette
Department of Biology, McGill University, Montreal, Quebec, Canada
Douglas R Hoen, Ewa Forczek, Zoé Joly-Lopez, Daniel J Schoen, Paul M Harrison & Thomas E Bureau
Nature Sciences Department, Colby-Sawyer College, New London, New Hampshire, USA
Joshua G Steffen
Department of Human Genetics, McGill University, Montreal, Quebec, Canada
Ken Dewar
Institute of Vegetables and Flowers (IVF), Chinese Academy of Agricultural Sciences (CAAS), Beijing, China
Xiaowu Wang
US Department of Energy Joint Genome Institute, Walnut Creek, California, USA
Jeremy Schmutz
HudsonAlpha Institute of Biotechnology, Huntsville, Alabama, USA
Jeremy Schmutz
J. Craig Venter Institute, Rockville, Maryland, USA
Christopher D Town
Division of Biological Sciences, University of Missouri, Columbia, Missouri, USA
Patrick P Edger & J Chris Pires
The School of Plant Sciences, University of Arizona, Tucson, Arizona, USA
Karen S Schumaker & David E Jarvis
Plant Cytogenomics, Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czech Republic
Terezie Mandáková & Martin A Lysak
Biosystematics Group, Plant Sciences, Wageningen University, Wageningen, The Netherlands
Erik van den Bergh & M Eric Schranz
Centre for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, Ontario, Canada
Stephen I Wright

Authors

Annabelle Haudry
View author publications
You can also search for this author in PubMed Google Scholar
Adrian E Platts
View author publications
You can also search for this author in PubMed Google Scholar
Emilio Vello
View author publications
You can also search for this author in PubMed Google Scholar
Douglas R Hoen
View author publications
You can also search for this author in PubMed Google Scholar
Mickael Leclercq
View author publications
You can also search for this author in PubMed Google Scholar
Robert J Williamson
View author publications
You can also search for this author in PubMed Google Scholar
Ewa Forczek
View author publications
You can also search for this author in PubMed Google Scholar
Zoé Joly-Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Joshua G Steffen
View author publications
You can also search for this author in PubMed Google Scholar
Khaled M Hazzouri
View author publications
You can also search for this author in PubMed Google Scholar
Ken Dewar
View author publications
You can also search for this author in PubMed Google Scholar
John R Stinchcombe
View author publications
You can also search for this author in PubMed Google Scholar
Daniel J Schoen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy Schmutz
View author publications
You can also search for this author in PubMed Google Scholar
Christopher D Town
View author publications
You can also search for this author in PubMed Google Scholar
Patrick P Edger
View author publications
You can also search for this author in PubMed Google Scholar
J Chris Pires
View author publications
You can also search for this author in PubMed Google Scholar
Karen S Schumaker
View author publications
You can also search for this author in PubMed Google Scholar
David E Jarvis
View author publications
You can also search for this author in PubMed Google Scholar
Terezie Mandáková
View author publications
You can also search for this author in PubMed Google Scholar
Martin A Lysak
View author publications
You can also search for this author in PubMed Google Scholar
Erik van den Bergh
View author publications
You can also search for this author in PubMed Google Scholar
M Eric Schranz
View author publications
You can also search for this author in PubMed Google Scholar
Paul M Harrison
View author publications
You can also search for this author in PubMed Google Scholar
Alan M Moses
View author publications
You can also search for this author in PubMed Google Scholar
Thomas E Bureau
View author publications
You can also search for this author in PubMed Google Scholar
Stephen I Wright
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Blanchette
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The study was conceived by M.B., S.I.W., A.M.M., T.E.B., D.J.S., P.M.H. and J.R.S. Computational experiments were designed by A.H., A.E.P., A.M.M., S.I.W., T.E.B. and M.B. E.F., Z.J.-L., J.C.P., M.E.S., D.J.S. and T.E.B. obtained material for genome sequencing for L. alabamica, S. irio and A. arabicum. A.E.P., K.D. and T.E.B. sequenced the DNA, and A.E.P. assembled the genomes, using additional data provided by C.D.T., P.P.E., M.E.S., E.v.d.B. and J.C.P. Additional RNA sequencing data were obtained from J.G.S., B. rapa genome sequence data were provided by X.W., and E. salsugineum genome data were provided by J.S., D.E.J. and K.S.S. T.M. and M.A.L. performed the multicolor FISH study on L. alabamica. P.M.H. and A.E.P. performed the gene annotation, D.R.H. and T.E.B. annotated TEs, and M.L. identified structural RNAs. Multiple-genome alignments and identification and analysis of CNSs were performed by A.E.P., A.H., E.V. and M.B. Population genetics analyses were performed by A.H., A.E.P., R.J.W., K.M.H., A.M.M., A.E.P. and S.I.W. The manuscript was written primarily by A.H., A.E.P., S.I.W. and M.B., with input from all coauthors.

Corresponding authors

Correspondence to Alan M Moses, Thomas E Bureau, Stephen I Wright or Mathieu Blanchette.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11, Supplementary Tables 1–9 and Supplementary Note (PDF 1237 kb)

Rights and permissions

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.

Reprints and permissions

About this article

Cite this article

Haudry, A., Platts, A., Vello, E. et al. An atlas of over 90,000 conserved noncoding sequences provides insight into crucifer regulatory regions. Nat Genet 45, 891–898 (2013). https://doi.org/10.1038/ng.2684

Download citation

Received: 13 October 2012
Accepted: 04 June 2013
Published: 30 June 2013
Issue Date: August 2013
DOI: https://doi.org/10.1038/ng.2684

This article is cited by

Large-scale gene expression alterations introduced by structural variation drive morphotype diversification in Brassica oleracea
- Xing Li
- Yong Wang
- Feng Cheng
Nature Genetics (2024)
A high-quality chromosome-level Eutrema salsugineum genome, an extremophile plant model
- Meng Xiao
- Guoqian Hao
- Quanjun Hu
BMC Genomics (2023)
Maternal dominance contributes to subgenome differentiation in allopolyploid fishes
- Min-Rui-Xuan Xu
- Zhen-Yang Liao
- Hua-Hao Zhang
Nature Communications (2023)
Genome evolution and diversity of wild and cultivated potatoes
- Dié Tang
- Yuxin Jia
- Sanwen Huang
Nature (2022)
Brassinosteroid-induced gene repression requires specific and tight promoter binding of BIL1/BZR1 via DNA shape readout
- Shohei Nosaki
- Nobutaka Mitsuda
- Takuya Miyakawa
Nature Plants (2022)

Subjects

Abstract

Similar content being viewed by others

Main

Results

Genome sequencing, assembly and annotation

Multiple-genome comparison

Selection on noncoding sites in the crucifer genomes

At least 90,000 conserved noncoding sequences

Evidence of purifying selection on CNSs at the population level

Distribution of CNSs in Brassicaceae and other plants

Loss of genes and CNSs after whole-genome triplications

Many CNSs are transcriptional regulatory elements

CNSs are enriched for specific sequence motifs

Discussion

Methods

Sequencing and assembly.

CCP analysis in L. alabamica.

Genome annotation.

Whole-genome alignments.

Determination of the fraction of sites under selection.

CNS identification.

Motif enrichment.

Population genomics analyses.

URLs.

Accession codes.

Accession codes

Primary accessions

BioProject

Referenced accessions

BioProject

Sequence Read Archive

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links