Introduction

Neighborhood conservation of gene arrangement was found in various bacteria1 and eukaryotic organisms2,3,4,5,6,7,8 from studying specific species or a group of genes. Gene direction is important for gene arrangement and function. Since random arrangement of a large number of genes along the chromosomes can theoretically generate a multiplicity of gene direction orders, a statistical test of gene direction randomness is required. To the best of our knowledge, however, there are no literature reports on algorithms suitable for testing gene direction randomness, likely because of the lack of a readily available algorithms for testing whether a series of two numbers or two letters (e.g., 1 for forward, 2 for backward) is random. Research is needed to develop a statistical algorithm to test gene direction randomness and to analyze many genomes for general information on gene direction distribution.

Genes with similar function or coordinated expression seem to be clustered in sequenced genomes3. Furthermore, the order of transcriptionally and functionally linked genes was found to be conserved in some eukaryotes, in a study using various analysis methods, including protein sequence BLAST searches, gene ontology assignments and phylogenetic tree reconstruction4. It has been proposed that the range for which DNA neighborhood optimizes biochemical interactions might therefore be defined by DNA topology1. Recently, the notion that expression neighborhoods are a feature of eukaryotic genome organization necessary for correct gene expression was publically challenged because a targeted separation of one well-defined gene expression neighborhood in the Drosophila genome did not significantly alter gene expression9. Since gene direction order is an important aspect of gene order and architecture, an analysis of the gene direction in a large number of genomes may provide insights into whether gene neighborhoods are random, or likely the result of selection and inheritance.

In this study, we developed a statistical approach to test the significance of gene direction order. Since an intergenic region can have four possible configurations, that is, FF, BB, FB and BF, where F denotes forward gene direction and B backward gene direction (i.e., on the complementary strand), the probability of occurrence of these four types of intergenic regions should be approximately equal if gene order on the annotated DNA sequence of a chromosome is random. The chi-square test approach can test the randomness of these four configurations. We tested the randomness of the direction of annotated genes on chromosomes (GenBank full version files; see Tables S1–S7 for sequence ID list) of all or nearly all complete and annotated genomes of bacteria, archaeans, protists, fungi, plants and animals available in NCBI GenBank (http://www.ncbi.nlm.nih.gov/) and present the findings below.

Results

Gene direction was not statistically random in any of the 63 archaean ( Supplementary Table S1 ), 631 bacterial ( Supplementary Table S2 ), 9 protist-protozal species ( Supplementary Table S3 ) and a total of 1,127 genomes analyzed ( Table 1 ). Archaea and bacteria have only non-random gene direction chromosomes; while a majority of the fungi, chlorophyta protists, plants and animal species have both random and nonrandom gene direction chromosomes ( Table 1 ).

Table 1 Summary of gene direction arrangement, inferred from gene interval distribution, at the species and chromosomal (chr) levels

All of the analyzed genomes of archaea, bacteria, protista and protozoa have a greater number of same-direction gene pairs than opposite-direction pairs; in other words, they all have neighbors mainly characterized by the same direction. The same/opposite gene direction ratios of chromosomes are approximately 2.74, 2.00 and 46.20, on average, among the bacteria, archaeans and protozoa, respectively ( Table 2 ). In the protozoa species, 75% of the intervals have genes in the same direction, either on the forward strand or complementary strand; whereas, interestingly, majority of the genes in the chlorophyta and fungi species are in the opposite direction ( Table 2 ).

Table 2 Gene direction arrangement, inferred from gene interval distribution, in different kingdoms

The largest string of same-direction genes, consisting of 391 genes, was found on the complementary strand of Leishmania infantum chromosome 31 (NC_009415); there were only two genes in the forward direction ( Supplementary Table S3 ). The second largest string, comprising 371 genes, was found on the forward strand on chromosome 26 (NC_007267) of Leishmania major strain Friedlin ( Supplementary Table S3 ).

An extreme case of opposite-direction genes was found in Ostreococcus lucimarinus CCE9901 and Micromonas sp. RCC299, two species of Chlorophyta (protists), an early-diverging photosynthetic class within the green plant lineage ( Supplementary Table S4 ). The majority (68%) of gene pairs in these two species exhibit opposite direction, either FB or BF ( Table 1 ). Each of the 21 O. lucimarinus chromosomes and the 17 Micromonas sp. RCC299 chromosomes had fewer same-direction gene pairs than opposite-direction pairs. This stands in contrast with the situation for Leishmania species.

Four species of fungi (Debaryomyces hansenii, Encephalitozoon cuniculi, Saccharomyces cerevisiae and Encephalitozoon intestinalis) have only chromosomes with randomly distributed gene direction ( Supplementary Table S5 ). Another 17 species have fewer same-direction gene pairs than opposite-direction pairs; this is also the case for the average of these 21 fungal species ( Table 1 ).

All the fungal chromosomes, except the chromosome VII of Ashbya gossypii (ATCC 10895; NC_005788.4), have more opposite gene neighbors ( Supplementary Table S5 ). It is interesting that Ashbya gossypii has the smallest eukaryote genome and was used as a tool for mapping the ancient Saccharomyces cerevisiae genome10. Since fungi are less primitive than bacteria, which have same direction dominance of genes, this may indicate that fungi are further along in the progression towards opposite-direction dominance.

In plants, Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa ssp. Japonica), poplar (Populus trichocarpa) and sorghum (Sorghum bicolor) have significantly more same-direction gene pairs than opposite-direction ones. However, a diploid yellow-flowered alfalfa (Medicago truncatula) was found to have fewer same-direction genes than opposite-direction ones with a (FF+BB)/(FB+BF) ratio, also called the same/opposite ratio, of 0.98 ( Supplementary Table S6 ). Overall, for these plants, the (FF+BB)/(FB+BF) ratio per chromosome is 1.15, which means there are more same-direction than opposite-direction genes ( Table 2 ).

For animal species, on average, there are statistically more same-direction gene pairs than opposite-direction pairs, but the difference is quite slim with a (FF+BB)/(FB+BF) ratio of 1.07 ( Table 2 ). Among the animal genomes analyzed, the genomes of Caenorhabditis elegans (nematode) and Drosophila melanogaster (fruit fly) have been completely sequenced and annotated. Each of the five C. elegans chromosomes has a greater proportion of same-direction gene pairs and the same/opposite ratio is 1.15 on average ( Table 2 ). In D. melanogaster, chromosome 2R gave a similar result in terms of same/opposite direction, but all the other chromosomes showed significantly fewer same-direction than opposite-direction genes ( Supplementary Table S7 ).

The kingdoms showed clear-cut differences in terms of the same/opposite ratio (FF+BB)/(FB+BF) on chromosomes. More same direction gene pairs than opposite direction ones at the chromosomal level occurred in all 78 archaean chromosomes (genomes), 898 bacterial chromosomes (genomes), in 17.68% of the fungal chromosomes, 86.79% of the plant chromosomes, 85.33% of protozoan chromosomes and 52.11% of the animal chromosomes ( Table 1 ). None of the 38 protista-Chlorophyta chromosomes showed this dominance ( Table 1 ).

Overall, 99% of the species (741 out of 747) have at least one chromosome on which gene direction is not random ( Table 1 ). However, it is worth noting that some species (i.e., 4 fungi) are characterized by random order of gene direction at the annotated sequences at the chromosome level in their genomes ( Table 1 ). In some species, such as alfalfa (Medicago truncatula) ( Supplementary Table S6 ), zebra finch (Taeniopygia guttata), chimpanzee (Pan troglodytes) and humans ( Tables S7 ), most chromosomes exhibit random gene direction in terms of gene pair configurations.

Discussion

In this study, we examined gene direction randomness at the whole chromosome level; therefore, we cannot rule out that regional non-random islands exist on the random gene-direction chromosomes. Similarly, chromosomes with non-random gene direction can be expected to have regions with random gene direction.

Most plants and some animals have more same-direction gene pairs than opposite-direction gene pairs in their genomes. In view of the fact that some lower kingdoms such as the Fungi and the Protista (protozoa) have already lost same-direction gene dominance, the maintenance of the statistical dominance of same-direction genes in these lower and higher organisms (i.e., fungi, protozoa, plants and some animals) must be attributed to functional advantages. This may correspond to the evolutionary conservation of non-randomness of gene neighborhoods which was reported previously6. The data suggest that the tendency in animals is towards randomness of gene direction at the chromosome level. There must be an unknown mechanism in these animal genomes to ensure animal fitness after members of previously defined gene blocks get physically split off. This might explain the observation that neighborhood continuity is not required for correct testis gene expression in Drosophila9.

The same gene direction dominance with non-randomness likely originated from the last common ancestor of living organisms analyzed in this study. This hypothesis is supported by the non-random, same-direction dominance of genes found in the most primitive species (63 archaean species, 631 bacterial species), an evolutionary middle level species (Ashbya gossypii, a progenitor fungus) and most higher organisms (plants and some animals). This non-randomness was likely strengthened in archaea, bacteria and some protist species notably in Mycoplasma suis (same/opposite = 6.95) and Leishmania infantum (391 genes in tandem), but weakened in many others species, including fungi, chlorophytes and some plants and animals.

Extra-attention should be given to the interpretation of statistical randomness and its biological meaning. This is because the statistical test is based on the annotated DNA sequence, which is in format of a single strand (from 5′ to 3′) in GenBank, but its information of gene location and direction represents both DNA strands. The gene direction annotation on the DNA sequence is similar to combining all the signs from both sides (parallel but with opposite-traffic flows) of a highway. The same direction dominance of genes in this study is in the same meaning of “non-randomness of gene direction” in the literature. The opposite direction dominance detected in this study is non-random statistically on the annotated genome sequences but is totally opposite to the conventional meaning of non-random in gene direction; it is actually equivalent to nearly the extreme case of conventional randomness. Our interpretation is that the opposite direction dominance is likely created by nearly random use of both DNA strands.

The model we propose here can be used to explain how gene direction evolved from same-direction dominance to opposite-direction dominance in some species, such as Chlorophyta and fungal species. Same-direction dominance was likely needed in earliest life forms to maximize the use of the limited DNA/RNA sequences. As genome size increased, some species and some gene regions developed random gene direction because the opportunity for gene mutation and diversity could complement species' functional needs. The nearly equal use or random use of both strands created new advantages for certain species and therefore selection for opposite gene direction dominance occurred in some species, allowing both strands to have an approximately equal distribution of genes. The chromosomes that have the annotated sequences with statistically random gene directions is likely at the interim stages on the way toward the opposite direction dominance. Unknown in trans mechanisms must exist which ensure that functionally related genes work together effectively in these species after the same direction dominance is lost. Such mechanisms play a greater role in animals than in plants. Although neighborhood continuity is clearly needed to a certain degree, we predict that there will be a trend toward less same-direction dominance and greater opposite-direction dominance in higher organisms, particularly animals, in the future.

In brief, the results of this analysis of the completely sequenced genomes suggest the following: The same gene direction dominance is likely derived from the common ancestor of these living organisms. This dominance is further strengthened in archaeans, bacteria and some protozoa-protists, but weakened in fungi, Chlorophyta and some plants and animals, likely owing to the increase in genome size and the opportunity to use both strands. Gene direction experienced a V-shape evolution. One branch is from moderately non-random to extremely non-random. In this branch, genes mainly located on one DNA strand. The other branch is from moderately non-random to mainly random. In this second branch, gene locations evolved from one DNA strand to nearly random between both strands. There is an evolutionary shift in gene direction, from predominantly same-direction to opposite-direction or approximately equivalent number of same- and opposite-directions during evolution of more complex species. Functional neighborhood continuity will likely be conserved to a certain degree, but the future trend is likely to be toward increasing opposite-gene direction dominance and decreasing same-direction dominance in most animals. This study expands current knowledge of the genomes of living organisms and may increase understanding of gene regulation in existing species as well as provide useful insights for designing synthetic genomes.

Methods

Most genomes were downloaded from the NCBI GenBank FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/). Plant genomes were individually searched and downloaded from http://www.ncbi.nlm.nih.gov/nuccore/and the genome browser website http://www.ncbi.nlm.nih.gov/genome/browse/. For the protist Micromonas sp. RCC299 genome, there were two series of IDs, only the series named with CP were used in the final analysis because the other series started by NC_ were unpublished versions and were identical in the analysis output to the CP series. For archaea, bacteria, fungi and protists, only the completed genomes were used. Plant genomes were analyzed if they had complete genomes or pseudomolecules available. The genomes of humans and the majority of the widely studied animals including rat, mouse, chimpanzee, monkey and dog were not complete but had large scaffolds. Therefore, scaffolds of animal chromosomes larger than 0.5 Mb were also analyzed as long as they had clear indication of chromosome number and the sequences were unique. The GenBank files (GBK, GB, or GBS) of the chromosomes/scaffolds were used if they had clear annotation of gene and coding region locations. The gene direction and location of the chromosomes were counted. The types of gene intervals (i.e., intergenic regions) were determined by direct neighboring genes and classified as FF, BB, FB and BF, where F is for forward and B is for backward or complement strand. A chi-square test was employed to test whether the four types of intervals were random and to test whether the same direction gene pairs (FF and BB) vs. opposite direction gene pairs (FB and BF) were statistically equal. The counting also included the total number of species analyzed, the species with only random gene direction chromosomes, the species having both random and nonrandom gene direction chromosomes, the same/opposite ratio of genes on the chromosomes and the percentage of chromosomes that had more same- than opposite-direction genes.