Introduction

The extraordinary advances in speed and throughput of sequencing technologies in the past decade have generated an unprecedented wealth of complete or near complete genome sequences and have also allowed the emergence of the technology of metagenomics or random community genomics, which aims at sequencing DNA from environmental microbial communities without culturing or isolating individual microbes. Today thousands of fully sequenced genomes and over 7,000 metagenomes have been deposited in public repositories, e.g., GenBank1, Genomes Online Database (GOLD)2, the SEED database3 and the Metagenomics RAST (MG-RAST) server4. To be annotated and analyzed, metagenome sequences are compared to genes, proteins, protein domains, protein families and genomes in known databases. It was shown a few years ago that approximately 19 hours were needed to analyze one megabase of DNA sequence (if linear compute complexity is assumed) and each data set required about one month of computing time (Unpublished data, Edwards, R. A., 2008). However, MG-RAST and other public services handle the analysis by using large compute clusters dedicated to sequence searching. Because of the deluge of sequence data, new efficient tools and methods are required for analyzing and comparing sequences and for prioritizing the sequences to be analyzed when comprehensive analysis is not feasible.

One approach to prioritizing the analysis of unknown genomic or metagenomic sequences is examining the information content of known genes, proteins and genomes to explore possible patterns or trends that might help in predicting informative sequences, i.e., those sequences likely to encode proteins or to provide new rather than redundant knowledge about the sample to which they belong. In the cell, the information flows from DNA to amino acid sequences, as DNA is transcribed into RNA then translated into amino acids to make proteins. Depending on the different combinations of bases in the deoxyribonucleotides of the DNA sequence, different amino acids are added to the nascent, growing polypeptide chain. Complex proteins consist of different combinations of amino acids and therefore are encoded by various combinations of the four sequence bases. Homopolymeric tracts like AAAAAAAAC or TTTTTTTCCCCC can only code one or few different amino acids and encode for proteins with amino acid repeats. Therefore, we hypothesize that they are much less likely to encode functional proteins than DNA containing equimolar mixtures of bases (e.g., AGCTAGCTAGCT).

Statistical approaches derived from information theory can quantify the amount of information in a DNA sequence. Several investigators have examined different aspects of information content of genomes, including Shannon's uncertainty5,6,7,8,9 and symmetry10,11. For example, Chang and coworkers calculated Shannon's uncertainty index for all the complete prokaryotic and eukaryotic genomes available in 2005. They found that Shannon's information in complete genomes is greater than that in matching random sequences and they described a coarse-grain model for genome growth and evolution that allows a genome to diverge at any stage during its growth6,7.

Shannon's uncertainty12,13,14 was originally designed for encoding and decoding data transmitted and received through a digital communication system. Since sequence data can also be represented as a system where DNA is transformed into amino acids, this theory can be used to calculate the amount of information or uncertainty of a sequence. For each sequence, the uncertainty measurement per base pair generates a score from 0 to 2n, where n is word length. The greater the uncertainty, the more even is the distribution of each word. For example, the sequence AAAA can only be read using two letter words as AA regardless of the register and has little uncertainty. In contrast, the sequence ACGT can be read as AC, CG and GT, depending on the register and has more inherent uncertainty and information.

Here, the information content was examined for complete bacterial and phage genomes and the analysis was extended to the calculation of Shannon's uncertainty for each sequence within metagenomic libraries. The effects of word size, genome length and GC% on Shannon's uncertainty have also been examined. We demonstrate that the information content of sequences from metagenomes correlates with the number of similar sequences that is found by comparison to databases of known sequences. Using this approach may speed up the processing time for analyzing metagenomic data and allow prioritization of computational resources.

Results

Shannon's uncertainty in complete bacterial and phage genomes

Shannon's index was calculated for 600 complete phage genomes and 94 complete bacterial genomes (listed in Supplementary Table S1) using word lengths ranging from 1 to 12 nucleotides (nt). Shannon's indices of phage and bacterial genomes were similar up to word length 7 nt (Figure 1), implying an even distribution of all possible sequence words in phage and bacterial genomes. From word length 8 to 12 nt, the rate of increase of Shannon's index is higher in bacterial genomes than in phage genomes. Moreover, for word lengths greater than 10 nt, Shannon's index can differentiate bacteria and phages (Figure 1).

Figure 1
figure 1

Shannon's indices of 600 complete phage genomes and 94 complete bacterial genomes.

Blue crosses represent phage genomes and red circles represent bacterial genomes. As the word length increases, Shannon's index is more discriminatory between phage and bacterial genomes.

Factors influencing differences in shannon's uncertainty between complete bacterial and phage genomes

The difference between Shannon's indices of phage and bacterial genomes for word length greater than 8 nt suggested that either word length, genome size or a combination of both might influence this uncertainty value.

Shannon's index vs. word length

Word length is reportedly an important factor influencing the value of Shannon's index6,7. A high Shannon's index (close to the maximum possible index, i.e., for word length n, the maximum index will be 2n) depends on the presence of all possible combinations of words in the genome. Consequently, the longer the genome the higher the probability of having different word variations. For a given word length of n, there are 4n possible word combinations for DNA sequences. The length of most phage genomes (585 out of 600) ranges from 47 bp to less than 49 bp (Figure 2). Therefore, for word size greater than 8 nt, many words will only be represented zero or one times, which will result in a lower Shannon's index for most of these genomes. In contrast, the average length of the 94 bacterial genomes used in this analysis is about 3 million bp (between 2×410 and 411 bp). Therefore, bacterial genomes have a higher Shannon's index than phage genomes using word lengths smaller than 12 nt. For word lengths 11 nt or 12 nt, Shannon's index can distinguish phage and bacterial genomes (Figure 1) although this is likely because phage genomes are too short to generate sufficiently high Shannon's indices for words of this size.

Figure 2
figure 2

Length distribution of 600 complete phages.

Shannon's index vs. genome length

Shannon's indices for all phage genomes have been plotted against their lengths. For word length 12 nt, Shannon's index highly correlates with the logarithm of the genome length (Figure 3). For word length 9 nt, there is still a significant correlation; however, for shorter word lengths, no significant correlation was observed between genome length and Shannon's index. For shorter word lengths, most of the genomes have almost all combination of words in their genomes, so there is no strong correlation between Shannon's index and genome length. In contrast, for longer words, the bigger genomes have more combinations of words than the smaller genomes; so Shannon's index correlates with genome length.

Figure 3
figure 3

Shannon's index vs. length for 600 complete phage genomes using word length 9 and 12.

Calculations with irrelevant word lengths may give the wrong impression and create false differences between genomes (Figure 1). To compare genomes based on Shannon's index, the word length (n) should be chosen in a way that allows the possibility of having all combination of words (4n) in all the genomes. Therefore, for a given genome of length L, the possible word length (n) to calculate Shannon's index should be (Equation 1)

Shannon's index vs. GC%

For most phage genomes, the maximum word length that should be used to calculate Shannon's index (Equation 1) is 7 nt. When word lengths from 1 to 7 nt were used to calculate Shannon's index, GC-rich and GC-poor genomes were found to have lower Shannon's index since these genomes tend to have less diverse word combinations than genomes with 50% GC content (Figure 4a). The strong relationship between Shannon's index and |GC% − 0.5| for word length 1 to 5 nt suggests that Shannon's index is strongly influenced by the GC composition of the DNA sequence (Figure 4b). For word lengths above 6 nt, the relationship is not strongly supported. Different sequences may have the same GC%, but Shannon's index depends on the distribution of the different word combinations. Therefore two different sequences having the same GC% may have different Shannon's indices and the probability of this happening increases with the word length. Thus, as word length is increased, the correlation between Shannon's index and GC content becomes weaker (Figure 4b).

Figure 4
figure 4

(a) Shannon's index vs. GC% for 600 complete phage genomes using word length 1 nt to 7 nt. (b) The relationship between Shannon's index and |GC% −0.5| for 600 complete phage genomes using word length 1 nt to 7 nt.

Shannon's uncertainty in metagenomes

Shannon's uncertainty was calculated for different metagenomic data sets. The maximum uncertainty equates to a sequence that has equal frequencies of each word (e.g. A, G, C, T for word length one) and the majority of reads in a metagenome have an uncertainty greater than 1.8 per nt (Figure 5a) suggesting an even distribution of bases in the reads, although the relative information content of the reads varies by sample.

Figure 5
figure 5

(a) Cumulative comparison of the uncertainty (for word length 1) in DNA sequences in metagenome samples. Eight samples representative of the 24 used in this study are shown here: Soudan Mine Black Stuff (pink34), Line islands Kingman reef phage (light green35), Line islands Tabuaren phage (light blue35), Marine phages from the Gulf of Mexico (blue29), Marine samples supplemented with DMSP (magenta36), Line islands Palmyra Phage (dark green35), Line islands Christmas Reef phage (red35), Marine samples supplemented with vanillate (green36)). The uncertainty is greater than 1.7 for 85% to 90% sequences of all samples. (b) Comparison of Shannon's uncertainty and the observed similarity to known sequences. Shannon's uncertainty (H) was calculated for word length one and is compared with similarity to the SEED no-redundant protein database. Samples are coloured the same as in Fig. 5. Word lengths up to 11 letters were also used to calculate (H) and all cases confer same results (data not shown).

To investigate whether information content correlates with functional content, we compared the frequencies with which each sequence matched an entry in the known databases. The similarities between the metagenomic sequences and the SEED non-redundant protein database had been pre-calculated using BLASTX15,16 as a part of the annotation and analysis procedure performed by the MG-RAST server4. For a set of reads with a given uncertainty, the fraction of reads that were similar to sequences in the SEED non-redundant database was extracted from these pre-calculated similarities (Figure 5b). A read with more information (higher uncertainty) was more likely to be similar to sequences in the database than a read with less information. Different metagenomes varied in the fraction of reads that are similar to known sequences, but this likely reflects the sampling limitations that have thus far limited the breadth of the known sequences17.

Discussion

Since the publication of the first complete genome sequences, genome composition has been appealing to mathematicians, statisticians and computer scientists. Base distribution statistics, skews and biases18,19,20,21,22, sequence symmetries10,11 and information content5,6,7,8 have all been examined in the hope of deciphering hidden codes within the genomes11 and better understanding genome growth and evolution7,23,24,25,26,27.

Among the mathematical methods used, Shannon's uncertainty has previously been considered as a genome analysis strategy5,6,7,8. In the work of Chang and colleagues6,7, Shannon's uncertainty was calculated for complete prokaryotic and eukaryotic genomes available at that time and it was found that genomes belonged to a universality class that could be mathematically represented by a simple formula, yet Plasmodium genomes stood out as an intriguing exception, still unexplained6,7. Additionally, the variation of Shannon's index with sequence word length and genome length was examined6. Here, our findings confirmed and advanced that study by establishing the relationship between word size and genome length for calculating Shannon's index.

We also found that at a certain word lengths, Shannon's index can be used to differentiate phage and bacterial sequences. Although this differentiation is sensitive to genome length, with some modification, this observation can help find phage genes embedded in bacterial genomes. As an application, we calculated Shannon's index for a group of DNA sequences using a word size of 12 nt (four consecutive amino acids) and we were able to use this group of sequences to detect prophages in bacterial genomes28.

Finally, our findings show that the information content of metagenomic sequences varies from sample to sample, but about 85% of those sequences have high levels of uncertainty, suggesting that they are comprised of approximately equal numbers of each of the four bases (Figure 5). In addition, the information content in metagenomic sequences was found to correlate with the likelihood that the sequence would be similar to a previously characterized sequence (in the non-redundant database). This suggests that the large numbers of metagenomic sequences could be rapidly sorted based on their information content to prioritize similarity searches and other common computations. It is to be noted, however, that those metagenomic sequences have to be preprocessed and cleared of potential repeats or homopolymeric runs, sometimes introduced by sequencing methods (e.g., the introduction of runs of nucleotides during high-throughput sequencing)29. For this purpose, tools such as PRINSEQ29 MG-RAST4 can be used prior to sequence analysis of metagenomic data sets. Moreover, the correlation between information content and similarity may provide a rapid mechanism to screen for either false positive matches (sequences matching the database that should not) or false negative matches (sequences with no match in the database, but that should). Of course, the extremely large numbers of sequences with high uncertainty but no similarity in the databases might be influenced by the lack of sampling in the known databases30.

Methods

Retrieval of genomic and metagenomic data sets

All genomes used in this analysis were retrieved from the SEED database and servers3 (http://servers.theseed.org), where they have been consistently annotated and classified into subsystems31,32 in the RAST server33 (http://rast.nmpdr.org) Likewise, metagenomic sequence data sets were retrieved from the MG-RAST server4 (http://metagenomics.theseed.org).

For the calculation and analysis of Shannon's uncertainty, a subset of 24 metagenomes was selected from the previously studied SCUMS data set34, most of which were created by pyrosequencing. The metagenomes were chosen to represent the range of data sets available from sequences sampled in simple and well-characterized environments to more complex environments with multiple species present. The raw data were used without assembly and the samples included in the data set cover both viral and microbial metagenomes, sampled from such diverse biomes as mines, marine environment, soils and animals34,35,36,37. The shortest sequence in the data set was 31 bp and the longest was 362 bp.

Calculation of Shannon's uncertainty

Shannon's uncertainty was calculated using Equation 2 14,

where pi is the frequency of the i-th word in a sequence. For example, for word length one, pi is calculated from the frequencies of the nucleotides{A, G, C, T}. If each word is equally frequent, pi = 0.25. In general, for all words of length n being equally likely, pi is 1/4n.