Introduction

Metagenomics is the study of the genomic content of a sample of organisms, obtained from a common habitat or an environmental sample of microbes using sequencing. Advances in the throughput and cost-efficiency of sequencing technology are fueling a rapid growth of the number and scope of metagenomics studies, resulting in a deluge of sequences. Taxonomic analysis of such data sets has shown that only a small number of prominent taxa appear in most data sets, while the majority appear to be present only in small numbers, in what has become known as the rare biosphere (Sogin et al., 2006).

There is a great need for the development of new methods for analyzing and comparing multiple metagenomic data sets, using appropriate ecological and statistical models. Explicitly, a tool that combines the visualization of relationships with a metric of distance in a single package, which includes appropriate ecological indices, without the need to fit metagenomic data to a root evolutionary dendrogramatic relationship. The two main software engineering requirements are rapid computational analysis of very large data sets and ease of use for researchers.

In this paper, we suggest a novel approach that combines the use of taxonomic analysis, ecological indices and non-hierarchical clustering to provide a network representation of the relationships between different metagenome data sets. The approach proceeds as follows:

First, a taxonomic profile is computed for each data set. Second, a matrix of pairwise distances is determined using one of several possible ecological indices (Legendre and Legendre, 1998). Finally, the distances are represented using an appropriate visualization technique. For reasons outlined below, we suggest to use the non-hierarchical clustering technique, neighbor-net (Bryant and Moulton, 2004).

In more detail, the first step is to produce a taxonomic profile for each given metagenomic data set. For DNA reads collected in a shotgun sequencing approach, one possibility is to use the MEGAN program (by Daniel H Huson and Stephan C Schuster (with contributions from Alexander F Auch, Daniel C Richter, Suparna Mitra & Ji Qi) Algorithms for Bioinformatics, Tuebingen University, Germany) (Huson et al., 2007), which performs a taxonomic analysis of a metagenomic data set based on a BLASTX (Altschul et al., 1990) comparison of a data set against an appropriate reference database such as NCBI-nr (Wheeler et al., 2008). MEGAN creates taxonomic profiles at different ranks of the NCBI taxonomy, and counts how many reads are assigned to each taxon at the specified rank. The current reference databases are still largely based on ‘model organisms’ and were not specifically designed as reference databases for metagenomics, thus BLAST-based analyses will be affected by the availability of good reference genomes in the database. However, the approach described in this paper is not tied to BLAST and such databases, as we show below in a study comparing 16S ribosomal RNA data.

The next step is to compute a matrix of pairwise distances from the taxonomic profiles using a suitable ecological measure. After reviewing 27 different ecological measures (listed in Legendre and Legendre, 1998), we chose six to use in this study. The simplest and most common metric measure is the ‘Euclidean distance’ (Equation 1), which is computed using Pythagoras’ formula. The distance (D) between two metagenome samples (X, Y) can be calculated using,

where xi and yi are the read counts for the ith taxon of the respective metagenomic samples X and Y. It is dominated by the highly abundant taxa and its value can increase indefinitely with the number of descriptors. The Kulczynski (Equation 2) (Odum, 1950) and Bray–Curtis (Equation 3) (Bray and Curtis, 1957) distances are slightly more sophisticated measures giving by

and

The χ2 (Equation 4) (Lebart et al., 1979) and Hellinger (Equation 5) (Rao, 1995) distances are two probabilistic measures that calculate the distance among sites using species abundances. They are calculated as,

and

While the ‘Bray–Curtis’ (Bray and Curtis, 1957) and ‘Kulczynski’ (Odum, 1950) measures also focus on the most abundant taxa, the ‘χ2’ (Lebart et al., 1979) and ‘Hellinger’ (Rao, 1995) distances are based on differences in the proportions of taxa between the two data sets and thus provide better representations of the taxon composition. Goodall's similarity index (Goodall, 1964, 1966) is a non-parametric measure specifically designed for determining the pairwise similarity between observations of composite multivariate data sets.

The computation of Goodall's index involves a number of steps. First, a so-called ‘partial similarity measure’ is calculated between each pair of species. Then for each pair of data sets, one computes the proportion of partial similarity values belonging to species i that are larger than, or equal to the partial similarity of the pair of data sets being considered. These proportions (pi) are combined for the n species by computing the product (∏) of the values relative to various species as ∏ = ∏i=1n pi. Finally the similarity (S) between two data sets (X, Y) can be obtained as the proportion of the products (∏) that are larger than or equal to the product of the pair of data sets (∏pair) considered. The equation is giving by,

See (Goodall, 1964, 1966; Legendre and Legendre, 1998) for further details.

By definition, Goodall's index gives more weight to differences between rare taxa than the other indices, and should therefore be particularly suitable for comparing microbial metagenomes (Sogin et al., 2006).

There are two popular ways of representing distance matrices graphically. The first, widely applied in ecological studies, is to use a principal component analysis (PCA) or non-metric multidimensional scaling (NMDS) to obtain a two-dimensional layout. The second, widely used in evolutionary studies, is to use rooted trees computed by a hierarchical clustering method (Rusch et al., 2007). The advantage of a tree representation is that it explicitly provides clusters of closely related data sets. However, metagenomes are not expected to evolve along a tree, rather numerous environmental factors may affect data set composition, resulting in distances that reflect incompatible signals. Although ordination methods do not suffer from this problem, they do not explicitly link data points into clusters and provide no metric against which to determine the distance between data sets. Hence, we suggest to use the neighbor-net method to compute an unrooted phylogenetic network that enjoys the advantages of both methods (Bryant and Moulton, 2004). Such networks are not restricted to being a tree and are able to show incompatible clusters.

In this study, we apply the approach outlined above to marine metagenomes from three types of studies; a mesocosm experiment (Gilbert et al., 2008), a spatially structured data set (the Global Ocean Survey) (Rusch et al., 2007) and a time-series (Gilbert et al., 2009). Our study suggests that the approach is robust as it produces networks that are very similar across all ranks of the NCBI taxonomy and, to a lesser extent, across different ecological indices. We further establish that the use of Goodall's index provides the best results, given that microbial communities tend to be rich in rare genes and rare taxa (Sogin et al., 2006). Thus, Goodall's index may be most suitable for analyses that involve rare taxa, whereas the χ2 and Hellinger distances can be considered when rare taxa have only a small role.

Materials and methods

All metagenomes and metatranscriptomes were aligned against the NCBI-NR database using the BLASTX tool (Altschul et al., 1990). The results were imported into MEGAN (Huson et al., 2007), using the ‘Import from BLAST’ option. To obtain taxonomic profiles, MEGAN uses the lowest common ancestor algorithm that assigns each read to the lowest common ancestor of the set of taxa that it hits in the NR database. A MEGAN project file contains all reads and all significant BLAST matches in a binary and incrementally compressed format, which is around 30% of the size of the original input files. We then performed multiple comparisons using various ecological indices and constructed networks using the neighbor-net algorithm (Bryant and Moulton, 2004), as implemented in version 4 of MEGAN.

In the first study, we compared eight Plymouth Marine Laboratory (PML)-Bergen data sets consisting of four metagenomes (DNA) and four metatranscriptomes (complementary DNA (cDNA)), and named these eight samples as follows: (1) Time1-Bag1-DNA, (2) Time1-Bag6-DNA, (3) Time2-Bag1-DNA, (4) Time2-Bag6-DNA, (5) Bag1-13May-cDNA, (6) Bag1-19May-cDNA, (7) Bag6-13May-cDNA and (8) Bag6-19May-cDNA (please refer to (Gilbert et al., 2008) for details of nomenclature). All data sets were randomly re-sampled to the smallest data set size to allow inter-comparison (for example, Gilbert et al., 2009). After opening all the data sets in MEGAN, the ‘compare’ menu item was used to generate a new document that contains a comparison of all data sets. We compared the taxonomical profiles (as MEGAN files) of these eight data sets. Then, multiple comparisons of the data sets were performed using six different ecological distance measures (Euclidean, Kulczynski (Odum, 1950), Bray–Curtis (Bray and Curtis, 1957), Hellinger (Rao, 1995), χ2 (Lebart et al., 1979) and Goodall's index (Goodall, 1964, 1966) at each of seven taxonomic ranks (‘kingdom’, ‘phylum’, ‘class’, ‘order’, ‘family’, ‘genus’ and ‘species’) to create a total of 42 networks (Supplementary Figures S1.1, S1.2 and S1.3). The distances were processed by the neighbor-net algorithm (Bryant and Moulton, 2004) to obtain a collection of unrooted phylogenetic networks.

In a second study, we used one random sub-sample of the Sargasso Sea data (Venter et al., 2004) and one sub-sample from the Sorcerer II Global Ocean Sampling expedition data (GOS) (Rusch et al., 2007) and the data and setup from the PML-Bergen study, to visualize the comparison of multiple marine metagenomes from different environments processed using different sampling and sequencing strategies. All 10 data sets were randomly re-sampled to the smallest data set size to allow inter-comparison of taxonomic abundances (for example, Gilbert et al., 2009). As in the first study, we performed a multiple comparison of the 10 data sets using four of the distances (Goodall's index, Euclidean distance, Hellinger distance and χ2 distance) at each of seven taxonomic ranks to create 28 additional networks (Supplementary Figures S2.1, S2.2), Networks obtained using the Kulczynski and Bray–Curtis distances looked very similar to the networks obtained using Euclidean distance in the previous study (Supplementary Figure S1), so we dropped the Kulczynski and Bray–Curtis distances from subsequent experiments.

In addition, multiple comparisons were performed using four of the indices considering only bacterial taxa at six taxonomic ranks, resulting in a further 24 networks (Supplementary Figure S3.1). For the Goodall's index and Euclidean distance, the numbers of sequences identified as bacterial were randomly normalized to standardize the apparent sequencing effort.

In a third study, we investigated the effect of excluding rare taxa from the taxonomical profiles. In this study, we analyzed the data at the class rank of the NCBI taxonomy. We duplicated the six metagenomes (four Bergen metagenomes, one Sargasso Sea sample and one GOS sample from the previous study) and excluded all taxa that have an arbitrarily selected abundance of <0.025% of the total community abundance from each data set. We then compared these six truncated metagenomic data sets using all six indices, resulting in six networks at the level of class taxa (Supplementary Figure S4).

In a fourth study, we analyzed all 41 samples of spatially structured GOS data. As with the previous three studies, all 41 data sets were randomly re-sampled to the smallest data set size. All data sets were ‘blasted’ against the NCBI-NR database and the result was imported to MEGAN. As for the Bergen samples, we computed taxonomic profiles as MEGAN files for all 41 GOS data sets. We downloaded the GOS data, from the CAMERA website (Seshadri et al., 2007), we then normalized the data sets to the smallest size to allow inter-comparison of taxonomic abundances. We performed the comparison using Goodall's index at the class rank (Supplementary Figure S5). First, we compared all the sites together (Supplementary Figure S5B) and then only the coastal and open ocean sites (Supplementary Figure S5C) to illustrate biogeographic clustering based on the assumption that the coastal sites may harbor a more diverse microbiota than the open ocean sites.

In a final study, we analyzed the correlation between 12 ‘16S ribosomal RNA V6 tag-pyrosequencing’ data sets spanning 12 months of 2007 at a continually monitored sampling site, L4, in the Western English Channel (Gilbert et al., 2009). As before, random re-sampling of these 12 samples was carried out to identical sequencing depth, to allow inter-comparison. As most operational taxonomic units (OTUs) are not present in all samples considered, we prepared an OTU abundance matrix by adding zeros in which there were no representatives for that sample.

We compared samples taken from the marine community over several months using Goodall's index in combination with neighbor-net based on all unique OTUs (Supplementary Figure S6.A), then excluding OTUs found on only one occasion (Supplementary Figure S6.B), and finally considering only the OTUs found every time (Supplementary Figure S6.C). In addition, we prepared the PCA and NMDS plots using the same OTU data for OTUs present in two or more occasions. For the PCA analysis, we used the raw data and for the NMDS calculation we used a computed Bray–Curtis matrix (Supplementary Figure S7: for a more detailed method please refer to Gilbert et al., 2009).

Results and discussion

Study 1: comparison of eight marine samples from an ocean acidification study

For the PML-Bergen analysis, all six selected ecological indices produce almost identical placements of the eight samples within a neighbor network, with only minor differences in the distances between samples (see Figure 1 and Supplementary Figure S1). The placement of these PML-Bergen samples conforms to reported biological and experimental relationships (Gilbert et al., 2008), with the metagenomes being well separated from the metatranscriptomes, and the samples from the peak of the induced phytoplankton bloom (Time1 or 13 May) being more separated from the samples after the collapse of the phytoplankton bloom (Time2 or 19 May) than each group is to itself. Interestingly, for the time 2 or 19 May metagenomes, the opposite is true with the differences between these being greater than their similarity to samples within the time 1 metagenomes. This is indicative of the extremely different ecology of the mesocosm samples that existed after the collapse of the bloom. This was brought about by the experimental methodology used, in which immediately after the collapse of the bloom Bag1 was re-bubbled with CO2 and Bag6 was re-bubbled with air. This significantly altered the community composition and hence forced these samples apart (for more information refer to Gilbert et al., 2008).

Figure 1
figure 1

Network obtained using Goodall's index showing the comparison of eight PML-Bergen samples (four metagenomes and four metatranscriptomes) considering all nodes at the class rank of the NCBI taxonomy.

Study 2: comparison of multiple marine metagenomic samples from different studies

To confirm that the Bergen-PML network was robust to the inclusion of additional samples, we added two additional marine metagenomes as ‘decoys’. The first was a subset of reads taken from the pooled Sargasso Sea study (Venter et al., 2004) and the second was a subset of the larger GOS (Rusch et al., 2007). To allow an accurate comparison, a random subset of 96 201 sequences (the size of the smallest mesocosm data set (Gilbert et al., 2008)) was extracted from each study. After computing networks with four indices (Figure 2; Supplementary Figure S2), we confirmed that the eight PML-Bergen samples remain in their original groupings and that the two decoys are placed at a distance from them. Interestingly, there are clear differences between the networks based on the Euclidean distance, wherein the decoys are much more distantly related to the PML-Bergen samples than for the Goodall's index (Figure 2; Supplementary Figure S2), we hypothesize that this is due to the biases induced by the vast rare biosphere and the way each index handles low-abundance sequences. The networks based on the Hellinger and χ2 distances (Supplementary Figure S2) are also similar. The GOS sample appears to cluster more closely to the PML-Bergen samples than the Sargasso Sea sample, as the GOS sample (random sub-sample of all GOS samples) is heavily enriched from coastal study sites, whereas the Sargasso Sea is an oligotrophic open ocean (Venter et al., 2004).

Figure 2
figure 2

Network obtained using Goodall's index showing the comparison of 10 marine samples (randomly re-sampled Sargasso Sea and GOS samples together with the eight PML-Bergen samples) considering all nodes at the class rank of the NCBI taxonomy.

Study 3: multiple metagenome/metatranscriptome comparisons considering only bacterial nodes

When only bacterial taxa are considered, the Sargasso Sea data set appears to be more similar to the other data sets than it does when all taxa are considered. This is because the Sargasso Sea sample contains a much smaller number of eukaryotic reads compared with the other data sets. This reflects the similar water sampling procedures (for example, filter size) for the GOS (Rusch et al., 2007) and mesocosm (Gilbert et al., 2008) data sets, resulting in organisms of a similar size range being analyzed; whereas the Sargasso Sea study used a different sampling procedure (Venter et al., 2004), which excluded micro-eukaryotes. In this study, the networks computed using Goodall's index (Figure 3; Supplementary Figure S3) and Hellinger distance (Supplementary Figure S3) maintain a very similar layout over all ranks of the NCBI taxonomy for the 10 metagenome data sets, whereas the networks using Euclidean distance (Supplementary Figure S3) and χ2 distance (Supplementary Figure S3) show more variability. Strikingly, unlike the first and second studies, the PML-Bergen metagenomes tend to group together by time, with time 1 (13 May) being more similar to each other than to time 2 (19 May), and vice versa. This suggests that the post-bloom bubbling treatment of these bags had a greater effect on the eukaryotic and archaeal communities than the bacterial communities. This is possible as a result of the bubbling-induced lysis of eukaryotic cells.

Figure 3
figure 3

Network obtained using Goodall's index showing the comparison of 10 marine samples (randomly re-sampled Sargasso Sea and GOS samples and the eight PML-Bergen samples) considering only bacterial nodes at the class rank of the NCBI taxonomy.

Study 4: the effect of rare taxa

To study the effect of rare taxa on such analyses, we excluded all taxa having an abundance of <0.025% from each of the six metagenomes examined above (now excluding the four metatranscriptomes). The resulting truncated data sets were then compared with the original full data sets. We observe that the placement of the original metagenomes remains the same in all the networks computed. The networks based on the Euclidean, Kulczynski and Bray–Curtis distances are unable to distinguish between the original and truncated metagenomes, placing them at identical locations in the network (Supplementary Figure S4; left column). Networks obtained using the χ2 and Hellinger distances place the truncated samples close to the original metagenomes, but on separate branches (Supplementary Figure S4; right column). Only the network based on Goodall's index was able to represent the correct branching within the data sets (Figure 4; Supplementary Figure S4). Interestingly, we observed that the distances between the original and the truncated data sets are roughly proportional to the percentages of community change.

Figure 4
figure 4

Comparison of six marine metagenomes (randomly re-sampled Sargasso Sea and GOS samples together with the four PML-Bergen metagenomes) with six truncated copies from which all rare taxa were excluded, analyzed at the class level of the NCBI. The displayed network is obtained using Goodall's index.

Study 5: comparison of the 41 GOS data sets

We applied our approach to the geospatially structured GOS data (Rusch et al., 2007) and computed two networks using Goodall's index, one considering all 41 sites and the second considering only the open ocean and coastal sites (Supplementary Figure S5). Both networks show a star-like structure, reflecting a high level of diversity in the data. Spatially related samples tend to cluster together, with the open ocean samples showing apparently fewer sample-specific taxa than the coastal ones.

Study 6: comparison of 16S ribosomal RNA time series data from Western English Channel

To show the use of our method on 16S ribosomal RNA tag-pyrosequencing data sets, we applied it to the OTUs obtained from a continually monitored sampling site in the Western English Channel spanning February–December 2007 (Gilbert et al., 2008). A comparison based on all 12 393 OTUs from this time-series data set using Goodall's index leads to a highly unresolved network (Supplementary Figure S6.A), which reflects the high abundance of rare taxa in the data across monthly samples. A more informative network can be obtained by excluding the OTUs found on only one occasion (considering 2666 OTUs, 22%) from the analysis (Supplementary Figure S6.B). A network based only on those OTUs present in all data (71 OTUs, 0.5%) shows similar clusters, but as a result, a proportion of the distance information is lost (Supplementary Figure S6.C). This network visually captures both the relationships between the samples and the seasonality of the data set as previously described less adequately using traditional NMDS methods (Gilbert et al., 2009). This analysis highlights the robust nature of Goodall's index in marker-based metagenomic studies, as well as the importance of identifying rare taxa in these data sets.

Finally to establish the benefits of using this network representation, we prepared PCA and NMDS plot based only on those OTUs present in more than one time points (Supplementary Figure S7). Unlike the NMDS plot, the network representation (Supplementary Figure S6.B) provides a clear visualization of the distances between the different data sets, and unlike the PCA analysis it suggests possible sample groupings. An obvious direct benefit is that the network representations provide a mix of the visual sensitivity of NMDS and PCA with the quantitative nature of classical dendrograms.

Availability

A program for computing ecological indices from taxonomical profiles (called MEG2DIST) is available as open source from the website http://www-ab.informatik.uni-tuebingen.de/software/megan/meg2dist.

The code is completely integrated into version 4 of MEGAN, which is available from the website: http://www-ab.informatik.uni-tuebingen.de/software/megan.