Deciphering functional redundancy in the human microbiome

Although the taxonomic composition of the human microbiome varies tremendously across individuals, its gene composition or functional capacity is highly conserved — implying an ecological property known as functional redundancy. Such functional redundancy has been hypothesized to underlie the stability and resilience of the human microbiome, but this hypothesis has never been quantitatively tested. The origin of functional redundancy is still elusive. Here, we investigate the basis for functional redundancy in the human microbiome by analyzing its genomic content network — a bipartite graph that links microbes to the genes in their genomes. We find that this network exhibits several topological features that favor high functional redundancy. Furthermore, we develop a simple genome evolution model to generate genomic content network, finding that moderate selection pressure and high horizontal gene transfer rate are necessary to generate genomic content networks with key topological features that favor high functional redundancy. Finally, we analyze data from two published studies of fecal microbiota transplantation (FMT), finding that high functional redundancy of the recipient’s pre-FMT microbiota raises barriers to donor microbiota engraftment. This work elucidates the potential ecological and evolutionary processes that create and maintain functional redundancy in the human microbiome and contribute to its resilience.


Diversity and redundancy measures based on Hill numbers
Various taxonomic and functional diversity measurements have been defined based on Hill number to characterize the diversity of ecological systems. In this section, we first briefly introduce those Hill number based taxonomic and functional diversity measures. In particular, we point out some limitations in the existing definitions of Hill number based functional diversity. We propose alternative definitions that overcome those limitations. Finally, we introduce Hill number based functional redundancy based on the existing Hill number based taxonomic diversity measures and our modified Hill number based functional diversity measures. Note that the proposed measures are not just valid for the analysis of microbiome data. In principle, they can also be used to analyze more general ecological data.

Taxonomic diversity
Consider a sample of N species with the relative abundance profile given by a vector p = [p 1 , · · · , p N ]. Here the term "species" is used in the general context of ecology, which doesn't necessarily represent the lowest major taxonomic rank. Hill introduced the concept of effective number of species [1]. The basic idea is that the taxonomic diversity (of order q) of a given sample with relative abundance profile p = [p 1 , · · · , p N ] is the same as that of an idealized sample of D equally abundant species with relative abundance profilẽ p = [1/D, · · · , 1/D]. In other words, This offers a parametric class of taxonomic diversity measures defined as follows [1]: for q = 1. (S2) and A few special cases: • q = 0, TD 0 = N is nothing but the species richness; • q = 1, TD 1 is the exponential of the Shannon entropy (a.k.a. Shannon index ); • q = 2, TD 2 = 1/ N i=1 p 2 i is the inverse Simpson index (a.k.a. Simpson diversity).
Note that the Gini-Simpson index (GSI) used in the main text is related to TD 2 as follows:

Functional diversity
Consider a sample of N species with the relative abundance profile p = [p 1 , · · · , p N ], and pair-wise functional distance matrix ∆ = (d ij ) ∈ R N ×N with d ii = 0 for all i = 1, · · · , N , and d ij = d ji ≥ 0 for all i = j. Recently, Chiu et al. extended the notion of Hill numbers to incorporate species pairwise functional distances and introduced the so-called functional Hill number [2]. In this framework, they derived the functional Hill number as follows: denotes the average pair-wise distance weighted by their abundances. Note that when different species are equally distinct with a constant pairwise distance, Q * must be equal to this constant. The functional Hill number D can be interpreted as "the effective number of equally abundant and (functionally) equally distinct species" with a constant distance Q * for all species pairs (i, j) with i = j. We notice that there are two drawbacks in this framework: (i) The functional Hill number D can not be explicitly solved from Eq.S5; (2) When all different species in the assemblage are equally distinct (i.e., d ij = Q * for all i = j), the functional Hill number D should reduce to the traditional or taxonomic Hill number. But according to (Eq.S5) and Eq.S6, we have Apparently, D dose not reduce to TD q in (Eq.S2) or (Eq.S3). In order to overcome those two drawbacks, we introduce a new pair-wise functional distance matrix . . . . . . . . . . . . . . .
where d ij denotes the original functional distance between species-i and j, with d ij = d ji ≥ 0 (i = j), and is the average functional distance between species-i and all the other species. Note that, when different species are equally distinct with a constant pairwise distance, λ is equal to this constant. Equipped with the modified distance matrix ∆ , we can now derive the functional Hill number as follows: Note that Q shares almost the exact form as Rao's quadratic entropy Q := N i=1 N j=1 d ij p i p j , a classic functional diversity measure that characterizes the mean distance between any two randomly chosen taxa in the sample [3]. From (Eq.S10), we can derive a parametric class of functional Hill numbers: and The functional Hill number of order q, i.e., D q (Q ), can be interpreted as "the effective number of equally abundant and (functionally) equally distinct species" with a constant distance Q for all species pairs. Thus, if D q (Q ) = v, it just means that the functional diversity (of order q) of a given real sample with relative abundance profile p = [p 1 , · · · , p N ] and pair-wise functional distance matrix ∆ is the same as that of an idealized sample containing v equally abundant species (i.e.,p = [1/v, · · · , 1/v]), and a v × v distance matrix with constant pair-wise functional distance Q , i.e., Hence, the functional diversity of a real sample can be defined as the column (or row) sum of the above idealized v × v constant matrix, yielding and A few special cases: • If all the different species in the sample are equally distinct, i.e., d ij = Q for all species pairs (i, j), i = j; and d ii = λ i = Q , then the functional Hill number D q (Q ) reduces to the ordinary Hill number TD q , and FD q (Q ) = TD q · Q .
• If all the species in the sample are equally aundant, i.e., p i = 1/N for i = 1, · · · , N , then for any distance matrix ∆ , we have

Functional redundancy
In the literature of ecology, the functional redundancy (FR) of a sample is often considered to be the part of the taxonomic diversity (TD) that can not be explained by the functional diversity (FD) [4,5,6]. Hence FR is typically defined to be the difference between TD and FD: FR := TD − FD.
• When the species are completely different in their functions, i.e., for all pairs i = j, d ij = 1; and d ii = 0, then Q = N i=1 N j=1,j =i p i p j = N i=1 p i (1 − p i ) = GSI, hence FR = 0. This makes perfect sense, because all the different species are functionally different, hence the functional redundancy of this sample (community) is zero.
• When all the species are functionally identical: d ij = 0 for i, j = 1, · · · , N , then Q = 0, and FR = GSI. In this case, the functional redundancy of this sample (community) is maximized and equal to the Gini-Simpson index of taxa diversity.
• When all the species are functionally identical: d ij = 0 for i, j = 1, · · · , N , and in addition the number of species N is very large and the species are equally abundant p i = 1/N → 0, then FR = GSI → 1.
Now we consider the Hill number based taxonomic diversity TD q and functional diversity FD q (Q ), and define a parametric class of functional redundancy: For q = 1, For q = 1, Consider two extreme cases: (1) Zero redundancy: d ij = 1 for all species pair i = j. Then, we have d ij = 1 for all i and j, and Q = 1. For q = 1 Therefore, for d ij = 1 we have FR q (Q ) = TD q − FD q (Q ) = 0.
(2) Complete redundancy: d ij = 0 for all species pair i = j, then d ij = 0, Q = 0, FD q (Q ) = D q (Q ) · Q = 0, and FR q (Q ) = TD q is maximized.  Fig. 2a, this reference GCN is visualized at the order level for taxa nodes and the KEGG super-pathway level for functional nodes. The genome of each order is obtained by averaging the genomes of all the species belonging to that order. The bar height of each order corresponds to average genome size of the species belonging to that order. The thickness of a link connecting an order and a KEGG super-pathway is proportional to the number of KOs that belong to that super-pathway, as well as the genomes of species in that order. The network properties of this reference GCN are shown in main text Fig. 2b-e, where each taxon node represents a species, and each functional node represents a KO.

Body site-specific GCN
For each metagenomic sample, its taxonomic profile at the strain level is inferred by using MetaPhlAn2 [8]. For the strains identified in each sample, we first look up the HMP reference genomes in the IMG/M-HMP data mart. If all the strains are included in this data mart, we can directly construct the GCN for this sample. If some of the strains in the sample are not included in the IMG/M-HMP data mart, we will refer to the IMG/M data mart in the IMG data base. The IMG/M data mart includes more than 80,000 genomes for bacteria, archaea, eukarya, viruses, and so on. For our study, the annotated genomes of all the strains can be found and downloaded from the IMG/M data mart. In this way, the GCN for each sample can be constructed. By consider the pool of strains that appear in all the samples from a certain body site, we can naturally construct a body site-specific GCN. This is equivalent to merging all the sample-specific GCNs for samples from the same body site to a large network. Consequently, in our calculations ( Fig. 3a-b in the main text), for each body site, we just randomized the body site-specific GCN, rather than those sample-specific GCNs (which are just the subgraphs of the body site-specific GCN).

Taxonomic profiling of SMS data
There are two different approaches for taxonomic profiling of SMS data. The first one relies on comparisons to reference genomes, and the second one is de novo identification and assembly of genomes without using reference genomes. In this work, to check if our main results are sensitive to the taxonomic profiling approaches, we used a representative pipeline of each approach to perform taxonomic profiling of SMS data. Here we talk about the first approach. The second approach will be described in Sec.2.2. MetaPhlAn is a computational tool for profiling the composition of microbial communities from SMS data. It uses clade-specific marker genes to unambiguously assign reads to microbial clades. The first version of MetaPhlAn (MetaPhlAn1 [9]) relies on unique cladespecific marker genes identified from ∼3,000 reference genomes, allowing species-level resolution for bacterial and archaeal organisms. The enhanced or second version of MetaPhlAn (MetaPhlAn2 [8]) relies on ∼1M unique clade-specific marker genes identified from ∼17,000 reference genomes (∼13,500 bacterial and archaeal, ∼3,500 viral, and ∼110 eukaryotic). It complements the original species-level profiling method with a system for profiling the composition of microbial communities (Bacteria, Archaea, Eukaryotes and Viruses) from SMS data with strain-level resolution. In our study, we used MetaPhlAn2 with its default settings. Especially, the sensitivity argument for the mapping process is set to "very-sensitive". More information can be found at http://huttenhower.sph.harvard.edu/metaphlan2. Note that the samples with less than 5 strains with known genomes were excluded from downstream analysis.

Without using reference genomes 2.2.1 Construction of GCN
Recently, Nielsen et al. developed a powerful method to interpret metagenomic data at the level of individual genomes without relying on reference genomes [10]. In this method, individual genomes are assembled based on clustering of co-abundant genes. First, multiple metagenomic samples (e.g., the gut microbiome samples of hundreds of people) are sequenced, and reads are assembled into genes. For each sample, the abundance of each gene is quantified by the coverage of metagenomic reads. Then, those groups of genes that co-vary in abundance, i.e., show similar abundance across all the metagenomic samples, are identified. A key idea of this method is that those co-abundant gene groups (CAGs) are inferred as belonging to the same genome of a biological entity such as species. Hence, a natural by-product of this method is the GCN. Both sample-specific and body site-specific GCNs can be naturally constructed from those CAGs.

De novo taxonomic profiling based on co-abundance gene groups
Segregating a metagenome into groups of genes that have similar abundance cross a collection of samples allows us to identify microbial genomes without using reference sequences [10].
First, a nonredundant metagenomic gene catalog was constructed. To achieve that, one can employ the MOCAT pipeline [11] to assemble the raw sequencing reads from the MetaHIT metagenomic samples into scaftigs (i.e., continuous sequences within scaffolds). In particular, one can use FastX (Http://http://hannonlab.cshl.edu/fastx toolkit/index.html) to filter raw sequencing reads with a quality cutoff of 20, and discarded reads shorter than 30 bp. Then SOAPdenovo (version 1.05) [12] can be used to assemble high-quality reads into scaftigs. One can then use MetaGeneMark [13] to predict genes on those scaftigs longer than 500 bp, and then use BLAT [14] to cluster the predicted genes from all samples. Those genes with identity greater than 95% and covering more than 90% of the shorter genes will be clustered together. Finally, one can discard cluster representatives shorter than 100 bp, resulting in a set of nonredundant genes. One can further discard those genes that were likely originated from human, animals or plants from this set of nonredudant genes. The remaining genes form the reference gene catalog.
Second, the abundances of the reference genes can be quantified. One can use the screen function in MOCAT [11] to map high-quality reads to the reference gene catalog. One can subsequently filter those mapped reads using a 30-bp length and 95% identity cutoff. The soap.coverage script [15] is then used to calculate the gene-length normalized base counts. For those samples with 11M or more sequence reads, one can randomly draw 11M sequence reads without replacement. Those randomly drawn reads are mapped to the gene catalog, and the number of reads is counted to form a downsized depth or abundance matrix. One can use the 11M downsized depth matrix to estimate the abundance of co-variance gene groups (CAGs).
Finally, co-abundance clustering is performed. This is achieved by canopy clustering, a very simple and fast method for grouping objects into clusters [16]. Among those not-yetclustered genes, one can pick a seed gene and segregate those genes whose abundance profiles are within a fixed distance from that of the seed gene into the seed canopy. The distance can be chosen as Person correlation coefficient (PCC) > 0.9 and Spearman's rank correlation coefficient > 0.6. This process is performed iteratively to obtain many canopies. For each canopy, its median abundance profile is calculated. Any two canopies are close enough, i.e., their median abundance profiles are within a distance of 0.97 PCC from each other are merged. The resulting canopies will be further tested based on the following criteria: (1) it has two or less genes; (2) any three samples constitute 90% or more of the total canopy abundance signal; (3) the median abundance profile is detected in less than four samples; (4) one sample makes up 90% of the total signal. Only those canopies that pass all these criteria are called CAGs. And those CAGs that have more than 700 genes are referred to as metagenomic species (MGS) or just species.

Null models 3.1 Randomizing the GCN
To identify the structural features of the GCN that detemine the functional redundancy and functional diversity of microbial communities, we randomized the GCN using four different randomization schemes, yielding four different null-GCN models: • (Null-GCN-1) Complete randomization. We keep the number of taxa (N ) and number of genes (M ) unchanged, but otherwise completely rewire the links between taxa and genes. The randomized GCN displays Poisson-like degree distributions for both taxonand gene-nodes.
• (Null-GCN-2) Taxon-degree preserving randomization. We keep N , M , and the degree of each taxon node unchanged, but selects randomly the gene nodes that link to each taxon. The randomized GCN will have exactly the same taxon degree distribution as the original GCN, but a Poisson-like gene degree distribution.
• (Null-GCN-3) Gene-degree preserving randomization. Here we keep N , M , and the degree of each gene unchanged, but selects randomly the taxa that link to each gene. The randomized GCN will have exactly the same gene degree distribution as the original GCN, and a Poisson-like taxon degree distribution.
• (Null-GCN-4) Gene-degree and taxon-degree preserving randomization. Here we keep N , M , and the degree of each gene and the degree of each taxon unchanged, but randomly rewire the links between taxon nodes and gene nodes. This is achieved as follows. In each step, we randomly pick two links (i, a) and (j, b), where the taxon nodes i = j and the gene nodes a = b. Then we rewire those two links by switching their taxon nodes (or gene nodes), yielding two new links (i, b) and (j, a). In case gene-a (gene-b) already exists in the genome of taxon-j (taxon-i), we simply increase its copy number by one. We repeat this process until every link has been rewired for at least once. The gene degree distribution and the taxon degree distribution of the randomized GCN will be exactly the same as those of the original GCN.
Note that the weight of each link in the original GCN is the gene copy number, which is a non-negative integer. In all the randomization processes, we treat the weighted link as multiple links of unit weight.

Randomizing the microbial composition
To identify how the microbial composition contributes to the functional redundancy and functional diversity, we randomized taxonomic abundance profile using three different randomization schemes, yielding three different null-composition models: • (Null-composition-1) Microbial assemblage randomization. For each sample, we randomly choosing the same number of strains from the strain pool but keep the strain abundance profile unchanged to generate a randomized profile. Specifically, for the abundance table of each body site, we randomly permute the table elements along each column (each sample) to generate a randomized abundance table.
• (Null-composition-2) Strain abundance randomization I. For each sample, we keep the microbial assemblage or collection unchanged, and randomize their abundance. Specifically, in this scheme, we randomize the abundance table of each body site through random permutation of non-zero abundance along each column (each sample) across different strains.
• (Null-composition-3) Strain abundance randomization II. For each sample, we keep the microbial assemblage or collection unchanged, and randomize their abundance. Specifically, in this scheme, we randomize the abundance table of each body site through random permutation of non-zero abundance along each row (each strain) across different samples.
Note that for Null-composition-2 and Null-composition-3, the randomized abundance tables are not normalized even though the original abundance table are normalized. Therefore, after the randomizations, we normalize them again before the calculations of FR, FD, and TD.

Genome evolution model 4.1 Model description
To reproduce the key topological features of the real GCN (e.g., highly nested structure, fat-tailed gene degree distribution, etc), we developed a simple genome evolution model, which consists of the following steps: Step-1 At time t = 0, we set the initial GCN as a random bipartite graph with n 0 species and m 0 genes. A species is randomly connected to a gene with probability p 0 . In our simulation, we chose n 0 = 500, m 0 = 200, and p 0 = 0.8.
Step-2 At each time step t > 0, a species i is chosen with probability: is a tuning parameter, and k i is the degree (i.e., the genome size) of species i. Then the genome of species i will be updated based on one of the following three events: Event-I: With probability q HGT , horizontal gene transfer (HGT) occurs. A donor species is randomly selected. Then one of the donor species' genes a that is not in the genome of the recipient species is transferred to the recipient species' genome.
In the GCN, a link is added between this gene node and the recipient species node.
Event-II: With probability q gg , gene gain occurs. A new gene is added to the genome of species i. In the GCN, we connect this new gene node to the species node i.

Event-III:
With probability q gl , gene loss occurs. A gene a in the genome of species i is randomly selected to be lost. In the GCN, a link is deleted between this gene node and the species node i.
Step-3 Repeat Step-2 until the system reaches a desired time step. In our simulation, we typically let the GCN evolve for 5 × 10 5 steps.
Note that the three events in Step-2 are not fully independent, because q HGT +q gg +q gl = 1. Moreover, for any h > 0 species with larger genome sizes are more likely to be chosen, implying that selection pressure is implicitly considered in our model. The case h = 0 corresponds to the neutral model, where all the species have equal probability to be chosen to update their genomes.

Simulation results
We have systematically tested that our simple genome evolution model can evolve into a relatively stable state (Supplementary Figure 11), where the constructed GCN displays all the desired topological features as observed in the real GCN, i.e., highly nested structure (NODF ∼ 0.7), fat-tailed gene degree distribution, Poisson-like species degree distribution, and unimodal functional distance distribution. Those key topological features can also be reproduced by other parameter setting, e.g., with different gene gain rate q gg (Supplementary Figure 12). Finally, we point out that if n = 0 or n is too large, both the incidence matrix of GCN and the functional distance distribution P (d ij ) will be quite different from that observed in the real GCN. This implies that moderate selection pressure is needed to reproduce key topological features of the GCN (Supplementary Figure 13), and consequently favor high functional redundancy. The box plots of nFR q were shown for the real GCN (black box), as well as the randomized GCNs (colored boxes) using four different randomization schemes: complete randomization (blue); species-degree preserving randomization (red); KO-degree preserving randomization (green); species-and KO-degree preserving randomization (yellow). (Lower panels), The box plots of nFR q were shown for the real microbial composition (black box), as well as the randomized microbial compositions (colored boxes) using three different randomization schemes: Randomized microbial assemblage generated by randomly choosing the same number of species from the species pool but keeping the species abundance profile unchanged (blue); Randomized microbial abundance profiles through random permutation of non-zero abundance for each sample across different species (red); Randomized microbial abundance profiles through random permutation of non-zero abundance for each species across different samples ( -1, B-1, C-1) The incidence matrix of the GCN, where the presence (or absence) of a link is colored in yellow (or blue), respectively. We organized this matrix using the Nestedness Temperature Calculator (NTC) to emphasize its nested structure [26]. Note that NTC is computationally intensive. Here, for visualization purpose, in ( Microbiome samples from the MetaHIT [10,19] (for gut, n = 177 samples), as well as HMP [20,21,22] (for six different body sites: gut, n = 549 samples; anterior nares n = 87 samples; buccal mucosa n = 368 samples; tongue dorsum, n = 418 samples; retroauricular crease, RC, n = 36 samples; posterior fornix n = 52 samples), were analyzed. (A-B), The box plots of normalized function redundancy are shown for the real GCN (black box), as well as the randomized GCNs (colored boxes) using four different randomization schemes: complete randomization (Null-GCN-1); species-degree preserving randomization (Null-GCN-2); COG-degree preserving randomization (Null-GCN-3); species-and COG-degree preserving randomization (Null-GCN-4). (C-D), The normalized function redundancy is calculated for the real microbial composition (black box), as well as the randomized microbial compositions (colored boxes) using three different randomization schemes: Randomized microbial assemblage generated by randomly choosing the same number of species from the species pool but keeping the species abundance profile unchanged (Null-composition-1); Randomized microbial abundance profiles through random permutation of non-zero abundance for each sample across different species (Null-composition-2); Randomized microbial abundance profiles through random permutation of non-zero abundance for each species across different samples . Boxes indicate the interquartile range between the first and third quartiles with the central mark inside each box indicating the median. Whiskers extend to the lowest and highest values within 1.5 times the interquartile range. Statistical analysis was performed using a two-sided Wilcoxon signed rank test. Significance levels: FDR-corrected p-value<0.05 (*), <0.01 (**), <0.001 (***), <0.0001 (****); >0.05 (NS, non-significant). See Source data for the exact FDR-corrected p-values. Supplementary Figure 5: Structural properties of the body site-specific genomic content networks (GCNs) constructed from two metagenomic datasets: MetaHIT [10,19] and HMP [20,21,22]. A, MetaHIT, gut (stool); B, HMP, gut (stool); C, HMP, airway (anterior nares); D, HMP, oral (buccal mucosa); E, HMP, skin (retroauricular crease); F, HMP, vaginal (posterior fornix). (column-1), The incidence matrix of the GCN shown at the species-KO level, where the presence (or absence) of a link between a species and a KO is colored in yellow (or blue), respectively. We organized this matrix using the Nestedness Temperature Calculator to emphasize its nested structure.
(column-2), The unweighted species degree distribution. (column-3), The unweighted KO degree distribution. (column-4), The nestedness based on the NODF measure of the real GCN (gray bar), as well as the randomized GCNs (colored bars) using four different GCN randomization schemes: I, complete randomization (blue); II, Species-degree preserving randomization (red); III, KO-degree preserving randomization (green); IV, Species-and KO-degree preserving randomization (yellow). For each randomization scheme, 100 realizations are generated, and the standard deviation is smaller than the symbol size (pentagram). See SI Sec. 3 for details. (column-5), The distribution of functional distances (d ij ) between different species of the real GCN (black lines), compared with the randomized GCNs (colored lines) using the same GCN randomization schemes as in Column-4. We generated 100 realizations for each randomization scheme, and the bin size is 0.02. For each bodysite, the housekeeping KOs are considered as the KOs that are present in a large fraction of the existing genomes. Here we used three fraction thresholds: 99%, 95%, and 90% to define housekeeping KOs. Denote the functional redundancy calculated from the real GCN as FR, and the functional redundancy calculated from the real GCN with housekeeping KOs removed as FR woKO . Then the contribution of housekeeping KOs to the functional redundancy of a microbiome sample, denotes as FR KO , can be quantified as FR KO = 1 − FR woKO /FR. Apparently, if we relax the fraction threshold from 99% to 90%, FR KO will increase. But even with the threshold 90% (i.e., those housekeeping KOs appear in the genomes of 90% species), FR KO is still generally smaller than 30%. This indicates that the housekeeping KOs only contribute a small part of the functional redundancy observed in human microbiome samples. Boxes indicate the interquartile range between the first and third quartiles with the central mark inside each box indicating the median. Whiskers extend to the lowest and highest values within 1.5 times the interquartile range.   (colomn-1), The incidence matrix of the GCN shown at the species-KO level, where the presence (or absence) of a link between a species and a KO is colored in yellow (or blue), respectively. We organized this matrix using the Nestedness Temperature Calculator to emphasize its nested structure [26]. The nestedness based on the NODF measure [27] of the real GCN (gray bar), as well as the randomized GCNs (colored bars) using four different GCN randomization schemes: I, complete randomization (blue); II, species-degree preserving randomization (red); III, KOdegree preserving randomization (green); IV, KO-degree preserving randomization (yellow). For each randomization scheme, 100 realizations are generated, and the standard deviation is smaller than the symbol size (pentagram). (F), The distribution of functional distances (d ij ) between different species of the real GCN (black lines), compared with the randomized GCNs (colored lines). We generated 100 realizations for each randomization scheme, and the bin size is 0.02. See SI Sec. 3 for details of GCN randomizations. See SI Sec. 2.2 for details of the GCN construction through de novo taxonomic profiling.