Introduction

The role of many genes remains unknown. Even in well-investigated model organisms, a quarter or more of the genes have poorly characterized function. With the advance of genome sequencing techniques, the vast amounts of accumulated data provide an opportunity to infer gene function using computational methods. While such methods occur in many varieties, one widespread approach is to examine the composition of gene neighborhoods that occur across genomes. Then, following the principle of guilt-by-association, a function of a gene is inferred by transferring it from its neighbors in the genome1,2. Gene neighborhoods that are conserved across multiple genomes provide additional confidence in each inference, often yielding highly accurate predictive modes of gene function3,4,5,6,7. One important biological mechanism that underlies similarity of gene function in neighboring genes is that they are often regulated by common factors and therefore co-expressed (reviewed in8,9). The prime example is the prokaryotic operon, where a single mRNA harboring multiple protein-coding regions is transcribed from a promoter, ensuring that expression of such functionally related proteins is coordinated. However, this concept extends more broadly – neighboring genes that are not part of the same operon are also often co-regulated and share function. Moreover, in eukaryotic organisms, which generally lack operons, gene regulation is also organized regionally and this pattern can be conserved across evolutionary time10,11,12,13,14. Consistently, gene function is non-randomly distributed also across eukaryotic chromosomes15,16, even though the neighborhood signal is overall more subtle than in prokaryotes, important exceptions notwithstanding (reviewed in17).

The current computational methods that use conserved gene neighborhoods to predict gene function rely on the principle that similar functions cluster together on the chromosomes. While there is abundant evidence that this is the case, we were intrigued by known individual examples of conserved clustering of genes with apparently unrelated functions. For instance, a metabolic gene (FAD synthase) was reported to hitchhike with clusters of protein translation-related genes, and RNA modification/degradation genes were found to hitchhike with signal transduction genes18,19. We asked if these examples represent a broader trend, which could be tapped into to better predict gene/protein function and phenotypes from genomic sequence. More specifically, we searched for pairs of gene functions that are highly dissimilar, according to the structure of the Gene Ontology, yet that systematically cluster in genomic neighborhoods. If found, such clustering patterns would be able to predict gene function by drawing on information which is not accessible to previous automated approaches, which propagate a particular gene function across genomic neighborhoods. If this type of clustering of diverse functions were widespread i.e. if it affected many genes, it would provide a basis for a general methodology to infer gene function that relaxes the requirement for homogeneity of function across neighborhoods.

Here, we examined the functional composition of genomic neighborhoods across 1669 prokaryotic, 49 fungal and 80 metazoan genomes (for more details see Supplementary Material 1, Section S1), finding that indeed it is a very common occurrence that certain pairs of unrelated functions cluster together. Almost all gene functions are significantly enriched in the genomic neighborhood of one or more semantically distant gene functions. Drawing on this signal to infer gene function results in 3.5-fold increase in predictive power (estimated by the information accretion criterion) over a naïve guilt-by-association approach, and an 1.4-fold increase over a state-of-the-art network approach. In addition to predicting gene function, accounting for the complementarity patterns in genomic neighborhoods enhances prediction of phenotype from copy number-neutral structural variation in genomes of individuals. Our work highlights a widespread pattern in how gene function is organized across chromosomal domains. This has implications for understanding genome evolution and brings practical benefits for methods to predict gene function and phenotype from the genome sequence.

Results

Widespread clustering of diverse gene functions across chromosomes

It is known that functionally related genes reside in the same genomic neighborhood more often than expected at chance3,4,5,6,7,15,16,17. We employed a simple framework to systematically quantify the extent of such co-occurrence in neighborhoods across many genomes, separately for individual gene functions. In brief, we used COG (clusters of orthologous groups) and NOG (non-supervised orthologous groups) gene families, to each of which we assigned a set of gene functions, herein represented by Gene Ontology (GO) terms. Of note, here we refer to function in a general sense, encompassing all three sub-ontologies of the GO: biological process (BP), molecular function (MF) and cellular component (CC). These assignments of function to COG/NOG gene families were based on the known functions of the constituent genes of each COG/NOG (henceforth jointly referred to as COG) (Methods). The mapping of COG gene families to GO terms allowed us to determine whether a certain GO term x is enriched in the genomic neighborhood of any gene assigned to this same function x, normalized to the prevalence of x outside such neighborhoods, examined across 1669 prokaryotic genomes. This yields an odds ratio (ORx), which describes the effect size of the enrichment of each function GO x in a neighborhood of size k of other genes having the same GO x. Given that operons tend to be short20, we present the data for prokaryotes with k = 2 (we also provide analyses covering larger neighborhood sizes in Supplementary document 1, Figure S27, relevant in the light of work suggesting that bacterial functional neighborhoods may extend up to 20 kb21,22). The neighborhood is defined here as two genes to each side of a central gene, corresponding to a total size of five genes–one central and four flanking (irrespective of orientation). We were able to examine a total of 1048 GO terms that occured in at least 5 COG gene families. Out of these GO terms, 81.5% had neighborhoods significantly enriched in that same GO term (ORx > 1 at FDR < 20%; Z-test for significance of log odds ratio; Methods) and in 78.8% was the enrichment significant and also higher than twofold (ORx ≥ 2). In the usually larger eukaryotic genomes, we considered neighborhoods of k = 5 genes to each side of a central gene, and examined the 2617 GO terms (fungi) and 2336 terms (metazoa) that occurred in at least 3 gene families (for more detailed analyses of different neighbourhood size see Supplementary document 1, Figure S27). Across eukaryotes, 35.1% of the analyzed GO terms are significantly enriched in their own neighborhoods across 49 fungal genomes, and 99.1% of GO terms across 80 metazoan genomes. These enrichments and significance calls are upheld by comparing them to enrichments computed on randomized data (see Methods and Supplementary material 1, Table S3). The above data are consistent with the known clustering of genes with similar functions across genomes15,23,24 and demonstrate that our approach can be used to detect function enrichment in genomic neighborhoods.

Next, we applied the same method to exhaustively test for enrichment of pairs of diverse GO functions in genomic neighborhoods (see Fig. 1). In particular, we measured the enrichment (as odds ratio, ORxy) of the genes annotated with a GO term y that are near genes with GO term x, again in a neighborhood of k = 2 genes to each side thereof. Indeed we found that 2.9 × 105 of 1.1 × 106 total examined pairs of GO terms in prokaryotes are significantly enriched (FDR < 20%, Z-test on log ORxy). This means that all of the tested GO terms were significantly enriched in the neighborhood of at least one non-self GO term (the significance calls are largely supported in a randomization test, see Methods and Supplementary material 1, Table S3). Mirroring the trend in prokaryotes, 1.0 out of 6.8 million, and 1.6 out of 5.4 million considered GO term pairs were significantly enriched in fungi and in metazoa, respectively.

Figure 1
figure 1

Enrichment of diverse gene functions is widespread in genomic neighborhoods. (a) Distribution of neighborhood enrichment scores (log odds ratio, log OR) for all pairs of GO functions on original and randomized genomes of prokaryotes, fungi and metazoa. See also Supplementary 1, Figures S8, S11, S12 and Supplementary 1, Table S3. Pairs with OR = 0 are not shown on graphs (see Methods; these pairs result in artefactually high or low log OR values after continuity correction). Information about the statistical significance of difference in distribution shape between observed and randomized distribution is expressed by the Kolmogorov-Smirnov D statistic and the corresponding p-value. (b) Number of GO terms that are semantically distant, but significantly enriched in genomic neighborhoods (FDR ≤ 10%) of each GO term, summarized in histograms for prokaryotes, fungi and metazoa. GO term pairs with Resnik similarity <1 (for prokaryotes) and RS < 2 (for eukaryotes) from the ‘biological process’ GO sub-ontology are tallied in the figure.

By design, GO has many terms which describe very similar concepts and it is conceivable that a gene family might be assigned to either a GO term x or a similar GO term y due to noise in the annotation process. This has the potential to inflate the number of non-self GO terms we observe mutually enriched in neighborhoods. We have therefore filtered out related GO term pairs by using a measure of semantic similarity (Resnik similarity, RS, see Methods), which is defined using the structure of the Gene Ontology graph. Of note, because semantic similarity is only defined within each GO sub-ontology but not across sub-ontologies, we present data for the ‘biological process’ GO terms. By conservatively requiring RS < 1 (prokaryotes), we effectively restrict to GO term pairs in the different branches of the ‘biological process’ GO graph, removing 60% of the pairs in our prokaryotic data (details in Methods). Even in this remaining set of semantically distant pairs of functions, there exist approximately 50 thousand (out of 180 thousand examined) significantly enriched pairs of GO terms in genomic neighborhoods. For eukaryotes we required RS < 2 to call dissimilar GO terms (see Methods for justification). Consistently with data in prokaryotes, 0.3 million out of 1.8 million and 0.4 million out of 1.3 million semantically distant BP GO term pairs were significantly co-occuring in genomic neighborhoods. These high proportions mean that almost all gene functions are significantly enriched in the neighborhood of at least one other semantically distant gene function (100% of BP GO terms in prokaroytes, at FDR = 10%; 96.4% in Fungi, 100% in Metazoa).

The effect sizes of such enrichments may be substantial: 72.1% of the observed log OR scores in prokaryotes are below 1st or above 99th percentile of log OR scores (corresponding to OR < 0.761 and OR >1.36 respectively) in randomized genomes (Supplementary material 1, Table S3). The distributions of enrichments in neighborhoods of four example gene functions are shown in Fig. 2, illustrating how there exist pairs of dissimilar functions that have neighborhood enrichments comparable to or higher than the enrichment of a gene function in its own neighborhood (histograms including statistical significance calls are shown in Supplementary 1, Figure S7). Finally, we examined separately a group of free-living microbes and a group of microbes that are pathogenic in mammals (labels from ProTraits phenotype database25, see Methods). Both sets of microbes yielded significantly different ORxy than obtained on randomized data for various pairs of gene categories (p < 2.2·10−16) and the substantially higher spread from randomized data was observed in both sets of microbes (Supplementary document 1, Figure S45). Therefore the enrichment of dissimilar gene functions in neighborhoods does not appear specific to certain ecological preferences but instead represents a general trend.

Figure 2
figure 2

Semantically distant GO terms can be as strongly enriched in gene neighborhoods as the semantically close GO terms. Four example GO terms of the ‘Biological process’ ontology are shown. Histograms show numbers of GO terms at a certain log odds ratio (log OR) of the enrichment in gene neighborhood (for prokaryotic genomes). The GO terms in neighborhoods of a central GO function are broken down into three groups: the “CLPar” group (the central function itself plus all its parent functions in the GO graph), “CLMed” group (functions with Resnik semantic similarity >2 with the central function) and “Dist” group (Resnik ≤2 with the central function). Instances of GO terms in the dissimilar “Dist” group and in the non-self “CLMed” group can be observed that have enrichments as high or higher than the self-enrichments (the “CLPar” functions, arrows on the plot).

Examples of semantically dissimilar gene functions enriched in genomic neighborhoods

For instance, there is a trend in gene neighborhood organization where the GO term “carbohydrate metabolism” is 2.12-fold enriched in the neighborhood of the semantically unrelated GO term “carbohydrate transport” (of note, the magnitude of this enrichment is similar to that of “carbohydrate metabolism” in its own neighborhood – 2.53-fold). This co-occurrence is reminiscent of the textbook example of the lac operon in Escherichia coli, where the lacY gene encodes lactose permease, which shuttles lactose into the cell, while the neighboring lacZ gene encodes the enzyme β-galactosidase, which breaks down lactose into monosaccharides. Our analysis suggests that the proximity of genes encoding carbohydrate transporters and carbohydrate metabolizing enzymes is a systematic trend. This can provide evidence for inferring the presence of hypothesized transporter genes located next to known metabolic enzyme genes, and vice versa.

Other examples include significant enrichments of “DNA topological change” (ORxy = 1.42) and also of “nucleoside bisphosphate biosynthesis” (ORxy = 1.41) in the neighborhood of gene families annotated with “response to DNA damage stimulus”. This suggests some manner of coordinated regulation between different DNA repair-related activities. Furthermore, “lipid biosynthesis” genes are often in the neighborhood of “localization within membrane” genes (2.43) and “carbohydrate biosynthesis” is in the neighborhood of “cell envelope organization” (2.10). These associations suggest coordinated activation of processes that generate building blocks of cellular structures, and subsequent processes that incorporate the building blocks into these structures. We highlight other examples of enriched GO term pairs for prokaryotic neighborhoods in Table 1 (see also Supplementary document 1, Table S2), while the exhaustive list is in Supplementary material 5.

Table 1 Examples of dissimilar gene function pairs enriched in genomic neighborhoods.

While all GO term pairs listed above either have low semantic similarity or are from different GO sub-ontologies, there are cases where GO terms, distant in the GO graph, may sometimes overlap in the set of genes assigned to them24. This has the potential to inflate the observed enrichments, which might be due to the same genes artefactually creating two apparently distinct functional neighborhoods. However after quantifying the overlap of genes assigned to the pairs of GO terms using Jaccard index, we find this not to be a common issue (Table 1; Supplementary Material 5). Furthermore, a randomization of gene positions supports that the observed enrichment of extreme OR values between distinct GO terms holds true for the semantically very close GO term pairs as well as for the semantically distant pairs (panels ‘CLPar’ and ‘Dist’ in Fig. S8). Overall, our data suggests that clustering of dissimilar gene functions – at least as defined by the GO – is very common in genomic neighborhoods, occurring to a comparable extent as the well-known clustering of similar gene functions in genomes.

It is well-known that functionally similar terms found in gene neighborhoods are often co-expressed. We asked if this extends also to the functionally dissimilar terms that cluster in gene neighborhoods. To address this, we stratified GO term pairs into bins by semantic similarity, and compared the neighborhood-cooccurring versus non-cooccurring pairs by similarity of expression profiles in E. coli (Methods). Expectedly, we observed that high functional similarity (irrespective of gene neighborhoods) implies higher co-expression than low functional similarity (Supplementary document 1, Table S10) and also that the highly similar function pairs were more strongly co-expressed when they are clustered in neighborhoods (3.2-fold difference in average Spearman correlation). Interestingly, we observed that also the lowly similar function pairs were more strongly co-expressed when clustered in neighborhoods than when they were not clustered (3.0-fold difference), even though the overall co-expression strength for lowly similar function pairs is not striking (Supplementary document 1, Table S10). This supports the notion that at least some of the diverse functions co-localized in gene neighborhoods are also co-expressed.

A general method to infer gene function based on neighborhood patterns

Having demonstrated that genomic neighborhoods are significantly enriched in certain combinations of diverse gene functions, we asked if such neighborhood patterns can be used to establish a general method that predicts gene function. To evaluate this, we compared two established methods that propagate a gene function to neighboring genes, with a novel classifier that can draw on neighborhood co-occurrence of diverse gene functions to predict GO terms for COG gene families. The first method is a simple k-nearest neighbors (kNN) classifier that transfers known functions to a COG gene family from the k neighboring gene families (with smallest average logarithmized distances across many genomes; Methods). The second classifier can transfer gene functions to neighbors additionally via indirect links: a gene network is constructed from neighborhoods and the gene function assignments diffuse across the links (using the GeneMania method26, herein referred to as the Gaussian Field Propagation (GFP) classifier). In addition to these known approaches, we introduced a third, novel classifier that can draw on both the enrichment of similar functions in neighborhoods and additionally the enrichment of semantically distant gene functions. For example, this method should be able to infer that a gene family deals with “carbohydrate metabolism” based on its neighbors being annotated with “carbohydrate metabolism” and additionally based on its neighbors dealing with “carbohydrate transport”, a semantically distant function in the GO graph. To this end, we employed a Random Forest classifier on a data set where examples are COG gene families, features are the normalized counts of every GO term in the neighborhood of that COG (across all genomes), and the target variable is the set of known GO term labels of the COG gene family; see Methods. In other words, from such a representation, the function of a gene can in principle be inferred from the presence of any function – or a combination thereof – in the genomic neighborhood of the gene, as long as such a pattern occurs commonly enough to be recognized by the algorithm. We named this approach the “neighborhood function profile” (NFP) classifier.

We evaluated the accuracy of all methods in a cross-validation test (using the out-of-bag error statistic provided by Random Forest; see Methods). The average area under precision-recall curves (AUPRC) for GO terms in the prokaryotic dataset is 0.153 for the 10-NN classifier and, expectedly, a much improved 0.199 score for the network-based (GFP) classifier. Both methods operate by transferring an annotation across gene neighborhoods, while the latter also uses indirect links to improve accuracy. However the NFP-based classifier, which can draw on diverse neighborhood patterns substantially improves over this, with a 0.266 average AUPRC in prokaryotes, a 34% increase (p < 10−10 for the increase over the next best method, one-tailed Wilcoxon test on AUPRC scores across GO categories; distributions of scores in Fig. 3, see also Supplementary 1, Figs. S16S18). Similarly so, in the two groups of eukaryotic organisms, the AUPRC scores were significantly improved using the novel NFP method: for Fungi, the 0.0460 (for the 10-NN) increased to 0.0545 (for the NFP; p < 10−15) and for Metazoa, the increase is 0.0228 (for 10-NN) to 0.0267 (for NFP, p < 10−15; Fig. 3). The network diffusion approach applied to gene neighborhood data in eukaryotes did not on average bring benefits over the simpler 10-NN (Supplementary 1, Figs. S19S24), therefore the latter is used as a baseline.

Figure 3
figure 3

The gene function profile of genomic neighborhoods enables a more accurate methodology to infer gene function. The distribution of area under the precision-recall curve (AUPRC) scores, measured in cross-validation, for all examined gene functions (GO terms) is represented in prokaryotes (top), fungi (middle) and metazoa (bottom). The methods compared are the nearest neighbor (NN) classifiers (1-NN, 3-NN, 10-NN), a network-based approach (Gaussian Field Label Propagation, GFP) and finally the novel Neighbourhood Function Profile (NFP) method. See Supplementary 1, Figures S17, S20 and S23 for the area under ROC curve (AUC) scores. P-values are from a one-tailed Wilcoxon signed-rank test.

High predictive power of the Neighborhood Function Profile (NFP) classifier

A more accurate classifier would be expected to provide a higher number of confident predictions. We quantified this increase provided by the predictive models based on NFP. In particular, we tallied the number of predictions (COG-GO term associations) made at a precision threshold of 50% (equivalent to 50% FDR) and additionally at a more stringent 80% (20% FDR) for the three classifiers; Table 2. This reveals remarkable, several-fold increases in the amount of predictions afforded by the NFP in prokaryotes, fungi and metazoa, when considering the more general gene functions (information content (IC) between 2 and 4). In the highly specific gene functions (IC > 4), which are usually of higher interest, there is a substantial increase in new annotations provided by NFP for prokaryotes and fungi, and a more modest gain in metazoa. We further examined the diversity of predictions: in particular, we asked if the new predictions afforded by NFP are largely added to the gene families that were already assigned predictions by the previous methods, or if NFP predictions instead cover new gene families. Our data suggested that the latter is the case (Supplementary 1, Tables S5S7), because the number of COG gene families receiving at least one prediction is higher in NFP compared to the baseline classifiers.

Table 2 Number of predictions (associations between a gene function and a COG gene family) obtained using different gene function prediction methods based on neighborhoods.

Our gene neighborhood classifiers, as implemented, provide function predictions at the level of COG gene families (exhaustive list given in Supplementary material 2,3.). We also provide data showing how this reflects in the number of genes receiving predictions in certain model organisms (Supplementary document 1, Tables S8 and S9). For instance, in Escherichia coli, at precision of 50%, the number of novel predictions (gene-function pairs) is 7572 for the network approach, and increases to 10559 provided by the novel NFP classifier; similarly so in the pathogen Staphylococcus aureus, increasing from 4314 to 5386 by use of NFP; all counts given for gene functions with IC > 4. Eukaryotes, consistent with more modest AUPRC scores (see above), provide overall fewer predictions, but increases from use of NFP are quite evident. The number of novel annotations at Pr = 50% substantially increased from 89 (10-NN) to 552 (NFP) for Saccharomyces cerevisiae and 87 to 282 for Schizosaccharomyces pombe, in the fungal predictor (Supplementary 1, Table S8). For metazoans, the increases were striking for the general gene functions (IC 2–4), with more than twofold higher number of predictions afforded by NFP at precision 50% for mouse, human or Drosophila melanogaster, compared to the next best method. The NFP gains were modest for highly specific gene functions with IC > 4 in Metazoa (Supplementary 1, Table S8). One possible explanation was the lower coverage with known functions (average 18 GO terms per gene in Metazoa versus 26 in Fungi in the databases we used; see Methods). This might prevent NFP from discovering complex association patterns between gene functions in neighborhoods, while the simpler kNN classifier is less affected.

We examined another measure of the utility of the predictive models, based on the information accretion (IA) criterion27. In brief, IA weighs the predictions such as to give higher scores to higher information content (rarer) GO terms; see Methods. By this method, in prokaryotes, kNN predicts 2.69 bits/gene novel information, the GFP network approach 6.89 bits/gene, while the NFP increases this to 9.44 bits/gene; all given at precision = 50% (data at 50% and 80% thresholds are in Table 3, for visualization of proportions see Supplementary 1, Figs. S30 and S31). Therefore, we estimate that NFP brings a 37% increase in coverage with predicted gene functions over a state-of-the-art genomic neighborhood method. This result is mirrored in two groups of eukaryotes we tested: in fungi, the increase was 0.70 (kNN), 1.37 (network) and 2.00 (NFP) bits/gene, while in Metazoa it was 0.70 (kNN), 0.99 (network) increasing to 1.25 bits/gene in the NFP method (Table 3, Supplementary 1, Figs. S32S35). In other words, in eukaryotes, the novel NFP method increases the predictive power by 46% (Fungi) and 26% (Metazoa) over a state-of-the art network approach for propagating gene function across neighborhoods.

Table 3 Amount of predicted information on gene function, measured using the ‘information accretion’ methodology and expressed as bits per gene.

NFP accuracy is augmented by semantically distant functions

The above data indicate that a classifier based on the NFP – an exhaustive description of the composition of gene functions in a genomic neighborhood – provides high accuracy and yields many additional function predictions. Next, we asked if this increase in accuracy of the NFP is due to the semantically distant GO terms in gene neighborhoods. To this end, we examined 11 example gene functions, which have other semantically dissimilar functions enriched in their neighborhoods (by our ORxy measure, see above). As expected, the NFP classifier strongly outperforms the baseline kNN and the GFP classifiers for these functions (Supplementary 1, Fig. S28). Then, we created partial NFPs, which contained only the semantically distant functions (without ancestors), compared to the function being predicted (“Dist/Par”, having RS < 2). We contrasted this to the partial NFPs that contain only the function being predicted itself and its ancestors that are semantically close (“CLPar”, parents in the GO graph having RS ≥ 4 plus the GO function itself). Random Forest classifiers were trained on the two kinds of partial NFPs and cross-validation accuracy compared (Fig. 4). For 9 out of 11 functions, a higher accuracy was obtained from neighborhood features describing semantically distant functions (“Dist/Par”, mean AUPRC = 0.047 across gene functions) than from features describing the target function and its close parents (“CLPar”, mean AUPRC = 0.035). The latter is an implementation of the guilt-by-association principle using the same classifier and the same data representation; for 8 out of 11 the increase was significant (p < 0.0002; Wilcoxon one-sided paired test). For all 11 functions, using the full set of neighborhood features, including the close, intermediate-distance, and distant functions, significantly outperforms the default model using only close features. This suggests a benefit to predictive accuracy also from including the intermediate-distance functions. This analysis demonstrates that even when using the same statistical methodology (Random Forest) and same type of data representation (NFP), the presence of semantically distant functions in genomic neighborhoods is often highly predictive of gene function. For more details on the experimental setup see Supplementary document 1, Section S2.

Figure 4
figure 4

Semantically distant functions in gene neighborhoods are important for accurate inference of gene function. Bars show accuracy (as AUPRC score, measured in crossvalidation) for predicting the eleven representative gene functions, using various types of neighborhood function profiles (NFP) that are listed in the legend. The “Full profile” are the full NFP of the ‘biological process’ GO graph, while the “CL/CLPar”, “Med/Par” and “Dist/Par” represent the partial NFP consisting only of close, medium-distance and distant functions, respectively (the “/Par” denotes that parent GO terms of the target functions were removed). The “CLPar” partial profiles contain only the selected function and its semantically close parents, meaning that “CLPar” is an implementation of the standard approaches that transfer functions across neighborhoods. In many cases, the close (but non-self), medium-distance and distant functions are more predictive than CLPar, and the complete profile is the most predictive. Serving as a control, the removal of the significantly enriched functions (labeled as “/Enr” in the legend) from the partial NFP strongly reduces accuracy, either for the close functions (CL), the medium-distance (Med) or the distant functions (Dist). Bars are average AUPRC scores of 200 runs of cross-validation of the Random Forest classifier, whereas error bars show standard deviation across the 200 runs.

Generalizing this principle, we find that integrating all three GO sub-ontologies in a common predictor can provide further increases to accuracy (see Supplementary Document 1, section S3.10).

Validation on external data sets

Our gene neighborhood-based NFP predictive models used cross-validation to obtain overall estimates of accuracy, and additionally to estimate the FDRs for thousands of individual predictions made in prokaryotes and eukaryotes (Supplementary material 2 and 3). To further support these estimates, we analyzed an external dataset of gene functions derived from the Critical Assessment of Function Prediction (CAFA 2) challenge data28 (https://biofunctionprediction.org/cafa/) (Supplementary 1, Section S3.9). Indeed, also on the external validation set, the NFP approach (average AUPRC: 0.207) again outperforms the 10-NN method (AUPRC: 0.130); the difference is significant at p < 2.2 × 10−16 by Wilcoxon test, one-tailed). The average AUPRC of 0.207 on the external validation set is broadly consistent with the average AUPRC of 0.266 on crossvalidation. The AUPRC scores for the individual GO categories were significantly correlated between crossvalidation and the external set (Supplementary 1, Fig. S36).

External validation of eukaryotic predictive models also shows higher accuracy for NFP (average AUPRC: 0.065 in Fungi, 0.025 in Metazoa), compared to the (next best) 10-NN approach (0.055 in Fungi, 0.020 in Metazoa); the differences are significant at p < 5 × 10−16 in Fungi and Metazoa. These scores for the NFP predictions on the external set are largely similar to those originally obtained in crossvalidation for the two groups (external 0.065 and 0.025 for Fungi and Metazoa, versus crossvalidation 0.0545 and 0.0267 respectively; Supplementary 1, Fig. S37).

In summary, external validation supports that NFP outperforms previous approaches that propagate gene functions across neighborhoods, and additionally provides credibility to the cross-validation estimates of accuracy. The set of predictions that we supply as Supplementary material 2,3 may be used to prioritize further validation work. The FDR score provided for each prediction allows making informed decisions on prioritizing the predictions to validate.

Gene neighborhood composition can predict phenotype of individuals

Predicting phenotype from the genome sequence of an individual is a central goal in modern genetics. While single-nucleotide variants and indels are commonly considered in such analyses29,30, structural variants also have considerable potential to affect gene regulation and may therefore bear on the phenotype. Encouraged by the high accuracy of the NFP classifiers in predicting gene function based on gene neighborhood composition, we therefore asked if a related method could be used to infer phenotype from gene order observed in individuals in a population. We focused on prokaryotes, for two reasons. First, the NFP classifiers were more accurate for prokaryotes, which is likely at least in part due to a much larger set of sequenced taxa currently available. Second, copy number-neutral structural variants are known to be abundant even between closely related microbial strains and affect the major part of the genome therein. Moreover, our recent work has shown that across prokaryotic species, many phenotypes are strongly statistically associated with certain gene neighborhoods25. This motivated us to examine to what extent this holds true also for individuals (strains) of one species and to what extent are the associations with neighborhoods predictive. We have therefore examined a previous data set of 696 naturally-occurring E. coli strains that have been systematically experimentally tested for 151 phenotypes (Methods), such as the ability to metabolize certain substrates or the resistance to a variety of toxins and antibiotics30. In the original work, occurrence of deleterious variants, such as nonsense variants or frameshifting indels in certain genes, was associated with specific phenotypic outcomes.

Here, as a baseline, we use conditional scores (CS) of Galardini et al., which are an estimate of gene disruption in a particular strain, combined with the phenotypes that are known to result from loss-of-function mutations for each gene30. Upon computing the AUC and AUPRC predictive performance measures for each phenotypic trait (here encoded as a binary outcome; see Methods) based on the CS, we obtained the median AUC of 0.672 (0.591–0.736; Q1–Q3) across the 151 phenotypes (Fig. 5a). Using the CS as input to a Random Forest algorithm yields slightly better performance with a median AUC of 0.679 (0.473–0.796; Q1–Q3), however the difference in AUC score distribution is is not significant (p = 0.276, one-sided Wilcoxon signed-rank test); see Supplementary document 1, Section S3.11.

Figure 5
figure 5

Predicting phenotypes of individuals from the effects of structural variants on the composition of gene neighborhoods. (a) Distribution of predictive models’ AUC scores (top-left) and AUPRC scores (top-right) across 151 Escherichia coli phenotypes, estimated in crossvalidation. The baseline classifier predicts phenotype from the scores based on gene disruption by small variants. The PCA-NFP classifier predicts from neighborhood function profiles, which are a representation of how structural variants affect genomic neighborhoods. The Ensemble classifier is a combination of both sources of data (see Supplementary 1, Section S3.11). (b) The cross-validation receiver operating characteristic (ROC) curv of a baseline method based on small genetic variants and gene content (green) and the ensemble method (blue) that also includes copy number neutral structural variants, shown for two example phenotypes. Additional examples are in Supplementary 1, Fig. S43.

Next, we created a NFP dataset from this genomic data, where examples are E. coli strains, while features are frequencies of each GO term in the neighbourhood of each COG (i.e. functional neighbourhoods computed for each COG occurring in the genome of an E. coli strain). The resulting dataset is sparse and contains a very large number (nCOG × nGO) of features. A principal components (PC) analysis was therefore applied to reduce this data set to 228 PCs that provide a compact representation of the gene function composition of gene neighborhoods across many gene families, and which were used to train a Random Forest classifier. This yielded NFP models with broadly improved accuracy in predicting phenotype, resulting in out-of-bag AUC scores of 0.715 (0.480–0.815; median, Q1–Q3) across the phenotypes. In specific, 42 out of 151 phenotypes had a significant increase in accuracy (FDR = 20%, DeLong test) over the baseline classifier that draws on deleterious point mutations and indels and gene presence/absence. In contrast, only 9 of 151 phenotypes had significantly reduced accuracy (FDR = 20%, DeLong test) in the NFP over the baseline. Overall this suggests that composition of gene neighborhoods is substantially associated with phenotype.

The baseline classifier draws on deleterious mutations and on alterations in gene content, but not on the copy-number neutral structural variation, i.e. that which does not result in net gene gain or loss, but instead manifests in changed gene order. Broadly, the effects of the deleterious point mutations/indels on the one hand and the copy-number neutral structural variants on the other hand are expected to be qualitatively different – the former commonly abolish or modify protein function, while the latter commonly affect gene regulation. We therefore hypothesized that the two types of variants need to be considered jointly to predict phenotype of individuals more accurately. This was tested by constructing an ensemble classifier (see Supplementary Document 1, Section S3.11, and Supplementary 1, Figs. S41, S42 for details) that results in further significant increases (AUC 0.761; Q1–Q3: 0.673–0.848) over the baseline and also over the NFP classifier (pbaseline = 1.0·10−11,pNFP = 1.6·10−9, one-sided Wilcoxon signed-rank test). There was a significant increase in accuracy (FDR = 20%, DeLong test) over the baseline model that draws only on deleterious gene variants when predicting 62 (out of 151) phenotypes, while only 10 phenotypes showed a significant decrease in accuracy over the baseline, signalling an increase in predictive ability for many different phenotypes.

We highlight some examples. The phenotype “Trimethoprim.A22”, which describes growth inhibition by a combination of two antibiotics, can be predicted by the baseline method (drawing on deleterious mutations) such that, at a precision of 0.5, only 6% of the strains exhibiting the growth phenotype are recovered by the model (recall = 0.06; estimates from precision-recall curves in crossvalidation; Fig. 5b). In contrast, the ensemble method which combines the mutations and the structural variants can recover 28% of the strains exhibiting the growth phenotype (recall = 0.28) at the same precision, which is approx. a four-fold increase. Furthermore, the phenotype “Doxycycline.Pyocyanin” which describes sensitivity to a combined treatment by an antibiotic and a reactive oxygen species-generating toxin, does not yield any predictions (recall = 0) at a precision threshold of 50% when drawing on mutations only, but recovers 27% of the strains known to exhibit growth phenotypes (Fig. 5b) when considering also the gene neighborhoods encoded via their gene function profile. This data for other phenotypes is listed in Supplementary Material 4. This demonstrates that structural variation in the genome of individuals can be used to predict many phenotypes by drawing on the NFP representation of gene ordering along the chromosomes.

Discussion

Our work characterizes the distribution of gene function across genomic neighborhoods in hundreds of genomes. We detected the well-known phenomenon where genes with similar function cluster together in eukaryotic and prokaryotic genomes. However, the same analysis revealed that another type of genomic pattern is very common – the clustering of certain pairs of gene functions that appear unrelated, measured either by their proximity in the Gene Ontology graph (via the Resnik similarity) or by the overlap in genes assigned to the functions (via the Jaccard coefficient). The prevalence of such clustering is very high: while 92.6% of all examined gene functions in prokaryotes are significantly enriched in their own neighborhood or in the neighborhood of a related function, 100% of all functions are so in a neighborhood of at least one unrelated function (at a stringent threshold of FDR = 1%; see Methods). Similarly so, also in the eukaryotic clades we have examined, a higher number of gene functions have an unrelated function significantly enriched in their neighborhood than have a related function enriched (Methods for details). In other words, while organization following the similarity principle is certainly prevalent in genome neighborhoods, other patterns appear similarly or more widespread. Given that the effect sizes of these between-function neighborhood enrichments are often similar to the within-function enrichment, it is conceivable that this widespread pattern is too a result of selective forces shaping genome organization, although the underlying evolutionary mechanisms remain to be elucidated in future work. Irrespective of the mechanism that created it, this pattern is sufficiently strong that it can yield accurate predictive models to infer gene function and also the phenotype of individuals.

If indeed the clustering between unrelated gene functions brings a selective benefit to the organism, one possible interpretation is that the clustered pairs of functions play complementary roles in the physiology of the organism. This raises the possibility that such a pair of complementary functions, whose genes are commonly inter-linked by functional associations (here, inferred from genomic neighborhoods), might be instead better merged into a single large functional group, which would more closely reflect biological reality. In other words, are these pairs of complementary function seen by evolution simply as a single function? An argument against this is that such pairs of gene functions are not commonly annotated to the same gene families (low Jaccard index of example GO terms in Table 1; exhaustive list in Supplementary material 5), even if they do commonly occur in the neighborhood of each other. More generally, all significantly enriched pairs of GO terms in neighborhoods tend to overall have very low gene overlap, with 92% of these pairs having Jaccard index < 0.1 in genes assigned to them (Supplementary 1, Fig. S15).

Turning to the example of the E. coli lac operon and the corresponding functions “carbohydrate transport” and “carbohydrate metabolism”: it is evident that these molecular functions must be distinct due to a different molecular basis for transmembrane transport and for enzymatic cleavage. It is also clear from genomic data that the functions are distinct because they occur independently i.e. they do not co-occur in the same gene families (0 co-occurrences out of 3475 examined prokaryotic COGs; 2.63 co-occurrences of the two functions expected at random, given 199/3475 COGs annotated with “carbohydrate transport” and 46/3475 annotated with “carbohydrate metabolism”). Generalizing this pattern in the lac operon, it is plausible that many other similar neighborhoods exist that incorporate both transport and metabolism of compounds, reaping benefit from co-regulated expression of such complementary gene functions. The NFP approach for predicting gene function and phenotype is able to leverage such systematic co-occurrences, and propagate such dissimilar but co-occurring gene functions across neighborhoods in a systematic manner.

Such pairs of putatively complementary gene functions might be thought of as child functions of a single hypothetical parent function, which currently does not exist in the Gene Ontology graph. (If it did exist, then the Resnik semantic similarity statistic would mark this pair of functions as closely related.) This suggests a possible manner of enhancing of the current GO graph based on this association data, which would involve creating additional parent nodes that bridge the semantically distant, but biologically related GO terms. Of note, there were previous suggestions to derive alternate GO graphs by drawing on co-occurrence of function annotations in the same genes24, which is distinct from what we report here. The complementary functions we propose do not co-occur in the same genes, but are instead associated with each other – in the current analysis, via conserved gene neighborhoods, but it is conceivable that functional interactions inferred from other large-scale data might yield similar results. Past work31 has proposed that some GO terms may be considered ‘classes’, whose member genes are densely interconnected by functional associations inferred from large-scale data, and other GO terms are ‘categories’, whose members are not linked by functional associations, and which therefore represent artificial concepts. Here, we see widespread evidence for a third type of pattern in relation to the GO graph of gene functions, wherein pairs of distant GO terms are linked by numerous functional associations bridging the two GO terms. This suggests that such pairs (and possibly larger groups) of GO terms that are significantly interlinked provide a biologically meaningful manner for organizing the catalogue of gene function, with practical implications for automated inference of gene function and phenotype from the genome sequence.

Methods

Methodology overview

In this work, we assess a novel gene neighbourhood representation, called “Neighbourhood function profile” (NFP), for gene function prediction in 1669 prokaryotic organisms, 49 fungal and 80 metazoan organisms. To predict gene functions, we used Clusters of Orthologous Groups32 (COGs and NOGs) gene families, derived from Eggnog database33 (version 4.0 for prokaryotic and 5.1 for eukaryotic organisms), henceforth collectively referred to as COGs. We have assigned functions from Gene Ontology34 to gene families (COGs) as those occurring in at least 50% genes assigned to a given COG. The resulting datasets contain: a) 3475 COGs (entities) and 1048 GO functions (targets) obtained from the prokaryotic genomes, b) 15969 COGs and 2617 GO functions obtained from the fungal genomes, c) 9187 COGs and 2336 GO functions obtained from the metazoan genomes. We used only complete genomes for prokaryotic organisms whereas the Fungi and Metazoa genomes consisted of complete genomes (chromosomal data) and the non-complete genomes (contigs, nonchromosomal data).

Methodology description

For a given set of genes G and a set of COGs Ω, we define a mapping γ:GP(Ω) which assigns each gene to one or more corresponding COGs. Similarly, for a set of gene functions ∑, mapping δ:Ω → P(∑) mapps COGs to a corresponding set of gene functions contained within GO ontology.

For a given set of organisms Θ, a set of genes G, a selected gene giΘl a number of neighboring positions from either side of the gene k , a set of COGs Ω and a set of GO functions ∑, a functional gene neighbourhood of gi in the organism Θl is defined as a count of all GO functions occurring in COGs assigned to genes in its k-neighbourhood. Formally: \(Nbh({g}_{i},{\varTheta }_{l},\,g{o}_{k})={\sum }_{s=1}^{k}({1}_{g{o}_{k}}(\,{\cup }_{cog\in \gamma ({g}_{(i+s)modn})}\delta (cog)))+({1}_{g{o}_{k}}(\,{\cup }_{cog\in \gamma ({g}_{(i-s)modn})}\) \(\delta (cog)))\,)\), for prokaryotic organisms and \(Nbh({g}_{i},{\varTheta }_{l},\,g{o}_{k})={\sum }_{s=1}^{k}({1}_{g{o}_{k}}(\,{\cup }_{cog\in \gamma ({g}_{(i-s)})}\,\delta (cog\,)))\,+({1}_{g{o}_{k}}({\cup }_{cog\in \gamma ({g}_{(i-s)})}\)\(\delta (cog)))\,)\), for eukaryotic organisms where gs are genes contained in a k-neighbourhood of gene gi and n denotes the total size of a genome. Thus, the result of gene neighbourhood computation for all go functions is a tuple of size |∑| containing corresponding occurrence frequencies of all GO functions in the k-neighbourhood of gi. In our data analyses, we use COGs as entities, thus each COG is associated with a vector containing |∑| elements, corresponding to occurrence of GO functions in its k-neighbourhood, derived from neighbourhoods of corresponding genes. The neighbourhood frequencies computed for a gene gi are added to the frequency vector of all COGs such that cogs γ(gi). Thus, \(NbhO(co{g}_{i},{\varTheta }_{l},g{o}_{k})={\sum }_{\{{g}_{k}\in {\varTheta }_{l}|co{g}_{i}\in \gamma ({g}_{k})\}}\,Nbh({g}_{k},{\varTheta }_{l},g{o}_{k})\) and the final features are computed as \(NbhD(co{g}_{i},\varTheta \,,g{o}_{k})={\sum }_{{\varTheta }_{l}\in \varTheta }\,NbhO(co{g}_{i},{\varTheta }_{l}\,,g{o}_{k})/|\varTheta g{o}_{k}|.\,\varTheta g{o}_{k}\) denotes a set of all organisms containing at least one COG with function gok. Note that genes that are not assigned to any COG do not add to function frequency count in the neighbourhoods. Functional neighbourhoods of each COG contain frequencies for functions with a wide range of semantic similarity to the GO categories assigned to this COG.

We trained a Random Forest of Predictive Clustering Trees35 model on this features to predict gene functions. Performance of this methodology is compared to the biological (1-NN) and Gaussian Field Label Propagation model36, trained on the average of logarithmic distances of pairs of COGs in different organisms. For a pair of genes \({g}_{i}=({c}_{{x}_{i}},{c}_{{y}_{i}})\) and \({g}_{i}=({c}_{{x}_{j}},{c}_{{y}_{j}})\), where \({c}_{{x}_{i}},{c}_{{y}_{i}},{c}_{{x}_{j}},{c}_{{y}_{j}}\) are coordinates of corresponding genes in a genome. The logarithmic distance of two genes contained in a prokaryotic organism is computed as:

$$d({g}_{i,}{g}_{j})=lo{g}_{2}(min(|{c}_{{x}_{i}-}{c}_{{y}_{j}}|,|{c}_{{y}_{i}}-n-{c}_{{x}_{j}}|)+\varepsilon ),{\rm{if}}\,{c}_{{x}_{i}} > {c}_{{y}_{j}}$$
$$d({g}_{i,}{g}_{j})=lo{g}_{2}(min(|{c}_{{x}_{j}-}{c}_{{y}_{i}}|,|{c}_{{y}_{j}}-n-{c}_{{x}_{i}}|)+\varepsilon ),{\rm{if}}\,{c}_{{x}_{j}} > {c}_{{y}_{i}}$$

If two genes are overlapping, we define their distance to equal small constant ε. In our work, we use ε = 10−10. Distances in eukaryotic organisms are computed as:

$$d({g}_{i,}{g}_{j})=lo{g}_{2}(|{c}_{{x}_{i}-}{c}_{{y}_{j}}|+\varepsilon ),{\rm{if}}\,{c}_{{x}_{i}} > {c}_{{y}_{j}}$$
$$d({g}_{i,}{g}_{j})=lo{g}_{2}(|{c}_{{x}_{j}-}{c}_{{y}_{i}}|+\varepsilon ),{\rm{if}}\,{c}_{{x}_{j}} > {c}_{{y}_{i}}$$

The average logarithm distance of pair of COGs is computed as: \(d(co{g}_{i},co{g}_{j},\varTheta \,)={\sum }_{{\varTheta }_{j}\in \varTheta }({\sum }_{({g}_{k},{g}_{l})\in {\varTheta }_{j}|co{g}_{i}\in \gamma ({g}_{k}),co{g}_{j}\in \gamma ({g}_{l})}\)\(d({g}_{k},{g}_{l}))/(|\{({g}_{k},{g}_{l})\in {\varTheta }_{j}|co{g}_{i}\in \gamma ({g}_{k}),co{g}_{j}\in \gamma ({g}_{l})\}|)\)

The Random Forest and the k-NN algorithm perform hierarchical multi-label classification whereas the Gaussian Field Label Propagation model performs single class classification. Methods that perform hierarchical multi-label classification simultaneously predict multiple gene functions to each COG contained in the dataset utilizing information about the hierarchical structure of the GO ontology. Gaussian Field Label Propagation needs to be run for each gene function separately, which makes the approach significantly more time consuming.

An overview of the approach is provided in Fig. 6.

Figure 6
figure 6

Overview of the neighborhood function profile (NFP) methodology to predict gene function. Location-based approaches are trained on pairwise COG/NOG distances of corresponding genes contained within genome of different prokaryotic and eukaryotic organisms. The obtained distances are used to create a similarity table to train the k-NN model and the association network to train the Gaussian Field Label Propagation approach. Functional neighbourhoods are used to create a normalized frequency matrix which is used to train the Random Forest of Predictive Clustering trees model. “COG” in the Figure is used to denote both COG and NOG. Target Hi denotes the sub-hierarchy of GO terms associated with COGi (sub-hierarchy contains information about the GO functions assigned to a COG and the parent-child relations between these GO functions).

Method evaluation and relevance

All methods were evaluated in a cross-validation setting using the 3475 prokaryotic COG gene families and 1048 corresponding GO terms, 15741 fungal COGs and 2617 corresponding GO terms, and 9185 metazoan COGs and 2336 corresponding GO terms. The k-NN and GFP were evaluated using leave-one-out cross-validation and the NFP using the out-of-bag estimates using a forest of 600 trees. The out-of-bag estimate is used in NFP, since it significantly reduces validation time of a random forest model and provides comparable estimate of error to cross-validation.

Since assessing the importance of enriched functions in genomic neighborhoods for gene function prediction was a central goal of our work, we evaluated our approach (that utilizes this information) by comparing it with other state-of-the-art approaches that use information about gene proximity, but lack the information about the enriched semantically distant functions. Increasing predictive performance of genomic neighborhood-based methods may significantly advance our knowledge about the many genes of unknown function, particularly since these methodologies can be easily incorporated into an ensemble model with other genomic predictors, which - due to complementarity of predictions of different models - yields superior performance37.

Association testing of GO functions in neighborhoods

To measure the strength of association between different pairs of GO functions from our data, we first computed the contingency tables that contain the following components: a) COG contains GOx and Neighborhood contains GOy, b) COG contains GOx and Neighbourhood does not contain GOy, c) COG does not contain GOx and Neighbourhood contains GOy, d) COG does not contain GOx and Neighbourhood does not contain GOy.

From these tables, we compute the Odds ratio OR = \(\frac{a/c}{b/d}\) and ultimately the Log Odds ratio log2(OR). In addition to computing ORs and log odds ratios and testing its statistical significance using the Fisher exact test (for ORs) and z-test for log2(OR) >0, we also provide empirical evidence of strength of association. This is done by computing the number (percentage) of significantly enriched pairs of functions computed on the original dataset with higher or at least two-times higher (2x or more) log2(OR) than the corresponding pair computed on the randomized dataset (gene locations are permuted in the genome).

For a GO term with frequency p(GO), the information content (IC) of a GO is defined as −log2(p(GO)). The Resnik Similarity (RS) of a pair of GO terms GOx,GOy is defined as RS(GOx,GOy) = IC(comAnc), where comAnc denotes the most informative common ancestor (the one with the largest IC). We use RS < 2 as a criteria for selecting pairs of distant GO functions for which we compute enrichments in Eukaryotic organisms, the reason for this choice is that GO frequencies on eukaryotic organisms are empirically computed from the data, thus upper level (children of the root) of GO ontology have Information Content higher than 1.

Association testing of GO functions in neighborhoods

In the Discussion section of this text, for some function GOwe say that GOy is a related function if: RS(GOx,GOy) ≥6 for GOxGOy contained in the same namespace of the GO ontology, or if  J(GOx,GOy) ≥0.6 for GOxGOy not contained in the same namespace of GO. We observed 6615 related pairs of functions with FDR < 1% in prokaryotic dataset. For some function GOx, we say that GOy is “unrelated function” if RS(GOx,GOy) <1, for GOxGOy contained in the same namespace of the GO ontology, or if  J(GOx,GOy) < 0.05 for GOxGOy not contained in the same namespace of GO. We observed 204202 such pairs with FDR < 1% in prokaryotic dataset. These statistics are (47.5% − 8853 pairs, 97% − 878850 pairs) for Fungi and (99.9% − 36122 pairs, 100% − 1256151 pairs) for Metazoa.

The 11 gene functions analyzed individually in the manuscript are all taken from the same namespace of GO ontology, to prevent considering potentially synonymous functions from different namespaces as semantically distant.

For a given mapping ξ:∑ → P(Ω), that maps a GO function to a set of COGs which contain this function, we use \(J(G{O}_{x},G{O}_{y})\,=\,\frac{|\zeta (G{O}_{x})\,{\cap }^{}\,\zeta (G{O}_{y})|}{|\zeta (G{O}_{x}\,{\cup }^{}\,G{O}_{y})|}\) to measure the level of circularity of pairs of functions (especially these from different namespaces of GO ontology).

The information accretion of a function GOx was computed as ia(GOx)= −log2(P(GOx|Par(GOx))), where Par denotes a set of parent nodes of GOx. This implies that if some model predicts GO function that does not have high probability of occurrence, given the parent nodes, it gets significantly larger accretion score.

The average number of annotations per annotated gene

To assess the potential reason for a difference in predictive power of all models between Fungi and Metazoa dataset, we computed the average number of annotations per annotated gene. This information is important since it has direct impact on the stability and the amount of information contained in gene functional neighbourhoods.

Dataset for predicting phenotypes

The NFP dataset used to predict phenotypes in different strains of E. coli was constructed so that E. coli strains are examples (entities), whereas features are COG/NOG-GO function pairs (frequency of occurrence of each GO in the neighbourhood of each COG/NOG contained in each E. coli strain). Thus, the whole table depicted in Fig. 6 is contained in each row of our NFP dataset for phenotype prediction. These features (sparse in nature) were used to create 228 principal components using PCA.

Target variables in analyses of phenotypes

The Phenotypic dataset contains 151 target variables (phenotypes) that denote if a fitness defect has been detected (value 1) after application of specific combination of drugs or not (value 0).

Dividing bacteria into two subgroups by ecology

To divide the prokaryotic organisms contained in our dataset into free-living and those associated to mammalian host (pathogenic in mammals), we used the ProTraits database25. We selected these prokaryotic strains having the integrated score above 90%. Taxonomy IDs obtained from these strains were used to obtain species-level taxonomy id, which is used to assign all strains of this particular species to the required subset. LOR distributions were computed using genomes of these subsets of bacteria.

Assessing correlation between functional enrichments and gene co-expressions

In order to determine if the enrichment phenomenon of semantically distant functions can be potentially explained by gene co-expression, we computed the pairwise gene co-expressions on the E. coli bacteria (using the gene expression data obtained from the Colombos database38). Next, we computed the average correlation coefficient of all pairs of genes associated to the GO function pairs that are semantically distant (Resnik < 2) but significantly enriched (LOR > 2 used) and compared it to the average correlation coefficient of all pairs of genes associated to the GO function pairs that are semantically similar (divided by semantic similarity in the intervals [2,4 > , [4,6 >,> = 6) and that are significantly enriched (LOR > 2). As a control, we use the average gene correlation for genes associated to the pairs of GO functions (divided by semantic similarity as above) that are not significantly enriched and have LOR values in the [−0.5,0.5] interval. The obtained results show that the average correlation coefficient obtained consistently increases with the increase of semantic similarity in both groups. It is evident that the average correlation coefficients have higher values for genes associated to the enriched GO pairs, however the number of available data points (4) is too small to prove statistical significance of the difference in correlation between these groups. Overall, we did not detect correlation between enrichments of semantically distant pairs of GO functions and the average co-expression of genes associated to these functions.

Software

All figures were created in the R Studio development environment and plotted using the R language (versions 3.1.1 and 3.1.2), and R packages SemDist 1.3.0, ggrepel 0.5, gplots 2.17.0, seriation 1.2.0, viridis 0.3.4, PRROC 1.1, pROC 1.8. Specific programs to perform various analyses, create features, analyse the results and prepare data for plots were created in Java (JDK versions 7 and 8) and Java libraries Commons_io 2.4, Common_lang3 3.5, Commons_math3 3.3, jarchivelib 0.7.1, javatuples 1.2, weka 3.7.12, FastRandomForest 0.99, and jsc.