Patterns of diverse gene functions in genomic neighborhoods predict gene function and phenotype

Genes with similar roles in the cell cluster on chromosomes, thus benefiting from coordinated regulation. This allows gene function to be inferred by transferring annotations from genomic neighbors, following the guilt-by-association principle. We performed a systematic search for co-occurrence of >1000 gene functions in genomic neighborhoods across 1669 prokaryotic, 49 fungal and 80 metazoan genomes, revealing prevalent patterns that cannot be explained by clustering of functionally similar genes. It is a very common occurrence that pairs of dissimilar gene functions – corresponding to semantically distant Gene Ontology terms – are significantly co-located on chromosomes. These neighborhood associations are often as conserved across genomes as the known associations between similar functions, suggesting selective benefits from clustering of certain diverse functions, which may conceivably play complementary roles in the cell. We propose a simple encoding of chromosomal gene order, the neighborhood function profiles (NFP), which draws on diverse gene clustering patterns to predict gene function and phenotype. NFPs yield a 26–46% increase in predictive power over state-of-the-art approaches that propagate function across neighborhoods, thus providing hundreds of novel, high-confidence gene function inferences per genome. Furthermore, we demonstrate that copy number-neutral structural variation that shapes gene function distribution across chromosomes can predict phenotype of individuals from their genome sequence.

The role of many genes remains unknown. Even in well-investigated model organisms, a quarter or more of the genes have poorly characterized function. With the advance of genome sequencing techniques, the vast amounts of accumulated data provide an opportunity to infer gene function using computational methods. While such methods occur in many varieties, one widespread approach is to examine the composition of gene neighborhoods that occur across genomes. Then, following the principle of guilt-by-association, a function of a gene is inferred by transferring it from its neighbors in the genome 1,2 . Gene neighborhoods that are conserved across multiple genomes provide additional confidence in each inference, often yielding highly accurate predictive modes of gene function [3][4][5][6][7] . One important biological mechanism that underlies similarity of gene function in neighboring genes is that they are often regulated by common factors and therefore co-expressed (reviewed in 8,9 ). The prime example is the prokaryotic operon, where a single mRNA harboring multiple protein-coding regions is transcribed from a promoter, ensuring that expression of such functionally related proteins is coordinated. However, this concept extends more broadly -neighboring genes that are not part of the same operon are also often co-regulated and share function. Moreover, in eukaryotic organisms, which generally lack operons, gene regulation is also organized regionally and this pattern can be conserved across evolutionary time [10][11][12][13][14] . Consistently, gene function is non-randomly distributed also across eukaryotic chromosomes 15,16 , even though the neighborhood signal is overall more subtle than in prokaryotes, important exceptions notwithstanding (reviewed in 17 ).
The current computational methods that use conserved gene neighborhoods to predict gene function rely on the principle that similar functions cluster together on the chromosomes. While there is abundant evidence that this is the case, we were intrigued by known individual examples of conserved clustering of genes with apparently unrelated functions. For instance, a metabolic gene (FAD synthase) was reported to hitchhike with clusters of protein translation-related genes, and RNA modification/degradation genes were found to hitchhike with signal genomic neighborhoods. For eukaryotes we required RS < 2 to call dissimilar GO terms (see Methods for justification). Consistently with data in prokaryotes, 0.3 million out of 1.8 million and 0.4 million out of 1.3 million semantically distant BP GO term pairs were significantly co-occuring in genomic neighborhoods. These high proportions mean that almost all gene functions are significantly enriched in the neighborhood of at least one other semantically distant gene function (100% of BP GO terms in prokaroytes, at FDR = 10%; 96.4% in Fungi, 100% in Metazoa).
The effect sizes of such enrichments may be substantial: 72.1% of the observed log OR scores in prokaryotes are below 1st or above 99th percentile of log OR scores (corresponding to OR < 0.761 and OR >1.36 respectively)  Table S3. Pairs with OR = 0 are not shown on graphs (see Methods; these pairs result in artefactually high or low log OR values after continuity correction). Information about the statistical significance of difference in distribution shape between observed and randomized distribution is expressed by the Kolmogorov-Smirnov D statistic and the corresponding p-value. (b) Number of GO terms that are semantically distant, but significantly enriched in genomic neighborhoods (FDR ≤ 10%) of each GO term, summarized in histograms for prokaryotes, fungi and metazoa. GO term pairs with Resnik similarity <1 (for prokaryotes) and RS < 2 (for eukaryotes) from the 'biological process' GO sub-ontology are tallied in the figure.
in randomized genomes (Supplementary material 1, Table S3). The distributions of enrichments in neighborhoods of four example gene functions are shown in Fig. 2, illustrating how there exist pairs of dissimilar functions that have neighborhood enrichments comparable to or higher than the enrichment of a gene function in its own neighborhood (histograms including statistical significance calls are shown in Supplementary 1, Figure S7). Finally, we examined separately a group of free-living microbes and a group of microbes that are pathogenic in mammals (labels from ProTraits phenotype database 25 , see Methods). Both sets of microbes yielded significantly different OR xy than obtained on randomized data for various pairs of gene categories (p < 2.2·10 −16 ) and the substantially higher spread from randomized data was observed in both sets of microbes (Supplementary document 1, Figure S45). Therefore the enrichment of dissimilar gene functions in neighborhoods does not appear specific to certain ecological preferences but instead represents a general trend.

Examples of semantically dissimilar gene functions enriched in genomic neighborhoods. For
instance, there is a trend in gene neighborhood organization where the GO term "carbohydrate metabolism" is 2.12-fold enriched in the neighborhood of the semantically unrelated GO term "carbohydrate transport" (of note, the magnitude of this enrichment is similar to that of "carbohydrate metabolism" in its own neighborhood -2.53-fold). This co-occurrence is reminiscent of the textbook example of the lac operon in Escherichia coli, where the lacY gene encodes lactose permease, which shuttles lactose into the cell, while the neighboring lacZ gene encodes the enzyme β-galactosidase, which breaks down lactose into monosaccharides. Our analysis suggests that the proximity of genes encoding carbohydrate transporters and carbohydrate metabolizing enzymes is a systematic trend. This can provide evidence for inferring the presence of hypothesized transporter genes located next to known metabolic enzyme genes, and vice versa.
Other examples include significant enrichments of "DNA topological change" (OR xy = 1.42) and also of "nucleoside bisphosphate biosynthesis" (OR xy = 1.41) in the neighborhood of gene families annotated with "response to DNA damage stimulus". This suggests some manner of coordinated regulation between different DNA repair-related activities. Furthermore, "lipid biosynthesis" genes are often in the neighborhood of "localization within membrane" genes (2.43) and "carbohydrate biosynthesis" is in the neighborhood of "cell envelope organization" (2.10). These associations suggest coordinated activation of processes that generate building blocks of cellular structures, and subsequent processes that incorporate the building blocks into these structures. We highlight other examples of enriched GO term pairs for prokaryotic neighborhoods in Table 1 (see also Supplementary  document 1, Table S2), while the exhaustive list is in Supplementary material 5.
While all GO term pairs listed above either have low semantic similarity or are from different GO sub-ontologies, there are cases where GO terms, distant in the GO graph, may sometimes overlap in the set of genes assigned to them 24 . This has the potential to inflate the observed enrichments, which might be due to the same genes artefactually creating two apparently distinct functional neighborhoods. However after quantifying Semantically distant GO terms can be as strongly enriched in gene neighborhoods as the semantically close GO terms. Four example GO terms of the 'Biological process' ontology are shown. Histograms show numbers of GO terms at a certain log odds ratio (log OR) of the enrichment in gene neighborhood (for prokaryotic genomes). The GO terms in neighborhoods of a central GO function are broken down into three groups: the "CLPar" group (the central function itself plus all its parent functions in the GO graph), "CLMed" group (functions with Resnik semantic similarity >2 with the central function) and "Dist" group (Resnik ≤2 with the central function). Instances of GO terms in the dissimilar "Dist" group and in the non-self "CLMed" group can be observed that have enrichments as high or higher than the self-enrichments (the "CLPar" functions, arrows on the plot).
the overlap of genes assigned to the pairs of GO terms using Jaccard index, we find this not to be a common issue (Table 1; Supplementary Material 5). Furthermore, a randomization of gene positions supports that the observed enrichment of extreme OR values between distinct GO terms holds true for the semantically very close GO term pairs as well as for the semantically distant pairs (panels 'CLPar' and 'Dist' in Fig. S8). Overall, our data suggests that clustering of dissimilar gene functions -at least as defined by the GO -is very common in genomic neighborhoods, occurring to a comparable extent as the well-known clustering of similar gene functions in genomes.
It is well-known that functionally similar terms found in gene neighborhoods are often co-expressed. We asked if this extends also to the functionally dissimilar terms that cluster in gene neighborhoods. To address this, we stratified GO term pairs into bins by semantic similarity, and compared the neighborhood-cooccurring versus non-cooccurring pairs by similarity of expression profiles in E. coli (Methods). Expectedly, we observed that high functional similarity (irrespective of gene neighborhoods) implies higher co-expression than low functional similarity (Supplementary document 1, Table S10) and also that the highly similar function pairs were more strongly co-expressed when they are clustered in neighborhoods (3.2-fold difference in average Spearman correlation). Interestingly, we observed that also the lowly similar function pairs were more strongly co-expressed when clustered in neighborhoods than when they were not clustered (3.0-fold difference), even though the overall co-expression strength for lowly similar function pairs is not striking (Supplementary document 1, Table S10). This supports the notion that at least some of the diverse functions co-localized in gene neighborhoods are also co-expressed.
A general method to infer gene function based on neighborhood patterns. Having demonstrated that genomic neighborhoods are significantly enriched in certain combinations of diverse gene functions, we asked if such neighborhood patterns can be used to establish a general method that predicts gene function. To evaluate this, we compared two established methods that propagate a gene function to neighboring genes, with a novel classifier that can draw on neighborhood co-occurrence of diverse gene functions to predict GO terms for COG gene families. The first method is a simple k-nearest neighbors (kNN) classifier that transfers known functions to a COG gene family from the k neighboring gene families (with smallest average logarithmized distances across many genomes; Methods). The second classifier can transfer gene functions to neighbors additionally via indirect links: a gene network is constructed from neighborhoods and the gene function assignments diffuse across the links (using the GeneMania method 26 , herein referred to as the Gaussian Field Propagation (GFP) classifier). In addition to these known approaches, we introduced a third, novel classifier that can draw on both the enrichment of similar functions in neighborhoods and additionally the enrichment of semantically distant gene functions. For example, this method should be able to infer that a gene family deals with "carbohydrate metabolism" based on its neighbors being annotated with "carbohydrate metabolism" and additionally based on its neighbors dealing with "carbohydrate transport", a semantically distant function in the GO graph. To this GO   Table 1. Examples of dissimilar gene function pairs enriched in genomic neighborhoods. Data shown for "biological process" GO graph of the prokaryotic genomes. GO x -GO x denotes enrichment of a GO term s in its own neighborhood, while GO x -GO y denotes enrichment of the other GO term y in the neighborhood of GO term x. Enrichments are given as log odds ratio (log OR) +/− corresponding 95% confidence interval and the p-value for the association (by Z-test on log OR, one-tailed). "Resnik" denotes Resnik semantic similarity; this is an unbounded score where any value < 2 corresponds to very distant terms that reside in separate branches of the GO graph (meaning the closest common ancestor has information content <2). "J" is the Jaccard index that quantifies co-occurrence of gene functions in COG gene families from our prokaryotic data set, and can vary from 0 to 1. (2019) 9:19537 | https://doi.org/10.1038/s41598-019-55984-0 www.nature.com/scientificreports www.nature.com/scientificreports/ end, we employed a Random Forest classifier on a data set where examples are COG gene families, features are the normalized counts of every GO term in the neighborhood of that COG (across all genomes), and the target variable is the set of known GO term labels of the COG gene family; see Methods. In other words, from such a representation, the function of a gene can in principle be inferred from the presence of any function -or a combination thereof -in the genomic neighborhood of the gene, as long as such a pattern occurs commonly enough to be recognized by the algorithm. We named this approach the "neighborhood function profile" (NFP) classifier.
We evaluated the accuracy of all methods in a cross-validation test (using the out-of-bag error statistic provided by Random Forest; see Methods). The average area under precision-recall curves (AUPRC) for GO terms in the prokaryotic dataset is 0.153 for the 10-NN classifier and, expectedly, a much improved 0.199 score for the network-based (GFP) classifier. Both methods operate by transferring an annotation across gene neighborhoods, while the latter also uses indirect links to improve accuracy. However the NFP-based classifier, which can draw on diverse neighborhood patterns substantially improves over this, with a 0.266 average AUPRC in prokaryotes, a 34% increase (p < 10 −10 for the increase over the next best method, one-tailed Wilcoxon test on AUPRC scores across GO categories; distributions of scores in Fig. 3, see also Supplementary 1, Figs. S16-S18). Similarly so, in the two groups of eukaryotic organisms, the AUPRC scores were significantly improved using the novel NFP method: for Fungi, the 0.0460 (for the 10-NN) increased to 0.0545 (for the NFP; p < 10 −15 ) and for Metazoa, the increase is 0.0228 (for 10-NN) to 0.0267 (for NFP, p < 10 −15 ; Fig. 3). The network diffusion approach applied to gene neighborhood data in eukaryotes did not on average bring benefits over the simpler 10-NN (Supplementary 1, Figs. S19-S24), therefore the latter is used as a baseline.
High predictive power of the Neighborhood Function Profile (NFP) classifier. A more accurate classifier would be expected to provide a higher number of confident predictions. We quantified this increase provided by the predictive models based on NFP. In particular, we tallied the number of predictions (COG-GO term associations) made at a precision threshold of 50% (equivalent to 50% FDR) and additionally at a more stringent 80% (20% FDR) for the three classifiers; Table 2. This reveals remarkable, several-fold increases in the amount of predictions afforded by the NFP in prokaryotes, fungi and metazoa, when considering the more general gene functions (information content (IC) between 2 and 4). In the highly specific gene functions (IC > 4), which are usually of higher interest, there is a substantial increase in new annotations provided by NFP for prokaryotes and fungi, and a more modest gain in metazoa. We further examined the diversity of predictions: in particular, we www.nature.com/scientificreports www.nature.com/scientificreports/ asked if the new predictions afforded by NFP are largely added to the gene families that were already assigned predictions by the previous methods, or if NFP predictions instead cover new gene families. Our data suggested that the latter is the case (Supplementary 1, Tables S5-S7), because the number of COG gene families receiving at least one prediction is higher in NFP compared to the baseline classifiers.
Our gene neighborhood classifiers, as implemented, provide function predictions at the level of COG gene families (exhaustive list given in Supplementary material 2,3.). We also provide data showing how this reflects in the number of genes receiving predictions in certain model organisms (Supplementary document 1, Tables S8 and  S9). For instance, in Escherichia coli, at precision of 50%, the number of novel predictions (gene-function pairs) is 7572 for the network approach, and increases to 10559 provided by the novel NFP classifier; similarly so in the pathogen Staphylococcus aureus, increasing from 4314 to 5386 by use of NFP; all counts given for gene functions with IC > 4. Eukaryotes, consistent with more modest AUPRC scores (see above), provide overall fewer predictions, but increases from use of NFP are quite evident. The number of novel annotations at Pr = 50% substantially increased from 89 (10-NN) to 552 (NFP) for Saccharomyces cerevisiae and 87 to 282 for Schizosaccharomyces pombe, in the fungal predictor (Supplementary 1, Table S8). For metazoans, the increases were striking for the general gene functions (IC 2-4), with more than twofold higher number of predictions afforded by NFP at precision 50% for mouse, human or Drosophila melanogaster, compared to the next best method. The NFP gains were modest for highly specific gene functions with IC > 4 in Metazoa (Supplementary 1, Table S8). One possible explanation was the lower coverage with known functions (average 18 GO terms per gene in Metazoa versus 26 in Fungi in the databases we used; see Methods). This might prevent NFP from discovering complex association patterns between gene functions in neighborhoods, while the simpler kNN classifier is less affected.
We examined another measure of the utility of the predictive models, based on the information accretion (IA) criterion 27 . In brief, IA weighs the predictions such as to give higher scores to higher information content (rarer) GO terms; see Methods. By this method, in prokaryotes, kNN predicts 2.69 bits/gene novel information, the GFP network approach 6.89 bits/gene, while the NFP increases this to 9.44 bits/gene; all given at precision = 50% (data at 50% and 80% thresholds are in Table 3, for visualization of proportions see Supplementary 1, Figs. S30 and S31). Therefore, we estimate that NFP brings a 37% increase in coverage with predicted gene functions over a state-of-the-art genomic neighborhood method. This result is mirrored in two groups of eukaryotes we tested: in fungi, the increase was 0.70 (kNN), 1.37 (network) and 2.00 (NFP) bits/gene, while in Metazoa it was 0.70 (kNN), 0.99 (network) increasing to 1.25 bits/gene in the NFP method (Table 3, Supplementary 1, Figs. S32-S35). In other words, in eukaryotes, the novel NFP method increases the predictive power by 46% (Fungi) and 26% (Metazoa) over a state-of-the art network approach for propagating gene function across neighborhoods. nfp accuracy is augmented by semantically distant functions. The above data indicate that a classifier based on the NFP -an exhaustive description of the composition of gene functions in a genomic neighborhood -provides high accuracy and yields many additional function predictions. Next, we asked if this increase in accuracy of the NFP is due to the semantically distant GO terms in gene neighborhoods. To this end, we examined 11 example gene functions, which have other semantically dissimilar functions enriched in their neighborhoods (by our OR xy measure, see above). As expected, the NFP classifier strongly outperforms the baseline kNN and the GFP classifiers for these functions (Supplementary 1, Fig. S28). Then, we created partial NFPs, which contained only the semantically distant functions (without ancestors), compared to the function being predicted ("Dist/Par", having RS < 2). We contrasted this to the partial NFPs that contain only the function being predicted itself and its ancestors that are semantically close ("CLPar", parents in the GO graph having RS ≥ 4 plus the GO function itself). Random Forest classifiers were trained on the two kinds of partial NFPs and cross-validation accuracy www.nature.com/scientificreports www.nature.com/scientificreports/ compared (Fig. 4). For 9 out of 11 functions, a higher accuracy was obtained from neighborhood features describing semantically distant functions ("Dist/Par", mean AUPRC = 0.047 across gene functions) than from features describing the target function and its close parents ("CLPar", mean AUPRC = 0.035). The latter is an implementation of the guilt-by-association principle using the same classifier and the same data representation; for 8 out of 11 the increase was significant (p < 0.0002; Wilcoxon one-sided paired test). For all 11 functions, using the full set of neighborhood features, including the close, intermediate-distance, and distant functions, significantly outperforms the default model using only close features. This suggests a benefit to predictive accuracy also from  Table 3. Amount of predicted information on gene function, measured using the 'information accretion' methodology and expressed as bits per gene. 10-NN, ten nearest neighbors; GFP, Gaussian Field Label Propagation (network-based approach); NFP, neighborhood function profile. The "Full profile" are the full NFP of the 'biological process' GO graph, while the "CL/CLPar", "Med/ Par" and "Dist/Par" represent the partial NFP consisting only of close, medium-distance and distant functions, respectively (the "/Par" denotes that parent GO terms of the target functions were removed). The "CLPar" partial profiles contain only the selected function and its semantically close parents, meaning that "CLPar" is an implementation of the standard approaches that transfer functions across neighborhoods. In many cases, the close (but non-self), medium-distance and distant functions are more predictive than CLPar, and the complete profile is the most predictive. Serving as a control, the removal of the significantly enriched functions (labeled as "/Enr" in the legend) from the partial NFP strongly reduces accuracy, either for the close functions (CL), the medium-distance (Med) or the distant functions (Dist). Bars are average AUPRC scores of 200 runs of crossvalidation of the Random Forest classifier, whereas error bars show standard deviation across the 200 runs.
including the intermediate-distance functions. This analysis demonstrates that even when using the same statistical methodology (Random Forest) and same type of data representation (NFP), the presence of semantically distant functions in genomic neighborhoods is often highly predictive of gene function. For more details on the experimental setup see Supplementary document 1, Section S2. Generalizing this principle, we find that integrating all three GO sub-ontologies in a common predictor can provide further increases to accuracy (see Supplementary Document 1, section S3.10).

Validation on external data sets. Our gene neighborhood-based NFP predictive models used
cross-validation to obtain overall estimates of accuracy, and additionally to estimate the FDRs for thousands of individual predictions made in prokaryotes and eukaryotes (Supplementary material 2 and 3). To further support these estimates, we analyzed an external dataset of gene functions derived from the Critical Assessment of Function Prediction (CAFA 2) challenge data 28 (https://biofunctionprediction.org/cafa/) (Supplementary 1, Section S3.9). Indeed, also on the external validation set, the NFP approach (average AUPRC: 0.207) again outperforms the 10-NN method (AUPRC: 0.130); the difference is significant at p < 2.2 × 10 −16 by Wilcoxon test, one-tailed). The average AUPRC of 0.207 on the external validation set is broadly consistent with the average AUPRC of 0.266 on crossvalidation. The AUPRC scores for the individual GO categories were significantly correlated between crossvalidation and the external set (Supplementary 1, Fig. S36).
External validation of eukaryotic predictive models also shows higher accuracy for NFP (average AUPRC: 0.065 in Fungi, 0.025 in Metazoa), compared to the (next best) 10-NN approach (0.055 in Fungi, 0.020 in Metazoa); the differences are significant at p < 5 × 10 −16 in Fungi and Metazoa. These scores for the NFP predictions on the external set are largely similar to those originally obtained in crossvalidation for the two groups (external 0.065 and 0.025 for Fungi and Metazoa, versus crossvalidation 0.0545 and 0.0267 respectively; Supplementary 1, Fig. S37).
In summary, external validation supports that NFP outperforms previous approaches that propagate gene functions across neighborhoods, and additionally provides credibility to the cross-validation estimates of accuracy. The set of predictions that we supply as Supplementary material 2,3 may be used to prioritize further validation work. The FDR score provided for each prediction allows making informed decisions on prioritizing the predictions to validate.

Gene neighborhood composition can predict phenotype of individuals. Predicting phenotype
from the genome sequence of an individual is a central goal in modern genetics. While single-nucleotide variants and indels are commonly considered in such analyses 29,30 , structural variants also have considerable potential to affect gene regulation and may therefore bear on the phenotype. Encouraged by the high accuracy of the NFP classifiers in predicting gene function based on gene neighborhood composition, we therefore asked if a related method could be used to infer phenotype from gene order observed in individuals in a population. We focused on prokaryotes, for two reasons. First, the NFP classifiers were more accurate for prokaryotes, which is likely at least in part due to a much larger set of sequenced taxa currently available. Second, copy number-neutral structural variants are known to be abundant even between closely related microbial strains and affect the major part of the genome therein. Moreover, our recent work has shown that across prokaryotic species, many phenotypes are strongly statistically associated with certain gene neighborhoods 25 . This motivated us to examine to what extent this holds true also for individuals (strains) of one species and to what extent are the associations with neighborhoods predictive. We have therefore examined a previous data set of 696 naturally-occurring E. coli strains that have been systematically experimentally tested for 151 phenotypes (Methods), such as the ability to metabolize certain substrates or the resistance to a variety of toxins and antibiotics 30 . In the original work, occurrence of deleterious variants, such as nonsense variants or frameshifting indels in certain genes, was associated with specific phenotypic outcomes.
Here, as a baseline, we use conditional scores (CS) of Galardini et al., which are an estimate of gene disruption in a particular strain, combined with the phenotypes that are known to result from loss-of-function mutations for each gene 30 . Upon computing the AUC and AUPRC predictive performance measures for each phenotypic trait (here encoded as a binary outcome; see Methods) based on the CS, we obtained the median AUC of 0.672 (0.591-0.736; Q1-Q3) across the 151 phenotypes (Fig. 5a). Using the CS as input to a Random Forest algorithm yields slightly better performance with a median AUC of 0.679 (0.473-0.796; Q1-Q3), however the difference in AUC score distribution is is not significant (p = 0.276, one-sided Wilcoxon signed-rank test); see Supplementary document 1, Section S3.11.
Next, we created a NFP dataset from this genomic data, where examples are E. coli strains, while features are frequencies of each GO term in the neighbourhood of each COG (i.e. functional neighbourhoods computed for each COG occurring in the genome of an E. coli strain). The resulting dataset is sparse and contains a very large number (n COG × n GO ) of features. A principal components (PC) analysis was therefore applied to reduce this data set to 228 PCs that provide a compact representation of the gene function composition of gene neighborhoods across many gene families, and which were used to train a Random Forest classifier. This yielded NFP models with broadly improved accuracy in predicting phenotype, resulting in out-of-bag AUC scores of 0.715 (0.480-0.815; median, Q1-Q3) across the phenotypes. In specific, 42 out of 151 phenotypes had a significant increase in accuracy (FDR = 20%, DeLong test) over the baseline classifier that draws on deleterious point mutations and indels and gene presence/absence. In contrast, only 9 of 151 phenotypes had significantly reduced accuracy (FDR = 20%, DeLong test) in the NFP over the baseline. Overall this suggests that composition of gene neighborhoods is substantially associated with phenotype.
The baseline classifier draws on deleterious mutations and on alterations in gene content, but not on the copy-number neutral structural variation, i.e. that which does not result in net gene gain or loss, but instead manifests in changed gene order. Broadly, the effects of the deleterious point mutations/indels on the one hand (2019) 9:19537 | https://doi.org/10.1038/s41598-019-55984-0 www.nature.com/scientificreports www.nature.com/scientificreports/ and the copy-number neutral structural variants on the other hand are expected to be qualitatively differentthe former commonly abolish or modify protein function, while the latter commonly affect gene regulation. We therefore hypothesized that the two types of variants need to be considered jointly to predict phenotype of individuals more accurately. This was tested by constructing an ensemble classifier (see Supplementary Document 1, Section S3.11, and Supplementary 1, Figs. S41, S42 for details) that results in further significant increases (AUC 0.761; Q1-Q3: 0.673-0.848) over the baseline and also over the NFP classifier (p baseline = 1.0·10 −11 ,p NFP = 1.6·10 −9 , one-sided Wilcoxon signed-rank test). There was a significant increase in accuracy (FDR = 20%, DeLong test) over the baseline model that draws only on deleterious gene variants when predicting 62 (out of 151) phenotypes, while only 10 phenotypes showed a significant decrease in accuracy over the baseline, signalling an increase in predictive ability for many different phenotypes.
We highlight some examples. The phenotype "Trimethoprim.A22", which describes growth inhibition by a combination of two antibiotics, can be predicted by the baseline method (drawing on deleterious mutations) such that, at a precision of 0.5, only 6% of the strains exhibiting the growth phenotype are recovered by the model (recall = 0.06; estimates from precision-recall curves in crossvalidation; Fig. 5b). In contrast, the ensemble method which combines the mutations and the structural variants can recover 28% of the strains exhibiting the growth phenotype (recall = 0.28) at the same precision, which is approx. a four-fold increase. Furthermore, the phenotype "Doxycycline.Pyocyanin" which describes sensitivity to a combined treatment by an antibiotic and a reactive oxygen species-generating toxin, does not yield any predictions (recall = 0) at a precision threshold of 50% when drawing on mutations only, but recovers 27% of the strains known to exhibit growth phenotypes (Fig. 5b) when considering also the gene neighborhoods encoded via their gene function profile. This data for other phenotypes is listed in Supplementary Material 4. This demonstrates that structural variation in the genome of individuals can be used to predict many phenotypes by drawing on the NFP representation of gene ordering along the chromosomes.

Discussion
Our work characterizes the distribution of gene function across genomic neighborhoods in hundreds of genomes. We detected the well-known phenomenon where genes with similar function cluster together in eukaryotic and prokaryotic genomes. However, the same analysis revealed that another type of genomic pattern is very common -the clustering of certain pairs of gene functions that appear unrelated, measured either by their proximity in the Gene Ontology graph (via the Resnik similarity) or by the overlap in genes assigned to the functions (via the  Supplementary 1, Fig. S43. Jaccard coefficient). The prevalence of such clustering is very high: while 92.6% of all examined gene functions in prokaryotes are significantly enriched in their own neighborhood or in the neighborhood of a related function, 100% of all functions are so in a neighborhood of at least one unrelated function (at a stringent threshold of FDR = 1%; see Methods). Similarly so, also in the eukaryotic clades we have examined, a higher number of gene functions have an unrelated function significantly enriched in their neighborhood than have a related function enriched (Methods for details). In other words, while organization following the similarity principle is certainly prevalent in genome neighborhoods, other patterns appear similarly or more widespread. Given that the effect sizes of these between-function neighborhood enrichments are often similar to the within-function enrichment, it is conceivable that this widespread pattern is too a result of selective forces shaping genome organization, although the underlying evolutionary mechanisms remain to be elucidated in future work. Irrespective of the mechanism that created it, this pattern is sufficiently strong that it can yield accurate predictive models to infer gene function and also the phenotype of individuals.
If indeed the clustering between unrelated gene functions brings a selective benefit to the organism, one possible interpretation is that the clustered pairs of functions play complementary roles in the physiology of the organism. This raises the possibility that such a pair of complementary functions, whose genes are commonly inter-linked by functional associations (here, inferred from genomic neighborhoods), might be instead better merged into a single large functional group, which would more closely reflect biological reality. In other words, are these pairs of complementary function seen by evolution simply as a single function? An argument against this is that such pairs of gene functions are not commonly annotated to the same gene families (low Jaccard index of example GO terms in Table 1; exhaustive list in Supplementary material 5), even if they do commonly occur in the neighborhood of each other. More generally, all significantly enriched pairs of GO terms in neighborhoods tend to overall have very low gene overlap, with 92% of these pairs having Jaccard index < 0.1 in genes assigned to them (Supplementary 1, Fig. S15).
Turning to the example of the E. coli lac operon and the corresponding functions "carbohydrate transport" and "carbohydrate metabolism": it is evident that these molecular functions must be distinct due to a different molecular basis for transmembrane transport and for enzymatic cleavage. It is also clear from genomic data that the functions are distinct because they occur independently i.e. they do not co-occur in the same gene families (0 co-occurrences out of 3475 examined prokaryotic COGs; 2.63 co-occurrences of the two functions expected at random, given 199/3475 COGs annotated with "carbohydrate transport" and 46/3475 annotated with "carbohydrate metabolism"). Generalizing this pattern in the lac operon, it is plausible that many other similar neighborhoods exist that incorporate both transport and metabolism of compounds, reaping benefit from co-regulated expression of such complementary gene functions. The NFP approach for predicting gene function and phenotype is able to leverage such systematic co-occurrences, and propagate such dissimilar but co-occurring gene functions across neighborhoods in a systematic manner.
Such pairs of putatively complementary gene functions might be thought of as child functions of a single hypothetical parent function, which currently does not exist in the Gene Ontology graph. (If it did exist, then the Resnik semantic similarity statistic would mark this pair of functions as closely related.) This suggests a possible manner of enhancing of the current GO graph based on this association data, which would involve creating additional parent nodes that bridge the semantically distant, but biologically related GO terms. Of note, there were previous suggestions to derive alternate GO graphs by drawing on co-occurrence of function annotations in the same genes 24 , which is distinct from what we report here. The complementary functions we propose do not co-occur in the same genes, but are instead associated with each other -in the current analysis, via conserved gene neighborhoods, but it is conceivable that functional interactions inferred from other large-scale data might yield similar results. Past work 31 has proposed that some GO terms may be considered 'classes' , whose member genes are densely interconnected by functional associations inferred from large-scale data, and other GO terms are 'categories' , whose members are not linked by functional associations, and which therefore represent artificial concepts. Here, we see widespread evidence for a third type of pattern in relation to the GO graph of gene functions, wherein pairs of distant GO terms are linked by numerous functional associations bridging the two GO terms. This suggests that such pairs (and possibly larger groups) of GO terms that are significantly interlinked provide a biologically meaningful manner for organizing the catalogue of gene function, with practical implications for automated inference of gene function and phenotype from the genome sequence.

Methods
Methodology overview. In this work, we assess a novel gene neighbourhood representation, called "Neighbourhood function profile" (NFP), for gene function prediction in 1669 prokaryotic organisms, 49 fungal and 80 metazoan organisms. To predict gene functions, we used Clusters of Orthologous Groups 32 (COGs and NOGs) gene families, derived from Eggnog database 33 (version 4.0 for prokaryotic and 5.1 for eukaryotic organisms), henceforth collectively referred to as COGs. We have assigned functions from Gene Ontology 34 to gene families (COGs) as those occurring in at least 50% genes assigned to a given COG. The resulting datasets contain: a) 3475 COGs (entities) and 1048 GO functions (targets) obtained from the prokaryotic genomes, b) 15969 COGs and 2617 GO functions obtained from the fungal genomes, c) 9187 COGs and 2336 GO functions obtained from the metazoan genomes. We used only complete genomes for prokaryotic organisms whereas the Fungi and Metazoa genomes consisted of complete genomes (chromosomal data) and the non-complete genomes (contigs, nonchromosomal data). www.nature.com/scientificreports www.nature.com/scientificreports/ For a given set of organisms Θ, a set of genes G, a selected gene g i ∈ Θ l a number of neighboring positions from either side of the gene k ∈ ℵ, a set of COGs Ω and a set of GO functions ∑, a functional gene neighbourhood of g i in the organism Θ l is defined as a count of all GO functions occurring in COGs assigned to genes in its k-neighbourhood. Formally:

Methodology description.
, for prokaryotic organisms and , for eukaryotic organisms where g s are genes contained in a k-neighbourhood of gene g i and n denotes the total size of a genome. Thus, the result of gene neighbourhood computation for all go functions is a tuple of size |∑| containing corresponding occurrence frequencies of all GO functions in the k-neighbourhood of g i . In our data analyses, we use COGs as entities, thus each COG is associated with a vector containing |∑| elements, corresponding to occurrence of GO functions in its k-neighbourhood, derived from neighbourhoods of corresponding genes. The neighbourhood frequencies computed for a gene g i are added to the frequency vector of all COGs such that cog s ∈ γ(g i ). Thus, denotes a set of all organisms containing at least one COG with function go k . Note that genes that are not assigned to any COG do not add to function frequency count in the neighbourhoods. Functional neighbourhoods of each COG contain frequencies for functions with a wide range of semantic similarity to the GO categories assigned to this COG.
We www.nature.com/scientificreports www.nature.com/scientificreports/ with higher or at least two-times higher (2x or more) log 2 (OR) than the corresponding pair computed on the randomized dataset (gene locations are permuted in the genome).
For a GO term with frequency p(GO), the information content (IC) of a GO is defined as −log 2 (p(GO)). The Resnik Similarity (RS) of a pair of GO terms GO x ,GO y is defined as RS(GO x ,GO y ) = IC(comAnc), where comAnc denotes the most informative common ancestor (the one with the largest IC). We use RS < 2 as a criteria for selecting pairs of distant GO functions for which we compute enrichments in Eukaryotic organisms, the reason for this choice is that GO frequencies on eukaryotic organisms are empirically computed from the data, thus upper level (children of the root) of GO ontology have Information Content higher than 1.

Association testing of Go functions in neighborhoods.
In the Discussion section of this text, for some function GO x we say that GO y is a related function if: RS(GO x ,GO y ) ≥6 for GO x , GO y contained in the same namespace of the GO ontology, or if J(GO x ,GO y ) ≥0. 6 for GO x , GO y not contained in the same namespace of GO. We observed 6615 related pairs of functions with FDR < 1% in prokaryotic dataset. For some function GO x , we say that GO y is "unrelated function" if RS(GO x ,GO y ) <1, for GO x , GO y contained in the same namespace of the GO ontology, or if J(GO x ,GO y ) < 0.05 for GO x , GO y not contained in the same namespace of GO. We observed 204202 such pairs with FDR < 1% in prokaryotic dataset. These statistics are (47.5% − 8853 pairs, 97% − 878850 pairs) for Fungi and (99.9% − 36122 pairs, 100% − 1256151 pairs) for Metazoa.
The 11 gene functions analyzed individually in the manuscript are all taken from the same namespace of GO ontology, to prevent considering potentially synonymous functions from different namespaces as semantically distant.
For a given mapping ξ:∑ → P(Ω), that maps a GO function to a set of COGs which contain this function, we use J GO GO ( , ) x y to measure the level of circularity of pairs of functions (especially these from different namespaces of GO ontology). The information accretion of a function GO x was computed as ia(GO x )= −log 2 (P(GO x |Par(GO x ))), where Par denotes a set of parent nodes of GO x . This implies that if some model predicts GO function that does not have high probability of occurrence, given the parent nodes, it gets significantly larger accretion score. www.nature.com/scientificreports www.nature.com/scientificreports/ the average number of annotations per annotated gene. To assess the potential reason for a difference in predictive power of all models between Fungi and Metazoa dataset, we computed the average number of annotations per annotated gene. This information is important since it has direct impact on the stability and the amount of information contained in gene functional neighbourhoods.
Dataset for predicting phenotypes. The NFP dataset used to predict phenotypes in different strains of E.
coli was constructed so that E. coli strains are examples (entities), whereas features are COG/NOG-GO function pairs (frequency of occurrence of each GO in the neighbourhood of each COG/NOG contained in each E. coli strain). Thus, the whole table depicted in Fig. 6 is contained in each row of our NFP dataset for phenotype prediction. These features (sparse in nature) were used to create 228 principal components using PCA. target variables in analyses of phenotypes. The Phenotypic dataset contains 151 target variables (phenotypes) that denote if a fitness defect has been detected (value 1) after application of specific combination of drugs or not (value 0).
Dividing bacteria into two subgroups by ecology. To divide the prokaryotic organisms contained in our dataset into free-living and those associated to mammalian host (pathogenic in mammals), we used the ProTraits database 25 . We selected these prokaryotic strains having the integrated score above 90%. Taxonomy IDs obtained from these strains were used to obtain species-level taxonomy id, which is used to assign all strains of this particular species to the required subset. LOR distributions were computed using genomes of these subsets of bacteria.
Assessing correlation between functional enrichments and gene co-expressions. In order to determine if the enrichment phenomenon of semantically distant functions can be potentially explained by gene co-expression, we computed the pairwise gene co-expressions on the E. coli bacteria (using the gene expression data obtained from the Colombos database 38 ). Next, we computed the average correlation coefficient of all pairs of genes associated to the GO function pairs that are semantically distant (Resnik < 2) but significantly enriched (LOR > 2 used) and compared it to the average correlation coefficient of all pairs of genes associated to the GO function pairs that are semantically similar (divided by semantic similarity in the intervals [2,4 > , [4,6 >,> = 6) and that are significantly enriched (LOR > 2). As a control, we use the average gene correlation for genes associated to the pairs of GO functions (divided by semantic similarity as above) that are not significantly enriched and have LOR values in the [−0.5,0.5] interval. The obtained results show that the average correlation coefficient obtained consistently increases with the increase of semantic similarity in both groups. It is evident that the average correlation coefficients have higher values for genes associated to the enriched GO pairs, however the number of available data points (4) is too small to prove statistical significance of the difference in correlation between these groups. Overall, we did not detect correlation between enrichments of semantically distant pairs of GO functions and the average co-expression of genes associated to these functions.

Data availability
Data is available via the supplementary material of this publication or upon request from authors.