Main

Evolutionarily related genes (homologs) across different species are often divided into gene pairs that originated through speciation events (orthologs) and gene pairs that originated through duplication events (paralogs)1. This distinction is useful in a broad range of contexts, including phylogenetic tree inference, genome annotation, comparative genomics and gene function prediction2,3,4. Accordingly, dozens of methods5 and resources6,7,8 for orthology inference have been developed.

Because the true evolutionary history of genes is typically unknown, assessing the performance of these orthology inference methods is not straightforward. Several indirect approaches have been proposed. Based on the notion that orthologs tend to be functionally more similar than paralogs (a notion now referred to as the ortholog conjecture9,10,11,12), Hulsen et al.13 used several measures of functional conservation (coexpression levels, protein–protein interactions and protein domain conservation) to benchmark orthology inference methods. Chen et al.14 proposed an unsupervised learning approach based on consensus among different orthology methods. Altenhoff and Dessimoz15 introduced a phylogenetic benchmark measuring the concordance between gene trees reconstructed from putative orthologs and undisputed species trees. More recently, several 'gold standard' reference sets, either manually curated16,17 or derived from trusted resources18, have been used as benchmarks. Finally, Dalquen et al.19 used simulated genomes to assess orthology inference in the presence of varying amounts of duplication, lateral gene transfer and sequencing artifacts.

This wide array of benchmarking approaches poses considerable challenges to developers and users of orthology methods. Conceptually, the choice of an appropriate benchmark strongly depends on the application at hand. Practically, most methods are not available as stand-alone programs and thus cannot easily be compared on a common set of data. Likewise, some benchmarks rely on complex pipelines that may be difficult to implement. If public results are available as part of a publication or a resource, inconsistent genome releases or identifiers severely complicate comparisons. Some methods or benchmarks can also be computationally costly to run. As a result, users cannot easily identify appropriate tools, and methodological progress is hampered.

Here, we report on a community effort to standardize and facilitate orthology benchmarking. For this effort, we established a shared reference data set and developed a web-based service for automatic orthology benchmarking (http://orthology.benchmarkservice.org). We then used these resources to run a community experiment to assess 15 well-established orthology inference methods and resources on a wide array of phylogenetic and functional benchmarks. By providing a way to automatically include new methods and disseminate results publicly, we hope to maintain an up-to-date and comprehensive assessment of state-of-the-art orthology tools.

Results

Here, we provide an overview of the benchmark service and orthology inference methods and then present benchmarking results in three categories: species discordance tests, reference gene trees and functional tests. The benchmark service alone required the evaluation of 70,390,701 orthologous relationships and the inference of 233,000 phylogenetic trees.

Benchmark service

To automate ortholog benchmarking on a broad range of tests (detailed below), we developed a publicly accessible web service (Fig. 1). Using this workflow, an orthology method developer first infers orthologs using the Quest for Orthologs (QfO) reference proteome data set. Orthology inference methods vary in the kind of output they provide—e.g., labeled gene trees and orthologous groups—but it is usually possible to reduce these to orthologous pairs, which thus constitute a natural 'common denominator' for benchmarking. The benchmark service accepts these pairwise orthologs predictions in OrthoXML20 or tab-delimited format. As the OrthoXML format also supports InParanoid-style clusters and hierarchical orthologous groups, the service can automatically convert these to pairwise relationships.

Figure 1: The Orthology Benchmark service facilitates assessment and comparison of orthology inference methods.
figure 1

Orthology method developers run their methods on a reference proteome set and submit the inferred orthologs to the service. The predictions are subjected to a battery of phylogenetic and functional tests, and the results are returned to the method developer, who can choose to disclose them publicly.

Next, the service ensures that only predictions among valid reference proteomes are provided (with scoring implicitly assuming that the uploaded inferences are complete). Benchmarks are then selected and run in parallel; some may take up to several hours. Finally, statistical analyses determine the method's performance on each benchmark data set. Where possible, performance is measured in terms of precision (i.e., positive predictive value: the proportion of ortholog predictions that are correct) and recall (i.e., sensitivity, or true positive rate: the proportion of actual orthologs that are correctly predicted). Raw data and results are stored and provided to the submitter, who can choose to make the results publicly available. In order to achieve transparency and encourage improvements, we have released source code under an open source license (Mozilla Public License Version 2.0) at https://github.com/qfo/benchmark-webservice (also Supplementary Software).

Methods investigated

We investigated a broad array of well-established methods, including three tree-based methods: Ensembl Compara21, PANTHER 8.0 (ref. 22) and PhylomeDB23; seven graph-based methods (i.e., based on pairwise comparisons): Best Reciprocal Hits24, Reciprocal Smallest Distance (RSD)25, EggNOG26, Hieranoid27, InParanoid28, OMA29, OrthoInspector30 and a meta-method incorporating both tree- and graph-based methods, MetaPhOrs31. For some methods, multiple variants are included in the analysis (Online Methods). Each method inferred orthologs on the 754,149 protein sequences from 66 reference genomes except for MetaPhOrs, which inferred orthologs on all but three prokaryotes (Online Methods).

Generalized species tree discordance test

Orthology was first defined in the context of species tree inference, which requires genes related through speciation1. The species tree discordance test exploits this relationship by assessing the accuracy of orthologs in terms of the accuracy of the species tree that can be reconstructed from them15. The original protocol was limited to species tree 'comb' topology (a specific type of tree in which all bifurcations occur along a single path) and a small number of taxa (up to six). Here we overcome these two limitations by generalizing the orthology sampling procedure to any tree topology and employing larger reference trees from the SwissTree initiative. Furthermore, to minimize the possibility of gene–species tree discordance due to incomplete lineage sorting, we avoided sampling orthologs among species separated by branches shorter than 10 million years (myr) (Online Methods and Supplementary Fig. 1).

We observed different trade-offs between average discordance (Robinson–Foulds32 distance, as a proxy for the false discovery rate, the complement of precision) and the number of trees that can be sampled (proxy for recall) across all methods (Fig. 2). An ideal method would be placed in the lower right corner of Figure 2. When considering eukaryotes, results with the highest precision and lowest recall were obtained with OMA groups. At the other extreme, PANTHER 8.0 (all) tended to yield the highest recall and lowest precision results. Among the more balanced methods, no method consistently obtained a better balance than the other methods across all data sets, but Orthoinspector, InParanoid and PANTHER (LDO only) performed well overall. In terms of broad categories, there is no obvious systematic difference in performance between tree-based (Ensembl, PANTHER and PhylomeDB) and graph-based methods (the rest) or between methods relying on species tree (Ensembl, PANTHER, PhylomeDB, OMA GETHOGs, Hieranoid and EggNOG) and methods that do not. The latter point is perhaps unexpected, as one could expect knowledge of the species tree to provide an 'unfair' advantage in this particular benchmark. If there is any such effect, our results indicate that it is small.

Figure 2: The Generalized Species Tree Discordance test assesses the congruence of inferred orthologs with a trusted reference tree.
figure 2

Benchmarking results are shown for eukaryotes. A trade-off between precision (measured in terms of tree error in the y-axis) and recall (measured in terms of completed tree samples in the x-axis; Online Methods) can be observed. Only high-confidence branches of the reference tree (L90, Online Methods), at least 10 myr long, are considered. Error bars indicate 95% confidence intervals and the line indicates the 'Pareto frontier'.

Source data

These trends persisted when we measured recall in terms of the number of inferred orthologs (Supplementary Fig. 2) or when we focused on other clades (Supplementary Figs. 3, 4, 5). Among vertebrates, the results were largely consistent, but we noted minor differences in the ranking of individual methods, with InParanoid Core yielding the highest precision and MetaPhOrs the highest recall (Supplementary Fig. 3). We also benchmarked the methods for their ability to recover ortholog relationships among 'universal' genes by applying the species discordance test on a tree spanning across archaea, bacteria and eukaryotes. Once again, there were slight variations in the precise ranking of methods, but the overall trends were very similar to what was observed for eukaryotes only (Supplementary Fig. 5). Finally, if we included (high-confidence) short branches as well, the average concordance of reconstructed trees substantially decreased, both because short branches tend to be harder to infer and because of potential incomplete lineage sorting around them; however, the relative position of the methods remained practically unchanged, which was a further indication of the robustness of the benchmark (Supplementary Fig. 6).

Reference gene trees

The second series of orthology benchmarks employs evolutionary relationships of gene pairs derived from annotated high-quality gene trees. Such reference trees are inferred through a careful combination of computational inference and expert curation: results obtained at each step of the tree inference pipeline (homolog identification, alignment, tree inference and gene–species tree reconciliation) are individually inspected, poor-quality sequences are excluded from the analysis and results are typically assessed using multiple models. This manual oversight is expected to yield gene phylogenies with high statistical support and topological consistency.

Concordance of orthology predictions was assessed with two sets of trees. The first was SwissTree16,33, a small collection of large- and high-confidence gene family phylogenies with different types of challenges for orthology prediction and species from all domains. The second, TreeFam-A34, consisted of a larger set of metazoan gene trees and thus covered a taxonomically restricted but wider range of protein families. Results obtained from the two benchmarks were quite similar (Fig. 3). On these benchmarks, virtually no trade-off between precision and recall appeared to be necessary. The best-performing methods were the ones that adopted a balanced precision–recall strategy, with MetaPhOrs doing particularly well. Methods with a more skewed precision–recall strategy (in particular, stringent OMA groups and permissive PANTHER (all)) fared poorly in comparison. This may be due in part to the nature of the reference gene tree data set, which focuses on gene families with a tractable evolutionary history. On ambiguous phylogenies, mistakes would become unavoidable and a skewed strategy could become preferable, depending on the application.

Figure 3: Benchmark results using sets of reference gene trees.
figure 3

Evolutionary gene relationships are predicted for the QfO reference proteomes by 15 different methods. From the results, pairs of orthologous relationships are determined for each method and compared to those obtained from the reference gene trees of (a) SwissTree and (b) TreeFam-A. Error bars indicate 95% confidence intervals.

Source data

Functional benchmarks

The third series of benchmarks evaluated orthology in terms of functional similarity. Although orthology is an evolutionary and not a functional relationship, we chose to include functional benchmarks for two reasons. First, for similar levels of sequence divergence, orthologs have been shown to be moderately (but significantly) more conserved than paralogs in terms of Gene Ontology (GO) annotation similarity11. For a given evolutionary distance, more accurate orthology inference is thus likely to be correlated with more functionally similar gene pairs. Second, many users are interested in using orthologs to identify functionally conserved genes; for this purpose, functional benchmarks are directly relevant.

We assessed functional similarity based on experimentally backed annotations from the UniProt–Gene Ontology Annotation (GOA) database35 and Enzyme Commission (EC) numbers from the ENZYME database36. Though the two benchmarks consider different aspects of gene function, the results were largely consistent. In both cases, orthology inference methods showed a clear trade-off between precision (measured as the average Schlicker semantic similarity37 of functional annotations associated with orthologs) and recall (measured as the number of ortholog relationships predicted; Fig. 4). The only exception was with the EC number benchmark, where MetaPhOrs falls beneath the 'Pareto frontier' (the frontier defined by the methods that are not outperformed by any other method in both precision and recall). However, MetaPhOrs is also the only method with missing taxa, and the three missing taxa contain a substantial number of genes with EC annotations (827 in total). This lack of EC annotations has a negative effect on the recall.

Figure 4: Benchmarks of functional similarity between inferred orthologous gene pairs.
figure 4

Two different types of functional annotations are used: (a) experimentally supported GO annotations and (b) Enzyme Commission (EC) numbers. Error bars indicate 95% confidence intervals.

Source data

Discussion

The Orthology Benchmark service overcomes many of the practical complications previously associated with orthology benchmarking. It enables systematic comparison of a new method with state-of-the-art approaches on to a wide range of benchmarks. It replaces current practice, which typically includes fewer methods, fewer tests and less empirical data.

By relying on a common set of data for all methods, the benchmark service ensures that the results obtained by different methods are directly comparable. Previous benchmarking efforts required painstaking and error-prone mapping of proteins between different sources, releases and choice of alternative splicing variants. In contrast, by relying on a common set of data for all methods, the benchmark service ensures that the results obtained by different methods are directly comparable. The only caveat is that, since proteomes vary in quality and analytical difficulty, the results on the benchmark data set may not entirely reflect the quality of the orthology assignments otherwise provided by each resource. The choice of species included in the QfO reference proteomes (Online Methods) requires a compromise between (i) increasing the number of proteomes to make the benchmark set more representative of current resources and (ii) keeping the number of proteomes low to facilitate and encourage new submissions to the benchmark.

Submissions performed on a subset of the proteomes are discouraged, as all missing predictions are counted as false negatives. This provides an incentive for submitters to analyze the entire reference proteome data set. We considered alternative ways of handling submissions on partial data, but these approaches had major flaws. For example, one alternative was to extrapolate scores obtained on the subset of proteomes considered in a particular submission to all data. However, this approach could introduce a bias in the analyses (e.g., some methods only predict orthologs for 'easy' pairs of proteomes). Another alternative was to restrict comparisons to the intersection of proteomes analyzed by all methods. However, this approach results in an excessive waste of information, as the intersection can only decrease with each additional method.

Overall, results obtained across multiple phylogenetic and functional tests corroborated previous observations that the main difference among the established orthology inference methods lies in the trade-off they produce in terms of precision and recall13,15,17. However, this trade-off was not present in the reference gene tree test, perhaps because sequences with ambiguous location are typically excluded from these hand-curated trees. On these reference trees, the meta-method MetaPhOrs performed particularly well. The analysis also confirmed that the widely used reciprocal best hit approach has a relatively high precision but a relatively low recall38,39. Other methods fill different niches, with OMA group and PANTHER (all) often lying at the two extremes of the precision–recall trade-off. Among the more balanced approaches, InParanoid, Hieranoid and OrthoInspector showed solid performance in most benchmarks.

The decision of whether to favor a skewed or a balanced approach to the precision–recall trade-off strongly depends on the application. For instance, hypothesis-generating analyses may favor a high recall, while phylogenomic species tree inference typically requires high precision. Because of this, we refrained from computing a combined score, which would necessarily entail a statement of preference with respect to this trade-off.

To be deemed competitive, a method should ideally reach or exceed the Pareto frontier in at least a subset of the benchmarks. If it does not, the benchmark service may help uncover bugs or deeper flaws. Analogous to unit testing in software engineering, benchmarking can also provide quality control for new releases of established resources. In the course of the present community benchmarking effort, over a hundred sets of predictions were submitted to the service. Many submitters did not make their results publicly available, presumably after discovering poor outcome in some of the benchmarks. This clearly demonstrates the effectiveness of the benchmark service for quality control.

The bane of benchmarking is circularity. Despite our best efforts, not all circularity could be avoided. Some methods used knowledge of the species tree in their inference; however, this potentially unfair advantage produced a negligible difference in performance for these methods. More generally, many methods were trained or fine-tuned using some of the benchmarks considered here. For instance, parameters of the meta-method MetaPhOrs were in part trained using TreeFam-A31. Similarly, the latest versions of InParanoid28 and PhylomeDB23 used the benchmark service for parameter fine-tuning. As for the functional benchmarks, although GO annotations derived from sequence comparisons were excluded, experiments are often guided by sequence similarity to proteins with known function. Thus, even when restricting analyses to experimentally backed GO annotations, we cannot avoid circularity entirely. However, because the benchmarks are collectively underpinned by a large amount of data from a broad range of species (tens of thousands of trees and hundreds of thousands of pairs of functional annotations), the risk of overfitting seems low, and this potential risk will be monitored by the QfO benchmarking working group. New benchmarks may be introduced over time to detect and discourage overfitting.

Presently, the benchmark service uses orthologous gene pairs as 'common denominators' among all the methods. However, many resources provide richer outputs—such as reconciled gene trees or hierarchical orthologous groups—and may indeed be optimized for these. The performance on pairwise data is thus not entirely representative of what the data offer. In the future, however, the benchmark service could be extended to evaluate these richer, more specific orthology formats as well. Similarly, the benchmark service could also be extended to take into account confidence scores or posterior probabilities, which are particularly relevant to likelihood-based orthology inference methods40,41.

Methods

Quest for orthologs reference proteomes and species tree.

The QfO consortium has defined a consensus data set of proteomes and common file formats6,7 to be used by diverse orthology inference methods, allowing for standardized benchmarks and aiding integration of multiple ortholog sources. The QfO Reference Proteomes data sets were created as a collection of data providing a representative protein for each gene in the genome of selected species. Such data sets have been generated annually from the UniProt Knowledgebase (UniProtKB) database42 for the past five years. To this end, a gene-centric pipeline has been developed and enhanced over these years by UniProt. The QfO Reference Proteomes are a manually compiled subset of the UniProt reference proteomes, comprising well-annotated model organisms and organisms of interest for biomedical research and phylogeny, with the intention to provide broad coverage of the tree of life. These complete, nonredundant reference proteomes are publicly available at ftp://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO. The data sets are provided either in SeqXML20 format or as a collection of FASTA files.

The benchmarking effort reported here uses the reference proteomes data set released in 2011, which comprises 754,149 nonredundant protein sequences from 66 species (40 eukaryotes and 26 bacteria–archaea).

The reference species tree used in this study was produced by the QfO species tree working group, which surveyed the literature to establish a well-supported tree topology for the 66 species43 (Supplementary Fig. 1). The internal nodes of this reference species tree have assigned confidence levels based on the agreement among the resources surveyed (L90: congruent, significant branch support; L70: congruent; L50: one alternative species tree topology; L30: default level; L10: two or more alternative species tree topologies have been reported; for more detail, see Boeckmann et al.43). The latest version of the tree can be retrieved from http://swisstree.vital-it.ch/species_tree. To minimize the chance of including cases of incomplete lineage sorting in the species tree discordance benchmark, we estimated the evolutionary times of all internal branches using the timetree resource44 and collapsed branches that were shorter than 10 myr.

Orthology databases and methods.

EggNOG26 (http://eggnogdb.embl.de) is a database of Orthologous Groups (OGs) and functional annotation covering prokaryotic and eukaryotic species. Since version 4.1, the EggNOG method is also capable of producing fine-grained (for example, pairwise) orthology predictions based on the automated analysis of phylogenetic trees. For this study, the complete set of 66 reference proteomes was independently analyzed using the EggNOG pipeline, which involved 1) joining proteins into inparalogous groups from closely related species and 2) de novo reconstruction of 38,513 OGs by clustering the obtained inparalogous groups based on triangles of their reciprocal best hits45. Phylogenetic analysis and automated tree interpretation for each OG was subsequently performed using the workflow described in PhylomeDB22 as implemented in the ETE Toolkit v2.3 (ref. 46). The phylogenetic approach used included testing three aligners (MAFFT47 v6.861b, Muscle48 v3.8.31 and Clustal Omega49 v1.2.1) and five evolutionary models (LG, WAG, JTT, VT and MtREV); applying alignment consensus and soft trimming techniques (M-Coffee50 v10, trimAl51 v1.3); and using maximum likelihood tree inference (PhyML52 v3). This workflow is labeled as eggnog41 when using the ETE-build command and was applied in a per-OG basis. Pairwise orthology predictions were derived from each tree using the species overlap algorithm53 after rooting trees to midpoint. The predictions were submitted to the benchmark service in July 2015.

Ensembl Compara21 uses a gene–species tree reconciliation pipeline. The predictions were run using the code released in version 81 of the Ensembl (July 2015). However, Treebest (the software used to build phylogenetic trees) had to be adapted to accept alignments of protein sequences. Treebest makes a consensus out of trees built with various phylogenetic methods and some of them required nucleotide sequences, which were not provided in the QfO data set. The list of maximum-likelihood models and distance methods (used for neighbor joining) was thus updated to: WAG, JTT and Dayhoff instead of WAG and HKY (for maximum likelihood), and JTT, Kimura and mixed amino acid models instead of dN, dS and mixed nucleotide models (for neighbor joining). The predictions were submitted to the benchmark service in June 2015. An older submission based on version 66 of the Ensembl code (June 2011) is also present on the benchmark service.

Hieranoid27 performs pairwise orthology analysis using InParanoid at each node in a guide (species) tree as it progresses from its leaves to the root. This concept reduces the total runtime complexity from a quadratic to a linear function of the number of species. We ran Hieranoid 2.0. Hieranoid outputs ortholog groups structured as species trees with orthologs at all levels, hence there can be many outparalogs within an ortholog group. The trees were therefore parsed to extract ortholog pairs only at the last common ancestor of two species, for all species pairs. The predictions were submitted to the benchmark service in April 2015.

InParanoid28 is a graph-based algorithm that aims to generate orthologous groups that include all inparalogs but no outparalogs between species pairs. Version 4.1 of the algorithm was run with default parameters. Two variants were tested in this study: the regular InParanoid output containing all predicted pairs of orthologs (labeled InParanoid in the plots) and a high-confidence set including only orthologs with InParanoid's maximum confidence score of 1.0 (labeled Inparanoid (core)). The predictions were submitted to the benchmark service in June 2011.

MetaPhOrs31 (Meta Phylogeny-based Orthologs) is a repository of orthologs and paralogs that were computed using phylogenetic trees available in several databases or computed from graph-based orthologous groups. For each orthology–paralogy prediction, MetaPhOrs (http://orthology.phylomedb.org/) provides two reliability scores: Evidence Level (informing about number of repositories from which prediction is retrieved) and Consistency Score (defining overall agreement of source databases about given prediction). MetaPhOrs does not include predictions for the three reference genomes Streptomyces coelicolor, Thermotoga maritima and Pyrococcus kodakaraensis (strain KOD1). The predictions were submitted to the benchmark service in February 2013.

OMA29 (Pairs, Groups, HOGs) is a publicly available resource (http://omabrowser.org/) that provides orthology predictions among thousands of proteomes from all domains of life. OMA uses evolutionary distance estimates from Smith–Waterman alignments to infer orthologs. A distinct feature among graph-based methods is the witness of nonorthology step in its pipeline, where cases of differential gene losses get detected. OMA provides three different groupings of orthologs: (i) the raw pairwise ortholog relationships form the OMA Pairs, a gene-centric view that lists all the orthologs for a given gene. (ii) OMA Groups, a very stringent type of grouping where all member proteins are orthologous to one another within a group. OMA Groups have been designed mainly for species tree inference purposes, as gene trees built from them should be congruent with the species tree. (iii) Lastly, we constructed hierarchical orthologous groups (OMA HOGs). These are nested groups that contain genes that descend from a single common ancestral gene within a given taxonomic range using the GETHOGs algorithm54. The predictions were submitted to the benchmark service in June 2011 (OMA pairs and groups) and in March 2013 (OMA HOGs).

OrthoInspector30 is a database of precomputed orthology and inparalogy relationships and a stand-alone package allowing large-scale predictions of orthology between thousands of proteomes (http://lbgi.fr/orthoinspector/). The resource has recently undergone a major new release, with improved speed and visualisation tools, but the inference algorithm is unchanged from the initial graph-based method described in Linard et al.55. The predictions were submitted to the benchmark service in June 2011.

PANTHER 8.022 is based on version 8.0 of the PANTHER database (http://pantherdb.org), released in 2012 (the current version is 10.0, released in 2015). Family membership of each sequence is based on HMM scoring to the PANTHER 'library' of HMMs (at both the family and subfamily levels). Sequences were aligned with MAFFT56 and the resulting alignment was used to construct phylogenetic trees with the GIGA program57. GIGA (version 1.1 was used for PANTHER version 8.0) uses a species tree to guide tree construction, and all nodes in the tree are labeled as speciation or gene duplication events; these labeled nodes are used to infer orthologs (pairs of genes with a speciation event as their common ancestor). PANTHER predicts two types of orthologs: least-diverged orthologs (LDO) and other orthologs (O). LDO pairs can be simplistically thought of as 'the same gene' in two different species. Formally, the two genes created by each gene duplication event in the tree are treated asymmetrically: the least diverged duplicate (the one with the shortest branch immediately following the duplication) remains in the same LDO group as its ancestor, while the other duplicate founds a new LDO group. The benchmarking was performed on either LDO only, or all orthologs (including both LDO and O). The predictions were submitted to the benchmark service in February 2013.

PhylomeDB23 (http://phylomedb.org/) is a publicly available repository of phylomes, i.e., the complete collection of phylogenies for all genes of a given species in a predefined evolutionary context. PhylomeDB is unique among other repositories in that it follows an approach that is both gene centric and genome wide. PhylomeDB uses its phylogenetic trees to infer orthology and paralogy relationships. For the Quest for Orthologs project, 42 phylomes were reconstructed using different combinations of the 66 species in the benchmark. A total of 458,108 phylogenetic trees were generated, which were later combined to provide orthology predictions for all proteins included in the benchmark. Briefly, each tree was scanned and only the partition of up to 30 sequences, including the seed protein, was kept. Then, evolutionary relationships were computed for those protein sequences based on a species overlap approach. Redundant predictions across the 42 phylomes were unified using the Consistency Score (CS) as implemented in MetaPhOrs (see above). Only those predictions having a Consistency Score greater or equal to 0.5 across the whole data set were called orthologs. The predictions were submitted to the benchmark service in June 2013.

RBH24 (Reciprocal best hit) is a classic method consisting of identifying the pairs of genes with mutually highest alignment score between every pair of species. Here, we use reciprocal blastp hits as orthologs, with minimum E-value of 1e–2, and we keep all hits that are ≥99% of the highest score. The predictions were submitted to the benchmark service in January 2016.

RSD25 infers orthology relationships by finding pairs of genes whose nearest gene, computed using PAML, is the other gene in the pair. Candidates genes are also filtered using BLAST E-value and multiple-sequence alignment divergence thresholds. This method is implemented in the database RoundUp58, a large-scale orthology database developed by the Wall Lab. The database is no longer maintained, but the source code is still available at https://github.com/todddeluca/reciprocal_smallest_distance/. To identify orthologs, we ran the algorithm with divergence and E-value cutoffs of 0.8 and 1e–5, respectively. The predictions were submitted to the benchmark service in February 2012.

Benchmarks.

Generalized species tree discordance. The idea behind the species tree discordance test is simple. Two genes are orthologous if they started diverging through a speciation event. Therefore, if we sample putative orthologous genes such that all resulting genes are related through speciation events, the resulting tree should be congruent with the species tree. Previously, we presented a sampling strategy for fully imbalanced tree topologies15. Here, we extend this idea to arbitrary reference trees, including those with soft polytomies (unresolved nodes).

The following procedure is repeated a large number of times. We start with a random gene in a random genome. We then attempt to sample a maximal path along the tree by selecting an orthologous gene in the 'next' species in the tour from the list of reported orthologs (Supplementary Fig. 7a). If there are multiple possibilities in the choice of the 'next' species due to soft polytomies, or in the choice of the orthologous counterparts due to one-to-many or many-to-many orthology, a choice is made at random. If there is no predicted ortholog at any step along the path, the sample is deemed unsuccessful. Alternatively, if at least one orthologous counterpart is predicted at each step, this results in a set of n sequences. Assuming that i) the reference tree is correct, ii) the retrieved orthologs are all correct and iii) all within-species variation is fixed (i.e., no incomplete lineage sorting), it is easy to prove that the unrooted evolutionary tree relating these sequences should only contain speciation nodes and should therefore be congruent with the reference species tree.

Proof: The n sequences sampled through the circular tour are sampled by starting from a random sequence and retrieving n − 1 pairs of orthologs. By construction, these n − 1 pairs of orthologs belong to pairs of species that have distinct last common ancestors and thus coalesce in different speciation nodes in the phylogenetic tree of these sequences. Therefore, that tree contains at least n − 1 distinct speciation nodes. However, the rooted, fully-resolved evolutionary tree of n species has exactly n − 1 internal nodes. Thus, all the internal nodes of the gene tree are speciation nodes. Since we assume that there is no incomplete lineage sorting, as long as the input orthologs are correct, the tree relating these sequences should be congruent with the species tree.

A least-squares distance tree is reconstructed for each set of putative orthologous sequences. After aligning the sequences with MAFFT47, maximum likelihood distances and their variances (using the inverse Fisher information) are estimated using the EstimatePam() function in the Darwin programming environment59 for each pair of sequences. Next, the gene tree is estimated using Darwin's MinSquareTree() function, which is a fast implementation of the weighted least-squares trees60 constrained to non-negative branch lengths61. We have previously shown that orthology benchmarking results obtained with such distance trees are consistent with more computationally demanding Maximum likelihood trees15. The Robinson–Foulds32 distance between this gene tree and the reference tree measures the false discovery rate, while the total number of trees is used as a proxy of recall. Due to the stochastic nature of the algorithm, repeated runs of the benchmark may lead to slightly (albeit nonsignificantly) different results.

Reference gene trees. Reference gene trees labeled with speciation and duplication events were downloaded from SwissTree on March 23, 2015 (http://swisstree.vital-it.ch/) and Treefam-A version 7 (http://www.treefam.org/). As sequences analyzed in these two resources can differ from those of the QfO reference proteomes, sequences were mapped based on gene identifiers or sequence identifiers. After mapping, for each family the n(n − 1)/2 induced pairwise evolutionary relationships were extracted and compared with the orthologous predictions from each orthology prediction method as follows. Let G = {gi} be the set of all genes in the reference gene tree and RO = {(gi, gj) | gi G, gj G, gigj, label(gi,gj) = speciation} the set of true orthologs according to the reference tree. Likewise, let RP be the set of nonorthologous relations in that family and P = {(gi, gj)}, be the set of all predictions made by the orthology method. With PF = {(gi, gj) | (gi, gj) Pgi Ggj G}, we denote the set of orthologs where both members are part of the reference gene family. Now, the true/false positives/negatives are simply TP = PFRO, FP = PFRP, FN = ROPF and TN = RPPF. From these values we can compute positive predictive values (PPV) and true positive rate (TPR): PPV = |TP|/(|TP| + |FP|), TPR = |TP|/(|TP| + |FN|).

We can further estimate the uncertainties of these rates by treating them as binomially distributed random variables, for example, σ2(PPV) = PPV(1 − PPV)/(|TP| + |FP|). Finally, we combine all the families by building averages of the rates. As an example, for the positive predictive value this results in,

Functional tests.We downloaded the Gene Ontology annotations62 for all the genes in the reference genomes from the November 2014 release of UniProt-GOA35 and excluded any annotation with a 'NOT' qualifier from this set. For the analysis shown here, we only use annotations with experimental evidence codes (EXP, IPI, IDA, IMP, IGI and IEP). Likewise, we collected the hierarchical EC number assignments of the ENZYME database36 maintained by Swiss-Prot. The computation of the functional similarities between gene pairs is done in the same way for both types of data, using the approach of Schlicker et al.37: the semantic similarity between annotations sim(i,j) is measured using Lin's metric63; between any two genes, the most similar pairs of annotations are identified and averaged, i.e.,

where pi is the set of function annotations associated with protein i.

Code availability.

The source code is available under an open source license (Mozilla Public License Version 2.0) at https://github.com/qfo/benchmark-webservice.