Total Ortholog Median Matrix as an alternative unsupervised approach for phylogenomics based on evolutionary distance between protein coding genes

The increasing number of available genomic data allowed the development of phylogenomic analytical tools. Current methods compile information from single gene phylogenies, whether based on topologies or multiple sequence alignments. Generally, phylogenomic analyses elect gene families or genomic regions to construct phylogenomic trees. Here, we presented an alternative approach for Phylogenomics, named TOMM (Total Ortholog Median Matrix), to construct a representative phylogram composed by amino acid distance measures of all pairwise ortholog protein sequence pairs from desired species inside a group of organisms. The procedure is divided two main steps, (1) ortholog detection and (2) creation of a matrix with the median amino acid distance measures of all pairwise orthologous sequences. We tested this approach within three different group of organisms: Kinetoplastida protozoa, hematophagous Diptera vectors and Primates. Our approach was robust and efficacious to reconstruct the phylogenetic relationships for the three groups. Moreover, novel branch topologies could be achieved, providing insights about some phylogenetic relationships between some taxa.


S7: Bar graph showing the total number of orthologous identified by the RSD and
OrthoMCL algorithms for 78 pairs of species combinations used in the analysis, based on 13 species with sequences retrieved from TriTryp database, as indicated in Table 1 ("Proteins sequence source" column). Intersections (shared orthologs) and unique orthologs were calculated with gene ID lists as input using Venn diagram tool (http://bioinformatics.psb.ugent.be/webtools/Venn/). File name: Supplemental_Figures_S1_S2_S3_S4_S5_S6_S7.pdf

Supplemental Table S1: Kinetoplastida pairwise matrices
Excel spreadsheet containing resulting tables of pairwise orthologs data (pairwise matrices). Sheet "AA distance": amino acid distance obtained from median value calculated by RSD algorithm.
Sheet "Number-50": total number of orthologs identified by RSD algorithm settings of 0.001 for the blast e-value of acceptance, and the value of 0.8 for the minimum ratio of the smallest sequence to the larger one.

Supplemental Table S2: Hemataphagous Diptera pairwise matrices
Excel spreadsheet containing resulting tables of pairwise orthologs data (pairwise matrices). Sheet "AA distance": aminoacid distance obtained from median value calculated by RSD algorithm. Sheet "Number-50": total number of orthologs identified by RSD algorithm settings of of 0.001 for the blast e-value of acceptance, and the value of 0.8 for the minimum ratio of the smallest sequence to the larger one.

Supplemental Table S3: Primates pairwise matrices
Excel spreadsheet containing resulting tables of pairwise orthologs data (pairwise matrices). Sheet "AA distance": aminoacid distance obtained from median value calculated by RSD algorithm. Sheet "Number-50": total number of orthologs identified by RSD algorithm settings of of 0.1 for the blast e-value of acceptance, and the value of 0.8 for the minimum ratio of the smallest sequence to the larger one.

Supplemental Table S4: Comparison between orthology detection methods, RSD and
OrthoMCL, using 13 species with protein sequences retrieved from TriTryp database.
Intersections (shared orthologs) and unique orthologs were calculated with gene ID lists from each method as input using Venn diagram tool (http://bioinformatics.psb.ugent.be/webtools/Venn/).

Supplemental File 1: R scripts
Scripts for hclust, pvclust and ape R packages used to build phylograms from amino acid distance matrices. File name: Supplemental_File1.pdf

Supplemental File 2: RSD resulting files (gene IDs) in compressed folders
The compressed folders named "RSD-Primates", "RSD-Flies", RSD-kinetoplastids", contain text files with gene IDs resulted from RSD searches. Gene IDs for each paired species are tabulated in two columns, where first column indicates IDs from first species and second column indicates IDs from second species. Names of each paired species are abbreviated in the txt file name, using the first three letters for genus plus species, separated by hyphen, e.g.: the AOTNAN-CEBCAP-0.txt file presents gene IDs for Aotus nancymaae in the first column, while in the second column gene IDs belong to Cebus capucinus. The abbreviation for all organisms is supplied in Excel files Supplemental Tables S1, S2 and S3, under sheet "Abbreviation".