State-of-the-art phylogenomic pipelines require many steps, which can be both time consuming and error prone (Fig. 1a). With Read2Tree, we directly process raw sequencing reads and reconstruct sequence alignments for conventional tree inference methods (Fig. 1b and Supplementary Fig. 1). We start by aligning raw reads to nucleotide sequences derived from the genome-wide reference orthologous groups (OGs; we used Mafft20 as default) (Fig. 1b, 1). Within each OG, we reconstruct protein sequences from reads aligned to reference sequences (Fig. 1b, 2). Importantly, these sequences in reference OGs are not restricted to single-copy marker genes, such as the mitochondrial cytochrome c oxidase I gene or BUSCO genes21; they also include multiple paralogous genes as well as nonuniversal genes. This is achieved by leveraging OGs computed from 2,500 diverse genomes analyzed in the Orthologous Matrix (OMA) resource developed in our laboratory22,23. Next, we retain the best reference-guided reconstructed sequence, using the number of reconstructed nucleotide bases as criterion (Fig. 1b, 3 and Supplementary Fig. 2). Subsequently, the selected consensus is added to the OG’s multiple sequence alignment (MSA) (Fig. 1b, 4). Finally, putative OG selection and tree inference can proceed using conventional methods (we use IQTREE24 by default; Fig. 1b, 5). For greater detail on the individual steps, see Methods.

This way, Read2Tree is able to report key information across putative OGs in a fraction of the time over conventional comparative genomic pipelines—by bypassing genome assembly, annotation, homology and orthology inference. Furthermore, because each sample is processed independently, Read2Tree can process the input genomes in parallel, and scales linearly with respect to the number of input genomes.

Impact of coverage and distance to reference on accuracy

We tested Read2Tree on a wide array of conditions, with two kinds of sequence (DNA versus RNA), three target species (Arabidopsis thaliana, Saccharomyces cerevisiae and Mus musculus), three types of sequencing technology (Illumina, PacBio and Oxford Nanopore Technologies (ONT)), six levels of sequencing coverage (ranging from 0.2× to 20×) and six different sets of reference species (increasingly distant from the targets spanning over 1 billion years of evolution) (Fig. 2a). For sequence reconstruction accuracy (Fig. 2b), we measured both the correctness of the reconstructed sequences (‘precision’) and the completeness of the reconstructed sequences (‘recall’). For tree reconstruction accuracy (Fig. 2c and Supplementary Fig. 6), we compare the reconstructed tree with the known species phylogeny and report both the precision and the recall of the reconstructed trees, in terms of the branches with at least 90% support.

Fig. 2: Benchmark of Read2Tree using three different datasets, six different coverage levels and three sequencing technologies. a, Phylogenetic trees of reference datasets. In dark purple (bottom) are the species used for mapping. The colors represent species removal to assess the dependency on closest neighbors in the reference datasets. Timepoints were obtained from timetree.org59. b, Read2Tree sequences are more similar (percentage identity) and more complete with increasing coverage and decreasing distance to a more closely related species. The best sequence identity is obtained for Illumina data. The colors convey the increasing evolutionary distance to the closest reference species (ref.). c, The precision and recall of trees reconstructed using Read2Tree after collapsing branches below 90% support. Full size image

In general, Read2Tree was able to maintain a high precision in terms of sequence reconstruction (Fig. 2b) and tree reconstruction (Fig. 2c) across all datasets, with varying levels of recall depending on the dataset difficulty. First, we assessed the effect of coverage ranging from 0.2× to 20× of the individual datasets. We observed that increasing the sequencing coverage had little impact on precision, and mainly lowered recall: in most configurations, Read2Tree could maintain 90–95% precision at the sequence level even with coverages as low as 0.2× (Fig. 2b). The best low-coverage results were obtained on transcriptomic short-read data in mice, where precision reached 98.5% at 0.2× coverage. To assess the versatility of Read2Tree, we benchmarked it across DNA and RNA datasets. This did not have a large impact in general, but transcriptomic RNA results (in the mouse dataset) are marginally less impacted by differences in average coverage, perhaps due to the large coverage variance from uneven gene expression levels in these data (Fig. 2b,c). Next, we assessed whether Read2Tree is capable of utilizing the range of current sequencing technologies. For this, we applied it across traditional short reads, Oxford Nanopore and PacBio long reads. To enable this, Read2Tree has slightly different mapping strategies built in for long versus short reads (Methods). As Fig. 2b,c shows, Read2Tree maintained a high accuracy across each sequencing technology, but we observed the highest accuracy over traditional short reads. We have not assessed more recent sequencing technologies such as PacBio HiFi or Illumina infinity that might change this result.

Finally, we assessed the robustness of Read2Tree with respect to the evolutionary distance between the sample at hand and the closest relative in the reference set. This is often critical as one might not know the closest ancestor that is assembled or it is not available25. Thus, we tested Read2Tree across a wide range of evolutionary distances ranging from 7 million years ago to over 1.1 billion years ago. While these are certainly extreme scenarios, overall Read2Tree was able to cope with them successfully. Figure 2b,c shows that the choice of reference set mainly impacted recall, with closer reference genomes leading to more reconstructed positions. Remarkably, Read2Tree was able to maintain high accuracy even in the datasets with very distant references—for example, processing mouse RNA sequencing (RNA-seq) data without any vertebrate genome in the reference set.

We also tested Read2Tree on simulated data, for coverages between 0.1× and 10× and distance to the closest reference varying between 2 and 150 point accepted mutation (PAM) units—where 100 PAM corresponds to one substitution per site on average. The reconstructed trees were perfect in all but the most extreme scenarios (PAM >120 or coverage <0.5×; Supplementary Fig. 7).

Given the extensive benchmarks across species, coverage, sequencing technology, assay (DNA and RNA) and simulated data, we observe that Read2Tree is indeed a highly versatile and accurate tool to reconstruct phylogeny directly from raw reads.

Faster and often more accurate than assembly-based trees

Next, we compared the performances of Read2Tree with conventional assembly pipelines. For this, we generated de novo assemblies and protein predictions across the same datasets as from the previous section, using Canu26 for PacBio and ONT data and Megahit27 together with SoapDeNovo28 for the Illumina reads (Methods). The conventional assemblies were processed using OMA standalone, including the same exported reference genomes, as OMA standalone was previously shown to identify the most accurate phylogenetic marker genes29. For the inclusion of orthologous markers in the concatenated alignment used for tree inference, we required a commonly set minimum threshold of 80% taxon presence. As above, we varied the closest remaining species in the dataset by removing species along the reference tree (Fig. 2a). With different coverages and reference sets, we obtained 42 data points per species. For each of these data points, we performed the orthology inference separately and recorded its computation time. The proportion of sequences placed into the respective OGs showed high levels of variation (Supplementary Fig. 8a). For each assembly and variation of proteomes, we computed the topological distance between the resulting tree from assembly or Read2Tree with trees obtained using high-quality genome assemblies for A. thaliana and S. cerevisiae.

Figure 3 shows the overall results, highlighting the performance of Read2Tree. Perhaps unsurprisingly, we observed that coverage levels had a profound impact on the performance of assembly-based approaches, rendering them incapable of dealing with coverages below 5–10×. Thus, for these datasets, we report only Read2Tree results.

Fig. 3: A comparison of Read2Tree with a regular pipeline with assembly, orthology prediction and MSA computation. a, A comparison of trees using the difference between the reference tree and either the tree of Read2Tree or the tree coming from the assembly approach. For dark blue, we had only Read2Tree trees as assemblies for these low coverages are not possible to obtain. Below zero (in dark or light blue), Read2Tree is more accurate, while above zero (in red), the assembly approach is more accurate and gray indicates no difference between the methodologies. b, A comparison of wall time needed from reads to availability of concatenated MSA showing the dependencies of available closest remaining reference and coverage. Full size image

Where both approaches can be compared, the only cases where the conventional de novo assembly approach outperformed Read2Tree were with high coverage and very distant (>500 Mya) to the closest reference species (Fig. 3a, upper right region of each graph). In all other scenarios, Read2Tree outperformed the conventional approach in accuracy. Specifically, on the yeast dataset at a higher coverage level, both assembly and Read2Tree performed well overall—we never observed more than two different branches between the obtained and reference trees. With at least 10× coverage and distant reference species, the conventional assembly approach outperformed Read2Tree (Fig. 3a and Supplementary Fig. 4).

By contrast, on the more complex A. thaliana and M. musculus datasets, Read2Tree outperformed the assembly approach—with fewer differences to the reference (up to two different branches for Read2Tree, versus up to four for the conventional approach). On the ONT data—characterized by longer reads but higher error rate—Read2Tree outperformed the conventional approach on both datasets.

Finally, in terms of compute time, Read2Tree was generally much faster than the conventional approach, up to 100 times faster on the larger genomes (Fig. 3b and Supplementary Fig. 8b).

Altogether, these results indicate that Read2Tree is faster in all conditions, and produces reliable trees in low-coverage datasets and other datasets where the conventional approach fails entirely (long-read transcriptomics). At higher coverage levels, the trees inferred by Read2Tree rival in quality those obtained from assembled reference species with a full pipeline, particularly when applied to more complex genomes, and unless the closest reference species is very distant (>500 million years).

We also compared Read2Tree with Mash, a fast k-mer-based approach30 commonly used on bacterial genomes. While the alignment-free approach of Mash was much faster than even Read2Tree, the resulting trees were much less accurate than either Read2Tree or the assembly-based approach (Supplementary Fig. 5). This illustrates why alignment-free approaches such as Mash, while very useful for fast approximations, are typically not suitable to reconstruct high-quality phylogenetic trees.

Accurate reconstruction of a 435 species yeast tree of life

To assess a potential large-scale application for Read2Tree, we applied it to reconstruct a large yeast phylogeny from raw reads. Thanks to Read2Tree’s ability to process low-coverage datasets, we could extend our analysis to all Illumina single- and paired-end, ONT, PacBio and 454 sequencing read datasets available for budding yeast in the NCBI Sequence Read Archive (SRA) database (November 2018, 404 species) and 31 reference species obtained from the OMA database (release 2018, 3,063 OGs). Using an automated approach for retrieval and mapping, we were able to obtain direct sequences for 404 species (Supplementary File 1). Read2Tree could process these datasets in around a month of computation (adding each species sequentially and performing the mapping on 30 central processing units (CPUs)—one CPU per reference—in parallel), due to its ‘embarrassingly parallel’ architecture, with every sample being processed independently up to phylogenetic inference (10× Illumina: ~20 min using four threads).

A large proportion of these datasets were recently used to construct a phylogeny across 363 budding yeast species31. This included a dataset of 196 new assemblies and their annotations31. This large effort provided a delineation of the yeast tree of life into 13 main clades and highlighted the influence of horizontal gene transfer in the evolution of yeast species31. Due to the complexity of state-of-the-art pipelines, it also consumed millions of CPU hours and years of work. Furthermore, the conventional assembly-based approach could not include low-coverage samples into their analysis. We were able to extend this work using Read2Tree using a fraction of the resources.

Using Read2Tree, we were able to compute and produce this large phylogeny across 435 samples (including 31 species as reference). Some of the samples failed due to their too low coverage levels of around 3.1× assuming a 12-Mbp-long average genome size. Nevertheless, using Read2Tree we were able to include multiple samples even at coverage levels below 5×, which were reported with over 2,500 sequences placed in OGs (Supplementary Fig. 14). Read2Tree was able to reconstruct the phylogeny and also reported the phylogeny-relevant genes assembled per sample, which overall showed similar GC levels as the reference data (Supplementary Fig. 15). This was also exemplified by the fact that we did not observe a correlation between the number of sequences placed into OGs per species and their individual coverage (Supplementary Fig. 14, correlation 0.2).

Considering the subset of species in common, our results were highly congruent with those of Shen et al.31 (Fig. 4 and Supplementary Figs. 12 and 13): both trees exhibited similar distances to the NCBI taxonomy tree—297 splits in ours versus 291 splits in Shen et al. In direct comparison, Shen et al. and Read2Tree were more similar with one another, with only 128 different splits (20% difference of the branches), than either was to the NCBI taxonomy. After collapsing branches with a support below 90, the difference in the number of splits between the conservative NCBI tree and ours was 29 splits, and 25 splits between ours and Shen et al. Twenty-four of these splits were in common between Read2Tree and Shen et al. To get more insight into the nature of these differences, we assessed the agreement with the NCBI taxonomy for two different levels of resolution: family and genus. At the coarser family level, Read2Tree was more consistent with the NCBI taxonomy for six families, while Shen et al. was more consistent in one family (Supplementary Fig. 10). At the finer genus level, Read2Tree was more consistent with the NCBI taxonomy for four genera, versus ten for Shen et al. (Supplementary Fig. 11).

Fig. 4: High consistency between Read2Tree and a state-of-the-art phylogenetic pipeline31. The top row shows the full trees and the alignment matrix used to compute the tree as outer circles. The red dots indicate nodes with a bootstrap below 100. The species Naumovozyma dairenensis, previously misclassified32,33, is highlighted in red. The bottom row shows trees trimmed to an overlapping leaf set. Full size image

Nevertheless, there are still certain differences between Read2Tree and the NCBI taxonomy remaining. While resolving most such instances would constitute entire follow-up studies in their own right, we were able to explain one apparent disagreement: Naumovozyma dairenensis is placed in the CUG-Ser1 classification, while according to the NCBI taxonomy, it should be an ascomycetous yeast in the Saccharomyces sensu lato group within the family Saccharomycetaceae. However, this is a case of erroneous metadata reported in the literature32,33.

Given this phylogeny, we can now easily update and extend it using Read2Tree in a matter of minutes with additional sequences being generated. This enables a deep dive into the comparative genomics of yeast and to further explore their differences between the strains and their impact on life, food production and so on. This is also easily reproducible for other organisms as Read2Tree is capable of spanning large evolutionary distances with respect to the reference tree.

Read2Tree for zoonotic surveillance and human epidemiology

To further illustrate the versatility of Read2Tree, we used it to reconstruct a phylogeny encompassing various coronaviruses from the OMA coronavirus database, as well as 215 raw coronavirus sequencing samples deposited to the SRA. Besides the putative SARS-CoV-2 sequence, we also included two samples from bat (SRR11085797 (ref. 34) and SRR11085736 (ref. 35)) and one from mink36 (SRX9605666).

The reconstructed phylogeny was in complete agreement with the lineage classification obtained from the UniProt reference proteomes. In particular, the tree recovered not only the main coronavirus genera (Alpha-, Beta-, Gamma- and Deltacoronavirus) but also all subgenera with complete consistency (Fig. 5).

Fig. 5: Read2Tree correctly classifies the recent SARS-CoV-2 sequences and recapitulates the evolution of the individual variants. All genera (gray boxes in the overall tree) and subgenera (colored labels) are correctly delineated. The inset focuses on the part of the tree with 215 SARS-CoV-2 samples, and variants of concern (colored labels) cluster consistently on the tree, indicating that Read2Tree can be used to categorize the samples. Full size image

The first bat sample corresponds to the reads of RaTG13, which is the closest relative of SARS-CoV-2 identified yet34. Indeed, in our tree it falls right outside the SARS-CoV-2 clade. The other bat sample could also be confirmed as an Alphacoronavirus, subgenus Rhinacovirus35. Likewise, we could confirm the classification of the mink sample, identified as an Alphacoronavirus, subgenus Minacovirus by the authors36.

The position of the SARS-CoV-2 sequences within the coronavirus tree of life is also consistent with our prior knowledge on them. The reference genome, the Wuhan-Hu-1 sequence reported in early January 2020 (ref. 37), is at the base of the subtree. The only three sequences that branch out before it are SRR11092056-8—which were obtained from patients with severe pneumonia at the beginning of the pandemic34. Finally, we note that the variants of concern included in the analyses appear clearly as distinct clades on the tree.

To empirically test the scalability of our method, we also used Read2Tree to process 10,283 SARS-CoV-2 samples. The reconstructed tree clustered the sequences according to Centers for Disease Control variants of concerns classification, providing further evidence that the tool can be used to quickly and reliably classify SARS-CoV-2 variants (Supplementary Fig. 17). The same observation held for additional controls—running Read2Tree using coding-gene markers only (Supplementary Fig. 16), and using FastTree38 as the tree inference method (Supplementary Fig. 18).

Overall, this application of Read2Tree to diverse coronaviruses sequences illustrates the ability of the tool to deal both with the considerable phylogenetic breath of this family of virus39 and the depth required to classify individual SARS-CoV-2 variants of concerns. This makes Read2Tree suitable for both zoonotic surveillance and human epidemiology40.