Inferring phylogenies from pandemic-scale genome datasets

Reconstructing phylogenetic trees from large collections of genome sequences is a computationally challenging task. We developed MAPLE, a method for performing phylogenetic inference on large numbers of closely related genomes, which might be useful when studying the evolution and spread of SARS-CoV-2 and of infectious pathogens in future pandemics.

Reconstructing phylogenetic trees from large collections of genome sequences is a computationally challenging task. We developed MAPLE, a method for performing phylogenetic inference on large numbers of closely related genomes, which might be useful when studying the evolution and spread of SARS-CoV-2 and of infectious pathogens in future pandemics.

The problem
Genome sequence data provide important insights into pathogen transmission and evolution, and phylogenetic methods are fundamental in the analysis of this data. For example, these methods enable the identification and tracking of pathogen variants 1 , the tracking of infectious disease spread across and within countries 2 , and the identification of mutations that are key to the pathogen's spread 3 . As sequencing technologies advance and are more widely adopted, genome data from infectious pathogens are becoming more important and ubiquitous in the analyses that guide the public health response to disease outbreaks. The COVID-19 pandemic provides a good example, as several million SARS-CoV-2 genomes from around the world are now available for analysis. However, algorithmic limitations of existing state-of-the-art phylogenetic inference methods mean that, at most, only a few thousand genome sequences can be analysed at a time, severely limiting the applicability of current methods to very large genomic datasets 4 .

The solution
Our aim was to develop algorithms tailored for the inference of phylogenetic trees of many closely related genomes. More specifically, we wanted to improve the efficiency of probabilistic phylogenetic methods in this scenario, which is relevant to the COVID-19 pandemic and will likely also be relevant to large-scale infectious disease outbreaks in the future. We developed mathematical approximations that are both computationally convenient and accurate, assuming that the analysed genomes are closely related. This assumption also allowed us to concisely represent the genomes of the samples and their unsampled ancestors, greatly reducing the computational demand of probabilistic phylogenetics.
We have now developed a maximum likelihood phylogenetic inference software based on these principles and algorithmic ideas, called maximum parsimonious likelihood estimation (MAPLE). Using real and simulated SARS-CoV-2 genome data, we show that our software can infer phylogenetic trees more rapidly and from much larger collections of genomes than other, pre-existing maximum likelihood methods (Fig. 1). MAPLE can also perform more extensive phylogenetic tree searches owing to its reduced computational demand, resulting in more accurate inferred phylogenetic trees. Therefore, MAPLE enables the application of accurate probabilistic phylogenetic methods to genomic epidemiology datasets that are at least 1-2 orders of magnitude larger than was previously possible.

Future directions
Probabilistic phylogenetic frameworks allow seamless integration of different forms of data -such as geographic and temporal information -within phylogenetic analyses, and the use of sequence evolution models to realistically describe and reconstruct the complex interplay of different features of genome evolutionsuch as selection and mutational forces. We are currently working on extending the mathematical models in MAPLE to include biological complexities such as variation in mutation rates and selective pressure along the genome. We are also working on modelling sequence errors, such as assembly errors or contamination, which, if not accounted for, can adversely affect phylogenetic and downstream analyses. Our methods will now allow us to scale up popular phylogenetic and phylogenetic-based genome analysis tools to larger collections of genomes and therefore to perform more informative analyses. However, it is important to consider that our approach assumes that the analysed genomes are closely related, and that higher genomic divergence negatively affects both the computational demand and accuracy of MAPLE. Therefore, while our methods are useful in the context of genomic epidemiology where dense sampling over a short timescale is possible, they are less useful when comparing, for example, the genomes of different species.
The methods we have developed not only enable large-scale maximum likelihood phylogenetic inference of many closely related genomes but also can be used in the context of Bayesian phylogenetic inference 5 , potentially leading to similar reductions in computational demand. In this respect, we plan to use our methods within existing software such as BEAST 5 , which is commonly used for genome data analyses that are based on probabilistic phylogenetics, such as phylogeography 2 and phylodynamics 3 .