Phylogenetic Trees Made Easy: A How-to Manual for Molecular Biologists

  • Barry G. Hall
Sinauer: 2001. 192 pp. $24.95, £18.99 (pbk)

A little more than 10 years ago, when I was entering the field of phylogenetics myself, the construction of evolutionary trees was often a daunting task. In those, still early, days of personal computers, few programs for making trees were available, and it took about two or three days of fiddling with rulers, Chinese ink and typewriters to produce a publishable tree of less than 100 sequences. A lot has changed since then — currently, about 200 different software tools are available to construct and draw trees. Nevertheless, we had to wait a long time for a book like this. Whereas many other books or book chapters have focused on the theoretical aspects of certain methodologies or techniques, Barry Hall's is the first one, to my knowledge, that actually describes, step by step, how to build a tree.

Undoubtedly, there was a need for such a 'tutorial' book for non-expert users. Tree construction is a complicated business, and for those inexperienced in this matter the inference of phylogenies can be very frustrating, and the choice of programs and different options bewildering. Tree construction was, for a long time, mainly the business of biologists who studied the evolutionary relationship between organisms. In the era of comparative genomics, evolutionary analysis is becoming important in many new fields of research. Phylogenies are used to deduce the relationships — originated by both speciation and duplication — between genes in large gene families across different species. Phylogenomics refers to the discipline that tries to improve functional predictions for uncharacterized genes by overlaying known functions of genes onto an evolutionary tree containing all homologues. Functions of uncharacterized genes are then predicted by their phylogenetic position relative to characterized genes. More and more, phylogenetic principles are used to interpret, and date, gene and genome duplications.

Basically, there are three main methodologies for inferring phylogenetic trees: maximum parsimony, pairwise distance methods and maximum likelihood. Maximum parsimony reconstructs the ancestral character states (nucleotides or amino acids) at the branching points of the tree, and 'chooses' the topology requiring the fewest number of changes to explain the sequences at the tips of the tree. Distance methods compute, for all pairs of sequences, their genetic distance — the fraction of sites that differ between two sequences, corrected for multiple substitutions. A tree is then inferred by considering the relationships among these distance values. Maximum-likelihood methods are statistical methods that try to find the tree topology that maximizes the probability of observing the data, again being the sequences at the tips of the tree. As with distance methods, likelihood methods are based on an explicit model of evolution; now, however, this model is used to compute the probability, or likelihood, of having an ancestral character state at a branching point in a given phylogeny.

Recently, as a fourth approach, bayesian inference has been applied to phylogenetic tree construction; this differs from maximum likelihood in that it depends on the 'posterior probability'. Posterior probabilities of trees are based on the joint probabilities of the tree, branch lengths and the model of substitution, and are approximated by sampling from the entire posterior-probability distribution (the so-called 'tree space', or collection of all possible tree topologies). Bayesian phylogenetic inference is indeed rapidly gaining popularity — partly because of the possibility of including prior information, such as the single origin of a group of sequences — and is obviously Hall's preferred method.

Hall explains how to build a tree from an alignment of sequences using these four different methodologies. In a stepwise manner, a publishable tree is created by making use of the most popular software packages available, such as PAUP* and TREE-PUZZLE.

The book unmistakably derives its strength from the clear examples (for which the data sets are available on the Internet) using software, and this is probably also its major disadvantage. Whereas PAUP* is available for PCs, and the Windows version runs as a 'GUI' application, most options are, unlike the Macintosh version, command-line driven. For those who buy the book but have a PC, this will probably create some frustration, because the step-by-step procedures shown by displaying the menu options and different selections will be hard to execute in a command-line-driven environment. The book could have benefited from including the corresponding commands for the PC version. On the positive side, Hall discusses in detail aspects that may seem trivial, but often are not, such as the rooting and presentation of trees. Furthermore, a large part of the book is devoted to sequence alignment, and the author rightly emphasizes the importance of reliable alignment in tree building.

For many biologists, nothing is more exciting than seeing in a phylogenetic tree the sequences obtained after weeks or months of hard labour in the lab. This guidebook enables students and researchers anxious to start building trees to complete the job in a few minutes, and this will undoubtedly be highly appreciated. Unlike the pioneers, phylogenetic novices today have direct access to the appropriate computer hardware, software, and now even a tutorial manual that can initiate them into this important field of biology.