Standardized phylogenetic and molecular evolutionary analysis applied to species across the microbial tree of life

There is growing interest in reconstructing phylogenies from the copious amounts of genome sequencing projects that target related viral, bacterial or eukaryotic organisms. To facilitate the construction of standardized and robust phylogenies for disparate types of projects, we have developed a complete bioinformatic workflow, with a web-based component to perform phylogenetic and molecular evolutionary (PhaME) analysis from sequencing reads, draft assemblies or completed genomes of closely related organisms. Furthermore, the ability to incorporate raw data, including some metagenomic samples containing a target organism (e.g. from clinical samples with suspected infectious agents), shows promise for the rapid phylogenetic characterization of organisms within complex samples without the need for prior assembly.


Supplementary Methods
1. PhaME: Under the hood PhaME's input consists of a set of genomes in fasta and/or fastq formats, and corresponding annotation files in gff3 format if downstream analyses will include coding regions. A detailed step by step explanation of how PhaME analyzes genomes is provided below.

Selecting Reference genome
Since PhaME is a reference genome-based tool where all input genomes and metagenomes are aligned against a reference, the first step of PhaME's analysis is selecting a reference genome. Given a set of genomes (in a folder listed under the "refdir" parameter of the control file), the reference genome can be selected using one of three options: option 1-a random genome is selected from the provided set of genomes; option 2-a specific genome is selected from the set via input from the user; option 3-the MinHash distance is calculated between all genomes provided (complete genomes, draft genomes, and raw reads) to determine which reference genome has the shortest average distance to all of the other genomes. MinHash distances are calculated using its implementation in BBMap 1 .
1.2 Self-nucmerization to remove repeats from reference genomes The genome alignment portion of PhaME is built on the tool nucmer 2 for alignments of genomes in FASTA format. Each genome included is first aligned with itself using nucmer, called self-nucmerization, and then aligned regions called repeats are removed from the genomes for downstream analyses. The following nucmer command is used for the self-nucmerization step: $nucmer --maxmatch --nosimplify --prefix=seq_seq ref_genomeA.fasta ref_genomeA.fasta The option --maxmatch, which reports all matches, is used to ensure that all possible alignments are reported for maximal removal of repeats.

Genome Alignments
All genomes that are in FASTA format are aligned against the reference genome (see section 1.1) using following command: $nucmer --maxmatch refgenome.fasta genome.fasta All other options in nucmer alignments are kept at default. Some of the important ones are listed below: -b|breaklen Set the distance an alignment extension will attempt to extend poor scoring regions before giving up (default 200) -c|mincluster Sets the minimum length of a cluster of matches (default 65) -D|diagdiff Set the maximum diagonal difference between two adjacent anchors in a cluster (default 5) -d|diagfactor Set the maximum diagonal difference between two adjacent anchors in a cluster as a differential fraction of the gap length (default 0.12) -- [no]extend Toggle the cluster extension step (default --extend) -g|maxgap Set the maximum gap between two adjacent matches in a cluster (default 90) -l|minmatch Set the minimum length of a single match (default 20) Also, any Ns in the genomes will not be included in the alignment.
Note: If an analysis requires running multiple iterations of PhaME on a same set of data or a subset of data, one does not need to perform the alignment over and over again. PhaME provides an option where it can keep all possible pairwise alignment of genomes from "refdir" for future analyses. All the steps mentioned in this section are the same, except that all vs. all alignment is performed compared to just one reference.

Mapping of raw reads to the reference genome
Currently, PhaME only processes short, raw reads from Illumina. If raw reads, single or paired end, are included in the analyses, they are mapped to the reference genome using either bowtie2 or BWA based on users' input. For reads mapping to the reference genome, the following commands are used: First, it indexes the reference genome. Depending on the mapping tool selected, one of the following commands are executed: $bowtie2-build refgenome refgenome or $bwa index refgenome The raw reads are then mapped to the reference genome using one of the following commands, depending on the mapping tool selected and whether reads are single or paired.
For bowtie2 and single end reads: $bowtie2 -a -x $refgenome -U read -S single.sam           A. B. C.