Introduction

Comparing nucleotide sequences from different organisms helps understand evolution. Applications range from reconstructing the earliest branches on the Tree of Life to mapping the routes and timing of human expansion out of Africa1,2,3. Standard approaches evaluate homologous nucleotide or amino acid positions across a sequence alignment to infer the probable order of divergences and display results in a tree diagram of evolutionary history4,5. Phylogenetic methods generally emphasize branching order–the sequence of events along each branch–and less so timing across divisions. As a result, coincident divergences involving multiple boughs may be overlooked. Specific methods designed to detect clustering have been applied to species delimitation and viral evolution6,7,8,9. This relatively limited focus to date likely reflects the commonly-held view that higher taxa are arbitrary demarcations of the taxonomic hierarchy rather than indicators of evolutionary processes10,11.

Matrix heat maps help visualize clustering in complex datasets and can compress hundreds of thousands of data points into single-page displays12,13. Applications range from evaluating social networks to identifying diagnostic gene expression profiles in tumors and brain scan patterns associated with schizophrenia14,15,16,17,18. Matrix rows and columns are sorted, typically by hierarchical clustering and the rearranged matrix is colorized as a heat map. Clusters of correlated inputs show up as “hot blocks” along the diagonal. Matrices may be asymmetric, e.g., a gene expression profile with genes sorted along one axis and cell types along the other, or symmetric, with identical inputs along both axes (e.g.19,20).

A symmetric matrix heat map approach to comparative nucleotide sequence analysis using indicator vector correlations is recently described21,22. Indicator vectors are digital transformations of nucleotide sequences in vector space; correlations are roughly inversely proportional to p-distances. Unlike simple p-distance methods, scaling of correlations is relative rather than absolute and vectors can represent multiple sequences. Indicator vector analysis generates a Klee diagram, a colorized heat map of the correlation matrix. Taxonomy-ordered Klee diagrams may offer new insights into evolution22,23,24,25,26. However, to date this approach has only been applied to mitochondrial COI barcode sequences and is limited by the need for an accurate taxonomic list which is not readily available for most groups. Here we describe TreeParser, a web-based software that sorts a nucleotide sequence alignment according to a phylogenetic tree generated from the same dataset, facilitating an otherwise time-consuming step in this analytic pipeline. To assess potential utility, we apply TreeParser-indicator vector analysis to mitochondrial and nuclear gene datasets and examine clustering in the resulting Klee diagrams.

Results

TreeParser, Klee performance

TreeParser run times on the web were less than 5 s for alignments with 5,000 or fewer sequences. Larger files containing 7,500 and 10,000 sequences and were sorted in 14 s and 26 s, respectively. TreeParser outputs closely followed template trees. Differences reflected topology-equivalent branch rotations and alternate ordering of identical sequences (Supplementary Figs. S1, 2)27. Klee diagrams required approximately two to three minutes on a desktop machine.

Astraptes fulgerator COI barcodes

The skipper butterfly A. fulgerator from northwestern Costa Rica is proposed to represent ten cryptic species based on differences in caterpillar morphology, food plants and COI barcodes28. The putative species, which have modest sequence differences (average nearest neighbor distance, 1.76% K2P; range 0.32%–5.41%), formed discrete blocks of high correlation along the diagonal in TreeParser-ordered Klee diagram (Fig. 1). Exceptions were INGCUP and HIHAMP, which differ by 1–2 nucleotides and were not clearly demarcated. Whether or not these constitute valid species has been questioned29,30.

Figure 1
figure 1

Species-level clusters in butterflies and birds.

At left, skipper butterfly Astraptes fulgerator COI barcode Klee diagram generated from TreeParser ordered alignment (n = 420) with correlation scale at right of diagram. Sequence clusters appear as blocks of high correlation along the diagonal and correspond to the 10 provisional species (1. INGCUP, 2. HIHAMP, 3. FABOV, 4. BYTTNER, 5. YESENN, 6. LONCHO, 7. LOHAMP, 8. SENNOV, 9. CELT, 10. TRIGO). Block sizes reflect number of sequences per species (n = 3–88). At right, Setophaga warblers COI barcode Klee generated from TreeParser-ordered alignment (n = 276; 3–32 per species). Blocks along the diagonal correspond to species; species with shared blocks are marked with an asterisk (1. petechiae, 2. striata, 3. pensylvanica, 4. nigrescens, 5. graciae, 6. discolor, 7. virens, 8. occidentalis,* 9. townsendi,* 10. magnolia, 11. tigrina, 12. castanea, 13. dominica, 14. palmarum, 15. citrina, 16. americana,* 17. pitiayumi,* 18. cerulea, 19. pinus, 20. kirtlandii, 21. fusca, 22. coronata, 23. caerulescens, 24. ruticilla).

Setophaga warbler COI barcodes

The Setophaga wood warblers are one of the youngest groups of songbirds, an “explosive radiation” of largely North American species that diversified in the past 5–10 million years31. A Klee diagram of the TreeParser-ordered alignment, which included 24 of the 25 Setophaga species in North America, displayed distinct blocks of high correlation corresponding to species (Fig. 1). Expected exceptions were two species pairs known to share barcodes either due to ongoing hybridization (S. townsendi/occidentalis) or recent divergence (S. americana/pitiayumi). It has been proposed that the latter pair represent a single species32.

Tyrannid flycatchers and allies recombination activating gene 1 (RAG-1)

This published dataset includes representatives of nearly all (93%) Tyrannides genera33. Individual species are represented by single sequences. A Klee diagram of the TreeParser-ordered FASTA file displayed discrete blocks of higher correlation along the diagonal that corresponded to the revised Tyrannides phylogeny, including four of five families and several subfamilies (Fig. 2a). Some groups were “split” or “lumped” in the RAG-1 Klee. For example, two Tyrannidae subfamilies appeared as a single block and family Tityridae was split into independent blocks corresponding to subfamily divisions (Fig. 2b).

Figure 2
figure 2

Higher-level clusters in suborder Tyrannides (tyrannid flycatchers and allies) nuclear gene RAG-1.

a) Klee diagram generated from TreeParser-ordered alignment (n = 180) is shown. Sequence clusters visible as blocks of high correlation along the diagonal correspond to taxonomic groups listed at bottom. b) Klee detail showing Tityridae and subfamilies.

Comparison of avian RAG-1, COI

The avian RAG-1 Klee showed strongly demarcated blocks reflecting major phylogenetic divisions of birds (Fig. 3)34,35. Short mitochondrial sequences such as COI barcodes are generally considered to lack sufficient information for evolutionary analysis above the species level36,37,38. Thus it was of note that much of RAG-1 Klee structure was mirrored in the COI diagram, although the discontinuities were less marked (Fig. 3).

Figure 3
figure 3

Comparison of higher-order avian taxonomic clusters in RAG-1, COI.

TreeParser-ordered Klee diagrams for bird species with both RAG-1 and COI barcode sequences are shown (n = 704). To facilitate comparison, RAG-1 Klee was rotated to more closely match arrangement of species in COI diagram. Major taxonomic divisions are labeled at top. Large and small black brackets indicate positions of Tyrannides (cf. Fig. 2a) and Parulidae wood warblers including Setophaga spp. (cf. Fig. 1), respectively. White bracket at lower right of each diagram indicates position of the multi-family New World songbird radiation informally referred to as “nine-primaried oscines45,46.”

Butterfly elongation factor 1α (EF-1), COI

These published datasets included sequences from 89 species representing five of seven recognized butterfly families and include 15 subfamilies and 52 genera39. Clusters corresponding to recognized taxonomic divisions were evident in both the EF-1 and COI Klees, including family Lycaenidae and subfamilies within Nymphalidae and Papilionidae (Fig. 4). In the Klee diagram generated from concatenated EF-1 and COI sequences, three additional families emerged as discrete blocks.

Figure 4
figure 4

Butterfly family and subfamily clusters in mitochondrial COI and nuclear EF-1.

TreeParser-ordered Klee diagrams representing five of seven butterfly families are shown (n = 89 species). Each Klee follows the NJ tree for that dataset; EF-1 and EF-1 + COI Klees are rotated to more closely match the order in COI Klee. Bar at top indicates positions of families in each diagram and selected clusters representing Nymphalidae subfamilies are marked. Correlation scale is at right and taxonomic groups are listed at bottom.

Discussion

Heat map analysis requires an organized matrix. In this study, phylogeny-ordered alignments enabled Klee heat map visualization of evolutionary sequence clusters. To generate Klee diagrams, we previously sorted sequence alignments by hand according to a taxonomic list or a phylogenetic tree. This was not optimal even for small datasets, as errors were unpredictable and hard to identify and correct. For large datasets, manual reordering was simply not feasible–a computational approach was needed. To enable automated sorting we developed the TreeParser software described in this paper. The results demonstrate that TreeParser sorts a nucleotide sequence FASTA file according to a phylogenetic tree generated from the same data. The stand-alone web version accepts standard format files and requires no additional software. In this report MEGA NJ algorithm was used to produce template trees40. Any phylogenetic software that generates a standard format Newick tree file41 could be utilized by converting the Newick file to text format in MEGA before uploading to TreeParser. However, it is likely optimal to use a distance-based method such as NJ to create the template, given that indicator vector correlations are most closely related to Hamming or p-distances21. Thus distance-based NJ ordering is expected to closely follow indicator vector correlations. The repeated finding of coherent clusters in NJ-ordered Klee diagrams supports this approach.

An alternative to TreeParser is available in SeaView sequence analysis software42, which includes a utility that reorders a FASTA file according to a phylogenetic tree. For persons familiar with SeaView, this may be an attractive option. Advantages to TreeParser are that it is designed to work with the widely-used MEGA software and the stand-alone web version requires no additional software installation.

We applied the TreeParser-indicator vector-Klee pipeline to mitochondrial and nuclear genes from invertebrate and vertebrate species. In each case there were strong congruences between clusters and taxonomic groups. The skipper butterfly A. fulgerator COI Klee displayed eight of the ten putative species as distinct blocks (Fig. 1), a visual representation of the typically shallow evolutionary histories within animal species as compared to greater distances among even close relatives43,44,45. A large set of closely-related Setophaga warbler species formed similarly distinct blocks in the COI Klee (Fig. 1). In tyrannid flycatchers, the nuclear RAG-1 Klee discontinuities corresponded to recently revised family and subfamily groups (Fig. 2), providing a condensed snapshot of higher-level phylogeny33. With a broader set of avian species, a RAG-1 Klee vividly displayed major taxonomic divisions of birds (Fig. 3). A COI Klee generated from the same set of species demonstrated congruent blocks of high correlation, although less strongly demarcated. Applied to butterfly COI and nuclear EF-1 sequence alignments, Klee diagrams revealed families and subfamilies as distinct blocks (Fig. 4).

In addition to congruences, differences between Klee clusters and named taxonomic divisions suggest possible areas that could benefit from further attention (e.g. Fig. 2b). Indicator vector-Klee analysis may point to groups meriting formal taxonomic names, such as the New World passerine radiation of “nine-primaried oscines”46,47, which appeared as a densely correlated block in both RAG-1 and COI Klees (Fig. 3).

Several limitations to this analytic approach were encountered. The initial version of TreeParser had difficulty finding unique IDs in some files, reflecting the diversity of sequence headers. To circumvent this problem we modified the program and web portal, adding an option of using the entire sequence header as an identifier. Regarding indicator vector analysis, alignments with large gaps or numbers of missing characters produced distorted Klee diagrams. This was addressed by filtering alignments for full-length sequences and setting indicator vector bp parameters to exclude regions with missing data. It should be noted that all datasets in this study were protein coding regions. It may be of interest to test this approach on alignments of introns, ribosomal genes, or other non-coding sequences that contain gaps.

More generally, although not relevant to above examples, we encountered limitations to analyzing large files at multiple steps in the pipeline: alignment, tree generation and indicator vector-Klee analysis. The computing challenges to generating alignments and phylogenetic trees for large datasets are well known (e.g.48). Using higher capacity hardware we have been able to generate phylogeny-informative Klee diagrams for alignments as large as 11,000 sequences (Supplementary Fig. S3). It should be noted that TreeParser sorted this relatively large dataset on our standard server without difficulty.

Although it is possible to construct an accurate evolutionary branching diagram for just a few taxa, clustering is likely evident only if many closely related organisms are analyzed. DNA barcode libraries are an attractive resource given the breadth of taxonomic coverage. Drawbacks are reliance on a single gene and the paucity of phylogenetic signal in short mitochondrial DNA sequences36,37,38,49. In this study, higher-level COI clusters were concordant with those in nuclear or combined gene analysis and with established taxonomy (Figs. 24; see also22,24). These results suggest DNA barcode Klee analysis could help establish a taxonomic framework, which even if incomplete, could be useful particularly for groups less well known than butterflies or birds. It should be straightforward to test if these findings are generally applicable by analyzing other animal groups with large datasets of mitochondrial and nuclear genes in GenBank or Barcode of Life Datasystems (BOLD)50. Unlike animals, green plants (Viridiplantae) do not show strong intraspecific clustering in organellar genes including the standard plant barcode loci, rbcL and matK51,52. Given this apparent dichotomy, it would be of interest to apply TreeParser-indicator vector-Klee analysis to examine higher-level structure in land plants.

The present findings support the re-emerging view that clustering is a widespread evolutionary pattern not limited to species-level differences. For example, Barraclough and colleagues recently proposed that that higher-level diversity is comprised of “evolutionary significant units worthy of scientific study” and put forth a mechanism by which such units could arise53. However to date there is no broadly-applicable method other than expert opinion to define clusters above species level and thus a lack of objective data for model testing. Our results demonstrate indicator vector-Klee heat map analysis delineates higher-level structure in nucleotide sequence alignments. Analyzing additional datasets as outlined above will help determine the generality of clustering and the utility of this approach in investigating underlying mechanisms.

In summary, TreeParser-indicator vector-Klee software visualizes evolutionary clusters in nucleotide sequence datasets. This approach provides a condensed snapshot of a sequence alignment and should help investigate the structure of higher-level diversity which is not well understood.

Methods

Datasets

DNA barcode sequences were downloaded from BOLD project “EPAF Astraptes fulgerator complex”28,50. Sequences were aligned with MUSCLE in MEGA and trimmed to include 648 base pair (bp) corresponding to nucleotides 52–699 of mouse mitochondrial genome40,54. Those representing the ten putative species and containing at least 600 bp (positions 42 to 642) were selected for further analysis (n = 420). The sequence alignment and a MEGA-generated Kimura-2-parameter (K2P) neighbor-joining (NJ) tree file in text format were uploaded to TreeParser, producing an output FASTA file that followed the order of terminals in the tree. A Klee diagram was generated by indicator vector analysis with parameters n = 1 sequence/vector and bp window = 42–642.

Setophaga warbler COI DNA barcode sequences were downloaded from GenBank using search terms “setophaga[organism] AND BARCODE[keyword]”, aligned in MEGA and trimmed to COI barcode region as described above. Sequences containing at least positions 100–600 were selected for further analysis (n = 276). A K2P NJ tree text file and FASTA alignment were uploaded to TreeParser and the re-ordered alignment was used generate a Klee diagram, with parameters n = 1 sequence/vector and bp window = 100–600.

RAG-1 sequences from suborder Tyrannides (tyrant flycatchers, cotingas, manakins and their allies)33 were downloaded from GenBank PopSet and aligned in MEGA using MUSCLE (n = 180). The alignment contained 1,183 variable and 1,689 conserved positions. To facilitate desktop indicator vector analysis, conserved positions were deleted using MEGA export function. The condensed alignment was reordered with TreeParser according to a K2P NJ text file as described above. A Klee diagram was generated with parameters n = 1 sequence/vector and bp window = 1–1183.

To compare clustering in avian RAG-1 and COI, all avian RAG-1 sequences in GenBank (search terms “aves[organism] AND (rag-1[gene name] OR rag1[gene name])”) were downloaded and aligned in MEGA using MUSCLE. These were filtered to exclude short sequences, multiple sequences per species, conserved positions as described above and positions with gaps in more than 90% of sequences. The resulting alignment contained 595 bp. Sequences from those species also represented in a published avian COI BARCODE dataset55 were selected for further analysis, as were the corresponding COI BARCODEs (n = 704). K2P NJ tree files produced in MEGA and their respective alignments were uploaded to TreeParser. Klee diagrams were generated from rearranged FASTA files with parameters n = 1 sequence/vector and bp window = 1–595 (RAG-1) or 100–600 (COI).

To examine higher-level patterns in butterfly genes, datasets of EF-1 (1066 bp) and COI (1101 bp) sequences (n = 89 sequences, 1 per species)39 were downloaded from GenBank PopSet, aligned in MEGA and used to generate TreeParser-ordered Klee diagrams. For combined analysis, a FASTA file of concatenated EF-1 and COI sequences was condensed by removing invariant positions as described above (final size 864 bp). A Klee diagram was generated from the TreeParser re-ordered alignment with bp window = 1–864.

TreeParser software

TreeParser is designed to work with FASTA files downloaded from GenBank or BOLD and with phylogenetic tree text files generated by MEGA. Programming language PHP version 5 was chosen for web compatibility and ease of use. The software and step-by-step instructions on running TreeParser and generating Klee diagrams are posted on the web at http://phe.rockefeller.edu/barcode/klee.php. The web version, hosted on a Linux server running Apache at the address above, requires no additional software. The source code, designed to be downloaded and run locally, is available at http://phe.rockefeller.edu/barcode/klee_sourcecode/tree_parser.tar.gz.

TreeParser accepts two files: a FASTA-formatted alignment of nucleotide sequences and a text format tree file generated from the alignment using MEGA.

Once files are uploaded, TreeParser performs the following algorithm:

  1. 1

    Search the tree and FASTA files for the unique ID of each nucleotide sequence (represented as a regular expression).

  2. 2

    Obtain two lists, using the unique ID of the particular sequence as the index of each fragment:

    1. a

      Tree list: An ordered list of all sequences in the template tree text file.

    2. b

      FASTA list: A list of nucleotide sequences constructed by splitting up the FASTA file into blocks. Each block consists of a unique header and its sequence.

  3. 3

    Loop through the Tree list and search the FASTA list for each entry.

  4. 4

    Construct an array consisting of reordered FASTA blocks corresponding to the order of the Tree file list.

  5. 5

    Check the new array for any missing values from either the Tree or FASTA list.

  6. 6

    Write the new array to an output FASTA file.

  7. 7

    Generate a secondary log file.

The output FASTA file is identical in content to the original, but reordered in accordance with the template tree. This file can then be passed directly into indicator vector software to construct a Klee diagram. The output log file records the number of matched sequences and lists any missing values from the FASTA or Tree lists.

Indicator vector analysis

This was performed as described21 using updated software available at http://phe.rockefeller.edu/barcode/klee_sourcecode/Indicator_Vector_Klee_v1.tar.gz.

Computer hardware

MEGA (nucleotide sequence alignment, neighbor-joining) and MATLAB 2009a (indicator vector-Klee) analyses were performed on Mac Mini desktop (Mac OSX 10.7.4, 2.5 GHz Intel Core i5 processor, 8 GB RAM).