Progressive Cactus is a multiple-genome aligner for the thousand-genome era

New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies1–3. For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database4 increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies5 are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus6, a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far.


F1 score
Supplementary Figure 2: Species-by-species breakdown of similarity between the alignments with guidetrees based on Jarvis and Prum.Similarity for every cell of the matrix is based on F1 score for pairs of aligned bases found to be shared or unshared between the two alignments.

Set
In A In B

In both
Supplementary Figure 5: Comparison of aligned pairs between human-dog and human-mouse aligned pairs within the alignments of high-and low-quality assemblies, as well as the 600-way, to those using the respective chains and nets.The Cactus alignments are filtered using the mafDuplicateFilter tool, which chooses the single closest matching sequence from each species in each alignment block to the consensus sequence of the block.This allows a fair comparison against chains and nets, which are single-copy (and therefore have no duplicates).The first alignment mentioned is referred to as A, the second is referred to as B.  11: Distribution of Jaccard similarities between each pair of aligners for the mRNA and coding regions of each human transcript mapped to the chicken genome (see Methods).Where one or both aligners produces multiple mappings per transcript we pick the pair of mappings (one from each mapper) with highest overlap.If one or both methods method didn't produce an alignment, a Jaccard index of 0.0 is assigned.Where one or both aligners produces multiple mappings per gene we pick the pair of mappings (one from each mapper) with highest overlap.If one or both methods didn't produce an alignment, a Jaccard index of 0.0 is assigned.Here gene coordinates are defined by the longest single mRNA or CDS per gene.(c) The cactus graph constructed from the sequence graph using the Cactus construction procedure (see [1] for details).The subsequence 'a d f' common to all the input strings is represented by a simple cycle, termed a chain.The remaining substrings 'b', 'c', and 'e' are each in trivial chains represented using self loops.

Alignment relationships
Supplementary Figure 15: A visualization of the best-hit filtering method.Here, each node of the directed graph indicates a single base, and edges represent pairwise alignment relationships (the color of the node indicates the species the base belongs to, and higher thickness of edges represents higher scores of the pairwise alignments).Since Progressive Cactus's alignment columns represent the transitive closure of the input pairwise alignment relationships, the final alignment relationships will be represented by connected components within this graph.Taking the single best hit (so that this graph contains at most one outgoing edge per base) results in the correct separation between copies if orthologous copies have higher score, but some lineage-specific duplications require secondary, non-best-hit alignments to bring together orthologs from different species.Supplementary Figure 16: Coverage of the human genome from alignments with and without removing recoverable chains after the CAF process.While the coverage is increased overall across all genomes when removing recoverable chains, the increase is relatively larger in more distant species.10: Similarity of mapping protein-coding transcripts and genes between human and chicken using the four alignment and mapping methods (see Methods).The metric is the mean Jaccard index computed at the base-level of individual transcripts.Where one or both aligners produces multiple mappings per transcript we pick the pair of mappings (one from each mapper) with highest overlap.If one or both methods method didn't produce an alignment, a Jaccard index of 0.0 is assigned.

Supplementary Figure 1 :
Guide trees used in the guide-tree influence analysis.s _ p la t y r h y n c h o s A n t r o s t o m u s _ c a r o li n e n s is A p a lo d e r m a _ v it t a t u m A p t e n o d y t e s _ fo r u s _ b r a c h y r h y n c h o s C u c u lu s _ c a n o r u s E g r e t t a _ g a r z e t t a E u r y p y g a _ h e li a s F a lc o _ p e r e g r c r o c o r a x _ c a r b o P h o e n ic o p t e r u s _ r u b e r P ic o id e s _ p u b e s c e n s P o d ic e p s _ c r is Jaccard similarity of region between aligner pair Number of human genes mapped to chickenRegion type CDS mRNA Supplementary Figure12: Distribution of Jaccard similarities between each pair of aligners for the mRNA and coding regions of each human of each human gene mapped to the chicken genome (see Methods).

Supplementary Figure 13 :
Sequence and cactus graph example.(A) A biedged sequence graph constructed from the strings 'a b d e f', 'a c d e f' and 'a b d f', where here homology is indicated by common alphabet characters.In Progressive Cactus the sequence edges (black lines) represent alignment blocks.The adjacency edges (grey lines) indicate the sequence relationships.(B) Each input string is encoded as a restricted form of walk in the sequence graph; the path for the string 'a c d f' is highlighted by dotted edges.

Supplementary Figure 14 :
Iteratively building a sequence graph by progressive glueing operations.(A) Two sequence edges labeled with their respective strings.(B) A local alignment glues together the 'a' copies.(C) The glueing of 'd'.(D) The glueing of 'f' creates the final set of alignment blocks.
Comparison between Jarvis (left) and Prum (right) topologies (branch lengths not to scale), with branches above clades not shared between the two topologies highlighted in red.
Supplementary Figure4: Fraction of aligned pairs found only in the alignment of high-quality assemblies, only in the alignment of low-quality assemblies, or in both.Only human, mouse, rat, and dog pairs are shown since these are the only species represented by the same assemblies in both alignments.
Aligned pairs shared within specific regions (defined on the human reference) between several pairs of alignments.For Cactus alignments, duplicates have been removed for better comparison against the single-copy net alignments.The smaller alignment used for comparison is the alignment of highquality assemblies mentioned in earlier sections.A: The fraction of various types of human regions mappable to each ancestor within each alignment.B: The total size of each ancestral assembly within each alignment.
Supplementary Figure9: Comparison of ancestors at the same position in the tree in a large (242-species Zoonomia alignment, labeled as "200M") and small (11-species, labeled as "High-quality assemblies") alignment of mammalian genomes.Supplementary Figure10:Comparison of ancestors at the same position in the tree in a large (363species, labeled as "363-way") and smaller (48-species, labeled as "48-way") alignment of bird genomes.A: The fraction of various types of chicken regions mappable to each ancestor within each alignment.B: The total size of each ancestral assembly within each alignment.

Table 9 :
Comparison of each of the LASTZ and Cactus alignments to the union of the BLATX and TBLASTX translated alignments.This analysis implicitly uses the union of the translated alignments as a proxy to a truth set to compare the non-translated methods.Human coding sequences from GENCODE V34 are used, picking the coding sequence from the longest transcript per gene to define a minimally overlapping set.To account for human bases which map to multiple bases in chicken (which occurs frequently for the translated alignment methods that include very distant, fragmented, paralogous alignments, but much less often for the non-translated methods), when per CDS there is either or both multiple translated alignments or multiple non-translated alignments, we pick the pair of mappings (one translated, one from the non-translated method) with highest pairwise Jaccard similarity.Base level counts are then reported for this pairing of the CDS.We report numbers in terms of genes and individual human coding bases.The source row is the total number of human genes/coding bases.The missing counts are the number of human genes/coding bases that were not mapped by either the untranslated method (Cactus or LASTZ ) or the one of the translated alignment methods.The unmapped are cases where there are translated alignments and no untranslated alignment.The mapped hit are cases where the untranslated alignment is the same as a translated alignment.With mapped miss, the untranslated alignments do not overlap any of the translated alignments.The mapped only are counts of genes/bases that are only aligned by the untranslated aligner.
Supplementary Table