Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deep distributed computing to reconstruct extremely large lineage trees

Abstract

Phylogeny estimation (the reconstruction of evolutionary trees) has recently been applied to CRISPR-based cell lineage tracing, allowing the developmental history of an individual tissue or organism to be inferred from a large number of mutated sequences in somatic cells. However, current computational methods are not able to construct phylogenetic trees from extremely large numbers of input sequences. Here, we present a deep distributed computing framework to comprehensively trace accurate large lineages (FRACTAL) that substantially enhances the scalability of current lineage estimation software tools. FRACTAL first reconstructs only an upstream lineage of the input sequences and recursively iterates the same produce for its downstream lineages using independent computing nodes. We demonstrate the utility of FRACTAL by reconstructing lineages from >235 million simulated sequences and from >16 million cells from a simulated experiment with a CRISPR system that accumulates mutations during cell proliferation. We also successfully applied FRACTAL to evolutionary tree reconstructions and to an experiment using error-prone PCR (EP-PCR) for large-scale sequence diversification.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1
Fig. 2: Lineage reconstruction of over 235 million sequences.
Fig. 3: Reconstruction of evolutionary and pseudo-evolutionary trees.
Fig. 4: FRACTAL for large-scale cell lineage tracing.
Fig. 5: Reconstruction of a sequence diversification process produced by EP-PCR.

Similar content being viewed by others

Data availability

All of the simulated sequences and their lineages generated by PRESUME are available via the URLs given in Supplementary Table 2. The high-throughput sequencing data produced in this work are available at the NCBI Sequence Read Archive (PRJNA675984). P values obtained in this study are listed in Supplementary Data 5.

Code availability

The source codes and manuals of FRACTAL and PRESUME are available at GitHub (https://github.com/yachielab/FRACTAL and https://github.com/yachielab/PRESUME, respectively). FRACTAL and PRESUME can also be modified and executed on a small scale on web browsers through the Code Ocean, a cloud-based code sharing platform (https://doi.org/10.24433/CO.2433997.v1 and https://doi.org/10.24433/CO.3922773.v1, respectively). Note that the distributed computing mode is disabled in the Code Ocean because the Univa Grid Engine (UGE) is not supported in the Code Ocean. The other codes used in this study are also all available at GitHub (for the list, see Supplementary Table 5).

References

  1. Zou, Q., Wan, S., Zeng, X. & Ma, Z. S. Reconstructing evolutionary trees in parallel for massive sequences. BMC Syst. Biol. 11, 100 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Mora, C., Tittensor, D. P., Adl, S., Simpson, A. G. & Worm, B. How many species are there on Earth and in the ocean? PLoS Biol. 9, e1001127 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823–826 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Chan, M. M. et al. Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Bowling, S. et al. An engineered CRISPR–Cas9 mouse line for simultaneous readout of lineage histories and gene expression profiles in single cells. Cell 181, 1410–1422 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Salvador-Martinez, I., Grillo, M., Averof, M. & Telford, M. J. Is it possible to reconstruct an accurate cell lineage using CRISPR recorders? eLife 8, e40292 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  9. McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Raj, B. et al. Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat. Biotechnol. 36, 442–450 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Spanjaard, B. et al. Simultaneous lineage tracing and cell-type identification using CRISPR–Cas9-induced genetic scars. Nat. Biotechnol. 36, 469–473 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Alemany, A., Florescu, M., Baron, C. S., Peterson-Maduro, J. & van Oudenaarden, A. Whole-organism clone tracing using single-cell sequencing. Nature 556, 108–112 (2018).

    Article  CAS  PubMed  Google Scholar 

  13. Quinn, J. J. et al. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371, eabc1944 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Simeonov, K. P. et al. Single-cell lineage and transcriptome reconstruction of metastatic cancer reveals selection of aggressive hybrid EMT states. Cancer Cell 39, 1150–1162.e9 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, eaba7721 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Sender, R., Fuchs, S. & Milo, R. Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14, e1002533 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Barbera, P. et al. EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst. Biol. 68, 365–369 (2019).

    Article  PubMed  Google Scholar 

  20. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Simonsen, M., Mailund, T. & Pedersen, C. N. S. in International Workshop on Algorithms in Bioinformatics 113–122 (Springer, 2008).

  22. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).

    Article  Google Scholar 

  24. Yarza, P. et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 31, 241–250 (2008).

    Article  CAS  PubMed  Google Scholar 

  25. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    Article  CAS  PubMed  Google Scholar 

  26. Frieda, K. L. et al. Synthetic recording and in situ readout of lineage information in single cells. Nature 541, 107–111 (2017).

    Article  CAS  PubMed  Google Scholar 

  27. Jones, M. G. et al. Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol. 21, 92 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  28. Nishida, K. et al. Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science 353, aaf8729 (2016).

    Article  CAS  PubMed  Google Scholar 

  29. Hwang, B. et al. Lineage tracing using a Cas9-deaminase barcoding system targeting endogenous L1 elements. Nat. Commun. 10, 1234 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Grünewald, J. et al. A dual-deaminase CRISPR base editor enables concurrent adenine and cytosine editing. Nat. Biotechnol. 38, 861–864 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Zhang, X. et al. Dual base editor catalyzes both cytosine and adenine base conversions in human cells. Nat. Biotechnol. 38, 856–860 (2020).

    Article  CAS  PubMed  Google Scholar 

  32. Sakata, R. C. et al. Base editors for simultaneous introduction of C-to-T and A-to-G mutations. Nat. Biotechnol. 38, 865–869 (2020).

    Article  CAS  PubMed  Google Scholar 

  33. Du, Z., Santella, A., He, F., Tiongson, M. & Bao, Z. De novo inference of systems-level mechanistic models of development from live-imaging-based phenotype analysis. Cell 156, 359–372 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).

    Article  CAS  PubMed  Google Scholar 

  35. Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).

    Article  CAS  PubMed  Google Scholar 

  36. Poe, S. & Swofford, D. L. Taxon sampling revisited. Nature 398, 299–300 (1999).

    Article  CAS  PubMed  Google Scholar 

  37. Chow, K. K. et al. Imaging cell lineage with a synthetic digital recording system. Science 372, eabb3099 (2021).

  38. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Yu, M. K. et al. DDOT: a Swiss army knife for investigating data-driven biological ontologies. Cell Syst. 8, 267–273 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  CAS  PubMed  Google Scholar 

  46. Madeira, F. et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res. 47, W636–W641 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Baum, B. R. PHYLIP: phylogeny inference package. Version 3.2. Quarterly Review of Biology 64, 539–541 (1989).

    Article  Google Scholar 

  49. Zhao, L., Liu, Z., Levy, S. F. & Wu, S. Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics 34, 739–747 (2018).

    Article  CAS  PubMed  Google Scholar 

  50. Satopaa, V., Albrecht, J., Irwin, D. & Raghavan, B. in 2011 31st International Conference on Distributed Computing Systems Workshops 166–171 (IEEE, 2011).

  51. Levenshtein, V. I. in Soviet Physics Doklady, Vol. 10 707–710 (Doklady Akademii Nauk SSSR, 1966).

  52. Brunner, E. & Munzel, U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom. J. 42, 17–25 (2000).

    Article  Google Scholar 

Download references

Acknowledgements

We thank members of the Yachie lab for valuable discussions and critical assessment of the work, especially A. Adel, S. King and S. Okawa for reviewing the manuscript. We also thank C. de Hoog for providing comments on the manuscript. Deep sequencing was performed with the support of H. Aburatani. This study was supported by the Canada Research Chair program (by the Canadian Institutes for Health Research), the Japan Science and Technology Agency (JST) PRESTO program (10814), the Japan Agency for Medical Research and Development (AMED) PRIME program (20gm6110007), the Shimadzu Science and Technology Foundation, the Naito Foundation, the Nakajima Foundation and the Asahi Glass Foundation (all to N.Y.). Y.K., S.I., H.M. and N.M. were supported by Japan Society for the Promotion of Science (JSPS) Research Fellowships. High-performance computing experiments were performed using the SHIROKANE Supercomputer at the University of Tokyo Human Genome Center or the NIG Supercomputer System at the National Institute of Genetics.

Author information

Authors and Affiliations

Authors

Contributions

N.K., H.M. and N.Y. conceived the high-level concept of FRACTAL. N.K., W.I. and N.Y. designed the study. N.K. implemented FRACTAL. K.W. and N.K. implemented PRESUME. N.K. led the analyses. N.M. and N.Y. designed the high-content cell lineage recording model. Y.K. and K.W. supported the analyses of the simulated cell lineages. S.I. and M.T. performed the EP-PCR experiments. Y.K. supported the analysis of the EP-PCR experiments. N.K., K.O., D.P. and T.I. performed data visualization using HiView. N.K., Y.K., S.I. and N.Y. wrote the manuscript.

Corresponding author

Correspondence to Nozomu Yachie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 PRESUME.

a, Schematic diagram of PRESUME. b, Various datasets of 32,768 sequences generated by PRESUME with different mutational parameters μ and α. The topological parameter σ was fixed to 0 (perfectly balanced sequence diversification). The diversity of nucleotide letters across different sequence positions is represented by a bit score distribution and by a sequence logo for the first 15 nt of the generated sequences. c, Partial diagrams of representative trees generated by PRESUME with different topological parameters σ.

Extended Data Fig. 2 Runtime simulation of FRACTAL.

a, Conceptual diagram of the runtime simulation of FRACTAL. Under a distributed computing environment with a fixed number of available nodes d, each FRACTAL job of input sequence size n starts if there is one or more available free computing node; otherwise, it is stalled until a node is released from one of the ongoing jobs. Each FRACTAL job process is modeled to occupy one computing node for a runtime f(n) and produce two new child jobs for the next job cycles each with an input sequence size of n/2. When the input sequence size is under a certain threshold (nt), the job terminates after occupying one node for a runtime f(n) assuming the terminal lineage reconstruction process. b, Two implementation models of FRACTAL. In FRACTAL, each job cycle contains three steps that require high computing loads: sequence subsampling from the entire input sequences, phylogenetic placement, and sorting of sequences mapped on each of the sample lineage clades for the next job cycles. In model A, none of the three steps is parallelized where a runtime of each cycle is f(n) for the input sequence size of n. In model B, all of the three steps are perfectly parallelized where a runtime of each cycle is linearly reduced by the number of available computing nodes. The denominator of the formula represents the number of available computing nodes, assuming that the smaller the input sequences for a FRACTAL iteration cycle, the higher the likelihood of computing nodes being occupied by the other job processes at the same period. c, Linear regression f(n) using runtime log data of independent FRACTAL iteration cycles any of whose major three steps was not parallelized for the lineage reconstruction of the 235 million sequences.

Extended Data Fig. 3 Robustness of evolutionary lineage reconstruction with sequence data noise.

The scatter plots represent accuracies to reconstruct the SILVA 16S rRNA lineage from the input sequence datasets including different fractions of noise sequences. The noise sequences were generated by shuffling nucleotide positions of randomly selected sequences by keeping their alignment gap positions in the multiple sequence alignment result. The lineage dendrograms represent the ones reconstructed from the sequence dataset with 20% noise using FRACTALized RapidNJ, RAxML and FastTree.

Extended Data Fig. 4 Simulating proliferation of cells with the scalable CRISPR-barcode system.

a, The scGESTALT lineage tracing datasets of zebrafish brain development. BBI scores were obtained for branches that had more than three leaves associated to each of the two descending edges and a total of more than ten leaves associated to both of the two descending edges. b, c, Estimation of the insertion and deletion event probabilities per barcode array per generation in the scGESTALT dataset. The relative insertion and deletion event probabilities across each barcode array position and the probability distributions of insertion and deletion sizes were modeled using those observed in the scGESTALT dataset. The probabilities of the insertion and deletion events per generation for the production of 4,000 cells were fitted to the average total insertion and deletion lengths per barcode array observed in the scGESTALT dataset, respectively. d, Median fractions of base substitution per nucleotide position per generation observed for different secondary scaling factors for base editing in the simulation of producing 4,000 cells. The blue shading represents the 5 to 95 percentile range. e, f, Tree recovery scores for RapidNJ and FRACTALized RapidNJ in reconstructing the lineages of 4,000 cells simulated based on different secondary scaling factors for base editing.

Extended Data Fig. 5 Reconstruction of the BET002 sequence diversification process produced by EP-PCR.

a, Distribution in number of mutations observed in the second-generation sequences (CTNNB1 and BET002). b-g, BET002. b, Distribution of the second-generation sequences across the parental sample wells. The second-generation sequences were assigned based on their best matched parental first-generation sequences. c, The lineage tree of the mutated sequences reconstructed by FRACTAL. The sequences that did not have unique best matched sequences in the expected parental wells were filtered out after the lineage reconstruction. The dendrogram only represents the upstream lineage of the largest clades each composed of less than 15,000 sequences. Number of sequences, proportions of their source sample wells, and entropy of the well proportions are represented for each of the clades. d, A zoom-in diagram of the lineage highlighted by yellow in c. e, Distribution of entropies for the clades each with 1,000 or more sequences. The statistical difference between the entropy distribution and the null distribution given by random sequence-well assignment was tested by two-sided Brunner-Munzel test. f, Reconstructed lineage of the sequences in the clade indicated by the arrow in d. g, Proportions of the parental sequences identified in the control PCR wells and normalized distances of the second-generation sequences in the reconstructed lineage for pairs in the same wells and pairs, one of each is from a different well. Orange dots represent significant differences between the intra- and inter-well distributions (two-sided Brunner-Munzel test with Bonferroni correction; adjusted P-value < 0.05).

Extended Data Fig. 6 Evolutionary lineage reconstruction without preliminary MSA.

a, Sample tree reconstruction and phylogenetic placement of FRACTAL with no preliminary multiple sequence alignment (MSA). A given number of sequences are first randomly subsampled from the input sequences (Step 1). The subsampled sequences are aligned with a common root sequence by MSA using MAFFT (Step 2) and a sample tree is reconstructed by a software tool of choice (Step 3). Each of the remaining input sequences are then independently added to the MSA result by ‘plus-one’ alignment using HMMER (Step 4i) and placed on the sample tree (Step 4ii). b, Accuracies of reconstructing various sizes of clades in the reference lineage of 1,000,000 sequences generated by RNASim. c, Accuracies and coverages of reconstructing the entire lineage of 1,000,000 unaligned sequences by FRACTALized RapidNJ, RAxML and FastTree with 100 computing nodes (five trials). d, Time series for the numbers of computing jobs and waiting jobs observed in the reconstruction of the simulated lineage of RNA evolution using 100 computing nodes for FRACTALization.

Extended Data Fig. 7 Lineages of CTNNB1 and BET002 datasets before filtering out the potential artifact sequences.

a, b, Reconstructed lineages. The dendrogram only represents the upstream lineage of the largest clades each composed of less than 15,000 sequences. Number of sequences, proportions of their source sample wells, and entropy of the well proportions are represented for each of the clades. a, CTNNB1. b, BET002. c, d, Distribution of entropies for the clades with 1,000 or more sequences. c, CTNNB1. d, BET002. The statistical differences between the entropy distributions and the null distributions given by random sequence-well assignment were tested by two-sided Brunner-Munzel test. e, f, Unique read counts of the second-generation sequences uniquely best-matched to single parental sequences in the expected and unexpected wells and those best-matched to multiple parental sequences. The ones uniquely best-matched to single parental sequences are color-coded according to the parental wells. The second-generation sequences best-matched to single parental sequences of unexpected wells can be assumed to be cross-contaminants derived during the second EP-PCR and the following steps. The second-generation sequences redundantly best-matched to multiple parental sequences can be assumed to have parental sequences that were either cross-contaminated before the second EP-PCR or were conferred an insufficient number of mutations. e, CTNNB1. f, BET002.

Supplementary information

Supplementary Information

Supplementary Notes 1–4 and Figs. 1–11.

Reporting Summary.

Supplementary Table 1

Supplementary Tables 1–5.

Supplementary Data 1

Supplementary Data 1–5.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Konno, N., Kijima, Y., Watano, K. et al. Deep distributed computing to reconstruct extremely large lineage trees. Nat Biotechnol 40, 566–575 (2022). https://doi.org/10.1038/s41587-021-01111-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-021-01111-2

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research