Deep distributed computing to reconstruct extremely large lineage trees

Konno, Naoki; Kijima, Yusuke; Watano, Keito; Ishiguro, Soh; Ono, Keiichiro; Tanaka, Mamoru; Mori, Hideto; Masuyama, Nanami; Pratt, Dexter; Ideker, Trey; Iwasaki, Wataru; Yachie, Nozomu

doi:10.1038/s41587-021-01111-2

Article
Published: 06 January 2022

Deep distributed computing to reconstruct extremely large lineage trees

Naoki Konno^1,2,3,
Yusuke Kijima^1,4,5^na1,
Keito Watano^1,2,3^na1,
Soh Ishiguro^1,5^na1,
Keiichiro Ono⁶,
Mamoru Tanaka¹,
Hideto Mori^1,7,8,
Nanami Masuyama^1,5,7,8,
Dexter Pratt ORCID: orcid.org/0000-0002-1471-9513⁶,
Trey Ideker ORCID: orcid.org/0000-0002-1708-8454^6,9,
Wataru Iwasaki ORCID: orcid.org/0000-0002-9169-9245^2,3 &
…
Nozomu Yachie ORCID: orcid.org/0000-0003-1582-6027^1,5,7,8

Nature Biotechnology volume 40, pages 566–575 (2022)Cite this article

11k Accesses
8 Citations
154 Altmetric
Metrics details

Subjects

Abstract

Phylogeny estimation (the reconstruction of evolutionary trees) has recently been applied to CRISPR-based cell lineage tracing, allowing the developmental history of an individual tissue or organism to be inferred from a large number of mutated sequences in somatic cells. However, current computational methods are not able to construct phylogenetic trees from extremely large numbers of input sequences. Here, we present a deep distributed computing framework to comprehensively trace accurate large lineages (FRACTAL) that substantially enhances the scalability of current lineage estimation software tools. FRACTAL first reconstructs only an upstream lineage of the input sequences and recursively iterates the same produce for its downstream lineages using independent computing nodes. We demonstrate the utility of FRACTAL by reconstructing lineages from >235 million simulated sequences and from >16 million cells from a simulated experiment with a CRISPR system that accumulates mutations during cell proliferation. We also successfully applied FRACTAL to evolutionary tree reconstructions and to an experiment using error-prone PCR (EP-PCR) for large-scale sequence diversification.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Lineage reconstruction of over 235 million sequences.**

**Fig. 3: Reconstruction of evolutionary and pseudo-evolutionary trees.**

**Fig. 4: FRACTAL for large-scale cell lineage tracing.**

**Fig. 5: Reconstruction of a sequence diversification process produced by EP-PCR.**

Reconstructing phylogenetic trees from genome-wide somatic mutations in clonal samples

Article 23 February 2024

Single-cell lineage tracing by integrating CRISPR-Cas9 mutations with transcriptomic data

Article Open access 16 June 2020

LinRace: cell division history reconstruction of single cells using paired lineage barcode and gene expression data

Article Open access 16 December 2023

Data availability

All of the simulated sequences and their lineages generated by PRESUME are available via the URLs given in Supplementary Table 2. The high-throughput sequencing data produced in this work are available at the NCBI Sequence Read Archive (PRJNA675984). P values obtained in this study are listed in Supplementary Data 5.

Code availability

The source codes and manuals of FRACTAL and PRESUME are available at GitHub (https://github.com/yachielab/FRACTAL and https://github.com/yachielab/PRESUME, respectively). FRACTAL and PRESUME can also be modified and executed on a small scale on web browsers through the Code Ocean, a cloud-based code sharing platform (https://doi.org/10.24433/CO.2433997.v1 and https://doi.org/10.24433/CO.3922773.v1, respectively). Note that the distributed computing mode is disabled in the Code Ocean because the Univa Grid Engine (UGE) is not supported in the Code Ocean. The other codes used in this study are also all available at GitHub (for the list, see Supplementary Table 5).

References

Zou, Q., Wan, S., Zeng, X. & Ma, Z. S. Reconstructing evolutionary trees in parallel for massive sequences. BMC Syst. Biol. 11, 100 (2017).
Article CAS PubMed PubMed Central Google Scholar
Mora, C., Tittensor, D. P., Adl, S., Simpson, A. G. & Worm, B. How many species are there on Earth and in the ocean? PLoS Biol. 9, e1001127 (2011).
Article CAS PubMed PubMed Central Google Scholar
Cong, L. et al. Multiplex genome engineering using CRISPR/Cas systems. Science 339, 819–823 (2013).
Article CAS PubMed PubMed Central Google Scholar
Mali, P. et al. RNA-guided human genome engineering via Cas9. Science 339, 823–826 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kalhor, R. et al. Developmental barcoding of whole mouse via homing CRISPR. Science 361, eaat9804 (2018).
Article CAS PubMed PubMed Central Google Scholar
Chan, M. M. et al. Molecular recording of mammalian embryogenesis. Nature 570, 77–82 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bowling, S. et al. An engineered CRISPR–Cas9 mouse line for simultaneous readout of lineage histories and gene expression profiles in single cells. Cell 181, 1410–1422 (2020).
Article CAS PubMed PubMed Central Google Scholar
Salvador-Martinez, I., Grillo, M., Averof, M. & Telford, M. J. Is it possible to reconstruct an accurate cell lineage using CRISPR recorders? eLife 8, e40292 (2019).
Article PubMed PubMed Central Google Scholar
McKenna, A. et al. Whole-organism lineage tracing by combinatorial and cumulative genome editing. Science 353, aaf7907 (2016).
Article CAS PubMed PubMed Central Google Scholar
Raj, B. et al. Simultaneous single-cell profiling of lineages and cell types in the vertebrate brain. Nat. Biotechnol. 36, 442–450 (2018).
Article CAS PubMed PubMed Central Google Scholar
Spanjaard, B. et al. Simultaneous lineage tracing and cell-type identification using CRISPR–Cas9-induced genetic scars. Nat. Biotechnol. 36, 469–473 (2018).
Article CAS PubMed PubMed Central Google Scholar
Alemany, A., Florescu, M., Baron, C. S., Peterson-Maduro, J. & van Oudenaarden, A. Whole-organism clone tracing using single-cell sequencing. Nature 556, 108–112 (2018).
Article CAS PubMed Google Scholar
Quinn, J. J. et al. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371, eabc1944 (2021).
Article CAS PubMed PubMed Central Google Scholar
Simeonov, K. P. et al. Single-cell lineage and transcriptome reconstruction of metastatic cancer reveals selection of aggressive hybrid EMT states. Cancer Cell 39, 1150–1162.e9 (2021).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. Comprehensive single-cell transcriptional profiling of a multicellular organism. Science 357, 661–667 (2017).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).
Article CAS PubMed PubMed Central Google Scholar
Cao, J. et al. A human cell atlas of fetal gene expression. Science 370, eaba7721 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sender, R., Fuchs, S. & Milo, R. Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14, e1002533 (2016).
Article CAS PubMed PubMed Central Google Scholar
Barbera, P. et al. EPA-ng: massively parallel evolutionary placement of genetic sequences. Syst. Biol. 68, 365–369 (2019).
Article PubMed Google Scholar
Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).
Article CAS PubMed PubMed Central Google Scholar
Simonsen, M., Mailund, T. & Pedersen, C. N. S. in International Workshop on Algorithms in Bioinformatics 113–122 (Springer, 2008).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
Article CAS PubMed PubMed Central Google Scholar
Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).
Article Google Scholar
Yarza, P. et al. The All-Species Living Tree project: a 16S rRNA-based phylogenetic tree of all sequenced type strains. Syst. Appl. Microbiol. 31, 241–250 (2008).
Article CAS PubMed Google Scholar
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Article CAS PubMed Google Scholar
Frieda, K. L. et al. Synthetic recording and in situ readout of lineage information in single cells. Nature 541, 107–111 (2017).
Article CAS PubMed Google Scholar
Jones, M. G. et al. Inference of single-cell phylogenies from lineage tracing data using Cassiopeia. Genome Biol. 21, 92 (2020).
Article PubMed PubMed Central Google Scholar
Nishida, K. et al. Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science 353, aaf8729 (2016).
Article CAS PubMed Google Scholar
Hwang, B. et al. Lineage tracing using a Cas9-deaminase barcoding system targeting endogenous L1 elements. Nat. Commun. 10, 1234 (2019).
Article CAS PubMed PubMed Central Google Scholar
Grünewald, J. et al. A dual-deaminase CRISPR base editor enables concurrent adenine and cytosine editing. Nat. Biotechnol. 38, 861–864 (2020).
Article CAS PubMed PubMed Central Google Scholar
Zhang, X. et al. Dual base editor catalyzes both cytosine and adenine base conversions in human cells. Nat. Biotechnol. 38, 856–860 (2020).
Article CAS PubMed Google Scholar
Sakata, R. C. et al. Base editors for simultaneous introduction of C-to-T and A-to-G mutations. Nat. Biotechnol. 38, 865–869 (2020).
Article CAS PubMed Google Scholar
Du, Z., Santella, A., He, F., Tiongson, M. & Bao, Z. De novo inference of systems-level mechanistic models of development from live-imaging-based phenotype analysis. Cell 156, 359–372 (2014).
Article CAS PubMed PubMed Central Google Scholar
Ciccarelli, F. D. et al. Toward automatic reconstruction of a highly resolved tree of life. Science 311, 1283–1287 (2006).
Article CAS PubMed Google Scholar
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523, 208–211 (2015).
Article CAS PubMed Google Scholar
Poe, S. & Swofford, D. L. Taxon sampling revisited. Nature 398, 299–300 (1999).
Article CAS PubMed Google Scholar
Chow, K. K. et al. Imaging cell lineage with a synthetic digital recording system. Science 372, eabb3099 (2021).
Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).
Article CAS PubMed PubMed Central Google Scholar
Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).
Article CAS PubMed PubMed Central Google Scholar
Yu, M. K. et al. DDOT: a Swiss army knife for investigating data-driven biological ontologies. Cell Syst. 8, 267–273 (2019).
Article CAS PubMed PubMed Central Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article CAS PubMed PubMed Central Google Scholar
Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS ONE 11, e0163962 (2016).
Article CAS PubMed PubMed Central Google Scholar
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article CAS PubMed Google Scholar
Madeira, F. et al. The EMBL-EBI search and sequence analysis tools APIs in 2019. Nucleic Acids Res. 47, W636–W641 (2019).
Article CAS PubMed PubMed Central Google Scholar
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Article CAS PubMed PubMed Central Google Scholar
Baum, B. R. PHYLIP: phylogeny inference package. Version 3.2. Quarterly Review of Biology 64, 539–541 (1989).
Article Google Scholar
Zhao, L., Liu, Z., Levy, S. F. & Wu, S. Bartender: a fast and accurate clustering algorithm to count barcode reads. Bioinformatics 34, 739–747 (2018).
Article CAS PubMed Google Scholar
Satopaa, V., Albrecht, J., Irwin, D. & Raghavan, B. in 2011 31st International Conference on Distributed Computing Systems Workshops 166–171 (IEEE, 2011).
Levenshtein, V. I. in Soviet Physics Doklady, Vol. 10 707–710 (Doklady Akademii Nauk SSSR, 1966).
Brunner, E. & Munzel, U. The nonparametric Behrens–Fisher problem: asymptotic theory and a small-sample approximation. Biom. J. 42, 17–25 (2000).
Article Google Scholar

Download references

Acknowledgements

We thank members of the Yachie lab for valuable discussions and critical assessment of the work, especially A. Adel, S. King and S. Okawa for reviewing the manuscript. We also thank C. de Hoog for providing comments on the manuscript. Deep sequencing was performed with the support of H. Aburatani. This study was supported by the Canada Research Chair program (by the Canadian Institutes for Health Research), the Japan Science and Technology Agency (JST) PRESTO program (10814), the Japan Agency for Medical Research and Development (AMED) PRIME program (20gm6110007), the Shimadzu Science and Technology Foundation, the Naito Foundation, the Nakajima Foundation and the Asahi Glass Foundation (all to N.Y.). Y.K., S.I., H.M. and N.M. were supported by Japan Society for the Promotion of Science (JSPS) Research Fellowships. High-performance computing experiments were performed using the SHIROKANE Supercomputer at the University of Tokyo Human Genome Center or the NIG Supercomputer System at the National Institute of Genetics.

Author information

These authors contributed equally: Yusuke Kijima, Keito Watano, Soh Ishiguro.

Authors and Affiliations

Synthetic Biology Division, Research Center for Advanced Science and Technology, The University of Tokyo, Tokyo, Japan
Naoki Konno, Yusuke Kijima, Keito Watano, Soh Ishiguro, Mamoru Tanaka, Hideto Mori, Nanami Masuyama & Nozomu Yachie
Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan
Naoki Konno, Keito Watano & Wataru Iwasaki
Department of Integrated Biosciences, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Japan
Naoki Konno, Keito Watano & Wataru Iwasaki
Department of Aquatic Bioscience, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
Yusuke Kijima
School of Biomedical Engineering, Faculty of Applied Science and Faculty of Medicine, The University of British Columbia, Vancouver, British Columbia, Canada
Yusuke Kijima, Soh Ishiguro, Nanami Masuyama & Nozomu Yachie
Department of Medicine, University of California, San Diego, La Jolla, CA, USA
Keiichiro Ono, Dexter Pratt & Trey Ideker
Institute for Advanced Biosciences, Keio University, Tsuruoka, Japan
Hideto Mori, Nanami Masuyama & Nozomu Yachie
Graduate School of Media and Governance, Keio University, Fujisawa, Japan
Hideto Mori, Nanami Masuyama & Nozomu Yachie
Departments of Bioengineering and Computer Science, University of California San Diego, La Jolla, CA, USA
Trey Ideker

Authors

Naoki Konno
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Kijima
View author publications
You can also search for this author in PubMed Google Scholar
Keito Watano
View author publications
You can also search for this author in PubMed Google Scholar
Soh Ishiguro
View author publications
You can also search for this author in PubMed Google Scholar
Keiichiro Ono
View author publications
You can also search for this author in PubMed Google Scholar
Mamoru Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Hideto Mori
View author publications
You can also search for this author in PubMed Google Scholar
Nanami Masuyama
View author publications
You can also search for this author in PubMed Google Scholar
Dexter Pratt
View author publications
You can also search for this author in PubMed Google Scholar
Trey Ideker
View author publications
You can also search for this author in PubMed Google Scholar
Wataru Iwasaki
View author publications
You can also search for this author in PubMed Google Scholar
Nozomu Yachie
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

N.K., H.M. and N.Y. conceived the high-level concept of FRACTAL. N.K., W.I. and N.Y. designed the study. N.K. implemented FRACTAL. K.W. and N.K. implemented PRESUME. N.K. led the analyses. N.M. and N.Y. designed the high-content cell lineage recording model. Y.K. and K.W. supported the analyses of the simulated cell lineages. S.I. and M.T. performed the EP-PCR experiments. Y.K. supported the analysis of the EP-PCR experiments. N.K., K.O., D.P. and T.I. performed data visualization using HiView. N.K., Y.K., S.I. and N.Y. wrote the manuscript.

Corresponding author

Correspondence to Nozomu Yachie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 PRESUME.

a, Schematic diagram of PRESUME. b, Various datasets of 32,768 sequences generated by PRESUME with different mutational parameters μ and α. The topological parameter σ was fixed to 0 (perfectly balanced sequence diversification). The diversity of nucleotide letters across different sequence positions is represented by a bit score distribution and by a sequence logo for the first 15 nt of the generated sequences. c, Partial diagrams of representative trees generated by PRESUME with different topological parameters σ.

Extended Data Fig. 2 Runtime simulation of FRACTAL.

a, Conceptual diagram of the runtime simulation of FRACTAL. Under a distributed computing environment with a fixed number of available nodes d, each FRACTAL job of input sequence size n starts if there is one or more available free computing node; otherwise, it is stalled until a node is released from one of the ongoing jobs. Each FRACTAL job process is modeled to occupy one computing node for a runtime f(n) and produce two new child jobs for the next job cycles each with an input sequence size of ⌊n/2⌋. When the input sequence size is under a certain threshold (n ≤ t), the job terminates after occupying one node for a runtime f(n) assuming the terminal lineage reconstruction process. b, Two implementation models of FRACTAL. In FRACTAL, each job cycle contains three steps that require high computing loads: sequence subsampling from the entire input sequences, phylogenetic placement, and sorting of sequences mapped on each of the sample lineage clades for the next job cycles. In model A, none of the three steps is parallelized where a runtime of each cycle is f(n) for the input sequence size of n. In model B, all of the three steps are perfectly parallelized where a runtime of each cycle is linearly reduced by the number of available computing nodes. The denominator of the formula represents the number of available computing nodes, assuming that the smaller the input sequences for a FRACTAL iteration cycle, the higher the likelihood of computing nodes being occupied by the other job processes at the same period. c, Linear regression f(n) using runtime log data of independent FRACTAL iteration cycles any of whose major three steps was not parallelized for the lineage reconstruction of the 235 million sequences.

Extended Data Fig. 3 Robustness of evolutionary lineage reconstruction with sequence data noise.

The scatter plots represent accuracies to reconstruct the SILVA 16S rRNA lineage from the input sequence datasets including different fractions of noise sequences. The noise sequences were generated by shuffling nucleotide positions of randomly selected sequences by keeping their alignment gap positions in the multiple sequence alignment result. The lineage dendrograms represent the ones reconstructed from the sequence dataset with 20% noise using FRACTALized RapidNJ, RAxML and FastTree.

Extended Data Fig. 4 Simulating proliferation of cells with the scalable CRISPR-barcode system.

a, The scGESTALT lineage tracing datasets of zebrafish brain development. BBI scores were obtained for branches that had more than three leaves associated to each of the two descending edges and a total of more than ten leaves associated to both of the two descending edges. b, c, Estimation of the insertion and deletion event probabilities per barcode array per generation in the scGESTALT dataset. The relative insertion and deletion event probabilities across each barcode array position and the probability distributions of insertion and deletion sizes were modeled using those observed in the scGESTALT dataset. The probabilities of the insertion and deletion events per generation for the production of 4,000 cells were fitted to the average total insertion and deletion lengths per barcode array observed in the scGESTALT dataset, respectively. d, Median fractions of base substitution per nucleotide position per generation observed for different secondary scaling factors for base editing in the simulation of producing 4,000 cells. The blue shading represents the 5 to 95 percentile range. e, f, Tree recovery scores for RapidNJ and FRACTALized RapidNJ in reconstructing the lineages of 4,000 cells simulated based on different secondary scaling factors for base editing.

Extended Data Fig. 5 Reconstruction of the BET002 sequence diversification process produced by EP-PCR.

a, Distribution in number of mutations observed in the second-generation sequences (CTNNB1 and BET002). b-g, BET002. b, Distribution of the second-generation sequences across the parental sample wells. The second-generation sequences were assigned based on their best matched parental first-generation sequences. c, The lineage tree of the mutated sequences reconstructed by FRACTAL. The sequences that did not have unique best matched sequences in the expected parental wells were filtered out after the lineage reconstruction. The dendrogram only represents the upstream lineage of the largest clades each composed of less than 15,000 sequences. Number of sequences, proportions of their source sample wells, and entropy of the well proportions are represented for each of the clades. d, A zoom-in diagram of the lineage highlighted by yellow in c. e, Distribution of entropies for the clades each with 1,000 or more sequences. The statistical difference between the entropy distribution and the null distribution given by random sequence-well assignment was tested by two-sided Brunner-Munzel test. f, Reconstructed lineage of the sequences in the clade indicated by the arrow in d. g, Proportions of the parental sequences identified in the control PCR wells and normalized distances of the second-generation sequences in the reconstructed lineage for pairs in the same wells and pairs, one of each is from a different well. Orange dots represent significant differences between the intra- and inter-well distributions (two-sided Brunner-Munzel test with Bonferroni correction; adjusted P-value < 0.05).

Extended Data Fig. 6 Evolutionary lineage reconstruction without preliminary MSA.

a, Sample tree reconstruction and phylogenetic placement of FRACTAL with no preliminary multiple sequence alignment (MSA). A given number of sequences are first randomly subsampled from the input sequences (Step 1). The subsampled sequences are aligned with a common root sequence by MSA using MAFFT (Step 2) and a sample tree is reconstructed by a software tool of choice (Step 3). Each of the remaining input sequences are then independently added to the MSA result by ‘plus-one’ alignment using HMMER (Step 4i) and placed on the sample tree (Step 4ii). b, Accuracies of reconstructing various sizes of clades in the reference lineage of 1,000,000 sequences generated by RNASim. c, Accuracies and coverages of reconstructing the entire lineage of 1,000,000 unaligned sequences by FRACTALized RapidNJ, RAxML and FastTree with 100 computing nodes (five trials). d, Time series for the numbers of computing jobs and waiting jobs observed in the reconstruction of the simulated lineage of RNA evolution using 100 computing nodes for FRACTALization.

Extended Data Fig. 7 Lineages of CTNNB1 and BET002 datasets before filtering out the potential artifact sequences.

a, b, Reconstructed lineages. The dendrogram only represents the upstream lineage of the largest clades each composed of less than 15,000 sequences. Number of sequences, proportions of their source sample wells, and entropy of the well proportions are represented for each of the clades. a, CTNNB1. b, BET002. c, d, Distribution of entropies for the clades with 1,000 or more sequences. c, CTNNB1. d, BET002. The statistical differences between the entropy distributions and the null distributions given by random sequence-well assignment were tested by two-sided Brunner-Munzel test. e, f, Unique read counts of the second-generation sequences uniquely best-matched to single parental sequences in the expected and unexpected wells and those best-matched to multiple parental sequences. The ones uniquely best-matched to single parental sequences are color-coded according to the parental wells. The second-generation sequences best-matched to single parental sequences of unexpected wells can be assumed to be cross-contaminants derived during the second EP-PCR and the following steps. The second-generation sequences redundantly best-matched to multiple parental sequences can be assumed to have parental sequences that were either cross-contaminated before the second EP-PCR or were conferred an insufficient number of mutations. e, CTNNB1. f, BET002.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Konno, N., Kijima, Y., Watano, K. et al. Deep distributed computing to reconstruct extremely large lineage trees. Nat Biotechnol 40, 566–575 (2022). https://doi.org/10.1038/s41587-021-01111-2

Download citation

Received: 04 April 2020
Accepted: 01 October 2021
Published: 06 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1038/s41587-021-01111-2

This article is cited by

A statistical method for quantifying progenitor cells reveals incipient cell fate commitments
- Shanjun Deng
- Han Gong
- Xionglei He
Nature Methods (2024)
Maximum likelihood pandemic-scale phylogenetics
- Nicola De Maio
- Prabhav Kalaghatgi
- Nick Goldman
Nature Genetics (2023)
Enabling technology and core theory of synthetic biology
- Xian-En Zhang
- Chenli Liu
- Tong Si
Science China Life Sciences (2023)