Genealogy at the genome scale

Two teams develop methods for genealogy inference using large genomic datasets.

Genealogies describing ancestral relationships among individuals harbor key information about demographic history and evolution. Although trees are widely used for summarizing genealogies, recombination can shuffle genetic materials between chromosomes and generate discordant trees among loci. In order to encode the genealogies of the whole genome, networks of huge size and complexity are needed, causing a major hurdle for statistical inference using large-scale genomic data.

Several computational methods exist to tackle this challenge. However, the daunting computational burden limits the size of the datasets they can handle. Using different computational strategies, two teams from Oxford University independently developed methods to analyze large genomic datasets for genealogy inference.

Simon Myers and colleagues adopted a two-step approach (called ‘Relate’) that infers different elements of the genealogies: tree topologies and branching lengths, using position-specific distance matrices and a Markov Chain Monte Carlo (MCMC) algorithm, respectively. Based on the output, quantities of population and evolutionary genetic significance can be further estimated, such as effective population sizes, mutation rates and gene-flow through time. They apply Relate to the 1,000 Genomes Project dataset, and also developed a test for detecting positive selection. “By building its statistics from the genealogies themselves, this makes the test more powerful and sensitive,” said Leo Speidel, a member of the team, and Myers. “Despite a lack of strong selection at particular loci, we found directional selection acting on complex traits to be extremely common in humans. The complex, differing patterns among groups and therefore through time were a surprise to us, building on existing findings.”

Jerome Kelleher and colleagues developed another inference method (‘tsinfer’) based on an efficient data structure called a tree sequence, which makes use of the correlation structure between adjacent trees. Using a strategy that infers the ancestral haplotypes, as well as the copying paths for them and the sample haplotypes, tsinfer achieves comparable accuracy with state-of-the-art competitors and with extraordinary efficiency, as shown by tests using the 1,000 Genomes Project, the Simons Genome Diversity Project and the UK Biobank.

In addition to genealogy inference, “tree sequences, which encode biological history in such fine detail, are also extremely computationally efficient and could help address the spiraling costs of storing and processing DNA sequence data,” says Kelleher. The key idea is that the genetic variation in a population is the result of ancestral mutational events. Therefore, storing data indicating where mutations occurred in the genealogies would generate a full picture of the genetic variation, while being much more efficient than storing each sample’s genotype. Along this line, they estimate that using tree sequences to store genetic variation can be four orders of magnitude more efficient than the current most widely used format. “To me, this synergy between biology and computing is beautiful and compelling,” says Kelleher.

Regarding the next step, both teams are working to improve the tools in several ways. “We focused initially on achieving scale, and are now working to relax some assumptions and make tsinfer more general and robust,” says Kelleher, furthermore, “we are continuing to develop tskit (the tree sequence toolkit) and hope that over time an ecosystem of inference methods and analysis tools will grow around this shared, open-source infrastructure.” Speidel and Myers hope their work might motivate others to think about leveraging tree-based approaches to answer evolutionary questions, noting, “We also see great potential in combining the innovations of Relate and tsinfer”.

Research papers

  1. Speidel, L. et al. A method for genome-wide genealogy estimation for thousands of samples. Nat. Genet. 51, 1321–1329 (2019).

    CAS  Article  Google Scholar 

  2. Kelleher, J. et al. Inferring whole-genome histories in large population datasets. Nat. Genet. 51, 1330–1338 (2019).

    CAS  Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Lin Tang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tang, L. Genealogy at the genome scale. Nat Methods 16, 1077 (2019).

Download citation

Further reading

  • A Genome Epidemiological Study of SARS-CoV-2 Introduction into Japan

    • Tsuyoshi Sekizuka
    • , Kentaro Itokawa
    • , Masanori Hashino
    • , Tetsuro Kawano-Sugaya
    • , Rina Tanaka
    • , Koji Yatsu
    • , Asami Ohnishi
    • , Keiko Goto
    • , Hiroyuki Tsukagoshi
    • , Hayato Ehara
    • , Kenji Sadamasu
    • , Masakatsu Taira
    • , Shinichiro Shibata
    • , Ryohei Nomoto
    • , Satoshi Hiroi
    • , Miho Toho
    • , Tomoe Shimada
    • , Tamano Matsui
    • , Tomimasa Sunagawa
    • , Hajime Kamiya
    • , Yuichiro Yahata
    • , Takuya Yamagishi
    • , Motoi Suzuki
    • , Takaji Wakita
    • , Makoto Kuroda
    •  & Michael J. Imperiale

    mSphere (2020)


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing