Accurate detection of identity-by-descent segments in human ancient DNA

Long DNA segments shared between two individuals, known as identity-by-descent (IBD), reveal recent genealogical connections. Here we introduce ancIBD, a method for identifying IBD segments in ancient human DNA (aDNA) using a hidden Markov model and imputed genotype probabilities. We demonstrate that ancIBD accurately identifies IBD segments >8 cM for aDNA data with an average depth of >0.25× for whole-genome sequencing or >1× for 1240k single nucleotide polymorphism capture data. Applying ancIBD to 4,248 ancient Eurasian individuals, we identify relatives up to the sixth degree and genealogical connections between archaeological groups. Notably, we reveal long IBD sharing between Corded Ware and Yamnaya groups, indicating that the Yamnaya herders of the Pontic-Caspian Steppe and the Steppe-related ancestry in various European Corded Ware groups share substantial co-ancestry within only a few hundred years. These results show that detecting IBD segments can generate powerful insights into the growing aDNA record, both on a small scale relevant to life stories and on a large scale relevant to major cultural-historical events.

Long DNA segments shared between two individuals, known as identity-by-descent (IBD), reveal recent genealogical connections.Here we introduce ancIBD, a method for identifying IBD segments in ancient human DNA (aDNA) using a hidden Markov model and imputed genotype probabilities.We demonstrate that ancIBD accurately identifies IBD segments >8 cM for aDNA data with an average depth of >0.25× for whole-genome sequencing or >1× for 1240k single nucleotide polymorphism capture data.Applying ancIBD to 4,248 ancient Eurasian individuals, we identify relatives up to the sixth degree and genealogical connections between archaeological groups.Notably, we reveal long IBD sharing between Corded Ware and Yamnaya groups, indicating that the Yamnaya herders of the Pontic-Caspian Steppe and the Steppe-related ancestry in various European Corded Ware groups share substantial co-ancestry within only a few hundred years.These results show that detecting IBD segments can generate powerful insights into the growing aDNA record, both on a small scale relevant to life stories and on a large scale relevant to major cultural-historical events.Some pairs of individuals share long, nearly identical genomic segments, so-called IBD segments, that must be co-inherited from a recent common ancestor because recombination during each meiosis leads to the rapid break-up of these segments.Consequently, long IBD segments provide an ideal signal to probe recent genealogical connections and have been used as a distinctive signal for a range of downstream applications such as identifying biological relatives or inferring recent demography [1][2][3] .Several existing methods identify IBD segments for single nucleotide polymorphism (SNP) array or whole-genome sequence data [4][5][6] but they require confident diploid genotype calls.These are not achievable for most human aDNA data because of too low genomic coverage (<5× average coverage per site) and comparably high error rates due to degraded and short DNA molecules.So far only a few exceptional applications of IBD to comparably high-quality aDNA have been published 7,8 .First efforts to identify IBD on the basis of imputed data have been fruitful [9][10][11][12] but those require higher coverage not routinely available for aDNA.Importantly, they do not include a systematic evaluation of the IBD calling pipelines, a critical task given that IBD detection accuracy is expected to decay for short segments and low-coverage data.Practical downstream applications, such as demographic modelling, require information about power, length biases and false positive rates either to account directly for these error processes or to identify thresholds of data quality.
Here, we present and systematically evaluate ancIBD, a method to detect IBD segments in human aDNA data.In brief, ancIBD starts from phased genotype likelihoods imputed by GLIMPSE 13 , which are then screened using a hidden Markov model (HMM) to infer IBD blocks Article https://doi.org/10.1038/s41588-023-01582-w to 1× coverage at 1240k sites (Fig. 2).We found that ancIBD on average overestimates the length of IBD segments but in the recommended coverage cutoff the length errors remain within ~1 cM (Extended Data Tables 1 and 2).
Performance on downsampled aDNA data.To assess performance on downsampled empirical aDNA data, we used four high-coverage genomes of ancient individuals, all ~5,000 years old and associated with the Southern Siberian Afanasievo culture (Supplementary Note 5) 21 .When comparing the IBD calls in the downsampled data to the IBD calls of the original high-coverage data, we found that WGS substantially outperforms 1240k data of the same coverage.For long IBD segments (>10 cM) that are particularly informative when detecting relatives, ancIBD achieves high precision and recall (>90%) for all coverages tested here (WGS data 0.1× to 5×; 1240k data 0.5× to 2×).For intermediate range segments (8-10 cM), ancIBD maintains reasonable recall (~80%) at all coverages while having less than 80% precision at 0.5× for 1240k data.Overall, ancIBD yields accurate IBD calling (~90% or higher precision) at >0.25× WGS data and >1× 1240k data (Extended Data Fig. 2).
Comparing to other methods.Several recent publications have applied softwares designed to detect IBD in high-quality present-day data on imputed aDNA data (for example, using GLIMPSE) 9,10 .We compared the performance of ancIBD to such methods, using the downsampled empirical aDNA data described above.
Softwares to call IBD can be classified into two categories, ones that require prior phasing and ones that use unphased data as input.The former search for long, identical haplotypes, while the latter primarily use, directly or implicitly, the signal of 'opposing homozygotes' (two samples being homozygous for different alleles), which are lacking in IBD segments.
In preliminary tests, we found that methods that require phasing information have very low power to detect IBD in imputed aDNA data, potentially because of high switch error rates in imputed ancient genomes 19 , which is an order of magnitude higher than what is attainable for phasing Biobank-scale modern data 22 .
Therefore, we focus our detailed comparison on two methods that do not require phasing information, IBIS 23 and IBDseq 24 .IBIS detects IBD segments by screening for genomic regions with few opposing homozygotes.Our results on downsampled aDNA data show that this method mostly maintains higher precision at the expense of a lower recall, particularly at lower coverages.Despite keeping precision at >90%, for segments >8 cM, IBIS recall drops to ~50% for ~1× 1240k data (Extended Data Fig. 2).
(Fig. 1).We then identified default parameters that optimize performance on so-called 1240k capture data.This set of ~1.1 million autosomal SNPs is targeted by in-solution enrichment experiments that have produced more than 70% of genome-wide human aDNA datasets to date [14][15][16] .Our tests show that ancIBD robustly identifies IBD longer than 8 cM in aDNA data-for SNP capture with at least 1x average coverage depth (calculated on SNP target) and for whole-genome sequencing (WGS) as low as 0.25× average genomic coverage.

Identifying IBD with ancIBD
Our method consists of two computational steps (Fig. 1b).In a preprocessing step, the aDNA data are first computationally imputed and phased using a modern reference haplotype panel.In the main step, we apply a custom HMM to identify IBD segments.
For the preprocessing, we use imputation software that has been shown to work well for low-coverage data, GLIMPSE 13 , which we apply to aligned sequence data (in .bamformat) to impute genotype likelihoods at the 1240k sites, using haplotypes in the 1000 Genome Project as the reference panel 17 .Our full imputation pipeline is described in Supplementary Note 3. Previous evaluation of imputing aDNA data this way showed that imputed common variants, which are highly informative about IBD sharing, are of good quality down to mean coverage depth as low as 0.5-1.0×(refs.18,19).
The details of the main ancIBD HMM are described in Methods.Briefly, the HMM is based on a total of five hidden states, where one state models non-IBD and four states the possible ways of IBD sharing between two phased genomes (Fig. 1a).The emission probabilities are based on the imputed posterior genotype probability and phasing.The standard forward-backward algorithm 20 yields the posterior probability of being in one of the four IBD states, which is postprocessed to obtain the final IBD segment calls.

Evaluating ancIBD
We performed two sets of experiments to evaluate the quality of IBD calls of ancIBD at various sequencing depths.First, we copied IBD segments of known length into pairs of genomes (Methods).Second, we downsampled high-coverage empirical aDNA data.
Performance on copied-in IBD segments.When applying ancIBD to the simulated data with copied-in IBD (simulation procedures are described in Supplementary Note 2 and visualized in Extended Data Fig. 1), we observed that the inferred IBD segments remain accurate and that their length distribution peaks around the true value for WGS data down to about 0.25× coverage and for 1240k capture data down Fig. 2 | Performance of ancIBD on simulated IBD segments.a, Power and segment length errors.We copied-in IBD segments of lengths 4, 8, 12, 16 and 20 cM into synthetic diploid samples.We simulated shotgun-like and 1240k-like data (Supplementary Note 2) and visualize false positive, power and length bias for 2×, 1×, 0.5× and 0.25× coverage (rows).For each parameter set and IBD length, we simulated 500 replicates of pairs of chromosome 3, each pair with a single, randomly placed, copied-in IBD segment.The power (or recall) of detecting IBD segments of each simulated length is indicated in the text next to the corresponding grey vertical bar.Results for other coverages are shown in Supplementary Fig. 4. b, False positive rate.We downsampled high-quality empirical aDNA data without IBD segments (Supplementary Table 6) to establish false positive rates of IBD segments for various coverage and IBD lengths (Supplementary Note 7).The y axis shows the mean number of false positive IBD segments per pair of chromosome 3 in each length bin (bin width 0.25 cM).To contextualize these false positive rates, we also depict expected IBD sharing assuming various constant population sizes (dotted lines, calculated as described in ref. 58).If the false positive rate is on a similar order of magnitude or larger than expected for a population of that effective population size (N e ), individual IBD calls of that length for that coverage and demographic scenario are likely to be false positives.IBDseq was designed for WGS data.It works by computing likelihood ratios of IBD and non-IBD states for each marker and then identifies IBD segments by searching for regions with high cumulative scores.Our results on downsampled empirical ancient aDNA data indicate that precision and recall of IBDseq drop substantially at lower coverages, achieving <50% precision for ~1× 1240k data, a coverage regime typical for most aDNA samples (Supplementary Figs.16 and 17).

Detecting close and distant relatives with ancIBD
To showcase the utility of IBD segments to detect biological relatives, we applied ancIBD to a set of 4,248 published ancient Eurasian individuals.Sample quality filtering and downstream bioinformatic processing are described in Methods.When plotting the total sum and the total count of IBD segments longer than 12 cM, we find that the pattern of IBD sharing (Fig. 3a) closely mirrors simulated IBD sharing between various degrees of relatives (using the software ped-sim 25 ) (Fig. 3b).A first-degree relative cluster becomes apparent, with a parent-offspring cluster (where the whole genome is in IBD) and a full-sibling cluster.The parent-offspring cluster in the simulated IBD dataset consists of one point, as expected because parent and offspring share each of the 22 chromosomes fully IBD.In the inferred IBD dataset, the apparent parent-offspring cluster is spread out more widely, including also individuals with more than 22 IBD segments-the reason for this is that sporadically very long IBD are broken up by artificial gaps and if they are too big they are not merged by the default gap merging of ancIBD.Overall this effect remains modest and in the parent-offspring cluster the total number of inferred IBD segments is in most cases only slightly elevated beyond the expected 22.The plot visualizes both the count (y axis) as well as the summed length (x axis) of all IBD >12 cM long.For comparison, we colour-code pairs on the basis of relatedness estimates from pairwise mismatch rates (PMR) that can detect up to third-degree relatives (Supplementary Note 9).We also annotate new relatives found by ancIBD, indicated by at least three very long IBD segments (>20 cM) typical of up to sixth-degree relatives.b, Simulated IBD among pairs of relatives.For each relative class, we simulated 100 replicates using the software ped-sim 25 , as described in Supplementary Note 8.As in a, we depict the summed length and the count of all IBD at least 12 cM long.c, Inferred IBD among four ancient English Neolithic individuals, who lived about 5,700 years ago and were entombed at Hazleton North long cairn.A full pedigree was previously reconstructed using first-and second-degree relatives inferred using pairwise SNP matching rates 26 .We depict all IBD at least 12 cM long.The four individuals were genotyped using 1240k aDNA capture (I12438, 3.7× average coverage on target; I12440, 2.1×; I13896, 1.1×; I12439, 6.7×).

Article
https://doi.org/10.1038/s41588-023-01582-w Further, we observe two clear second-degree relative clusters that correspond to biological great-parent grandchildren and aunt/uncleniece/nephew relationships.Half-siblings are expected to form a gradient between these two clusters, with their average position depending on whether the shared parent is maternal (on average more but shorter shared segments) or paternal (fewer but longer shared segments) 25 .
In the simulated data, IBD clusters for third-degree and more distant relatives increasingly overlap (Fig. 3b) and the empirical IBD distribution follows this gradient (Fig. 3a).Owing to this biological variation in genetic relatedness, it is not possible to uniquely assign individuals to specific relative clusters beyond third-degree relatives even if the exact IBD is known.However, these pairs with multiple long shared segments still unambiguously indicate very recent biological relatedness.Most biological relatives up to the sixth degree will share two or more long IBD segments 25 .For instance, we identified two long IBD segments in a sixth-degree relative from Neolithic Britain (Fig. 3c), a relationship that was previously reconstructed from a pedigree of first-degree and second-degree relatives identified using average pairwise genotype mismatch rates 26 .In most human populations, pairs of biologically unrelated (that is, related at most by tenth degree) individuals share only sporadically single IBD segments [27][28][29] .Thus, the sharing of many long IBD segments provides a distinct signal for identifying close genealogical relationships that we can detect with ancIBD.

Recent links among Eneolithic and Bronze Age groups
Because recombination acts as a rapid clock (the probability of an IBD segment of length l cM persisting for t generations declines quickly as exp(−t × l/50)), the rate of sporadic sharing of IBD segments probes genealogical connections between groups of individuals only a few hundred years deep, for example, for modern Europeans 2 .To showcase how detecting IBD segments with ancIBD can reveal such connections between ancient individuals, we applied our method to a set of previously published ancient West Eurasian aDNA data dating to the Late Eneolithic and Early Bronze Age (Supplementary Table 3).This period, from 3,000 to 2,000 bce, was characterized by major gene flow events, where 'Steppe-related' ancestry had a substantial genetic impact throughout Europe (for example, refs.30,31), leading to widespread genetic admixtures and population turnover as far west as Britain 32 and Iberia 33 .Applying ancIBD to the relevant published aDNA record of 304 ancient Western Eurasians organized into 24 archaeological groups (Supplementary Table 3), we find several intriguing links.Many of those connections were previously proposed and suggested by admixture tests; however, the sharing of long IBD segments now provides definitive evidence for recent co-ancestry and biological interactions, tethering groups together closely in time.
We found that several nomadic Steppe groups associated with the Yamnaya culture that date to around 3,000 bce share comparably large amounts of IBD with each other (Fig. 4).This late Eneolithic to Early Bronze Age culture of pastoral nomads, who inhabited the Western Eurasian Pontic-Caspian Steppe often buried their death in tumuli (Kurgans) and were among the first people to use wagons, are suggested to have had a key role in the early spread of Indo-European languages 34 .Notably, the Yamnaya IBD cluster includes also individuals associated with the contemporaneous Afanasievo culture thousands of kilometres east, an Eneolithic archaeological culture near the Central Asian Altai mountains.This signal of IBD sharing confirms the previous archaeological hypothesis that Afanasievo and Yamnaya are closely linked despite the vast geographic distance from Eastern Europe to Central Asia 34 .A genetic link has already been evident from genomic similarity and Y haplogroups 31,35 ; however, the time depth of this connection remained unclear.We now identify IBD signals across all length scales, including several shared IBD segments even longer than 20 cM (Extended Data Fig. 3).Such long IBD links must be recent as recombination ends an IBD segment ~20 cM long on average every five meiosis.This long IBD sharing signal, at the same level as between various Yamnaya groups (Fig. 4), therefore clearly indicates that ancient individuals from Afanasievo contexts descend from people who migrated at most a few generations earlier across vast distances of the Eurasian Steppe.
Increased individual mobility in Eneolithic and Early Bronze Age Eurasian Steppe groups is also reflected in a pair of individuals  3), we detected a pair of biological relatives whose remains were buried 1,410 km apart, one in central Mongolia and one in Southern Russia.The two individuals were previously published in two different publications 35,59 .Both individuals are archaeologically associated with the Afanasievo culture and genetically cluster with other Afanasievo individuals 35,59 .b, Posterior of non-IBD state on chromosome 12, which has the longest inferred IBD segment (39.1 cM long, indicated as a dark blue bar).We also plot opposing homozygotes (upper grey dots), whose absence is a necessary signal of IBD.Only SNPs where both markers have an imputed genotype probability >0.99 are plotted.c, Plot of all inferred IBD segments longer than 12 cM.d, Histogram of inferred IBD segment lengths, as well as theoretical expectations for various types of relatives (calculated using formulas described in ref. 29).Panels b-d were all created using default plotting functions bundled into the ancIBD software package.

Article
https://doi.org/10.1038/s41588-023-01582-wassociated with the Afanasievo culture that were buried 1,410 km apart, one in present-day Central Mongolia and one in Southern Russia, who share several long IBD segments (Fig. 5a,c).We identified four IBD segments 20-40 cM long, a distinctive signal of close biological relatedness typical of about fifth-degree relatives (Fig. 5c,d).Previous work showed that both individuals have a genetic profile typical for Afanasievo individuals and here this close biological link demonstrates that at least one individual in the chain of relatives between them must have travelled several hundreds of kilometres in their lifetime.
Moreover, there are several intriguing observations regarding individuals associated with the Corded Ware culture, an important archaeological culture that appears across a vast area of Eastern, Central and Northern Europe between 3,000 and 2,400 bce.Previous aDNA research showed Corded Ware groups to be the first people of these regions to carry high amounts of a distinct ancestry found in Eurasian Steppe pastoralists such as the Yamnaya, admixed with previous Final Neolithic farmer cultures 30,31,36,37 .Using IBD, we find that individuals from diverse Corded Ware cultural groups, including from Sweden (associated with the Battle Axe culture), Russia (Fatyanovo) and East/Central Europe share high amounts of long IBD with each other and also have IBD sharing up to 20 cM with various Yamnaya groups (Fig. 4 and Extended Data Fig. 3a,b,c).We find a distinctive IBD signal with the so-called Globular Amphora culture, in particular from Poland and Ukraine, who were Copper Age (Eneolithic) farmers around 3,000 bce not yet carrying Steppe-like ancestry 38,39 .This IBD link to Globular Amphora appears for all Corded Ware groups in our analysis, including from as far away as Scandinavia and Russia (Fig. 4), which indicates that individuals related to Globular Amphora contexts from Eastern Europe must have had a major demographic impact early on in the genetic admixtures giving rise to various Corded Ware groups.

Discussion
We have introduced ancIBD, a method to detect IBD segments optimized for aDNA data.The algorithm follows a long line of work using probabilistic HMMs to screen for IBD segments [40][41][42][43][44] .When compared to other methods to detect IBD (IBIS 23 , IBDseq 24 , Germline 4 , Germline2 43 and hapIBD 6 ), ancIBD maintains a balanced performance between precision and recall in the low-coverage regime typical for aDNA data.A recent method KIN 45 fits transitions between IBD states to identify relatives up to the third degree but does not identify sporadic IBD segments which are typical of more distant relatives or are useful for demographic inference.
We optimized the default parameters of ancIBD towards performance on imputed 1240k variants, an SNP set widely used in human aDNA.We also recommend downsampling imputed WGS data to this SNP set because using all common 1000 Genome SNPs only marginally improves performance (Supplementary Note 6).Our benchmarks have demonstrated that ancIBD robustly detects IBD longer than 8 cM, for WGS data down to 0.25× and 1240k data down to 1× average coverage depth on 1,240k SNPs.That WGS data perform better than 1240k data at the same coverage depth on target SNPs is not surprising because WGS data cover the entire genome while 1,240k capture data are depleted for off-target data.But imputation at 1240k sites uses all SNPs in the 1000 Genome dataset, thus providing more off-target data leads to substantially improved imputation quality.We found that WGS data can be imputed at roughly three times lower coverage equally as well as 1240k data (Supplementary Fig. 5), consistent with findings from ref. 19.This observation is relevant for choosing aDNA data generation strategies where IBD segment calling is of interest.
We showcased two main applications for identifying long IBD segments within human aDNA.First, ancIBD reveals biological relatives up to the sixth degree as such pairs distinctively share multiple long IBD segments 25 .Allele sharing-based methods commonly used in aDNA studies 46,47 are generally limited to detecting relatives only up to the third degree because they average over the genome and do not identify signals due to only a few shared IBD segments that make up only a small part of the genome.However, they can be applied to substantially lower coverage than ancIBD.Similarly, KIN 45 can be applied to lower coverage than ancIBD but is also limited to detecting relatives up to the third degree.
Second, identifying IBD segments with intermediate coverage aDNA data unlocks a powerful way to investigate fine-scale genealogical connections of past human populations.Sharing of long haplotypes establishes bounds on the number of generations separating pairs of individuals, which adds information beyond average single-locus correlation statistics that have been the workhorse of aDNA studies to date.To showcase this potential, we have used ancIBD to generate evidence for the origins of the people culturally associated with the Corded Ware culture.Corded Ware groups of Eastern, Central and Northern Europe were identified to be among the first cultures affected by large-scale gene flows starting 3,000 bce which spread a distinct ancestry found in pastoralists of the Pontic-Caspian Steppes across Europe [30][31][32] .Our analysis of long IBD segments reveals that the quarter of Corded Ware Complex ancestry associated with earlier European farmers can be pinpointed to people associated with the Globular Amphora culture of Eastern Europe, who carry no Steppe-like ancestry yet, while the remaining three-quarters must share recent co-ancestry with Yamnaya Steppe pastoralists in the late third millennium bce.This direct evidence that most Corded Ware ancestry must have genealogical links to people associated with Yamnaya culture spanning on the order of at most a few hundred years is inconsistent with the hypothesis that the Steppe-like ancestry in the Corded Ware primarily reflects an origin in as-of-now unsampled cultures genetically similar to the Yamnaya but related to them only a millennium earlier.
Several extensions could improve ancIBD.Both SNP density in the 1240k and 1000 Genome SNP set varies substantially along the genome 29 .We have found that false positive rate negatively correlates with SNP density (Supplementary Fig. 9) and designed a filter to mask genomic regions with high false positive rates of long IBD (Supplementary Fig. 9).Focusing exclusively on regions of high SNP density could enable one to call IBD with shorter lengths.We also note that we have imputed ancient data using a modern reference haplotype panel, which yields decreasing imputation and phasing performance the older the sample 19,48 .Future efforts to include high-quality ancient genomes into reference haplotype panels or to use modern reference panels substantially larger than 1000 Genomes will probably improve the quality of imputed ancient genomes and thus also boost the performance of ancIBD.We note that ancIBD takes imputed data as input, thus future improvements of imputation software or reference panels can be easily integrated by updating the preprocessing step.
Our algorithm infers the presence of at least one shared IBD segment between two diploid individuals but in practice both pairs or even three or all four haplotypes can be shared.Here, we deliberately kept the model simple to improve robustness and runtime.Importantly, we believe that detecting the presence of one IBD segment alone suffices for most practical applications.Double IBD sharing, often termed IBD2, occurs mostly in full siblings, who on average share half of their genome length in a single IBD and one additional quarter in a double IBD.In this case, the sum of IBD length alone distinguishes full siblings from parent-offspring pairs (who distinctively have their whole genome in IBD) and from second-degree relatives (separate clusters in Extended Data Fig. 4).Beyond full siblings, having overlapping IBD segments on different haplotype pairs only rarely occurs in practice 49 .Only in special cases, such as distinguishing double first cousins from other second-degree relatives, identifying double IBD can be useful.In that case, we recommend directly screening for identical imputed genotypes in IBD segments.
One promising extension is calling IBD segments on X chromosomes.Genetic males have only one copy of it, while females have Article https://doi.org/10.1038/s41588-023-01582-wtwo, which causes sex-specific inheritance and recombination patterns (for example, males must have inherited their X chromosomes from their mothers).Therefore, IBD sharing on the X chromosome can provide information about sex-specific relatedness and demography 50 .Our work here focused on the autosomes that make up most of the human genome; however, one can in principle apply ancIBD to imputed female X chromosomes.To call IBD on the X in pairs involving males, one could adapt the state space of ancIBD in a technically straightforward way.Another potential application of IBD segments is to improve the dating of ancient samples by using recombination clocks to tether samples in time.Future work to refine carbon-14 dating, a method widely used for determining the age of human remains, can build upon existing Bayesian methods to incorporate external information into such dates [51][52][53] .
Detecting IBD segments in modern DNA has yielded fine-scale insights into the recent demography of present-day populations, allowing researchers to infer population size dynamics 54,55 , genealogical connections between various groups of people 2,43,56 and the geographic scale of individual mobility 3,55 .In principle, such analysis can also be applied to aDNA.It is particularly encouraging that the number of sample pairs that can be screened for IBD segments grows quadratically with the sample size, while the number of ancient genomes used in aDNA studies itself is currently quickly growing 57 .This rapid scaling will provide aDNA researchers with a powerful way to address demographic questions about the human past.We believe that the method to detect IBD in aDNA presented here marks only a first step towards creating the next generation of demographic inference tools, resulting in unprecedented insights into the human past.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.© The Author(s) 2023 https://doi.org/10.1038/s41588-023-01582-wloading time scales only linearly with the number of samples.This strategy, implemented in ancIBD, leads to hugely improved runtime per pair of samples in cases where many samples are loaded into memory and screened for pairwise IBD (Extended Data Fig. 5).We observed that for batches of size 50 samples and when screening all 50 × 49/2 = 1,225 pairs for IBD, the average runtime of ancIBD per imputed pair for all 22 chromosomes reduces to ~0.75 s.The asymptotic limit per sample pair, which is the runtime of the HMM and postprocessing, is about 0.35 s on our architecture.

Empirical data analysis
We applied ancIBD to a large set of previously published aDNA data of ancient Eurasians (using the bioinformatic processing described in the AADR dataset 57 ).After filtering to all individuals with geographic coordinates in Eurasia dating within the last 45,000 years and sufficient genomic coverage for robust IBD calling we obtained a final set of 4,248 unique ancient individuals (Supplementary Table 1).As the coverage cutoff, we required at least 70% of the 1240k SNPs on chromosome 3 having max(GP) (defined as the maximum among the three posterior genotype probabilities of 0/0,0/1,1/1) exceeding 0.99.This metric was chosen because it can be easily calculated on imputed data for various data types.It corresponds to the coverage cutoff for ancIBD described above, as the relationship between coverage and this metric is monotonic (Supplementary Fig. 19).Our imputation pipeline is described in detail in Supplementary Note 3. We then screened each of the 9,020,628 pairs of ancient genomes with ancIBD.To optimize runtime we grouped the genomes into batches of 400 and then ran all possible pairs between two batches after loading the two batches into memory (this approach is implemented in the in ancIBD software package).For each pair with detected IBD, we collected IBD statistics into a summary table (see Supplementary Table 2 for pairs of published individuals).

Statistics and reproducibility
For empirical aDNA data analysis presented in this work, we used 4,248 published samples originating from Eurasia dated within the last 45,000 years and passing the coverage requirement.No statistical method was used to predetermine the sample size.All simulation experiments depending on probabilistic random draws were performed with many independent replicates to analyse statistical uncertainty.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.We visualize our steps to simulate IBD segment data (see detailed description in Supplementary Note 2).Starting from TSI (Tuscany) high-quality reference haplotypes in the 1000 Genome panel (A), we created haplotype mosaics (B) as any long IBD segment is removed from those.We then copied over IBD segments of the target length (C).We grouped two mosaic haplotypes to obtain diploid individuals but to simplify visualization here we do not depict the second haplotype per individual.(D): To create data typical for imputed low-coverage aDNA, we matched each genotype to a random matching genotype in a panel of aDNA diploid genotypes called from high-coverage aDNA (either 1240k or WGS aDNA data).We then downsampled the high-coverage aDNA panel to the target coverage, imputed genotype probabilities and copied those back to each match.Extended Data Fig. 2 | Precision and recall of ancIBDand IBISat various length bins and coverages.We applied both methods with their default settings to genotype data imputed after downsampling to various coverages.For each coverage, we report the average precision and recall of each length bin across 50 independent replicates.The error bar represents ± SE of the estimated precision and recall.Each row represents a length bin and each column represents one input data type (either WGS data or 1240k data).Note that the y axis ranges are different for different rows.Extended Data Fig. 4 | Downsampling of Hazelton pedigree samples.We downsampled all individuals from a previously published English Neolithic pedigree 26 with coverage at least 1x both to 1x and 0.75x.For each coverage, we downsampled 10 times, each with different random seeds, to create 10 replicates.

Simulated Genotypes
Therefore, not all dots are independent pairs of relatives; they may be the same pair downsampled with different random seeds.The relationship annotations are obtained from Supp.To benchmark runtimes, we applied ancIBD on empirical ancient DNA data in .hdf5format imputed at 1240k sites.We used the imputed hdf5 file from the Eurasian application (Fig. 3), choosing samples and pairs at random.Left: For each sample pair, all autosomes are screened for IBD.In one experiment all pairs of samples were run independently, leading to a linear dependency on pair number, as expected.In a second experiment, all samples were loaded into memory and then each sample pair was screened for IBD.The apparent sub-linear behaviour is due to the fact that loading n samples scales slower than the actual runtime of n(n − 1)/2 sample pairs.Right: We depict the runtimes normalized per sample pair when screening all pairs of sample batches of various sizes for IBD.We visualize the loading time (the time it takes to load the hdf5 genotype data into memory), the preprocessing time (including preparing the transition and emission matrix), as well as the runtime of screening for IBD that includes the forward-backward algorithm as well as postprocessing.Due to the decrease in the impact of the time to load the data, which scales linearly with batch size while the number of sample pair scales quadratically, we observe substantially increased runtimes per pair.

Extended Data Table 1 | Inferred segment length in simulated WGS-like data
For each of the simulated IBD lengths (4cM, 8cM, 12cM, 16cM, 20cM) with WGS-like data quality at various coverages, the table shows the inferred segment length averaged over 500 independent replicates.

Extended Data Table 2 | Inferred segment length in simulated 1240k-like data
For each of the simulated IBD lengths (4cM, 8cM, 12cM, 16cM, 20cM) with 1240k-like data quality at various coverages, the table shows the inferred segment length averaged over 500 independent replicates.

Fig. 3 |
Fig.3| Inferring biological relatives in the aDNA record using long IBD inferred with ancIBD.a, Inferred IBD among pairs of 4,248 ancient Eurasian individuals.The plot visualizes both the count (y axis) as well as the summed length (x axis) of all IBD >12 cM long.For comparison, we colour-code pairs on the basis of relatedness estimates from pairwise mismatch rates (PMR) that can detect up to third-degree relatives (Supplementary Note 9).We also annotate new relatives found by ancIBD, indicated by at least three very long IBD segments (>20 cM) typical of up to sixth-degree relatives.b, Simulated IBD among pairs of relatives.For each relative class, we simulated 100 replicates using the software

Fig. 5 |
Fig.5| A geographically distant pair of ancient biological relatives detected with ancIBD.a, When screening ancient Eurasian individuals for IBD segments (Fig.3), we detected a pair of biological relatives whose remains were buried 1,410 km apart, one in central Mongolia and one in Southern Russia.The two individuals were previously published in two different publications35,59 .Both individuals are archaeologically associated with the Afanasievo culture and genetically cluster with other Afanasievo individuals35,59 .b, Posterior of non-IBD state on chromosome 12, which has the longest inferred IBD segment(39.1 cM Create typical aDNA data from diploid genotypes B: Create Mosaic haplotypes C: Copy over IBD blocks Extended Data Fig. 1 | Pipeline to simulate IBD segment data.

Extended Data Fig. 3 |
IBD sharing matrix of various Eneolithic & Bronze Age West Eurasian Groups for four IBD length scales.As in Fig. 4, but for shared IBD [8 − 12 cM], [12 − 16 cM], [16 − 20 cM], > 20 cM long.We used ancIBD to infer IBD segments between all pairs of groups and visualize the fraction of pairs that share at least one IBD for each pair of populations and for the four different IBD length bins.