## Introduction

Estimates suggest that around 10% of the world’s population are offspring of closely related parents, mostly in the north and sub-Saharan Africa, the Middle East, and west, central, and south Asia1. In these regions, there is frequently co-occurring endogamy (marriage within clans)2. Exploring the impact of consanguinity, endogamy, and population structure on genetic variation within these populations is important for quantifying their relative effects on risks of genetic diseases. Furthermore, a proper understanding of the demographic history of a population is critical for designing robust and effective medical genetic analyses, as has been recently highlighted in work showing the impact of population structure on polygenic scores3,4. Here, we investigate population structure, history, and consanguinity patterns in individuals with Pakistani ancestry living in the United Kingdom (UK).

British Pakistanis are one of the largest and most socioeconomically disadvantaged ethnic minorities in the UK, with a population size of 1.17 million in the 2011 census5,6. The most substantial wave of migration from Pakistan to the UK occurred in the 1950s/1960s, after the partition of British India7,8. British Pakistanis have rates of type 2 diabetes and heart disease that are two to four times higher than the white British population9,10,11, as well as an increased risk of congenital anomalies due to the prevalence of consanguinity12. These factors, combined with the drive to increase the number of genetic studies on people with non-European ancestry13, have spurred the creation of several cohort studies whose aims include exploring the environmental and genetic contributions to various phenotypes in Pakistani-ancestry individuals and the impact of homozygous gene knockouts14. These include Genes & Health15 and Born in Bradford (BiB)16 (in the UK), and the Pakistan Risk of Myocardial Infarction Study17.

Most modern South Asians are a mixture of two ancestral populations: the Ancestral Northern Indian (ANI) and Ancestral Southern Indian (ASI) components18,19,20. North-west Indians and Pakistanis have a greater proportion of the ANI component18,19. Several major studies of human genetic diversity, such as the Human Genome Diversity Project (HGDP)21, 1000 Genomes22, and GenomeAsia23, have highlighted differences between multiple ethnic groups within Pakistan, as well as the elevated levels of homozygosity due to consanguinity24. These previous studies have focused mostly on population structure on a macro-scale, with relatively limited sample sizes per population, and they lacked information on finer-scale groupings within each of these populations.

Most of the ethnic and tribal groups in Pakistan follow a patrilineal kinship system based on the biraderi (brotherhood) that shares its historical roots with the better-studied Indian caste system. The biraderi system is a means of attributing social status and providing mutual social support25. Some biraderi groups such as Rajput and Jatt have been present on the Indian subcontinent for the past 2000–3000 years26,27,28. Others originated in the early Medieval period, such as the Gujjars whose identity is traced back to the Gurjara kingdom in present-day Rajasthan around 570 C.E.29. Mass conversion to Islam during the pre-Mughal and Mughal era introduced multiple new biraderi groups, including Syed, Qureshi, Malik and Sheikh30. Historical records suggest that endogamous practices started during the Gupta Empire, strengthened during the Mughal Empire31, and became even stricter during the colonial times of the nineteenth century, as the social classification system was reinforced by the British to solidify their political authority, enable rationalised taxation, and establish rules about property31,32. Overall, however, there are limited historical records about when the biraderi groups emerged or the extent to which endogamy was practiced over the centuries. Very little is known about the effect of this historical social structure on present-day genetics, since previous work has been limited to small studies of a few microsatellite markers28,33,34.

Here, we analyse genotype array data and exome-sequence data from thousands of Pakistani-ancestry individuals sampled in Britain as part of the BiB project. BiB is a birth cohort set up to investigate the social, environmental, and genetic causes of poor health and educational outcomes of children born in Bradford, a city with high levels of socio-economic deprivation in the north of England16. Around half of the individuals in this cohort have Pakistani ancestry, coming primarily from Azad Kashmir (Mirpur) and northern Punjab (see map in Supplementary Fig. 1), in proportions similar to the British Pakistanis as a whole8,35. BiB has rich self-reported information on the Pakistani mothers’ biraderi groups, places of birth, and parental relatedness. To our knowledge, this is the largest sample of Pakistani-ancestry individuals analysed to date to explore population genetic questions.

## Results

### Samples and dataset

We assembled a large dataset of 7180 individuals with Pakistani ancestry (Supplementary Data 1 and 2) from the BiB project, of which 5669 had been genotyped on the Illumina CoreExome array and 1511 on the Illumina Global Screening Array (GSA); 2484 of these also had exome-sequence data. Identifying related individuals is challenging in a population with high consanguinity and endogamy, so we tried several algorithms (see ‘Methods’, Supplementary Fig. 2). Conservatively, we used the algorithm that gave the highest estimates of kinship, PropIBD from KING36, to identify and remove putative relatives (third-degree or closer). Most analyses in this paper are based on genotype data from 2200 unrelated mothers (CoreExome array; 251,853 single-nucleotide polymorphisms (SNPs) with minor allele frequency (MAF) >1% post-quality control). Some analyses additionally used genotype data from 1616 unrelated children (CoreExome array) or 228 unrelated fathers (GSA array), and exome-sequence data from 2484 mothers (see below). Supplementary Data 18 indicates that samples were used in which analyses.

After cleaning questionnaire responses about biraderi membership, we determined that 56 distinct groups had been reported. Most of these are recognised biraderi groups, but some individuals identified themselves with tribal or regional groups (e.g. Pathan or Kashmiri), or clans within a biraderi (e.g. Choudhry) (Supplementary Data 3). Presuming that individuals have reported the labels that best represent their group identity within the context of the Bradford Pakistani community, we henceforth refer to these collectively as self-reported ‘subgroups’ rather than ‘biraderi’.

### Population structure

We began by investigating the genetic relationships between the Bradford Pakistanis and other worldwide populations using publicly available datasets, considering autosomal, Y chromosome and mitochondrial markers (Supplementary Note 1, Supplementary Figs. 3, 4 and 19 and Supplementary Data 4, 5, 6, 7 and 15). Comparisons with both modern and ancient genomes suggested that the different Bradford subgroups all come from a common ancestral population with a large ANI component (Supplementary Note 1).

We then examined the fine-scale population structure within the Bradford Pakistanis. Principal component analysis (PCA) of the samples revealed clear structure, with the first three PCs each explaining ~4% of the variation and driven, respectively, by the separation of the Jatt and Choudhry subgroups, the Pathan, and the Bains and a subset of Rajput individuals (Fig. 1). The fact that the Choudhry (dark orange colour) and Jatt (teal colour) subgroups cluster together is consistent with the fact that ‘Choudhry’ is an honorary title in Punjab and Kashmir used most commonly by the Jatts, who are one of the largest ethnic groups in Pakistan and north-west India28. ADMIXTURE analysis of the Bradford Pakistani samples indicates that the subgroups that were distinguishable on the PCA have different proportions of genetic components (Fig. 1c). The Rajput group contains two distinct subgroups that we term henceforth Rajput-A and Rajput-B; the Rajput-B group, which has a higher fraction (>40%) of the ‘red’ ancestry component, appears more similar to Bains than to Rajput-A, and these individuals cluster with Bains on the PCA (Fig. 1b). We found no evidence that the Rajput-A versus Rajput-B distinction is driven by geographical origin within Pakistan. It may well be that Rajput-B and the different subgroups of Rajput-A represent different sub-clans of Rajputs, of which there are hundreds across South Asia37,38. An alternative explanation is that some people self-identifying as Rajputs actually have diverse ancestries since individuals from other subgroups chose to identify with this group to benefit from land allocations during British colonial times25,39 or to increase their social status after migration to the UK40. The observation that Bains and the Rajput-B group appear to form a homogeneous genetic cluster is consistent with historical evidence that Bains is one of the Rajput families41.

We next applied Uniform Manifold Approximation and Projection (UMAP) to the first 20 PCs (Supplementary Fig. 5). This allowed us to define the Pathan and the Bains/Rajput-B subgroups more cleanly than on the PCA, but failed to distinguish additional groups. We found no significant correlation (two-sided Mantel test p value: 0.94) between genetic distance (measured by UMAP1 and UMAP2 vectors) and geographic distance (measured by geographic coordinates of the individual’s or her parents’ self-reported village of origin in Pakistan).

We then applied fineSTRUCTURE42, a Bayesian clustering algorithm, to a matrix of haplotype-sharing patterns. The inferred hierarchical clustering tree based on 1520 individuals from the 16 major subgroups (Fig. 2) identified three clusters containing the majority of the self-reported Pathan (Cluster 8), Bains and Rajput-B (Cluster 9), and Jatt and Choudhry (Cluster 10) individuals. Subsets of Bains and Rajput-B (Cluster 6) and Jatt and Choudhry (Cluster 11) clustered with individuals from other groups, although Cluster 11 had lower support than the other clusters. The Bains and Rajput-B in Cluster 6 showed a smaller proportion of the red component in the ADMIXTURE plot (Fig. 1c) than those in Cluster 9 (two-sided Wilcoxon’s rank-sum test p value = 1 × 10−12). For other self-reported groups, the majority of individuals from the group fell in a single fineSTRUCTURE cluster: the Kashmiri (Cluster 1), Syed and Awaan (Cluster 2), Arain (Cluster 4), Gujjar (Cluster 5), and Qasabi (Cluster 7). Some of the self-reported groups appear to be quite heterogeneous, with individuals from these groups distributed across different inferred clusters; Rajput-A is a notable example. Genetic differentiation was very low (FST < 0.001) between the Awaan and Syed in Cluster 2, the Bains and Rajput-B in Cluster 9, and the Jatt and Choudhry in Cluster 10 (Supplementary Fig. 6 and Supplementary Data 8). We henceforth pooled together individuals in these very similar subgroups who fell within those dominant clusters (Awaan/Syed, Bains/Rajput-B and Jatt/Choudhry), and restricted the Arain, Gujjar, Kashmiri, Pathan and Qasabi subgroups to those individuals who fell within the dominant fineSTRUCTURE cluster for that group. These homogeneous subgroups were recapitulated when downsampling each cluster in Fig. 2a to 60% of its original size and rerunning the fineSTRUCTURE clustering (Supplementary Note 1 and Supplementary Fig. 20), confirming that cryptic relatedness between samples was unlikely to explain the structure we observed.

The biraderi system is culturally characterised by patrilineal kinship ties, so we examined GSA data on 228 unrelated Pakistani fathers from BiB to test whether males from the same subgroups did indeed carry the same or similar Y haplogroups (Supplementary Data 5), using 1056 Y chromosomal SNPs. Although 79% of the individuals carried the IJ haplogroup, and the sample size is relatively small, we did not find a clear delineation of some groups observed in the analysis of the autosomal data. Even males within the most distinct groups (e.g. Jatt/Choudhry, Bains/Rajput-B, Pathan) did not tend to cluster together (Supplementary Fig. 7). This is consistent with previous findings in Punjabi Rajputs34 and in Jatts28, and may suggest at least three possible explanations that are not mutually exclusive: the founders of each biraderi may have included males with several different haplogroups, the age of the biraderi groups may not be old enough to lead to strong Y-haplotype stratification, and/or historically the patrilineality of the biraderi system may not have been very strict.

### Demographic history

We next estimated population divergence times between the homogeneous subgroups defined in the previous section, using the NeON package43 that considers patterns of linkage disequilibrium (LD) decay. We found that all groups diverged from one another within the last ~70 generations (Fig. 3a, Supplementary Data 9 and Supplementary Data 10), consistent with the low genetic differentiation between the subgroups (Supplementary Fig. 6). When we removed individuals with high autozygosity, the divergence time estimates tended to be shifted towards more recent times, but the relative patterns remained similar across groups (Supplementary Fig. 8 and Supplementary Data 10). The estimates in Fig. 3a suggest that the history of these subgroups cannot be considered as a series of clean splits between ancestral populations; rather, it appears that several groups began to differentiate from one another around the same time, with some degree of admixture persisting between the groups after their initial divergence. It is important to note that admixture is not explicitly modelled when estimating split times, thus depending on the extent of the gene flow in each subgroup, estimates might be more or less biased. Nonetheless, we can see that Bains/Rajput-B (Cluster 9 in Fig. 2) show the greater divergence time from the other groups, followed by Qasabi (Cluster 7) and Pathan (Cluster 8) (Supplementary Data 9). Greater divergence may be interpreted as either older divergence time between groups or greater genetic drift caused by a small population size. However, simulations under different demographic scenarios showed that NeON infers divergence times robustly in the last 200 generations even in the presence of decreasing Ne (Supplementary Note 1 and Supplementary Fig. 21). Within clusters, Bains and Rajput-B (Cluster 9) have a divergence time estimate close to zero, as do Jatt and Choudhry in (Cluster 10; Supplementary Data 11), consistent with their clustering together on fineSTRUCTURE and having very low FST (FST < 0.001).

Population relationships and gene flow inferred by Treemix44 suggested that the Bains/Rajput-B group experience the strongest genetic drift, followed by Qasabi and Pathan (Supplementary Fig. 9a–c). The tree topology without migration edges explained most of the variance (97.4%), but adding one or two migration edges increased the variance explained to >99%, detecting gene flow events from Pathan into Kashmiri (one migration edge), and from Bains/Rajput-B and Jatt/Choudhry into Kashmiri (two migration edges) (jackknife analysis p values <0.001). f3-statistics45 also showed that the Kashmiri underwent significant admixture events with other subgroups (Supplementary Fig. 9d and Supplementary Data 12). Although about half of the Bradford Pakistanis come from Azad Kashmir, people who refer to themselves as ‘Kashmiri’ are believed to originate specifically from the Kashmir Valley46. Our finding that the Kashmiri have gene flow from other groups is consistent with historical evidence that the Kashmiri from the valley belonged to different biraderi groups and tribes. Some migrated to Azad Kashmir and Punjab during the nineteenth century, where they began identifying themselves as Kashmiri ahead of their biraderi identity, tending to marry within this group41,47,48. This likely explains the relative genetic homogeneity of the Kashmiri in our sample (80% of individuals fall in Cluster 1 on Fig. 2) and the observation that the most recent estimated divergence time between Kashmiri and other groups is ~10 generations ago (Fig. 3a).

We next applied IBDNe49 to infer recent effective population sizes (Ne) through time across the different subgroups. All subgroups, and indeed, all Pakistanis combined (Supplementary Fig. 10d), were inferred to have relatively low Ne over the last 50 generations compared to white British individuals (Fig. 3b), likely reflecting the endogamy of the biraderi system. Bains/Rajput-B (Cluster 9), Jatt/Choudhry (Cluster 10) and Pathan (Cluster 8) showed a similar trend: a strong reduction in Ne around 10–20 generations ago followed by a recovery. The other subgroups, except for Arain, showed a progressive decrease in Ne in the last 15 generations, which may reflect the increase in endogamy that occurred during the Mughal Empire and British colonial times31. Broad conclusions were unchanged in various sensitivity analyses (Supplementary Note 1 and Supplementary Figs. 10 and 11). It is important to note that the high Ne estimates in recent generations might be due to phasing errors and should be thus interpreted with caution. Furthermore, gene flow between the subgroups could also have an impact on the inferred Ne trajectory.

### Endogamy and consanguinity

To quantify the extent of this historical endogamy, we calculated identity-by-descent (IBD) scores50 in the Pakistani subgroups. These scores represent the average total length of IBD shared between any two individuals from the same subgroup after excluding relatives, considering segments between 5 and 30 cM. Most of the Pakistani subgroups have substantially higher IBD scores than other worldwide populations from the 1000 Genomes Project, including the Finns (Fig. 4a), and similar to those reported previously for some isolated Indian groups50. The Bains/Rajput-B group had the highest IBD score, followed by Qasabi and Jatt/Choudhry. Similar results were obtained in various sensitivity analyses (Supplementary Note 1 and Supplementary Fig. 12).

Fifty-seven per cent of the BiB Pakistani mothers reported that their parents were related, and 63% reported being related to their child’s father (Supplementary Data 13 and 14). As expected, a much higher fraction of the genome was homozygous (FROH) in the Pakistani mothers than the White British (mean = 0.048 versus 0.0004, two-sided t test p < 1 × 10−15). Amongst the BiB children, those whose parents were both born in the UK had significantly lower FROH than those whose parents were both born in Pakistan, and both groups had significantly lower FROH than those who had one parent born in Pakistan and the other in the UK (Supplementary Fig. 13a). This fits with some prior evidence that cousin unions may be preferred for trans-national marriages51 among first- and second-generation British Pakistanis, for whom they can be a means of ensuring cultural continuity and socio-economic support, and of allowing siblings separated by migration to reconnect through the marriages of their children

We developed an algorithm to infer parental relatedness based on the frequency and length of ROHs in an individual (see ‘Methods’ and Supplementary Note 1 for details), similar to previously published approaches52,53. Briefly, we simulated different types of consanguineous couples to derive empirical distributions of the frequency of ROHs of different lengths as well as the length of the ten longest ROHs per individual. We then used a neural network trained on the simulated data to predict parental relatedness in our cohort. The model was 92.0% accurate when considering the three classes of the parental relationship shown in Fig. 4b–d. Our results suggest that many families have been practising cousin marriage for multiple generations (Supplementary Fig. 14a, b). When comparing the self-reported parental relatedness to the inferred parental relatedness, we find that 81% of BiB Pakistani mothers who reported having parents who are first cousins are inferred to have parents who are first cousins or closer (Fig. 4b). However, the self-reporting seemed to be less reliable for the individuals who said their parents were first cousins once removed or second cousins (z test for population proportions: p = 0.0048 for first cousins versus first cousins once removed, p < 1 × 10−15 for first cousins versus second cousins) (Fig. 4b). Overall, consanguinity was under-reported in the mothers and slightly over-reported in the children though less so than in mothers, with 78% of the mothers inferred to have parents who are second cousins or closer compared to the 57% reported, versus 58% inferred and 63% reported for the children (z test for population proportions: p < 1 × 10−15 for mothers, p = 0.0008 for children, p < 1 × 10−15 for the difference in reporting between mothers and children) (Supplementary Fig. 14).

Rates of consanguinity markedly differed between subgroups (Fig. 4c). Most notably the Qasabi subgroup had 67% of individuals with parents who are unrelated (more distant than second cousins) while nearly all Bains/Rajput-B individuals had parents who were inferred to be second cousins or closer. Even when considering individuals inferred to be offspring of unrelated parents, FROH differed significantly between the subgroups (Fig. 4d). Notably, groups that had higher Ne in recent generations including Arain, Awaan/Syed and Kashmiri (Fig. 3b) had overall lower FROH, suggesting weaker endogamy in these groups. To examine the contribution of endogamy and consanguinity to variation in FROH within our cohort, we fit a joint model on inferred consanguinity and subgroup defined using fineSTRUCTURE. Inferred consanguinity explained 62% of the variance in FROH (two-way ANOVA p < 1 × 10−15), subgroup explained 5% (two-way ANOVA p < 1 × 10−15), and the interaction term was not significant (two-way ANOVA p = 0.17). The pattern of differences in FROH across subgroups was similar between individuals who were born in Pakistan and those born in the UK (Supplementary Fig. 13b), implying that each subgroup’s relative frequency of consanguineous unions has not changed substantially after migration.

To validate our inference of the degree of parental relatedness for the Bradford Pakistanis using an orthogonal method, we applied recently developed theory54, which predicts the length distribution of ROH segments given historical consanguinity rates and a historical Ne trajectory. We focused on mothers from the two largest homogeneous subgroups: the Pathan from Cluster 8 (N = 213) and the Jatt/Choudhry from Cluster 10 (N = 299) (Fig. 2). Our neural net predicted an average kinship between the mothers’ parents of 0.035 for Pathan and 0.057 for Jatt/Choudhry (see ‘Methods’), versus values of 0.016 and 0.028 estimated based on self-reported parental relationships. Our neural net estimator did not explicitly take into account the possibility that some ROH segments may descend from a more remote common ancestor even in the absence of any close relationship between the parents. However, it did focus only on ROHs >10 cM, which Supplementary Fig. 15 illustrates are minimally impacted by Ne, except when it is very small (Ne < 5k). To verify that the Ne trajectory was not affecting our inferences from the neural net, we examined the expected ‘ROH footprint’54 given the Ne estimates from IBDNe (Fig. 3b) (see ‘Methods’), and given consanguinity rates derived either from self-reported information or based on our neural net. Figure 5 shows that the observed ROH (and IBD) footprints are similar to those expected when using the kinship values inferred from our neural net-based method rather than the self-reported information, confirming the robustness of our inference. Similar results were obtained with ROHs called by PLINK, whereas GARLIC seems to call more short ROHs than the model predicts (Supplementary Fig. 16), consistent with the higher false-positive rate for short ROHs reported in the original GARLIC paper55.

Our results suggest that, even in the absence of close consanguinity, increased homozygosity due to endogamy is likely to be contributing to recessive disease burden56 and the elevated frequency of rare homozygous knockouts14,57 in this population. To investigate the relative impact of endogamy versus consanguinity on recessive disease risk, we used exome-sequence data from 2484 Bradford Pakistani mothers, in which we ascertained pathogenic/likely pathogenic (P/LP) variants in autosomal recessive developmental disorder genes. We then simulated intra-biraderi (endogamous) and inter-biraderi (exogamous) couples, and unions between pairs of individuals whose IBD distribution matches that of self-reported first cousins within the dataset (see ‘Methods’). We then scored each couple as being ‘at risk’ of having an affected child if both individuals were carriers of a P/LP variant in the same gene, similar to the approach in ref. 58. The results (Fig. 6) indicate that intra-biraderi unions incur significantly higher risk than inter-biraderi unions (particularly for the Bains and Jatts; one-sided permutation tests p = 2 × 10−4 and p < 1 × 10−4 respectively), but first cousin unions incur more than ten-fold higher risk than intra-biraderi unions.

## Discussion

We have carried out the first large-scale investigation of the population structure and demographic history of Pakistanis. We found genetic structure in the cohort reflecting the influences of the biraderi social stratification system, with some subgroups forming identifiable and homogeneous clusters (Figs. 1 and 2). Our analyses suggest that these subgroups come from a common ancestral population, but diverged from one another within the last 70 generations (1500–2000 years) (Fig. 3a). This is consistent with an earlier finding that the transition from intermarriage to strict endogamy on the Indian subcontinent started from about 70 generations ago20, concurrent with or immediately after the drafting of the ancient Law Code of Manu that described a ranked social stratification system32.

Historically, intra-biraderi marriages and consanguinity were practised to solidify socio-economic bonds59,60. Intra-biraderi marriages continue to be very common in the Bradford Pakistani community, constituting >90% of marriages in the Gujjar, Pathan, Jatt, and Bains, ≥80% in the Syed, Qasabi, Rajput, Awaan, and Choudhry, and ~60–63% in Qureshi, Malik, and Sheikh in the current BiB participants60, according to the self-reported questionnaire data. As one might expect, the subgroups we inferred to be the most genetically homogeneous and with the highest IBD sharing (Fig. 3a) are generally those with the highest self-reported rates of intra-biraderi marriage. Endogamous practices became stronger, particularly amongst the elite classes, during the Mughal Empire (mid-1500s to mid-1800s)31, and then amongst all classes under the British Raj (mid-1800s to 1947) when the laws of land ownership changed31,32. This could be an explanation for the decrease in effective population size seen in all subgroups starting 15–20 generations ago (~375–580 years ago if the average generation time were 25–29 years) (Fig. 3b). However, there is considerable uncertainty in the average generation time; according to historical records, it was not uncommon at some periods in history and in some regions of the world for women to marry and have children as early as 12–18 years old, including in South Asia61,62. Hence, we should be cautious about attributing these changes in Ne to demographic changes or historical events at particular time points.

To assess the rates of consanguinity in the cohort, we developed a neural net-based algorithm to infer the degree of parental relatedness based on ROH patterns in individuals. We found that rates of consanguinity were under-reported in the mothers and varied significantly between biraderi groups, highlighting differential social practice within the British Pakistani population. Consanguinity rates seemed to be reported somewhat more accurately for the children (Supplementary Fig. 14), possibly because the mothers were more sure of their genetic relationship with their husbands than of that between their parents. We found evidence of multiple generations of recent consanguinity in many families (Supplementary Fig. 14), which contribute to many individuals having higher FROH than expected given the self-reported parental relationships. Our data also suggest that endogamy has contributed to differences in FROH between biraderi groups (Fig. 4d). These results could potentially inform prior expectations about recessive disease risk for British Pakistanis in a clinical genetics setting56. It should be noted that there could be misclassifications with our neural net-based method due to systematic bias in the arbitrary set of modelled consanguineous relationships. Additionally, there is significant variability in the ROH distribution for a given consanguineous relationship, further complicating accurate classification due to overlapping ROH distributions.

To validate our findings from this neural net-based method, we leveraged new coalescent-based theory54 (and ‘Methods’) to predict ROH levels due to recent consanguinity, while, for the first time, properly accounting for ROH attributed to historical endogamy. We found that for both the Jatt/Choudhry and Pathan subgroups, a naive estimate of spousal kinship based on self-reported parental relatedness underestimates the proportion of the genome in ROH segments, whereas an estimate of kinship based on the inference from our neural net gives an expectation very similar the observed data. This approach can in principle also detect historical changes in the consanguinity rates; however, we found little power to distinguish different levels of historical consanguinity rates (Supplementary Fig. 17).

We showed through simulations based on exome-sequence data that intra-birarderi unions increase the risk of having offspring with rare recessive disorders, but less so than first cousin unions (Fig. 6). The elevated risk was particularly pronounced in the groups that showed the strongest degree of bottleneck in the IBD score analysis. We emphasise that this analysis assumes full penetrance, and that the variants included are the only ones that cause recessive disease in this population, and hence underestimates absolute risk, but this seems less of a problem if we consider the estimates of relative risk. However, we note that our analysis ignored missense and inframe variants that are pathogenic but have not been reported with two stars in ClinVar. Such variants are likely to be rarer and more recent and thus contribute more to risk for intra-biraderi than inter-biraderi marriages. For this reason, our estimates of the relative risk may be considered lower bounds. Future work should consider larger sample sizes, more exhaustively identify all potentially pathogenic recessive variants and evaluate the risks of endogamy versus consanguinity in cohorts of rare disease patients rather than in these simulations.

Our findings suggest that clinicians should consider recording parents’ biraderi groups as well as close relatedness in genetic consultations. This will be particularly useful as research becomes more focused on clinical sequencing datasets such as that held by Genomics England. Recording biraderi information would enable further research into the prevalence of different diseases in different biraderi groups, the impacts of endogamy and the possible presence of disease-causing founder mutations. The results from such research will be important to inform and design targeted genomic health services for Pakistani-ancestry populations. However, great care needs to be taken to ensure this research and any application of it is carried out in a culturally sensitive way. (See Broader Impact section below.)

This study has several limitations. Although we had samples from 56 distinct groups, many of them had a small sample size that would not allow reliable group-level inference about demographic history or founder events. The majority of our sample comprised individuals with Pathan, Punjabi or Kashmiri ancestry residing in the UK; additional data from the UK and Pakistan, including from other ethnic groups such as the Baloch and Sindhi, would allow us to explore how generalisable our findings are. Furthermore, our study was based primarily on SNP-genotype data, which, although highly valuable for describing population structure, did not allow the finer-scale demographic inference that would be possible with a large sample of whole-genome sequence data. For example, whole-genome sequence data on a larger sample size of males would allow us to leverage more Y chromosomal markers to further explore whether biraderi membership has indeed been passed down patrilineally in recent times. Finally, it is challenging to accurately estimate relatedness in populations with high consanguinity and endogamy. Multiple lines of evidence suggest that our approach has correctly removed individuals who are third-degree relatives or closer (Supplementary Figs. 2c, e and 4e) and that our identification of homogeneous clusters was not affected by the accidental inclusion of relatives (Supplementary Fig. 20). However, we cannot exclude the possibility that we have been over-cautious in removing putative relatives. This may mean we have over-estimated recent Ne and under-estimated FST between groups, leading to under-estimation of the divergence time. We also note that the admixture between the groups has likely affected our estimates of historical Ne and divergence time.

We have presented the largest investigation to date of fine-scale population structure and history in Pakistanis. We have shown how the biraderi social stratification system has played a significant role in shaping the population structure in Pakistani communities over the last 70 generations. Endogamous practices have led to greatly elevated IBD sharing as well as increased homozygosity, which is likely to have implications for disease risk on top of the high rates of consanguinity. Larger sample sizes and data on disease diagnoses (as opposed to proxy estimates of risk) will be needed to quantify the risk of recessive disorders for offspring of intra- versus inter-biraderi unions accurately, noting that our results imply that this differs between biraderi groups, and that certain disorders may be enriched in particular biraderi groups as a result of founder effects. Furthermore, future studies of disease genetics in Pakistanis should consider our findings in order to best control for stratification due to recent population structure in genome-wide association studies3 and potentially to increase power by exploiting the IBD within some groups as a proxy for rare variant sharing63.

There is an urgent need to expand genetic research to populations with non-European ancestry so that more people will benefit in the future from precision medicine. An understanding of population structure is an important prerequisite for robust medical genetic studies and can be used to inform study design to ensure validity and maximise power. This study explores the genetic structure of a population with Pakistani ancestry, which has a high burden of particular diseases, but has been, until recently, largely neglected in genetic research. Our findings can be used to inform future medical genetic studies of Pakistani-ancestry individuals in the UK and around the world. For example, our work motivates further exploration of how best to control for recent, fine-scale population structure when conducting genome-wide association studies in this population and of whether polygenic risk scores perform differently in different Pakistani sub-populations.

Our work touches on some sensitive topics, such as how people’s genetic ancestry may differ from their cultural identity, and how the degree of parental relatedness inferred from people’s genetic data may differ from their own beliefs or understandings. Our findings of the elevated risks of rare recessive disorders associated with intra-biraderi (endogamous) marriage have the potential to be mis-interpreted and lead to stigmatisation if not appropriately communicated. Nevertheless, we hope that this risk is outweighed by the benefits that may be reaped from future medical genetics research in the Pakistani community that builds on our findings. At each stage of this research, we have consulted with representatives from the Bradford Pakistani community and, with their input, we have prepared a set of key questions and answers to accompany the paper (Supplementary Note 2), to try to present our results and their implications clearly and sensitively to a lay audience.

Further research would be required in population-based cohorts containing rare disease patients to calculate absolute risks arising from consanguineous and intra-biraderi marriages. Estimating absolute risks would be more clinically useful than the relative risks we have presented here. We note that previous work from the BiB study showed that advanced maternal age confers a similar level of increased risk of congenital anomalies to that associated with consanguinity12. Our findings motivate a search for recessive disease-causing founder variants in particular biraderi groups, the discovery of which would enable carrier and/or prenatal testing for variants causing severe disorders. A culturally competent approach and sensitive communication strategy will be required to carry out such research and to translate these results into health education and clinical practice, in order to maximise patient benefits and to ensure all groups are well informed and empowered to make informed decisions regarding their marriage choices64. A joint clinical and social science approach alongside a community engagement programme would help to build trust within the British Pakistani population, improve access to genomic medicine services and provide a means to reduce health disparities. Similar approaches to other preventive health interventions like obesity65 and smoking cessation66 were acceptable for delivery among British Pakistanis.

## Methods

### Dataset

The data come from the Born in Bradford (BiB) study. We complied with all relevant ethical regulations for work with human participants. Ethical approval for the data collection was granted by Bradford Research Ethics Committee (ref. 07/H1302/112), and informed consent was obtained from participants. All women (of any ethnicity) giving birth at the Bradford Royal Infirmary in 2007–2010 were invited to participate in the study and ~80% of those eligible accepted16. We had genetic data from 7180 individuals of Pakistani ancestry and 6818 of White British ancestry (Supplementary Data 1 and 2) from the BiB cohort. These were predominantly from mothers and their children, with some fathers also included. The samples were genotyped using two chips: (1) the Infinium CoreExome-24 v.1.1 BeadChip (~550 K SNPs), and (2) the Infinium GSA-24 v.1 (~640 K SNPs). We also analysed whole-exome sequencing (WES) data for 2484 Pakistani individuals who were also genotyped. Recruited mothers were asked to fill in a self-reported questionnaire, including, for the Pakistani mothers, questions about biraderi groups, places of birth, and parental relatedness (consanguinity).

### Quality control of genotype chip data

Data quality control was performed using PLINK v.1.90b467. We required <1% missingness per SNP and <10% missingness per individual. Duplicated variants were removed. We removed 110 samples with sex discrepancies, 123 genetic duplicates and 465 individuals who were not genetically related to someone who should have been a first-degree relative, and five individuals who had an inferred first-degree relative who should not have been related. Genetic ancestry was assigned with PCA using the self-reported information on ethnicity, and an additional 52 samples were filtered out because their declared ethnicity was different from their genetic ancestry inferred using EIGENSOFT 7.2.168,69. SNPs with Hardy–Weinberg equilibrium p value <1 × 10−6 were removed considering Pakistani (3348 CoreExome variants, 1435 GSA variants) and White British (305 CoreExome variants, 224 GSA variants) separately. These filters resulted in a dataset of 476,816 autosomal SNPs (246 mitochondrial DNA (mtDNA) SNPs) and 14,624 individuals on the Core Exome chip and 598,326 SNPs (583,667 autosomal, 1056 Y chromosome) and 4398 samples on the GSA. For this study, we focused primarily on individuals of Pakistani ancestry, but in some figures, the White British individuals were used for comparison. Results presented are from the CoreExome chip data, unless otherwise stated, since this gave a larger sample size, and taking the intersection of the two chips would have left insufficient SNPs for most analyses.

We performed the demographic inferences on mothers genotyped on the CoreExome chip using common autosomal SNPs (251,853 SNPs with MAF > 0.01). For the PCA and ADMIXTURE analyses, we filtered out SNPs in high LD (r2 > 0.5). Fathers from the GSA dataset were used for the Y-chromosome analysis. This resulted in a dataset of 3081 Pakistani and 2873 white British mothers and 2588 Pakistani children on the CoreExome chip and 601 mothers and 235 Pakistani fathers on the GSA.

### Inference and removal of relatives

A critical step in population genetic analysis is identifying and removing close relatives. In endogamous populations, distinguishing recent familial relatedness from population structure is challenging because both show genetic similarity through allele sharing70. We inferred relatedness coefficients with KING 2.2.4, which has been reported to be robust to population structure36, unlike PLINK’s $$\hat{\varPi }$$ estimator. We compared the original kinship estimate from KING (called ‘KING-robust’ in ref. 36) to a new estimator in KING, PropIBD, which integrates IBD segment information to infer sample relationships for third- and fourth-degree relatives more accurately (http://people.virginia.edu/wc9c/KING/manual.html). Comparisons of these different estimators on the mothers genotyped on the CoreExome showed that PropIBD identified more third-degree and closer relatives than the other methods (Supplementary Fig. 2a, b): KING ProbIBD removed 881 samples, KING-kinship removed 741 samples and PLINK’s $$\hat{\varPi }$$ removed 821 samples.

To try to determine which of these estimators was more accurate, we compared the genetic estimates of kinship to self-reported relationships for 196 Pakistani mother–father pairs for whom the relationship had been declared by the mother on the questionnaire. For these, since mothers and fathers had been genotyped on different arrays, we took the intersection of CoreExome chip and GSA (134,218 SNPs with MAF > 0.01, with SNP missingness <1% in the merged set) and ran KING on this (Supplementary Fig. 2c, d). The results suggest that KING-kinship may be underestimating true kinship for third-degree relatives: 21/75 (28%) self-declared first cousins were called more distant than third-degree relatives by KING-kinship, versus only 4/75 (5%) with KING-PropIBD. PropIBD called 38/75 (51%) self-declared first cousins as second-degree relatives. This seems to be mostly driven by inbreeding in the previous generation, since when we restrict to couples for whom both the mother’s and father’s parents were reported to be unrelated, KING-kinship and KING-PropIBD gave very concordant estimates (Supplementary Fig. 2d).

To be conservative in our population genetic analyses, we decided to use the relatedness estimates from KING-PropIBD to exclude relatives. We removed one sample from each pair of individuals with third-degree relatedness or above (i.e. PropIBD>0.0884). We retained 2200 Pakistani and 2520 White British mothers and 1616 Pakistani children on the Core Exome chip and 544 Pakistani mothers and 228 Pakistani fathers on the GSA.

As further validation of KING-PropIBD and to assess whether the estimates were robust to the endogamy in the cohort, we simulated individuals of various degrees of relatedness using the real-phased genotype data from individuals who self-declared as coming from the same biraderi and who fell in the same fineSTRUCTURE cluster (Fig. 2a). Specifically, we took the data phased with Eagle v.2.4.1 and used the recombination map provided with Eagle to generate gametes and combined them along a given pedigree using custom R code. We generated 500 pairs of individuals with each of four relationships: siblings, first cousins, second cousins, and unrelated. Reassuringly, KING-PropIBD inferred the correct or a closer relationship in the vast majority of cases; all sibling pairs and 93% of first cousin pairs were inferred to have the true simulated relationship or closer (Supplementary Fig. 2e).

### Population structure

For comparison with other modern worldwide populations, we merged our dataset with the HGDP71 and 1000 Genomes Project Phase 322. For comparison with modern and ancient individuals, we combined our samples with a dataset of published genotypes from modern and ancient individuals (https://reich.hms.harvard.edu/datasets)18. In both cases, we used GRCh37.

PCA was performed on the pruned datasets using EIGENSOFT 7.2.168,69. For the worldwide dataset, the eigenvectors were computed using the non-BiB datasets and the individuals from BiB were projected onto them (Supplementary Fig. 3a, b). For Supplementary Fig. 3c, to ensure that the BiB Pakistanis were not dominating the structure due to their large sample size, we computed the PCA using HGDP Pakistani populations and just 25 BiB Pakistanis, then projected the remaining BiB Pakistanis onto them.

ADMIXTURE v.1.372 was run on the pruned datasets and the cross-validation error was calculated for identifying the best K value, which was found to be 4 for Fig. 1c and 8 for Supplementary Fig. 4a. We separated Rajput-A and Rajput-B according to the proportion of the red component in the ADMIXTURE analysis, defining Rajput-B individuals as those for which the red component made up >40% of their ancestry. Rajput-A individuals had a maximum red component proportion of 18%.

Genetic affinity with other ancient and modern worldwide populations was tested by computing f-statistics using ADMIXTOOLS v.6.0. We computed both f3- and f4-statistics. Outgroup f3-statistics were computed with the phylogeny f3(Bradford subgroups, ancient samples; Mbuti), where the ancient samples were representative individuals of different archaeological periods and locations in South and Central Asia18. We selected ancient samples with the highest number of SNPs overlapping with our cohort. We also computed outgroup f3-statistics using present-day populations with the phylogeny f3(Bradford Pakistanis, X; Mbuti), where X represents another worldwide population (Supplementary Data 3). We considered both all Bradford Pakistanis together and the self-reported subgroups separately. We computed f3-statistics with the phylogeny f3(X,Y; Bradford subgroup), where X and Y are other Bradford subgroups (Supplementary Fig. 9d). Standard errors were obtained using blocks of 500 SNPs. f4-statistics were computed using qpDstat with f4mode:YES and with the phylogeny f4(W, X; Y, Chimpanzee), where W represents the biraderi groups reporting Arabic ancestry (Qureshi, Sheikh or Syed), X are all the other Bradford subgroups and Y represents Middle East populations45 (Supplementary Data 5). Positive values indicate gene flow between W and Y.

We used qpgraph in the ADMIXTOOLS v.6.0 package to confirm that the Bradford subgroups could be modelled as a mixture of ANI and ASI ancestral components as seen in ref. 18. For this analysis, we considered the homogeneous subgroups defined using fineSTRUCTURE (Fig. 2a) to avoid biases in the proportion estimates. We used the same qpgraph parameters and populations used in ref. 18. We assessed the fit of the demographic model for each of the Bradford subgroups separately, and estimated the proportion of the ANI and ASI components in each subgroup. Differences in the empirical and theoretical f-statistics with Z-score ≥3 were considered statistically significant.

UMAP was computed from the top 20 PCs with default parameters in the BiB dataset using the uwot R package73. To compare genetic to geographic distance, we first assigned 547 mothers to a village of origin in Pakistan using either the mother’s own self-reported village of origin or that of her parents if she was born in the UK and reported that her parents were born in the same Pakistani village. We then determined the latitude and longitude of the villages in order to define geographic distances between them. Genetic (measured as UMAP1 and UMAP2 vectors) and geographic distances were computed as Euclidean distances using the dist R function with method = Euclidean. Correlation between genetic and geographic distance was estimated using a two-sided Mantel test implemented in the Ade4 R package (mantel.rtest function)74. The number of permutations used for the Mantel test was 9999.

We ran fineSTRUCTURE v.4.0.142 to infer fine-scale population structure in the BiB Pakistani mothers. The haplotypes were phased using SHAPEIT v.2.1275 using the 1000 Genomes Phase 3 dataset as a reference panel. We generated the co-ancestry matrix using ChromoPainter and used it to run fineSTRUCTURE with 1,000,000 burn-in steps and 1,000,000 iterations. A PCA on the co-ancestry matrix using the prcomp R function confirmed the structure in the tree (Fig. 2). We ran the algorithm twice on the dataset with the major subgroups to make sure the tree structure was robust (data not shown). We also ran it on the dataset including all subgroups to confirm that the clusters were consistent.

Given the heterogeneity of the genetic clusters inferred in our cohort (Fig. 2), we defined self-reported groups as homogeneous if at least 60% of the individuals from each self-reported subgroup fell in the same genetic cluster inferred by fineSTRUCTURE. We then estimated pairwise FST (calculated with the Weir and Cockerham’s method using the program 4P76,77) between homogeneous self-reported groups within a fineSTRUCTURE cluster, and combined two groups if they had FST < 0.001 (5th percentile of the empirical distribution for all pairs of groups shown in Supplementary Fig. 6b). For demographic analyses, we included only the homogeneous self-reported groups that had at least 20 individuals. These were: Awaan+Syed from Cluster 2, Bains+Rajput-B from Cluster 9, Jatt+Choudhry from Cluster 10, Arain from Cluster 4, Gujjar from Cluster 5, Kashmiri from Cluster 1, Pathan from Cluster 8 and Qasabi from Cluster 7. Sample sizes for these groups are shown in Supplementary Data 8.

Chromosome Y haplogroups were defined with yhaplo78 and mtDNA haplogroups with Haplogrep279. A median-joining haplotype network for the Y-chromosome data from BiB Pakistani fathers was constructed with PopART v.1.780.

### Divergence time estimation

We calculated the time of divergence between the homogeneous sub-populations and genetic clusters using the approach described in McEvoy et al.81 and implemented in the NeON R package43. The 95% confidence intervals for the times of divergence were calculated using a jackknife procedure, leaving out one chromosome each time.

### IBD segment calling

We conducted several analyses using IBD segments between the Pakistani individuals, called using IBDseq v.r120682 and/or GERMLINE 1.5.383. IBDseq requires unphased data, whereas GERMLINE phased haplotypes. IBDseq ran with default parameters. For GERMLINE, we phased the data with Eagle v.2.4.184 using 1000 Genomes Project Phase 3 genetic maps. IBD segments were defined using GERMLINE with the parameters -bits 75 -err_hom 0 -err_het 0 -min_m 3 -h_extend, and the HaploScore algorithm85 was used to remove false-positive IBD tracts (genotype error = 0.0075, switch error = 0.003, mean overlap = 0.8). We did some additional filtering of IBD calls as described in Supplementary Note 1.

### Estimating historical effective population size

We estimated historical changes in effective population size (Ne) for both Pakistani subgroups and White British individuals using IBDNe v.19Sep19.26849. As recommended, we ran IBDseq82 to identify IBD segments with default parameters. We then excluded IBD chunks overlapping problematic regions as explained above, although in practice we found this had a minimal effect on the results. We ran IBDNe with the default parameters, except that we set a lower limit of 5 cM for the IBD segments considered. We also ran IBDNe using the IBD calls from GERMLINE and found that this gave systematically higher Ne estimates than IBDseq, although the trajectories were very similar, particularly for the larger groups (Supplementary Note 1 and Supplementary Fig. 10). We ran IBDNe with different sets of samples and filtering to confirm the robustness of our estimates (Supplementary Note 1 and Supplementary Figs. 10 and 11).

### Calculating IBD scores

We computed the IBD score developed in Nakatsuka et al.50 from GERMLINE IBD calls to quantify the extent of founder events in each population using IBD segments. In addition to the aforementioned removal of individuals who were third-degree relatives or closer based on the KING-PropIBD metric (PropIBD > 0.0884), we also removed one individual from each pair of samples sharing an IBD chunk >40 cM. We calculated IBD scores as the total length of IBD segments between 5 and 30 cM detected between individuals in the same subgroup divided by the total number of possible pairs, i.e. $$[({{2n}\atop{2}})-n]$$, where n is the number of samples in that subgroup. We then standardised each IBD score by the score for the Finnish individuals from 1000 Genomes Phase 350. Standard errors for IBD scores were calculated using a weighted block jackknife for each chromosome, and 95% confidence intervals were defined as the IBD score ± 1.96 × standard error. We also computed IBD scores using the same set of filters used in Nakatsuka et al. (IBD segment filter >30 cM to identify relatives, then counting IBD segments between 3 and 20 cM; Supplementary Fig. 18) and found that the results were similar (Supplementary Fig. 12a). We found similar results with different sensitivity analyses (Supplementary Note 1 and Supplementary Fig. 12). We also computed the IBD scores using IBD calls from IBDseq using the same strategy used for GERMLINE IBD calls (Supplementary Note 1 and Supplementary Fig. 12e).

### Inferring gene flow between Pakistani subgroups

Gene flow was inferred using the Treemix v.1.13 approach44 and f3-statistics19,45. We ran the Treemix analysis with default parameters, without the -root option. We added migration edges (-m) until we reached a total variance explained of 99.5%. The f3-statistics were calculated in the form of f3(target, source 1, source 2) using ADMIXTOOLS v6.0, and provide evidence that the target population is derived from an admixture of populations related to sources 1 and 2. We tested all possible combinations of target and source populations in our dataset. Standard errors were obtained using blocks of 500 SNPs. Tests with a Z-score < −3 were considered significant.

### ROH calling

We used three different ROH callers: bcftools/roh 1.13, GARLIC v1.6.0a, and PLINK. The results in Figs. 4 and 5 and Supplementary Figs. 13, 14, 15 and 17 are based on bcftools/roh calls, and in Supplementary Fig. 16 we compare all three callers.

With bcftools/roh, we used the -G 30 flag. We noticed an excess of apparently artefactual ROHs spanning long gaps between SNPs around centromeres, so we removed ROHs <10 Mb overlapping centromeres. In practice, this made very little difference to the ROH footprint (Supplementary Fig. 16). For GARLIC55,86, we set the number of resamples for estimating allele frequencies (--resamples) to 20 and we assumed a genotyping error rate (--error) of 0.001. We used the --auto-winsize flag to guess the best window size based on the SNP density. For PLINK, we followed previous publications87,88 and used the following parameters: --homozyg-window-snp 50 --homozyg-snp 50 --homozyg-kb 1500 --homozyg-gap 1000 --homozyg-density 50 --homozyg-window-missing 5 --homozyg-window-het 1.

### Inferring degree of parental relatedness from ROH patterns

In order to infer the degree of parental relatedness of individuals for whom parental genotype information was unavailable, we simulated pedigrees with various types of consanguineous unions and used a neural network classifier to infer parental relatedness from the patterns of homozygosity in the offspring, similar to approaches used in ref. 52,53. We used the data phased with Eagle v2.4.1. Using the subset of individuals inferred to be unrelated from the original dataset, we used the recombination map provided with Eagle to generate gametes and combined them along a given pedigree that involved a particular type of consanguineous union. We considered ten different classes of consanguineous union: sibling, avuncular unions lasting for up to three generations, first cousin unions lasting up to three generations, first cousin once removed, second cousin and unrelated.

For each type of consanguineous parental relationship, we simulated 500 pedigrees using custom R code to obtain empirical distributions for the length of ROHs over 10 cM in individuals whose parents had that relationship. We called ROHs on simulated and real individuals using bcftools/roh as described above. We used 15 statistics for the purposes of classification using the neural net: the total length of the ten longest ROHs (in cM), and the frequency of ROHs ranging from 10 to 150 cM binned into 14 intervals of 10 cM. Using these statistics, we trained a neural net classifier implemented in the R package nnet to assign individuals to a given consanguineous pedigree. When testing the algorithm, we trained the model on 400 individuals from each class and predicted the assignment of 100 individuals from each class of the ten different parental relationships. We tested the algorithm by randomly sampling individuals ten times, each time simulating an equal number of individuals from each of the ten model classes. Each of the ten times we assigned an individual to a parental relatedness category based on the category for which they got the most classifications across the ten trials. The accuracy was determined as the fraction of individuals classified correctly by the neural net.

We conducted a z test for population proportions to compare the proportion of individuals inferred to have parents with a particular relationship with the proportion who self-reported that relationship. Similarly, we used a z test for population proportions to assess whether the difference in the fraction who were inferred to be versus self-reported as consanguineous differed between mothers and children.

Further details regarding the accuracy and robustness of the method are available in Supplementary Note 1. Code is available at https://github.com/malawsky/consanguinity_simulation.

### Analysis of FROH

FROH was determined for each individual by summing the lengths of the (autosomal) ROHs in base pairs and dividing by the length of the autosomal genome. The figures given in the ‘Results’ section come from bcftools/roh calls, but very similar results were obtained with PLINK and GARLIC (data not shown). We calculated FROH in the BiB mothers and children from the bcftools/roh calls and stratified these estimates according to the parents’ birthplace (both parents born in the UK, one parent born in the UK and one in Pakistan, and both parents born in Pakistan), by the degree of parental relatedness inferred as described in the previous section and by subgroup.

### Analysis of ROH and IBD distributions

We applied theory from refs. 54,89 to determine the expected length distribution and genomic footprint of ROHs and IBD segments given a particular historical Ne trajectory and rate of consanguinity. In the model developed in the above papers, there are $${N}_{{{\rm{e}}}}$$ pairs of parents, who, in each generation, have probability $${r}_{1}$$ of being (full) siblings, $${r}_{2}$$ of being (full) first cousins, $${r}_{3}$$ of being second cousins, etc., up to a certain degree $$n$$. A key parameter of the model is the average kinship between parents,

$$k={\mathop{\sum}\limits_{i=1}^{n}}{\frac{{r}_{i}}{{4}^{i}}},$$
(1)

which is the probability of a random chromosome in each parent coalescing due to consanguinity. It is assumed that the parents are related only through a single path of degree $$1,...,n$$, or are otherwise unrelated.

In ref. 54, a Markov chain was derived for the evolution of the state of two chromosomes. The possible states are as follows: the chromosomes are in unrelated individuals, in one of each parent, in the same individual as two homologous chromosomes, or in the same individual as a single chromosome (coalescence). We then considered $${t}_{{{\rm{between}}}}$$, the time to the most recent common ancestor (TMRCA, or coalescence time) of two chromosomes sampled from two unrelated individuals, and $${t}_{{{\rm{within}}}}$$, the TMRCA of the two chromosomes of an individual. In refs. 54,89, the mean and variance of $${t}_{{{\rm{between}}}}$$ and $${t}_{{{{\rm{within}}}}}$$ were computed using a first-step analysis. An approximation of the distribution was derived using a separation-of-time-scales analysis, The results showed that $${t}_{{{\rm{between}}}}$$ is approximately exponential with rate $$1/[4N_{{{\rm{e}}}}(1-3k)]$$. The factor of 4 is because there are $${N}_{{{\rm{e}}}}$$ pairs of parents, thus $$2{N}_{{{\rm{e}}}}$$ individuals, and $${4N}_{{{\rm{e}}}}$$ chromosomes. $${t}_{{{\rm{within}}}}$$ has probability $$k/(1-3k)$$ of being O(1), i.e. representing a rapid coalescence within the family due to consanguineous unions, and is otherwise approximately exponential as $${t}_{{{\rm{between}}}}$$.

Here, we assume that $${t}_{{{\rm{between}}}}$$ is distributed as in the standard coalescent, but with a population size trajectory scaled according to the theory as $${N}_{{{\rm{e}}}}(t)(1-3k)$$. Thus, in discrete time, it has the distribution

$$P\left({t}_{{{{\rm{between}}}}}=t\right)=\frac{1}{4{N}_{{{\rm{e}}}}\left(t\right)\left(1-3k\right)}{\mathop{\prod}\limits_{r=1}^{t-1}}\left(1-\frac{1}{4{N}_{{{\rm{e}}}}\left(\tau \right)\left(1-3k\right)}\right).$$
(2)

For $${t}_{{{{\rm{within}}}}}$$, as the approximate distribution groups together all rapid coalescence events, we used a composite approach. For $$t\le 50$$, the distribution was computed numerically by running the exact Markov chain for 100,000 iterations. For $$t \, > \, 50$$, the distribution was set to

$$P({t}_{{{{\rm{within}}}}}=t \, > \, 50)\propto \frac{1}{4{N}_{{{\rm{e}}}}(t)(1-3k)}{\mathop{\prod}\limits_{r=1}^{t-1}}\left(1-\frac{1}{4{N}_{{{\rm{e}}}}(\tau )(1-3k)}\right)$$
(3)

(as for $${t}_{{{{\rm{between}}}}}$$), where the coefficient of proportion was set such that the entire distribution was normalised. Running the exact model for only $$t \, < \,10$$ or $$t \, < \,40$$ gave almost indistinguishable results.

We defined the genomic ‘footprint’ of IBD and ROH segments as the proportion of the genome (in genetic map units) found in segments (of each type, respectively) of length between $$[{l}_{1},{l}_{2}]$$90. Given the distributions $$P({t}_{{{{\rm{between}}}}})$$ and $$P({t}_{{{{\rm{within}}}}})$$, we computed the footprint as follows. In ref. 91, Ringbauer et al. (Eq. (4) therein) showed that, in a chromosome of length $$L$$, the mean number of segments with TMRCA $$t$$ of length in the small interval $$[l,l+\Delta l]$$ (in Morgan) is:

$$E[{n}_{{{\rm{seg}}}}(l;L,t)] = 4t{e}^{-2{tl}}(1+t(L-l))\cdot \Delta l$$
(4)

The mean chromosome length covered by segments (with TMRCA $$t$$) of length in $$[{l}_{1},{l}_{2}]$$, denoted $$c({l}_{1},{l}_{2}{;L},t)$$, is:

$$E\left[c\left({l}_{1},{l}_{2}{{{\rm{;}}}}L,t\right)\right]={\int_{l_{1}}^{l_{2}}}E\left[{n}_{{{\rm{seg}}}}\left(l{{{\rm{;}}}}t\right)\right] {ldl}={e}^{-2t{l}_{1}}\left(L+2{{Ltl}}_{1}-2t{{l}_{1}}^{2}\right)-{e}^{-2t{l}_{2}}\left(L+2{{Ltl}}_{2}-2t{{l}_{2}}^{2}\right).$$
(5)

The mean length covered by segments of all TMRCAs is obtained by numerically summing over all TMRCAs, using their distributions ($${t}_{{{\rm{between}}}}$$ for IBD segments, who are between unrelated individuals, and $${t}_{{{\rm{within}}}}$$ for ROH segments):

$$E[{c}_{{{\rm{IBD}}}}({l}_{1},{l}_{2}{{{\rm{;}}}}L)]={\mathop{\sum}\limits_{t=1}^{{\infty }}}E[c({l}_{1},{l}_{2}{{{\rm{;}}}}L,t)]P({t}_{{{\rm{between}}}}=t),$$
(6)
$$E\left[{c}_{{{\rm{ROH}}}}\left({l}_{1},{l}_{2}{{{\rm{;}}}}L\right)\right]={\mathop{\sum}\limits_{t=1}^{{{\infty }}}}E\left[c\left({l}_{1},{l}_{2}{{{\rm{;}}}}L,t\right)\right]P\left({t}_{{{\rm{within}}}}=t\right).$$
(7)

Finally, the footprint is obtained by summing over all chromosomes and dividing by the total genome length,

$$E[f({l}_{1},{l}_{2})]={\mathop{\sum}\limits_{i=1}^{22}}E[c({l}_{1},{l}_{2}{{{\rm{;}}}}{L}_{i})]/\left({\mathop{\sum}\limits_{i=1}^{22}}{L}_{i}\right),$$
(8)

where $${L}_{i}$$ is the length of chromosome $$i$$, in Morgans. The code implementing this model can be accessed at https://github.com/scarmi/ibd_roh.

We set the values for $$r=({r}_{1},{r}_{2},{r}_{3},{r}_{4}{,r}_{5})$$ in two different ways:

1. 1.

Using the self-reported parental relatedness: we set $${r}_{1}=0$$ and $${r}_{5}=0$$ and estimated $${r}_{2}$$ and $${r}_{3}$$ using the fraction of mothers who declared that their parents were first cousins/first cousins once removed or second cousins, respectively. We set $${r}_{4}$$ to be the proportion of mothers who declared that their parents were related but that the relationship type was unknown or was described as ‘other blood’ or ‘other marriage’. This gave an estimate for $${{{\boldsymbol{r}}}}{{{\boldsymbol{=}}}}{{{\boldsymbol{(}}}}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{1}}}}}{{{\boldsymbol{,}}}}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{2}}}}}{{{\boldsymbol{,}}}}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{3}}}}}{{{\boldsymbol{,}}}}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{4}}}}}{{{\boldsymbol{,}}}}{{{{\boldsymbol{r}}}}}_{{{{\boldsymbol{5}}}}}{{{\boldsymbol{)}}}}$$ of (0, 0.235, 0.052, 0.155,0) for Pathan in cluster 8 and of (0, 0.425, 0.057, 0.151,0) for Jatt/Choudhry in cluster 10, giving naive kinship estimates 0.016 and 0.028, respectively.

2. 2.

Using the parental relatedness inferred with our neural net method: we calculated the parameters using the number of individuals inferred to be offspring of the following relationships, as follows:

$${r}_{1} = \frac{{\#}\,{{{\rm{sibling}}}}}{{\#}\,{{{\rm{individuals}}}}}$$
(9)
$$\,{r}_{2}=\,\frac{\#\,{{{\rm{first}}}}\,{{{\rm{cousin}}}}\,(1{{{\rm{st}}}}\,{{{\rm{generation}}}})\,+\,\#\,{{{\rm{first}}}}\,{{{\rm{cousin}}}}\,(2{{{\rm{nd}}}}\,{{{\rm{generation}}}})+\,\#\,{{{\rm{first}}}}\,{{{\rm{cousin}}}}\,(3{{{\rm{rd}}}}\,{{{\rm{generation}}}})\,}{\#\,{{{\rm{individuals}}}}}$$
(10)
$$\, {r}_{3}=\\ \,\frac{\#\,{{{\rm{first}}}}\,{{{\rm{cousins}}}}\,{{{\rm{once}}}}\,{{{\rm{removed}}}}+2\times \#\,{{{\rm{first}}}}\,{{{\rm{cousins}}}}\,(2{{{\rm{nd}}}}\,{{{\rm{generation}}}})+2\times \#\,{{{\rm{first}}}}\,{{{\rm{cousins}}}}\,(3{{{\rm{rd}}}}\,{{{\rm{generation}}}})+\,\#\,{{{\rm{second}}}}\,{{{\rm{cousins}}}}}{\#\,{{{\rm{individuals}}}}}$$
(11)
$${r}_{4}=\frac{3\times \#\,{{{\rm{first}}}}\,{{{\rm{cousin}}}}\,(3{{{\rm{rd}}}}\,{{{\rm{generation}}}})}{\#\,{{{\rm{individuals}}}}}$$
(12)

This is because, e.g. someone who is the offspring of first cousins and whose parents and grandparents were also offspring of first cousins is also the offspring of second cousins by two routes, and third cousins by three routes. Note that it is thus possible to have the sum of the components of r being >1. The model still holds as long as the kinship satisfies $$k \, < \, 1/4$$. For Jatt, we obtained the following estimates:

$${r}_{1}=0.02341137$$
$${r}_{2}=0.08361204+0.14053512+0.28093645 = 0.5050836$$
$${r}_{3}=0.13043478+2\times 0.14053512+2\times 0.28093645+0.23411371 = 1.207492$$
$${r}_{4} = 3\times 0.28093645 = 0.8428094$$

For Pathan, we obtained the following estimates:

$${r}_{1}=0.007389671$$
$${r}_{2}=0.111455399+0.079812207+0.126150235 = 0.3174178$$
$${r}_{3}=0.111455399+2\times 0.079812207+2\times 0.126150235+0.286384977 = 0.8097653$$
$${r}_{4} = 3\times 0.126150235 = 0.37845069$$

To investigate the effect of a sudden change in consanguinity rates at a certain point in the past, we tried reducing the kinship value to 1 × 10−4 at the time the distribution switched to the approximate model ($$t=50$$). However, this proved to have only very subtle effects on the expected ROH and IBD segment footprint (Supplementary Fig. 17), so we could not evaluate this possibility with the observed data.

For the Ne trajectory, we used a constant value of Ne for Supplementary Fig. 15. For the other analyses, we used trajectory estimated with IBDNe for the last 50 generations (using the point estimate for Fig. 5 and the bounds of the 95% confidence interval for Supplementary Fig. 16), followed by a constant value for $$t \, > \,50$$ (the IBDNe estimate at $$t=50$$). [We divided the IBDNe population sizes by 2 before plugging them into the model, as the model assumes $${N}_{{{\rm{e}}}}$$ is the number of reproducing pairs.] We note that by using IBD data alone, IBDNe estimates $${N}_{{{\rm{e}}}}(t)(1-3k)$$ rather than $${N}_{{{\rm{e}}}}(t)$$. Thus, given the IBDNe inferred trajectory $${\hat{N}}_{{{\rm{e}}}}(t)$$, we set

$$4{N}_{{{\rm{e}}}}(t)(1-3k)={\hat{N}}_{{{\rm{e}}}}(t)$$
(13)

in the equations above.

To compute the ROH footprint in the real data, we averaged the total lengths of ROH segments (within each length interval) over all individuals in the group, and divided by the number of individuals. For the IBD footprint, we averaged the total lengths of IBD segments over all pairs of individuals in the group, and divided by $$2n(2n-2)/2$$ (where $$n$$ is the number of individuals), which is the number of chromosome pairs in different individuals.

We restricted the real data analysis to ROHs and IBD segments greater than 5 cM, since we suspected that segments shorter than this could not be called reliably with the CoreExome chip data, and less than 30 cM, since there were few ROHs longer than this so the average footprint became very noisy. We used 1 cM segment length intervals for both ROH and IBD, where each data point was plotted at the beginning of each interval. The empirical IBD footprint was plotted using the IBDseq calls filtered as described above, since these calls were used as input for IBDNe.

### Quality control of exome-sequence data

WES data were generated and mapped using BWA-MEM92 to the GRCh37+decoy reference genome used by the 1000 Genomes project (ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz). Variant calling was carried out with GATK HaplotypeCaller93 restricting to the Agilent V5 exome bait regions ± a 100-bp window on either end as described in Narasimhan et al.14. Two thousand three hundred and eleven samples were generated as part of this previous publication and had a median on-target coverage of 38×, and 473 additional samples were sequenced later to a median coverage of 50×. This gave a dataset of 2784 individuals, of which 2484 were Pakistani. We excluded variants that failed these criteria, based on standard GATK annotations:

• SNPs: QD < 2, FS > 60, MQ < 40, MQRankSum < −12.5, ReadPosRankSum < −8 or Hardy–Weinberg equilibrium p value <1 × 10−6.

• Indels: QD < 2, FS > 200, ReadPosRankSum < −20 or Hardy–Weinberg Equilibrium p value <1 × 10−6.

We set genotypes to missing if they had genotype quality <20, allelic depth binomial p value <0.001 or depth <7, and then removed variants with more than 20% missingness. This resulted in a dataset of 1,931,299 variants (1,895,447 SNPs and 35,852 indels). The variants were annotated with the Variant Effect Predictor (version 95) using the LOFTEE plugin.

### Simulations to assess relative risk incurred by endogamy and consanguinity on recessive disease

We began by identifying P/LP variants in autosomal recessive developmental disorder genes from the Developmental Disorder Gene-to-Phenotype (DDG2P) list. We focused on protein-truncating variants annotated as ‘high confidence’ by LOFTEE that fell in a gene with a loss-of-function mechanism according to DDG2P (915 genotypes at 561 variants in 298 genes), and missense/inframe variants that were classed as pathogenic in ClinVar with two stars and above (264 genotypes at 88 variants in 62 genes).

We then created all possible pairs of exome-sequenced individuals in the dataset, and scored them as ‘at risk’ if both individuals in the pair carried a P/LP variant in the same gene58,94. For a given pair of individuals i and j, we write $${s}_{i,j}=1$$ if they are at risk, and $${s}_{i,j}=0$$ otherwise. We compared the average $${s}_{i,j}$$ (i.e. the fraction of at-risk couples) between couples that involved two unrelated (i.e. more distant than third-degree) individuals from the same biraderi (intra-biraderi) versus two unrelated individuals from different biraderi groups (one from an index biraderi and the other from any other biraderi). For this analysis, we used the self-reported biraderi information to define the groups, rather than the results from the fineSTRUCTURE analysis, since we wanted to mimic the choices that individuals presumably make when selecting a partner, which are based on self-reported identity. We restricted to groups with at least 50 individuals. We calculated the variance of the fraction of at-risk couples as follows. For couples involving two individuals i and j from the same biraderi of size n, we calculated:

$${{{\rm{Var}}}}\left(\frac{{\sum }_{i\ne j}{s}_{i,j}}{\frac{n(n-1)}{2}}\right)=\frac{\frac{n(n-1)}{2}{{{\rm{Var}}}}({s}_{i,j})+n(n-1)(n-2){{{\rm{Cov}}}}({s}_{i,j},{s}_{i,k})}{{\left(\frac{n(n-1)}{2}\right)}^{2}}$$
(14)

This equation assumed that $${{{\rm{Cov}}}}({s}_{i,j},{s}_{k,l})=0$$, i.e. no correlation in the risk status of distinct couples. The terms $${{{\rm{Var}}}}({s}_{i,j})$$ and $${{{\rm{Cov}}}}({s}_{i,j},{s}_{i,k})\,$$in the above equation were computed empirically from the data. The variance term was computed as the sample variance of all possible values of $${s}_{i,j}$$. The covariance term was computed as the sample covariance between all pairs of terms $${s}_{i,j}$$ and $${s}_{i,k}$$ that share an index (in other words, all pairs of the risk status of one individual with two other potential mates). For couples involving two individuals, one of whom (i) comes from a given biraderi of size n1, and the other of whom (j) comes from a different biraderi (total sample size n2), we similarly calculated:

$${{{\rm{Var}}}}\left({\frac{{\mathop{\sum}\nolimits_{i=1}^{n_{1}}}{\mathop{\sum}\nolimits_{j=1}^{n_{2}}}s_{i,j}}{{n_{1}}{n_{2}}}}\right)= \,\frac{1}{({n_{1}n_{2}})^2}{{{\rm{Var}}}}\left({{\mathop{\sum}\limits_{i=1}^{n_{1}}}{\mathop{\sum}\limits_{j=1}^{n_{2}}}s_{i,j}}\right)\\ = \, \frac{{n_{1}}{n_{2}}{{{\rm{Var}}}}(s_{i,j})+{{n_{1}}{n_{2}}}({n_{1}}+{n_{2}}-2){{{\rm{Cov}}}}(s_{i,j},s_{i,k}) }{({n_{1}}{n_{2}})^2}$$
(15)

To obtain an empirical p value for the (positive) difference between the average $${s}_{i,j}$$ for intra- versus inter-biraderi couples, we permuted the biraderi labels 10,000 times, re-estimated the mean $${s}_{i,j}$$ for each permutation, and calculated the fraction of permutations for which the difference $${{{\rm{mean}}}}(s_{i,j,{{{\rm{intra}}}}}) - {{{\rm{mean}}}}(s_{i,j,{{{\rm{inter}}}}})$$ was greater than or equal to the observed value.

To estimate risk for first cousin unions, we randomly sampled 150 pairs of individuals 1000 times, matching their kinship (KING-PropIBD) distribution to the observed kinship distribution for BiB Pakistani husband–wife pairs who self-declared as being first cousins (as seen in Supplementary Fig. 2c), in eight bins of unit size 0.05. For example, the proportion of self-reported first cousins with PropIBD between 0.1 and 0.15 was 20/75 (26.67%), so we sampled 40 individuals amongst those 150 randomly chosen pairs to have PropIBD in this range. Each individual was only included in one pair in a given iteration. The variance for the simulated first cousin pairs was calculated over these 1000 sets of bootstrapped couples.

This analysis makes several assumptions. The most important is that the variants included in the analysis are the only ones that could cause autosomal recessive disorders in this population. This is clearly not true, because we are only considering variants for which there is strong evidence that they are likely to be pathogenic (e.g. excluding novel missense variants). This means that we are underestimating the absolute risk of recessive disorders in this population. However, since these biases should affect all biraderi groups equally, it is likely that our estimates of relative risk for intra- versus inter-biraderi unions are valid. Our analysis also assumes that these variants are fully penetrant when present in a homozygous or compound heterozygous state, and that the random pairing of unrelated individuals accurately recapitulates patterns of marital choices in this community.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.