Abstract
Clinical classification is essential for estimating disease prevalence but is difficult, often requiring complex investigations. The widespread availability of population level genetic data makes novel genetic stratification techniques a highly attractive alternative. We propose a generalizable mathematical framework for determining disease prevalence within a cohort using genetic risk scores. We compare and evaluate methods based on the means of genetic risk scores’ distributions; the Earth Mover’s Distance between distributions; a linear combination of kernel density estimates of distributions; and an Excess method. We demonstrate the performance of genetic stratification to produce robust prevalence estimates. Specifically, we show that robust estimates of prevalence are still possible even with rarer diseases, smaller cohort sizes and less discriminative genetic risk scores, highlighting the general utility of these approaches. Genetic stratification techniques offer exciting new research tools, enabling unbiased insights into disease prevalence and clinical characteristics unhampered by clinical classification criteria.
Similar content being viewed by others
Introduction
The development and refinement of polygenic analysis techniques is greatly increasing our understanding of many diseases. Using polygenic risk has allowed insights into disease etiology and through Mendelian randomization evaluation of causality^{1}. Clinically, capturing polygenic susceptibility through genetic risk scores (GRS) can be used to determine individuals at the highest risk of a disease^{2,3,4}. This paper concentrates on an innovative use of polygenic risk to genetically estimate disease prevalence (proportion of individuals with and without a disease) within a cohort. Currently estimating disease prevalence is difficult as it requires robust clinical classification. Diseasespecific investigations are rarely available in populationlevel data and inaccuracies associated with selfreported diagnosis are well recognized^{5,6}. Given the increasing availability of populationlevel genetic data, novel polygenic estimates of disease prevalence are an extremely attractive alternative.
The basis of genetically determining disease prevalence is fundamentally that the distribution of a specific disease GRS within a cohort will reflect the mixture of GRS of those with the disease (cases) and those without (noncases). This mixture GRS distribution will lie between reference groups of cases and noncases and will reflect the relative proportion of cases to noncases (Fig. 1a). The location of the mixture cohort’s GRS distribution in comparison to the GRS distribution of known cases and noncases allows the respective proportion of each group to be determined which provides a geneticbased estimate of disease prevalence. Furthermore, using the genetically calculated proportion of a disease within a cohort allows the additional benefit of associated clinical features of the genetically defined disease cohort to be determined. It is worth emphasising that in almost all polygenic risk situations, even those at the highest genetic risk are unlikely to develop the relevant disease and therefore this concept does not remain valid at an individual level. Nonetheless, at a group level the average GRS will be higher in a cohort with disease versus those without.
In this paper we assess, the performance and utility of polygenic stratification as a tool for determining disease prevalence. Through simulated scenarios and realworld data, we evaluate different mathematical techniques for determining disease prevalence based on the GRS distribution within a cohort. The generalizability and robustness of genetic stratification have been investigated through a systematic evaluation of the cohort characteristics required for estimates to remain robust. Specifically, the impact on the performance of the prevalence of the disease, the mixture cohort size and the strength of genetic predisposition for a disease. Finally, in order to highlight the utility of the proposed framework we apply our methodologies in the context of identifying the prevalence of undiagnosed coeliac disease within a cohort adhering to a glutenfree diet.
Genetic stratification summary
We present three methods developed to estimate the proportions of cases and noncases in an unknown mixture cohort using GRS distributions and compare them with the published Excess approach^{7}. The methods’ performance characteristics are evaluated over clinically relevant parameter ranges using GRS for type 1 diabetes (T1D), type 2 diabetes (T2D) and coeliac disease, as well as synthetic data. Clinical sample sets were taken from the following cohorts: T1D (n = 1,963) and T2D (n = 1,924) from the Wellcome Trust Case Control Consortium (WTCCCC)^{8}, Coeliac disease reference cases (n = 12,018) from a combination of European studies^{9} with noncases (controls) and mixture (glutenfree diet) cohorts from UK Biobank (n = 12,000 and n = 12,757, respectively)^{10}.
To compare the methods under different conditions, the T1DGRS data were split in half to form reference cohorts and an independent holdout set for generating parameterised mixture cohorts (Figs. 2 and 3). In these analyses, mixtures were constructed by sampling with replacement, enabling larger mixture sizes to be used than the size of the holdout sets from which they were derived^{11,12}. Synthetic data sets were also constructed from Gaussian distributions of equal standard deviations (set to 1) but different means (see Table 1 and Fig. 4). For the reference cohort of cases, \({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\), the mean of the generating distribution was always 0 while the mean for the noncases cohort, \({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\), was systematically varied in order to investigate the effect of differences in discriminability signified by the area under the curve (AUC) of the GRS distribution. For further details of the data sets, see “Methods”.
In each method, two cohorts consisting of the GRS of individuals with and without a particular polygenic disease were taken as references, denoted \({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\) (the reference cohort of cases) and \({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\) (the reference cohort of noncases). The proportions of individuals from these reference cohorts (denoted \({p}_{{{{{{{{\rm{C}}}}}}}}}\) and \({p}_{{{{{{{{\rm{N}}}}}}}}}\) respectively) who comprise an unknown mixture cohort (\({\widetilde{{{{{\rm{M}}}}}}}\)) were estimated based on the properties of the reference cohorts. When only one proportion is mentioned, this is \({p}_{{{{{{{{\rm{C}}}}}}}}}\) (i.e. relative to the reference cohort of cases, \({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\)), unless otherwise stated. The cohort characteristics used are dependent upon the particular method employed as illustrated in Fig. 1 and are detailed below.
Throughout this paper, we assume that the unknown mixture cohort is composed solely of the samples that come from the two reference cohorts (blue and red dots in Fig. 1). In practice, this means that \({p}_{{{{{{{{\rm{N}}}}}}}}}\) (prevalence of noncases) and \({p}_{{{{{{{{\rm{C}}}}}}}}}\) (prevalence of cases) sum to one, \({p}_{{{{{{{{\rm{N}}}}}}}}}+{p}_{{{{{{{{\rm{C}}}}}}}}}=1\), and so accordingly, the proportion of noncases was calculated as: \({p}_{{{{{{{{\rm{N}}}}}}}}}=1{p}_{{{{{{{{\rm{C}}}}}}}}}\). Furthermore, the presented Earth Mover’s Distance (EMD) and Kernel Density Estimation (KDE) methods make it possible to check if this assumption is satisfied. We revisit details of such checks in the discussion and supplementary information.
Finally, our methods are all based on the assumption that between the reference and mixture cohorts, cases and noncases are genetically equivalent. This assumption must hold true for estimates to be valid and becomes less certain if the mixture cohort is derived from a different population than those used for reference. For this reason, we recommend these methods should be used to estimate disease prevalence within a subset of a population where reference cases and noncases can be derived from the same population, for example, UK Biobank. This does not completely exclude using reference cohorts derived from different datasets, particularly where robust disease cases may be difficult to define^{7}, but in this context, extreme caution should be exercised prior to applying the methods and around the interpretation of the generated estimates. Using reference cohorts from a different population from the mixture analysis should only be undertaken following close examination of the selection criteria and demographics of the reference and mixture cohorts to ensure equivalence. This is of particular importance when studying different geographical populations where allele frequencies are known to vary^{13,14}. Accordingly, in this manuscript all analyses are restricted to white Europeans; the populations that the reference GRS distributions were derived from. Where possible, the GRS of the reference noncases (controls) and cases should be compared with the GRS of known noncases and cases within the same population the mixture has been taken from. This could be done, for example, by means of a statistical test appropriate for the assessment of the observed GRS distributions. An example of the importance of this and how it can be detected is demonstrated by the T2DGRS for a reference T2D population from the WTCCC^{8}. The WTCCC cohort was largely selected based on a positive family history of T2D or early disease onset and is therefore enriched for T2D risk variants. As shown in Supplementary Fig. 1 the distribution of T2DGRS of unselected T2D cases from population data in UK Biobank is significantly lower than the T2DGRS in the WTCCC T2D reference. The T2DGRS in UK Biobank population T2D cases only becomes equivalent to the WTCCC when the same case selection criteria are mirrored. If this WTCCC cohort was used as a reference T2D population when evaluating the prevalence of T2D in a cohort in the UK biobank, it would have influenced the accuracy of estimates since it does not constitute a representative T2D cohort.
The Excess method
This estimates the proportion from the number of excess disease cases above the mixture cohort’s median score compared to the equal numbers expected in a pure control cohort (Fig. 1b). We illustrate the method as introduced in ref. ^{7}.
The Means method
This compares the mean GRS of the mixture cohort to the means of the two reference cohorts and estimates the mixture proportion according to the normalised difference between the two (Fig. 1c).
The Earth Mover’s Distance (EMD) method
This uses the weighted cost of transforming the mixture distribution into each reference distribution (more formally, the integral of the difference between the cumulative density functions, i.e., the area between the curves). This method allows \({p}_{{{{{{{{\rm{N}}}}}}}}}\) and \({p}_{{{{{{{{\rm{C}}}}}}}}}\) to be computed independently (Fig. 1d) and so provides a way to validate the assumption that the mixture is composed solely of the samples from the two reference cohorts, \({\hat{p}}_{{{{{{{{\rm{N}}}}}}}}}+{\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}=1\); if the sum is significantly different from 1, then the assumption is not satisfied. In this study, we use the mean of the two estimates for \({p}_{{{{{{{{\rm{C}}}}}}}}}^{{{{{{{{\rm{EMD}}}}}}}}}\) and \(1{p}_{{{{{{{{\rm{N}}}}}}}}}^{{{{{{{{\rm{EMD}}}}}}}}}\) as the estimate of the \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}\).
The Kernel Density Estimation (KDE) method
This method fits a smoothed template to each reference distribution (by convolving each sample with a Gaussian kernel) and builds a model of the mixture as a weighted sum of these two templates. The method then adjusts the proportion of these templates with the Levenberg–Marquardt (damped least squares) algorithm until the sum optimally fits the mixture distribution (Fig. 1e), noting that the algorithm could find one of the potentially several local minima. In other words, the method finds (one of) the linear (convex) combination(s) of the reference distributions that best fits the mixture distribution.
Results
Performance of genetic stratification
We start by using the T1D GRS (AUC = 0.88^{2}) to evaluate the performance of all four methods on artificially constructed (synthetic) mixtures. The mixtures are generated by sampling with replacement from half of the reference data (holdout subset), to ensure the reference and mixture cohorts are independent and identically distributed (for details see “Methods”). Figure 2 demonstrates that genetic stratification allows robust estimates of disease prevalence (proportion of cases to noncases) around known values. The accuracy (defined as deviation from the true proportion) and precision (defined as confidence interval width) of estimates are dependent on the following variables: proportion of cases and noncases within the mixture, the mixture size and the discriminative ability of the GRS. For each method, we describe how each of these variables affects the accuracy and/or precision of prevalence estimates.
What is impact of the proportion of cases to controls in the mixture cohort?
In all methods except the Excess, away from extremes of proportion, varying the proportion of cases to controls has no impact on the accuracy or precision of prevalence estimates (Fig. 2). Using heat maps we illustrate the combined effect of gradually changing both proportion and mixture size on the accuracy of estimates (Fig. 3 and Supplementary Fig. 2). At extremes of proportion accuracy significantly reduces, tending to underestimate at high proportions and overestimate at low proportions. Increasing sample size reduces the extent to which proportions are classed as extremes thereby improving accuracy and precision for estimating the prevalence of rarer diseases. This is demonstrated by Fig. 2, a mixture size of 500 gives imprecise estimates around a 10% disease prevalence (proportion 0.1) and includes zero. Increasing the mixture size to 5,000 significantly improves the precision around the same 10% prevalence allowing a meaningful estimate of disease prevalence.
What is the impact of the size of the mixture cohort?
With all but the smallest cohort sizes, prevalence estimates remain valid. Not surprisingly, increasing cohort size leads to an improvement in the precision of estimates, (Figs. 2 and 3). Increasing mixture size improves precision because larger cohort sizes can be seen to represent the characteristics of the reference distributions more accurately. Where larger mixtures cohort sizes are not possible, Fig. 3 clearly demonstrates that for all methods except the Excess, accurate albeit less precise, estimates of disease prevalence can still be achieved with lower case numbers. Figure 2 shows that using a T1DGRS and a mixture of just 500 cases can still provide accurate and clinically informative estimates around a disease prevalence of 40%, e.g., determining the prevalence of T1D in diabetes cases rapidly requiring insulin (clinical PPV of ≈50% for identifying T1D^{15}).
How predictive does a GRS need to be?
Accuracy and precision of estimates for all four methods reduce when using less discriminative GRS. However, excluding the Excess method, robust estimates of proportion are possible even when using GRS with AUC around 0.6 or above. This is demonstrated in Fig. 4 where we create artificial GRS with the area under the ROC curve (AUC) varying from completely nondiscriminative (AUC = 0.5) to fully discriminative (AUC = 1). Reducing GRS AUC leads to widening of confidence intervals around evaluated disease prevalence’s of 10% and 25%. The reduction in precision can be entirely mitigated by increasing the mixture cohort size. This is emphasised by Table 1, which shows the minimum mixture size required to give an estimated precision of 0.1 (\({{{{{{{\mathrm{C}}}}}}}}{{{{{{{\mathrm{I}}}}}}}}_{{{{{{{{\rm{U}}}}}}}}}{{{{{{{\mathrm{C}}}}}}}}{{{{{{{\mathrm{I}}}}}}}}_{{{{{{{{\rm{L}}}}}}}}}\)) around a prevalence of 0.1 with increasing AUC. For instance, using the EMD method a mixture size of 25,500 (2,550 cases and 22,950 non cases) and an AUC of 0.6 allows robust precision around a 10% disease prevalence. A realworld clinical example is shown in Supplementary Fig. 5 accurately estimating the proportion of T2D cases in participants with selfreported glaucoma in UK Biobank using a less discriminative GRS (T2DGRS AUC 0.65, calculated in this study).
What is the relative performance of the different methods?
We find that the Means, KDE and EMD methods perform well in estimating prevalence. Their accuracy and precision are largely comparable and all outperform the Excess method. The Excess method demonstrates reduced performance and exhibits strong bias (difference between the estimated prevalence \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}\) and the median of the bootstrap values \({p}_{{{{{\rm{C}}}}}}^{\prime}\)) typically underestimating the true prevalence. Figure 3 shows that regardless of the mixture size, the Excess method is practically unusable for any but the highest AUC. The relatively comparable performance of the Excess method in Fig. 2 is a consequence of the high AUC of the T1DGRS (0.88^{2}) and the strong asymmetry of the reference distributions.
Clinical example estimating prevalence of coeliac disease
Finally, we illustrate a worked example asking the question of how muchundiagnosed coeliac disease is present within a population adhering to a glutenfree diet (Fig. 5) using a coeliac disease GRS (CDGRS). This is important as whilst people observe a glutenfree diet for a number of reasons, it is possible that without getting a formal diagnosis people with undiagnosed coeliac disease eliminated gluten from their diet using trial and error to alleviate abdominal symptoms. For each method we: (1) compute an estimate of prevalence (2) use modelled mixtures and bootstrapping to calculate its confidence intervals. All methodologies provide estimates of the proportion of individuals with coeliac disease with their 95% CIs shown in square brackets: Excess = 15.0% [13.4%, 17.7%]; Means = 15.1% [13.5%, 16.6%]; EMD = 15.1% [13.5%, 16.7%]; KDE = 13.2% [11.6%, 14.7%]. In this same population in the UK biobank^{10} adhering to a glutenfree diet, 13.9% of individuals were known coeliac cases (selfreported or ICD10 code; see “Methods” for details). Our results suggest an absence of undiagnosed coeliac disease in all patients adhering to a glutenfree diet and not known to have the condition.
Discussion
We present analysis of a novel approach to disease classification based on genetic predisposition. We demonstrate genetic stratification produces robust prevalence estimates even in the context of: rarer diseases, smaller cohort sizes and less discriminative GRS, highlighting the general utility of the proposed approaches. This was demonstrated through headtohead evaluation of four methods including the original Excess methodology published by Thomas et al.^{7}. The presented examples illustrate the performance and utility of these method across a range of different scenarios highlighting improved accuracy of the new approaches over the original Excess method. We supplemented the estimation methods by combining Monte Carlo^{11} sampling and bootstrap^{12} methods to quantify uncertainty around the estimate and compute realistic confidence intervals.
Distribution of GRS can be used to estimate disease prevalence within a cohort
Our results show that robust estimates of prevalence are possible using differences in distributions of GRS between cohorts of cases and noncases. Our methods build on the previously published genetic stratification by Thomas et al.^{7}. This novel concept is important, as when coupled to the everincreasing availability of populationlevel genetic datasets, it allows fresh insights into disease epidemiology without requiring extensive investigations or unreliable selfreported diagnosis^{5,6}. The permanence associated with genetic risk makes these methods potentially very powerful tools for clinical researchers and enables accurate evaluation where cases are difficult to differentiate clinically.
Rare diseases and small mixture cohorts can be evaluated
Accurate estimates were possible with mixture cohorts containing as few as 500 individuals and away from extremes of proportion disease prevalence had little impact. Precision around estimates improved with increasing cohort size. Larger mixture cohorts, readily achievable in modernday population datasets (UK Biobank has genotyped ≈500,000 individuals^{10}), almost entirely mitigated for the reduced precision observed with using less discriminative GRS (lower AUC). When disease prevalence is extremely low, robust estimates can still be achieved through mixture enrichment. This enrichment will inevitably be to the detriment of smaller mixture sizes but because proportions are moved away from extremes, in this situation accuracy is still improved.
Estimates remain robust in diseases with less discriminative GRS
Whilst accuracy and precision are higher when utilising a more discriminative GRS, we show that clinically meaningful estimates can still be obtained using GRS with AUC as low as 0.6. While in theory our methods could be used in diseases with minimal genetic predisposition (AUC < 0.6) our analysis would suggest extreme caution in these scenarios and that extremely large mixture sizes would be required to generate any clinically meaningful confidence around estimates. The performance of the Means, EMD and KDE methods is very good in the case of normal GRS distributions with equal standard deviations e.g., diseases with polygenetic risk arising from a large number of causal variants, each with tiny effects, e.g., T2D. In diseases where certain variants predominate, e.g., HLA in autoimmune disease, the GRS will be skewed to account for this, e.g., T1D. In this instance, the EMD and KDE methods will be more accurate, as they are able to utilise the unequal skewness (or other properties such as standard deviations or kurtosis) even when the means of the reference distributions are close, see Supplementary Fig. 7. In diseases where one variant has the predominant effect on genetic risk, e.g., HLADQ in coeliac disease, it might be possible to estimate prevalence using just this variant. However previous work has shown a GRS including the predominant variant as well as smaller effect variants has better discriminative ability than the predominant variant alone^{9}.
Different methods have different advantages
In most settings, the best approaches are the Means, EMD and KDE methods. The overall performance of these three methods is comparable across different parameters (mixture size, mixture proportional makeup and GRS AUC). At extreme proportions, the KDE method exhibits the smallest bias. A key advantage of the Means method is that it is very straightforward to apply, allowing rapid evaluation of disease prevalence within a cohort. Alternatively, the EMD and KDE methods have the benefit of being able to estimate the prevalence in cases where the Means method cannot be used, e.g., if the reference cohorts have very similar means (Supplementary Fig. 7). Finally, the KDE and EMD methods can be used to test the assumption that the mixture is only composed of two cohorts (Supplementary Note 2).
As noted in the original article by Thomas et al.^{7} the Excess method inherently underestimates the proportion of cases because typically both reference cohorts have values below the median value of \({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\). Taking distinct approaches, the new methods eliminate this inaccuracy and even with decreasing genetic discrimination, these are still interpretable, reflecting the improved generalizability of these methods. We note that the Excess method could be modified to improve its accuracy (e.g., by choosing another quantile rather than the median) but these changes would require casebycase finetuning and at best achieve equivalence to the proposed alternative methods.
Utility of using polygenic approaches to estimate prevalence within a group
Prevalence
We highlight the clinical utility of the presented concept with a clinical question around the prevalence of potentially undiagnosed coeliac disease within a cohort adhering to a glutenfree diet. This question would be unanswerable using the traditional clinical approach of endoscopy to confirm the coeliac disease, as once observing a glutenfree diet findings are often normal^{16}. We showed the prevalence of coeliac disease determined genetically and reported clinically were comparable, suggesting that there is no undiagnosed coeliac disease within this glutenfree cohort. Whilst this finding is not unexpected, it could not be robustly shown before and highlights the general applicability of the proposed framework to quantitatively answer novel and difficulttoanswer questions.
Defining clinical characteristics of a genetically defined subgroup
A further advantage of the proposed methodologies over traditional clinical classification arises from the fact that clinical characteristics are not used to define cases. It is therefore possible to estimate both binary and continuous clinical characteristics of the genetically defined disease group within the mixture cohort. Using BMI as an example:
where \({\hat{p}}_{{{{{{{{\rm{N}}}}}}}}}\) and \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}\) represent the estimated proportions and \({\bar{x}}_{{{{{{{{\rm{N}}}}}}}}}^{{{{{{{{\rm{BMI}}}}}}}}}\), \({\bar{x}}_{{{{{{{{\rm{M}}}}}}}}}^{{{{{{{{\rm{BMI}}}}}}}}}\) and \({\bar{x}}_{{{{{{{{\rm{C}}}}}}}}}^{{{{{{{{\rm{BMI}}}}}}}}}\) represent the mean BMI of each of the noncases, mixture and cases (disease) groups respectively. This approach was used in^{7} to show rates of Diabetic Ketoacidosis to be the same in T1D diagnosed above and below 30 years of age. We note that all the same limitations of the Means method apply. The EMD and KDE methods could allow for reconstruction of the full distribution of the clinical characteristic, however, evaluation of this approach is beyond the scope of this study.
Testing of proposed clinical discriminators
Another utility of these genetic discrimination techniques is to test the performance of clinical classification criteria and allow more precise stratification of a population. Whilst the increasing availability of population datasets generated from routinely collected data allow largescale population analysis, robust classification can become more difficult leading to bias which is difficult to quantify^{6}. Treating the clinically defined cohort as a mixture would allow rapid estimation of the correctly and incorrectly classified proportions within the cohort, thus allowing for bias adjustment and optimisation of classification criteria.
Cautions
The use of genetic data in the context of genetic stratification means certain assumptions must hold true for the estimates to be valid. The same assumptions required for Mendelian randomisation^{1,17} should be met here. Key to the accuracy of estimates is the equivalence assumption which states that cases and noncases in the mixture reflect their respective reference cohorts. The importance of meeting this assumption and the implications if it is not met, are highlighted by our example of a raised T2DGRS for an enriched reference T2D population from the WTCCC^{8}. For this reason, to help ensure equivalence is maintained we recommend these methods are used in subsets of a cohort allowing reference cases and noncases to be derived from the same dataset. If these methods are to be used with reference cohorts from different datasets, as was done previously^{7}, the equivalence assumption should be rigorously tested prior to analysis. This must initially involve a detailed assessment of the selection criteria for the mixture and reference cohorts and available literature, followed by comparison of the GRS between definite noncases and cases from within the mixture and their respective references.
Careful GRS comparison between the reference cohorts and definite cases and noncases from the mixture will also help mitigate any potential impact of unrecognized genotype–phenotype interactions, which may arise when selecting subgroups. This is highlighted by Supplementary Fig. 6 showing a reduction in performance of estimates with higher disease prevalence owing to a subtle difference in GRS between type 2 diabetes cases with and without microalbuminuria. We recommend careful investigation for overlapping genetic associations and pleiotrophy using standard Mendelian Randomisation approaches^{17,18}. Genotype–phenotype interaction is also relevant to think about the criteria used to originally select cases and controls in the genomewide association study (GWAS) from which a GRS is derived from, as cases may have been enriched to improve variant discovery. However, this will have minimal impact on the methods’ estimates provided that genetic equivalence has been maintained between reference and mixture cases and controls. Clearly, this would not be the case if the enriched GWAS population was used as the case reference population, as highlighted by our type 2 diabetes example above.
Finally, all our methodologies assume that the mixture consists of only the two genetic reference cohorts such that \({p}_{{{{{{{{\rm{C}}}}}}}}}+{p}_{{{{{{{{\rm{N}}}}}}}}}=1\). Both, the EMD and KDE methods provide a way to check if this mixture assumption is satisfied. In the case of the EMD method we could use the independent estimates of \({\hat{p}}_{{{{{{{{\rm{N}}}}}}}}}\) and \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}\) to check how much their sum deviates from 1. For the KDE method, the validation could be based on the residuals of the leastsquare fitting procedure. To check if the deviation from \({p}_{{{{{{{{\rm{C}}}}}}}}}+{p}_{{{{{{{{\rm{N}}}}}}}}}=1\) is significant we again suggest the use of the bootstrap methodology. We present some details and an example of such checks in the supplementary information, however a detailed analysis of this aspect of the proposed methodology is beyond the scope of this paper.
In summary, we propose novel approaches that use population distributions of GRS to estimate disease prevalence. We show that the proposed Means, EMD and KDE approaches improve upon the existing Excess method, performing similarly across different mixture cohorts, with robust estimates possible even when using GRS with reduced discriminative ability. Utilising these concepts will allow researchers to gain novel unbiased insights into polygenic disease prevalence and clinical characteristics, unhampered by clinical classification criteria.
Methods
Participants
Type 1 diabetes cases: Cases (n = 1,963) were taken from the WTCCC^{8}. The WTCCC T1D patients all received a clinical diagnosis of T1D at <17 years of age and were treated with insulin from the time of diagnosis.
Type 2 diabetes cases: Cases (n = 1,924) were taken from the WTCCC^{8}. The WTCCC T2D patients all received a clinical diagnosis of T2D.
Clinical examples

(1)
Coeliac Disease
Coeliac disease reference cases: Cases (n = 12,018) Cases consisted of those from a combination of European studies. Cases were diagnosed according to standard clinical criteria, including compatible serology and small intestinal biopsy^{19}.
Coeliac noncases: Noncases (n = 12,000) a cohort was randomly selected from those within the UK biobank (total n = 366,326) defined as unrelated individuals of white European descent without a diagnosis of coeliac disease and not reporting a glutenfree diet.
Glutenfree diet: Glutenfree cases (n = 12,757) were taken from unrelated individuals of white European descent in the UK biobank reporting adherence to a glutenfree diet.
Reported coeliac cases in biobank: Coeliac disease cases (n = 1,772) were defined based on selfreported questionnaire answers and/or an ICD10 record from hospital episode statistics data.

(2)
Microalbuminuria
Type 2 diabetes reference cases: Cases (n = 13,268) were defined as non insulintreated participants of White European descent either selfreporting diabetes or with an HbA1C ≥ 48 mmol mol^{−1} at recruitment to UK Biobank without microalbuminuria.
Type 2 diabetes non cases: Non cases (n = 10,000) were randomly selected from all participants (n = 339,385) of white European descent without microalbuminuria not selfreporting diabetes and with an HbA1C < 48 mmol mol^{−1} at recruitment to UK Biobank.
Microalbuminuria cases: Cases (n = 17,868) were taken from unrelated individuals of white European descent in the UK Biobank. We used the albumin creatinine ratio (ACR) calculated from the baseline assessment. In UKBB, a continuous measure of ACR was derived using urinary measures of albumin and creatinine. Microalbuminuria was defined based on international cutoffs ≥ 2.5 mg mmol^{−1} in males and ≥ 3.5 mg mmol^{−1} in females. Any selfreported insulintreated diabetes cases were excluded as evaluating the proportion of type 2 diabetes cases.
Microalbuminuria cases with type 2 diabetes: Cases (n = 2,509) were defined as white European participants with type 2 diabetes and microalbuminuria as defined by the aforementioned criteria.

(3)
Glaucoma
Type 2 diabetes reference cases: Cases (n = 15,128) were defined as non insulintreated participants of White European descent either selfreporting diabetes or with an HbA1C ≥ 48 mmol mol^{−1} at recruitment to UK Biobank without either selfreported glaucoma or a glaucoma code in hospital episode statistic data.
Type 2 diabetes non cases: Non cases (n = 10,000) were randomly selected from all participants (n = 345,534) of white European descent without glaucoma not selfreporting diabetes and with an HbA1C < 48 mmol mol^{−1} at recruitment to UK Biobank.
Glaucoma cases: Cases (n = 9,857) were taken from unrelated individuals of white European descent in the UK Biobank selfreporting glaucoma or with a glaucoma code in the hospital episode statistic data. Any selfreported insulintreated diabetes cases were excluded.
Glaucoma cases with type 2 diabetes: Cases (n = 650) were defined as white European participants with type 2 diabetes as defined by the aforementioned criteria and selfreported glaucoma or a glaucoma code in hospital episode statistic data.
Calculating GRS
T1DGRS: The T1DGRS was generated using published variants known to be associated with the risk of T1D. We generated a 30 SNP T1DGRS from variants present in the WTCCC cohort. We followed the method as described by Oram et al.^{2} using tag variants rs2187668 and rs7454108 to determine HLA DR haplotype and ascertain the HLAhaplotype component of each individual’s score^{20}. This was added to the score of the remaining variants, generated by summing the effective allele dosage of each variant multiplied by the natural log (ln) of the odds ratio.
T2DGRS: The T2DGRS was generated using published variants known to be associated with risk of T2D^{21}. We generated a 77 SNP T2DGRS in both the WTCCC cohort and UK Biobank consisting of variants present in both data sets and with high imputation quality (R2 > 0.4). The AUC (0.65) for discriminating T1D and T2D was calculated within the study as this 77 SNP GRS was created specifically to allow comparison between WTCCC cohort and UK Biobank. The score was generated by summing the effective allele dosage of each variant multiplied by the natural log (ln) of the odds ratio.
CDGRS: The 46 SNP coeliac GRS was generated using published variants known to be associated with risk of Coeliac disease^{19,22,23}. The logadditive CDGRS was generated using a weight as the natural log of corresponding odds ratios. For each included genotype at the DQ locus, the odds ratio was derived from a casecontrol dataset^{19}. For each nonHLA locus, odds ratios from existing literature were used, and each weight was multiplied by individual risk allele dosage^{9,19}.
Excess method
Following on from the previous work^{7}, the Excess method calculates the reference proportions in a mixture cohort according to the difference in expected numbers either side of the reference cohort’s median. The reference median in question was taken to be the closest to the mixture cohort’s median. The proportion was then calculated according to: \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}=\left\frac{\#\left\{x \; > \; m\right\}\;\;\#\left\{x \;\le \; m\right\}}{n}\right\), where \(m\) is the median of the reference cohort, \(n\) is the size of the mixture cohort and \(x\) is an individual participant in the mixture cohort, hence \(\#\left\{x \; > \;m\right\}\) represents the number of cases above the median and \(\#\left\{x\le m\right\}\) represents the number of cases below the median.
Means method
The mean GRS were computed for each of the two reference cohorts and the mixture population. The proportions of the two reference cohorts were then calculated according to the normalised difference of the mixture cohorts’s mean (\({\mu}_{{{{{\rm{\sim{{M}}}}}}}}\)) and the means of the two reference cohorts (\({\mu}_{{{{{{\rm{R}}}}}}_{{{{{{\rm{C}}}}}}}}\) and \(\mu_{{{{{{\rm{R}}}}}}_{{{{{{\rm{N}}}}}}}}\)): \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}=\left\frac{{\mu}_{{{{{\rm{\sim{{M}}}}}}}}\;\;{\mu}_{{{{{{{\rm{R}}}}}}_{{{{{{\rm{N}}}}}}}}}}{{\mu}_{{{{{{{\rm{R}}}}}}_{{{{{{\rm{C}}}}}}}}}\;\;{\mu}_{{{{{{{{{\rm{R}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}}}}\right\). If the mean of the mixture cohort is bigger (or smaller) than both means of the reference cohorts then the estimate is defined as 1 (or 0) depending on the closest reference mean.
Earth Mover’s Distance (EMD) method
Intuitively, the Earth Mover’s Distance (EMD) is the minimal cost of work required to transform one ‘pile of earth’ into another; with each ‘pile of the earth’ representing a probability distribution. Mathematically, the (EMD) is a Wasserstein distance and has been widely used in computer and data sciences^{24,25}. For univariate probability distributions, the EMD has the following closedform formula:^{26}
Here, \({{{{{{{{\rm{PDF}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\) and \({{{{{{{{\rm{PDF}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\) are two probability density functions with support in set \({{{{{{{\rm{Z}}}}}}}}\), and cumulative density functions, \({{{{{{{{\rm{CDF}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\) and \({{{{{{{{\rm{CDF}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\), are their respective cumulative distribution functions.
To compute the EMD, we first find the experimental CDFs of GRS for each of the two reference cohorts and the mixture cohort. These CDFs are then interpolated at the same points for each distribution, with the points being the centres of the bins obtained when applying the FreedmanDiaconis rule^{27} to the combined reference cohorts (such that \(h=2\frac{{{{{{{{\mathrm{IQR}}}}}}}}}{{n}^{1/3}}\)). As a support set, we take an interval bounded by the minimum and maximum value of the GRS in all three cohorts. The proportions were then calculated as:
where \({{{{{{{\rm{x}}}}}}}}\) is either \({{{{{{{\rm{C}}}}}}}}\) or \({{{{{{{\rm{N}}}}}}}}\). Since the two estimates are independent, deviation of their sum from one, \(\left{p}_{{{{{{{{\rm{C}}}}}}}}}^{{{{{{{{\rm{EMD}}}}}}}}}+{p}_{{{{{{{{\rm{N}}}}}}}}}^{{{{{{{{\rm{EMD}}}}}}}}}1\right\) can be used to test the assumption that \({p}_{{{{{{{{\rm{C}}}}}}}}}+{p}_{{{{{{{{\rm{N}}}}}}}}}=1\), dispersion of the deviation can be computed during bootstrapping and compared with the value observed in the analysed cohort. However, under the assumption that \({p}_{{{{{{{{\rm{C}}}}}}}}}+{p}_{{{{{{{{\rm{N}}}}}}}}}=1\), we adapted the method by taking the average of the estimated proportions as follows:
Kernel Density Estimation (KDE) method
Individual GRS were convolved with Gaussian kernels, with the bandwidth set to the bin size obtained when applying the FreedmanDiaconis rule^{27} in the same way as for the EMD method. This forms two reference distribution templates and a mixture template, \({{{{{{{{\rm{KDE}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\), \({{{{{{{{\rm{KDE}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\) and \({{{{{{{{\rm{KDE}}}}}}}}}_{{{{{{{{\rm{M}}}}}}}}}\) for each dataset. A mixture model was then defined as the weighted sum of the two reference templates (with both weights initialised to 1). This model was then fit to the mixture template (\({{{{{{{{\rm{KDE}}}}}}}}}_{{{{{{{{\rm{M}}}}}}}}}\)) with the LevenbergMarquardt (Least Squares) algorithm^{28}, allowing the weights (\({w}_{{{{{{{{\rm{C}}}}}}}}}\) and \({w}_{{{{{{{{\rm{N}}}}}}}}}\)) to vary. The proportions were then calculated according to: \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}=\frac{{w}_{{{{{{{{\rm{C}}}}}}}}}}{{w}_{{{{{{{{\rm{C}}}}}}}}}\;+\;{w}_{{{{{{{{\rm{N}}}}}}}}}}\). Admissible values of the weights are limited to the [0, 1] interval.
Simulated mixtures
To simulate a range of realworld scenarios, we constructed artificial mixture cohorts by randomly sampling with replacement GRS from the reference cohorts of cases (\({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\)) and noncases (\({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\)) in specified proportions, \({p}_{{{{{{{{\rm{C}}}}}}}}}\) and total mixture sizes, \(n\). To construct the mixtures, we use the WTCCC^{8} T1D (n = 1,963) and T2D (n = 1,924) data. We used half of the available samples as reference cohorts (first n = 982 and n = 962 points, respectively) and the other half (last n = 981 and n = 962, respectively) is a holdout set used to construct the mixtures. To obtain any required mixture size we sampled with replacement from the holdout data.
For the heatmaps, Fig. 3 and Supplementary Figs. 2–4, the proportion and cohort size were systematically varied, with \({p}_{{{{{{{{\rm{C}}}}}}}}}\) ranging from 0 to 1 in 0.01 (1%) steps while \(n\) ranged from 100 to 2,500 in steps of 100 samples. All four methods were applied to each combination of these parameters. At each point in the parameter space, we estimated the prevalence (\({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}\)) and its confidence interval and then compared it with the model proportion (\({p}_{{{{{{{{\rm{C}}}}}}}}}\)) used to generate them.
Figure 3 (top row) illustrates how the randomness of the simulated mixture cohort affects the variability of each method’s estimates. This variability reflects the randomness that is inherently present in the mixture cohort. Supplementary Fig. 2 shows how this variability decreases for more discriminative GRS, while Supplementary Note 1 and Supplementary Figs. 3–4 compare the performance of the methods once the randomness of the composition of the mixture cohort is eliminated.
For Supplementary Figs. 5–6, we used the GRS of the T2D cases and noncases in the mixture cohort, \({\widetilde{{{{{\rm{M}}}}}}}\), to construct (random sampling with replacement) 21 artificial mixture distributions, (\(n=2{,}500\), each) with prevalence of T2D varying from 0 to 100% (with of 5% step). To estimate the proportions in the constructed mixture cohorts we used reference cohorts as specified in Clinical examples section above.
Synthetic GRS data
To generate synthetic GRS in Fig. 4 and Supplementary Fig. 7 we used pseudorandom number generators. As references, we used two samples (n = 2,000, each) from normal distributions with mean 0 and standard deviation 1, N(0, 1); the means and standard deviations of the reference samples were \(\mu =0.002,\sigma =0.999\) and \(\mu =0.008,\sigma =1.001\). The reference samples are generated only once. To change the AUC for the reference samples, we added a value to one distribution of them to change its mean. The mixtures are generated using different pseudorandom number generators for each proportion (\({p}_{{{{{{{{\rm{C}}}}}}}}}\)) and AUC value. For example, to generate mixture with n = 5,000, \({p}_{{{{{{{{\rm{C}}}}}}}}}=0.1\) and AUC = 0.7 we: (1) draw 500 samples from N(0, 1) and (2) we draw n = 4,500 samples from N(0, 1) and add 0.74 to them. The mixture and reference samples are generated separately.
Varying mixture size
To investigate the dependence of the width of the CIs on the mixture size (Fig. 4c) and find the minimum mixture size required for CIs width <0.1 (Table 1) we used the same synthetic GRS distributions with \({p}_{{{{{{{{\rm{C}}}}}}}}}\) = 0.1 as described above. For the Excess, Means and the EMD methods, we varied mixtures sizes between 100 and 10,000 (30,000 for AUC < 0.7) with a step of 100 points. Since the KDE method is more computationally expensive, we tested mixture sizes between 100 and 6,500 with a step of 100 points and between 7,000 and 10,000 (40,000 for AUC = 0.6, 30,000 for AUC = 0.65) with a step of 500 points. For each considered mixture size, we repeated the estimation of the CIs 100 times. We disregarded estimates for which the CIs do not include \({p}_{{{{{{{{\rm{C}}}}}}}}}=0.1\). As the minimum mixture size, we took a median of the mixture sizes (over the 100 runs) at which we first observed the CI width <0.1.
Calculating confidence intervals
In order to estimate confidence intervals and any systematic bias of the methods, we used Monte Carlo^{11} and bootstrap methods^{12,29}. We combined the two approaches to capture variability of the estimate resulting from the mixture size and features of the reference distributions.
First, we stochastically modelled the process of generating the mixture. To do so, we modelled \({N}_{{{{{{{{\rm{M}}}}}}}}}\) new mixtures, by sampling with replacement from the reference cohorts. Each modelled mixture has the same size as the original cohort and the composition given by the initial estimate \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}\) based on the original mixture. For example, if the original cohort has 1,000 values and the estimate was \({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}=0.3\) then each modelled mixture would contain 300 values sampled with replacements from the cases reference sample (\({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{C}}}}}}}}}\)) and 700 values from the noncases reference sample (\({{{{{{{{\rm{R}}}}}}}}}_{{{{{{{{\rm{N}}}}}}}}}\)). Next, we resampled each of the \({N}_{{{{{{{{\rm{M}}}}}}}}}\) new mixtures generating \({N}_{{{{{{{{\rm{B}}}}}}}}}\) bootstrap samples, see also Supplementary Fig. 8.
Following, chapters 2 and 5 from ref. ^{12} we used all the \({N}_{{{{{{{{\rm{M}}}}}}}}}\cdot {N}_{{{{{{{{\rm{B}}}}}}}}}\) cohorts to compute the bias and confidence intervals of the estimate. The systematic median bias of the method is defined as a difference between \({{{{{{{\rm{med}}}}}}}}( \{\{ {p}_{{{{{{{\rm{C}}}}}}}}^{\prime} \} _{{{{{{{\rm{B}}}}}}}} \} _{{{{{{{\rm{M}}}}}}}})\) the median value of the \({N}_{{{{{{{{\rm{M}}}}}}}}}\cdot {N}_{{{{{{{\rm{B}}}}}}}}\) bootstrapped estimates of \({p}_{{{{{{{\rm{C}}}}}}}}^{\prime}\) and the estimate \({\hat{p}}_{{{{{{{\rm{C}}}}}}}}\):
We used bias corrected and accelerated bootstrap confidence intervals (BCa CI) which we computed as described in ref. ^{30}. Bootstrap confidence intervals assume that the spread of the distribution of the bootstrap estimates \({p}_{{{{{{{{\rm{C}}}}}}}}}^{\prime}\) can be used to estimate the CI. The BCa CI take into account median bias and skewness (acceleration) of the distribution of the bootstrap estimates \({p}_{{{{{\rm{C}}}}}}^{\prime}\) and allows calculation of corrected quantiles representing a chosen confidence level, \(\alpha\).
Throughout this section \(\Phi\) is a normal standard (\(\mu =0,\sigma =1\)) CDF and \({\Phi }^{1}\) is its inverse and \({{{{{{{{\mathscr{T}}}}}}}}}_{n}^{1}\) is an inverse CDF of a Student’s tdistribution with \(n\) degrees of freedom.
The computation takes the following steps:

1.
Estimate the median bias correction factor \({z}_{0}\):
$${z}_{0}={\Phi }^{1}\left(\frac{{\#}({p}_{{{{{\rm{C}}}}}}^{\prime} \le {\hat{p}}_{{{{{{{{\rm{C}}}}}}}}})}{{N}_{{{{{{{{\rm{M}}}}}}}}}\cdot {N}_{{{{{{{{\rm{B}}}}}}}}}}\right).$$(4) 
2.
Estimate the acceleration correction factor \(\hat{a}\):
$$\hat{a}=\frac{1}{6}\frac{{\sum }_{i=1}^{n}{U}_{i}^{3}}{{({\sum }_{i=1}^{n}{U}_{i}^{2})}^{3/2}}.$$(5)where \({U}_{i}\) values are calculated using the jackknife influence function:
$${U}_{i}=(n1)({\hat{p}}_{{{{{{{{\rm{C}}}}}}}}}{\hat{p}}_{i}),$$(6)here \({\hat{p}}_{i}\) is an estimate based on the reduced mixture sample \({{\widetilde{{{{{\rm{M}}}}}}}}_{i}=\left({GR}{S}_{1},{GR}{S}_{2},\ldots ,{GR}{S}_{i1},{GR}{S}_{i+1},\ldots {GR}{S}_{n}\right)\) with score \(i\) removed.

3.
To counteract the narrowness bias we additionally expand the confidence level^{29}
$${\alpha }^{\prime}=\Phi \left({{{{{{{{\mathscr{T}}}}}}}}}_{{{{{{{{\rm{n}}}}}}}}1}^{1}\left(\alpha \right)\sqrt{\left(n/\left(n1\right)\right)}\right).$$(7) 
4.
Use bias and acceleration factors to compute the BCa confidence levels:
$${\alpha }_{{{{{{{{\rm{BCa}}}}}}}}}\left(\alpha \right)=\Phi \left({z}_{0}+\frac{{z}_{0}+{\Phi }^{1}\left({\alpha }^{{\prime} }\right)}{1\hat{a}\cdot \left({z}_{0}+{\Phi }^{1}\left({\alpha }^{{\prime} }\right)\right)}\right).$$(8) 
5.
Take \({\alpha }_{{BCa}}(\alpha /2)\) quantile of the \({p}_{{{{{\rm{C}}}}}}^{\prime}\) samples to obtain the lower confidence limit \(C{I}_{{{{{{{{\rm{L}}}}}}}}}\) and \({\alpha }_{{BCa}}(1\alpha /2)\) quantile to obtain the upper confidence limit \(C{I}_{{{{{{{{\rm{U}}}}}}}}}\). If the median bias is very strong the BCa CI are undefined. For example, if the \({\hat{p}}_{C}\) is outside of the range of the distribution of the bootstrap estimates \({p}_{{{{{\rm{C}}}}}}^{\prime} ,\) \({z}_{0}\) is infinite and both limits of the \({CIs}\) are equal to the maximum or minimum value the \({p}_{{{{{\rm{C}}}}}}^{\prime}\) samples.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
UK Biobank data can be obtained after completing an online application, see details at http://www.ukbiobank.ac.uk/usingtheresource/ Wellcome Trust Case Control Consortium genotype data can be obtained through by application to the Wellcome Trust Case Control Consortium Data Access Committee. The procedure is described in more detail at https://www.wtccc.org.uk/info/access_to_data_samples.html.
Code availability
The Distribution Proportion Estimation software (v1.0.0) used to analyse the data was developed and tested in Python 3.8.2 and Matlab release 2020b (that includes other algorithms mentioned in the manuscript). The Distribution Proportion Estimation software (v1.0.0) implementing these methods is archived at https://doi.org/10.5281/zenodo.5512651. The code is opensource and available under versioncontrol here: https://github.com/bdevans/DPE.
References
Smith, G. D. & Ebrahim, S. ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? Int J. Epidemiol. 32, 1–22 (2003).
Oram, R. A. et al. A type 1 diabetes genetic risk score can aid discrimination between type 1 and type 2 diabetes in young adults. Diabetes Care 39, 337–344 (2015).
Ntalla, I. et al. Genetic risk score for coronary disease identifies predispositions to cardiovascular and noncardiovascular diseases. J. Am. Coll. Cardiol. 73, 2932–2942 (2019).
Gao, X. R., Huang, H. & Kim, H. Polygenic risk score is associated with intraocular pressure and improves glaucoma prediction in the UK Biobank cohort. Transl. Vis. Sci. Technol. 8, 10 (2019).
St Clair, P. et al. Using selfreports or claims to assess disease prevalence: it’s complicated. Med. Care 55, 782–788 (2017).
Manuel, D. G., Rosella, L. C. & Stukel, T. A. Importance of accurately identifying disease in studies using electronic health records. BMJ 341, c4226 (2010).
Thomas, N. J. et al. Frequency and phenotype of type 1 diabetes in the first six decades of life: a crosssectional, genetically stratified survival analysis from UK Biobank. Lancet Diabetes Endocrinol. 6, 122–129 (2018).
Wellcome Trust Case Control, C. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Sharp, S. A. et al. A single nucleotide polymorphism genetic risk score to aid diagnosis of coeliac disease: a pilot study in clinical care. Aliment Pharm. Ther. 52, 1165–1173 (2020).
Allen, N. E. et al. UK biobank data: come and get it. Sci. Transl. Med. 6, 224ed4 (2014).
Rosenblad, A. & Manly, B. F. J. Randomization, bootstrap and Monte Carlo methods in biology, third edition. Computational Stat. 24, 371–372 (2009).
Davison, A. C. & D. V. Hinkley, Bootstrap Methods and their Application. Cambridge Series in Statistical and Probabilistic Mathematics. (Cambridge University Press, 1997).
Kerminen, S. et al. Geographic variation and bias in the polygenic scores of complex diseases and traits in Finland. Am. J. Hum. Genet. 104, 1169–1181 (2019).
Martin, A. R. et al. Human demographic history impacts genetic risk prediction across diverse populations. Am. J. Hum. Genet. 100, 635–649 (2017).
Thomas, N. J. et al. Type 1 diabetes defined by severe insulin deficiency occurs after 30 years of age and is commonly treated as type 2 diabetes. Diabetologia 62, 1167–1172 (2019).
Lebwohl, B., Sanders, D. S. & Green, P. H. R. Coeliac disease. Lancet 391, 70–81 (2018).
Davies, N. M., Holmes, M. V. & Davey, G. Smith, Reading Mendelian randomisation studies: a guide, glossary, and checklist for clinicians. BMJ 362, k601 (2018).
Bowden, J., Davey Smith, G. & Burgess, S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J. Epidemiol. 44, 512–525 (2015).
Trynka, G. et al. Dense genotyping identifies and localizes multiple common and rare variant association signals in celiac disease. Nat. Genet. 43, 1193–1201 (2011).
Barker, J. M. et al. Two single nucleotide polymorphisms identify the highestrisk diabetes HLA genotype: potential for rapid screening. Diabetes 57, 3152–3155 (2008).
Udler, M. S. et al. Genetic risk scores for diabetes diagnosis and precision medicine. Endocr. Rev. 40, 1500–1520 (2019).
Mitchell, R. T. et al. Coeliac screening in a Scottish cohort of children with type 1 diabetes mellitus: is DQ typing the way forward? Arch. Dis. Child 101, 230–233 (2016).
GutierrezAchury, J. et al. Fine mapping in the MHC region accounts for 18% additional genetic risk for celiac disease. Nat. Genet. 47, 577–578 (2015).
Levina, E. & P. Bickel. The Earth Mover’s distance is the Mallows distance: some insights from statistics. in Proc. Eighth IEEE International Conference on Computer Vision. ICCV 2001 (2001).
Muskulus, M. & VerduynLunel, S. Wasserstein distances in the analysis of time series and dynamical systems. Phys. D: Nonlinear Phenom. 240, 45–58 (2011).
Cohen, S. & Guibas, L. The Earth Mover“s Distance: Lower Bounds and Invariance under Translation. (Stanford University, 1997).
Freedman, D. & Diaconis, P. On the histogram as a density estimator:L2 theory. Z. f.ür. Wahrscheinlichkeitstheorie und Verwandte. Geb. 57, 453–476 (1981).
Gill, P. E., Walter, M. & Wright, M. H. in Practical Optimization, p. 136–137 (academic press, 1981).
Hesterberg, T. C. What teachers should know about the bootstrap: resampling in the undergraduate statistics curriculum. Am. Statistician 69, 371–386 (2015).
DiCiccio, T. J. & Efron, B. Bootstrap confidence intervals. Stat. Sci. 11, 189–228 (1996).
Acknowledgements
This research has in part been conducted using the UK Biobank Resource. The authors would like to acknowledge the use of the University of Exeter HighPerformance Computing (HPC) facility in carrying out this work. We are grateful to Jack Bowden for his comments on the manuscript.
Funding
B.D.E. and P.S. acknowledge that this work was generously supported by the Wellcome Trust Institutional Strategic Support Awards (WT204909MA and 204909/Z/16/Z respectively). K.T.A. gratefully acknowledges the financial support of the EPSRC via grants EP/N014391/1 and EP/T017856/1. N.J.T. is funded by an NIHR Academic Clinical Fellowship and undertook the research as part of a Wellcome Trust funded secondment within the translational research exchange at Exeter University (WT204909MA and 204909/Z/16/Z respectively). S.A.S. is supported by a Diabetes UK PhD studentship (17/0005757). M.N.W. is supported by the Wellcome Trust Institutional Support Fund (WT097835MF). R.A.O. is funded by a Diabetes UK Harry Keen Fellowship (16/0005529). S.E.J. is funded by an MRC grant. A.T.H. is supported by the NIHR Exeter Clinical Research Facility and a Wellcome Senior Investigator award and an NIHR Senior Investigator award. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.
Author information
Authors and Affiliations
Contributions
Manuscript writing: B.D.E., N.J.T., P.S., K.T.A. Method development: B.D.E., P.S., N.J.T., A.T.H., R.J.O., K.T.A. Data acquisition and coding: N.J.T., S.S., R.K., S.J., M.N.W. Simulation implementation, running and analysis: B.D.E., P.S. Discussion of results and manuscript editing: All authors. Project coordination: K.T.A., N.J.T.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests
Additional information
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Evans, B.D., Słowiński, P., Hattersley, A.T. et al. Estimating disease prevalence in large datasets using genetic risk scores. Nat Commun 12, 6441 (2021). https://doi.org/10.1038/s41467021265017
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41467021265017
This article is cited by
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.