The ability to predict complex human phenotypes, including common disease risks, from DNA alone, is an important advance in genomics and biological science1,2. Sibling comparisons are a powerful method with which to validate genomic prediction in humans. Siblings (i.e., children who share the same mother and father) have typically experienced similar environments while growing up: family social status, exposure to toxins, diet, climate, etc. all tend to be similar3,4. Furthermore, siblings are concordant for ancestry and display negligible differences in population structure.

If a girl grows up to be taller than her sister, with whom she spent the first 18 years of her life, it seems likely at least some of the height difference is due to genetic differences. How much of phenotype difference can we predict from DNA alone? If one of the sisters develops breast cancer later in life, how much of the risk was due to genetic variants that she does not share with her asymptomatic sister? These are fundamental questions in human biology, which we address (at least to some extent) in this paper.

There are real clinical applications of this predictive capability. In Ref.5, this point is elaborated in the case of breast cancer. It is shown that the distribution of affected individuals is shifted in PRS score relative to the control population. An immediate result of this is that the probability that an individual in this population will be diagnosed with Breast Cancer at some point in their life increases with higher PRS. For individuals who are, e.g., in the top few percentiles in PRS, the probability of developing breast cancer is roughly 1 in 3, making them high risk by American Cancer Society guidelines. According to these guidelines, women with such PRS scores might be offered mammograms starting a decade earlier than women with average risk, as is standard of care for women with a BRCA risk variant. This example shows how PRS can have practical utility despite a modest AUC value of only 0.6 or so. There are roughly an order of magnitude more high risk women due to aggregate polygenic effects than due to BRCA variants. These women can now be identified through inexpensive array genotyping (to obtain their SNP values).

Other polygenic predictors—e.g., for Heart Attack, Diabetes, Hypothyroidism, etc.—may also have analogous clinical utility. Of course, the predictive performance of the final risk model can be improved significantly if other covariates (age, population structure, blood pressure, BMI or other biomarkers) are included in the analysis6. (Note that is not our focus here—we concentrate specifically on DNA-based prediction of risk with the knowledge that other factors can be included in a straightforward way).

Future work should investigate the cost-benefit characteristics of inexpensive population-level genotyping. In Ref.5, a very simplified version of this kind of analysis suggests that the benefits from increasing Breast Cancer screening based on PRS stratification alone might pay for the cost of genotyping the entire female population through cost savings from early detection. Of course, such a significant conclusion requires much more detailed analysis and other researchers have pushed for a similar approach7. In our view, the potential for early detection alone provides strong utilitarian motivation for our research, and future research, on the construction of PRS for a broad variety of disease conditions.

We study two types of predictors: Polygenic Risk Scores (PRS), which estimate the genetic risk of developing a specific common disease condition, and Polygenic Scores (PGS) which predict a quantitative trait such as adult height or bone density. Previous studies examined polygenic prediction on individuals without regard to family status (e.g.6,8). Our main objective here is to show that most of the predictive power remains even in the context of siblings. We compare predictive power in pairs of unrelated individuals to that in pairs of siblings. (We will sometimes abbreviate sibling by sib for brevity).

Predictors trained on a large population of non-sibling individuals (see “Methods and data” section below) could potentially utilize correlations in the SNP data that arise from environment effects, but are not related to direct genetic causation. Two examples are given below.

  1. 1.

    If environmental conditions in a specific region, such as, e.g., Northern England, affect disease risk, the predictor trained on UK data might assign nonzero effect sizes to SNPs associated with ancestries found in that region—i.e., the predictor learns to use population structure correlated to environmental conditions. These specific SNPs are correlated to disease risk for environmental reasons, but might not have any connection to genetic mechanisms related to the disease. They likely have little power to differentiate between siblings, who experienced similar family conditions and have have identical ancestry.

  2. 2.

    It is also possible that some SNP variants affect nurture (the way that parents raise their children). These SNPs could affect the child phenotype via an environmental mechanism under parental control, not a biochemical pathway within the child. This is sometimes referred to as a genetic nurture effect9,10,11,12,13. Note, siblings raised together would both be affected by parental genetic nurture variants, so these effects are weakened in family designs.

Sibling comparisons reduce the impact of factors such as those described above. We expect some reduction in power when predictors trained in a population of non-sibling individuals are tested among sibs. Sibling validation likely yields a better estimate of truly causal genetic effects. A more complicated measure of familial relatedness might lead to even better results14, but we restrict our analyses here to siblings.

For almost all of the predictors studied here, both PRS and PGS, significant power remains even in the sibling tests. Almost all of the predictors we study seem to capture some real genetic effects that cause siblings to differ from each other as adults.

Methods and data

The main dataset we use for training is the 2018 release of the UK Biobank15,16. The goal of this work is to study the effectiveness of polygenic predictors using siblings in the UK Biobank. In previous work, predictors were trained exclusively on genetically British individuals (as identified by principal component analysis17), however it has been shown that predictors trained on populations filtered by self-reported ethnicity perform equivalently5. We expect the predictor performance between siblings may be diminished to some extent compared to the general population because of shared environments, shared genetics, genetic nurture, and other confounding factors. For all traits (case/control and quantitative), predictors are trained, validated and tested on individuals who self-report as some form of “white ancestry”—i.e., British, Irish, or other white (note this terminology is from UK Biobank data tables). From this group, all individuals for whom there is at least one sibling match are set aside for use in the sibling test set. This is described in Supplementary Appendix C. In each training run, a small fraction of non-sibs is withheld from the training set for validation and model selection, and the set of sibling pairs is used as a final test set.

We construct linear models of genetic predisposition for a variety of disease conditions that were presented in Ref.5 and linear models of several quantitative human phenotypes, some of which can be found in Ref.8. The disease condition phenotype data describes a binary case-control status which is defined either by self-report or from a clinical diagnosis.

Polygenic predictors are constructed using compressed sensing18,19,20,21. It has been demonstrated that SNP matrices of human genome matrices are good compressed sensors: L1 performance guarantee theorems hold and phase transition behavior is observed.

We focus specifically on L1-trained predictors because we understand their training and performance characteristics well. There are many other methods used in the creation of polygenic scores. While we make no claims concerning those other predictors, we suspect that they would perform similarly in between-sibling validation tests such as those performed here. We do also examine two predictors (for Breast Cancer and Coronary Artery Disease) which were published in Khera et al.6. These are indicated as such in the figures and one can compare with L1-trained predictors on similar phenotypes.

For each disease condition, we compute a set of additive effects \(\vec {\beta }\) (each component is the effect size for a specific SNP) which minimizes the LASSO objective function:

$$\begin{aligned} {\mathcal {O}}(\lambda ,\vec {\beta }) = \frac{1}{2} || \vec {y} - X \vec {\beta }||^2 + n\lambda || \vec {\beta } ||_1, \end{aligned}$$

where n is the number of samples, \(|| \ldots ||\) means L2 norm (square root of sum of squares), \(|| \ldots ||_1\) is the L1 norm (sum of absolute values) and the term \(\lambda ||\vec {\beta }||_1\) is a penalization which enforces sparsity of \(\vec {\beta }\). The value of the phenotype variable y for case or control status is simply 1 or 0 (respectively). For quantitative phenotypes, y values are z-scored using population means and standard deviations.

The optimization is performed over a set of 50 k SNPs which are selected by rank ordering the p-values obtained from single marker regression of the phenotype against the SNPs. The details of this are described in Supplementary Appendix G.

Predictors are trained using the implementation of the LASSO algorithm from the Scikit-learn Python package22. Specifically, the lassopath algorithm is called on standardized inputs as it generates the full lasso path. For disease status, we typically use five non-overlapping sets of cases and controls held back from the training set for the purposes of in-sample cross-validation. For each value of \(\lambda \), there is a particular predictor which is then applied to the cross-validation set, where the polygenic score is defined as (i labels the individual and j labels the SNP)

$$\begin{aligned} PGS_i \,\, \mathrm{or} \,\, PRS_i = \sum _{j=1}^p X_{ij} \beta _j. \end{aligned}$$

To select a specific value of the penalization \(\lambda \) which defines our final predictor (for final evaluation on out-of-sample testing sets), we choose the \(\lambda \) that maximizes the performance metric in each cross validation set thereby creating five different predictors. For case-control phenotypes, the performance metric is AUC, and for quantitative phenotypes, it is the correlation between predicted and actual trait value. This is explained in more detail in Supplementary Appendices C and G.

Other significant covariates, such as age, sex, principal components from population structure, etc. could be included in the model and would serve to enhance the predictive power of these predictors. However, we are primarily interested in genetic predictive power alone. In Ref.5, age/sex is included as a covariate and it is shown that model prediction improves when these are included. In Ref.8, it was shown that the prediction variance accounted for in the top principal components (population structure) for complex traits in the UK Biobank is negligible. This is the reason why we do not include them in PRS/PGS construction. The UKB white population displays very little population structure—this is elaborated on in Supplementary Appendix D. However, the concern that principal component and age differences could explain some of the discriminatory power is explored in Supplementary Appendix D where we compare sibling pairs to randomized pairs which are chosen to have a similar principal component and age difference structure as the sibling set.

The training computations were performed using the super-computing cluster in the Michigan State University High Performance Computing Center.

Sibling differences in case/control phenotypes

For each trait, 1,000 randomly selected (non-sibling) individuals are set aside (not used in the training) from the non-sibling training set, but are used for cross-validation and model selection. For case-control phenotypes, there are 500 cases and 500 controls making up the 1,000. (For Breast, Prostate, and Testicular Cancer the corresponding numbers are 100 and 100, due to smaller datasets.) This process is repeated 5 times to generate a set of 5 predictors so that statistical fluctuations associated with the training process (mean and variance) can be estimated. We do not report the performance metrics on the validation sets as they are quantitatively similar to that of the final test set—see5 for an example of this.

For all traits, we make use of L1 penalized regression as described in Refs.5,8. Previous work has shown this to be an effective method of generating polygenic predictors5,8. The typical outputs of a LASSO run are the regularization parameters and a vector of SNP weights—this is discussed at length in Refs.5,6,8 where we use the scikit-learn package instead of a custom implementation22. We include results from publicly available predictors for Breast Cancer and Coronary Artery Disease from Khera et al.6—scoring from these predictors is described in Supplementary Appendix B.2.

The first quantity which is calculated for a predictor is an overall performance metric: for case/control phenotypes this corresponds to AUC; for quantitative phenotypes we focus on the correlation coefficient between predicted and actual phenotypes. The test set is composed of all individuals who are within a sibling pair in the UKB—the performance metric on this test set matches previous results from the literature5,8 and sets the baseline of comparison.

Note, case and control PRS distributions were shown in previous work5 to be shifted in mean. From these shifted distributions one can estimate the likelihood of case status for an individual with a particular PRS score. (That is, the fraction of individuals in a certain PRS bin who are cases vs controls.) In Fig. 1 we show an example of such a PRS distribution for both the entire sibling testing set and the restricted affected sibling pair (ASP) cohort. The ASP cohort consists of individuals with a sibling that is a case, and its PRS distribution is somewhat different from that of the general population. Please see Ref.5 for a more in depth discussion of the PRS distributions and “Population risk sorting: relative risk reduction” for more analysis of the ASP cohort.

Figure 1
figure 1

The left and right panels show case and control distributions in PRS for the entire cohort of sibling pairs and the Affected Sibling Pair (ASP) cohort respectively. Phenotype is Hypertension. This plot was made using pyplot v3.2.1 under license

Sibling call rates: case|control

A first test of polygenic scores in the affected sibling context can be made by simply computing the frequency at which the higher PRS sibling corresponds to the affected individual. We restrict the test set to all sibling pairs with one affected sibling and one unaffected sibling—i.e., we exclude sibling pairs where both are cases or controls. Within this set, we compute the fraction of the time in which the sibling with higher PRS is the case. The results are given in Table 1. As a baseline comparison, we also compute the fraction called correctly using an equal number of non-sibling case–control pairs randomly drawn from the total sibling set.

In Supplementary Appendix E we perform a similar analysis for trios of siblings.

Table 1 Polygenic predictors tested on sibling pairs.

Case identification for high risk sibling

Here we consider sibling pairs with one affected (case) and one control. Further, we focus on the subset of pairs in which one sibling has a high PRS score and the other a PRS score in the normal range (i.e., less than + 1 SD above average). In other words, exactly one of the sibs is a high risk outlier and we wish to know how often it is the outlier that is a case.

The previous analysis focused on the identification of the case in a sibling pair by selecting the larger polygenic score even if the difference was very small. While our polygenic scores are themselves additive models, individual risk (for example odds ratios as calculated in Refs.5,6 from validation data) can increase or decrease non-linearly as a function of PRS in the tails of the distribution—i.e., for outliers in PRS. For most individuals, in the middle of the distribution, the risk behavior is approximately linear, and the change in risk per standard deviation of change in PRS is not large. Because of this we do not expect strong prediction results when comparing two individuals in the normal PRS range. In this analysis, summarized in Table 2, one sibling is labeled high risk and the other sibling is normal risk as defined by PRS. In all cases, normal risk is defined as in being in the 84th percentile or below (< + 1 SD in PRS), while we vary the threshold used to define high risk (> + 1.5 SD, + 2.0 SD, + 2.5 SD, etc.).

As we restrict to sibling pairs with a larger risk differential, the predictions of which sibling is the case become more accurate (albeit still noisy). In other words: given that one of two siblings is affected, when one sibling is normal risk in PRS but the other sibling is in the top few percentile of risk—the larger PRS sibling will be increasingly likely to be the affected sibling as the difference in PRS becomes larger.

Table 2 Predictors tested on sibling pairs with a single case, where one sibling is high risk (\(+\;1.5\), \(+\;2\), \(+\;2.5\) SD or 93rd, 97.5th, 99th percentile) and the other is normal risk (\(<\;+\;1\) SD or < 85th percentile).

We repeat this calculation for a set of pairs in which no individual is paired with his or her sibling. This is done using the sibling population by randomizing the pairings. We generate random pairs of non-sibling individuals with exactly one case per pair. Further, we consider the subset of pairs in which one member of the pair is normal risk (PRS < + 1 SD), while the other is high risk. We then compute the probability that the high risk individual is the affected individual. Results are given in Table 3.

Table 3 Predictors tested on non-sibling (random) pairs w/ a single case where one is high risk (\(+\;1.5\), \(+\;2\), \(+\;2.5\) SD above or 93rd, 97.5th, 99th percentile) and the other is normal risk (\(<\;+\;1\) Standard Deviation or < 85th percentile).

Comparing Tables 2 and 3 suggests higher prediction accuracy for non-sibling pairs of individuals. The difference in accuracy is slightly inflated by the fact that the normal risk individuals in the related (sib) pairs tend to cluster closer to the + 1 SD PRS upper limit than those in the non-sibling pairs. This is because, conditional on having a high-risk sibling, the distribution of PRS scores is shifted to larger than average values. Nevertheless, we see that the success fractions are not very different between the two tables, and almost always overlap within one standard deviation uncertainty.

These results suggest that polygenic prediction works almost as well between siblings as in unrelated individuals.

In Fig. 2, we repeat the analysis from the tables using a continuously varying threshold (in z-score) to define the high risk set of individuals. As the threshold z-score increases the fraction of cases called correctly also increases. We display the results for Affected Sibling Pairs (ASP) as well as non-sibling pairs of individuals where each pairing consists of one case and one control. There is some reduction in accuracy for sibling pairs versus non-sibling pairs, as expected.

The error estimates in the figures and tables are generated as follows. We display the larger of two contributions to the uncertainty in determining the fraction called correctly (vertical axis): one results from the standard deviation among the five predictors we generate for each trait. The other results from sampling error (i.e., having only a finite number of pairs in which to estimate the fraction called correctly). The second source of error is a Clopper–Pearson interval with a \(68\%\) confidence value.

Figure 2
figure 2

Predictors tested on random (non-sibling) pairs and affected sibling pairs with a single case. One individual is high risk (with z-score given on the horizontal axis) and the other is normal risk (PRS < + 1 SD). The error estimates are explained in the text. This plot was made using pyplot v3.2.1 under license

Figure 2 is given specifically as an example—similar plots are generated for all conditions. These are shown in Supplementary Appendix H.

Population risk sorting: relative risk reduction

Polygenic scores can be used to identify subsets of the population who are at high or low risk for a given condition. This information can be used to better allocate resources for, e.g., screening or prevention. In Ref.5, it is proposed that polygenic scores could lead to more effective detection and intervention for a wide variety of health conditions (e.g., breast cancer). The early detection of disease conditions could then lead to a net cost reduction and better health outcomes. Throughout this section we are specifically focused on genetic risks, but these results could be incorporated into more complete risk models or potential clinical applications as found in Refs.23,24,25,26.

In this section, we investigate how the number (or fraction) of affected individuals varies as we exclude high and low risk individuals from the group. The fraction of affected individuals can be considered an estimator for the probability that a randomly selected individual will develop the condition, conditional on either having 1. PRS below some upper limit (left panel in figures—a low risk population defined by PRS) or 2. PRS above some lower limit (right panel in figures—a high risk population defined by PRS).

The figures here display the fraction of individuals affected when restricted to PRS score either above or below a specific value. The upper panels in Figs. 3, 4, 5 and 6 display the results for randomly selected individuals from the general population. The orange line in both panels represents the disease prevalence in the entire testing set (general population).

Figure 3
figure 3

Exclusion of individuals above (left panel) and below (right panel) a z-score threshold (horizontal axis) with resulting group prevalence shown on the vertical axis. The left panel shows risk reduction in a low PRS population, the right panel shows risk enhancement in a high PRS population. Top figures are results in the general population, bottom figures are the Affected Sibling Pair (ASP) population (i.e., variation of risk with PRS among individuals with an affected sib). Phenotype is Type 2 Diabetes. This plot was made using pyplot v3.2.1 under license

Figure 4
figure 4

Exclusion of individuals above (left panel) and below (right panel) a z-score threshold (horizontal axis) with resulting group prevalence shown on the vertical axis. The left panel shows risk reduction in a low PRS population, the right panel shows risk enhancement in a high PRS population. Top figures are results in the general population, bottom figures are the Affected Sibling Pair (ASP) population (i.e., variation of risk with PRS among individuals with an affected sib). Phenotype is Breast Cancer. This plot was made using pyplot v3.2.1 under license

Figure 5
figure 5

Exclusion of individuals above (left panel) and below (right panel) a z-score threshold (horizontal axis) with resulting group prevalence shown on the vertical axis. The left panel shows risk reduction in a low PRS population, the right panel shows risk enhancement in a high PRS population. Top figures are results in the general population, bottom figures are the Affected Sibling Pair (ASP) population (i.e., variation of risk with PRS among individuals with an affected sib). Phenotype is Hypertension. This plot was made using pyplot v3.2.1 under license

Figure 6
figure 6

Exclusion of individuals above (left panel) and below (right panel) a z-score threshold (horizontal axis) with resulting group prevalence shown on the vertical axis. The left panel shows risk reduction in a low PRS population, the right panel shows risk enhancement in a high PRS population. Top figures are results in the general population, bottom figures are the Affected Sibling Pair (ASP) population (i.e., variation of risk with PRS among individuals with an affected sib). Phenotype is Heart Attack.This plot was made using pyplot v3.2.1 under license

These plots are meant to be illustrative. Similar plots are shown for each of the disease conditions we study in Supplementary Appendix I.

We examine the behavior of PRS in the context of a known family history by repeating the previous calculation on a restricted Affected Sibling Pair (ASP) testing set. In the lower panels of Figs. 3, 4, 5 and 6 we compute the same disease prevalence as in the upper panels, but for individuals with an affected sibling. That is, all cases and all controls used in the calculation have an affected sibling; the existence of this affected sibling defines the population analyzed as one with higher than normal risk. The values in the lower panels of Figs. 3, 4, 5 and 6 reflect an overall higher fraction of affected individuals than in the entire data set. It seems plausible that this increased risk is due to the family history of the individuals. However, the results show that low PRS individuals have reduced risk relative to others with a similar family history. Given two individuals A and B, where A has an affected sibling A’ and B has an affected sibling B’, the graphs show that between A and B, the one with higher PRS has a higher probability of having the condition. The green line in both panels represents the disease prevalence in the entire testing set—the population of individuals with an affected sibling.

For some of the disease conditions with small rate of incidence, we did not have enough data to directly estimate risk as a function of PRS for sibs in the ASP population—i.e., there are not enough sib pairs in which both are cases. However, we typically did have enough data to estimate mean and standard deviation in PRS for affected and unaffected individuals conditional on each individual having an affected sib. (Less data is required to estimate a mean and SD than to map out an entire curve bin by bin.) Assuming that the distribution of cases and controls are both approximately Gaussian in PRS (something we verified to be true for conditions for which we have more data), this allows us to compute the implied risk as a function of PRS. We include this predicted risk as a function of PRS (see green curves) on all prevalence plots involving the ASP populations. The results are shown in the corresponding figures in Supplementary Appendix I.

Figure 3 is meant to be illustrative and similar plots for all conditions are given in Supplementary Appendix I. We include the predicted prevalence as a function of score—the predicted prevalence is calculated assuming that cases and controls are normally distributed (a mixed Gaussian distribution). The means, standard deviations and total numbers of cases and controls are the only (six) parameters needed for the predicted curve—these are calculated directly from the data. See Ref.5 for a more in depth discussion.

Identification within affected sibling pairs (ASP)

To assess the degree to which discriminatory power is altered within affected families for case/control phenotypes, we calculate the AUC amongst the full testing set (i.e., a proxy for the general population; for convenience we used all individuals with a sibling) and amongst a cohort of affected sibling pairs (ASP; all cases or controls must have a sibling who is also a case). The ASP cohort is constructed by restricting the testing set to all sibling pairs as follows: controls have at least one sibling which is a case; cases must also have at least one other sibling which is a case—i.e., in this new test set, all cases and controls have at least one affected sibling. The difference in the AUC between the entire population and the affected sibling testing sets are given in Table 4.

Table 4 Polygenic predictors tested on sibling pairs.

Table 4 shows that, as expected, prediction accuracy is typically higher in groups of non-sibling individuals. However, the decrease in AUC when working with high risk families (i.e., where at least one sib is affected) is typically modest.

Sibling differences in quantitative traits

Performance difference: siblings vs non-sibling pairs

We now turn to prediction of quantitative phenotypes. To evaluate performance one typically computes the correlation between predicted and actual phenotypes: \(\rho (PGS,y)\) where PGS and y are the predicted phenotype from polygenic score and the measured phenotype respectively.

In comparing between-sibling performance to performance in the general (non-sibling) population, it is useful to consider pairwise differences in actual phenotype and predicted phenotype (polygenic score): \(\Delta y\) and \(\Delta PGS\). For example \(\Delta y\) could be the (z-scored) difference in height between the two in the pair, and \(\Delta PGS\) the (z-scored) difference in predicted heights (or PGS score).

We compute the correlation between phenotype and score difference, \(\rho (\Delta PGS,\Delta y)\), for pairs of siblings and for pairs of non-sibling individuals. The results are given in Table 5. Figure 7 provides a specific example—results are shown for all traits considered in Supplementary Appendix J.

Table 5 Polygenic predictors tested on sibling pairs and non-sibling (random) pairs.
Figure 7
figure 7

Difference in phenotype (vertical axis) and difference in polygenic score (horizontal axis) for pairs of individuals. Red dots are sibling pairs and blue dots are random (non-sibling) pairs. This plot was made using pyplot v3.2.1 under license

Educational attainment (EA) shows an especially large between-sibling attenuation in performance relative to the other predictors. This has been noticed in other studies27. The results suggest that at least some of the observed power in polygenic prediction of EA among non-sibling individuals comes from effects such as subtle population stratification (perhaps correlated to environmental conditions or family socio-economic status)12, genetic nurture13, or other environmental-genetic correlations9,10,11. Interestingly, the decrease in power seems to be not as large for the phenotype Fluid Intelligence (measured in UKB using a brief 12 item cognitive test).

Rank order accuracy: siblings vs non-sibling pairs

We can further compare between-sibling effectiveness of quantitative trait predictors by estimating the probability of predicting rank order—e.g., which sib is taller—using PGS.

First, how often does the higher PGS sibling have the larger value of the actual phenotype? We restrict the analysis to only those pairs of siblings whose phenotypes are known and then compute the fraction of the time in which rank order by PGS agrees with rank order in phenotype. The results are listed in Table 6.

Similar results for trios of siblings are presented in Supplementary Appendix E.

Table 6 Rank ordering by polygenic score.

Rank order accuracy as a function of phenotype difference

In the previous calculation many of the failures to correctly predict rank order result from the two individuals in the pair having very close values of the phenotype. To further investigate, we consider accuracy of rank order prediction as a function of actual phenotype difference in the pair. As expected, probability of correct rank ordering increases with actual difference in phenotype.

PGS from sets of five trained predictors are z-scored based on the testing population. The identification of pairs with phenotypic difference larger than x (value shown on horizontal axis of Fig. 8) is based upon the average score value across the five predictors. This selects the sub-cohort with large phenotypic difference. Then the fraction called correct is calculated for each of the five polygenic scores. This fraction, for 0.5, 1, and 1.5 standard deviation difference in phenotype, can be found in Table 7. The quoted error is computed as the larger of the standard deviation resulting from the five different predictors, and the statistical sampling error (Clopper-Pearson interval) in estimating the probability p in a binomial distribution. (See earlier discussion in “Case identification for high risk sibling”) To clarify: the first error contribution is intrinsic to the construction of the predictor (different training runs create slightly different predictors), the second error contribution always arises when estimating the (success) probability p from a finite sample of N datum.

Table 7 Predictors tested on sibling pairs where a phenotype difference is larger than some value (\(+\;0.5\), \(+\;1.0\), \(+\;1.5\) Standard Deviations difference; the adjusted phenotypes are described in the Supplementary Appendix).

We repeat this calculation for non-related individuals, by simply randomizing the pairings so that individuals are no longer paired with their siblings. We then perform the same operations: select pairs where the phenotype difference is larger than a certain value and then compute the fraction of pairs where the high PGS individual has a larger value. This is illustrated in Table 8.

Table 8 Predictors tested on non-sibling pairs where a phenotype difference is larger than some value (\(+\;0.5\), \(+\;1.0\), \(+\;1.5\) Standard Deviations difference; the adjusted phenotypes are described in the Supplementary Appendix).

The comparison between non-sibling pairs and sibling pairs is shown in Fig. 8, where we display the fraction identified correctly for sibling pairs and for randomly paired individuals, allowing the threshold phenotype difference to vary continuously. The difference between the blue and orange lines represents the difference between predictive power amongst non-sibling and related individuals.

Figure 8
figure 8

Probability of PGS correctly identifying the individual with larger phenotype value (vertical axis). Horizontal axis shows absolute difference in phenotypes. The blue line is for sibling pairs, the orange line is for randomized (non-sibling) pairs. This plot was made using pyplot v3.2.1 under license

Figure 8 is given specifically as an example—similar plots are generated for all continuous traits which are discussed in this paper. These are shown in Supplementary Appendix K. The loss of power in polygenic predictors is expected, but these calculations illustrate the central point that polygenic predictors can still reliably improve the identification of individuals (or rank ordering) when large phenotypic differentials exist.


Siblings have typically experienced similar environments during childhood, and exhibit negligible population stratification relative to each other. The ability to predict differences in disease risk or complex trait values between siblings provides an important validation of polygenic predictors. We compared validation results obtained using non-sibling subjects to those obtained among siblings, and found that most of the predictive power persists in between-sibling designs.

In the case of disease risk we tested the extent to which higher polygenic risk score (PRS) identifies the affected sibling, and also estimated Relative Risk Reduction as a function of risk score threshold. For quantitative traits we studied between-sibling differences in trait values as a function of predicted differences, and compared to performance in non-sibling pairs.

One exception is the Educational Attainment (EA) predictor, which exhibits a very strong reduction in power when applied to sibs. This is not entirely unexpected as effects like the violation of the equal-environment hypothesis may be found for EA4, and EA can depend on complicated correlations between environment and genes28. Interestingly, the corresponding reduction for the Fluid Intelligence predictor is much less than for EA. This is discussed in more detail below.

Our focus was not primarily on the absolute level of prediction, but rather on the comparison between results in non-sibling pairs versus sibling pairs. Improved absolute levels of prediction can be obtained by taking into account covariates (e.g., age, specific biomarkers, other correlates), as done in Refs.6,8. For most predictors the observed reduction in power tends to be modest. The largest decline in power is observed for the quantitative trait Educational Attainment (but see Fluid Intelligence in contrast). The results discussed above suggest that almost all of the predictors studied capture some real, direct, genetic effects. These effects survive between-sibling validity testing, and attenuation of predictive power tends to be modest (most of the power remains in the sibling tests). Our results for height, body mass index (BMI), EA, and Fluid Intelligence are similar to recent results found in Ref.29, utilizing data from the Twins Early Development Study (TEDS). As far as we know, this paper is the first to analyze a variety of disease risks using between-sibling designs.

We emphasize that predictors trained on even larger datasets will likely have significantly stronger performance than the ones analyzed here5,8. As we elaborated in earlier work, where many of these predictors were first investigated, their main practical utility at the moment is in the identification of outliers who may be at exceptionally high (or low) risk for a specific disease condition. The results here confirm that high risk score outliers are indeed at elevated risk, even compared to their (normal range score) siblings.

The main limitation to progress is sample size—number of genotyped individuals available for analysis. As larger datasets become available, the accuracy and robustness of these results can only improve. Stronger results could be obtained using future datasets, with larger families and larger numbers of families. However, with the UK Biobank we were mostly limited to sibling pairs—ideally, a similar analysis could be done with full families (parents and children).

The sibling results presented in this paper, together with the many out of sample validations of polygenic scores that continue to appear in the literature, suggest that genomic prediction in humans is a robust and important advance that will lead to improvements in translational medicine as well as deep insights into human genetics.

As shown in earlier work5,8, we expect the predictors to improve substantially as more data become available for training. This is conditioned on genotyping that captures a sufficient part of the predictive regions of the genome30. It seems clear that with the possible exception of the height phenotype (for which we start to see diminishing returns; most of the common SNP heritability is captured already in the predictor), training is limited by sample size (specifically for risk predictors: number of genotyped cases) and not by algorithm performance or computational resources.