Sibling validation of polygenic risk scores and complex trait prediction

We test 26 polygenic predictors using tens of thousands of genetic siblings from the UK Biobank (UKB), for whom we have SNP genotypes, health status, and phenotype information in late adulthood. Siblings have typically experienced similar environments during childhood, and exhibit negligible population stratification relative to each other. Therefore, the ability to predict differences in disease risk or complex trait values between siblings is a strong test of genomic prediction in humans. We compare validation results obtained using non-sibling subjects to those obtained among siblings and find that typically most of the predictive power persists in between-sibling designs. In the case of disease risk we test the extent to which higher polygenic risk score (PRS) identifies the affected sibling, and also compute Relative Risk Reduction as a function of risk score threshold. For quantitative traits we examine between-sibling differences in trait values as a function of predicted differences, and compare to performance in non-sibling pairs. Example results: Given 1 sibling with normal-range PRS score (< 84 percentile, < + 1 SD) and 1 sibling with high PRS score (top few percentiles, i.e. > + 2 SD), the predictors identify the affected sibling about 70–90% of the time across a variety of disease conditions, including Breast Cancer, Heart Attack, Diabetes, etc. 55–65% of the time the higher PRS sibling is the case. For quantitative traits such as height, the predictor correctly identifies the taller sibling roughly 80 percent of the time when the (male) height difference is 2 inches or more.


Supplementary Information A Genotype Quality Control
The main dataset used in this work is the 2018 release of the UK Biobank (the 2018 version corrected some issues with imputation, included sex chromosomes, etc). In 2018, the UK Biobank (UKB) re-released the dataset representing approximately 500,000 individuals genotyped on two Affymetrix platforms -approximately 50,000 samples on the UKB BiLEVE Axiom array and the remainder on the UKB Biobank Axiom array. The genotype information was collected for 488,377 individuals for 805,426 SNPs which were then subsequently imputed to a much larger number of SNPs. In all predictor training, we restricted our analysis to self-report White individuals (as defined using self-reported ethnic background as surveyed by UK Biobank) [1] -namely participants with the code 1, 1001, 1002 or 1003 in the Ethnic Background field.
In our analyses, we do not use the imputed SNP set for training, but rather build all predictors from the directly genotyped markers. Our previous work has indicated that polygenic scores constructed from imputed data perform slightly less well. The imputed SNP set is used only for testing predictors generated by other authors (Khera et al. [2]). Further, we restrict our analysis to the autosomal chromosomes on every trait as we have found that inclusion of the sex chromosomes does not provide a significant performance enhancement for the traits studied here. From the called SNPs, further quality control was performed using PLINK version 1.9 [3]. After combining individual chromosomes to a single BED file, SNPs and samples which had missing call rates exceeding 3% were excluded and SNPs with minor allele frequency below 0.1% were also removed so to avoid rare variants. This resulted in 663,533 SNPs and 487,048 people -filtering this set by self-reported ancestry results in 459,063 remaining individuals.

B.1 UK Biobank Constructed Predictors
Here we report some of the performance characteristics and SNP content of the predictors used in this work. The majority of the predictors were developed by the authors and previously reported in [4][5][6]. Several predictors are included in this paper which have not been published elsewhere but were developed in a completely analogous manner -specifically body fat percentage, body mass index, fluid intelligence and platelet count. Two predictors (Breast Cancer and Coronary Artery Disease) which are evaluated were developed and published externally in [2]. In table S1 we show the trait type (binary or quantitative), the AUC or correlation coefficient in the testing set and the number of active SNPs in the predicture. Quantities listed are the average over 5 predictors and the quantity in parenthesis is the standard deviation.

Condition
Trait

B.2 External Predictors
We use Breast Cancer and Coronary Artery Disease predictors that were published in [2]. To apply these predictors, we use the imputed genomes directly from the UK Biobank without any additional QC. We again restrict to the self-report white sibling cohort for all testing. For each individual, we scan through each chromosome and generate a score using the SNP weight column from the predictor file for every SNP which is present in the imputed chromsome. Each individual chromosome score is combined and then the final result is z-scored based on the controls in the cohort (so that the score is centered near zero with unit variance).
For Coronary Artery Disease, this set gave 40,108 individuals and for Breast Cancer, after restricting to only females, gave 23,205 individuals.

C Sibling Identification, Training and Testing Sets
The UK Biobank performed an initial relatedness analysis on all participants, but familial relationships among UK Biobank participants were not directly recorded. The analysis was done by the UK Biobank, calculating kinship coefficients and IBS0 using the KING software [1]; the details can be found in the original UK Biobank paper [1]. The UK Biobank provides a total of 107,162 related pairs of individuals with kinship coefficients and IBS0.
To identify sibling pairs, we implement the same filters which the UK Biobank used: parentoffspring and sibling relationships have an expected relatedness coefficient of 0.25 and parentoffspring relationship are distinguished by those with IBS0 < 0.0012. We select all pairs with IBS0 > 0.0012 and kinship coefficient > 0.176, yielding a set of 22,667 pairs of participantsthis agrees with table S3 of the UK Biobank quality control supplementary information [1]. An in-depth analysis of the relationships can be found in the UK Biobank supplementary material. After restricting this group to individuals who survived the genotype quality control and also self-reported white ancestry, 40,030 individuals remained in this set and formed 21,671 pairs of siblings.
These sibling pairs are assumed to be first-degree siblings -each pair is assumed to have the same mother and father. However, amongst this set there exist situations where a participant is listed as sibling with two others, but the two are not listed as a sibling with each other. For example, if 3 individuals are labeled A, B, C then the following situation exists: A/B form a sibling pair, A/C form a sibling pair, but the pair B/C is not in the list of pairs. In this example, we do NOT include the B/C pair in our calculations with sibling pairs -we maintain the pairing which matches with the UK Biobank. However, when forming trios we group A,B,C together as a trio.
Due to the small number of groups larger than 3, we only use pairs and triplets in our analysis. After excluding individuals who do not self-report as white or pass quality control, 982 trios remain.

D Principal Component and Age Dependence
We note that including age as a covariate may enahance predictive power, given that age is the most important risk factor for many chronic diseases. Not accounting for age differences would be expected to attenuate our findings. In reference [5], many of the phenotypes studied here have been analyzed in models which utilize SNPs alone, sex/age alone and SNPs/sex/age together. The purpose of this work is not to build the most accurate model of risk which includes all major covariates. Rather, it is to show specifically that predictions may be made by genotype alone and that the prediction power persists in siblings.
Similarly, including principal components as covariates could enhance the predictive capability. In reference [7], it was found that principal components in the white UKB population explain a negligible component of variance for several complex traits. However, one could hypothesize that prediction power is enhanced for non-sibling pairs due to residual stratification. Similarly, one could hypothesize that a disparity in age gaps amongst siblings versus in non-sibling pairs could affect prediction.
The principal component structure amongst the sibling set does not vary greatly. This is illustrated in Figure S1. In the upper left panel, we show the values for the first two principal components in the UK Biobank (the principal component values are computed by UKB and the details can be found in reference [1]). We display the principal component structure for 1) the entire UK Biobank, 2) all self-reported Caucasian individuals and 3) the sibling set we use as a final testing set. Note that the sibling set is fairly homogeneous in the first few principal components.
Here we investigate the sensitivity of prediction to age and principal component value by creating pairs of random non-sibs chosen to have principal component or age structure comparable to that of sibling pairs. The sibling pairs are first randomized so that the sibling relationship is lost. We then keep each of the random pairs which have difference in the first four principal components that are similar to the difference in the first four principal components of the sibling pairs. The criterion for keeping the random pairs is that the new pairings principal component differences must be 2 standard deviations from the mean principal component difference of the sibling pairs. The leftover set of random pairs which does satisfy this criterion is then randomized again and the selection process repeats for a total of 10 times. A similar procedure is carried out for year of birth -we select random pairs which have age differences that are within two standard deviations of the mean difference of age differences amongst siblings. The principal component and year of birth differences for the sibling pairs, random pairs and random pairs that are "sibling similar" are illustrated in figures S1 S2.
After this set is identified, the fraction of time which the correct individual is chosen based on PRS is calculated. This is done for case/control traits which have a fairly large number of cases (more than 1000); specifically Hypothyroidism, High Cholesterol, Asthma, Hypertension and Type 2 Diabetes. These results are illustrated in

Case-control phenotypes
After inferring familial units, it was found that there are very few larger families in the UK Biobank -only trios are present in large enough quantity to warrant investigation. We performed risk-reduction analysis similar to that for the sibling pairs, in which we consider only trios with one affected sibling and then compute the fraction of the time in which the highest or lowest PGS sibling is a case. We note that the expected result under random selection would be 1/3. In all cases, the highest (lowest) PGS sibling is more (less) likely to be the affected sibling than random chance would produce.  Table S3: Polygenic predictors tested on sibling trios. The first column is the number of sibling trios with one affected and two unaffected siblings. The next three columns are the probabilities (and standard deviation over 5 predictors) that the higher PRS / lowest PRS / randomly chosen sibling are the case.

Quantitative phenotypes
After identifying family structure in the UKB, we find that only trios are present in large enough quantity to warrant discussion. We performed a similar analysis to that for the sibling pairs, in which we consider all trios with measured phenotypes and then compute the fraction of the time in which the highest or lowest PGS sibling has the highest / lowest phenotype value. We note that the expected result under random selection would be 1/3. We also compute the fraction of trios for which the correct order is assigned; the expected result under random selection would be 1/6 for this question. In all cases, the highest (lowest) PGS sibling is more likely to have the largest (smallest) phenotypic value. Also, the ordering of siblings is more likely to be correct than random chance would produce.  Table S4: Polygenic predictors tested on sibling trios and random trios. The first column gives the number of trios, columns 2-4 give the probabilities that the largest / smallest phenotype value, and order, are correctly identified using PGS. Quantities in parenthesis are standard deviations. Note that all quantities exceed the expected values of 1/3 (0.33) for selection and 1/6 (0.167) for rank ordering.

Case/Control Phenotypes
In this section, we describe how cases and controls are identified. We extract the following conditions and use as a case/control phenotype: Atrial Fibrillation, Asthma, Basal Cell Carcinoma, Breast Cancer, Coronary Artery Disease, Gallstones, Glaucoma, Gout, Heart Attack, High Cholesterol, Hypothyroidism, Hypertension, Malignant Melanoma, Prostate Cancer, Testicular Cancer, Type 1 Diabetes and Type 2 Diabetes. All conditions were identified using the fields "Non cancer illness code (self-reported)", "Cancer code (self-reported)" and "Diagnoses primary ICD10" or "Diagnoses secondary ICD10". All individuals who are not labelled as a case for a specific condition are then labelled as a control.
We used the field "Non-Cancer Illness Code (self-reported)" to identify cases and controls for the following: Atrial Fibrillation, Asthma, Gallstones, Glaucoma, Gout, Heart Attack, High Cholesterol, Hypothyroidism, Hypertension. Specifically, cases were identified by selecting individuals with the codes in "Non cancer illness code (self-reported)" To select Type 1 Diabetes cases, we identify individuals based on a doctor's diagnosis using the fields "Diagnoses primary ICD10" or "Diagnoses secondary ICD10". Specifically, any individual with ICD10 code E10.0-E10.9 (Insulin-dependent diabetes mellitus) in the Main Diagnosis or Secondary Diagnosis field is labelled as a case. To select Type 2 Diabetes cases in UKB, we identify individuals based on a doctor's diagnosis using the fields Diagnoses primary ICD10 or Diagnoses secondary ICD10. Specifically, any individual with ICD10 code E11.0-E11.9 (Non-insulin-dependent diabetes mellitus) in the Main Diagnosis or Secondary Diagnosis field is labelled as a case. To select for Coronary Artery Disease, we select based on ICD10 criterion identified in [2], specifically we select any individual with any of the following ICD10 codes (corresponding to various diagnoses of angina, myocardial infarctions or ischaemic heart disease): I20.0, I20.1, I20.8, I20. After identifying cases and controls in the whole UKB population, we restricted our training set to self-reported white and our testing set to self-reported white but within a sibling pair. For sex-specific traits (breast cancer / prostate cancer), we restricted our analysis to individuals who are of the appropriate sex. From the training set, we then select a random 500 cases and 500 controls to use for validation and model selection -with the exception of breast, prostate and testicular cancers where we choose 100/100 for validation due to the smaller training set. The number of cases and controls identified in this manner for training / validation are listed in  For all traits, we calculate the mean and standard deviation for Genetically British males / females and then z-score the phenotypes appropriately. After z-scoring based on sex, we fit a trendline to everyone born between 1938 and 1968, this is then subtracted from the z-scored values. This is done to correct for any population level changes over time. Educational Attainment was converted into a quantitative trait using the ISCED scale, similar to that used by the SSGAC collaboration [8]. Specifically, the codes in field "Qualifications" were converted to to years of education via the following mapping: (1,2,3

G Model Training Algorithm
In all calculations, we use the implementation of LASSO regression (Least Absolute Shrinkage and Selection Operator) found in Scikit Learn for Python 3 [9]. Given the large datasets, we use the lassopath algorithm which outputs the set of λ and β along the regularization pathway so we can choose λ on our own validation sets. For completeness, we outline the general optimization procedure executed to obtain the SNP weights.
First, we perform single marker regression using the PLINK software [3]. From this, the top 50k SNPs by p-value are selected and this subset of SNPs is used in the LASSO run. The BED matrix is loaded into memory using the Pandas-Plink package [10]. The data is standardized and lassopath is run with nstep = 200 and eps = 0.04 (eps = 0.01 for continuous traits).
Given a set of samples i = 1, 2, ..., n with a set of p SNPs, the phenotype y i and state of the j th SNP, X ij , are observed. X ij is an n x p matrix which contains the number of copies of the minor allele and any missing values are replaced with the SNP average. L 1 penalized regression, LASSO, seeks to minimize the objective function where | v| 1 = n i |v i | is the L 1 norm, || v|| = n i v 2 i is the L 2 norm and λ is a adjustable hyperparameter. The solution is given in terms of the soft-thresholding function as S(z, γ) = sgn(z)max(|z| − γ, 0) The penalty term affects which elements of β have non-zero entries. The value of λ is first chosen to be the maximum value such that all β i are zero, and it is then decreased, allowing more nonzero components in the predictor. For each value of λ, β * (λ n ) is obtained using the previous values of β * (λ n−1 ) (warm start) and coordinate descent. The Donoho-Tanner phase transition [11] describes how much data is required to recover the true nonzero components of the linear model and suggests, e.g., for SNP heritability h2 ∼ 0.5 that we expect to recover the true signal with s SNPs when the number of samples is n ∼ 30s − 100s (see [12,13]). For a more complete description of the algorithm, consult the documentation available at https://scikit-learn.org/stable/modules/generated/ sklearn.linear_model.lasso_path.html.

H High vs Normal risk call rates.
Here we focus pairs in which one member is a case and the other is a control. We further restrict to the subset of pairs where one sib has a high PRS score and the other a PRS score in the normal range (i.e., less than +1 SD above average). In other words, exactly one of the pair is a high risk outlier, and we wish to know how often it is the outlier that is affected.
As we restrict to sibling pairs with a larger risk differential, the predictions for which sibling is the case become more accurate (albeit noisy). In other words: given that one of two siblings is affected, when one sibling is normal risk in PRS but the other sibling is in the top few percentile of risk -the larger PRS sibling will be increasingly likely to be the affected sibling as the difference in PRS becomes larger.
We repeat this calculation for non-related individuals. We generate random pairs of nonsibling individuals with exactly one case per pair. Further, we consider the subset of pairs in which one member of the pair is normal risk (PRS < +1 SD), while the other is high risk. We then compute the probability that the high risk individual is the affected individual. Results are given in table ?? and in Figures S3, S4, S5, S6, S7.
The error estimates in the figures are generated as follows. We display the larger of two contributions to the uncertainty in determining the fraction called correctly (vertical axis): one results from the SD among the 5 predictors we generate for each trait, the other results from sampling error (i.e., having only a finite number of pairs in which to compute the fraction called correctly). This is known as the Clopper-Pearson interval; in the case that the probability p of calling the case correctly is not too close to 0 or 1, and N pairs used, the one standard deviation sampling uncertainty is given by roughly p(1 − p)/N . Figure S3: Predictors tested on random (non-sibling) pairs and affected sibling pairs with a single case. One individual is high risk (with z-score given on the horizontal axis) and the other is normal risk (PRS < +1 SD). The error estimates are explained in the text. This plot was made using pyplot v3.2.1 under license https://matplotlib.org/3.2.1/users/license.html Figure S4: Predictors tested on random (non-sibling) pairs and affected sibling pairs with a single case. One individual is high risk (with z-score given on the horizontal axis) and the other is normal risk (PRS < +1 SD). The error estimates are explained in the text. (the label "Khera" distinguishes the LASSO generated predictor vs that from ref [2].) This plot was made using pyplot v3.2.1 under license https://matplotlib.org/3.2.1/users/license.html Figure S5: Predictors tested on random (non-sibling) pairs and affected sibling pairs with a single case. One individual is high risk (with z-score given on the horizontal axis) and the other is normal risk (PRS < +1 SD). The error estimates are explained in the text. This plot was made using pyplot v3.2.1 under license https://matplotlib.org/3.2.1/users/license.html Figure S6: Predictors tested on random (non-sibling) pairs and affected sibling pairs with a single case. One individual is high risk (with z-score given on the horizontal axis) and the other is normal risk (PRS < +1 SD). The error estimates are explained in the text. This plot was made using pyplot v3.2.1 under license https://matplotlib.org/3.2.1/users/license.html Figure S7: Predictors tested on random (non-sibling) pairs and affected sibling pairs with a single case. One individual is high risk (with z-score given on the horizontal axis) and the other is normal risk (PRS < +1 SD

I Population prevalence changes
In this appendix, we present the results on how population prevalence varies as we exclude high and low risk individuals from the group. The population prevalence can be interpreted as the probability that a randomly selected individual will develop the condition, conditional on either having 1. PRS below some upper limit (left panel in figures) or 2. PRS above some lower limit (right panel in figures). These are shown in Figures S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, S22, S23, S24, S25, S26.
We demonstrate the utility of PRS in the context of a known family history by repeating this calculation on a restricted ASP testing set. We compute the same disease prevalence but using only individuals with an affected sibling. In this test set, all cases and all controls have an affected sibling (see earlier discussion). The values therefore reflect an overall higher risk due to the family history of the individuals.
Here we illustrate for each disease studied, how the population prevalence changes by excluding individuals with PRS above / below a threshold given on the horizontal axis.    We consider accuracy of rank order prediction as a function of actual phenotype difference in sets of pairs. The identification of pairs with phenotypic difference larger than x (value shown on horizontal axis of figures) is based upon the average score value across the five predictors. This selects the sub-cohort with large phenotypic difference. Then the fraction called correct is done for each of the 5 polygenic scores. The quoted error is then computed as the larger of the standard deviation resulting from the 5 different predictors, and the statistical sampling error in estimating the probability p in a binomial distribution. This is done for sibling pairs and randomly paired individuals. This is shown in Figures S29,S30. Figure S29: Probability of PGS correctly identifying the individual with larger phenotype value (vertical axis). Horizontal axis shows absolute difference in phenotypes. The blue line is for sibling pairs, the orange line is for randomized (non-sibling) pairs. This plot was made using pyplot v3.2.1 under license https://matplotlib.org/3.2.1/users/license.html Figure S30: Probability of PGS correctly identifying the individual with larger phenotype value (vertical axis). Horizontal axis shows absolute difference in phenotypes. The blue line is for sibling pairs, the orange line is for randomized (non-sibling) pairs. This plot was made using pyplot v3.2.1 under license https://matplotlib.org/3.2.1/users/license.html