Genetic architecture of complex traits and disease risk predictors

Genomic prediction of complex human traits (e.g., height, cognitive ability, bone density) and disease risks (e.g., breast cancer, diabetes, heart disease, atrial fibrillation) has advanced considerably in recent years. Using data from the UK Biobank, predictors have been constructed using penalized algorithms that favor sparsity: i.e., which use as few genetic variants as possible. We analyze the specific genetic variants (SNPs) utilized in these predictors, which can vary from dozens to as many as thirty thousand. We find that the fraction of SNPs in or near genic regions varies widely by phenotype. For the majority of disease conditions studied, a large amount of the variance is accounted for by SNPs outside of coding regions. The state of these SNPs cannot be determined from exome-sequencing data. This suggests that exome data alone will miss much of the heritability for these traits—i.e., existing PRS cannot be computed from exome data alone. We also study the fraction of SNPs and of variance that is in common between pairs of predictors. The DNA regions used in disease risk predictors so far constructed seem to be largely disjoint (with a few interesting exceptions), suggesting that individual genetic disease risks are largely uncorrelated. It seems possible in theory for an individual to be a low-risk outlier in all conditions simultaneously.

: Plots of the number of random SNPs located within genic regions, expressed as a percentage of the total number of SNPs in a randomly-selected set containing the same number of SNPs as the activated set in the predictor for the indicated disease condition/trait, against expansion of GENCODE Release 19 gene boundaries by k kilo base pairs. SNPS were randomly selected from the 800,000+ variants measured by the UK Biobank Axiom Array. Supplementary Figure S8: The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the type-2 diabetes predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, 'genic' SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends. Supplementary Figure S9: The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the diastolic blood pressure predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, 'genic' SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.  Figure S14: The fifteen largest total values of variance accounted for by predictor SNPs located on a single gene (in terms of the percentage of total variance accounted for by all predictor SNPs) for the heart attack predictor. Each vertical bar is colored violet with a depth of shade proportional to the height of the bar. Here, 'genic' SNPs are contained within the GENCODE Release 19 gene boundaries plus 30 kilo base pairs at both ends.  Tables   Tables that list the   care records, and admission to hospital with a primary diagnosis of asthma. In regards to age, sex and ancestry, the sequenced individuals are representative of the overall UK Biobank cohort. The sample set has 194 parent-offspring pairs, including 26 mother-father-child trios, 613 full-sibling pairs, 1 monozygotic twin pair and 195 second-degree genetically determined relationships.

Appendix B: Supplementary
Exomes were captured using a version of the IDT xGen Exome Research Panel v1.0. Multiplexed samples were sequenced with dual-indexed 75 x 75 bp paired-end reads on the Illumina NovaSeq 6000 platform using S2 flow cells. The specific genomic regions targeted for sequencing covered 39 megabases of the human genome, corresponding to 19,396 genes. In addition, the regions measuring 100 bp and located directly upstream and downstream of each target region were also sequenced.
A total of 4,735,722 variants located in targeted regions were identified. With adjacent (non-targeted) 100 bp regions included in the tally, a total of 9,693,526 indel and single nucleotide variants (SNVs) were observed. While only the target regions are required to meet all sequencing quality standards such as unique read coverage, variants in both target and adjacent regions were subjected to the same variant quality control metrics. Approximately 14% of coding variants identified via whole-exome sequencing were observed in the imputed sequence of 49,797 participants with both whole-exome sequencing and imputed data. 22.6% of the coding variants in the imputed data were not observed in the whole-exome sequencing data.

About Our Predictors
Predictors were derived as follows: For every phenotype considered, a small subset of the genetically British grouping of UK Biobank participants was set aside for validation, and then the predictors were trained on a large remaining subset -see Supplementary Tables S57 and S58 for the exact training and validation set sizes. Cross-validation was performed several times, and one predictor was randomly selected as the representative predictor for that particular phenotype. Each predictor consists of a set of SNP ID's, weights, effect alleles, and a value of the penalization parameter λ. True out-of-sample testing and adjacent ancestry testing was performed for many of these predictors in [4,6]. More information can be found in the Methods section of the main text, and in [4,6].
As described in the Methods section of the main text, the L1-penalized LASSO regression is not expected to output weights such that X · β is of order 1. However, we can still record what the total value of variance,    Figure S62: Genetic correlation estimates for some of the phenotypes examined in this paper. Results are quoted from [22]. Notable overlaps are: heart attack and high cholesterol at .6, hypertension and high cholesterol at .57, hypertension and heart attack at .51, systolic blood pressure and hypertension at .79, hypertension and diastolic blood pressure at .78, systolic blood pressure and diastolic blood pressure at .67, and diastolic blood pressure and pulse rate at .29. Supplementary Figure S63: Log 10 (p-value) corresponding to the genetic correlation results cited in Supplementary Figure S62. Values of −100 correspond to p-values that are 0 to within floating precision. All values are quoted from [22].