Abstract
We developed an integrative transcriptomic, evolutionary, and causal inference framework for a deep regionlevel analysis, which integrates several published approaches and a new summarystatisticsbased methodology. To illustrate the framework, we applied it to understanding the host genetics of COVID19 severity. We identified putative causal genes, including SLC6A20, CXCR6, CCR9, and CCR5 in the locus on 3p21.31, quantifying their effect on mediating expression and on severe COVID19. We confirmed that individuals who carry the introgressed archaic segment in the locus have a substantially higher risk of developing the severe disease phenotype, estimating its contribution to expressionmediated heritability using a new summarystatisticsbased approach we developed here. Through a largescale phenomewide scan for the genes in the locus, several potential complications, including inflammatory, immunity, olfactory, and gustatory traits, were identified. Notably, the introgressed segment showed a much higher concentration of expressionmediated causal effect on severity (0.9–11.5 times) than the entire locus, explaining, on average, 15.7% of the causal effect. The regionlevel framework (implemented in publicly available software, SEGMENTSCAN) has important implications for the elucidation of molecular mechanisms of disease and the rational design of potentially novel therapeutics.
Similar content being viewed by others
Introduction
A novel coronavirus, Severe Acute Respiratory Syndrome Coronavirus 2 (SARSCoV2), has caused a global pandemic^{1}, with millions of individuals infected and over one million lives claimed worldwide. The severity of coronavirus disease 2019 (COVID19) shows substantial interindividual variability^{2}, highlighting the pressing question of the major molecular and epidemiological determinants of disease presentation. The features of the host genome that increase the risk of severe COVID19 constitute a critical public health question^{3}, with important implications for our molecular understanding of a lethal disease and for the development of effective therapeutic strategies. Several recent sufficientlypowered studies reproduced the genomewide association study (GWAS) signal on 3p21.31^{4,5,6}, which had been linked to the risk of respiratory failure and critical illness in COVID19 cases^{3,7}. A subsequent study found that a 49.4 Kb segment (chr3: 45,859,651–45,909,024, hg19) within the locus, which harbors the sentinel GWAS variant, is inherited from Neanderthals^{8}. Despite these striking results, the causal gene or genes in the locus and the phenotypic consequences of the introgressed segment are largely unknown.
The discovery of a locus associated with severe COVID19 underscores certain fundamental and interrelated methodological issues. Key aspects of broad methodological interest for a region or locus level analysis of a putatively complex disease include elucidation of (a) genome function, which may be investigated through causal inference on intermediate molecular traits; (b) evolutionary history, which may stratify the genomic data according to modeled (e.g., introgression status or archaic alleles) and unmodelled sequences; and (c) phenomescale consequence, which may underlie the adverse outcomes of the disease or indicate comorbidities. Integrating several widelyused approaches and a newly developed summarystatisticsbased method, we provide a framework that integrates these key elements into a regionlevel analysis, leveraging the largest collection of human transcriptomes^{9,10,11}, to gain insights into the disease’s etiology and expressivity.
This work has other broad methodological implications for studies of the genetic and molecular basis of complex traits. It presents an unbiased approach to estimating the heritability of gene expression attributable to a genomic segment (e.g., a regulatory element, a region undergoing selection, or a traitassociated locus) within a region, highlighting sources of bias for existing approaches. A segmentanchored analysis enables highresolution quantification of its effect on genes within the region under study. This work also develops a summarystatisticsbased approach to investigating, with improved causal resolution, the phenotypic consequences of a genomic region, proposing a new metric of the proportion of expressionmediated causal effect explained. For illustration, we apply our framework to the specific case of the COVID19 severity associated locus (3p21.31) with the inherited archaic segment, but we emphasize the framework’s generalizability and crossstudy relevance (Fig. 1).
Results
An overview of the framework
In this work, we developed a framework for a regionlevel analysis of a complex trait that performs causal inference on an intermediate molecular trait, incorporates the evolutionary history of modeled DNA sequence segment to clarify the trait’s expressivity, and evaluates a region’s broad phenotypic consequences on the human phenome (Fig. 1). We provide a software implementation, SEGMENTSCAN, of the framework. Here, a “segment” may be a regulatory element, a stretch of DNA under positive selection, or an archaic introgressed haplotype, within a possibly larger region of interest. Leveraging the jointtissue imputation (JTI) methodology^{9}, the segmentbased gene expression heritability is estimated using the “reduced model”, which includes as features the variants in the segment. A regionlevel gene prioritization is then performed by applying the “full model”, that is, the model trained on all local genetic variants, to GWAS summary statistics for a trait for maximal statistical power. Mendelian randomization approaches, for example, MRJTI (an approach that estimates the gene effect size on the trait by also modeling the heterogeneity due to horizontal pleiotropy and unobserved confounding^{9}), are used to increase causal support for the prioritized genes. For the putatively causal genes, since a segment (here, an introgression) may reflect the presence of an admixture (here, an ancient one) determining the local ancestry, with molecular or phenotypic consequences^{12}, genetically determined expression scores (GDEscores)^{11} are generated and compared for an ‘archaic’ genetic profile and the corresponding profile in modern human populations. In addition, the proportion of expressionmediated causal effect explained by the segment is quantified using a newly developed summarystatisticsbased approach (“Methods”). To comprehensively identify the phenotypic consequences of the segment, phenomewide scans using largescale biobanks are conducted for the genes for which the segment shows significant evidence of a regulatory effect. Identification of potential complications or comorbidities is the goal of the phenomewide scan. Here, we applied the framework to a COVID19 severity related region on 3p21.31 to demonstrate the framework.
Impact of segment on gene expression
We sought to quantify the impact of the introgressed segment(chr3: 45,859,65145,909,024, hg19) on gene expression. For genes in the locus, we implemented JTI, a more powerful gene expression prediction approach than PrediXcan^{9,10}, leveraging variants in the segment as features (“Methods”), using the 49 GTEx tissues^{13}. The crossvalidation performance provides an estimator of the segmentbased heritability of expression that is more robust to model misspecification than the standard genomebased restricted maximum likelihood (GREML) approach (“Methods”), which assumes a polygenic architecture. In this study, for heritability, we consider only the proportion of gene expression variance explained by cis regulation (Fig. 1).
Based on the estimate from JTI, the segment explained up to 24.6%, i.e., for FYCO1, of the variance in gene expression in the locus (Fig. 2a). Interestingly, the protein FYCO1 was recently shown to physically interact with SARSCoV2’s NSP13, a helicasetriphosphatase, in a study^{14} of protein interaction map between SARSCoV2 and human proteins. Notably, the JTI prediction quality of the segmentbased reduced model was significantly higher than the corresponding PrediXcan (which does not leverage borrowing of information across tissues) quality (P < 2.2e−16, Wilcoxon signedrank test, Supplementary Table 1), with JTI improving the estimate of heritability by leveraging tissue similarity of gene expression and of its genetic regulation.
We asked whether the introgressed segment was more informative for gene expression regulation than a randomlychosen segment of the same length (i.e., a segment with length equal to that of the introgressed one but with start position at a random position on the same chromosome). We, therefore, built a prediction model for each gene using variants in each of 100 randomlyselected segments for comparison (“Methods”). We found that the prediction performance (the square of the Pearson’s correlation r between the predicted expression and the observed expression) of the introgressed segment was significantly higher than the median of the prediction performance from the randomlychosen segments (P = 0.038, Wilcoxon matchedpairs signedrank test, twosided, Supplementary Fig. 1).
Gene expression heritability due to full model and reduced model
We sought to characterize the regulatory impact of the segment on local gene expression relative to the full cisregion (i.e., within 1 Mb on either side of the gene). We compared the estimate of gene expression heritability (derived from crossvalidation prediction performance) from the reduced model and the full model. Two clusters of genes could be identified with the reduced model—one with a substantial reduction in performance (to near zero) and a second that lies “along the diagonal” with little performance loss (Fig. 2b). The latter set includes genes with a substantial fraction of the expression variance explained by the segment while the former includes genes that may derive much of its expression variance from outside the segment (Supplementary Table 2). Notably, the genes with heritability “concentrated” in the segment tended to be physically closer to the segment (Fig. 2c).
We investigated the impact of the segment length on the ability to maintain good prediction accuracy with the reduced model compared to the full model. This analysis also allowed us to determine the extent to which the quality of the segment calling (i.e., the accuracy of the boundary of the segment) may influence the robustness of the conclusions that can be drawn. We tested segments that include variants within 100 and 500 kb of the actual introgressed segment. As expected, the prediction performance decreased as the segment narrowed from the full cisregion to the actual introgressed segment. (See results in lung and whole blood in Supplementary Fig. 2a and 2b, respectively. The distributions of the prediction performance (r^{2}) across all the available tissues are shown in Supplementary Fig. 2c.) The genetic variants in the segment account for only a small fraction (up to 0.035, Supplementary Fig. 2d) of the SNPs in the entire cisregion; nevertheless, performance degradation was observed for only a limited number of the genes in the region (consistent with Fig. 2b, c), indicating a disproportionately stronger regulatory role for the introgressed segment on local gene expression relative to the nonintrogressed region.
Regionlevel association test using GWAS summary statistics
Leveraging the COVID19 Host Genetics Initiative (COVID19 HGI) round 6 GWAS metaanalyses^{6} for COVID19 hospitalization (phenotype code: B2) and severity (phenotype code: A2), we performed summarystatisticbased JTI association analyses to identify hospitalization and severity associated genes in the 3p21.31 region. The substudy information can be found in Supplementary Table 3. To maximize the power, we utilized the full model to perform the association analyses. Among the imputable genes (genes with good prediction quality; “Methods”) near the introgressed segment, we found 27 genes significantly (Benjamini–Hochberg P_{FDR} < 0.05) associated with the risk of hospitalization either in lung or in whole blood. SLC6A20, CXCR6, and CCR9 were topranked associations (Fig. 3a). The same genes were found to be associated with COVID19 severity (Fig. 3b). The full set of JTI association results is summarized in Supplementary Tables 4 and 5.
Causal inference via summary statistics based Mendelian randomization
To further prioritize causal gene effects on COVID19 severity, we applied our MRJTI methodology^{9}. MRJTI is a twosample Mendelian randomization approach for causal inference. Here the “exposure” is gene expression, and the “outcome” is COVID19 severity or COVID19 hospitalization. Summary association results for the exposure and outcome were obtained from GTEx v8 and COVID19 HGI, respectively. Given the strong possibility of the presence of invalid instrumental variables (IVs) in the region, MRJTI models the heterogeneity of IVs and provides a more accurate estimate of causality (see Supplementary Fig. 3 and “Methods” for comparison with the conventional inversevariance weighted [IVW] method). In this context, the heterogeneity of IVs may be due to horizontal pleiotropic effects and unobserved confounding factors. MRJTI was performed on genes (in lung and whole blood) with significant signals (P_{FDR} < 0.05) from the JTI association analysis of the COVID19 HGI GWAS summary statistics. Six genes (namely, SLC20A6, CCR9, CXCR6, CCR2, CCR5, and CCR5AS) were significant from the MRJTI analysis after Bonferroni correction (Fig. 4a), indicating causal support for these genes on COVID19 hospitalization. Similarly, MRJTI showed causal support for FYCO1 in lung on COVID19 severity (Fig. 4b). Mendelian randomization results from MREgger and weightedmedian estimator were also generated using the same source data as MRJTI (see Fig. 4, Supplementary Tables 6 and 7).
Quantifying proportion of expressionmediated causal effect
For the MRJTI significant genes, we further asked to what extent the gene causal effect (on the trait) was driven by the introgressed segment, quantifying the proportion of expressionmediated causal effect explained, π_{c}. The statistic π_{c} is a ratio of estimated expressionmediated causal effects, which is calculated using a new summarystatisticsbased approach (“Methods”).
We evaluated the methodological implications of our approach. Local heritability (\(\widehat {r_{{{{\mathrm{local}}}}}^2}\)) estimation is dependent on the LD matrix (“Methods”), which is typically estimated from the sample dataset (either insample or a reference panel) with finite sample size. Minimizing the distance (mean squared error) of the sample LD matrix (\(\widehat C\)) to the true LD matrix (C) is one way of optimizing the estimate of heritability. Use of a nonoptimal LD matrix can substantially inflate the estimate of heritability. Towards this end, we obtained the unique optimal LD matrix estimator π(C) from projecting the true matrix to the “observable field” (Fig. 1 and “Methods”). Using simulations at various assumed levels of local heritability and informed by empirical genomic data (“Methods”), we confirmed that the local heritability estimated from the projected LD matrix π(C) is less biased than that estimated from the sample (e.g., externalpanelbased) LD matrix Ĉ (Supplementary Fig. 4).
The inflation in the heritability estimate may also result from a genomewide (global) approach such as LD Score regression^{15} (under a polygenic architecture). Comparison of the LD scores (calculated for a variant as the respective row sum of the LD matrix) between the original (unadjusted) LD matrix and the projected (optimal) LD matrix revealed overestimation of heritability (range: 0.4–17.1%, mean: 4.9%, Supplementary Fig. 5) with the use of the original LD matrix. Taken together, these results show that the projection matrix approach is broadly applicable, including for unbiased genomewide heritability estimation.
We applied the optimized local heritability estimation to the seven potentially causal genes (in lung or whole blood; see Fig. 4) for either COVID19 hospitalization or severity. On average, the segment explained 15.7% of the expressionmediated causal effect among the seven genes (Supplementary Tables 8 and 9). Notably, the concentration of expressionmediated heritability (“Methods”) was much higher (0.9–11.5 times) for the segment than the entire cisregion (Supplementary Tables 8 and 9).
Regulatory divergence due to the segment
Paabo et al. showed that individuals with the introgressed segment are more likely to develop severe COVID19^{8}. However, the mechanism and the effector genes are unknown. To identify potentially mediating genes, we generated GDEscores in five modern human populations (1000 Genomes project phase 3) and an approximately 122,000yearold Altai Neanderthal sample (“Methods”), using the JTItrained models. We emphasize that the GDEscore for a gene is not a substitute for an extinct hominin’s level of gene expression (which cannot be directly accessed), but the score allows us to stratify the genetically determined effect of a DNA sequence according to the sequence’s evolutionary history similar to local ancestry based stratification of gene expression^{12}. The GDEscore for the “archaic” genetic profile provides a way to evaluate the gene expression determined by the introgressed segment in modern human populations as a function of the distance to the archaic profile. For a given gene, its JTI model was trained on genetic variants that fall naturally into categories based on their evolutionary histories, but with the effects of archaicancestryspecific variants remaining unmodeled^{16} (Supplementary Fig. 6). We emphasize that differences in the GDEscore reflect differences in genetic regulatory effects rather than a difference in overall expression^{11}. The analysis of the difference in the GDEscores between the archaic profile and modern human populations was performed for the Mendelian randomizationsignificant genes that had passed Bonferroni correction (from MRJTI, MREgger, or weighted median estimator). To generate the distribution of GDEscore in modern humans for comparison with the archaic profile, we included only genes with at least two JTI model predictor SNPs available in the archaic genome.
Among the putative causal genes for either COVID19 hospitalization or severity, the archaic sequencebased GDEscores for CCR5 in lung was extreme relative to modern human populations (Fig. 5a). The crosspopulation similarity of the GDEscore distributions for the gene in these tissues in modern humans makes the significant regulatory divergence for the archaic genomic sequence striking. Since lower expression of CCR5 increased the risk of severe COVID19, as estimated from the Mendelian randomization analyses, the significant difference in GDEscore indicates that carriers of the introgressed segment would have increased predisposition to severe COVID19. A similar pattern was observed for CXCR6 in lung, indicating that carriers of the introgressed segment have increased risk of severe COVID19. However, in lung, CCR9 and CCR5AS showed similar GDEscore profiles across modern human populations and in a carrier of the archaic genomic sequence (Fig. 5c, d).
Phenomic scan to identify complication etiologies and comorbidities
To evaluate the broad phenotypic consequences of the introgressed segment, we performed regionlevel analyses of the list of genes that are well imputed by the segment (Supplementary Table 1).
Blood cell traits are used to diagnose or monitor an infection. Considering the enrichment of immune response and chemokinerelated genes in this region, we computed the genelevel JTI associations of the genes in the locus with 27 blood cell traits (Supplementary Table 10), using the GWAS summary statistics from the UK Biobank samples (see “Methods”). The severityrelated genes showed significant associations with multiple blood cell traits (Fig. 6a). Notably, both CXCR6 (P = 6.5e−41, lung) and SLC6A20 (P = 1.4e−13, spleen; P = 7.1e−11, lung) were found to be significantly associated with monocyte percentage. Strong associations between the genes of the CCR family (CCR1, CCR2, CCR3, CCR5, and CCR9) within this locus and monocyte percentage, monocyte count, and basophil percentage were detected in multiple tissues, including whole blood and lung (Fig. 6a and Supplementary Table 11). Moreover, CCR1, CCR3, and CCR5 were found to be associated with platelet distribution width in multiple tissues, including fibroblasts, subcutaneous adipose, tibial artery, and esophagus mucosa with Pvalues ranging from 4.8e−06 to 1.0e−02 (all passing the FDR correction, Supplementary Table 11). Taken together, the substantial associations between the genetically determined expression and inflammation, immune response, and coagulationrelated blood cell biomarkers lend further support to the role of this locus in predisposition to COVID19 severity.
We then asked to what extent genetically determined gene expression in the locus predisposes individuals to develop certain complications and adverse outcomes. Leveraging the medical phenome in the UKB, we performed a regionlevel phenomescale scan across neurological, respiratory, circulatory, and endocrine/metabolic disorders (253 binary traits in total, Supplementary Table 12), limiting the analysis to the genes imputable by the segment. However, given the limited effective sample size (range: 204–251,681, Supplementary Table 12) and the large number of association tests, we emphasize that these promising results on potential complications (Fig. 6b and Supplementary Table 13) will require systematic replication in much larger datasets. Transient cerebral ischemia, myocardial infarction, and essential hypertension were found to be nominally associated with the genes in this region. Decreased genetically determined XCR1 in esophagus mucosa was found to be nominally associated with increased risk for disturbances of sensation of smell and taste (P = 5.1e–05), although the significance did not survive multiple testing correction. Notably, decreased genetically determined XCR1 in esophagus mucosa was also associated with a higher risk for severe COVID19, indicating a potential pleiotropic effect of the gene. Taken together, these associations, which are examples among others with the same level of significance, suggest that dysregulation of genes in this locus may result in adverse outcomes and potential complications of severe COVID19 (Supplementary Table 13).
Discussion
Here we develop an integrative framework for the locusspecific analysis of genome function, evolutionary history, and phenomescale impact. We build on our JTI (with its improved performance over conventional transcriptomewide association studies) and causal inference (to account for the presence of horizontal pleiotropy or unmeasured confounding effect) methodology^{9}. The framework inherently comes with a segmentbased gene expression heritability estimation approach where a segment may be a regulatory element, a region under positive selection, or a traitassociated locus. Furthermore, the framework develops a new summarystatisticsbased approach to estimate a metric, namely, the proportion of expressionmediated causal effect explained, that can be used to quantify causal mechanisms in a genomic region for a general complex disease or trait. Focusing on the introgressed segment as an application, we estimated the segmentbased heritability of gene expression in the larger locus, performing a comparison of the full model and the reduced model. We prioritized genes associated with COVID19 severity using the regionwide association test followed by several Mendelian randomization approaches (including MRJTI). Potential complications, which implicate key biological processes underlying the infection phenotype, were identified by a phenomewide scan for the genes regulated by the introgressed segment.
The genetic architecture of gene expression is characterized as sparse, with a small number of variants with disproportionately large effect (relative to expected from a polygenic model). We used the prediction performance (\(\widehat {r_{g,s}^2}\)) (for the gene g, of the test segment s, Eq. 7), which is derived from a crossvalidated (additive and sparse) model of gene expression, as an estimate of the segmentspecific heritability. In our application to the introgressed segment within the 3p21.31 locus associated with severe COVID19, although the segment spans only 49.4 Kb, the genetic variants in the segment were found to explain a substantial proportion of gene expression for the neighboring genes, indicating a strong regulatory role for the segment.
An extension of PrediXcan, JTI borrows information across tissues and substantially improves gene expression prediction performance^{9}. The increased power of JTI may enhance drug target discovery and improve drug repurposing efforts. By estimating the heterogeneity due to horizontal pleiotropy and unobserved confounding, MRJTI further prioritized several genes near the introgressed segment in the associated locus as potentially causal. Importantly, we provide strong support for the regulatory role of the introgressed segment for the putatively causal genes.
We previously trained prediction models using only the (GTEx) individuals with no Neanderthal ancestry in a gene’s regulatory region and applied the models to (GTEx) individuals with Neanderthal ancestry^{11}. Only a small reduction in prediction accuracy for the individuals with Neanderthal ancestry was observed relative to the models built without filtering by archaic ancestry^{11}. Comparing the GDEscore of an archaic profile with the distribution in modern human populations, we found supportive evidence that the Neanderthal alleles conferred a greater predisposition to severe COVID19. For carriers of the archaic segment, the higher risk of severe COVID19 was driven mainly by the genetic regulation of the expression of CCR5 and CXCR6 in lung.
The regionlevel analysis prioritized SLC6A20, CXCR6, and the CCR family (CCR5 and CCR9). Functional interaction between SIT1 (the protein encoded by SLC6A20) and ACE2 has been reported by VuilleditBille and colleagues^{17}. Exploited by SARSCoV2 (and a SARSCoV2like virus), ACE2 is a coreceptor important for viral intracellular entry into the lung and brain^{18,19,20}. The chemokine receptor coding gene, CXCR6, plays a key role in NK cellmediated memory of haptens and viruses^{21}. The CCR5 encodes the protein which belongs to the beta chemokine receptor family of integral membrane proteins^{22}. A recent study showed that antiCCR5 humanized monoclonal antibody restored CD8 counts in COVID patients, indicating CCR5 as a therapeutic target for COVID19^{23}. The chemokine receptor CCR9 plays an important role in regulating the development and migration of T lymphocytes^{24}. By utilizing CRISPR/Cas9 mediated genomic deletion, Yao et al. identified CCR9 as a potential target gene of the 3p21.31 locus for COVID19 severity^{25}.
The regionlevel analysis of blood cell traits further supports the connection between these genes and inflammatory traits. In addition, biomarkers for coagulationrelated traits were found to be associated with the genetically determined expression of several genes in the CCR family, which show substantial genetic control by the segment. Notably, the relevance of fibroblasts^{26} and subcutaneous adipose tissue^{27}, where the association signals were observed, for coagulationrelated traits finds support in previous studies. Leveraging disease phenotypes in the UK Biobank, we identified potential comorbidities and complications for the region. Notably, decreased genetically determined XCR1 in esophagus mucosa was found to be associated with increased risk for both severe COVID19 and “disturbances of sensation of smell and taste”, which had been reported as comorbidities in 41.0 and 38.2% cases, respectively, in a previous study^{28}. The protein encoded by XCR1 is a chemokine receptor for XCL1 and XCL2 (lymphotactin1 and 2). XCR1 has been studied mostly in dendritic cellbased cancer immunotherapy^{29}, while its role in olfactory and gustatory dysfunction is unknown. Clearly, a larger sample size and more comprehensive replication (in additional external datasets) will be required for more definitive conclusions due to the multiple comparison burden. Nevertheless, these genelevel associations can be the basis for interrogating the downstream consequences of severe COVID19 on the broader human disease phenome and, potentially, for designing effective therapeutic strategies.
Here, we treated the gene as the basic unit for causal inference (treating its expression as the “exposure” within a Mendelian randomization framework), which is to be contrasted with finemapping of causal variants. To date, only limited finemapping of causal variants has been performed for COVID19 severity^{3,30,31}. Compared with variantlevel finemapping, the genelevel causal inference has some desirable features, including (1) the relevance of the gene (and ease of use) as a target for drug development and repurposing; (2) increased statistical power for causal inference from leveraging multiple instrumental variables; and (3) greater portability across ethnic groups^{32}. Our approach also differs from colocalization, which tests for shared causal variants for expression and the phenotype. In the Mendelian randomization framework (MRJTI), for a gene to be causal for a phenotype, having shared causal variant effects is not enough. Clearly, the genelevel analysis does not capture coding mechanisms and other nonexpressionmediated causal effects. However, we provide a framework for estimating the expressionmediated causal effect using summary statistics for downstream functional studies.
This study has several caveats and limitations. Firstly, without modeling lowfrequency genetic variants, the regulatory effect of the introgressed segment may be underestimated. Low MAF variants are not very informative given the current sample sizes of available reference datasets. Secondly, although the latest GTEx dataset is a broad collection of tissues and cell types, the causal cell type(s) may be missing, or only partially represented, in the available tissues and cell types. Thus, the “tissues” in this study denote a proxy for the causal tissue(s) or cell type(s). Finally, we are unable to model archaicancestryspecific regulatory effects, i.e., both the nonintrogressed, archaicancestryderived alleles and the ancestral alleles now fixed on the modern human lineage. However, our interest here is not in predicting the transcriptome of an archaic genome (which is not available), but the effect of an introgressed segment in modern human populations.
In summary, we developed an integrative, geneticsanchored framework for a deep regionlevel analysis of a complex trait, which performs causal inference on an intermediate molecular trait, incorporates the evolutionary history of modeled DNA variation, and evaluates the phenomescale impact of the implicated locus. Applying the framework to the COVID19 severity associated locus with an archaic introgressed segment, we provided causal support for multiple genes and identified several geneticallysupported adverse outcomes.
Methods
Estimating the segmentbased heritability of gene expression
We estimated the heritability of gene expression due to a genomic segment, using a sparsityregularization and crossvalidationbased methodology. This approach, as we show below, is more robust to model misspecification than the widely used mixed model^{33} and is suitable for gene expression.
Gene expression model building
Suppose g_{1}, g_{2}, …, g_{n} are n tissuegene pairs of expression measurements for a given gene. We aim to find a nearoptimal set of variants in the segment with effect size vector \(\hat \beta\), the ‘JTI model’^{9}, assuming additivity of effect:
The n × p matrix \([x_1,\,x_2,\, \ldots ,\,x_n]^T\) is the feature matrix (of genetic variants). The w_{i} is the weight, generated from hyperparameter tuning, on the ith observation from the tissue similarity matrix. The JTI model thus leverages the similarity in transcriptional regulation profile. JTI can be extended to leverage a ddimensional similarity vector (d ≥ 1) by incorporating several layers of epigenomic datasets, as we previously described^{9}. The L_{1} penalty in the objective function enforces sparsity (consistent with the genetic architecture of gene expression) while the L_{2} penalty promotes grouping effect. Here α encodes the relative weight of the two penalties; we assumed α = 0.50. Given a test tissue, when tissue sample pairs from a different tissue are assigned weight 0 while those in the test tissue are assigned weight 1 in the loss function in Eq. 1, then the resulting special instance of the optimization problem generates the singletissue ‘PrediXcan model’.
Crossvalidation
The vector g of gene expression (say of dimension n) in each tissue can be decomposed as:
where g_{*}, s_{*}, and ε_{*} are the gene expression level, the genetic component, and the residual, respectively, in the training or test set (denoted here as *). For simplicity of presentation and without loss of generality, we left out the fixed effects (covariates). Assuming \(\varepsilon \sim {{{\mathcal{N}}}}(0,{{\Gamma }})\) has a Gaussian distribution, the variancecovariance matrix var(g) can be written as^{34}:
where cov(β) is the symmetric covariance matrix of the effect size vector and X is the n × p genotype (feature) matrix. By independence of the training and test sets, \({{\Gamma }} = \left[ {\begin{array}{*{20}{l}} {{{\Gamma }}_{{{{\mathrm{train}}}},\,{{{\mathrm{train}}}}}} \hfill & 0 \hfill \\ 0 \hfill & {{{\Gamma }}_{{{{\mathrm{test}}}},{{{\mathrm{test}}}}}} \hfill \end{array}} \right]\), where each submatrix \({{\Gamma }}_{ \ast , \ast }\) is symmetric.
Sampling dependence
Here we seek a theoretical formulation of the sampling dependence of the crossvalidation framework. In Kfold crossvalidation, the dataset is partitioned into K nonoverlapping subsets (say, of the same size n/K). Let Test_{k} and Train_{k} (that is, the dataset with the elements of Test_{k} removed) be the kth test set and training set, respectively. For each \(i \in Test_k\), we consider the “error” or residual \(\varepsilon _i\), defined as the difference between the gene expression level and the estimated genetic component trained in Train_{k} for i. The average residual \(\varepsilon = \frac{1}{n}\mathop {\sum }\limits_{i = 1}^n \varepsilon _i\) has variance given by the following expression:
where \(\sigma ^2\) is the average variance of the residuals for test samples (where the average is calculated over the training sets on which the residuals depend), \(\delta _{{{{\mathrm{within}}}}}^2\) is the withinfold covariance for these test samples (which may be nonzero because of the shared training set), and \(\delta _{{{{\mathrm{between}}}}}^2\) is the betweenfold covariance (which may be nonzero due to the fact that each \(Test_k\) is a subset of \(Train_l\) when \(l \,\ne\, k\)). We note that
Unbiased estimator of heritability of gene expression
The expression for the variancecovariance matrix (Eq. 3) recalls the usual decomposition of variance in the standard mixed model for heritability estimation^{33}. One key difference is that the mixed model fits the genetic effects \(u \in {\Bbb R}^p\) as random effects:
Here \(\vec 1 \in {\Bbb R}^{{{\boldsymbol{n}}}}\) is a vector of ones. The variance components \(\sigma _u^2\) and \(\sigma ^2\) are estimated using an algorithm (e.g., restricted maximum likelihood), and the heritability estimate \(\widehat {h_{MM}^2}\) is then given by the ratio \(\frac{{p\widehat {\sigma _u^2}}}{{p\widehat {\sigma _u^2} + \widehat {\sigma ^2}}}\). Now the socalled Best Linear Unbiased Predictions (BLUP) derived from the mixed model is related to ridge regression^{35,36}, a common regularization approach. Maximizing the posterior P(ug) under a Gaussian prior is equivalent to the minimization of the ridge objective function with ridge hyperparameter \(\lambda = \sigma ^2/\sigma _u^2\). Thus, mixed model parameter estimation (and thus heritability estimation under the mixed model approach) can be viewed as a type of regularization, but in contrast to regular ridge hyperparameter estimation which requires a training and validation dataset, mixed model parameter estimation is done in a single dataset.
For gene expression, we take a different approach, which relies first on regularization (Eq. 1) and then crossvalidation, both of which should reduce overfitting. Let \(g_{{{{\mathrm{test}}}},0}\) be the gene expression level for one random observation from the test set. The performance of the model is given by:
Here, the estimated genetic component \(\widehat {s_{{\rm{test}},0}}\) comes from applying the solution to the optimization problem given by Eq. 1 to the test subject. This coefficient of determination is an unbiased estimate of the proportion of explained variation. The regularization and crossvalidation approach is the core of the JTI prediction methodology, from which, therefore, an estimator of heritability of gene expression can be defined.
We also estimated the concentration of heritability, using the statistic:
where p_{*} is the number of variants in the model (reduced or full; see section “Training the full model and reduced model of gene expression” below), which measures the perSNP heritability from the reduced model as a fraction of the perSNP heritability from the full model.
Estimation of regionlevel (local) trait heritability using summary statistics
Using the theory of quadratic forms, we previously derived a summarystatisticsbased estimator of regionlevel trait heritability (while accounting for linkage disequilibrium [LD]; see Equations A6 and A7 in the appendix of Gamazon et al.^{37}). The estimator and its variance are given by:
This estimator is defined for a locus or region L, is approximately unbiased when insample LD is close to the true LD, and can be extended, via independent LD blocks, to estimate the genomewide SNP heritability. Here p is the number of SNPs, \(\hat \beta\) is the p × 1 vector of estimated effect sizes (on the GWAS trait or on gene expression, depending on context), and C is the p × p SNP correlation matrix. The condition \(n \ge p\) is a necessary condition for C being invertible or having a full rank. This approach was extended by Shi et al.^{38} in Heritability Estimator from Summary Statistics (HESS) (and then by Hou et al.^{39} to biobankscale data) with a model of genotypes in a locus as random variables and a technique to account for rank deficiency in the LD matrix (e.g., as may arise from SNPs in perfect LD). HESS replaces, in Eq. 9, \({{{\mathbf{C}}}}^{  1}\) by the MoorePenrose pseudoinverse and replaces p by \(q = {{{\mathrm{rank}}}}({{{\mathbf{C}}}})\), that is, the maximal number of linearly independent columns or the “effective number” of SNPs. Shi et al. “regularized” the external reference LD matrix to account for noise in the matrix, using principal components. Here, we extend our earlier work and Shi et al. with a theoretical and empirical investigation into a major source of bias for the estimate of heritability.
First, for illustration, we consider two SNPs that are in LD (\(r^2 = \rho\)), so that the assumed LD matrix is \(\left[ {\begin{array}{*{20}{c}} 1 & \rho \\ \rho & 1 \end{array}} \right]\). The inverse of the matrix is, therefore, \(\frac{1}{{(1  \rho ^2)}}\left[ {\begin{array}{*{20}{c}} 1 & {  \rho } \\ {  \rho } & 1 \end{array}} \right]\). Let \(\hat \beta ^T =\)[\(\widehat {\beta _1}\,\widehat {\beta _2}\)] be the vector of estimated variant effect sizes (from GWAS). Then the estimate of heritability (Eq. 9) can be written as:
Here, we note that:
which shows the change in the estimate caused by a perturbation in LD. A special instance is that of the SNPs being independent (\(\rho = 0\)), so that the LD matrix is the identity matrix. In this case, as \(n \to \infty\), the heritability estimate approaches \(\widehat {\beta _1}^2 + \widehat {\beta _2}^2\). Another special case is that of the SNPs that are in perfect LD (\(\rho = 1\)) so that the LD matrix is noninvertible (that is, has determinant \(\left( {1  \rho ^2} \right) = 0\)). In this case, the MoorePenrose pseudoinverse is \(\left[ {\begin{array}{*{20}{c}} {\frac{1}{4}} & {\frac{1}{4}} \\ {\frac{1}{4}} & {\frac{1}{4}} \end{array}} \right]\) and the estimate of heritability reduces to:
As \(n \to \infty\), this estimate approaches \(\frac{1}{4}\widehat {\beta _1}^2 + \frac{1}{4}\widehat {\beta _2}^2 + \frac{1}{2}\widehat {\beta _1}\widehat {\beta _2}\), which is the square of the weighted sum of the variant effect sizes (each of weight 1/2). Since the SNPs are in perfect LD, then the estimated effects sizes should be equal: \(\widehat {\beta _1} = \widehat {\beta _2} = \hat \beta\), and any difference in the estimates may be due to genotyping error.
Now, let us consider the general case of p variants in the region. The use of an external LD panel (which is typically smaller in sample size than a GWAS) usually leads to a lower rank of the LD matrix and thus produces an underestimation of the variance (Eq. 10 with lower p). However, a larger GWAS sample size leads to improved (i.e., lower) standard error (Eq. 10 with higher n). The groundtruth heritability \(r_L^2 = \beta ^T{{{\mathbf{C}}}}\beta\) (where \({{{\mathbf{C}}}} = [C_{ij}]\) is the LD matrix) is a quadratic form with (scalarbymatrix) derivative with respect to C given by the following p × p matrix (assuming a genetic architecture where the effect size β is not a function of the LD matrix C):
We emphasize that the genetic architecture in which β is independent of C is assumed and necessary in Eq. 14. Thus, the change in the heritability due to a perturbation in LD is a function (a monomial of degree 2 for each entry in the p × p matrix) of the effect sizes in the region. A similar conclusion holds true on the relationship between the estimator and the estimated effect sizes assuming the LD estimate \({{{\hat{\mathbf C}}}}\) from an external reference panel. The ijth term of the derivative matrix \(\nabla _{{{{\hat{\mathbf C}}}}}r_L^2\left( {{{{\hat{\mathbf C}}}}} \right)\) with respect to \({{{\hat{\mathbf C}}}}\) equals \(\widehat {\beta _i}\widehat {\beta _j}\left( {\frac{n}{{n  {{{\mathrm{rank}}}}\left( {{{{\hat{\mathbf C}}}}} \right)}}} \right)\), which quantifies the change in heritability relative to a change in (external panel based) LD between the ith and jth variants. Thus, the change in the estimate of heritability (viewed as a function of the external panel LD estimate \({{{\hat{\mathbf C}}}}\), which in turn can be viewed as a perturbation of the insample LD \(C_{ij}\)) relative to the change in the insample LD \(C_{ij}\) is:
where tr is the trace operator. This observation argues for the importance of making available not just the GWAS summary statistics, i.e., the \(\widehat {\beta _i}\), but also the insample LD data, i.e., \(C_{ij}\). We calculated the empirical distribution of \(\widehat {\beta _i}\widehat {\beta _j}\) and performed simulations on the impact of the external panel (i.e., using the statistic \(\left( {\frac{n}{{n  {{{\mathrm{rank}}}}\left( {{{{\hat{\mathbf C}}}}} \right)}}} \right){{{\mathrm{tr}}}}\left( {\frac{{\partial {{{\hat{\mathbf C}}}}}}{{\partial C_{ij}}}} \right)\)) on the heritability estimate. For an LDmatched reference panel, the product monomials \(\widehat {\beta _i}\widehat {\beta _j}\) have a major influence on the behavior of the estimate.
Note that in Eq. 9, the inverse of the true (unobserved) LD matrix C or the inverse of the external panel LD estimate \({{{\hat{\mathbf C}}}}\) is required. Thus, assuming the inverses exist, we obtain an expression for the difference between \({{{\hat{\mathbf C}}}}^{  1}\) and \({{{\mathbf{C}}}}^{  1}\) in terms of the difference (noise) matrix \({{\Delta }} = {{{\hat{\mathbf C}}}}  {{{\mathbf{C}}}}\):
Therefore, the term on the right determines the difference in the estimate of heritability from the use of the external LD panel and the true LD information. We note that this term is a general expression that includes the special case, such as treated in Shi et al. in which the noise Δ is addressed through use of the truncated singular value decomposition (SVD) to obtain an improved estimator \(\widehat {{{{\mathbf{C}}}}_{{{{\mathbf{SVD}}}}}}\). In particular, the difference \(\Delta _{{{{\mathbf{SVD}}}}} = \widehat {{{{\mathbf{C}}}}_{{{{\mathbf{SVD}}}}}}  {{{\mathbf{C}}}}\) may still bias the estimate of heritability, with the residual bias given by \(\left( {\frac{n}{{n  {{{\mathrm{rank}}}}\left( {\widehat {{{{\mathbf{C}}}}_{{{{\mathbf{SVD}}}}}}} \right)}}} \right)\hat \beta ^T(  \left( {{{{\mathbf{I}}}} + {{{\mathbf{C}}}}^{  1}{{\Delta }}_{{{{\mathbf{SVD}}}}}} \right)^{  1}{{{\mathbf{C}}}}^{  1}({{\Delta }}_{{{{\mathbf{SVD}}}}}){{{\mathbf{C}}}}^{  1})\hat \beta\)
Here we describe how to obtain the projected matrix \(\pi ({{{\mathbf{C}}}})\), which has the property that the difference matrix \({{\Delta }}_{{{{\mathbf{Projected}}}}} = \pi \left( {{{\mathbf{C}}}} \right)  {{{\mathbf{C}}}}\) is “minimal” in the sense of minimizing the expected quadratic loss:
where \({{{\mathbf{C}}}}_ \ast\) is a linear combination of the identity matrix \({{{\mathbf{I}}}}\) and \({{{\hat{\mathbf C}}}}\), the observed (insample or reference) LD matrix (Fig. 1). Define \(\pi \left( {{{\mathbf{C}}}} \right)\) as the LedoitWolf estimator, expressed as a linear combination of \({{{\hat{\mathbf C}}}}\) and \({{{\mathbf{I}}}}\) as follows:
where
Here, \(<*,*>\) and \( \!\ast \!\) refer to the Frobenius inner product and norm, respectively, and \(X_k\) is the \(p \times 1\) genotype vector for the kth subject. Equations 17 and 18 have a Bayesiangeometric interpretation. \(\pi \left( {{{\mathbf{C}}}} \right)\) reflects the combination of prior information and sample information. The prior information states that the unobserved (true) covariance C is on the sphere with center at \(m{{{\mathbf{I}}}}\) and radius a. The sample information states that C is on a second sphere with center at \({{{\hat{\mathbf C}}}}\) and radius b. The combination of the two indicates that C is in the intersection of the two spheres, i.e., a circle with center at \(\pi \left( {{{\mathbf{C}}}} \right)\).
Comparison of local heritability estimated from the observed LD matrix and from the projected LD matrix
We performed simulations (n = 500) to investigate the impact of using an external reference panel on the estimate of local heritability. We leveraged the 1000 Genomes EUR dataset for realistic simulations. For each simulation, we generated 50,000 individuallevel genotype^{40} data of 50 kb segments, with LD structure informed by empiricallyderived segments, which were randomly drawn from chromosome 22. We assumed various levels of local heritability (\(h_{{{{\mathrm{local}}}}}^2 =\) 0.01, 0.02, and 0.03). For each value of heritability, we generated the phenotype: \(Y = \beta G + \varepsilon\). Here, G denotes the genotype in dosage (scaled to standard normal distribution) of a randomly sampled causal variant; \(\beta = \sqrt {\frac{{h_{{{{\mathrm{local}}}}}^2 \times {{{\mathrm{var}}}}(Y)}}{{{{{\mathrm{var}}}}(G)}}}\) is the effect size of the causal variant; \(\varepsilon\) denotes the residual term randomly drawn from a normal distribution \(\varepsilon \sim {{{\mathcal{N}}}}(0,\sigma ^2)\) where \(\sigma ^2 = {{{\mathrm{var}}}}\left( Y \right)  {{{\mathrm{var}}}}(\beta G)\) and \(Y\sim {{{\mathcal{N}}}}(0,1)\). The marginal effect size for each of the variants on the segment was estimated. We randomly sampled 500 subjects to be used as an “external reference panel” and, in addition, calculated the observed LD matrix \({{{\hat{\mathbf C}}}}\) and projected LD matrix \(\pi ({{{\mathbf{C}}}})\). The local heritability was then estimated (Eq. 9) using each LD matrix for comparison.
Summarystatisticsbased estimation of the proportion of expressionmediated causal effect explained
To estimate the extent to which the gene causal effect is driven by the segment of interest, we developed a summarystatisticsbased approach using the projected LD matrix. We define a new metric \(\pi _c\) to estimate the proportion of expressionmediated causal effect explained by a genomic segment using summary statistics. (To illustrate the approach, we evaluated the causal role of the introgressed segment in severe COVID19. However, the approach can be applied more generally to GWAS summary statistics data.) Let \(\hat \alpha\) be the MRJTI estimate of the gene causal effect on the trait, which is obtained by solving an optimization problem (of predicting a variant’s GWAS effect size by its regulatory effect on the gene and its contribution to heterogeneity) (see below; Eq. 29). We consider the GWAS marginal effect size vectors, \(\widehat {\theta _{{{{\mathrm{full}}}}}}\) and \(\widehat {\theta _{{{{\mathrm{reduced}}}}}}\), and corresponding eQTL effect size vectors, \(\widehat {\beta _{{{{\mathrm{full}}}}}}\) and \(\widehat {\beta _{{{{\mathrm{reduced}}}}}}\), for the full model and reduced model, respectively, and the projected matrices \({{{\mathbf{C}}}}_{{{{\mathrm{full}}}}}^ \ast\) and \({{{\mathbf{C}}}}_{{{{\mathrm{reduced}}}}}^ \ast\) of the SNP correlation matrices \({{{\mathbf{C}}}}_{{{{\mathrm{full}}}}}\) and \({{{\mathbf{C}}}}_{{{{\mathrm{reduced}}}}}\) for the full model and reduced model, respectively. We have the following decomposition of the GWAS marginal effect size into an expressionmediated causal effect and an “indirect” component (Fig. 1):
where * denotes the full or reduced model. Then we define \(\pi _c\) as follows:
The metric \(\pi _c\), a ratio of estimated expressionmediated causal effects, is obtained by replacing the GWAS effect size vector \(\widehat {\theta _ \ast }\) by the effect size vector \(\hat \alpha \widehat {\beta _ \ast }\) which quantifies the effect on the trait mediated by the gene expression. Correspondingly, one can estimate the concentration of expressionmediated heritability, ψ_{e} (see above for definition of ψ). The difference vector:
is an overall estimate of ‘indirect’ effect, including heterogeneity, confounding, and other nonexpressionmediated effect.
Training the full model and reduced model of gene expression
We generated a “reduced model” (trained using only the subset of variants in the segment of interest) and the “full model” (trained using all variants in the cisregion, 1 Mb on both sides from the gene body). As an application, for the reduced model, we included only the introgressed variants in the Neanderthalinherited 49.4 Kb segment, and then estimated the expression variance \(h_{g,{{{\mathrm{reduced}}}}}^2\) explained by the model:
as the square of the correlation between the predicted expression \(\widehat {g_{{{{\mathrm{reduced}}}},{{{\mathrm{test}}}}}}\) and observed expression \(g_{{{{\mathrm{test}}}}}\) in a test set. This reduced model facilitates comparison with the original full model.
For the actual implementation, we leveraged wholegenome sequence data and gene expression data from the GTEx v8 data release^{13}. The sample size ranges from 70 to 706 across 49 tissues from a total of 838 donors. We used the residual of the normalized expression level^{13} after adjusting for covariates: gender, platform, first five principal components (PCs), and PEER factors for each tissue. The reduced model and the full model were trained using JTI^{9} to improve the prediction performance (the square of the Pearson’s correlation r between the predicted expression and the observed expression) by borrowing information across tissues. The training of the full model was as previously described^{9}. Briefly, JTI estimates the gene expression profile similarity and the regulatory profile similarity (here, generated from the DNase I hypersensitivity [DHS] sites in the promoter region) for each tissuetissue pair. The two similarity measures were combined using hyperparameters, which were tuned using fivefold cross validation. For the reduced model, the similarity of the regulatory profile was estimated using the DHS peaks in the introgressed segment^{41,42}. Genes with a good prediction quality from 5fold crossvalidation (r > 0.1 and P < 0.05 for the correlation between the observed and the predicted expression) are called imputable genes (iGenes). Common genetic variants (minor allele frequency ≥ 0.05) were used for training the full and reduced models. Models trained by PrediXcan and by JTI, and similarly the reduced model and the full model, were systematically compared for prediction quality.
We also compared the prediction performance (r^{2}) of a randomlychosen segment with that of the actual introgressed segment. For each gene located within 1 Mb of the introgressed segment (in both directions), we built a prediction model for each of 100 randomlyselected segments (of the same length as the introgressed segment) within the cisregion (i.e., within 1 Mb of the gene), using the genetic variants in the segment. The median of the prediction performance (r^{2}) across the 100 models was calculated for each gene as the randomsegmentbased prediction performance.
We investigated the extent to which maintenance of good prediction accuracy with the reduced model (relative to the full model) depended on the segment length. We tested two segment lengths (i.e., 100 and 500 kb extensions on both sides of the actual segment), to compare the performance of the reduced model from the dilated segment and that of the full model from the complete cisregion.
GWAS summarystatisticsbased JTI of COVID19 hospitalization and severity
To identify the genes associated with COVID19 severity, we applied JTI to the summary statistics from COVID19 HGI GWAS metaanalyses round 6^{6}. For the GWAS metaanalysis of COVID19 hospitalization, 24,274 hospitalized cases and 2,061,529 population controls were included. The GWAS metaanalysis of severity included 8,779 very severe respiratory confirmed cases and 1,001,875 population controls. Details of each substudy can be found in Supplementary Table 3.
Causal gene mapping using Mendelian randomization
Based on the JTI results, we further performed Mendelian randomization to map causal genes around the introgressed segment. Here we applied our MRJTI^{9} approach, which, through modeling the heterogeneity (from horizontal pleiotropy and unobserved confounding factor) of instrumental variables (IVs), provides a nearly unbiased estimate of the gene causal effect ɑ on the trait. To confirm this, we performed simulations (n = 500), comparing MRJTI’s estimate of the causal effect with the conventional inversevariance weighted (IVW) method’s estimate. We randomly sampled 100 genes with at least one eQTL (estimated from 670 whole blood GTEx v8 samples). The gene expression level (X) was simulated using empirical eQTL effect sizes (β). The variance of the residual component (\(\sigma _X^2\)) was also informed by empirical data. The trait (Y) was simulated by assuming that the gene expression level was causal for the trait at various levels of effect size α (ranging from 0 to 0.5). To investigate the impact of heterogeneity on the causal effect estimate from MRJTI (\(\widehat {\alpha _{{{{\mathrm{JTI}}}}}}\)) and IVW (\(\widehat {\alpha _{{{{\mathrm{IVW}}}}}}\)), we assumed that 20% of the instrumental variables were not valid, with the horizontalpleiotropy effect (Z) twice as large as the mediation effect. For each simulation, the genotype data (G) was generated for 50,000 samples based on empirical genotype data (GTEx v8)^{13,40}.
MRJTI solves the following optimization problem:
to estimate the gene causal effect (\(\hat \alpha\)), the contribution (\(\widehat {\delta _j}\)) of the jth instrument to the heterogeneity, and the effect (\(\hat \omega\)) of the LD score l_{j}. Here, \(\hat \theta\) is the GWAS effect size vector. MRJTI is a twosample Mendelian randomization approach. For implementation, the GTEx v8 eQTL dataset^{13} (\(\widehat {\beta _j}\)) and the GWAS summary statistics (\(\widehat {\theta _j}\)) were used as input. The LD score was estimated from GTEx v8 (the same dataset as used for eQTL estimation). For additional support, we also applied MREgger and weighted median estimator to estimate the causal effect for each gene using the R package ‘MendelianRandomization’. Following the Mendelian randomization guidelines^{43}, we removed palindromic IVs and clumped IVs using PLINK1.9 (clumpp1 0.05 clumpr2 0.1) based on the pvalue of the association test between an IV and gene expression level. Additional correlation among the IVs was removed by incorporating the IVIV correlation matrix in the ‘MendelianRandomization’ implementation.
Genetically determined expression score in modern human populations and an archaic genome
We define the GDEscore of a subject for a gene using the gene’s JTI model^{9}. The GDEscore provides a metric to quantify “regulatory divergence” between modern human genomes and an archaic genome, which can be used to investigate phenotypic divergence among hominin lineages^{11} or among individuals according to introgression status. We note that the GDEscore should not be viewed as an extinct hominin’s level of gene expression, which is not directly accessible. The GDEscore does not reflect fixed differences or substitutions, but models only polymorphisms that arose in the common ancestors of modern humans and the archaic genome as well as modernhuman specific polymorphisms at which the archaic genome is homozygous for the ancestral alleles^{16}. Differences in a gene’s GDEscore quantify differences in genetic regulatory effects for these modeled variants. As an application, we estimated the phenotypic consequence of the introgressed segment for putatively causal genes from the Mendelian randomization analyses.
As a reference panel of modern human populations, individuallevel genotype data were downloaded from the 1000 Genomes project (phase 3)^{44}. The distributions of estimated genetically determined expression in five populations, including African Ancestry (AFR), American Ancestry (AMR), East Asian Ancestry (EAS), European Ancestry (EUR), and South Asian Ancestry (SAS), were generated. The highquality archaic genome from a Neanderthal individual found in the Altai Mountains was used to estimate the archaic genome GDEscore^{11,45}.
Identifying the phenomic consequences of a genomic segment
To determine the health consequences of the target genes of the segment, we conducted phenomewide association studies (PheWAS)^{46,47,48}. We selected genes based on the prediction performance of the reduced model, as these genes show substantial genetic control by the segment in at least one tissue, but we used the full model to evaluate their phenotypic consequences in PheWAS, as the full model should have improved power for the association test.
We performed JTI association analyses on blood cell traits, using the GWAS summary statistics from the UK Biobank samples. The GWAS summary statistics were downloaded from the Neale Lab (www.nealelab.is/ukbiobank). The sample size for the 27 blood cell traits ranges from 344,728 to 350,470. The links for the resource, including the summary statistics and the original distributions of all blood cell traits, can be found in Supplementary Table 10. The covariates age, age^{2}, sex, age*sex, sex*age^{2}, and the first 20 PCs were considered as covariates in the GWAS.
To identify potential complications of severe COVID19, we performed a JTIbased phenome scan across four trait categories, specifically neurological, respiratory, circulatory, and endocrine/metabolic disorders, based on the UKB GWAS results. The GWAS summary statistics had been generated by the Lee lab^{49}, using SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), which provides accurate P values even when casecontrol ratios are extremely unbalanced^{50}. In total, 253 binary traits (belonging to the four categories) with at least 50 cases were included. The first four genotypebased principal components, gender, and birth year were included as nongenetic covariates. The Phecode hierarchical system (https://phewascatalog.org/)^{51,52} comes with case groups (typically diseases and complications), each with a corresponding control group. The sample size for each trait can be found in Supplementary Table 12.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The COVID19 severity GWAS summary statistics are publically accessible. The JTI prediction models are available at https://zenodo.org/record/3842289.
Code availability
The code for SEGMENTSCAN is available at Github (https://github.com/gamazonlab/DeepRegionalAnalysis).
References
Organization, W. H. Coronavirus disease 2019 (COVID19): situation report, 72. https://apps.who.int/iris/handle/10665/331685 (2020).
Hu, Y. et al. Prevalence and severity of corona virus disease 2019 (COVID19): A systematic review and metaanalysis. J. Clin. Virol. 127, 104371 (2020).
Ellinghaus, D. et al. Genomewide association study of severe Covid19 with respiratory failure. N. Engl. J. Med. 383, 1522–1534 (2020).
Initiative, C.H. G. The COVID19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARSCoV2 virus pandemic. Eur. J. Human Genet. https://doi.org/10.1038/s4143102006366 (2020).
PairoCastineira, E. et al. Genetic mechanisms of critical illness in Covid19. Nature 591, 92–98 (2021).
Initiative, C.H. G. Mapping the human genetic architecture of COVID19. Nature 600, 472–477 (2021).
PairoCastineira, E. et al. Genetic mechanisms of critical illness in Covid19. medRxiv https://doi.org/10.1038/s4158602003065y (2020).
Zeberg, H. & Pääbo, S. The major genetic risk factor for severe COVID19 is inherited from Neanderthals. Nature 587, 610–612 (2020).
Zhou, D. et al. A unified framework for jointtissue transcriptomewide association and Mendelian randomization analysis. Nat. Genet. https://doi.org/10.1038/s4158802007062 (2020).
Gamazon, E. R. et al. A genebased association method for mapping traits using reference transcriptome data. Nat. Genet. 47, 1091 (2015).
Colbran, L. L. et al. Inferred divergent gene regulation in archaic hominins reveals potential phenotypic differences. Nat. Ecol. Evol. 3, 1598–1606 (2019).
Zhong, Y., Perera, M. A. & Gamazon, E. R. On using local ancestry to characterize the genetic architecture of human traits: Genetic regulation of gene expression in multiethnic or admixed populations. Am. J. Hum. Genet. 104, 1097–1115 (2019).
Consortium, G. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).
Gordon, D. E. et al. A SARSCoV2 protein interaction map reveals targets for drug repurposing. Nature 583, 459–468 (2020).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genomewide association summary statistics. Nat. Genet. 47, 1228 (2015).
Yan, S. M. & McCoy, R. C. Functional divergence among hominins. Nat. Ecol. Evol. 3, 1507–1508 (2019).
VuilleditBille, R. N. et al. Human intestine luminal ACE2 and amino acid transporter expression increased by ACEinhibitors. Amino Acids 47, 693–705 (2015).
Hoffmann, M. et al. SARSCoV2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell 181, 271–280 (2020).
Berry, J. D. et al. Development and characterisation of neutralising monoclonal antibody to the SARScoronavirus. J. Virol. Methods 120, 87–96 (2004).
Kuba, K. et al. A crucial role of angiotensin converting enzyme 2 (ACE2) in SARS coronavirus–induced lung injury. Nat. Med. 11, 875–879 (2005).
Paust, S. et al. Critical role for the chemokine receptor CXCR6 in NK cellmediated antigenspecific memory of haptens and viruses. Nat. Immunol. 11, 1127–1135 (2010).
Samson, M., Labbe, O., Mollereau, C., Vassart, G. & Parmentier, M. Molecular cloning and functional expression of a new human CCchemokine receptor gene. Biochemistry 35, 3362–3367 (1996).
Patterson, B. K. et al. CCR5 inhibition in critical COVID19 patients decreases inflammatory cytokines, increases CD8 Tcells, and decreases SARSCoV2 RNA in plasma by day 14. Int. J. Infect. Dis. 103, 25–32 (2021).
Uehara, S., Grinberg, A., Farber, J. M. & Love, P. E. A role for CCR9 in T lymphocyte development and migration. J. Immunol. 168, 2811–2819 (2002).
Yao, Y. et al. Genome and epigenome editing identify CCR9 and SLC6A20 as target genes at the 3p21. 31 locus associated with severe COVID19. Signal Transduct. Target. Ther. 6, 1–3 (2021).
Braunstein, P., Cuenoud, H., Joris, I. & Majno, G. Platelets, fibroblasts, and inflammation: Tissue reactions to platelets injected subcutaneously. Am. J. Pathol. 99, 53 (1980).
Matsubara, Y., Murata, M. & Ikeda, Y. Platelets and Megakaryocytes 249–258 (Springer, 2012).
Agyeman, A. A., Chin, K. L., Landersdorfer, C. B., Liew, D. & OforiAsenso, R. Mayo Clinic Proceedings 1621–1631 (Elsevier, 2020).
Audsley, K. M., McDonnell, A. M. & Waithman, J. Crosspresenting XCR1+ dendritic cells as targets for cancer immunotherapy. Cells 9, 565 (2020).
Wohlers, I., CalongaSolís, V., Jobst, J.N. & Busch, H. COVID19 genetic risk and Neanderthals: A case study highlighting the importance of scrutinizing diversity. Preprint at bioRxiv https://doi.org/10.1101/2020.11.02.365551 (2020).
Wang, A. et al. Singlecell multiomic profiling of human lungs reveals celltypespecific and agedynamic control of SARSCoV2 host genes. Elife 9, e62522 (2020).
Liang, Y. et al. Polygenic transcriptome risk scores (PTRS) can improve portability of polygenic risk scoresacross ancestries. Genome biology 23, 1–18 (2022).
Yang, J. et al. Common SNPs explain a large proportion of the heritability for human height. Nat. Genet. 42, 565–569 (2010).
Gamazon, E. R. & Park, D. S. SNPbased heritability estimation: Measurement noise, population stratification, and stability. Preprint at bioRxiv https://doi.org/10.1101/040055 (2016).
Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genomewide dense marker maps. Genetics 157, 1819–1829 (2001).
de Los Campos, G., Vazquez, A. I., Fernando, R., Klimentidis, Y. C. & Sorensen, D. Prediction of complex human traits using the genomic best linear unbiased predictor. PLoS Genet. 9, e1003608 (2013).
Gamazon, E. R., Cox, N. J. & Davis, L. K. Structural architecture of SNP effects on complex traits. Am. J. Hum. Genet. 95, 477–489 (2014).
Shi, H., Kichaev, G. & Pasaniuc, B. Contrasting the genetic architecture of 30 complex traits from summary association data. Am. J. Hum. Genet. 99, 139–153 (2016).
Hou, K. et al. Accurate estimation of SNPheritability from biobankscale data irrespective of genetic architecture. Nat. Genet. 51, 1244–1251 (2019).
Dimitromanolakis, A., Xu, J., Krol, A. & Briollais, L. sim1000G: A userfriendly genetic variant simulator in R for unrelated individuals and familybased designs. BMC Bioinform. 20, 1–9 (2019).
ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636–640 (2004).
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317 (2015).
Burgess, S. et al. Guidelines for performing Mendelian randomization investigations. Wellcome Open Res. 4, 186 (2019).
Consortium, G. P. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Prüfer, K. et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature 505, 43–49 (2014).
Unlu, G. et al. GRIK5 genetically regulated expression associated with eye and vascular phenomes: Discovery through Iteration among Biobanks, Electronic Health Records, and Zebrafish. Am. J. Hum. Genet. 104, 503–519 (2019).
Unlu, G. et al. Phenomebased approach identifies RIC1linked Mendelian syndrome through zebrafish models, biobank associations, and clinical studies. Nat. Med. 26, 98–109 (2020).
Denny, J. C. et al. Systematic comparison of phenomewide association study of electronic medical record data and genomewide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Zhou, W. et al. Scalable generalized linear mixed model for regionbased association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).
Zhou, W. et al. Efficiently controlling for casecontrol imbalance and sample relatedness in largescale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
Denny, J. C. et al. Systematic comparison of phenomewide association study of electronic medical record data and genomewide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).
Wu, P. et al. Mapping ICD10 and ICD10CM codes to phecodes: workflow development and initial evaluation. JMIR Med. Inform. 7, e14325 (2019).
Acknowledgements
E.R.G. is grateful to the President and Fellows of Clare Hall, University of Cambridge for providing a stimulating intellectual home and for the generous support. We thank the COVID19 Host Genetics Initiative for making the GWAS summary statistics publicly available for immediate use for the benefit of the wider biomedical community to advance discovery. We thank members of the Gamazon Lab for helpful discussions. This research is supported by the National Institutes of Health (NIH) Genomic Innovator Award R35HG010718, NIH/NHGRI R01HG011138, NIH/NIA AG068026, and NIH/NIGMS R01GM140287.
Author information
Authors and Affiliations
Contributions
E.R.G. and D.Z. designed the study, wrote the manuscript, and revised it critically for its intellectual content. E.R.G. and D.Z. approved the completed version of the manuscript. D.Z. performed the analyses. E.R.G. supervised and acquired funding for the study.
Corresponding author
Ethics declarations
Competing interests
E.R.G. receives an honorarium from the journal Circulation Research of the American Heart Association, as a member of the Editorial Board. D.Z. declares no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhou, D., Gamazon, E.R. Integrative transcriptomic, evolutionary, and causal inference framework for regionlevel analysis: Application to COVID19. npj Genom. Med. 7, 24 (2022). https://doi.org/10.1038/s4152502200296y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s4152502200296y
This article is cited by

Phenomewide association study and precision medicine of cardiovascular diseases in the postCOVID19 era
Acta Pharmacologica Sinica (2023)