MiXcan: a framework for cell-type-aware transcriptome-wide association studies with an application to breast cancer

Human bulk tissue samples comprise multiple cell types with diverse roles in disease etiology. Conventional transcriptome-wide association study approaches predict genetically regulated gene expression at the tissue level, without considering cell-type heterogeneity, and test associations of predicted tissue-level expression with disease. Here we develop MiXcan, a cell-type-aware transcriptome-wide association study approach that predicts cell-type-level expression, identifies disease-associated genes via combination of cell-type-level association signals for multiple cell types, and provides insight into the disease-critical cell type. As a proof of concept, we conducted cell-type-aware analyses of breast cancer in 58,648 women and identified 12 transcriptome-wide significant genes using MiXcan compared with only eight genes using conventional approaches. Importantly, MiXcan identified genes with distinct associations in mammary epithelial versus stromal cells, including three new breast cancer susceptibility genes. These findings demonstrate that cell-type-aware transcriptome-wide analyses can reveal new insights into the genetic and cellular etiology of breast cancer and other diseases.

Transcriptome-wide association studies (TWAS) aim to identify genes that are associated with disease through their genetically regulated gene expression (GReX) levels 1,2 . Conventional TWAS approaches such as PrediXcan 1 predict tissue-level GReX using models trained on transcriptomic and genomic data from bulk tissue samples, and test associations between the predicted tissue-level GReX and disease. By reducing the multiple testing burden from millions of variants to thousands of genes, TWAS can improve the power of genome-wide association studies (GWAS) while providing biological insights into the genes and regulatory mechanisms underlying disease. However, conventional TWAS approaches do not account for cell-type heterogeneity of bulk tissue samples, which can reduce the accuracy of GReX prediction models and obscure disease associations, particularly when the most mechanistically relevant cell type for the disease is a minor cell type in the tissue 3 .
Breast carcinoma is a common and highly heritable cancer that arises from epithelial cells, which line the ducts and lobules that produce milk during lactation 4,5 . Human mammary tissue has highly variable cell composition. Visualized on mammography, breast composition can range from extremely dense (light), reflecting a high proportion of fibroglandular tissue, to almost entirely fatty (dark), reflecting a high proportion of adipose tissue 6,7 . Whereas higher mammographic density is associated with increased risk of breast cancer, a higher amount of nondense fatty tissue is associated with decreased risk, indicating disparate roles of the different cellular components of mammary tissue in carcinogenesis [8][9][10] . Breast cancer susceptibility loci identified by prior GWAS [11][12][13] and TWAS [14][15][16] approaches that do not account for cell-type heterogeneity explain only a fraction of the familial relative risk. Disentangling the distinct effects of gene expression in mammary epithelial cells from other cell types through cell-type-aware analysis could lead to new gene discoveries and biological insights.
To our knowledge, no statistical methods currently exist for conducting cell-type-aware TWAS using GWAS data. Single-cell sorting and transcriptome profiling are costly, and large reference panels with both single-cell transcriptomic and genomic data are not yet widely available for training robust GReX prediction models. Recent studies of bulk tissue transcriptomic data have used computational estimates of cell-type enrichment, which are correlated with their proportions, to evaluate cell-type-specific effects. The Genotype-Tissue Expression (GTEx 17 ) consortium estimated cell-type enrichment scores in bulk tissue samples using xCell 18 and tested for interactions between genotype and xCell scores in linear regression models of gene expression to identify interaction expression quantitative trait loci (ieQTL) 19 . The breast was among the human tissues with the most ieQTLs, specifically involving mammary epithelial cells and adipocytes 19 , highlighting the potential for new methods that harness cell-type-specific genetic regulation of expression to improve the power of breast cancer TWAS. Methods that integrate bulk tissue data with single-cell reference profiles to estimate cell-type-level gene expression have also been proposed to study cell-type-specific disease associations 20,21 . However, these methods all require transcriptomic data from the diseaserelevant tissue and cannot be applied to existing GWAS datasets to perform TWAS in large populations.
Here we present MiXcan, a new statistical framework for conducting cell-type-aware TWAS using GWAS data. MiXcan builds celltype-level GReX prediction models through decomposition of bulk tissue data, identifies disease-associated genes via combination of signals from cell-type-level association analyses of multiple cell types, and provides insight into the cell type responsible for the disease association. We show that MiXcan improves the tissue-level GReX prediction accuracy compared with conventional approaches in an independent bulk-tissue validation set, and reliably predicts epithelial cell GReX in a single-nucleus RNA sequencing (snRNAseq) dataset. Simulation studies show that MiXcan controls the type I error, and provides higher power than conventional TWAS approaches when disease associations are driven by a minor cell type (e.g. mammary epithelial cells) rather than the predominant cell type in a tissue, or have opposite directions in different cell types. We apply MiXcan to conduct the first cell-type-aware TWAS of breast cancer risk in 31,716 cases and 26,932 controls, and report three new susceptibility genes (ZNF703, TMEM245, and PSG4) with evidence of distinct associations in mammary epithelial versus stromal cells that were not detected by prior TWAS nor GWAS. These findings provide a proof a concept that cell-type-aware TWAS can reveal new insights into the genetic and cellular etiology of breast cancer and other diseases.

MiXcan framework
We developed the MiXcan framework for conducting cell-typeaware TWAS (Fig. 1). To build GReX prediction models, MiXcan requires specification of the cell type of interest and a prior estimate of its proportion in bulk tissue training samples with transcriptomic and genomic data. The cell type of interest for a given disease may be selected based on prior biologic knowledge, and its proportion estimated from the transcriptomic data using existing deconvolution methods and reference panels [21][22][23] . For cell types without large reference panels or direct proportion estimates, a cell-type enrichment score can be estimated from the bulk tissue transcriptomic data using xCell 18 . MiXcan can utilize xCell or other enrichment scores as a prior to estimate the cell-type proportion (see "Methods"). MiXcan then decomposes the bulk tissue gene expression level into its cell-type levels and uses joint penalized regression to model the association of genetic variants (SNPs) with gene expression for each cell type. The regression coefficients (SNP weights) are compared to determine whether the GReX prediction models for each gene are cell-type-specific (different weights in different cell types) or nonspecific (sames weights across cell types). Simulation studies (below) show that MiXcan prediction models are robust to misspecification of the cell-type proportion, which can result from inaccurate estimates [24][25][26] .
To conduct cell-type-aware TWAS, MiXcan uses the predicted GReX to test the following composite null and alternative hypotheses: H 0 : There is no association between the predicted GReX and the disease in any cell type. H A : There is an association between the predicted GReX and disease in at least one cell type.
Genes with cell-type-specific GReX prediction models are first associated with disease within each cell type, and then the signals are combined across cell types using the Cauchy-based p-value combination method 27 . Genes with nonspecific GReX prediction models are tested for their association with disease in one step. Significant associations are identified using an appropriate threshold to control the family-wise error rate (FWER) or false discovery rate (FDR). For significant genes with cell-type-specific GReX models, the cell-type-level results are compared to provide further insight into the cell type(s) likely to be responsible for the disease association.
While the MiXcan framework is general, its performance depends on the cell types under consideration and the available training data. As the number of cell types increases, the number of parameters increases and the accuracy of the model decreases. At present, given the limited sample sizes of transcriptomic and genomic datasets available for most human tissues through public repositories such as GTEx, it is practical to consider only two categories of cells using MiXcan and to focus on the cell type of greatest interest for the disease under investigation versus the other cell types. As breast carcinoma is known to arise from epithelial cells, we developed MiXcan epithelial and stromal (nonepithelial) cell models using bulk mammary tissue transcriptomic and genomic data available for 125 European ancestry (EA) women in GTEx v8.

Prediction performance
The accuracy of MiXcan and PrediXcan GReX prediction models trained using GTEx v8 data for 125 mammary tissue samples from EA women was initially evaluated in an independent dataset of 103 tumor-adjacent normal mammary tissue samples from EA women in The Cancer Genome Atlas (TCGA). MiXcan estimates of the epithelial cell proportions were highly correlated with the xCell 18 epithelial cell enrichment scores (used as a prior), with Pearson correlations (r) of 0.90 and 0.89 in normal mammary tissue samples from EA women in GTEx (N = 125) and TCGA (N = 103), respectively (Supplementary Fig. 1). However, MiXcan estimates of the epithelial cell proportion were more highly correlated with the expression levels of 126 genes included in the xCell epithelial cell gene signature (median r of 0.54 in GTEx and 0.60 in TCGA samples) than were the xCell enrichment scores themselves (median r of 0.36 in GTEx and 0.39 in TCGA samples) indicating that MiXcan can improve cell proportion estimation from its prior (Supplementary Fig. 1).
MiXcan estimated cell-type-specific prediction models for 5473 (84.7%) and nonspecific prediction models for 988 (15.3%) of 6461 genes that had mammary tissue-level prediction models available in PredictDB 28 (Fig. 2). The tissue-level GReX was computed using MiXcan estimates of the cell-type proportion and predicted cell-type-level GReX values. The median correlation of predicted GReX and measured mammary tissue expression levels for the 6461 genes in the TCGA validation set was significantly higher for MiXcan compared with Pre-diXcan (median r of 0.41 vs. 0.10; p value < 2.2 × 10 −16 ) models trained using the same dataset of 125 GTEx EA women. The prediction accuracy for the 5473 genes with cell-type-specific models in MiXcan was significantly better than PrediXcan (median r of 0.43 vs. 0.12; p < 2.2 × 10 −16 ), whereas the prediction accuracy for the remaining 988 genes with nonspecific models in MiXcan was the same as PrediXcan (median r of 0.08 vs. 0.08; p value=1). These results indicate that allowing for cell-type-level GReX prediction models increases the prediction accuracy for genes with evidence of cell-specific genetic regulation, and does not decrease the prediction accuracy for other genes compared with standard approaches for predicting tissuelevel GReX.
To examine potential sources of the gain in prediction accuracy, three additional approaches were compared with MiXcan and Pre-diXcan ( Supplementary Fig. 2). The median correlation of predicted GReX with measured mammary tissue-level expression for all 6461 genes in the TCGA validation set was slightly higher for PredictDB (r = 0.12) elastic-net models trained using 337 GTEx EA men and women on the entire genome compared with PrediXcan (r = 0.10) trained using 125 EA women indicating modest gains from the inclusion of 212 EA men in the training dataset. Accounting for cell composition using penalized regression models including interactions of SNPs with the xCell epithelial cell score (xCell Interaction; r = 0.20) or MiXcan cell proportion (MiXcan 0 ; r = 0.38) led to substantial gains in prediction accuracy. Symmetric estimation of celltype-level prediction models employed in MiXcan (r = 0.41) further improved performance compared with standard interaction models that employ asymmetric penalization for the two cell types. Importantly, whereas standard interaction models require estimates of celltype composition, which often are unavailable for the tissue of interest in GWAS of human diseases, MiXcan prediction models can be applied directly to GWAS genotype data to perform cell-typeaware TWAS.
Finally, to evaluate the prediction accuracy of MiXcan at the celltype level, we compared the predicted epithelial cell GReX with measured mammary epithelial cell snRNAseq data available for three GTEx women of European, Asian, and African ancestry 29 . We found that genes (n=100) predicted to have the largest GReX differences based on the SNP genotypes in each pair of women also had significantly different measured snRNAseq levels in their mammary epithelial cells (p value range: 0.01 to 3.3 × 10 −7 ), as expected (Fig. 3). The observed snRNAseq differences were significant despite the potentially poorer prediction accuracy of MiXcan models in women of Asian and African ancestry who were not represented in the training dataset. These snRNAseq results support the robustness of the cell-type level GReX predictions obtained using the MiXcan approach.

Simulation studies
Type I error and power. To evaluate type I error and power of MiXcan association tests, datasets were simulated (see "Methods") under a broad range of realistic settings for the associations of genetic variants with gene expression (SNP-Exp) and gene expression with disease (Exp-Disease). MiXcan predicted GReX with higher accuracy than PrediXcan in the presence of cell-type heterogeneity of SNP-Exp associations, while maintaining comparable accuracy in the absence of cell-specific effects ( Supplementary Fig. 3), consistent with results in the independent TCGA validation dataset (Fig. 2).
The type I error was well controlled for MiXcan and PrediXcan under all simulated data scenarios (Fig. 4 col. 1). When SNP-Exp associations were homogeneous in the two cell types (Fig. 4a), the power was similar for MiXcan and PrediXcan whether the Exp-Disease associations were homogeneous or heterogeneous across cell types. Differences in the mean gene expression level between the two cell types that were not determined by SNP-Exp associations (Fig. 4a, b) did not impact the power of MiXcan and PrediXcan indicating robustness to differential expression that is not regulated by genetic variants.
When SNP-Exp associations were heterogeneous in the two cell types ( Fig. 4b-d), the relative power of MiXcan and PrediXcan depended on the mechanisms of the Exp-Disease and SNP-Exp associations. PrediXcan was generally more powerful than MiXcan when the Exp-Disease association was either homogeneous across cell types ( Fig. 4 col. 2) or present only in the major cell type (Fig. 4 col. 3). However, MiXcan was generally more powerful than PrediXcan when the Exp-Disease association was present only in the minor cell type (Fig. 4 col. 4) or had opposite directions in the two cell types (Fig. 4 col. 5).
As the strength of the SNP-Exp association increased in the same cell type as the Exp-Disease association, the power increased for both PrediXcan and MiXcan (Fig. 4c col. 4; Fig. 4d col. 3). However, as the strength of the SNP-Exp association increased in a different cell type from the Exp-Disease association, the power decreased for PrediXcan but not MiXcan (Fig. 4c col. 3; Fig. 4d col. 4). When the Exp-Disease association had opposite directions in the two cell types, the power was U-shaped for PrediXcan but increased for MiXcan as the strength of the SNP-Exp association increased in either cell type ( Fig. 4c-d col. 5). Similar patterns were observed when type I error and power were evaluated in relation to the expression heritability in the two cell types instead of the strength of SNP-Exp associations ( Supplementary Fig. 4). These patterns show that different association signals in the two cell types can cancel each other out in PrediXcan, which averages their effects, but are aggregated across cell types in MiXcan thereby preserving power to detect associations due to the minor cell type or that differ across cell types.
In addition to providing valid tissue-level association tests, MiXcan provides information for each cell type separately. Simulation studies showed that the type I error was well controlled for MiXcan cell-type-level tests when no Exp-Disease association was present in any cell type ( Supplementary Fig. 5 col. 1). When SNP-Exp associations were homogeneous across cell types ( Supplementary Fig. 5a), MiXcan generally estimated nonspecific GReX prediction models which yield the same disease-association test results for all cell types. Thus, cell-type-level inferences can only be made in the heterogeneous SNP-Exp setting ( Supplementary Fig. 5b-d), when MiXcan estimates celltype-specific GReX prediction models. When the Exp-Disease association was present in both cell types in the same or opposite directions ( Supplementary Fig. 5 cols. 2 & 5), the power of the cell-type-level tests was similar when the SNP-Exp associations had similar magnitude (regardless of direction) and increased as the magnitude of the SNP-Exp association increased. When the Exp-Disease association was present in only one cell type ( Supplementary Fig. 5 cols. 3-4), the power was always highest in this cell type, but the association signal was shared to some degree with the uninvolved cell type. This correlation of the cell-type-level results arises from the joint estimation of the SNP weights for the cell-type-level GReX prediction models in MiXcan. Therefore, we recommend using the combined p-value for all cell types to make inferences regarding whether GReX is significantly associated with disease in any cell type in the tissue, and the cell-typelevel results to compare the evidence that different cell types are involved for significant genes.
Finally, we evaluated the impact of the sample size of the training dataset on the type I error and power of MiXcan association tests in simulation studies ( Supplementary Fig. 6). As the training dataset increased from 100 to 300 samples, the power of association studies with 3000 samples increased while the type I error remained well controlled. Prediction models trained using only 100-150 samples provided reasonable power for gene identification.
Performance under model misspecification. MiXcan decomposes bulk tissue expression levels into two components using an estimate of the cell-type proportion b π for the cell of interest. First, we evaluated the performance of MiXcan under misspecification of b π (Fig. 5a). In simulation studies using a broad range of biased and noisy estimates of b π, the type I error was consistently well controlled. The power of MiXcan also was generally maintained when b π was misspecified, and compared favorably with PrediXcan when the Exp-Disease association was in the minor cell type or had opposite directions in the two cell types. Second, we evaluated the performance of MiXcan when a latent third cell type was present that had different SNP-Exp associations from the other cell types (Fig. 5b). We simulated a tissue with three cell types comprising 40%, 50% and 10% of the tissue, respectively, and assumed that MiXcan decomposed the tissue into cell type 1 versus a mixture of cell types 2 and 3. The type I error was consistently well controlled in the presence of a latent third cell type, and the power of MiXcan remained higher than PrediXcan when the Exp-Disease association was in cell type 1 (corresponding to the minor cell type in correctly specified models) or in opposite directions in cell types 1 vs. 2 and 3. The latent third cell type reduced the power of both PrediXcan and MiXcan when the Exp-Disease association was present in the most common cell type or homogeneous across all cell types. Third, we evaluated the impact of Exp-Disease associations in a latent third cell type on study power (Fig. 5c). Similar to the performance under correctly specified models, PrediXcan was more powerful mostly when Exp-Disease associations exist in cell type 2 (the most common cell type), and MiXcan was more powerful mostly when Exp-Disease associations exist in cell type 1 (the minor cell type of interest) or in opposite directions in cell types 1 and 2.
Cell-type-aware TWAS of breast cancer As a proof of concept, we applied MiXcan to conduct the first cell-typeaware TWAS of breast cancer in a publicly available dataset of 58,648 EA women (31,716 cases and 26,932 controls) from the DRIVE GWAS (Discovery, Biology, and Risk of Inherited Variants in Breast Cancer) who were genotyped using the OncoArray 30 . Transcriptome-wide significance was determined using the Bonferroni-corrected threshold of 7.7 × 10 −6 to account for the 6461 genes tested, and suggestive associations were determined using the Benjamini-Hochberg false discovery rate (FDR) of 0.10. MiXcan identified 12 significant genes (p value < 7.7 × 10 −6 ) ( Table 1, Fig. 6) and 82 suggestive genes (FDR < 0.10) (Supplementary Data 1) whose predicted GReX in Fig. 4 | Simulation studies to evaluate the type I error and power of MiXcan and PrediXcan to detect associations of GReX with disease at the tissue level. Bulk tissue samples (N = 300) for training GReX prediction models and independent studies (N = 3000) for testing disease associations were simulated under a range of realistic data scenarios. Gene expression levels were modeled by u = b 0 + b 1 x + e u in the minor cell type, v = b 2 x + e v in the major cell type, and y = πu + (1 − π)v at the tissue level, where π denotes the minor cell-type proportion, b 0 denotes the mean difference of the gene expression levels in the two cell types, and b 1 and b 2 denote the weights for the association of SNPs X with gene expression levels in the minor and major cell types, respectively. The disease D was modeled by logit P(D = 1) = η 0 + η 1 u + η 2 v where η 1 and η 2 denote the associations of the gene expression levels with disease in the two cell types, respectively. (a) Homogeneous SNP-Exp associations (b 1 = b 2 ) in the two cell types, varying the mean difference in gene expression levels between the two cell types (b 0 ). Heterogeneous SNP-Exp associations (b 1 ≠ b 2 ) in the two cell types, varying the: (b) mean difference in gene expression levels between the two cell types (b 0 ); (c) magnitude of the SNP-Exp association in the minor cell type (b 1 ); and (d) magnitude of the SNP-Exp association in the major cell type (b 2 ). Source data are provided as a Source Data file. mammary tissue were associated with breast cancer risk. In comparison, PrediXcan (trained using the same 125 EA female mammary tissue samples as MiXcan) identified only 8 significant genes (p value < 7.7 × 10 −6 ), and 31 suggestive genes (FDR < 0.10). The inflation factor obtained using the BACON v1.26 Bayesian method to estimate the empirical null distribution was 1.01 for MiXcan and 1.03 for PrediXcan, indicating well-controlled type I error 31 .
Four significant genes (CH17-437K3.1, SLC4A7, L3MBTL3, and RCCD1) were identified by both MiXcan and PrediXcan (Table 1). MiXcan estimated nonspecific prediction models for these genes, yielding the same results as PrediXcan. All four genes were near (<500 kb) breast cancer SNPs previously identified by GWAS 11 , and two genes (L3MBTL3 and RCCD1) also were reported by prior breast cancer TWAS (Supplementary Data 1) 14,15,32 . Follow-up analyses in a larger sample of 228,951 EA women (122,977 cases and 105,974 controls) in the Breast Cancer Association Consortium (BCAC) and DRIVE studies using GWAS summary statistics 11 and S-PrediXcan 33 models for mammary tissue confirmed that the tissue-level GReX for all four genes were significantly (p value < 7.7 × 10 −6 ) associated with breast cancer risk.
Eight genes were identified by MiXcan but not PrediXcan (Table 1). MiXcan estimated cell-type-specific GReX prediction models for all eight of these genes (Supplementary Data 2). Six of these genes (MRPS30, SETD9, ADGRV1, ZNF703, PRR33, and PSG4) showed different directions of association with breast cancer in epithelial vs. stromal (nonepithelial) cells. In MiXcan the signals from the two cell types were aggregated, whereas in PrediXcan they canceled each other out reducing the tissue-level signal. Notably, for ADGRV1 and PRR33, no SNPs in the training dataset were predictive of tissue-level GReX because of the mixture of the different cell-type effects, and consequently the PrediXcan association analysis could not be performed. Two genes (TMEM245 and CDYL2) showed the same directions of association, with stronger effects in epithelial vs. stromal cells.
These results indicate that MiXcan may be more powerful than Pre-diXcan in the presence of cell-type heterogeneity of GReX and when the disease association is present in a minor cell type, e.g. mammary epithelial cells, rather than the predominant cell type in the tissue.
Importantly, MiXcan uniquely identified three novel breast cancer susceptibility genes (ZNF703, TMEM245, and PSG4) that were not previously implicated by breast cancer GWAS 11-13 nor TWAS [14][15][16]32 (Table 1). For ZNF703, GReX was associated with increased breast cancer risk in stromal cells (p value=2.4 × 10 −9 ) and decreased risk in epithelial cells (p value=2.7 × 10 −7 ). Because adipocytes are the predominant stromal cell type in mammary tissue, we also performed follow-up analyses using the S-PrediXcan subcutaneous fat model and discovered a highly significant (p value=2.1 × 10 −20 ) association of ZNF703 with increased breast cancer risk in the four-fold larger BCAC/DRIVE dataset, consistent with the MiXcan stromal cell results in the DRIVE data only. For TMEM245 and PSG4 the signal was stronger in epithelial cells, which are a minor cell type in mammary tissue and may explain why their tissuelevel GReX was not significantly associated with breast cancer risk. MiXcan also identified two breast cancer genes (ADGRV1 and CDYL2) at previously reported GWAS loci 11 that had different associations in epithelial and stromal cells and were not detected in prior TWAS.
Associations of the mammary tissue-level GReX with breast cancer risk were found for all four genes in S-PrediXcan analyses of the BCAC/ DRIVE data, as expected. MiXcan estimated cell-type-specific GReX prediction models for these four genes (Supplementary Data 2), and different associations with breast cancer risk in epithelial and stromal cells that did not reach statistical significance in part because of the larger number of model parameters compared with PrediXcan. However, the cell-type-specific MiXcan results for STXBP4 suggest that stromal cells (estimated effect=0.18; p value = 2.1 × 10 −4 ) may play a more important role than epithelial cells (estimated effect = -0.06; p value=0.41) in driving the positive association of tissue-level GReX with breast cancer risk. Finally, we compared the TWAS results using MiXcan and Pre-diXcan with publicly available PredictDB 28 elastic-net models trained using GTEx mammary tissue data for 212 EA men in addition to 125 EA women ( Supplementary Fig. 7). There was substantial overlap of the genes detected by PrediXcan and PredictDB as expected, although PredictDB detected a larger number of genes perhaps because the larger training dataset enabled more accurate prediction models for genes that have similar expression patterns in male and female mammary tissue. However, the etiology and pathobiology of male breast cancer is distinct from female breast cancer 34 . Thus, prior TWAS of breast cancer in EA women 15,32 also trained PrediXcan models using mammary tissue samples (n=67) from EA women only, as we did here using a larger number of EA women from GTEx. Cell-type-specific TWAS using MiXcan identified five genes (ADGRV1, ZNF703, TMEM245, CDYL2, and PSG4), including three novel breast cancer susceptibility genes, that were not identified by either PrediXcan or PredictDB.

Discussion
MiXcan is a new statistical framework for conducting cell-type-aware TWAS using GWAS data. In contrast to standard TWAS methods, MiXcan builds cell-type-level prediction models for the genetically regulated component of gene expression and performs association tests taking into consideration the signals from multiple cell types. We have shown that MiXcan improves the prediction accuracy of GReX at both the tissue and cellular levels in independent validation datasets, and improves the power to detect disease associations that are driven by a minor cell type or are heterogeneous between cell types compared with standard approaches. We applied MiXcan to perform the first cell-type-aware TWAS of breast cancer risk and identified three new susceptibility genes (ZNF703, TMEM245, and PSG4) with evidence of distinct associations in mammary epithelial versus stromal cells that were not detected by prior GWAS [11][12][13] nor TWAS [14][15][16]32 . These findings provide a proof of concept that cell-type-aware TWAS can reveal novel insights into the genetic and cellular etiology of human diseases.
Several recent studies have explored context-specific TWAS [35][36][37] . Li et al. proposed a tissue-specificity-aware TWAS framework that uses prior knowledge of trait-related tissue types for accurate detection of single-tissue and cross-tissue TWAS 35 . Feng et al. proposed to derive cross-tissue expression features using sparse canonical correlation analysis, and then combine expression-outcome associations across single-and cross-tissue features for powerful detection 36 . Thompson et al. proposed CONTENT to go one step further and model both shared and tissue-specific components of gene expression in bulk multi-tissue data for model construction 37 . This approach can also be used for modeling shared and cell-type-specific components in singlecell RNAseq data 37 . These recently developed TWAS methods model associations at the same resolution of the data, such as modeling tissue-level associations for bulk profiling data and cell-type-level associations for single-cell data, and thus do not provide higherresolution (single cell) understanding of disease using lowerresolution (bulk tissue) data as does MiXcan.
Recent studies have also explored methods for performing celltype-level association analyses when the tissue-level data are available for all study participants 19,[38][39][40][41][42] . Luo et al. evaluated cell-type-specific associations between DNA methylation and traits 40 , but this method did not involve prediction models and methylation data are required for all subjects. Liu et al. built tissue-level GReX prediction models, inferred cell types from the predicted GReX, and looked for associations of the inferred cell-type proportions with disease rather than constructing a TWAS framework for identifying genes 42 . In this study, MiXcan enables cell-type-aware TWAS in large populations using existing GWAS datasets that do not have transcriptomic and cell composition data from the disease-relevant tissue. MiXcan evaluates the composite null hypothesis that there is no association between the GReX in any cell type with the disease, which tolerates decomposition uncertainty to provide robust cell-type-aware analysis using bulk tissue samples. By carefully modeling cell-type-level expression, MiXcan is more powerful than PrediXcan when disease associations are driven by a minor cell type or have opposite directions in different cell types. However, when the association of GReX with disease is similar in all cell types or driven by the major cell type, then conventional TWAS approaches using more parsimonious tissue-level GReX prediction models that assume cell-type homogeneity can be more powerful. Thus, these two TWAS approaches are complementary, and additional cell-type-aware analyses are especially valuable for diseases where celltype heterogeneity and a minor cell of origin are hypothesized, as for breast carcinoma and many other human diseases.
To construct cell-type-level GReX prediction models, MiXcan uses a scaled xCell 18 cell-type enrichment score in the training data as prior information. While estimates from other approaches 20,21,23,[43][44][45] can also be used as priors, xCell is among the most widely used. Building upon the priors, MiXcan fits mixture models for the expression levels of the epithelial cell signature genes in the training data to improve estimation of the cell-type proportion. By incorporating better estimates of the cell-type proportion, and penalizing all cell types equally, MiXcan improves the accuracy of the GReX prediction models, as well as the power and type I error of the downstream association tests. Compared with standard interaction models that include interaction effects between cell-type enrichment scores and genetic variants 19 , an important advantage of MiXcan is that the predicted values correspond to the cell-type-specific GReX, which are directly interpretable and biologically meaningful. Moreover, standard interaction models require cell composition information, which is often unavailable for the tissue of interest, whereas MiXcan prediction models can be applied directly to GWAS data without transcriptomic or cell composition data to interrogate genetic associations within the diseaserelevant tissue and cell type context. MiXcan has several limitations. First, although the MiXcan framework is generalizable, the model building procedure requires hypotheses regarding the disease-related cell types and tissue. The prediction models for this cell-type-aware TWAS of breast cancer were developed specifically for human mammary tissue with a focus on distinguishing epithelial and stromal (nonepithelial) cells, which have distinct roles in breast carcinogenesis [8][9][10] . Second, the current MiXcan framework decomposes a tissue into two cell-type components only. The MiXcan framework can be extended naturally to model more than two cell types. However, given limited sample sizes of the available training datasets, the model performance becomes quite variable with additional cell types due to the rapidly increasing dimension of the parameter space and complexity of the numerical optimization. As larger training sets with bulk tissue transcriptomic and genomic data become available, a careful evaluation of its analytical performance can be performed for additional less common cell types. Third, paired single-cell RNAseq and genomic datasets are presently very limited, and the validation of MiXcan epithelial cell predictions was performed for a small set of genes using epithelial cell snRNAseq data available for only three women in GTEx. Human single-cell transcriptome profiling efforts 46,47 are currently underway, and will enable further evaluation of the performance of MiXcan in larger datasets. Future studies can also investigate integrating single-cell transcriptome profiles into MiXcan, for example to improve estimation of cell composition in bulk tissue samples 23,45 or to provide an initial estimate of SNP weights, which could be used to tune separate penalty terms for different SNPs in adaptive elastic-net models 48 .
Human mammary tissue has variable cell composition and numerous eQTLs with distinct effects in epithelial cells and adipocytes, which are a major stromal cell type in the breast 19 . Cell-typeaware TWAS using MiXcan mammary tissue models applied to publicly available GWAS data identified three new breast cancer susceptibility genes that were associated with disease risk through their GReX in normal mammary epithelial or stromal cells. ZNF703 (zinc finger protein 703) is an oncogene that is commonly amplified in luminal B breast tumors, and has been shown to regulate genes involved in proliferation, invasion, and an altered balance of progenitor stem cells [49][50][51] . To our knowledge, common germline variants in ZNF703 have not previously been implicated in breast cancer risk. Our finding that genetic upregulation of ZNF703 in normal mammary stromal cells (predominantly adipocytes) was associated with increased breast cancer risk in 58,648 women was confirmed by a highly significant (p-value=2.1 × 10 −20 ) association of tissue-level GReX predicted using S-PrediXcan subcutaneous fat models in 228,951 EA women, which has not previously been reported to our knowledge. Notably, the mammary tissue-level results for ZNF703 did not reach transcriptome-wide significance, underscoring the importance of accounting for cell-type heterogeneity to elucidate disease etiology. TMEM245 (transmembrane protein 245) is the host gene for microRNA 32, which has been shown to promote proliferation and suppress apoptosis of breast cancer cells 52 . Relatively little is known about PSG4 (pregnancy specific beta-1-glycoprotein 4), a member of the carcinoembryonic antigen gene family that may play a role in regulation of the innate immune system 53 . Future studies are needed to provide experimental validation of the breast cancer genes identified in this study and better understand the cellular mechanisms underlying the associations.
In conclusion, the MiXcan framework enables cell-type-aware TWAS using prediction models that allow for differences across cell types in the disease-relevant tissue. MiXcan mammary tissue models are available at https://github.com/songxiaoyu/MiXcan 54 and can be applied to GWAS genotype data to identify genes associated with complex traits through their GReX in epithelial or stromal cells. MiXcan software is also freely available to facilitate training prediction models for other tissues and cell types, and conducting cell-type-aware TWAS. MiXcan prediction models had excellent performance in independent validation datasets, and identified new breast cancer susceptibility genes in the first cell-type-aware TWAS of breast cancer. These findings provide a proof of concept that cell-type-aware TWAS are feasible using existing bulk tissue training datasets and GWAS data, and can lead to the discovery of new disease genes and cellular mechanisms. Future research is needed to develop MiXcan models for other human tissues and cell types, extend the MiXcan framework to GWAS summary statistics, and explore alternative modeling and inference strategies 2,55,56 . These research directions will enable the broad application of cell-type-aware TWAS to improve our understanding of the genetic and cellular mechanisms underlying human diseases.

MiXcan framework
We first summarize PrediXcan 1 , an established tissue-level TWAS framework, and then present the MiXcan cell-type-aware TWAS framework. Let y i denote the measured expression level of a gene in the bulk tissue sample i ∈ (1, . . . , N), x i denote the genetic variants (e.g. SNPs) used in model training, and z i denote the non-genetic covariates (e.g. age).
PrediXcan tissue-level TWAS framework. PrediXcan uses a linear additive model to characterize the gene expression level: where e i~N (0, σ 2 ) is the model error, x T i b is the genetically determined component of gene expression, and z T i c is the non-genetically determined component. This model can be estimated with the elastic-net method 57 , which maximizes the following penalized loglikelihood function: where PðÁÞ = α|| Á || 2 l 2 + ð1 À αÞ|| Á || l 1 is the elastic-net penalty function with the mixing parameter α ∈ (0, 1). In PrediXcan, α is set at 0.5 and λ is selected via 10-fold cross validation.(CV) 58 The estimated SNP weights b b can be used to predict the GReX by b y j =x T j b b wherex j denotes the SNP genotypes of the GWAS subject j ∈ (1, . . . , M). Then, the association of b y j with the phenotype (e.g. disease status) d j can be evaluated using a generalized linear model gðd j Þ = η 0 + b y j η 1 , where g(. ) is a link function. The null and alternative hypotheses, H 0 : η 1 = 0 vs. H A : η 1 ≠ 0, test whether the GReX at the tissue level is associated with the phenotype.
MiXcan cell-type-level GReX prediction model. MiXcan extends upon PrediXcan to enable cell-type-aware TWAS. In this section, we build the prediction models for the cell-type-level GReX. In the next Article https://doi.org/10.1038/s41467-023-35888-4 section, we develop strategies for applying these prediction models in disease-association studies.
Human bulk tissue samples from solid tissue (e.g. mammary tissue) comprise a mixture of cells of different types. Let π i and 1 − π i be the proportions of the cell type of interest (e.g. epithelial cells) and all other cell types (e.g. stromal cells) in the i th tissue sample, respectively. We assume the observed bulk tissue expression level of a given gene y i is a linear combination of expression levels in the cell types. Introducing two latent variables u i and v i to denote the unobserved average gene expression levels in the epithelial and stromal cells, respectively, we have: We model both u i and v i using the linear additive models: where e ui ∼ Nð0, σ 2 u Þ and e vi , ∼ Nð0,σ 2 v Þ are the model errors. In Eq. (4), the intercepts a u and a v and genetic parameters b u and b v differ between cell types, allowing for different mean expression levels and cell-type-specific effects of genetic variants on gene expression. A shared parameter c is used for the non-genetic component z i to simplify the model as the non-genetic variables are not necessarily used in downstream analyses.
The proportion of epithelial cells π = {π 1 , . . . , π N } is a feature of the tissue samples, which can be jointly estimated using multiple genes, Þ are features of each gene for investigation. Therefore, we present a step-wise procedure to first estimate π using multiple epithelial signature genes, and then estimate the cell-type-level effects Θ for each gene.
Estimation of π. The notation in the sections above focuses on individual genes, and here we introduce additional notation to describe the joint modeling of multiple genes. Let W N × G = fw g i g be the observed expression of G epithelial signature genes in N tissue samples; and S N × G = fs g i g and T N × G = ft g i g be the unobserved gene expression levels in epithelial and stromal cells, respectively. Similarly as in Eq. (3), we have: w g i = π i s g i + ð1 À π i Þt g i , g = 1, :::, G, i = 1, :::, N: Leveraging primarily the mean differences of signature genes in epithelial and stromal cells for estimating π, we model the marginal distributions of individual genes and omit the complex gene-gene correlations for computationally efficient estimation, as supported by our previous work 59 . Specifically, we assume: Across all G genes, the parameters include Γ = ffΓ g gg In parallel, we also take advantage of a prior cell-type proportion estimate h i based on existing tools (i.e. rescaled xCell 18 enrichment scores). We link the prior estimates h i to the true π i using a Beta distribution such that h i~B eta(π i δ, (1 − π i )δ) for some positive parameter δ. We have E(h i ) = π i and var(h i ) = π i (1 − π i )/(δ + 1) such that h i is an unbiased estimator of the true π i with variation.
We then join these two models for parameter estimation and solve the following maximization problem: ðπ,Γ ,δÞ = arg max Γ,π,δ where P G g = 1 lðΓ , π i |w g i Þ and l(π i , δ|h i ) are the log-likelihood of the observed gene expression profile and cell proportion estimate of the i th sample, respectively. This optimization problem is solved using an Expectation-Maximization (EM) algorithm similar to that in Petralia et al. 59 To enhance the robustness of the estimation, we implement a bagging strategy to estimate the parameters with randomly selected bootstrap samples, and aggregate multiple estimates by calculating a tail truncated mean. This bagging strategy further stabilizes the estimates, and may also be used to investigate the consistency of π estimation.
Estimation ofb u andb v . Given b π, we next estimate b u and b v in Eq. (4). Since u i and v i are unobserved, we integrate Eqs. (3) and (4) to have: This equation can be rearranged as: where ϵ i = b π i e ui + ð1 À b π i Þe vi . A simple strategy for estimating b u and b v is to apply elastic-net regression to Eq. (8). Specifically, the elastic-net regularization is put on (b v , b u − b v , c)-the dependence of expression levels on genetic variants and covariates-but not on (a v , a u − a v ) -the mean expression levels in the two cell types. We refer to this strategy as MiXcan 0 .
One issue with MiXcan 0 is that the two cell components in the mixture model are not treated in a symmetric manner. In other words, the penalization on b u and b v differs: b v is shrunk towards zero, while b u is shrunk towards b v . This asymmetric penalization results in different models if the order of the two components is switched. To address this issue, we introduce b c i = b π i À 0:5 and rewrite Eq. (8) as: When fitting elastic-net regression to Eq. (9), we include penalties on , ðb u À b v Þ,cÞ, which impose the same degree of regularization on b u and b v : the penalty on b u + b v 2 regularizes the overall sparsity of the genetic effects, while the penalty on b u − b v encourages similarities between the two components. We refer to this strategy as MiXcan.
Note, when fitting elastic-net regressions in MiXcan 0 and MiXcan, we do not consider the varying variances of ϵ i asπ i takes different values. This is because the residual variance structure has limited impact on the coefficient estimates, especially for regularized regression. In a trade-off between extensive computational costs (allowing varying residual variances) and minimal sacrifice of estimation accuracy (assuming constant variance), we chose the latter and take advantage of the fast implementation of elastic-net regression in the glmnet package.
Model Aggregation. In the prediction models, the term b b u À b b v is of particular importance: a non-zero value suggests that the dependence structure between genetic variants and expression levels is celltype-specific. Therefore, it is critical to know the selection robustness We employ a procedure similar to stability selection 60 for its evaluation. Specifically, for models that select non-zero b b u À b b v , we generate B bootstrap samples (e.g. B = 200), perform ordinary least square analysis on the pre-selected variables, and record d diff ðbÞ v for b = 1, . . . , B. Only when the 95% confidence interval (CI) for d diff b excludes 0 do we employ cell-type-specific prediction models (inferred using the complete data set). Otherwise, nonspecific models that have the same prediction weights for the two cell types will be used, as in Eq. (2) of PrediXcan. Association analysis with cell-type-level prediction models. The model building procedure in MiXcan selects cell-type-specific prediction models for some genes and nonspecific models for other genes. Cell-type-specific prediction models estimate different SNP weights in the two cell types ( b b u ≠ b b v ) resulting in different predicted GReX from the same genotype datax j (j ∈ (1, ..., M)), such thatỹ uj =x These cell-type-specific GReX levels cannot be combined into tissue levels in GWAS datasets that lack cell-type proportion estimates, requiring a novel statistical framework for association analysis. One natural idea is to infer cell-type-specific associations by directly associating the phenotype d j withỹ uj andỹ vj , either separately, such that gðd j Þ = η 0 + η uỹuj and gðd j Þ = η0 0 + η vỹvj , or jointly, such that gðd j Þ = η 0 + η uỹuj + η vỹvj . Asỹ uj andỹ vj are predicted from the same genotype data with prediction weights jointly learned using the bulk tissue data in MiXcan, they may capture leaked information from each other. As a result, if an association exists in one cell type, analysis in the other cell type may also capture this association, resulting in an inflated type I error for inferring associations in each cell type. To avoid this inflation, we propose a composite hypothesis test to test whether association exists in any cell types in the tissue: H 0 : There is no association in any cell type in the tissue vs. H A : There is an association in at least one cell type in the tissue.
This composite null is robust against information leakage, as the leaked values under the null are not associated with the phenotype. To perform the test, we first associate d j withỹ uj andỹ vj , separately if the ðỹ uj ,ỹ vj Þ are highly correlated (e.g. r = ±1), or jointly otherwise. Then, we propose to aggregate the resulting p values p u and p v forỹ uj andỹ vj using Cauchy combination 27 . The Cauchy combination provides valid test for correlated p-values, and in this setting the test statistic can be written as: where pi is the mathematical constant approximately equal to 3.14159. The combined p-value for the tissue is approximated by: The p tissue tests whether association exists in any cell type in the tissue. Unlike PrediXcan that tests associations averaged across all cell types, the p tissue in MiXcan accumulates signals from different cell types. Note that p u and p v are building blocks of T Cauchy and the resulting p tissue is between p u and p v . After the tissue-level hypothesis test, p u and p v can provide information on the cell type(s) driving the association. For example, p u << p v indicates that a significant p tissue is primarily driven by epithelial cells.
Some genes have nonspecific models with the same estimated SNP weights and predicted GReX in the two cell types ( . While association analyses can follow the same strategy described above for cell-type-specific models, it is equivalent to performing a single tissuelevel association analysis as in PrediXcan. Finally, in transcriptomewide studies, p tissue from genes with cell-type-specific prediction models and p-values from genes with nonspecific models can be jointly used to adjust for multiple testing, and infer transcriptome-wide significant discoveries. Build MiXcan prediction models using GTEx mammary tissue data MiXcan gene expression prediction models were developed using the GTEx v8 genotype and gene expression data for mammary tissue samples from 125 European ancestry women (dbGaP accession number phs000424.v8.p2 <https://www.ncbi.nlm.nih.gov/projects/gap/ cgi-bin/study.cgi?study_id=phs000424.v8.p2>). PredictDB (http:// predictdb.org) provides tissue-level expression prediction models trained on 337 men and women of European ancestry with mammary tissue data available in GTEx v8. PredictDB included mammary tissue elastic-net models for a total of 6461 genes that were well predicted by genetic variants (176,983 SNPs). An additional 1,715 SNPs were included in PredictDB mammary tissue MASHR models 61 . For the purpose of comparison to PrediXcan as a proof of concept, we developed celltype-level prediction models for these 6461 genes using all 178,698 SNPs in the PredictDB mammary tissue database. MiXcan cell-typelevel prediction models were developed for mammary epithelial cells, the cell of origin for breast carcinoma, and stromal (nonepithelial) cells.
π estimation. The epithelial cell proportion π i was estimated using 126 epithelial cell signature genes 18 available in the training dataset of 125 GTEx female mammary tissue samples. We first computed xCell gene set enrichment scores for epithelial cells and 63 other cell types using the curated set of cell signature genes for each cell type 18 and the bulk tissue transcriptomic data for each sample. We then re-scaled the xCell epithelial cell enrichment score to range from 0 to 1 for use as a prior estimate of the cell-type proportion in MiXcan. The π i estimation was performed using 100 bootstrap samples (80% random draw with replacement). The final estimate was computed by excluding the most extreme 5% of bootstrap estimates in each of the two tails and averaging the remaining estimates.
Prediction model. Using b π, we modeled the cell-type-level expression levels for each of the 6461 genes using MiXcan with tuning parameter λ selected by tenfold CV. We adjusted for covariates that were used in GTEx eQTL analyses including age, platform, PCR, genomic principal components (PC) 1-5, and PEER factors 1-15 62 If the 95% CI excluded 0, we used celltype-specific models with parameters estimated using the full data; otherwise, we used nonspecific models that were the same as Pre-diXcan. The average computation time for training models for 1000 genes using 125 samples on a single CPU core was 11 min (standard deviation, 2.8 minutes).

Evaluate MiXcan prediction accuracy in independent TCGA data
We evaluated the prediction performance of MiXcan in an independent dataset of 103 European ancestry female breast cancer patients with adjacent normal tissue samples from TCGA 63,64 . To minimize the study effect, we re-processed the TCGA gene expression data using methods analogous to those used to process the GTEx expression data (https://gtexportal.org/home/ documentationPage). Briefly, we required genes to have Transcripts Per Kilobase Million (TPM) >0.1 in at least 20% of samples, and at least six reads in at least 20% of samples, resulting in a set of 25,702 out of 25,849 total genes that met these quality control (QC) requirements. Expression data were then normalized using the trimmed mean of M values method (TMM) 65 as implemented in the R package edgeR v3.16.5 66 , and the results were quantilenormalized to a standard normal distribution with mean=0 and variance = 1. Comparison of these normalized gene expression levels showed no systematic differences between the GTEx and TCGA data (Supplementary Fig. 7). To process the genotype data, we removed all indels, monomorphisms, and ambiguous pairs (e.g. A/T, C/G). SNPs with >5% missing genotypes or Hardy-Weinberg equilibrium (HWE) test p value < 1e-05 were also removed. The remaining SNPs were aligned to build 37 coordinates, and imputation was performed on the TOPMed imputation server 67 . A total of 97% (54,663 out of 56,531) and 97% (52,031 out 53,876) SNPs used in MiXcan and PrediXcan prediction models, respectively, were available for analysis.
We estimated the epithelial cell proportion in the TCGA samples as described above, and used this estimate to combine the predicted cell-type-level GReX from epithelial and stromal components into the tissue level. To evaluate predication accuracy, we computed the Pearson correlation between the predicted and observed tissue-level gene expression, and compared the results with the predicted tissue-level expression using Pre-diXcan. The observed bulk tissue expression levels showed significantly higher correlation with the tissue-level GReX predicted by MiXcan compared with PrediXcan. To investigate the sources of the improved performance, we compared five approaches for predicting tissue-level GReX: • Existing PredictDB elastic-net models (PredictDB) • PrediXcan trained on the same dataset as MiXcan (PrediXcan) • Prediction model including interactions between SNPs and the xCell enrichment score (xCell interaction) • Prediction model including interactions between SNPs and the MiXcan cell-type proportion (MiXcan 0 ) • Cell-type-level prediction models with symmetric penalization (MiXcan).
These comparisons evaluated incorporating cell-type composition, use of the MiXcan cell-type proportion estimate, and symmetric penalization of the two cell types in prediction models, as well as use of a larger training set including both men and women in PredictDB. It is worth noting that "MiXcan 0 " and "xCell interaction" models are not applicable to GWAS datasets that lack cell-type composition information for the tissue of interest, and are included here only for the purpose of understanding the sources of improved prediction performance for MiXcan.

Evaluate MiXcan epithelial cell prediction accuracy in snRNAseq data
We evaluated the performance of MiXcan epithelial cell prediction models using snRNAseq data for normal mammary epithelial cells and paired genomic data available for three women of European, Asian and African ancestry from GTEx v9 29 . Details of the snRNAseq data generation and processing were provided in 29 . The preprocessed log count expression profiles for a total of 5990, 2324 and 1456 nuclei, including 2292 (38%), 2180 (94%) and 1327 (91%) epithelial cell nuclei, from the European, Asian and African ancestry woman, respectively, were downloaded from https://gtexportal.org/home/datasets. Mammary epithelial cell snRNAseq data were available in all three women for 4751 genes with MiXcan prediction models. To enable comparisons between women, the snRNAseq levels were averaged for each gene and quantile normalized to a standard normal distribution within each woman to reduce the impact of noise, skewness and outliers.
To evaluate prediction accuracy, MiXcan epithelial cell prediction models were applied to the genotype data for each woman, and the between-woman difference in the predicted GReX for each gene was computed to identify the two sets of 100 genes predicted to have the largest positive or negative differences in mammary epithelial cell expression in each pair of women. The Wilcoxon signed-rank test was used to test whether the observed snRNAseq differences for genes predicted to have the largest GReX differences between women were significantly different from zero, as expected.

Simulation studies
To evaluate the type I error and power of MiXcan association tests, we performed extensive simulation studies under a broad range of realistic models for the associations of genetic variants with gene expression (SNP-Exp) and gene expression with disease (Exp-Disease). Mimicking real data, in each simulation, we generated a training dataset for building the GReX prediction models, and a GWAS dataset for testing the associations of GReX with disease. Without loss of generality, non-genetic covariates were excluded from simulations to allow direct evaluation of the predicted GReX.
For the training dataset, we simulated 300 bulk tissue samples with observed SNP genotypes and tissue-level gene expression. We assumed each tissue i ∈ (1, . . . , 300) was a mixture of two cell types and that the minor cell type (cell type 1) comprised an average of 40% of the tissue, with proportion π i~B eta(α = 2, β = 3). We further simulated the genotypes of 50 neighboring SNPs x i = {x 1i , . . . , x 50i } using the genome simulator R package sim1000G, with its default reference genome region (chromosome 4) and minor allele frequency (MAF) range 0.05-0.50. For the expression levels in the two cell types u i and v i , we considered a linear additive model such that u i = b 0 + b 1 x i + e ui and v i = b 2 x i + e vi where e ui , e vi~N (0, 1). The parameter b 0 determined the mean expression difference in the two cell types under x i = 0, and b 1 and b 2 determined the association patterns between the SNPs X and gene expression level Y in the minor and major cell types, respectively. Then, the tissue-level gene expression was a weighted average of the expression levels in the two cell types: We considered two SNP-Exp settings: Homogeneous SNP-Exp Association (b 1 = b 2 ) and Heterogeneous SNP-Exp Association (b 1 ≠ b 2 ). Under the Homogeneous SNP-Exp setting, we randomly selected one genetic variant p to be associated with expression levels in the two cell types and let b 1p = b 2p = 1 or -1 with equal chance, corresponding to a median heritability of 0.27 and interquartile range (IQR) of 0.10. We varied b 0 from -2 to 2 to evaluate the impact of the intercept (mean expression difference in two cell types under x i = 0) on TWAS. Under the Heterogeneous SNP-Exp setting, we randomly selected two SNPs p 1 ≠ p 2 ∈ (1, . . . , 50) with SNPs p 1 and p 2 associated with expression levels in the minor and major cell types, respectively, corresponding to the same heritability in both cell types (median 0.27; IQR 0.10). Similar to the Homogeneous SNP-Exp setting, we first evaluated the impact of the intercept by varying b 0 from -2 to 2 while fixing b 1p 1 at 1 or -1 and b 2p 2 at 1 or -1 with equal chance. Second, we varied the magnitude of b 1p from 0 to 2 (median heritability 0 to 0.59) with equal chance of a positive or negative sign, while fixing b 0 at 1 and b 2p = ± 1 to understand the impact of the SNP-Exp association strength in the minor cell type. Third, we varied the magnitude of b 2p from 0 to 2 (allowing a random sign with equal chance), while fixing b 0 at 1 and b 1p = ± 1 to understand the impact of the SNP-Exp association strength in the major cell type. Finally, to evaluate the impact of the training data sample size, we assessed sample sizes ranging from 100 to 300, while fixing b 0 at 1 and the magnitude of non-zero components of b 1 , b 2 at 1.
For the GWAS dataset, we simulated a case-control study of 3000 participants with observed genotypes and disease status. We assumed the unobserved cell-type composition and cell-type-level gene expression in this dataset followed the same distributions as in the training dataset. Disease risk was simulated using a logistic model, such that logitP(d j = 1) = η 0 + η 1 u j + η 2 v j for j ∈ (1, . . . , 3000). The intercept η 0 was set to reflect a 1:1 ratio of cases and controls. We considered five different settings for η 1 , η 2 to capture the dynamic relationship between gene expression levels in the two cell types and disease risk: 1. No Exp-Disease Association: η 1 = η 2 = 0, i.e. disease is not associated with the gene expression in either cell type. We compared the prediction accuracy, type I error, and power of MiXcan with PrediXcan, which ignores cell-type heterogeneity in 500 Monte Carlo simulations.
MiXcan performance under misspecified models. MiXcan decomposes tissues into two cell-type components. To evaluate the robustness of MiXcan to misspecification of the cell-type proportion, where a noisy estimate ofπ is used instead of the true π, we simulated: (a)π = 0:8π to shift the mean from 0.4 to 0.32 and scale from (0-1) to (0-0.8); (b)π = 0:7π + 0:2 to shift the mean from 0.4 to 0.48 and scale from (0-1) to (0.2-1); (c)π i ∼ Betað50π i ,50ð1 À π i ÞÞ to reduce the correlation with the true value corðπ i ,π i Þ to 0.9; and (d) π i ∼ Betað5:5π i ,5:5ð1 À π i ÞÞ to further reduce corðπ i ,π i Þ to 0.6. We compared the performance of MiXcan using the misspecifiedπ with MiXcan using the true π and PrediXcan.
We also evaluated the robustness of MiXcan to the presence of a latent third cell type. We simulated a tissue with three cell types that have different SNP-Exp or Exp-Disease associations, and evaluated the performance of MiXcan by decomposing the tissue into cell type 1 vs. a mixture of cell types 2 and 3. Specifically, we simulated 300 bulk tissue samples comprised of three cell types with proportions π 1i = 40%, π 2i = 50% and π 3i = 10%. As in the simulations above, gene expression levels in three cell types were linearly dependent on 50 neighboring SNPs as determined by b 1 , b 2 and b 3 , and logit transformed disease risk was linearly dependent on the expression levels in the three cell types as determined by η 1 , η 2 and η 3 . To evaluate the impact of a latent third cell type contributing to the SNP-Exp association, we compared the performance of MiXcan and PrediXcan for b 2 = b 3 vs. b 2 ≠ b 3 assuming η 2 = η 3 under the Heterogeneous SNP-Exp Association (b 1 ≠ b 2 ) setting. In detail, we randomly selected three different SNPs p 1 , p 2 , p 3 ∈ (1, . . . , 50) to be associated with expression levels in the three cell types, respectively, and fixed the magnitude of these nonzero b (b 1p 1 , b 2p 2 , b 2p 3 ) at 1, with equal chance of a positive or negative sign. We evaluated type I error and power in simulated GWAS datasets (N = 3000) under five Exp-Disease patterns as described for η 1 , η 2 above. As we observed that type I error was well controlled under various SNP-Exp associations in a latent third cell type, we next evaluated the impact of a latent third cell type contributing to the Exp-Disease association on the study power of MiXcan. In this simulation, we fixed b 3 = b 2 but simulated η 3 ≠ η 2 with η 3 values ranging from -0.2 to 0.2.
Apply MiXcan to perform cell-type-aware TWAS of breast cancer Cell-type-aware TWAS of breast cancer. We performed cell-typeaware TWAS of breast cancer risk using GWAS data from the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study. Genotype data for 60,014 women (32,438 cases and 27,576 controls) assayed using the Oncoarray 30 , which includes >500,000 variants and provides excellent coverage of most common variants, were downloaded from dbGaP (phs001265.v1.p1 <https://www.ncbi.nlm.nih.gov/ projects/gap/cgi-bin/study.cgi?study_id=phs001265.v1.p1>). After imputation (as described above for TCGA data), 95% (53,528 out of 56,531) and 95% (51,049 out of 53,876) SNPs used in MiXcan and Pre-diXcan prediction models, respectively, were available for analysis.
Principle component analysis (PCA) was performed using 20,629 SNPs, after excluding SNPs with a missing rate above 0.01% and selecting SNPs in approximate linkage equilibrium using PLINK v1.90 (indep-pairwise option with window size=50kb, step size=5, r 2 threshold=0.05) 68 . EIGENSOFT v6.1.4 was used to compute PCs with the fast mode option enabled, which implements the FastPCA approximation 69 . The first PC separated individuals of African (e.g. from Nigeria, Uganda and Cameroon) vs. European (e.g. from Australia) ancestry. In total, 58,648 women (31,716 cases and 26,932 controls) of European ancestry determined by PCs were included in TWAS analyses.
MiXcan, PrediXcan and PredictDB elastic-net mammary tissue models were applied to the individual-level genotype data to perform cell-type-aware or tissue-level TWAS, as described above. All three models were adjusted for the same covariates, including age, country of origin and the top 10 PCs 15 .
Evaluation of TWAS findings. We evaluated significant TWAS genes identified by MiXcan and PrediXcan in a substantially larger study of 228,951 European ancestry women (122,977 cases and 105,974 controls) from the combined DRIVE and Breast Cancer Association Consortium (BCAC) GWAS meta-analysis of breast cancer 11 . The summary statistics for the "Combined Oncoarray, iCOGS GWAS meta analysis" were downloaded from https://bcac.ccge.medschl.cam.ac.uk/ bcacdata/oncoarray/oncoarray-and-combined-summary-result/gwassummary-results-breast-cancer-risk-2017/. Associations of predicted tissue-level GReX with breast cancer risk were evaluated using S-PrediXcan 33 with PredictDB 28 elastic-net models derived from GTEx v8 mammary tissue data for 337 men and women of European ancestry. We also determined whether TWAS genes identified by MiXcan and PrediXcan were located within 500 kb of 214 previously reported genome-wide significant breast cancer susceptibility loci [11][12][13] .

MiXcan software
We developed a computationally efficient R package MiXcan to facilitate estimation of cell-type-level GReX prediction models in the two cell components of bulk tissue data, and to perform cell-type-aware TWAS. The MiXcan R package, and pre-trained models for the epithelial and stromal (non-epithelial) cell components of mammary tissue derived from 125 European ancestry women in GTEx v8 are freely available at https://github.com/songxiaoyu/MiXcan 54 .

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.