Rare genetic variants are abundant in humans and are expected to contribute to individual disease risk1,2,3,4. While genetic association studies have successfully identified common genetic variants associated with susceptibility, these studies are not practical for identifying rare variants1,5. Efforts to distinguish pathogenic variants from benign rare variants have leveraged the genetic code to identify deleterious protein-coding alleles1,6,7, but no analogous code exists for non-coding variants. Therefore, ascertaining which rare variants have phenotypic effects remains a major challenge. Rare non-coding variants have been associated with extreme gene expression in studies using single tissues8,9,10,11, but their effects across tissues are unknown. Here we identify gene expression outliers, or individuals showing extreme expression levels for a particular gene, across 44 human tissues by using combined analyses of whole genomes and multi-tissue RNA-sequencing data from the Genotype-Tissue Expression (GTEx) project v6p release12. We find that 58% of underexpression and 28% of overexpression outliers have nearby conserved rare variants compared to 8% of non-outliers. Additionally, we developed RIVER (RNA-informed variant effect on regulation), a Bayesian statistical model that incorporates expression data to predict a regulatory effect for rare variants with higher accuracy than models using genomic annotations alone. Overall, we demonstrate that rare variants contribute to large gene expression changes across tissues and provide an integrative method for interpretation of rare variants in individual genomes.
Our analysis focused on individuals with extremely high or extremely low expression of a particular gene compared with the population, using the GTEx v6p release data, which include RNA-sequencing data for 449 individuals and 44 tissues. We refer to these individuals as gene expression outliers. The GTEx data enable the identification of both single-tissue and multi-tissue expression outliers (Fig. 1a), with the latter defined by consistent extreme expression across many tissues (see Methods). To account for broad environmental and technical confounders, we removed hidden factors estimated by PEER (probabilistic estimation of expression residuals)13 from each tissue before outlier discovery (Extended Data Figs 1, 2 and Supplementary Tables 1, 2).
We identified a single-tissue expression outlier for ≥99% of expressed genes in each tissue and a multi-tissue outlier for 4,919 out of 18,380 genes that were tested (27%). Each individual was a single-tissue outlier for a median of 83 genes per tissue and a multi-tissue outlier for a median of 10 genes. Single-tissue outliers that were found in one tissue replicated in other tissues at rates of up to 33%, with higher rates among related tissues (Fig. 1b and Extended Data Fig. 3). The replication rate for multi-tissue outliers was much higher and increased with the number of tissues used for discovery (Fig. 1c).
We investigated the influence of rare genetic variation on extreme expression levels, focusing on the individuals of European ancestry with whole-genome sequencing data (1,144 multi-tissue outliers). Multi-tissue outliers were strongly enriched for nearby rare variants. The enrichment was most pronounced for structural variants, as previously described14, and greater for short insertions and deletions (indels) than for single-nucleotide variants (SNVs) (Fig. 2a and Extended Data Fig. 4). Because most rare variants occur as heterozygotes, expression outliers driven by rare variants in cis should exhibit allele-specific expression (ASE). Both single-tissue and multi-tissue outliers were significantly enriched for ASE compared to non-outliers (see Methods; two-sided Wilcoxon rank-sum tests, each nominal P < 2.2 × 10−16; Fig. 2c). For underexpression outliers with exonic rare variants, the rare allele was generally underexpressed with respect to the common allele and conversely so for overexpression outliers, consistent with the rare variant causing the effect (two-sided Wilcoxon rank-sum tests, each nominal P < 4.0 × 10−8; Extended Data Fig. 5a). The enrichment for rare variants and ASE was stronger for multi-tissue outliers than for single-tissue outliers (Fig. 2b, c and Extended Data Fig. 6a), especially at higher Z-score thresholds.
To characterize the properties of rare variants that correlated with large changes in gene expression, we assessed the enrichment of different classes of variants in outliers compared to non-outliers (Supplementary Table 3a). Outliers were enriched, in order of significance, for structural variants, variants near splice sites, introducing frameshifts, at start or stop codons, near the transcription start site and in conserved regions (Fig. 3a). Variants in coding regions contributed disproportionately to outlier expression; enrichments weakened for all variants types (SNVs, indels and structural variants) when excluding exonic regions (Extended Data Fig. 6b). Additionally, 90% of stop-gain and frameshift variants were predicted to trigger nonsense-mediated decay in outliers (see Methods), suggesting a biological mechanism for these cases.
We also tested the relationship between outlier gene expression and functional annotations. Multi-tissue outliers were strongly enriched for variants in promoter or CpG-rich regions and had variants with higher conservation15,16,17,18 and CADD (combined annotation-dependent depletion)19 scores than non-outliers. We observed weaker enrichment in enhancers and transcription-factor-binding sites (Fig. 3b and Extended Data Fig. 7). Combining all classes of variation, other than non-conserved, non-coding, rare variants (excluded as less likely candidates for causal effects), we observed that 58% of underexpression and 28% of overexpression outliers had rare variants near the relevant gene, compared to 8% for non-outliers (Fig. 3c). Overexpression outliers were more common overall, potentially because detection of underexpression outliers for very low expression genes is inherently limited (Extended Data Fig. 5b). Overexpression outliers were also less enriched for functionally annotated rare variants (Extended Data Fig. 5c). Some variant classes had strong directionality concordant with their expected impact: duplications caused overexpression, whereas deletions, start- and stop-codon variants and frameshifts coincided with underexpression (Fig. 3d). We also observed strong ASE for outliers carrying all classes of variants, except non-conserved variants (Fig. 3e).
We hypothesized that functional, large-effect rare variants have been under recent selective pressure. As expected, we found that rare promoter variants of outliers were significantly less frequent in the UK10K cohort of 3,781 individuals3 than rare promoter variants of non-outliers for the same genes (two-sided Wilcoxon rank-sum test, P = 0.0060; Fig. 4a). Additionally, genes intolerant to loss-of-function and missense mutations were depleted of both multi-tissue outliers and multi-tissue expression quantitative trait loci (eQTLs; Fisher’s exact test, all P < 2 × 10−15; Fig. 4b and Extended Data Fig. 8a). We observed a similar depletion in two curated disease gene lists—genes involved in heritable cardiovascular disease and genes in the guidelines of the American College of Medical Genetics and Genomics for incidental findings20—but not in broader gene lists (Fig. 4c and Extended Data Fig. 8b, c). Genes with a multi-tissue outlier were more likely to have a multi-tissue eQTL (two-sided Wilcoxon rank-sum test, P < 2.2 × 10−16; Extended Data Fig. 8d, e), suggesting that rare and common regulatory variation influence similar genes. However, we found evidence that genes with outliers were more constrained than genes with multi-tissue eQTLs, because genes with outliers had less missense and loss-of-function variation (Tukey’s range test, missense Z-score P = 0.0070, probability of loss-of-function intolerance score P = 0.032; Fig. 4b and Extended Data Fig. 8a). This suggests that outlier expression analysis can yield unique insights into constraints on gene regulation.
Next, we sought to prioritize rare variants in each individual genome by their predicted impact on gene expression. We developed RIVER (RNA-informed variant effect on regulation), a Bayesian statistical model that jointly analyses genome and transcriptome data from the same individual to estimate the probability that a variant has regulatory impact (https://bioconductor.org/packages/release/bioc/html/RIVER.html, see Methods). RIVER uses a generative model that assumes that genomic annotations (Supplementary Table 3b) determine the prior probability that a variant is a functional regulatory variant, in terms of influence on gene expression, which in turn affects whether nearby genes are likely to display outlier levels of expression (Fig. 5a). RIVER does not require a labelled set of functional/non-functional variants; rather it derives its power from identifying expression patterns that coincide with predictive genomic annotations.
We trained RIVER on the GTEx v6p cohort, and evaluated the model on held-out pairs of individuals who shared the same rare variants. We then computed the RIVER score (the posterior probability of having a functional regulatory variant) for one individual, using both expression and genomic data, and assessed the accuracy with respect to the expression levels of the second individual that had been held out (see Methods). Incorporating expression data significantly improved prediction compared with a model that uses genomic annotations alone (area under the curve (AUC) of 0.64 and 0.54, respectively, P = 3.5 × 10−4; Fig. 5b and Extended Data Fig. 9a, b), and RIVER learned, unsupervised, to prioritize variants supported by both genomic annotations and extreme expression levels across tissues (Fig. 5c and Extended Data Fig. 9c). ASE was also enriched among the top RIVER hits compared with the genomic annotation model (Extended Data Fig. 9d). Finally, even after accounting for the most informative genomic annotations or summary scores, personal expression data were highly informative of rare variant effects (average log odds ratio, 2.76; Extended Data Fig. 9e, f).
RIVER can be used to predict regulatory effects on gene expression of disease-associated variants and aid in prioritization of rare variants in disease studies. To investigate this potential, we evaluated 27 pathogenic variants from ClinVar21 present in 21 GTEx donors (Fig. 5c and Extended Data Fig. 10a). Overall, pathogenic variants had RIVER scores that were higher than background variants (two-sided Wilcoxon rank-sum test, P = 3.3 × 10−9; Extended Data Fig. 10b–d), and the six that were probably regulatory variants (those not annotated as missense or as an indel within a coding region) scored in the 99.9th percentile. Several cases, which we evaluated in detail, illustrated that rare disease-causing variants can have a regulatory impact evident from RNA-sequencing data, even from healthy individuals that have those variants (in whom the variants are often heterozygous; Extended Data Fig. 10e, f). Note that RIVER trained on healthy cohorts, such as GTEx, can then be directly applied to new cohorts that include disease samples.
To experimentally validate a subset of the variants that were identified through outlier analysis, we used CRISPR–Cas9-mediated genome editing22,23. In K562 cells, we tested six SNVs and matched controls in transcribed regions of genes with an outlier (see Methods and Extended Data Fig. 11a, b), and compared the allelic ratios between mRNA and genomic DNA (gDNA), which was used as an internal control. All variants that were tested were SNVs in underexpression outliers and were therefore expected to decrease expression. Two variants were excluded owing to low cDNA and gDNA total reads counts. The four remaining SNVs in outliers all showed lower proportions of the alternate (installed) allele in the cDNA compared to the gDNA, confirming that these variants decreased expression (Extended Data Fig. 11c).
In summary, by combining data across multiple tissues, we curated a set of gene expression outliers that replicated at higher rates and showed stronger enrichment of rare variants than those from any single tissue. We found that rare structural variants, frameshift indels, coding variants and variants near the transcription start site were most likely to have large effects on expression. However, our ability to characterize the genetic basis of multi-tissue outliers remains incomplete. Outliers without an underlying rare variant in our analysis may be due to variants in more distal regions or in annotations we did not consider, or may be attributable to residual technical or environmental effects.
Although variant interpretation remains challenging, RIVER demonstrates the value of incorporating personal gene expression data to examine the consequences of rare variants that may be uncertain based on the sequence alone. Our results suggest that a general approach can be applied to studies that supplement genome sequencing with other molecular phenotypes, such as methylation24,25,26 and histone modification27,28. We anticipate that such integrative approaches will be essential for effective interpretation of genome-wide genetic variation on a personalized level.
All human subjects were deceased donors. Informed consent was obtained for all donors via next-of-kin consent to permit the collection and banking of de-identified tissue samples for scientific research. The research protocol was reviewed by Chesapeake Research Review Inc., Roswell Park Cancer Institute’s Office of Research Subject Protection, and the institutional review board of the University of Pennsylvania. We used the RNA-seq, allele-specific expression, and whole-genome sequencing (WGS) data from the v6p release of the GTEx project. The generation of these data are described in the supplementary information of ref. 12.
Correction for technical confounders
We restricted our expression analyses to the 449 individuals and 44 tissues for which sex and the top three genotype principal components, which capture major population stratification, were available. For each tissue, we log2-transformed all expression values (log2(RPKM + 2)), where RPKM is the number of reads per kilobase of transcript per million mapped reads. We then standardized the expression of each gene to prevent shrinkage of outlier expression values caused by quantile normalization. To remove unmeasured batch effects and other confounders, for each tissue separately, we estimated hidden factors using PEER13 on the transformed expression values. In each tissue, we defined expressed genes and corrected for the same number of PEER factors as in the GTEx eQTL analyses (see supplementary information of ref. 12). We regressed out the PEER factors, the top three genotype principal components and sex (where appropriate) from the transformed expression data for each tissue using the following linear model:where Yg is the transformed expression of a given gene g, μg is the mean expression level for the gene, Pn is the nth PEER factor, G1, G2, G3 are the top three genotype principal components, and S is the sex covariate. We assumed the residual vector εg follows the multivariate normal distribution εg ~ N(0, σ2I). Finally, we standardized the expression residuals εg for each gene, which yielded Z-scores.
To better understand the effect of PEER correction on the removal of technical and biological confounders, we compared the PEER factors in each tissue separately to pre-collected sample and subject covariates. We considered the subset of covariates with >50 observations in at least 31 tissues, where we first selected covariates with more than one unique entry in each tissue. For categorical covariates, we only considered categories with more than 20 observations. For each PEER factor and each covariate, we fit a linear model with the PEER factor as the response and the covariate as the predictor. From this model, we computed the proportion of that PEER factor’s variance explained by the covariate as the adjusted R2:where p and n are the number of parameters and samples, respectively, andSST and SSR refer to the total and residual sums of squares, respectively.
To quantify the degree to which each covariate was captured by the combination of all PEER factors, genotype principal components and sex (where appropriate) for each tissue, we considered the expression component regressed out from the uncorrected data:
For each covariate, we then fit a linear model with Wg as the response and the covariate as the predictor. We assessed the proportion of the variance of Wg explained by each covariate by computing the adjusted R2 for the covariate across all genes. We used the formula above, but summed across all genes to compute SST and SSR.
To assess the impact of PEER correction on rare variant enrichment, we also tried removing either the top five PEER factors for each tissue or no PEER factors. We then performed multi-tissue outlier calling and tested the enrichment of rare and common variants in the two partially corrected datasets (see ‘Enrichment of rare and common variants near outlier genes’).
Single-tissue and multi-tissue outlier discovery
Single-tissue and multi-tissue outlier calling was restricted to autosomal lincRNA and protein-coding genes. For each tissue, an individual was called a single-tissue outlier for a particular gene if that individual had the largest absolute Z-score and the absolute value was at least 2. For each gene, the individual with the most extreme median Z-score taken across tissues was identified as a multi-tissue outlier for that gene provided the absolute median Z-score was at least 2. Therefore, each gene had at most one single-tissue outlier per tissue and one multi-tissue outlier. Under this definition an individual could be an outlier for multiple genes. In addition, we only tested for multi-tissue outliers among individuals with expression measurements for the gene in at least five tissues. To reduce cases where non-genetic factors may cause widespread extreme expression, we removed eight individuals that were multi-tissue outliers for 50 or more genes from all downstream analyses, including before single-tissue outlier discovery. Removing these individuals with extreme expression across many genes improved our rare variant enrichments, but the precise threshold mattered less (Extended Data Fig. 2g). We chose the threshold of 50 to strike a balance between removing extreme individuals while not excluding a large proportion of our cohort.
Replication of expression outliers
We calculated the proportion of single-tissue outliers discovered in one tissue that had |Z-score| ≥ 2 with the same direction of effect for the same gene in the replication tissue. Since certain groups of tissues were sampled in a specific subset of individuals, we evaluated the extent to which replication was influenced by the size and the overlap of the discovery and replication sets. We repeated the replication analysis with the discovery and replication in exactly 70 overlapping individuals for each pair of tissues with enough samples and compared the replication patterns to those obtained by using all individuals. To estimate the extent to which individual overlap biased replication estimates, for each pair of tissues with sufficient samples, we defined three disjoint groups of individuals: 70 individuals with data for both tissues, 69 distinct individuals with data in the first tissue, and 69 distinct individuals with data in the second tissue. We discovered outliers in the first tissue using the shared set of individuals then tested for replication using the same individuals in the second tissue. Then, for each gene, we added the identified outlier to the distinct set of individuals and tested the replication again in the second tissue. We repeated the process running the discovery in the second tissue and the replication in the first one. We compared the replication rates when using the same or different individuals for the discovery and replication.
We assessed the confidence of our multi-tissue outliers using cross-validation. We separated the tissue expression data randomly into two groups: a discovery set of 34 tissues and a replication set of 10 tissues. For t = 10, 15, 20, 25, and 30, we randomly sampled t tissues from the discovery set and performed outlier calling as described above. Owing to incomplete tissue sampling, the number of tissues supporting each outlier is at least five but less than t. We computed the replication rate as the proportion of outliers in the discovery set with |median Z-score| ≥ 1 or 2 in the replication set. We set no restriction on the number of tissues required for testing in the replication set. To calculate the expected replication rate, we randomly selected individuals in the discovery set with at least five tissues that expressed the gene and computed the replication rate. We repeated this process 10 times for each discovery set size.
Quality control of genotypes and rare variant definition
We restricted our rare variant analyses to individuals of European descent, as they constituted the largest homogenous population within our dataset. We considered only autosomal variants that passed all filters in the VCF (those marked as PASS in the Filter column). Minor allele frequencies (MAFs) within the GTEx data were calculated from the 123 individuals of European ancestry with WGS data (average coverage 30×). The MAF was the minimum of the reference and the alternate allele frequency where the allele frequencies of all alternate alleles were summed together. Rare variants were defined as having MAF ≤ 0.01 in GTEx, and for SNVs and indels we also required MAF ≤ 0.01 in the European population of the 1000 Genomes Project Phase 3 data30. To ensure that population structure among the individuals of European descent was unlikely to confound our results, we verified that the allele frequency distribution of rare variants included in our analysis (within 10 kb of a protein-coding or lincRNA gene, see below) was similar for the five European populations in the 1000 Genomes Project (Extended Data Fig. 4d).
Enrichment of rare and common variants near outlier genes
We assessed the enrichment of rare SNVs, indels and structural variants near outlier genes. Proximity was defined as within 10 kb of the transcription start site for most analyses. For Fig. 3 and Extended Data Figs 5, 7, 8, we included all variants within 10 kb of the gene, including the gene body, to also capture coding variants. In Fig. 3 and Extended Data Figs 5, 8, we extended the window to 200 kb for enhancers and structural variants. For each gene with an outlier, we chose the remaining set of individuals tested for outliers at the same gene as non-outlier controls. We only considered genes that had both an outlier and at least one control. We stratified variants of each class into four minor allele frequency bins (0–1%, 1–5%, 5–10%, 10–25%) to compare the relative enrichments of rare and common variants. We also assessed the enrichment of SNVs at different Z-score cutoffs. Enrichment was defined as the ratio of the proportion of outliers with a variant whose frequency lies within the range to the corresponding proportion for non-outliers. This enrichment analysis is equivalent to the relative risk of having a nearby rare variant given outlier status. We used the asymptotic distribution of the log relative risk to obtain 95% Wald confidence intervals. Within our set of European individuals, we observed some individuals with minor admixture that had relatively more rare variants than the rest (Extended Data Fig. 1b). We confirmed that inclusion of these admixed individuals did not substantially affect our results (Extended Data Fig. 1c). We also calculated rare variant enrichments when restricting to variants outside protein-coding and lincRNA exons in the Gencode v.19 annotation (extending internal exons by 5 bp to capture canonical splice regions).
To measure the informativeness of variant annotations, we used logistic regression to model outlier status as a function of the feature of interest; this yielded log odds ratios with 95% Wald confidence intervals. Note that for the feature enrichment analysis in Fig. 3b and Extended Data Fig. 7, we required that outliers and their gene-matched non-outlier controls have at least one rare variant near the gene. We standardized all features, including binary features, to facilitate comparison between features of different scale. We also calculated the proportion of overexpression outliers, underexpression outliers and non-outliers with a rare variant near the gene (within 10 kb for SNVs and indels and 200 kb for structural variants). To each outlier instance, we assigned at most one of the 12 rare variant classes that we considered (Supplementary Table 3a). If an outlier had rare variants from multiple classes near the relevant genes, we selected the class that was most significantly enriched among outliers.
Annotation of variants
We obtained structural variant annotations from ref. 14 and computed features for rare SNVs and indels using three primary data sources: Roadmap Epigenomics31, CADD v.1.2 (ref. 19) and VEP v.80 (ref. 32). Promoter and enhancer annotation tracks were obtained from the Roadmap Epigenomics Project (http://www.broadinstitute.org/~meuleman/reg2map/HoneyBadger2_release/). We mapped 28 unique tissues in the GTEx project to 19 tissue groups in the Roadmap Project. Using these annotations, for each individual, we assessed whether each SNV or indel overlapped a promoter or enhancer region in at least one of the 19 Roadmap tissue groups. Features, including conservation15,16,17,18, transcription factor binding and deleteriousness, were extracted from the full annotation tracks of the CADD v.1.2 release (downloaded 15 May 2015; http://cadd.gs.washington.edu/download). Finally, we obtained protein-coding and transcription-related annotations from VEP and LOFTEE. This information was provided in the GTEx v6p VCF file (described in ref. 12). Stop-gain and frameshift variants annotated as high-confidence loss-of-function variants by LOFTEE were assumed to trigger nonsense-mediated decay. We generated gene-level features described in Supplementary Table 3.
Allele-specific expression (ASE)
We only considered sites with at least 30 total reads and at least five reads supporting each of the reference and alternate alleles. To minimize the effect of mapping bias, we filtered out sites that showed mapping bias in simulations33, that were in low mappability regions (ftp://hgdownload.cse.ucsc.edu/gbdb/hg19/bbi/wgEncodeCrgMapabilityAlign50mer.bw) or that were rare variants or within 1 kb of a rare variant in the given individual (the variants were extracted from the GTEx exome-sequencing data described in ref. 12). The first two filters were provided in the GTEx ASE data release. The third filter was applied to eliminate potential mapping artefacts that mimic genetic effects from rare variants. We measured ASE at each testable site as the absolute deviation of the reference-allele ratio from 0.5. For each gene, all testable sites in all tissues were included. We compared ASE in single-tissue and multi-tissue outliers at different Z-score thresholds to non-outliers using two-sided Wilcoxon rank-sum tests. To obtain a matched background, we only included a gene in the comparison when ASE data existed for both the outlier individual and at least one non-outlier. In the case of single-tissue outliers, we also required the tissue to match between the outlier and the non-outlier. All individuals that were neither multi-tissue outliers for the given gene nor single-tissue outliers for the gene in the corresponding tissue were included as non-outliers.
In cases where outliers had rare coding variants in the gene, if the rare variants were causing the extreme expression in cis, we expected to see ASE at the rare variant matching the direction of the effect. For underexpression outliers, we expected the (rare) minor allele to be underexpressed compared to the major allele. For overexpression outliers, we expected the minor allele to be overexpressed. To test this, we used the same filters as above, but looked exclusively at rare variants (instead of excluding them). We measured ASE as the minor-allele ratio: the number of reads supporting the minor allele over the total number of reads.
We also used ASE to evaluate the performance of both the genomic annotation model and RIVER (see below) by testing the association between allelic imbalance and model predictions using Fisher’s exact test. Here, we defined allelic imbalance as the top 10% of the median absolute deviation, across tissues, of the reference-allele ratio from 0.5.
Allele frequency measurements in UK10K
UK10K3 VCF files of whole-genome cohorts were downloaded from https://www.ebi.ac.uk. We merged the Avon Longitudinal Study of Parents and Children (ALSPAC) EGAS00001000090 and the Department of Twin Research and Genetic Epidemiology (TWINSUK) EGAS00001000108 datasets for a total of 3,781 individuals. We counted the occurrence of all rare GTEx SNVs in Roadmap Epigenomics-annotated promoter regions among the UK10K samples. GTEx variants absent from the UK10K cohorts were assigned a count of 0.
Definition of multi-tissue eGenes
We defined multi-tissue eGenes using two approaches. For the tissue-by-tissue approach, we obtained lists of significant eGenes (q value ≤ 0.05) for each of the 44 tissues from the GTEx v6p release. The second approach used cis-eQTLs with shared effects across tissues estimated by the RE2 model of the Meta-Tissue software34, as described in ref. 12. We chose, for each gene, the variant with the lowest nominal P value from the RE2 model. We then determined the number of tissues in which this variant-gene pair showed a cis-eQTL effect (m value ≥ 0.9 (ref. 34)). For each of the 18,380 genes tested for multi-tissue outliers, we calculated the number of tissues in which the gene appeared as a significant eGene (tissue-by-tissue approach) or had a shared eQTL effect (Meta-Tissue approach). To show that the enrichment of outlier genes as multi-tissue eGenes was not confounded by gene expression level, using the Meta-Tissue results, we stratified genes tested for multi-tissue outliers into RPKM deciles and repeated the comparison between genes with and without a multi-tissue outlier. When comparing the enrichment for eGenes among constrained and disease gene lists, we classified the top n Meta-Tissue eGenes (ranked by nominal P value from the RE2 model) as multi-tissue eGenes and considered the remaining genes as background. We selected n to match the number of multi-tissue outliers in the comparison.
Evolutionary constraint of genes with multi-tissue outliers
We obtained gene-level estimates of evolutionary constraint from the Exome Aggregation Consortium35 (http://exac.broadinstitute.org/, ExAC release v.0.3). We intersected the 17,351 autosomal lincRNA and protein-coding genes with constraint data from ExAC with the 18,380 genes tested for multi-tissue outliers from GTEx, yielding 14,379 genes for further analysis (3,897 and 10,482 genes with and without a multi-tissue outlier, respectively). We examined three functional constraint scores from the ExAC database: synonymous Z-score, missense Z-score and probability of loss-of-function intolerance (pLI). Synonymous- and missense-intolerant genes were defined as those with corresponding Z-scores above the 90th percentile. We defined loss-of-function intolerant genes as those with a pLI score above 0.9, following the guidelines provided by ExAC. We calculated odds ratios and 95% confidence intervals for the enrichment of genes with multi-tissue outliers in these lists using a Fisher’s exact test. We repeated this analysis for three other gene sets: 19,182 multi-tissue eGenes from GTEx v6p defined using Meta-Tissue, 9,480 reported GWAS genes from the NHGRI-EBI catalogue36 (http://www.ebi.ac.uk/gwas, accessed 30 November 2015) and 3,576 OMIM genes (http://omim.org/, accessed 26 May 2016).
We tested for a difference in the mean constraint for genes with multi-tissue outliers and genes with multi-tissue eQTLs using ANOVA. For each constraint score in ExAC, we treated the score for each gene as the response and the status of the gene as having a multi-tissue outlier and/or a multi-tissue eQTL as a categorical predictor with four classes. After fitting the model, we performed a Tukey’s range test to determine whether there was a significant difference in the mean constraint between genes with a multi-tissue outlier but no multi-tissue eQTL and genes with a multi-tissue eQTL but no multi-tissue outlier.
Overlap of genes with multi-tissue outliers and disease genes
We examined the enrichment of genes with multi-tissue outliers in eight disease gene lists: the GWAS catalogue and OMIM (described above), as well as ClinVar (6,279 genes; http://www.ncbi.nlm.nih.gov/clinvar/), OrphaNet (3,451 genes; http://www.orpha.net/), ACMG20 (58 genes; http://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/), Developmental Disorders Genotype-to-Phenotype37 (DDG2P; 1,693 genes; http://www.ebi.ac.uk/gene2phenotype/), and two curated gene lists of 86 cardiovascular disease genes and 55 cancer genes (described below). We computed odds ratios and 95% confidence intervals using a Fisher’s exact test to compare each disease gene list to the genes with multi-tissue outliers and repeated the comparison for genes with multi-tissue eQTLs.
Heritable cancer predisposition and heritable cardiovascular disease gene lists were curated by local experts in clinical and laboratory-based genetics in the two respective areas (Stanford Medicine Clinical Genomics Service, Stanford Cancer Center’s Cancer Genetics Clinic and Stanford Center for Inherited Cardiovascular Disease). Genes were included if both the clinical and laboratory-based teams agreed there was sufficient published evidence to support using variants in these genes in clinical decision making.
For each of the eight disease gene lists above and for genes with multi-tissue outliers or multi-tissue eQTLs, we computed the number of variants (SNVs and indels within 10 kb and structural variants within 200 kb of the gene, including the gene body) at each gene in the 123 individuals of European ancestry with WGS data. For each gene list and for each MAF bin (0–1%, 1–5%, 5–10%, 10–25%), we compared the mean number of variants near genes in the list to the mean number near all other annotated autosomal protein-coding and lincRNA genes using a two-sided t-test.
The RIVER integrative model for predicting regulatory effects of rare variants
RIVER (RNA-informed variant effect on regulation) is a hierarchical Bayesian model that predicts the regulatory effects of rare variants by integrating gene expression with genomic annotations. The RIVER model consists of three layers: a set of nodes G = G1,..., GP in the topmost layer representing P observed genomic annotations over all rare variants near a particular gene; a latent binary variable F in the middle layer representing the unobserved functional regulatory status of the rare variants; and one binary node E in the final layer representing expression outlier status of the nearby gene. We model each conditional probability distribution as follows:with parameters β and θ and hyper-parameters λ and C.
Because F is unobserved, the RIVER log-likelihood objective over instances n = 1, …, N is non-convex. We therefore optimize model parameters using Expectation–Maximization38 (EM) as follows:
In the E-step, we compute the posterior probabilities (ωn(i)) of the latent variables Fn given current parameters and observed data. For example, at the ith iteration, the posterior probability of Fn = 1 for the nth instance isIn the M-step, at the ith iteration, given the current estimates ω(i), the parameters (β(i + 1)*) are estimated aswhere λ is an L2 penalty hyper-parameter derived from the Gaussian prior on β.
The parameter θ gets updated as:where I is an indicator operator, t is the binary value of expression En, s is the possible binary values of Fn, and C is a pseudo count derived from the Beta prior on θ. The E and M steps are applied iteratively until convergence.
RIVER application to the GTEx cohort
As input, RIVER requires a set of genomic features G and a set of corresponding expression outlier observations E, each over instances of individual and gene pairs. Using the variant annotations described above, we generated site-level genomic features for the 116 European individuals with GTEx WGS data that had fewer than 50 multi-tissue outliers. We then collapsed these features for all rare SNVs within 10 kb of each transcription start site to generate the gene-level features that are described in Supplementary Table 3b. This produced a matrix of genomic features G of size (116 individuals × 1,736 genes) × (112 genomic features), where we standardized features before use. For the values of E, we defined any individual with |median Z-score| ≥ 1.5 as an outlier if expression was observed in at least five tissues; the remaining individuals were labelled as non-outliers for the gene. We used this more lenient threshold in order to obtain a sufficiently large set of outliers for robust training and testing. In total, we extracted 48,575 instances where an individual had at least one rare variant within 10 kb of the transcription start site of a gene.
To train and evaluate RIVER on the GTEx cohort, we used the 3,766 instances of individual and gene pairs where two individuals had the same rare SNVs near a particular gene. We held out those instances and trained RIVER parameters with the remaining instances. RIVER requires two hyper-parameters λ and C. To select λ, we first applied an L2-regularized multivariate logistic regression with features G and response variable E, selecting λ with the minimum squared error via tenfold cross-validation (we selected λ = 0.01). We selected C = 50, informed simply by the total number of training instances available, as validation data were not available for extensive cross-validation. Initial parameters for EM were set to θ = (P(E = 0 | F = 0), P(E = 1 | F = 0), P(E = 0 | F = 1), P(E = 1 | F = 1)) = (0.99, 0.01, 0.3, 0.7) and β from the multivariate logistic regression above, although different initializations did not significantly change the final parameters (Extended Data Fig. 9b).
The 3,766 held-out pairs of instances were used to create a labelled evaluation set. For one of the two individuals from each pair, we estimated the posterior probability of a functional rare variant P(F | G, E, β, θ). The outlier status of the second individual, whose data were not observed either during training or prediction, was then treated as a ‘label’ of the true status of functional effect F. Using this labelled set, we compared the RIVER score to the posterior P(F | G, β) estimated from the plain L2-regularized multivariate logistic regression model with genomic annotations alone. We produced receiver operating characteristic curves and computed areas under the curve (AUCs) for both models, testing for significant differences using DeLong’s method29. This analysis relied on outlier status reflecting the consequences of rare variants. Indeed, pairs of individuals who shared rare variants tended to have highly similar outlier status even after regressing out effects of common variants (Kendall’s τ rank correlation, P < 2.2 × 10−16). We repeated this evaluation, varying the median Z-score threshold used to define outliers, and we also compared RIVER to individual features that were strongly enriched among outliers as well as PolyPhen39 and SIFT40.
Supervised model integrating expression and genomic annotation
To assess the information gained by incorporating gene expression data in the prediction of functional rare variants, we applied a simplified supervised approach to a limited dataset. We used the instances where two individuals had the same rare SNVs to create a labelled training set where the outlier status of the second individual was used as the response variable. We then trained a logistic regression model with only two features: (1) the outlier status of the first individual and (2) a single genomic feature value, such as CADD or deleterious annotation of genetic variants using neural networks (DANN). We estimated parameters from the entire set of rare-variant-matched pairs using logistic regression to determine the log odds ratio and corresponding P value of expression status as a predictor. While this approach was not amenable to training a full predictive model over all genomic annotations jointly given the limited number of instances, it provided a consistent estimate of the log odds ratio of outlier status. We tested five genomic predictors: CADD19, DANN41, transcription-factor-binding site annotations, PhyloP scores15 and one aggregated feature: the posterior probability from a multivariate logistic regression model learned with all genomic annotations.
RIVER assessment of pathogenic ClinVar variants
We downloaded variants from the ClinVar database21 (accessed 04 May 2015) and searched for these disease variants within the set of rare variants segregating in the GTEx cohort. Any disease variant reported as pathogenic, likely pathogenic or a risk factor for disease was considered pathogenic. We further categorized the pathogenic variants as likely regulatory if they were annotated as splice-site variants, synonymous or nonsense, whereas missense variants were considered unlikely to have a regulatory effect. To explore RIVER scores for those pathogenic variants, all instances were used for training RIVER. We then computed a posterior probability P(F | G, E, β, θ) for each instance coinciding with a pathogenic ClinVar variant.
Stability of estimated parameters with different parameter initializations
We tried several different initialization parameters for β and θ to explore how this affected the estimated parameters. We initialized a noisy β by adding K% Gaussian noise compared to the mean of β with fixed θ (for K = 10, 20, 50 100, 200, 400, 800). For θ, we fixed P(E = 1 | F = 0) and P(E = 0 | F = 0) as 0.01 and 0.99, respectively, and initialized (P(E = 1 | F = 1), P(E = 0 | F = 1)) as (0.1, 0.9), (0.4, 0.6) and (0.45, 0.55) instead of (0.3, 0.7) with β fixed. For each parameter initialization, we computed Spearman rank correlations between parameters from RIVER using the original initialization and the alternative initializations. We also investigated how many instances within top 10% of posterior probabilities from RIVER under the original settings were replicated in the top 10% of posterior probabilities under the alternative initializations (replication accuracy in Extended Data Fig. 9b).
Validation of large-effect rare variants using CRISPR–Cas9 genome editing
To select rare, coding SNVs for validation by CRISPR–Cas9 editing, we first restricted to the (gene, individual, variant) tuples identified in multi-tissue outliers without a rare structural variant or a rare indel within 200 kb or 10 kb of the gene, respectively. We considered the 116 rare SNVs with a coding consequence for the corresponding gene as annotated by VEP32; coding annotations included stop gained, stop lost, splice acceptor variant, splice donor variant, start lost, missense variant, splice region variant, stop retained variant, synonymous variant, coding sequence variant and 5′/3′ UTR variant. Using RNA-seq data from ENCODE, we further restricted our variant list to the 59 SNVs occurring in genes with an average FPKM (fragments per kilobase per million reads) of at least 10 in K562 cells (ENCODE experiment accession numbers ENCSR000AEL and ENCSR000AEN)42. Finally, we filtered for rare, coding SNVs in (gene, individual) pairs with |median Z-score| > 4 and a RIVER score above the 99.5th percentile. These filters yielded a final set of 13 rare SNVs from which we chose the six exonic SNVs for testing.
As controls, we selected SNVs present within the same cDNA amplicon region as the corresponding outlier SNV (see details on targeted sequencing below). We first searched for coding SNVs present within these regions in the GTEx cohort that did not occur in the outlier individual. If no SNV could be found satisfying these criteria, we expanded our search for SNVs using the ExAC database (ExAC release v.0.3)35. If multiple possible control variants existed for an outlier SNV, we ranked the controls by CADD score19 and prioritized synonymous variants.
Sequences of single-guide RNAs (sgRNAs) used in the study are listed in Extended Data Fig. 11b. For each variant, a sgRNA and two donor oligonucleotides (with the reference and alternative alleles) were designed such that the PAM was located as close to the variant as possible. The donors were 99 bp long centred on the variant being installed. The variants were installed into K562 cells as previously described22,23. The K562 cells were those generated previously23 and were regularly tested for mycoplasma infection. sgRNAs were expressed in the pGH020 (Addgene plasmid 85405) expression vector. For each donor oligonucleotide, K562 cells constitutively expressing a Cas9–BFP fusion protein were electroporated with 3 μg of sgRNA plasmid DNA and 1 μl of 100 μM donor oligonucleotide using the T-016 program on a Lonza Nucleofector 2b. After electroporation, cells were allowed to recover for five days. Cells electroporated with the reference and alternative allele donor oligonucleotides were mixed in a 1:1 ratio and grown together for three more days to control for differences in culturing conditions. We included cells electroporated with the reference allele to ensure that any changes in expression we observed were not due to the editing process itself. Because the editing efficiency is not 100% and varies between loci, we expected fewer than half the cells to carry the alternative allele and for this proportion to vary by locus. One to two million cells were collected for RNA and genomic DNA extraction.
Genomic DNA (gDNA) was extracted using the QiaAmp DNA mini kit (Qiagen). Total RNA was extracted using QiaShredder and RNeasy Mini kit (Qiagen). Subsequently, 6 μg of RNA was converted into cDNA using AMV reverse transcriptase (Promega). cDNA was purified and concentrated with the PCR Purification Kit (Qiagen). PCR primers were designed to generate 300–400-bp amplicons including the variant in either the gDNA or cDNA locus. For both gDNA and cDNA samples, 400 ng of DNA was amplified in triplicate (technical replicates) using Phusion High-Fidelty polymerase (Fisher) and the amplicon was purified on a 1% TAE agarose gel. The amplicons were then prepared for sequencing using the Nextera XT kit (Illumina) and sequenced together on a NextSeq 500.
Reads were trimmed with cutadapt43 (v.1.13) and aligned using bwa44 (v.0.7.12-r1039) allowing no mismatches (bwa aln –n 0), which excluded any reads with indels created during editing. We used custom reference sequences, one each for the reference and alternate alleles of the targeted cDNA and gDNA amplicon regions. Allele counts at the target locus were computed for each sample using samtools pileup as implemented in the R package Rsamtools45 (v.1.22.0). Only reads with a minimum mapping quality of 20 were considered. Two of the tested loci amplified poorly in preparation for sequencing, and they had extremely low mapping rates and total read counts over the target locus (median read count across replicates <400 compared to 281,000 and 397,000 for gDNA and cDNA, respectively, for the remaining loci). As such, we removed these two loci from further analysis. Finally, to assess the effect of each variant on expression, we tested for a significant difference between the cDNA and gDNA alternate allele proportions with a two-sided t-test. We corrected for multiple testing using the Bonferroni procedure.
RIVER is available at https://bioconductor.org/packages/release/bioc/html/RIVER.html. Additionally, the code for running analyses and producing the figures throughout this manuscript is available separately (https://github.com/joed3/GTExV6PRareVariation).
The GTEx v6p release genotype and allele-specific expression data are available from dbGaP (study accession phs000424.v6.p1; http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1). Expression data from the v6p release and eQTL results are available from the GTEx portal (http://gtexportal.org).
We thank members of the MacArthur laboratory and the Laboratory, Data Analysis, and Coordinating Center (LDACC) for performing the quality control of the whole genome sequencing data, D. Conrad for help with the structural variant calls, D. A. Knowles for code review, J. T. Leek and C. D. Brown for feedback on the manuscript, and the artists of the graphics that we modified in Fig. 1 (https://pixabay.com/en/man-silhouette-stand-straight-308387/ and http://www.allvectors.com/human-organs/). The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (NIH). Additional funds were provided by the National Cancer Institute; National Human Genome Research Institute (NHGRI); National Heart, Lung, and Blood Institute; National Institute on Drug Abuse; National Institute of Mental Health; and National Institute of Neurological Disorders and Stroke. Donors were enrolled at the Biospecimen Source Sites funded by Leidos Biomedical, Inc. (Leidos) subcontracts to the National Disease Research Interchange (10XS170) and Roswell Park Cancer Institute (10XS171). The LDACC was funded through a contract (HHSN268201000029C) to The Broad Institute. Biorepository operations were funded through a Leidos subcontract to the Van Andel Institute (10ST1035). Additional data repository and project management were provided by Leidos (HHSN261200800001E). The Brain Bank was supported by a supplement to University of Miami grant DA006227. We are grateful for support from a Hewlett-Packard Stanford Graduate Fellowship (E.K.T.), a doctoral scholarship from the Natural Science and Engineering Council of Canada (E.K.T.), a Lucille P. Markey Biomedical Research Stanford Graduate Fellowship (J.R.D.), the Stanford Genome Training Program (SGTP; NHGRI T32HG000044) (J.R.D., Z.Z.), the National Science Foundation GRFP (DGE-114747) (Z.Z.), the Joseph C. Pistritto Research Fellowship (F.N.D.), NIH training grant T32 GM007057 (B.J.S.), a Mr and Mrs Spencer T. Olin Fellowship for Women in Graduate Study (A.J.S.), the Searle Scholars Program (A.B.), NIH grants 1R01MH109905-01 (A.B.), R01MH101814 (NIH Common Fund; GTEx Program) (A.B. and S.B.M.), R01HG008150 (NHGRI; Non-Coding Variants Program) (A.B., S.B.M.), and NHGRI grants U01HG007436 and U01HG009080 (S.B.M.).
Extended data figures
This table contains the adjusted R2 values between the top PEER factors estimated for each tissue and the sample covariates available for that tissue. Data for each tissue are presented in separate sheets named according to the tissue ID. Example adjusted R2 data for sample covariates for skeletal muscle (tissue ID: Muscle_Skeletal) are presented in Extended Data Figure 1a (left).
This table contains the adjusted R2 values between the top PEER factors estimated for each tissue and the subject covariates available for that tissue. Data for each tissue are presented in separate sheets named according to the tissue ID. Example adjusted R2 data for subject covariates for skeletal muscle (tissue ID: Muscle_Skeletal) are presented in Extended Data Figure 1a (right).
This table contains descriptions of the variant features for the enrichment analyses and training of RIVER. Sheet ‘A-Selected features’ describes the disjoint variant classes whose enrichments among outliers are presented in Figure 3 and Extended Data Figure 5c. Sheet ‘B-RIVER features’ describes the gene-level features used for training RIVER. The enrichments of these features among outliers are presented in Extended Data Figure 7.