The impact of rare variation on gene expression across tissues

Li, Xin; Kim, Yungil; Tsang, Emily K.; Davis, Joe R.; Damani, Farhan N.; Chiang, Colby; Hess, Gaelen T.; Zappala, Zachary; Strober, Benjamin J.; Scott, Alexandra J.; Li, Amy; Ganna, Andrea; Bassik, Michael C.; Merker, Jason D.; Hall, Ira M.; Battle, Alexis; Montgomery, Stephen B.

doi:10.1038/nature24267

Download PDF

Letter
Open access
Published: 12 October 2017

The impact of rare variation on gene expression across tissues

Xin Li¹^na1^na2,
Yungil Kim²^na1^na2,
Emily K. Tsang^1,3^na1^na2,
Joe R. Davis^1,4^na1^na2,
Farhan N. Damani²^na1,
Colby Chiang⁵^na1,
Gaelen T. Hess⁴,
Zachary Zappala^1,4^na1,
Benjamin J. Strober⁶^na1,
Alexandra J. Scott⁵^na1,
Amy Li⁴,
Andrea Ganna^7,8,9,
Michael C. Bassik⁴,
Jason D. Merker¹,
GTEx Consortium,
Ira M. Hall^5,10,11,
Alexis Battle²^na3 &
…
Stephen B. Montgomery^1,4^na3

Nature volume 550, pages 239–243 (2017)Cite this article

55k Accesses
147 Citations
148 Altmetric
Metrics details

Subjects

Abstract

Rare genetic variants are abundant in humans and are expected to contribute to individual disease risk^1,2,3,4. While genetic association studies have successfully identified common genetic variants associated with susceptibility, these studies are not practical for identifying rare variants^1,5. Efforts to distinguish pathogenic variants from benign rare variants have leveraged the genetic code to identify deleterious protein-coding alleles^1,6,7, but no analogous code exists for non-coding variants. Therefore, ascertaining which rare variants have phenotypic effects remains a major challenge. Rare non-coding variants have been associated with extreme gene expression in studies using single tissues^8,9,10,11, but their effects across tissues are unknown. Here we identify gene expression outliers, or individuals showing extreme expression levels for a particular gene, across 44 human tissues by using combined analyses of whole genomes and multi-tissue RNA-sequencing data from the Genotype-Tissue Expression (GTEx) project v6p release¹². We find that 58% of underexpression and 28% of overexpression outliers have nearby conserved rare variants compared to 8% of non-outliers. Additionally, we developed RIVER (RNA-informed variant effect on regulation), a Bayesian statistical model that incorporates expression data to predict a regulatory effect for rare variants with higher accuracy than models using genomic annotations alone. Overall, we demonstrate that rare variants contribute to large gene expression changes across tissues and provide an integrative method for interpretation of rare variants in individual genomes.

Transcriptome variation in human tissues revealed by long-read sequencing

Article 03 August 2022

Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts

Article Open access 28 January 2020

Haplotype-aware modeling of cis-regulatory effects highlights the gaps remaining in eQTL data

Article Open access 15 January 2024

Main

Our analysis focused on individuals with extremely high or extremely low expression of a particular gene compared with the population, using the GTEx v6p release data, which include RNA-sequencing data for 449 individuals and 44 tissues. We refer to these individuals as gene expression outliers. The GTEx data enable the identification of both single-tissue and multi-tissue expression outliers (Fig. 1a), with the latter defined by consistent extreme expression across many tissues (see Methods). To account for broad environmental and technical confounders, we removed hidden factors estimated by PEER (probabilistic estimation of expression residuals)¹³ from each tissue before outlier discovery (Extended Data Figs 1, 2 and Supplementary Tables 1, 2).

**Figure 1: Gene expression outliers and sharing between tissues.**

We identified a single-tissue expression outlier for ≥99% of expressed genes in each tissue and a multi-tissue outlier for 4,919 out of 18,380 genes that were tested (27%). Each individual was a single-tissue outlier for a median of 83 genes per tissue and a multi-tissue outlier for a median of 10 genes. Single-tissue outliers that were found in one tissue replicated in other tissues at rates of up to 33%, with higher rates among related tissues (Fig. 1b and Extended Data Fig. 3). The replication rate for multi-tissue outliers was much higher and increased with the number of tissues used for discovery (Fig. 1c).

We investigated the influence of rare genetic variation on extreme expression levels, focusing on the individuals of European ancestry with whole-genome sequencing data (1,144 multi-tissue outliers). Multi-tissue outliers were strongly enriched for nearby rare variants. The enrichment was most pronounced for structural variants, as previously described¹⁴, and greater for short insertions and deletions (indels) than for single-nucleotide variants (SNVs) (Fig. 2a and Extended Data Fig. 4). Because most rare variants occur as heterozygotes, expression outliers driven by rare variants in cis should exhibit allele-specific expression (ASE). Both single-tissue and multi-tissue outliers were significantly enriched for ASE compared to non-outliers (see Methods; two-sided Wilcoxon rank-sum tests, each nominal P < 2.2 × 10⁻¹⁶; Fig. 2c). For underexpression outliers with exonic rare variants, the rare allele was generally underexpressed with respect to the common allele and conversely so for overexpression outliers, consistent with the rare variant causing the effect (two-sided Wilcoxon rank-sum tests, each nominal P < 4.0 × 10⁻⁸; Extended Data Fig. 5a). The enrichment for rare variants and ASE was stronger for multi-tissue outliers than for single-tissue outliers (Fig. 2b, c and Extended Data Fig. 6a), especially at higher Z-score thresholds.

**Figure 2: Enrichment of rare variants and ASE in outliers.**

To characterize the properties of rare variants that correlated with large changes in gene expression, we assessed the enrichment of different classes of variants in outliers compared to non-outliers (Supplementary Table 3a). Outliers were enriched, in order of significance, for structural variants, variants near splice sites, introducing frameshifts, at start or stop codons, near the transcription start site and in conserved regions (Fig. 3a). Variants in coding regions contributed disproportionately to outlier expression; enrichments weakened for all variants types (SNVs, indels and structural variants) when excluding exonic regions (Extended Data Fig. 6b). Additionally, 90% of stop-gain and frameshift variants were predicted to trigger nonsense-mediated decay in outliers (see Methods), suggesting a biological mechanism for these cases.

**Figure 3: Stratification of multi-tissue outliers by rare variant classes.**

We also tested the relationship between outlier gene expression and functional annotations. Multi-tissue outliers were strongly enriched for variants in promoter or CpG-rich regions and had variants with higher conservation^15,16,17,18 and CADD (combined annotation-dependent depletion)¹⁹ scores than non-outliers. We observed weaker enrichment in enhancers and transcription-factor-binding sites (Fig. 3b and Extended Data Fig. 7). Combining all classes of variation, other than non-conserved, non-coding, rare variants (excluded as less likely candidates for causal effects), we observed that 58% of underexpression and 28% of overexpression outliers had rare variants near the relevant gene, compared to 8% for non-outliers (Fig. 3c). Overexpression outliers were more common overall, potentially because detection of underexpression outliers for very low expression genes is inherently limited (Extended Data Fig. 5b). Overexpression outliers were also less enriched for functionally annotated rare variants (Extended Data Fig. 5c). Some variant classes had strong directionality concordant with their expected impact: duplications caused overexpression, whereas deletions, start- and stop-codon variants and frameshifts coincided with underexpression (Fig. 3d). We also observed strong ASE for outliers carrying all classes of variants, except non-conserved variants (Fig. 3e).

We hypothesized that functional, large-effect rare variants have been under recent selective pressure. As expected, we found that rare promoter variants of outliers were significantly less frequent in the UK10K cohort of 3,781 individuals³ than rare promoter variants of non-outliers for the same genes (two-sided Wilcoxon rank-sum test, P = 0.0060; Fig. 4a). Additionally, genes intolerant to loss-of-function and missense mutations were depleted of both multi-tissue outliers and multi-tissue expression quantitative trait loci (eQTLs; Fisher’s exact test, all P < 2 × 10⁻¹⁵; Fig. 4b and Extended Data Fig. 8a). We observed a similar depletion in two curated disease gene lists—genes involved in heritable cardiovascular disease and genes in the guidelines of the American College of Medical Genetics and Genomics for incidental findings²⁰—but not in broader gene lists (Fig. 4c and Extended Data Fig. 8b, c). Genes with a multi-tissue outlier were more likely to have a multi-tissue eQTL (two-sided Wilcoxon rank-sum test, P < 2.2 × 10⁻¹⁶; Extended Data Fig. 8d, e), suggesting that rare and common regulatory variation influence similar genes. However, we found evidence that genes with outliers were more constrained than genes with multi-tissue eQTLs, because genes with outliers had less missense and loss-of-function variation (Tukey’s range test, missense Z-score P = 0.0070, probability of loss-of-function intolerance score P = 0.032; Fig. 4b and Extended Data Fig. 8a). This suggests that outlier expression analysis can yield unique insights into constraints on gene regulation.

**Figure 4: Evolutionary constraint of genes with multi-tissue outliers.**

Next, we sought to prioritize rare variants in each individual genome by their predicted impact on gene expression. We developed RIVER (RNA-informed variant effect on regulation), a Bayesian statistical model that jointly analyses genome and transcriptome data from the same individual to estimate the probability that a variant has regulatory impact (https://bioconductor.org/packages/release/bioc/html/RIVER.html, see Methods). RIVER uses a generative model that assumes that genomic annotations (Supplementary Table 3b) determine the prior probability that a variant is a functional regulatory variant, in terms of influence on gene expression, which in turn affects whether nearby genes are likely to display outlier levels of expression (Fig. 5a). RIVER does not require a labelled set of functional/non-functional variants; rather it derives its power from identifying expression patterns that coincide with predictive genomic annotations.

**Figure 5: Performance of RIVER for prioritizing functional regulatory variants.**

We trained RIVER on the GTEx v6p cohort, and evaluated the model on held-out pairs of individuals who shared the same rare variants. We then computed the RIVER score (the posterior probability of having a functional regulatory variant) for one individual, using both expression and genomic data, and assessed the accuracy with respect to the expression levels of the second individual that had been held out (see Methods). Incorporating expression data significantly improved prediction compared with a model that uses genomic annotations alone (area under the curve (AUC) of 0.64 and 0.54, respectively, P = 3.5 × 10⁻⁴; Fig. 5b and Extended Data Fig. 9a, b), and RIVER learned, unsupervised, to prioritize variants supported by both genomic annotations and extreme expression levels across tissues (Fig. 5c and Extended Data Fig. 9c). ASE was also enriched among the top RIVER hits compared with the genomic annotation model (Extended Data Fig. 9d). Finally, even after accounting for the most informative genomic annotations or summary scores, personal expression data were highly informative of rare variant effects (average log odds ratio, 2.76; Extended Data Fig. 9e, f).

RIVER can be used to predict regulatory effects on gene expression of disease-associated variants and aid in prioritization of rare variants in disease studies. To investigate this potential, we evaluated 27 pathogenic variants from ClinVar²¹ present in 21 GTEx donors (Fig. 5c and Extended Data Fig. 10a). Overall, pathogenic variants had RIVER scores that were higher than background variants (two-sided Wilcoxon rank-sum test, P = 3.3 × 10⁻⁹; Extended Data Fig. 10b–d), and the six that were probably regulatory variants (those not annotated as missense or as an indel within a coding region) scored in the 99.9th percentile. Several cases, which we evaluated in detail, illustrated that rare disease-causing variants can have a regulatory impact evident from RNA-sequencing data, even from healthy individuals that have those variants (in whom the variants are often heterozygous; Extended Data Fig. 10e, f). Note that RIVER trained on healthy cohorts, such as GTEx, can then be directly applied to new cohorts that include disease samples.

To experimentally validate a subset of the variants that were identified through outlier analysis, we used CRISPR–Cas9-mediated genome editing^22,23. In K562 cells, we tested six SNVs and matched controls in transcribed regions of genes with an outlier (see Methods and Extended Data Fig. 11a, b), and compared the allelic ratios between mRNA and genomic DNA (gDNA), which was used as an internal control. All variants that were tested were SNVs in underexpression outliers and were therefore expected to decrease expression. Two variants were excluded owing to low cDNA and gDNA total reads counts. The four remaining SNVs in outliers all showed lower proportions of the alternate (installed) allele in the cDNA compared to the gDNA, confirming that these variants decreased expression (Extended Data Fig. 11c).

In summary, by combining data across multiple tissues, we curated a set of gene expression outliers that replicated at higher rates and showed stronger enrichment of rare variants than those from any single tissue. We found that rare structural variants, frameshift indels, coding variants and variants near the transcription start site were most likely to have large effects on expression. However, our ability to characterize the genetic basis of multi-tissue outliers remains incomplete. Outliers without an underlying rare variant in our analysis may be due to variants in more distal regions or in annotations we did not consider, or may be attributable to residual technical or environmental effects.

Although variant interpretation remains challenging, RIVER demonstrates the value of incorporating personal gene expression data to examine the consequences of rare variants that may be uncertain based on the sequence alone. Our results suggest that a general approach can be applied to studies that supplement genome sequencing with other molecular phenotypes, such as methylation^24,25,26 and histone modification^27,28. We anticipate that such integrative approaches will be essential for effective interpretation of genome-wide genetic variation on a personalized level.

Methods

Study population

All human subjects were deceased donors. Informed consent was obtained for all donors via next-of-kin consent to permit the collection and banking of de-identified tissue samples for scientific research. The research protocol was reviewed by Chesapeake Research Review Inc., Roswell Park Cancer Institute’s Office of Research Subject Protection, and the institutional review board of the University of Pennsylvania. We used the RNA-seq, allele-specific expression, and whole-genome sequencing (WGS) data from the v6p release of the GTEx project. The generation of these data are described in the supplementary information of ref. 12.

Correction for technical confounders

We restricted our expression analyses to the 449 individuals and 44 tissues for which sex and the top three genotype principal components, which capture major population stratification, were available. For each tissue, we log₂-transformed all expression values (log₂(RPKM + 2)), where RPKM is the number of reads per kilobase of transcript per million mapped reads. We then standardized the expression of each gene to prevent shrinkage of outlier expression values caused by quantile normalization. To remove unmeasured batch effects and other confounders, for each tissue separately, we estimated hidden factors using PEER¹³ on the transformed expression values. In each tissue, we defined expressed genes and corrected for the same number of PEER factors as in the GTEx eQTL analyses (see supplementary information of ref. 12). We regressed out the PEER factors, the top three genotype principal components and sex (where appropriate) from the transformed expression data for each tissue using the following linear model:

where Y_g is the transformed expression of a given gene g, μ_g is the mean expression level for the gene, P_n is the nth PEER factor, G₁, G₂, G₃ are the top three genotype principal components, and S is the sex covariate. We assumed the residual vector ε_g follows the multivariate normal distribution ε_g ~ N(0, σ²I). Finally, we standardized the expression residuals ε_g for each gene, which yielded Z-scores.

To better understand the effect of PEER correction on the removal of technical and biological confounders, we compared the PEER factors in each tissue separately to pre-collected sample and subject covariates. We considered the subset of covariates with >50 observations in at least 31 tissues, where we first selected covariates with more than one unique entry in each tissue. For categorical covariates, we only considered categories with more than 20 observations. For each PEER factor and each covariate, we fit a linear model with the PEER factor as the response and the covariate as the predictor. From this model, we computed the proportion of that PEER factor’s variance explained by the covariate as the adjusted R²:

where p and n are the number of parameters and samples, respectively, and

SS_T and SS_R refer to the total and residual sums of squares, respectively.

To quantify the degree to which each covariate was captured by the combination of all PEER factors, genotype principal components and sex (where appropriate) for each tissue, we considered the expression component regressed out from the uncorrected data:

For each covariate, we then fit a linear model with W_g as the response and the covariate as the predictor. We assessed the proportion of the variance of W_g explained by each covariate by computing the adjusted R² for the covariate across all genes. We used the formula above, but summed across all genes to compute SS_T and SS_R.

To assess the impact of PEER correction on rare variant enrichment, we also tried removing either the top five PEER factors for each tissue or no PEER factors. We then performed multi-tissue outlier calling and tested the enrichment of rare and common variants in the two partially corrected datasets (see ‘Enrichment of rare and common variants near outlier genes’).

Single-tissue and multi-tissue outlier discovery

Single-tissue and multi-tissue outlier calling was restricted to autosomal lincRNA and protein-coding genes. For each tissue, an individual was called a single-tissue outlier for a particular gene if that individual had the largest absolute Z-score and the absolute value was at least 2. For each gene, the individual with the most extreme median Z-score taken across tissues was identified as a multi-tissue outlier for that gene provided the absolute median Z-score was at least 2. Therefore, each gene had at most one single-tissue outlier per tissue and one multi-tissue outlier. Under this definition an individual could be an outlier for multiple genes. In addition, we only tested for multi-tissue outliers among individuals with expression measurements for the gene in at least five tissues. To reduce cases where non-genetic factors may cause widespread extreme expression, we removed eight individuals that were multi-tissue outliers for 50 or more genes from all downstream analyses, including before single-tissue outlier discovery. Removing these individuals with extreme expression across many genes improved our rare variant enrichments, but the precise threshold mattered less (Extended Data Fig. 2g). We chose the threshold of 50 to strike a balance between removing extreme individuals while not excluding a large proportion of our cohort.

Replication of expression outliers

We calculated the proportion of single-tissue outliers discovered in one tissue that had |Z-score| ≥ 2 with the same direction of effect for the same gene in the replication tissue. Since certain groups of tissues were sampled in a specific subset of individuals, we evaluated the extent to which replication was influenced by the size and the overlap of the discovery and replication sets. We repeated the replication analysis with the discovery and replication in exactly 70 overlapping individuals for each pair of tissues with enough samples and compared the replication patterns to those obtained by using all individuals. To estimate the extent to which individual overlap biased replication estimates, for each pair of tissues with sufficient samples, we defined three disjoint groups of individuals: 70 individuals with data for both tissues, 69 distinct individuals with data in the first tissue, and 69 distinct individuals with data in the second tissue. We discovered outliers in the first tissue using the shared set of individuals then tested for replication using the same individuals in the second tissue. Then, for each gene, we added the identified outlier to the distinct set of individuals and tested the replication again in the second tissue. We repeated the process running the discovery in the second tissue and the replication in the first one. We compared the replication rates when using the same or different individuals for the discovery and replication.

We assessed the confidence of our multi-tissue outliers using cross-validation. We separated the tissue expression data randomly into two groups: a discovery set of 34 tissues and a replication set of 10 tissues. For t = 10, 15, 20, 25, and 30, we randomly sampled t tissues from the discovery set and performed outlier calling as described above. Owing to incomplete tissue sampling, the number of tissues supporting each outlier is at least five but less than t. We computed the replication rate as the proportion of outliers in the discovery set with |median Z-score| ≥ 1 or 2 in the replication set. We set no restriction on the number of tissues required for testing in the replication set. To calculate the expected replication rate, we randomly selected individuals in the discovery set with at least five tissues that expressed the gene and computed the replication rate. We repeated this process 10 times for each discovery set size.

Quality control of genotypes and rare variant definition

We restricted our rare variant analyses to individuals of European descent, as they constituted the largest homogenous population within our dataset. We considered only autosomal variants that passed all filters in the VCF (those marked as PASS in the Filter column). Minor allele frequencies (MAFs) within the GTEx data were calculated from the 123 individuals of European ancestry with WGS data (average coverage 30×). The MAF was the minimum of the reference and the alternate allele frequency where the allele frequencies of all alternate alleles were summed together. Rare variants were defined as having MAF ≤ 0.01 in GTEx, and for SNVs and indels we also required MAF ≤ 0.01 in the European population of the 1000 Genomes Project Phase 3 data³⁰. To ensure that population structure among the individuals of European descent was unlikely to confound our results, we verified that the allele frequency distribution of rare variants included in our analysis (within 10 kb of a protein-coding or lincRNA gene, see below) was similar for the five European populations in the 1000 Genomes Project (Extended Data Fig. 4d).

Enrichment of rare and common variants near outlier genes

We assessed the enrichment of rare SNVs, indels and structural variants near outlier genes. Proximity was defined as within 10 kb of the transcription start site for most analyses. For Fig. 3 and Extended Data Figs 5, 7, 8, we included all variants within 10 kb of the gene, including the gene body, to also capture coding variants. In Fig. 3 and Extended Data Figs 5, 8, we extended the window to 200 kb for enhancers and structural variants. For each gene with an outlier, we chose the remaining set of individuals tested for outliers at the same gene as non-outlier controls. We only considered genes that had both an outlier and at least one control. We stratified variants of each class into four minor allele frequency bins (0–1%, 1–5%, 5–10%, 10–25%) to compare the relative enrichments of rare and common variants. We also assessed the enrichment of SNVs at different Z-score cutoffs. Enrichment was defined as the ratio of the proportion of outliers with a variant whose frequency lies within the range to the corresponding proportion for non-outliers. This enrichment analysis is equivalent to the relative risk of having a nearby rare variant given outlier status. We used the asymptotic distribution of the log relative risk to obtain 95% Wald confidence intervals. Within our set of European individuals, we observed some individuals with minor admixture that had relatively more rare variants than the rest (Extended Data Fig. 1b). We confirmed that inclusion of these admixed individuals did not substantially affect our results (Extended Data Fig. 1c). We also calculated rare variant enrichments when restricting to variants outside protein-coding and lincRNA exons in the Gencode v.19 annotation (extending internal exons by 5 bp to capture canonical splice regions).

To measure the informativeness of variant annotations, we used logistic regression to model outlier status as a function of the feature of interest; this yielded log odds ratios with 95% Wald confidence intervals. Note that for the feature enrichment analysis in Fig. 3b and Extended Data Fig. 7, we required that outliers and their gene-matched non-outlier controls have at least one rare variant near the gene. We standardized all features, including binary features, to facilitate comparison between features of different scale. We also calculated the proportion of overexpression outliers, underexpression outliers and non-outliers with a rare variant near the gene (within 10 kb for SNVs and indels and 200 kb for structural variants). To each outlier instance, we assigned at most one of the 12 rare variant classes that we considered (Supplementary Table 3a). If an outlier had rare variants from multiple classes near the relevant genes, we selected the class that was most significantly enriched among outliers.

Annotation of variants

We obtained structural variant annotations from ref. 14 and computed features for rare SNVs and indels using three primary data sources: Roadmap Epigenomics³¹, CADD v.1.2 (ref. 19) and VEP v.80 (ref. 32). Promoter and enhancer annotation tracks were obtained from the Roadmap Epigenomics Project (http://www.broadinstitute.org/~meuleman/reg2map/HoneyBadger2_release/). We mapped 28 unique tissues in the GTEx project to 19 tissue groups in the Roadmap Project. Using these annotations, for each individual, we assessed whether each SNV or indel overlapped a promoter or enhancer region in at least one of the 19 Roadmap tissue groups. Features, including conservation^15,16,17,18, transcription factor binding and deleteriousness, were extracted from the full annotation tracks of the CADD v.1.2 release (downloaded 15 May 2015; http://cadd.gs.washington.edu/download). Finally, we obtained protein-coding and transcription-related annotations from VEP and LOFTEE. This information was provided in the GTEx v6p VCF file (described in ref. 12). Stop-gain and frameshift variants annotated as high-confidence loss-of-function variants by LOFTEE were assumed to trigger nonsense-mediated decay. We generated gene-level features described in Supplementary Table 3.

Allele-specific expression (ASE)

We only considered sites with at least 30 total reads and at least five reads supporting each of the reference and alternate alleles. To minimize the effect of mapping bias, we filtered out sites that showed mapping bias in simulations³³, that were in low mappability regions (ftp://hgdownload.cse.ucsc.edu/gbdb/hg19/bbi/wgEncodeCrgMapabilityAlign50mer.bw) or that were rare variants or within 1 kb of a rare variant in the given individual (the variants were extracted from the GTEx exome-sequencing data described in ref. 12). The first two filters were provided in the GTEx ASE data release. The third filter was applied to eliminate potential mapping artefacts that mimic genetic effects from rare variants. We measured ASE at each testable site as the absolute deviation of the reference-allele ratio from 0.5. For each gene, all testable sites in all tissues were included. We compared ASE in single-tissue and multi-tissue outliers at different Z-score thresholds to non-outliers using two-sided Wilcoxon rank-sum tests. To obtain a matched background, we only included a gene in the comparison when ASE data existed for both the outlier individual and at least one non-outlier. In the case of single-tissue outliers, we also required the tissue to match between the outlier and the non-outlier. All individuals that were neither multi-tissue outliers for the given gene nor single-tissue outliers for the gene in the corresponding tissue were included as non-outliers.

In cases where outliers had rare coding variants in the gene, if the rare variants were causing the extreme expression in cis, we expected to see ASE at the rare variant matching the direction of the effect. For underexpression outliers, we expected the (rare) minor allele to be underexpressed compared to the major allele. For overexpression outliers, we expected the minor allele to be overexpressed. To test this, we used the same filters as above, but looked exclusively at rare variants (instead of excluding them). We measured ASE as the minor-allele ratio: the number of reads supporting the minor allele over the total number of reads.

We also used ASE to evaluate the performance of both the genomic annotation model and RIVER (see below) by testing the association between allelic imbalance and model predictions using Fisher’s exact test. Here, we defined allelic imbalance as the top 10% of the median absolute deviation, across tissues, of the reference-allele ratio from 0.5.

Allele frequency measurements in UK10K

UK10K³ VCF files of whole-genome cohorts were downloaded from https://www.ebi.ac.uk. We merged the Avon Longitudinal Study of Parents and Children (ALSPAC) EGAS00001000090 and the Department of Twin Research and Genetic Epidemiology (TWINSUK) EGAS00001000108 datasets for a total of 3,781 individuals. We counted the occurrence of all rare GTEx SNVs in Roadmap Epigenomics-annotated promoter regions among the UK10K samples. GTEx variants absent from the UK10K cohorts were assigned a count of 0.

Definition of multi-tissue eGenes

We defined multi-tissue eGenes using two approaches. For the tissue-by-tissue approach, we obtained lists of significant eGenes (q value ≤ 0.05) for each of the 44 tissues from the GTEx v6p release. The second approach used cis-eQTLs with shared effects across tissues estimated by the RE2 model of the Meta-Tissue software³⁴, as described in ref. 12. We chose, for each gene, the variant with the lowest nominal P value from the RE2 model. We then determined the number of tissues in which this variant-gene pair showed a cis-eQTL effect (m value ≥ 0.9 (ref. 34)). For each of the 18,380 genes tested for multi-tissue outliers, we calculated the number of tissues in which the gene appeared as a significant eGene (tissue-by-tissue approach) or had a shared eQTL effect (Meta-Tissue approach). To show that the enrichment of outlier genes as multi-tissue eGenes was not confounded by gene expression level, using the Meta-Tissue results, we stratified genes tested for multi-tissue outliers into RPKM deciles and repeated the comparison between genes with and without a multi-tissue outlier. When comparing the enrichment for eGenes among constrained and disease gene lists, we classified the top n Meta-Tissue eGenes (ranked by nominal P value from the RE2 model) as multi-tissue eGenes and considered the remaining genes as background. We selected n to match the number of multi-tissue outliers in the comparison.

Evolutionary constraint of genes with multi-tissue outliers

We obtained gene-level estimates of evolutionary constraint from the Exome Aggregation Consortium³⁵ (http://exac.broadinstitute.org/,ExACreleasev.0.3). We intersected the 17,351 autosomal lincRNA and protein-coding genes with constraint data from ExAC with the 18,380 genes tested for multi-tissue outliers from GTEx, yielding 14,379 genes for further analysis (3,897 and 10,482 genes with and without a multi-tissue outlier, respectively). We examined three functional constraint scores from the ExAC database: synonymous Z-score, missense Z-score and probability of loss-of-function intolerance (pLI). Synonymous- and missense-intolerant genes were defined as those with corresponding Z-scores above the 90th percentile. We defined loss-of-function intolerant genes as those with a pLI score above 0.9, following the guidelines provided by ExAC. We calculated odds ratios and 95% confidence intervals for the enrichment of genes with multi-tissue outliers in these lists using a Fisher’s exact test. We repeated this analysis for three other gene sets: 19,182 multi-tissue eGenes from GTEx v6p defined using Meta-Tissue, 9,480 reported GWAS genes from the NHGRI-EBI catalogue³⁶ (http://www.ebi.ac.uk/gwas, accessed 30 November 2015) and 3,576 OMIM genes (http://omim.org/, accessed 26 May 2016).

We tested for a difference in the mean constraint for genes with multi-tissue outliers and genes with multi-tissue eQTLs using ANOVA. For each constraint score in ExAC, we treated the score for each gene as the response and the status of the gene as having a multi-tissue outlier and/or a multi-tissue eQTL as a categorical predictor with four classes. After fitting the model, we performed a Tukey’s range test to determine whether there was a significant difference in the mean constraint between genes with a multi-tissue outlier but no multi-tissue eQTL and genes with a multi-tissue eQTL but no multi-tissue outlier.

Overlap of genes with multi-tissue outliers and disease genes

We examined the enrichment of genes with multi-tissue outliers in eight disease gene lists: the GWAS catalogue and OMIM (described above), as well as ClinVar (6,279 genes; http://www.ncbi.nlm.nih.gov/clinvar/), OrphaNet (3,451 genes; http://www.orpha.net/), ACMG²⁰ (58 genes; http://www.ncbi.nlm.nih.gov/clinvar/docs/acmg/), Developmental Disorders Genotype-to-Phenotype³⁷ (DDG2P; 1,693 genes; http://www.ebi.ac.uk/gene2phenotype/), and two curated gene lists of 86 cardiovascular disease genes and 55 cancer genes (described below). We computed odds ratios and 95% confidence intervals using a Fisher’s exact test to compare each disease gene list to the genes with multi-tissue outliers and repeated the comparison for genes with multi-tissue eQTLs.

Heritable cancer predisposition and heritable cardiovascular disease gene lists were curated by local experts in clinical and laboratory-based genetics in the two respective areas (Stanford Medicine Clinical Genomics Service, Stanford Cancer Center’s Cancer Genetics Clinic and Stanford Center for Inherited Cardiovascular Disease). Genes were included if both the clinical and laboratory-based teams agreed there was sufficient published evidence to support using variants in these genes in clinical decision making.

For each of the eight disease gene lists above and for genes with multi-tissue outliers or multi-tissue eQTLs, we computed the number of variants (SNVs and indels within 10 kb and structural variants within 200 kb of the gene, including the gene body) at each gene in the 123 individuals of European ancestry with WGS data. For each gene list and for each MAF bin (0–1%, 1–5%, 5–10%, 10–25%), we compared the mean number of variants near genes in the list to the mean number near all other annotated autosomal protein-coding and lincRNA genes using a two-sided t-test.

The RIVER integrative model for predicting regulatory effects of rare variants

RIVER (RNA-informed variant effect on regulation) is a hierarchical Bayesian model that predicts the regulatory effects of rare variants by integrating gene expression with genomic annotations. The RIVER model consists of three layers: a set of nodes G = G₁,..., G_P in the topmost layer representing P observed genomic annotations over all rare variants near a particular gene; a latent binary variable F in the middle layer representing the unobserved functional regulatory status of the rare variants; and one binary node E in the final layer representing expression outlier status of the nearby gene. We model each conditional probability distribution as follows:

with parameters β and θ and hyper-parameters λ and C.

Because F is unobserved, the RIVER log-likelihood objective over instances n = 1, …, N is non-convex. We therefore optimize model parameters using Expectation–Maximization³⁸ (EM) as follows:

In the E-step, we compute the posterior probabilities (ω_n⁽ⁱ⁾) of the latent variables F_n given current parameters and observed data. For example, at the ith iteration, the posterior probability of F_n = 1 for the nth instance is

In the M-step, at the ith iteration, given the current estimates ω⁽ⁱ⁾, the parameters (β^(i + 1)*) are estimated as

where λ is an L2 penalty hyper-parameter derived from the Gaussian prior on β.

The parameter θ gets updated as:

where I is an indicator operator, t is the binary value of expression E_n, s is the possible binary values of F_n, and C is a pseudo count derived from the Beta prior on θ. The E and M steps are applied iteratively until convergence.

RIVER application to the GTEx cohort

As input, RIVER requires a set of genomic features G and a set of corresponding expression outlier observations E, each over instances of individual and gene pairs. Using the variant annotations described above, we generated site-level genomic features for the 116 European individuals with GTEx WGS data that had fewer than 50 multi-tissue outliers. We then collapsed these features for all rare SNVs within 10 kb of each transcription start site to generate the gene-level features that are described in Supplementary Table 3b. This produced a matrix of genomic features G of size (116 individuals × 1,736 genes) × (112 genomic features), where we standardized features before use. For the values of E, we defined any individual with |median Z-score| ≥ 1.5 as an outlier if expression was observed in at least five tissues; the remaining individuals were labelled as non-outliers for the gene. We used this more lenient threshold in order to obtain a sufficiently large set of outliers for robust training and testing. In total, we extracted 48,575 instances where an individual had at least one rare variant within 10 kb of the transcription start site of a gene.

To train and evaluate RIVER on the GTEx cohort, we used the 3,766 instances of individual and gene pairs where two individuals had the same rare SNVs near a particular gene. We held out those instances and trained RIVER parameters with the remaining instances. RIVER requires two hyper-parameters λ and C. To select λ, we first applied an L2-regularized multivariate logistic regression with features G and response variable E, selecting λ with the minimum squared error via tenfold cross-validation (we selected λ = 0.01). We selected C = 50, informed simply by the total number of training instances available, as validation data were not available for extensive cross-validation. Initial parameters for EM were set to θ = (P(E = 0 | F = 0), P(E = 1 | F = 0), P(E = 0 | F = 1), P(E = 1 | F = 1)) = (0.99, 0.01, 0.3, 0.7) and β from the multivariate logistic regression above, although different initializations did not significantly change the final parameters (Extended Data Fig. 9b).

The 3,766 held-out pairs of instances were used to create a labelled evaluation set. For one of the two individuals from each pair, we estimated the posterior probability of a functional rare variant P(F | G, E, β, θ). The outlier status of the second individual, whose data were not observed either during training or prediction, was then treated as a ‘label’ of the true status of functional effect F. Using this labelled set, we compared the RIVER score to the posterior P(F | G, β) estimated from the plain L2-regularized multivariate logistic regression model with genomic annotations alone. We produced receiver operating characteristic curves and computed areas under the curve (AUCs) for both models, testing for significant differences using DeLong’s method²⁹. This analysis relied on outlier status reflecting the consequences of rare variants. Indeed, pairs of individuals who shared rare variants tended to have highly similar outlier status even after regressing out effects of common variants (Kendall’s τ rank correlation, P < 2.2 × 10⁻¹⁶). We repeated this evaluation, varying the median Z-score threshold used to define outliers, and we also compared RIVER to individual features that were strongly enriched among outliers as well as PolyPhen³⁹ and SIFT⁴⁰.

Supervised model integrating expression and genomic annotation

To assess the information gained by incorporating gene expression data in the prediction of functional rare variants, we applied a simplified supervised approach to a limited dataset. We used the instances where two individuals had the same rare SNVs to create a labelled training set where the outlier status of the second individual was used as the response variable. We then trained a logistic regression model with only two features: (1) the outlier status of the first individual and (2) a single genomic feature value, such as CADD or deleterious annotation of genetic variants using neural networks (DANN). We estimated parameters from the entire set of rare-variant-matched pairs using logistic regression to determine the log odds ratio and corresponding P value of expression status as a predictor. While this approach was not amenable to training a full predictive model over all genomic annotations jointly given the limited number of instances, it provided a consistent estimate of the log odds ratio of outlier status. We tested five genomic predictors: CADD¹⁹, DANN⁴¹, transcription-factor-binding site annotations, PhyloP scores¹⁵ and one aggregated feature: the posterior probability from a multivariate logistic regression model learned with all genomic annotations.

RIVER assessment of pathogenic ClinVar variants

We downloaded variants from the ClinVar database²¹ (accessed 04 May 2015) and searched for these disease variants within the set of rare variants segregating in the GTEx cohort. Any disease variant reported as pathogenic, likely pathogenic or a risk factor for disease was considered pathogenic. We further categorized the pathogenic variants as likely regulatory if they were annotated as splice-site variants, synonymous or nonsense, whereas missense variants were considered unlikely to have a regulatory effect. To explore RIVER scores for those pathogenic variants, all instances were used for training RIVER. We then computed a posterior probability P(F | G, E, β, θ) for each instance coinciding with a pathogenic ClinVar variant.

Stability of estimated parameters with different parameter initializations

We tried several different initialization parameters for β and θ to explore how this affected the estimated parameters. We initialized a noisy β by adding K% Gaussian noise compared to the mean of β with fixed θ (for K = 10, 20, 50 100, 200, 400, 800). For θ, we fixed P(E = 1 | F = 0) and P(E = 0 | F = 0) as 0.01 and 0.99, respectively, and initialized (P(E = 1 | F = 1), P(E = 0 | F = 1)) as (0.1, 0.9), (0.4, 0.6) and (0.45, 0.55) instead of (0.3, 0.7) with β fixed. For each parameter initialization, we computed Spearman rank correlations between parameters from RIVER using the original initialization and the alternative initializations. We also investigated how many instances within top 10% of posterior probabilities from RIVER under the original settings were replicated in the top 10% of posterior probabilities under the alternative initializations (replication accuracy in Extended Data Fig. 9b).

Validation of large-effect rare variants using CRISPR–Cas9 genome editing

To select rare, coding SNVs for validation by CRISPR–Cas9 editing, we first restricted to the (gene, individual, variant) tuples identified in multi-tissue outliers without a rare structural variant or a rare indel within 200 kb or 10 kb of the gene, respectively. We considered the 116 rare SNVs with a coding consequence for the corresponding gene as annotated by VEP³²; coding annotations included stop gained, stop lost, splice acceptor variant, splice donor variant, start lost, missense variant, splice region variant, stop retained variant, synonymous variant, coding sequence variant and 5′/3′ UTR variant. Using RNA-seq data from ENCODE, we further restricted our variant list to the 59 SNVs occurring in genes with an average FPKM (fragments per kilobase per million reads) of at least 10 in K562 cells (ENCODE experiment accession numbers ENCSR000AEL and ENCSR000AEN)⁴². Finally, we filtered for rare, coding SNVs in (gene, individual) pairs with |median Z-score| > 4 and a RIVER score above the 99.5th percentile. These filters yielded a final set of 13 rare SNVs from which we chose the six exonic SNVs for testing.

As controls, we selected SNVs present within the same cDNA amplicon region as the corresponding outlier SNV (see details on targeted sequencing below). We first searched for coding SNVs present within these regions in the GTEx cohort that did not occur in the outlier individual. If no SNV could be found satisfying these criteria, we expanded our search for SNVs using the ExAC database (ExAC release v.0.3)³⁵. If multiple possible control variants existed for an outlier SNV, we ranked the controls by CADD score¹⁹ and prioritized synonymous variants.

Sequences of single-guide RNAs (sgRNAs) used in the study are listed in Extended Data Fig. 11b. For each variant, a sgRNA and two donor oligonucleotides (with the reference and alternative alleles) were designed such that the PAM was located as close to the variant as possible. The donors were 99 bp long centred on the variant being installed. The variants were installed into K562 cells as previously described^22,23. The K562 cells were those generated previously²³ and were regularly tested for mycoplasma infection. sgRNAs were expressed in the pGH020 (Addgene plasmid 85405) expression vector. For each donor oligonucleotide, K562 cells constitutively expressing a Cas9–BFP fusion protein were electroporated with 3 μg of sgRNA plasmid DNA and 1 μl of 100 μM donor oligonucleotide using the T-016 program on a Lonza Nucleofector 2b. After electroporation, cells were allowed to recover for five days. Cells electroporated with the reference and alternative allele donor oligonucleotides were mixed in a 1:1 ratio and grown together for three more days to control for differences in culturing conditions. We included cells electroporated with the reference allele to ensure that any changes in expression we observed were not due to the editing process itself. Because the editing efficiency is not 100% and varies between loci, we expected fewer than half the cells to carry the alternative allele and for this proportion to vary by locus. One to two million cells were collected for RNA and genomic DNA extraction.

Genomic DNA (gDNA) was extracted using the QiaAmp DNA mini kit (Qiagen). Total RNA was extracted using QiaShredder and RNeasy Mini kit (Qiagen). Subsequently, 6 μg of RNA was converted into cDNA using AMV reverse transcriptase (Promega). cDNA was purified and concentrated with the PCR Purification Kit (Qiagen). PCR primers were designed to generate 300–400-bp amplicons including the variant in either the gDNA or cDNA locus. For both gDNA and cDNA samples, 400 ng of DNA was amplified in triplicate (technical replicates) using Phusion High-Fidelty polymerase (Fisher) and the amplicon was purified on a 1% TAE agarose gel. The amplicons were then prepared for sequencing using the Nextera XT kit (Illumina) and sequenced together on a NextSeq 500.

Reads were trimmed with cutadapt⁴³ (v.1.13) and aligned using bwa⁴⁴ (v.0.7.12-r1039) allowing no mismatches (bwa aln –n 0), which excluded any reads with indels created during editing. We used custom reference sequences, one each for the reference and alternate alleles of the targeted cDNA and gDNA amplicon regions. Allele counts at the target locus were computed for each sample using samtools pileup as implemented in the R package Rsamtools⁴⁵ (v.1.22.0). Only reads with a minimum mapping quality of 20 were considered. Two of the tested loci amplified poorly in preparation for sequencing, and they had extremely low mapping rates and total read counts over the target locus (median read count across replicates <400 compared to 281,000 and 397,000 for gDNA and cDNA, respectively, for the remaining loci). As such, we removed these two loci from further analysis. Finally, to assess the effect of each variant on expression, we tested for a significant difference between the cDNA and gDNA alternate allele proportions with a two-sided t-test. We corrected for multiple testing using the Bonferroni procedure.

Code availability

RIVER is available at https://bioconductor.org/packages/release/bioc/html/RIVER.html. Additionally, the code for running analyses and producing the figures throughout this manuscript is available separately (https://github.com/joed3/GTExV6PRareVariation).

Data availability

The GTEx v6p release genotype and allele-specific expression data are available from dbGaP (study accession phs000424.v6.p1; http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v6.p1). Expression data from the v6p release and eQTL results are available from the GTEx portal (http://gtexportal.org).

References

Tennessen, J. A. et al. Evolution and functional impact of rare coding variation from deep sequencing of human exomes. Science 337, 64–69 (2012)
Article CAS ADS PubMed PubMed Central Google Scholar
Nelson, M. R. et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104 (2012)
Article CAS ADS PubMed PubMed Central Google Scholar
The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature 526, 82–90 (2015)
Keinan, A. & Clark, A. G. Recent explosive human population growth has resulted in an excess of rare genetic variants. Science 336, 740–743 (2012)
Article CAS ADS PubMed PubMed Central Google Scholar
Uricchio, L. H., Zaitlen, N. A., Ye, C. J., Witte, J. S. & Hernandez, R. D. Selection and explosive growth alter genetic architecture and hamper the detection of causal rare variants. Genome Res. 26, 863–873 (2016)
Article CAS PubMed PubMed Central Google Scholar
MacArthur, D. G. et al. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828 (2012)
Article CAS ADS PubMed PubMed Central Google Scholar
Narasimhan, V. M. et al. Health and population effects of rare gene knockouts in adult humans with related parents. Science 352, 474–477 (2016)
Article CAS ADS PubMed PubMed Central Google Scholar
Montgomery, S. B., Lappalainen, T., Gutierrez-Arcelus, M. & Dermitzakis, E. T. Rare and common regulatory variation in population-scale sequenced human genomes. PLoS Genet. 7, e1002144 (2011)
Article CAS PubMed PubMed Central Google Scholar
Zhao, J. et al. A burden of rare variants associated with extremes of gene expression in human peripheral blood. Am. J. Hum. Genet. 98, 299–309 (2016)
Article CAS PubMed PubMed Central Google Scholar
Zeng, Y. et al. Aberrant gene expression in humans. PLoS Genet. 11, e1004942 (2015)
Article PubMed PubMed Central Google Scholar
Li, X. et al. Transcriptome sequencing of a large human family identifies the impact of rare noncoding variants. Am. J. Hum. Genet. 95, 245–256 (2014)
Article CAS PubMed PubMed Central Google Scholar
The GTEx Consortium. Genetic effects on gene expression across human tissues. https://doi.org/10.1038/nature24277 (2017)
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 7, 500–507 (2012)
Article CAS PubMed PubMed Central Google Scholar
Chiang, C. et al. The impact of structural variation on human gene expression. Nat. Genet. 49, 692–699 (2017)
Article CAS PubMed PubMed Central Google Scholar
Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010)
Article CAS PubMed PubMed Central Google Scholar
Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, 1034–1050 (2005)
Article CAS PubMed PubMed Central Google Scholar
Cooper, G. M. et al. Distribution and intensity of constraint in mammalian genomic sequence. Genome Res. 15, 901–913 (2005)
Article CAS PubMed PubMed Central Google Scholar
Arbiza, L. et al. Genome-wide inference of natural selection on human transcription factor binding sites. Nat. Genet. 45, 723–729 (2013)
Article CAS PubMed PubMed Central Google Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014)
Article CAS PubMed PubMed Central Google Scholar
Green, R. C. et al. ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing. Genet. Med. 15, 565–574 (2013)
Article CAS PubMed PubMed Central Google Scholar
Landrum, M. J. et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2016)
Article CAS PubMed Google Scholar
Hendel, A. et al. Chemically modified guide RNAs enhance CRISPR–Cas genome editing in human primary cells. Nat. Biotechnol. 33, 985–989 (2015)
Article CAS PubMed PubMed Central Google Scholar
Hess, G. T. et al. Directed evolution using dCas9-targeted somatic hypermutation in mammalian cells. Nat. Methods 13, 1036–1042 (2016)
Article CAS PubMed PubMed Central Google Scholar
Grundberg, E. et al. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to disease-associated variants in distal regulatory elements. Am. J. Hum. Genet. 93, 876–890 (2013)
Article CAS PubMed PubMed Central Google Scholar
Gamazon, E. R. et al. Enrichment of cis-regulatory gene expression SNPs and methylation quantitative trait loci among bipolar disorder susceptibility variants. Mol. Psychiatry 18, 340–346 (2013)
Article CAS PubMed Google Scholar
Bell, J. T. et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 12, R10 (2011)
Article CAS PubMed PubMed Central Google Scholar
Waszak, S. M. et al. Population variation and genetic control of modular chromatin architecture in humans. Cell 162, 1039–1050 (2015)
Article CAS PubMed Google Scholar
Grubert, F. et al. Genetic control of chromatin states in humans involves local and distal chromosomal interactions. Cell 162, 1051–1065 (2015)
Article CAS PubMed PubMed Central Google Scholar
DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 44, 837–845 (1988)
Article CAS PubMed Google Scholar
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015)
The Roadmap Epigenomics Consortium et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015)
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016)
Article PubMed PubMed Central Google Scholar
Panousis, N. I., Gutierrez-Arcelus, M., Dermitzakis, E. T. & Lappalainen, T. Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biol. 15, 467 (2014)
Article PubMed PubMed Central Google Scholar
Sul, J. H., Han, B., Ye, C., Choi, T. & Eskin, E. Effectively identifying eQTLs from multiple tissues by combining mixed model and meta-analytic approaches. PLoS Genet. 9, e1003491 (2013)
Article CAS PubMed PubMed Central Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016)
Article CAS PubMed PubMed Central Google Scholar
Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–D1006 (2014)
Article CAS PubMed Google Scholar
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature 519, 223–228 (2015)
Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 (2010)
Article CAS PubMed PubMed Central Google Scholar
Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res. 11, 863–874 (2001)
Article CAS PubMed PubMed Central Google Scholar
Quang, D., Chen, Y. & Xie, X. DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics 31, 761–763 (2015)
Article CAS PubMed Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012)
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17, 10–12 (2011)
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009)
CAS PubMed PubMed Central Google Scholar
Morgan, M., Pagès, H., Obenchain, V. & Hayden, N. Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import. R package v.1.28.0 http://bioconductor.org/packages/release/bioc/html/Rsamtools.html (2017)
Dror, Y. & Freedman, M. H. Shwachman–Diamond syndrome. Br. J. Haematol. 118, 701–713 (2002)
Article PubMed Google Scholar
Austin, K. M. et al. Mitotic spindle destabilization and genomic instability in Shwachman–Diamond syndrome. J. Clin. Invest. 118, 1511–1518 (2008)
Article CAS PubMed PubMed Central Google Scholar
Schmidt, A. et al. Severely altered guanidino compound levels, disturbed body weight homeostasis and impaired fertility in a mouse model of guanidinoacetate N-methyltransferase (GAMT) deficiency. Hum. Mol. Genet. 13, 905–921 (2004)
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank members of the MacArthur laboratory and the Laboratory, Data Analysis, and Coordinating Center (LDACC) for performing the quality control of the whole genome sequencing data, D. Conrad for help with the structural variant calls, D. A. Knowles for code review, J. T. Leek and C. D. Brown for feedback on the manuscript, and the artists of the graphics that we modified in Fig. 1 (https://pixabay.com/en/man-silhouette-stand-straight-308387/ and http://www.allvectors.com/human-organs/). The Genotype-Tissue Expression (GTEx) project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (NIH). Additional funds were provided by the National Cancer Institute; National Human Genome Research Institute (NHGRI); National Heart, Lung, and Blood Institute; National Institute on Drug Abuse; National Institute of Mental Health; and National Institute of Neurological Disorders and Stroke. Donors were enrolled at the Biospecimen Source Sites funded by Leidos Biomedical, Inc. (Leidos) subcontracts to the National Disease Research Interchange (10XS170) and Roswell Park Cancer Institute (10XS171). The LDACC was funded through a contract (HHSN268201000029C) to The Broad Institute. Biorepository operations were funded through a Leidos subcontract to the Van Andel Institute (10ST1035). Additional data repository and project management were provided by Leidos (HHSN261200800001E). The Brain Bank was supported by a supplement to University of Miami grant DA006227. We are grateful for support from a Hewlett-Packard Stanford Graduate Fellowship (E.K.T.), a doctoral scholarship from the Natural Science and Engineering Council of Canada (E.K.T.), a Lucille P. Markey Biomedical Research Stanford Graduate Fellowship (J.R.D.), the Stanford Genome Training Program (SGTP; NHGRI T32HG000044) (J.R.D., Z.Z.), the National Science Foundation GRFP (DGE-114747) (Z.Z.), the Joseph C. Pistritto Research Fellowship (F.N.D.), NIH training grant T32 GM007057 (B.J.S.), a Mr and Mrs Spencer T. Olin Fellowship for Women in Graduate Study (A.J.S.), the Searle Scholars Program (A.B.), NIH grants 1R01MH109905-01 (A.B.), R01MH101814 (NIH Common Fund; GTEx Program) (A.B. and S.B.M.), R01HG008150 (NHGRI; Non-Coding Variants Program) (A.B., S.B.M.), and NHGRI grants U01HG007436 and U01HG009080 (S.B.M.).

Author information

François Aguet, François Aguet, Kristin G. Ardlie, Beryl B. Cummings, Ellen T. Gelfand, Gad Getz, Kane Hadley, Robert E. Handsaker, Katherine H. Huang, Seva Kashin, Konrad J. Karczewski, Monkol Lek, Xiao Li, Daniel G. MacArthur, Jared L. Nedzel, Duyen T. Nguyen, Michael S. Noble, Ayellet V. Segrè, Casandra A. Trowbridge, Taru Tukiainen, Nathan S. Abell, Nathan S. Abell, Brunilda Balliu, Ruth Barshir, Omer Basha, Alexis Battle, Gireesh K. Bogu, Andrew Brown, Christopher D. Brown, Stephane E. Castel, Lin S. Chen, Colby Chiang, Donald F. Conrad, Nancy J. Cox, Farhan N. Damani, Joe R. Davis, Olivier Delaneau, Emmanouil T. Dermitzakis, Barbara E. Engelhardt, Eleazar Eskin, Pedro G. Ferreira, Laure Frésard, Eric R. Gamazon, Diego Garrido-Martín, Ariel D.H. Gewirtz, Genna Gliner, Michael J. Gloudemans, Roderic Guigo, Ira M. Hall, Buhm Han, Yuan He, Farhad Hormozdiari, Cedric Howald, Hae Kyung Im, Brian Jo, Eun Yong Kang, Yungil Kim, Sarah Kim-Hellmuth, Tuuli Lappalainen, Gen Li, Xin Li, Boxiang Liu, Serghei Mangul, Mark I. McCarthy, Ian C. McDowell, Pejman Mohammadi, Jean Monlong, Stephen B. Montgomery, Manuel Muñoz-Aguirre, Anne W. Ndungu, Dan L. Nicolae, Andrew B. Nobel, Meritxell Oliva, Halit Ongen, John J. Palowitch, Nikolaos Panousis, Panagiotis Papasaikas, YoSon Park, Princy Parsana, Anthony J. Payne, Christine B. Peterson, Jie Quan, Ferran Reverter, Chiara Sabatti, Ashis Saha, Michael Sammeth, Alexandra J. Scott, Andrey A. Shabalin, Reza Sodaei, Matthew Stephens, Barbara E. Stranger, Benjamin J. Strober, Jae Hoon Sul, Emily K. Tsang, Sarah Urbut, Martijn van de Bunt, Gao Wang, Xiaoquan Wen, Fred A. Wright, Hualin S. Xi, Esti Yeger-Lotem, Zachary Zappala, Judith B. Zaugg, Yi-Hui Zhou, Joshua M. Akey, Joshua M. Akey, Daniel Bates, Joanne Chan, Lin S. Chen, Melina Claussnitzer, Kathryn Demanelis, Morgan Diegel, Jennifer A. Doherty, Andrew P. Feinberg, Marian S. Fernando, Jessica Halow, Kasper D. Hansen, Eric Haugen, Peter F. Hickey, Lei Hou, Farzana Jasmine, Ruiqi Jian, Lihua Jiang, Audra Johnson, Rajinder Kaul, Manolis Kellis, Muhammad G. Kibriya, Kristen Lee, Jin Billy Li, Qin Li, Xiao Li, Jessica Lin, Shin Lin, Sandra Linder, Caroline Linke, Yaping Liu, Matthew T. Maurano, Benoit Molinie, Stephen B. Montgomery, Jemma Nelson, Fidencio J. Neri, Meritxell Oliva, Yongjin Park, Brandon L. Pierce, Nicola J. Rinaldi, Lindsay F. Rizzardi, Richard Sandstrom, Andrew Skol, Kevin S. Smith, Michael P. Snyder, John Stamatoyannopoulos, Barbara E. Stranger, Hua Tang, Emily K. Tsang, Li Wang, Meng Wang, Nicholas Van Wittenberghe, Fan Wu, Rui Zhang, Concepcion R. Nierras, Concepcion R. Nierras, Philip A. Branton, Philip A. Branton, Latarsha J. Carithers, Ping Guan, Helen M. Moore, Abhi Rao, Jimmie B. Vaught, Sarah E. Gould, Sarah E. Gould, Nicole C. Lockart, Casey Martin, Jeffery P. Struewing, Simona Volpi, Anjene M. Addington, Anjene M. Addington, Susan E. Koester, A. Roger Little, A. Roger Little, Lori E. Brigham, Lori E. Brigham, Richard Hasz, Marcus Hunter, Christopher Johns, Mark Johnson, Gene Kopen, William F. Leinweber, John T. Lonsdale, Alisa McDonald, Bernadette Mestichelli, Kevin Myer, Brian Roe, Michael Salvatore, Saboor Shad, Jeffrey A. Thomas, Gary Walters, Michael Washington, Joseph Wheeler, Jason Bridge, Jason Bridge, Barbara A. Foster, Bryan M. Gillard, Ellen Karasik, Rachna Kumar, Mark Miklos, Michael T. Moser, Scott D. Jewell, Scott D. Jewell, Robert G. Montroy, Daniel C. Rohrer, Dana R. Valley, David A. Davis, David A. Davis, Deborah C. Mash, Anita H. Undale, Anita H. Undale, Anna M. Smith, David E. Tabor, Nancy V. Roche, Jeffrey A. McLean, Negin Vatanian, Karna L. Robinson, Leslie Sobin, Mary E. Barcus, Kimberly M. Valentino, Liqun Qi, Steven Hunter, Pushpa Hariharan, Shilpi Singh, Ki Sung Um, Takunda Matose, Maria M. Tomaszewski, Laura K. Barker, Laura K. Barker, Maghboeba Mosavel, Laura A. Siminoff, Heather M. Traino, Paul Flicek, Paul Flicek, Thomas Juettemann, Magali Ruffier, Dan Sheppard, Kieron Taylor, Stephen J. Trevanion, Daniel R. Zerbino, Brian Craft, Brian Craft, Mary Goldman, Maximilian Haeussler, W. James Kent, Christopher M. Lee, Benedict Paten, Kate R. Rosenbloom, John Vivian and Jingchun Zhu: Lists of participants and their affiliations appear in the online version of the paper.
Xin Li, Yungil Kim, Emily K. Tsang and Joe R. Davis: These authors contributed equally to this work.
Alexis Battle and Stephen B. Montgomery: These authors jointly supervised this work.

Authors and Affiliations

Department of Pathology, Stanford University, Stanford, 94305, California, USA
Xin Li, Emily K. Tsang, Joe R. Davis, Zachary Zappala, Jason D. Merker & Stephen B. Montgomery
Department of Computer Science, Johns Hopkins University, Baltimore, 21218, Maryland, USA
Yungil Kim, Farhan N. Damani & Alexis Battle
Biomedical Informatics Program, Stanford University, Stanford, 94305, California, USA
Emily K. Tsang
Department of Genetics, Stanford University, Stanford, 94305, California, USA
Joe R. Davis, Gaelen T. Hess, Zachary Zappala, Amy Li, Michael C. Bassik & Stephen B. Montgomery
McDonnell Genome Institute, Washington University School of Medicine, St Louis, 63108, Missouri, USA
Colby Chiang, Alexandra J. Scott & Ira M. Hall
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA
Benjamin J. Strober
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, 02114, Massachusetts, USA
Andrea Ganna
Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, 02142, Massachusetts, USA
Andrea Ganna
Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, 02142, Massachusetts, USA
Andrea Ganna
Department of Medicine, Washington University School of Medicine, St Louis, 63110, Missouri, USA
Ira M. Hall
Department of Genetics, Washington University School of Medicine, St Louis, 63110, Missouri, USA
Ira M. Hall
The Broad Institute of Massachusetts Institute of Technology and Harvard University, Cambridge, 02142, Massachusetts, USA
François Aguet, Kristin G. Ardlie, Beryl B. Cummings, Ellen T. Gelfand, Gad Getz, Kane Hadley, Robert E. Handsaker, Katherine H. Huang, Seva Kashin, Konrad J. Karczewski, Monkol Lek, Xiao Li, Daniel G. MacArthur, Jared L. Nedzel, Duyen T. Nguyen, Michael S. Noble, Ayellet V. Segrè, Casandra A. Trowbridge, Taru Tukiainen, Melina Claussnitzer, Lei Hou, Manolis Kellis, Yaping Liu, Benoit Molinie, Yongjin Park, Nicola J. Rinaldi, Li Wang & Nicholas Van Wittenberghe
Analytic and Translational Genetics Unit, Massachusetts General Hospital, Boston, 02114, Massachusetts, USA
Beryl B. Cummings, Konrad J. Karczewski, Monkol Lek, Daniel G. MacArthur & Taru Tukiainen
Massachusetts General Hospital Cancer Center and Department of Pathology, Massachusetts General Hospital, Boston, 02114, Massachusetts, USA
Gad Getz
Department of Genetics, Harvard Medical School, Boston, 02114, Massachusetts, USA
Robert E. Handsaker & Seva Kashin
Department of Genetics, Stanford University, Stanford, 94305, California, USA
Nathan S. Abell, Joe R. Davis, Laure Frésard, Michael J. Gloudemans, Boxiang Liu, Stephen B. Montgomery, Zachary Zappala, Joanne Chan, Ruiqi Jian, Lihua Jiang, Jin Billy Li, Qin Li, Xiao Li, Jessica Lin, Shin Lin, Sandra Linder, Stephen B. Montgomery, Kevin S. Smith, Michael P. Snyder, Hua Tang, Meng Wang & Rui Zhang
Department of Pathology, Stanford University, Stanford, 94305, California, USA
Nathan S. Abell, Brunilda Balliu, Joe R. Davis, Laure Frésard, Michael J. Gloudemans, Xin Li, Boxiang Liu, Stephen B. Montgomery, Emily K. Tsang, Zachary Zappala, Sandra Linder, Stephen B. Montgomery, Kevin S. Smith & Emily K. Tsang
Department of Clinical Biochemistry and Pharmacology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, 84105, Israel
Ruth Barshir, Omer Basha & Esti Yeger-Lotem
Department of Computer Science, Johns Hopkins University, Baltimore, 21218, Maryland, USA
Alexis Battle, Farhan N. Damani, Yungil Kim, Princy Parsana & Ashis Saha
Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Dr Aiguader 88, 08003, Barcelona, Spain
Gireesh K. Bogu, Diego Garrido-Martín, Roderic Guigo, Jean Monlong, Manuel Muñoz-Aguirre, Panagiotis Papasaikas, Ferran Reverter & Reza Sodaei
Universitat Pompeu Fabra (UPF), 08002, Barcelona, Spain
Gireesh K. Bogu, Diego Garrido-Martín, Roderic Guigo, Jean Monlong, Manuel Muñoz-Aguirre, Panagiotis Papasaikas, Ferran Reverter & Reza Sodaei
Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211, Switzerland
Andrew Brown, Olivier Delaneau, Emmanouil T. Dermitzakis, Cedric Howald, Halit Ongen & Nikolaos Panousis
Institute for Genetics and Genomics in Geneva (iG3), University of Geneva, Geneva, 1211, Switzerland
Andrew Brown, Olivier Delaneau, Emmanouil T. Dermitzakis, Cedric Howald, Halit Ongen & Nikolaos Panousis
Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland
Andrew Brown, Olivier Delaneau, Emmanouil T. Dermitzakis, Cedric Howald, Halit Ongen & Nikolaos Panousis
Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, 19104, Pennsylvania, USA
Christopher D. Brown & YoSon Park
New York Genome Center, New York, 10013, New York, USA
Stephane E. Castel, Sarah Kim-Hellmuth, Tuuli Lappalainen & Pejman Mohammadi
Department of Systems Biology, Columbia University Medical Center, New York, 10032, New York, USA
Stephane E. Castel, Sarah Kim-Hellmuth, Tuuli Lappalainen & Pejman Mohammadi
Department of Public Health Sciences, The University of Chicago, Chicago, 60637, Illinois, USA
Lin S. Chen, Lin S. Chen, Kathryn Demanelis, Farzana Jasmine, Muhammad G. Kibriya & Brandon L. Pierce
McDonnell Genome Institute, Washington University School of Medicine, St. Louis, 63108, Missouri, USA
Colby Chiang, Ira M. Hall & Alexandra J. Scott
Department of Genetics, Washington University School of Medicine, St. Louis, 63108, Missouri, USA
Donald F. Conrad & Ira M. Hall
Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, 63108, Missouri, USA
Donald F. Conrad
Division of Genetic Medicine, Department of Medicine, Vanderbilt University Medical Center, Nashville, 37232, Tennessee, USA
Nancy J. Cox & Eric R. Gamazon
Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, Princeton, 08540, New Jersey, USA
Barbara E. Engelhardt
Department of Computer Science, University of California, Los Angeles, 90095, California, USA
Eleazar Eskin, Farhad Hormozdiari, Eun Yong Kang & Serghei Mangul
Department of Human Genetics, University of California, Los Angeles, 90095, California, USA
Eleazar Eskin
Instituto de Investigação e Inovação em Saúde (i3S), Universidade do Porto, Porto, 4200-135, Portugal
Pedro G. Ferreira
Institute of Molecular Pathology and Immunology (IPATIMUP), University of Porto, Porto, 4200-625, Portugal
Pedro G. Ferreira
Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, Amsterdam, 1105, AZ, The Netherlands
Eric R. Gamazon
Department of Psychiatry, Academic Medical Center, University of Amsterdam, Amsterdam, 1105, AZ, The Netherlands
Eric R. Gamazon
Lewis Sigler Institute, Princeton University, Princeton, 08540, New Jersey, USA
Ariel D.H. Gewirtz, Brian Jo & Joshua M. Akey
Department of Operations Research and Financial Engineering, Princeton University, Princeton, 08540, New Jersey, USA
Genna Gliner
Biomedical Informatics Program, Stanford University, Stanford, 94305, California, USA
Michael J. Gloudemans, Emily K. Tsang & Emily K. Tsang
Institut Hospital del Mar d’Investigacions Mèdiques (IMIM), Barcelona, 08003, Spain
Roderic Guigo
Department of Medicine, Washington University School of Medicine, St. Louis, 63108, Missouri, USA
Ira M. Hall
Department of Convergence Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul, 138-736, South Korea
Buhm Han
Department of Biomedical Engineering, Johns Hopkins University, Baltimore, 21218, Maryland, USA
Yuan He, Benjamin J. Strober & Andrew P. Feinberg
Department of Medicine, Section of Genetic Medicine, The University of Chicago, Chicago, 60637, Illinois, USA
Hae Kyung Im, Dan L. Nicolae, Meritxell Oliva, Barbara E. Stranger, Marian S. Fernando, Caroline Linke, Meritxell Oliva, Andrew Skol, Barbara E. Stranger & Fan Wu
Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, 10032, New York, USA
Gen Li
Department of Biology, Stanford University, Stanford, 94305, California, USA
Boxiang Liu
Nuffield Department of Medicine, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, UK
Mark I. McCarthy, Anne W. Ndungu, Anthony J. Payne & Martijn van de Bunt
Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford, Churchill Hospital, Oxford, OX3 7LE, UK
Mark I. McCarthy & Martijn van de Bunt
Oxford NIHR Biomedical Research Centre, Churchill Hospital, Oxford, OX3 7LJ, UK
Mark I. McCarthy
Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, 27708, North Carolina, USA
Ian C. McDowell
Human Genetics Department, McGill University, Montreal, H3A 0G1, Quebec, Canada
Jean Monlong
Departament d’Estadística i Investigació Operativa, Universitat Politècnica de Catalunya, Barcelona, 08034, Spain
Manuel Muñoz-Aguirre
Department of Statistics, The University of Chicago, Chicago, 60637, Illinois, USA
Dan L. Nicolae & Matthew Stephens
Department of Human Genetics, The University of Chicago, Chicago, 60637, Illinois, USA
Dan L. Nicolae, Matthew Stephens, Sarah Urbut & Gao Wang
Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, 27599, North Carolina, USA
Andrew B. Nobel & John J. Palowitch
Department of Biostatistics, University of North Carolina, Chapel Hill, 27599, North Carolina, USA
Andrew B. Nobel
Institute for Genomics and Systems Biology, The University of Chicago, Chicago, 60637, Illinois, USA
Meritxell Oliva, Barbara E. Stranger, Marian S. Fernando, Caroline Linke, Meritxell Oliva, Andrew Skol, Barbara E. Stranger & Fan Wu
Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, 77030, Texas, USA
Christine B. Peterson
Computational Sciences, Pfizer Inc, Cambridge, 02139, Massachusetts, USA
Jie Quan & Hualin S. Xi
Universitat de Barcelona, Barcelona, 08028, Spain
Ferran Reverter
Department of Biomedical Data Science, Stanford University, Stanford, 94305, California, USA
Chiara Sabatti
Department of Statistics, Stanford University, Stanford, 94305, California, USA
Chiara Sabatti
Institute of Biophysics Carlos Chagas Filho (IBCCF), Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, 21941902, Brazil
Michael Sammeth
Department of Psychiatry, University of Utah, Salt Lake City, 84108, Utah, USA
Andrey A. Shabalin
Center for Data Intensive Science, The University of Chicago, Chicago, 60637, Illinois, USA
Barbara E. Stranger, Andrew Skol & Barbara E. Stranger
Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, 90095, California, USA
Jae Hoon Sul
Department of Biostatistics, University of Michigan, Ann Arbor, 48109, Michigan, USA
Xiaoquan Wen
Bioinformatics Research Center and Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, 27695, North Carolina, USA
Fred A. Wright & Yi-Hui Zhou
National Institute for Biotechnology in the Negev, Beer-Sheva, 84105, Israel
Esti Yeger-Lotem
European Molecular Biology Laboratory, Heidelberg, 69117, Germany
Judith B. Zaugg
Department of Ecology and Evolutionary Biology, Princeton University, Princeton, 08540, New Jersey, USA
Joshua M. Akey
Altius Institute for Biomedical Sciences, Seattle, 98121, Washington, USA
Daniel Bates, Morgan Diegel, Jessica Halow, Eric Haugen, Audra Johnson, Rajinder Kaul, Kristen Lee, Jemma Nelson, Fidencio J. Neri, Richard Sandstrom & John Stamatoyannopoulos
Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, 02215, Massachusetts, USA
Melina Claussnitzer
University of Hohenheim, Stuttgart, 70599, Germany
Melina Claussnitzer
Department of Population Health Sciences, Huntsman Cancer Institute, University of Utah, Salt Lake City, 84112, Utah, USA
Jennifer A. Doherty
Center for Epigenetics, Johns Hopkins University School of Medicine, Baltimore, 21205, Maryland, USA
Andrew P. Feinberg, Kasper D. Hansen & Lindsay F. Rizzardi
Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, 21205, Maryland, USA
Andrew P. Feinberg
Department of Mental Health, Johns Hopkins University School of Public Health, Baltimore, 21205, Maryland, USA
Andrew P. Feinberg
McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, 21205, Maryland, USA
Kasper D. Hansen
Department of Biostatistics, Johns Hopkins University, Baltimore, 21205, Maryland, USA
Kasper D. Hansen & Peter F. Hickey
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, 02139, Massachusetts, USA
Lei Hou, Manolis Kellis, Yaping Liu, Yongjin Park & Nicola J. Rinaldi
Department of Medicine, University of Washington, Seattle, 98195, Washington, USA
Jessica Lin & John Stamatoyannopoulos
Division of Cardiology, University of Washington, Seattle, 98195, Washington, USA
Shin Lin
Institute for Systems Genetics, New York University Langone Medical Center, New York, 10016, New York, USA
Matthew T. Maurano
Department of Genome Sciences, University of Washington, Seattle, 98195, Washington, USA
John Stamatoyannopoulos
Division of Program Coordination, Office of Strategic Coordination, Planning and Strategic Initiatives, Office of the Director, NIH, Rockville, 20852, Maryland, USA
Concepcion R. Nierras
Division of Cancer Treatment and Diagnosis, Biorepositories and Biospecimen Research Branch, National Cancer Institute, Bethesda, 20892, Maryland, USA
Philip A. Branton, Latarsha J. Carithers, Ping Guan, Helen M. Moore, Abhi Rao & Jimmie B. Vaught
National Institute of Dental and Craniofacial Research, Bethesda, 20892, Maryland, USA
Latarsha J. Carithers
Division of Genomic Medicine, National Human Genome Research Institute, Rockville, 20852, Maryland, USA
Sarah E. Gould, Nicole C. Lockart, Casey Martin, Jeffery P. Struewing & Simona Volpi
Division of Neuroscience and Basic Behavioral Science, National Institute of Mental Health, NIH, Bethesda, 20892, Maryland, USA
Anjene M. Addington & Susan E. Koester
Division of Neuroscience and Behavior, National Institute on Drug Abuse, NIH, Bethesda, 20892, Maryland, USA
A. Roger Little
Washington Regional Transplant Community, Falls Church, 22003, Virginia, USA
Lori E. Brigham
Gift of Life Donor Program, Philadelphia, 19103, Pennsylvania, USA
Richard Hasz
LifeGift, Houston, 77055, Texas, USA
Marcus Hunter, Kevin Myer & Brian Roe
Center for Organ Recovery and Education, Pittsburgh, 15238, Pennsylvania, USA
Christopher Johns & Joseph Wheeler
LifeNet Health, Virginia Beach, Virginia, 23453, USA
Mark Johnson, Gary Walters & Michael Washington
National Disease Research Interchange, Philadelphia, 19103, Pennsylvania, USA
Gene Kopen, William F. Leinweber, John T. Lonsdale, Alisa McDonald, Bernadette Mestichelli, Michael Salvatore, Saboor Shad & Jeffrey A. Thomas
Unyts, Buffalo, 14203, New York, USA
Jason Bridge & Mark Miklos
Pharmacology and Therapeutics, Roswell Park Cancer Institute, Buffalo, New York, 14263, USA
Barbara A. Foster, Bryan M. Gillard, Ellen Karasik, Rachna Kumar & Michael T. Moser
Van Andel Research Institute, Grand Rapids, Michigan, 49503, USA
Scott D. Jewell, Robert G. Montroy, Daniel C. Rohrer & Dana R. Valley
Brain Endowment Bank, Miller School of Medicine, University of Miami, Miami, 33136, Florida, USA
David A. Davis & Deborah C. Mash
National Institute of Allergy and Infectious Diseases, NIH, Rockville, 20852, Maryland, USA
Anita H. Undale
Biospecimen Research Group, Clinical Research Directorate, Leidos Biomedical Research, Inc., Rockville, 20852, Maryland, USA
Anna M. Smith, David E. Tabor, Nancy V. Roche, Jeffrey A. McLean, Negin Vatanian, Karna L. Robinson, Leslie Sobin, Kimberly M. Valentino, Liqun Qi, Steven Hunter, Pushpa Hariharan, Shilpi Singh, Ki Sung Um, Takunda Matose & Maria M. Tomaszewski
Leidos Biomedical Research, Inc., Frederick, 21701, Maryland, USA
Mary E. Barcus
Temple University, Philadelphia, 19122, Pennsylvania, USA
Laura K. Barker, Laura A. Siminoff & Heather M. Traino
Department of Health Behavior and Policy, School of Medicine, Virginia Commonwealth University, Richmond, 23298, Virginia, USA
Maghboeba Mosavel
European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, CB10 1SD, UK
Paul Flicek, Thomas Juettemann, Magali Ruffier, Dan Sheppard, Kieron Taylor, Stephen J. Trevanion & Daniel R. Zerbino
UCSC Genomics Institute, University of California Santa Cruz, Santa Cruz, 95064, California, USA
Brian Craft, Mary Goldman, Maximilian Haeussler, W. James Kent, Christopher M. Lee, Benedict Paten, Kate R. Rosenbloom, John Vivian & Jingchun Zhu

Authors

Xin Li
View author publications
You can also search for this author in PubMed Google Scholar
Yungil Kim
View author publications
You can also search for this author in PubMed Google Scholar
Emily K. Tsang
View author publications
You can also search for this author in PubMed Google Scholar
Joe R. Davis
View author publications
You can also search for this author in PubMed Google Scholar
Farhan N. Damani
View author publications
You can also search for this author in PubMed Google Scholar
Colby Chiang
View author publications
You can also search for this author in PubMed Google Scholar
Gaelen T. Hess
View author publications
You can also search for this author in PubMed Google Scholar
Zachary Zappala
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin J. Strober
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra J. Scott
View author publications
You can also search for this author in PubMed Google Scholar
Amy Li
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Ganna
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Bassik
View author publications
You can also search for this author in PubMed Google Scholar
Jason D. Merker
View author publications
You can also search for this author in PubMed Google Scholar
Ira M. Hall
View author publications
You can also search for this author in PubMed Google Scholar
Alexis Battle
View author publications
You can also search for this author in PubMed Google Scholar
Stephen B. Montgomery
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

GTEx Consortium

Laboratory, Data Analysis & Coordinating Center (LDACC)—Analysis Working Group
- François Aguet
- , Kristin G. Ardlie
- , Beryl B. Cummings
- , Ellen T. Gelfand
- , Gad Getz
- , Kane Hadley
- , Robert E. Handsaker
- , Katherine H. Huang
- , Seva Kashin
- , Konrad J. Karczewski
- , Monkol Lek
- , Xiao Li
- , Daniel G. MacArthur
- , Jared L. Nedzel
- , Duyen T. Nguyen
- , Michael S. Noble
- , Ayellet V. Segrè
- , Casandra A. Trowbridge
- & Taru Tukiainen
Statistical Methods groups—Analysis Working Group
- Nathan S. Abell
- , Brunilda Balliu
- , Ruth Barshir
- , Omer Basha
- , Alexis Battle
- , Gireesh K. Bogu
- , Andrew Brown
- , Christopher D. Brown
- , Stephane E. Castel
- , Lin S. Chen
- , Colby Chiang
- , Donald F. Conrad
- , Nancy J. Cox
- , Farhan N. Damani
- , Joe R. Davis
- , Olivier Delaneau
- , Emmanouil T. Dermitzakis
- , Barbara E. Engelhardt
- , Eleazar Eskin
- , Pedro G. Ferreira
- , Laure Frésard
- , Eric R. Gamazon
- , Diego Garrido-Martín
- , Ariel D.H. Gewirtz
- , Genna Gliner
- , Michael J. Gloudemans
- , Roderic Guigo
- , Ira M. Hall
- , Buhm Han
- , Yuan He
- , Farhad Hormozdiari
- , Cedric Howald
- , Hae Kyung Im
- , Brian Jo
- , Eun Yong Kang
- , Yungil Kim
- , Sarah Kim-Hellmuth
- , Tuuli Lappalainen
- , Gen Li
- , Xin Li
- , Boxiang Liu
- , Serghei Mangul
- , Mark I. McCarthy
- , Ian C. McDowell
- , Pejman Mohammadi
- , Jean Monlong
- , Stephen B. Montgomery
- , Manuel Muñoz-Aguirre
- , Anne W. Ndungu
- , Dan L. Nicolae
- , Andrew B. Nobel
- , Meritxell Oliva
- , Halit Ongen
- , John J. Palowitch
- , Nikolaos Panousis
- , Panagiotis Papasaikas
- , YoSon Park
- , Princy Parsana
- , Anthony J. Payne
- , Christine B. Peterson
- , Jie Quan
- , Ferran Reverter
- , Chiara Sabatti
- , Ashis Saha
- , Michael Sammeth
- , Alexandra J. Scott
- , Andrey A. Shabalin
- , Reza Sodaei
- , Matthew Stephens
- , Barbara E. Stranger
- , Benjamin J. Strober
- , Jae Hoon Sul
- , Emily K. Tsang
- , Sarah Urbut
- , Martijn van de Bunt
- , Gao Wang
- , Xiaoquan Wen
- , Fred A. Wright
- , Hualin S. Xi
- , Esti Yeger-Lotem
- , Zachary Zappala
- , Judith B. Zaugg
- & Yi-Hui Zhou
Enhancing GTEx (eGTEx) groups
- Joshua M. Akey
- , Daniel Bates
- , Joanne Chan
- , Lin S. Chen
- , Melina Claussnitzer
- , Kathryn Demanelis
- , Morgan Diegel
- , Jennifer A. Doherty
- , Andrew P. Feinberg
- , Marian S. Fernando
- , Jessica Halow
- , Kasper D. Hansen
- , Eric Haugen
- , Peter F. Hickey
- , Lei Hou
- , Farzana Jasmine
- , Ruiqi Jian
- , Lihua Jiang
- , Audra Johnson
- , Rajinder Kaul
- , Manolis Kellis
- , Muhammad G. Kibriya
- , Kristen Lee
- , Jin Billy Li
- , Qin Li
- , Xiao Li
- , Jessica Lin
- , Shin Lin
- , Sandra Linder
- , Caroline Linke
- , Yaping Liu
- , Matthew T. Maurano
- , Benoit Molinie
- , Stephen B. Montgomery
- , Jemma Nelson
- , Fidencio J. Neri
- , Meritxell Oliva
- , Yongjin Park
- , Brandon L. Pierce
- , Nicola J. Rinaldi
- , Lindsay F. Rizzardi
- , Richard Sandstrom
- , Andrew Skol
- , Kevin S. Smith
- , Michael P. Snyder
- , John Stamatoyannopoulos
- , Barbara E. Stranger
- , Hua Tang
- , Emily K. Tsang
- , Li Wang
- , Meng Wang
- , Nicholas Van Wittenberghe
- , Fan Wu
- & Rui Zhang
NIH Common Fund
- Concepcion R. Nierras
NIH/NCI
- Philip A. Branton
- , Latarsha J. Carithers
- , Ping Guan
- , Helen M. Moore
- , Abhi Rao
- & Jimmie B. Vaught
NIH/NHGRI
- Sarah E. Gould
- , Nicole C. Lockart
- , Casey Martin
- , Jeffery P. Struewing
- & Simona Volpi
NIH/NIMH
- Anjene M. Addington
- & Susan E. Koester
NIH/NIDA
- A. Roger Little
Biospecimen Collection Source Site—NDRI
- Lori E. Brigham
- , Richard Hasz
- , Marcus Hunter
- , Christopher Johns
- , Mark Johnson
- , Gene Kopen
- , William F. Leinweber
- , John T. Lonsdale
- , Alisa McDonald
- , Bernadette Mestichelli
- , Kevin Myer
- , Brian Roe
- , Michael Salvatore
- , Saboor Shad
- , Jeffrey A. Thomas
- , Gary Walters
- , Michael Washington
- & Joseph Wheeler
Biospecimen Collection Source Site—RPCI
- Jason Bridge
- , Barbara A. Foster
- , Bryan M. Gillard
- , Ellen Karasik
- , Rachna Kumar
- , Mark Miklos
- & Michael T. Moser
Biospecimen Core Resource—VARI
- Scott D. Jewell
- , Robert G. Montroy
- , Daniel C. Rohrer
- & Dana R. Valley
Brain Bank Repository—University of Miami Brain Endowment Bank
- David A. Davis
- & Deborah C. Mash
Leidos Biomedical—Project Management
- Anita H. Undale
- , Anna M. Smith
- , David E. Tabor
- , Nancy V. Roche
- , Jeffrey A. McLean
- , Negin Vatanian
- , Karna L. Robinson
- , Leslie Sobin
- , Mary E. Barcus
- , Kimberly M. Valentino
- , Liqun Qi
- , Steven Hunter
- , Pushpa Hariharan
- , Shilpi Singh
- , Ki Sung Um
- , Takunda Matose
- & Maria M. Tomaszewski
ELSI Study
- Laura K. Barker
- , Maghboeba Mosavel
- , Laura A. Siminoff
- & Heather M. Traino
Genome Browser Data Integration & Visualization—EBI
- Paul Flicek
- , Thomas Juettemann
- , Magali Ruffier
- , Dan Sheppard
- , Kieron Taylor
- , Stephen J. Trevanion
- & Daniel R. Zerbino
Genome Browser Data Integration & Visualization—UCSC Genomics Institute, University of California Santa Cruz
- Brian Craft
- , Mary Goldman
- , Maximilian Haeussler
- , W. James Kent
- , Christopher M. Lee
- , Benedict Paten
- , Kate R. Rosenbloom
- , John Vivian
- & Jingchun Zhu

Contributions

X.L., Y.K., E.K.T., J.R.D., A.B. and S.B.M. designed the study, performed analyses and wrote the manuscript. Y.K., F.N.D. and A.B. developed RIVER. G.T.H., A.L. and M.C.B. designed and executed the validation using CRISPR–Cas9. C.C., A.J.S. and I.M.H. provided the set of structural variants. J.D.M. provided the lists of curated cancer and cardiovascular disease genes. Z.Z., B.J.S. and A.G. contributed to analysis and feedback.

Corresponding authors

Correspondence to Alexis Battle or Stephen B. Montgomery.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Reviewer Information Nature thanks E. Birney, A. Clark and Y. Gilad for their contribution to the peer review of this work.

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Figure 1 PEER correction.

a, Adjusted R² between top 15 PEER factors and top 20 sample (left) and subject (right) covariates in an example tissue, skeletal muscle. Covariates were ranked by the average adjusted R² across all PEER factors and hierarchically clustered. The corresponding data for all tissues are provided in Supplementary Tables 1, 2. b, Adjusted R² between the total expression component removed by PEER in each tissue and the top 20 sample (left) and subject (right) covariates. The covariates were ranked by the average adjusted R² across all tissues, and both axes were hierarchically clustered. White denotes missing values, and tissues are coloured as in Fig. 1. PEER factors captured slightly different covariates across tissues, with a noticeable difference between the brain and other tissues. c, Rare variant enrichments as in Fig. 2a for different levels of PEER correction. The fully corrected data show substantially stronger rare variant enrichments than the two partially corrected datasets.

Extended Data Figure 2 Distribution of the number of genes with a multi-tissue outlier.

a, Distribution of the number of genes for which each individual was a multi-tissue outlier. Each individual was an outlier for a median of 10 genes. Individuals with 50 or more outliers are coloured in grey and were excluded from downstream analyses. b–f, Distribution of the number of genes for which individuals, stratified by common covariates, were multi-tissue outliers. For race and sex, we compared the distributions using an unsigned Wilcoxon rank-sum test, whereas we used Spearman’s ρ to test for association with the remaining covariates. Only age (Spearman’s ρ = 0.10, P = 0.033) and ischaemic time (Spearman’s ρ = 0.18, P = 0.00022) were nominally associated with the number of outlier genes per individual. The association with age fails to achieve significance after correcting for multiple testing using the Bonferroni method. Note that in b we only tested for a significant difference in the distribution of the number of outlier genes between white and black individuals, because there were too few individuals in the other groups. g, Enrichments as shown in Fig. 2a either including all individuals, or excluding individuals that are outliers for 50 (matches Fig. 2a) or 30 genes.

Extended Data Figure 3 Single-tissue outlier replication.

a, Correlation between the replication proportions (see Methods) obtained from all samples and from a subset of 70 overlapping individuals per tissue pair (Pearson’s correlation, P < 2.2 × 10⁻¹⁶). When restricting to 70 individuals, the replication rates decreased more for discovery tissues with larger sample sizes in the full dataset, indicating that replication rates were underestimated for tissues with small sample sizes. b, Correlation between replication in the 70 individuals used for discovery and replication assessed in a set of 70 individuals that included the outlier individual and 69 individuals excluded from the discovery set (Pearson’s correlation, P < 2.2 × 10⁻¹⁶). Replication was higher when computed in the discovery individuals rather than in a distinct set of individuals. c, Single-tissue outlier replication using all individuals, as in Fig. 1b, but data are only shown for pairs with at least 70 overlapping individuals. Tissue pairs with insufficient overlap are in grey. d, For each pair of tissues with sufficient samples, outlier discovery and replication using 70 individuals sampled in both tissues. The replication values decreased compared with replication performed in all individuals (c), particularly for tissues with large sample sizes in the complete dataset. However, the pattern of replication, with more similar tissues having higher replication rates, is maintained. e, For each tissue, the proportion of (individual, gene) outlier pairs where the individual was also a multi-tissue outlier for the gene. This proportion was positively correlated with the tissue sample size (P = 1.4 × 10⁻¹⁰). Points are coloured by tissue as in Fig. 1.

Extended Data Figure 4 Number of rare variants per individual and population structure.

a, The distribution of the number of rare variants of each type for individuals of European descent (reported as white). Certain individuals had many more rare variants than the population median (vertical black line). b, Principal component analysis of all individuals. Individuals are plotted according to their first two genotype principal components (PCs) and coloured by their reported ancestry. White individuals with WGS data, included in a, are coloured in a lighter shade of blue and those with 60,000 or more rare variants are circled in black. The individuals with an excess of rare variants probably had African or Asian admixture. c, Enrichments as in Fig. 2a and excluding individuals with >60,000 rare variants (circled in b), which did not substantially affect the enrichment patterns. d, European population allele frequency distributions in the 1000 Genomes Project of rare SNVs and indels used in our analysis. The rare variants included in our analysis were constrained to have MAF ≤ 0.01 in the 1000 Genomes European super population, but they were also relatively rare in each of the individual European populations.

Extended Data Figure 5 Comparison of overexpression and underexpression outliers.

a, ASE at rare exonic variants. ASE is shown as the ratio of the number of reads supporting the minor allele to the total number of reads at the site. If the rare variant is driving the extreme expression, we expect this ratio to be below 0.5 for underexpression outliers and above 0.5 for overexpression outliers. Rare coding variants were enriched for ASE in the direction of the extreme expression effect (two-sided Wilcoxon rank-sum tests, each nominal P < 4.0 × 10⁻⁸). b, Expression level distribution of all genes and genes with overexpression or underexpression outliers. Expression is shown as the log₂ of the median (RPKM + 2), where the median was first taken across individuals in each tissue then across expressed tissues for each gene. For genes with low expression, even an RPKM of 0 may not yield a Z-score ≤ −2. Indeed, underexpression outliers were depleted among low expressed genes whereas the opposite was true of overexpression outliers (two-sided Wilcoxon rank-sum test comparing to all genes, P < 2.2 × 10⁻¹⁶ for both overexpression and underexpression). c, Feature enrichments (as in Fig. 3b) shown separately for over and underexpression outliers.

Extended Data Figure 6 Extended rare variant enrichments.

a, For each tissue, rare SNV enrichment in single-tissue outliers compared with non-outliers at the same genes for increasing Z-score thresholds. Enrichments calculated as in Fig. 2. The rare variant enrichments varied between tissues though the overall pattern mirrored that of multi-tissue outliers when combining all the tissues (Fig. 2b). The high variance in the enrichments underscores the noise in single-tissue outlier discovery. b, As in Fig. 2a, enrichment for SNVs, indels and structural variants in outliers compared with the same genes in non-outliers, either including all rare variants or only those outside protein-coding or lincRNA exons in Gencode v.19. The enrichment of rare variants was weaker, but still significant, for all variant types when excluding exonic regions.

Extended Data Figure 7 Enrichment of an extended list of functional genomic annotations.

log odds ratios and 95% Wald confidence intervals from logistic regression models of outlier status as a function of each genomic feature. Features were calculated among rare SNVs within 10 kb of the gene. When more than one feature corresponded to the same genomic annotation (for example, the number or the presence of rare variants in a splice region; Supplementary Table 3b), the feature with the highest enrichment is shown. Lighter shading indicates a non-significant log odds ratio (nominal P > 0.05).

Extended Data Figure 8 Evolutionary constraint and regulatory control of multi-tissue outlier genes.

a, Odds ratio of being intolerant to synonymous and missense variants for genes with multi-tissue eQTLs (eGenes), genes with multi-tissue outliers, OMIM and GWAS genes (see Methods). As expected, GWAS and OMIM genes showed no enrichment or depletion for synonymous variation intolerant genes. Genes with multi-tissue outliers and eGenes showed slight depletion for these genes. Genes with multi-tissue outliers and eGenes were strongly depleted for genes intolerant to missense variation compared with OMIM and GWAS genes. b, Comparison of the depletion of disease genes among genes with a multi-tissue outlier and eGenes. Similar to Fig. 4c, bars represent 95% confidence intervals from Fisher’s exact test. c, For each of ten gene lists, the difference in the mean number of variants near genes in the list compared with the mean for all other annotated genes. Results are stratified by minor allele frequency, and bars indicate the 95% confidence interval for the difference from a two-sided t-test. Disease genes had more variants than control genes in general, and the difference was particularly striking for rare variants. This suggests that the depletion of outliers and eQTLs for certain groups of disease genes is not due to less rare variation near these genes. Instead, we hypothesize that the variation around these genes in our healthy cohort is less likely to have large regulatory effects. d, Distribution of the number of tissues with an eQTL for genes with and without outliers. Genes with multi-tissue outliers had eQTLs in more tissues than genes without. This suggests that they are more susceptible to shared regulatory control. This result held for both multi-tissue eQTL definitions (see Methods; Meta-Tissue: 23 versus 3 tissues, Wilcoxon rank-sum test P < 2.2 × 10⁻¹⁶; tissue-by-tissue: 7 versus 3 tissues, P < 2.2 × 10⁻¹⁶). e, This eGene enrichment was robust across different mean expression levels across tissues (two-sided Wilcoxon rank-sum tests, Bonferroni-adjusted P < 1 × 10⁻¹¹).

Extended Data Figure 9 RIVER performance.

a, Comparison between the predictive power of RIVER and that of the genomic annotation model, as in Fig. 5a, across different Z-score thresholds for outlier calling. Increasing the Z-score threshold improved AUC values, but reduced the number of outlier examples, which led to noisy receiver operating characteristic curves. b, Stability analysis of estimated parameters with different parameter initializations (see Methods). c, Correlations, using Kendall’s τ, between the fraction of tissues with |Z-score| ≥ 2 and the test probabilities from the genomic annotation model (left) and RIVER (right). We calculated test posterior probabilities using tenfold cross-validation and only considered individual and gene pairs with a fraction of tissues with |Z-score| ≥ 2 that was significantly different from 0.05 (one-sided binomial exact test, Benjamini–Hochberg adjusted P < 0.05). d, P values from a one-sided Fisher’s exact test measuring the association between allelic imbalance (see Methods) and the posterior probability of a functional rare variant according to the genomic annotation model and RIVER. The posterior probabilities from RIVER were more strongly associated with allelic imbalance across all four thresholds tested. e, Assessment of the advantage of incorporating gene expression with genomic annotations for predicting outlier status using simplified supervised models (see Methods). All models showed consistent improvement of the log odds ratio of outlier status when incorporating expression. f, Performance of models with 12 individual genomic features compared with the genomic annotation model and RIVER. Some models with single genomic features provided slightly better AUCs compared with the genomic annotation model, but they were not statistically different. On the other hand, RIVER predicted the effects of rare variants significantly better than each of the models that included a single feature.

Extended Data Figure 10 Evaluation of known pathogenic variants using RIVER.

a, The 27 GTEx rare SNVs reported as disease variants in ClinVar. b–d, Relative frequency of the |median Z-score| (b), posterior probabilities from the genomic annotation model (c) and posterior probabilities from RIVER (d) for all (individual, gene) pairs (grey) and 27 pairs with pathogenic variants from ClinVar (orange). P values were computed using two-sided Wilcoxon rank-sum tests. We note that rare indels and structural variants were not found nearby the genes in the individuals carrying these pathogenic variants. e, f, The Z-score and RPKM distributions for SBDS (e) and GAMT (f) were compared with the values from four individuals carrying regulatory pathogenic variation (red asterisks and triangles). The median Z-score and RPKM values across tissues are shown at the top of each plot (black circle). Tissues are coloured as in Fig. 1 and sorted in decreasing order of the difference between the average Z-score of individual(s) with a regulatory pathogenic variant and the median Z-score for the tissue. Three individuals carrying a total of two unique rare variants are shown for SBDS. Both variants are associated with the recessive Shwachman–Diamond syndrome, which causes systemic symptoms that include pancreatic, neurological and haematologic abnormalities⁴⁶ and can disrupt fibroblast function⁴⁷. The individuals, being heterozygous for these variants, lacked the disease phenotype. Nonetheless, we saw extreme underexpression of SBDS across almost all tissues in these individuals, including brain tissues, fibroblasts and pancreas. One individual had a rare variant for GAMT associated with cerebral creatine deficiency syndrome 2, shown to cause neurological deficiencies and also lead to low body fat⁴⁸. The individual had the most extreme underexpression in (subcutaneous) adipose tissue.

Extended Data Figure 11 Validation of large-effect rare variants using CRISPR–Cas9 genome editing.

a, SNVs in outliers and controls assayed for expression effects using CRISPR–Cas9 genome editing. For common SNVs in controls (MAF >1% in the GTEx cohort), the range of median Z-scores and RIVER scores are given for all individuals with the minor allele. Missing values indicate that the variant was absent from our cohort. b, sgRNAs for four SNVs found in outliers and four control SNVs in the same genes. c, Alternate (installed) gDNA and cDNA allele proportions for four rare, coding SNVs in outliers (left) and four matched control SNVs (right). Each gDNA and cDNA sample was sequenced in triplicate (technical replicates). Asterisks denote the Bonferroni-adjusted significance level from a two-sided t-test of the difference between the gDNA and cDNA alternate allele proportions: ·P < 0.05, *P < 0.01, **P < 0.001. Although one control SNV showed a significant difference in the alternate allele proportion between cDNA and gDNA, it displayed an increase rather than a decrease in expression.

Related audio

Reporter Shamini Bundell learns about the grieving families contributing to a huge genetics project

Supplementary information

Reporting Summary (PDF 68 kb)

Supplementary Table 1

This table contains the adjusted R² values between the top PEER factors estimated for each tissue and the sample covariates available for that tissue. Data for each tissue are presented in separate sheets named according to the tissue ID. Example adjusted R² data for sample covariates for skeletal muscle (tissue ID: Muscle_Skeletal) are presented in Extended Data Figure 1a (left). (XLS 554 kb)

Supplementary Table 2

This table contains the adjusted R² values between the top PEER factors estimated for each tissue and the subject covariates available for that tissue. Data for each tissue are presented in separate sheets named according to the tissue ID. Example adjusted R² data for subject covariates for skeletal muscle (tissue ID: Muscle_Skeletal) are presented in Extended Data Figure 1a (right). (XLS 1458 kb)

Supplementary Table 3

This table contains descriptions of the variant features for the enrichment analyses and training of RIVER. Sheet ‘A-Selected features’ describes the disjoint variant classes whose enrichments among outliers are presented in Figure 3 and Extended Data Figure 5c. Sheet ‘B-RIVER features’ describes the gene-level features used for training RIVER. The enrichments of these features among outliers are presented in Extended Data Figure 7. (XLSX 31 kb)

PowerPoint slides

PowerPoint slide for Fig. 1

PowerPoint slide for Fig. 2

PowerPoint slide for Fig. 3

PowerPoint slide for Fig. 4

PowerPoint slide for Fig. 5

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) licence. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons licence, users will need to obtain permission from the licence holder to reproduce the material. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, X., Kim, Y., Tsang, E. et al. The impact of rare variation on gene expression across tissues. Nature 550, 239–243 (2017). https://doi.org/10.1038/nature24267

Download citation

Received: 08 September 2016
Accepted: 13 September 2017
Published: 12 October 2017
Issue Date: 12 October 2017
DOI: https://doi.org/10.1038/nature24267

This article is cited by

Massively parallel screen uncovers many rare 3′ UTR variants regulating mRNA abundance of cancer driver genes
- Ting Fu
- Kofi Amoah
- Xinshu Xiao
Nature Communications (2024)
High-dimensional phenotyping to define the genetic basis of cellular morphology
- Matthew Tegtmeyer
- Jatin Arora
- Soumya Raychaudhuri
Nature Communications (2024)
Investigating the role of common cis-regulatory variants in modifying penetrance of putatively damaging, inherited variants in severe neurodevelopmental disorders
- Emilie M. Wigdor
- Kaitlin E. Samocha
- Hilary C. Martin
Scientific Reports (2024)
Haplotype-aware modeling of cis-regulatory effects highlights the gaps remaining in eQTL data
- Nava Ehsan
- Bence M. Kotis
- Pejman Mohammadi
Nature Communications (2024)
Germline rare deleterious variant load alters cancer risk, age of onset and tumor characteristics
- Myvizhi Esai Selvan
- Kenan Onel
- Zeynep H. Gümüş
npj Precision Oncology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

Methods

Study population

Correction for technical confounders

Single-tissue and multi-tissue outlier discovery

Replication of expression outliers

Quality control of genotypes and rare variant definition

Enrichment of rare and common variants near outlier genes

Annotation of variants

Allele-specific expression (ASE)

Allele frequency measurements in UK10K

Definition of multi-tissue eGenes

Evolutionary constraint of genes with multi-tissue outliers

Overlap of genes with multi-tissue outliers and disease genes

The RIVER integrative model for predicting regulatory effects of rare variants

RIVER application to the GTEx cohort

Supervised model integrating expression and genomic annotation

RIVER assessment of pathogenic ClinVar variants

Stability of estimated parameters with different parameter initializations

Validation of large-effect rare variants using CRISPR–Cas9 genome editing

Code availability

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

GTEx Consortium

Laboratory, Data Analysis & Coordinating Center (LDACC)—Analysis Working Group

Statistical Methods groups—Analysis Working Group

Enhancing GTEx (eGTEx) groups

NIH Common Fund

NIH/NCI

NIH/NHGRI

NIH/NIMH

NIH/NIDA

Biospecimen Collection Source Site—NDRI

Biospecimen Collection Source Site—RPCI

Biospecimen Core Resource—VARI

Brain Bank Repository—University of Miami Brain Endowment Bank

Leidos Biomedical—Project Management

ELSI Study

Genome Browser Data Integration & Visualization—EBI

Genome Browser Data Integration & Visualization—UCSC Genomics Institute, University of California Santa Cruz