Sequence variation between 462 human individuals fine-tunes functional sites of RNA processing

Ferreira, Pedro G.; Oti, Martin; Barann, Matthias; Wieland, Thomas; Ezquina, Suzana; Friedländer, Marc R.; Rivas, Manuel A.; Esteve-Codina, Anna; Rosenstiel, Philip; Strom, Tim M; Lappalainen, Tuuli; Guigó, Roderic; Sammeth, Michael

doi:10.1038/srep32406

Download PDF

Article
Open access
Published: 12 September 2016

Sequence variation between 462 human individuals fine-tunes functional sites of RNA processing

Pedro G. Ferreira^1,2,3,4,
Martin Oti⁵,
Matthias Barann⁶,
Thomas Wieland⁷,
Suzana Ezquina⁸,
Marc R. Friedländer⁹,
Manuel A. Rivas¹⁰,
Anna Esteve-Codina^11,12,
The GEUVADIS Consortium,
Philip Rosenstiel⁶,
Tim M Strom^7,13,
Tuuli Lappalainen^2,14,15,
Roderic Guigó^1,16 &
…
Michael Sammeth^1,5,17

Scientific Reports volume 6, Article number: 32406 (2016) Cite this article

3097 Accesses
19 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Recent advances in the cost-efficiency of sequencing technologies enabled the combined DNA- and RNA-sequencing of human individuals at the population-scale, making genome-wide investigations of the inter-individual genetic impact on gene expression viable. Employing mRNA-sequencing data from the Geuvadis Project and genome sequencing data from the 1000 Genomes Project we show that the computational analysis of DNA sequences around splice sites and poly-A signals is able to explain several observations in the phenotype data. In contrast to widespread assessments of statistically significant associations between DNA polymorphisms and quantitative traits, we developed a computational tool to pinpoint the molecular mechanisms by which genetic markers drive variation in RNA-processing, cataloguing and classifying alleles that change the affinity of core RNA elements to their recognizing factors. The in silico models we employ further suggest RNA editing can moonlight as a splicing-modulator, albeit less frequently than genomic sequence diversity. Beyond existing annotations, we demonstrate that the ultra-high resolution of RNA-Seq combined from 462 individuals also provides evidence for thousands of bona fide novel elements of RNA processing—alternative splice sites, introns and cleavage sites—which are often rare and lowly expressed but in other characteristics similar to their annotated counterparts.

Transcriptome variation in human tissues revealed by long-read sequencing

Article 03 August 2022

Jasmine and Iris: population-scale structural variant comparison and analysis

Article 19 January 2023

Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program

Article Open access 10 February 2021

Introduction

In eukaryotes–especially in mammals–functional mRNAs depend crucially on the correct processing of transcribed sequences, governed by (alternative) splicing and 3′ end formation¹. At the molecular level these reactions rely on the recognition of the corresponding core RNA elements by different factors involved in transcript processing, i.e., components of the splicing machinery (e.g., U1 and U2) that target the splice site sequences in order to remove introns² and polyadenylation signals that correspondingly bind to the Cleavage/Polyadenylation Specificity Factor (CPSF) for initiating the 3′ formation^3,4. In addition to these central elements, modern molecular biology has demonstrated several scenarios of more complex splicing reactions that regulate the correct abundance of alternative gene products, involving accessory proteins, non-coding RNAs and also epigenetic factors. However, these mechanisms follow very cell-type and gene-specific rules that are not applicable in the general case^5,6,7,8,9.

The genomic sequence varies from individual to individual and already some published case studies show that genetic markers can affect the control of RNA processing^10,11,12. Particularly in human, the causal DNA variants of several diseases have been demonstrated to tamper with the control of splicing^13,14,15. Traditionally, best practices for carrying out systematic studies on splicing mechanisms involve specifically designed mutagenesis experiments in minigenes^16,17, which despite their evident usefulness, are restricted to a single locus and mutation in each experiment¹⁸. Predominantly hampered by the lack of availability of genome-wide genotype and phenotype data across a sufficient number of individuals, mechanistic investigations of differences in RNA-processing throughout populations have so far been limited to small numbers of genes and individuals^{19,20,21,22,23}. However, the advent of high-throughput sequencing technologies also heralded a new generation of population-scale projects that analyse combined DNA and RNA sequencing across multiple individuals. Such studies generally focus on identifying which genetic elements are statistically associated with a certain phenotype—usually defined as a quantitative trait locus (QTL) resolved at gene- transcript- or exon-level—rather than building hypotheses about how these phenotypic changes are mechanistically projected from the DNA to the RNA molecules^{24,25,26,27,28,29,30,31}.

In our present work, we employ data from the Geuvadis Project that provides deep RNA-sequencing in lymphoblastoid cell lines (LCLs) collected from 462 individuals of five populations genotyped in the 1000 Genomes Project³². The Geuvadis RNA-Seq experiments are described extensively in Lappalainen et al.³³ with a detailed analysis of the technical variation in ‘t Hoen et al.³⁴. Our main study³³ already used this data set to map and to characterize regulatory variation, showing by expression QTL (eQTL) analysis that genetic control of gene expression and transcript processing appears largely independent. Here, we drill into the molecular mechanisms of RNA modifications that are modulated by genetic polymorphisms in the sequence motifs of annotated splice donor and splice acceptor sites at the 5′ and 3′ ends of introns³⁵, as well as in poly-A signals affecting the 3′ formation of transcripts^36,37. Beyond genotypes, our study also extends to the effect of additional sequence variants in functional elements that are likely due to RNA editing mediated by the adenosine deaminase acting on RNA (ADAR) enzyme, as observed by divergences of the RNA-Seq reads from the corresponding DNA sequencing data. Combining the resolution of sequencing transcriptomes from hundreds of individuals in a population-scale project, we also pinpoint rare and therefore often not annotated transcriptional elements, i.e. splice sites, introns and cleavage sites. Altogether, our studies describe a comprehensive classification and comparison of the different ways in which RNA processing can be affected by these sources of sequence variation and serve as a reference for forthcoming mechanistic studies on RNA regulation by minority alleles.

Results

Genomic variants in splice sites can affect the splicing potential positively or negatively

In order to investigate the molecular mechanisms that cause splicing variation between populations, we focused on variants that directly affect the affinity of annotated splice sites, considering an informative sequence of 9nt for splice donors including the GT dinucleotide and 27nt for splice acceptors that include the AG dinucleotide and additionally the typical area of the preceding polypyrimidine tract (see Methods). When superimposing the 1000 Genomes DNA polymorphisms³² to the Gencode transcriptome version 12 reference transcriptome³⁸, we find 10.7% (51,342 out of 477,880) of the annotated splice sites to harbor one (92% out of the 51,342) or multiple sequence polymorphisms in the core splice site motif (up to seven polymorph positions per splice site, Supplementary Fig. 1a). Splice sites exhibit a repression of indels (2.2% vs. 3.6% indels overall, p-value = 0.017). Also, allele frequencies of indels in splice sites are shifted to lower values (median frequency = 0.039 vs. 0.049 for indels not affecting splice sites, p-value = 0.11 Mann-Whitney-Wilcoxon (MWW) test) likely due to purifying selection against large genomic perturbations in functional elements³², albeit coding sequences with <0.5% indels exhibit even higher depletions. Furthermore, the frequency of single nucleotide polymorphisms (SNPs) occurring at certain positions of the splice site sequence is negatively correlated with the information content of the consensus motif and the dinucleotides involved in the splicing reaction are mostly exempt of sequence polymorphisms (Fig. 1a,b).

Following earlier reports that genetic polymorphisms can directly affect splicing^39,40, we computed splicing scores traditionally used in gene finding for evaluating the affinity of an RNA sequence to the splicing machinery in a systematic manner (Methods). Gene finders usually score potential splice sites in order to predict gene structures, however, we created a high-throughput tool for studying the effects of sequence variation in splice sites by employing these scoring schemes in an introspective manner, i.e. a posteriori given a set of splice sites. In technical terms our “Scorer” tool avoids the computation of a majority of hypothetical splice sites in a genome and the associated overhead of filtering these predictions with respect to a given set of genes and it additionally allows to provide a list of specific sequence variants based on the corresponding reference genome. For scoring splice sites, we employed the Hidden Markov Model (HMM) scoring matrices provided by the gene predictor GeneID⁴¹, which we further evaluate in the following with respect to their capabilities of introspectively evaluating splice site affinities based on the Geuvadis dataset.

Supplementary Fig. 1b shows that the implemented HMM model predicts different scores for donor and for acceptor sites, however, the scores computed for alternative splice sites and exons are lower than those for sites that are constitutively spliced, confirming earlier observations that modification of splicing can be driven by less efficient binding of splicing factors to the RNA sequence⁴². Turning to our Geuvadis phenotype data, we reassuringly observe several examples where the RNA-Seq splice-junction coverage supports our predictions of variant effects in the expected manner. In order to analyse how the predicted splice score of variants correlates with our RNA-seq data, we first studied the correlation between changes in the HMM score, measured as the difference between the score computed for the GRCh37 reference genome splice site sequence and the corresponding sequence with the annotated genomic variants and percent-spliced-in (PSI) scores⁴³ of alternatively included exons (0.2 < PSI < 0.8 in >80% of the individuals). We found that exons with variants that lower the computed splice site score (“negative” effects in Fig. 1c) exhibit low inclusion levels even in individuals carrying the reference allele (median PSI score 0.37), whereas variants with “positive” effects target preferentially the flanks of exons that are already relatively highly included employing the reference allele (median PSI score 0.76). The exon inclusion level then further gradually increases/decreases in individuals accumulating more variants with positive and respectively negative effects in their splice sites (Fig. 1c). In a nutshell, our analyses demonstrate that between individuals the usage of splice sites and of entire exons can be negatively as well as positively controlled by genetic variants.

To further evaluate and to classify the predicted HMM score changes, we compared them to corresponding predictions based on Position Weight Matrices (PWMs) from the complementary splice site discovery database SpliceRack⁴⁴, providing the reference and variant splice site sequences collected by our Scorer tool. We analyzed different thresholds on the computed HMM score differences below which we do not consider a change in the score between the reference and the alternative allele of a splice site as biologically meaningful. We then classified sequence alterations for which we predict positive score changes above the chosen threshold as “enhancing” variants and correspondingly negative score changes exceeding the threshold as “weakening” variants. Sequence polymorphisms that lead to score deviations less than the selected threshold are considered as “neutral” variants. When comparing for each threshold the classification by the HMM model to corresponding PWM-based calculations (Fig. 1d summarizes the systematic study shown in Supplementary Fig. 2), we observe clear enrichment of shared predictions in all three categories (i.e., “weakening”, “neutral” and “enhancing” variants) at all thresholds, peaking at a threshold of 1.5 for donor and 1.0 for acceptor sites (p-value < 3e-323 at all thresholds, chi-squared test, Fig. 1e). The high agreement between both independent scoring schemes suggests that splice site scores are primarily a function of the analyzed sequence rather than the model employed to compute the score.

Splice site disrupting variants are rare in the genome and in the gene pool

In contrast to PWM estimates, the HMM model also pinpoints sequences with consecutive bases that have not been observed in the training set of splice sites used to establish the model (Methods). We therefore extend our classification to “activating” and “disrupting” variants for comparisons where the reference or alternative allele exhibit such splice site-absent sequences. Such variants include previously described SNPs that trigger alternative splice site usage between individuals by switching on/off cryptic splice sites. In these cases, homozygous individuals exhibit exclusively the use of one or the other exon boundary, whereas heterozygous individuals provide evidence of both splice sites being used (Fig. 2a–c).

Figure 2d summarizes the distribution of the different variant classes considered across all splice sites and individuals in the Geuvadis dataset and shows that the major part of SNPs in splice sites indeed fine-tunes the splicing activity, with a notably higher fraction of splicing weakening than enhancing variants (~17% vs. 4%). Disrupting variants (~10.5%) are less frequent and actually only an exceptional minority (<0.5%) of SNPs in Gencode splice sites is activating. The differences in the relative proportion of disrupting vs. activating variants–and similarly also of weakening vs. enhancing variants–are presumptively provoked by a bias for functional alleles in the GRCh37 refs 45, 46. Since our classification of the variant effect depends by definition on the allele included in the human reference genome, the Geuvadis data suggests that in total ~22% of splice sites with genetic variants are modified in their splicing potential, about half of them severely by entirely disrupting the splicing activity, compared to a dominating subset of ~68% variants without predicted effects.

Our classification of genetic variants on splice sites is based on the effect of the non-reference allele, which corresponds to the derived allele when assuming that the reference genome represents the ancestral state. However, this is a priori not always the case. We therefore measured for variants in each variant class the global derived allele frequency (DAF), i.e., the frequency of the non-ancestral allele (Methods). Figure 2e shows that splicing-disrupting and also -weakening sequence polymorphisms are significantly more enriched (p-value ~ 2e-3 and 2e-4, Kolmogorov-Smirnov (KS) test) in low derived allele frequencies as compared to neutral variants. Enhancing variants on the contrary are shifted towards higher DAFs (p-value ~ 2e-4, KS test) and activating variants differ substantially in their global DAF distribution from all other splice site variant classes: 72% of activating SNPs exhibit DAFs >0.1 (p-value ~ 9e-5 compared to the distribution of neutral variants, KS test). Our results imply that activating variants are common variants for which the reference assembly of the human genome actually describes a low-frequency derived allele that disrupts the splice site.

To further estimate the degree to which the Geuvadis experiment can complement current knowledge about transcript annotation in LCLs, we superimposed split-mappings to the exon-intron structures of Gencode v12 to rescue putative novel introns (PNIs) that describe non-annotated exon-exon junctions (Methods). We found >64 million reads supporting ~2/3 of the annotated introns (222,862 out of 337,247 introns) and additionally ~14.7 million split-mappings that provide evidence for ~1.1 million PNIs. Although the overall size distribution of PNIs follows largely the one of introns annotated in the Gencode reference, a mixture of two lognormal distributions caused by distinct groups of short (~100nt) and long (~1,600nt) introns⁴⁷, there are outliers of extremely short and long PNIs (Supplementary Fig. 3a). Most PNIs are predominantly observed in few individuals (Supplementary Fig. 3b) and also covered poorly by split-mappings in comparison to introns annotated in the Gencode reference (Supplementary Fig. 3c). However, PNIs also reflect many RNA-biology attributes similar to their annotated counterparts (Supplementary Fig. 4), the majority of PNIs (~74%) locate within annotated transcripts (i.e., “internal” events) and ~82% of them also employ at least one annotated splice site (Table 1a).

Table 1 Alternative splicing implied by putative novel introns (PNIs).

Full size table

But also PNIs involving non-annotated (i.e., novel) splice sites and those that extend the transcript boundaries beyond the Gencode annotation (Table 1b) are supported well by complementary RNA-Seq data from the Encode project⁴⁸, especially at higher thresholds of individual- and population-support (Table 2a). Like annotated splice sites (Fig. 1a), novel splice sites show evidence for genetic control of their splicing functionality, although at expectedly lower read support levels (Supplementary Fig. 3d). When clustering genetic variation caused by 19,528 variants in novel GT/AG splice sites from PNIs confirmed by >150 individuals according to the effects on splicing, we find amongst the variant groups a ranking similar to the one of splice sites annotated in the Gencode reference, but with highly significant shifts towards fewer neutral (p-value ~ e-30, Fisher Exact test) and weakening (p-value ~ e-29), but more enhancing (p-value ~ e-125), activating variants (p-value ~ e-65, Fig. 2f). In the context of our previous observations on the bias of the human reference genome in favor of more functional elements, these differences can be explained by non-annotated PNIs showing a reduced bias for functional reference alleles (Fig. 2d vs. 2f). However, we also observe an increase in the relative proportion of disrupting variants (p-value~ e-14), which could reflect that disrupted splice junctions are underrepresented in the Gencode annotation by their generally lower expression levels²⁶.

Table 2 Mutual confirmation of novel transcriptional elements in Geuvadis and Encode RNA-Seq data.

Full size table

RNA editing as a splice site modulator

Next, we employed our methodology to analyze Gencode splice sites for the impact of potential RNA editing events catalyzed by the ADAR enzyme complex (Methods), which produces A-to-I conversions that are represented by A-to-G transitions in the RNA-Seq data⁴⁹. Reassuringly, our approach calls substantially fewer splice sites with putative RNA editing polymorphisms than with genetic polymorphisms (<0.01% vs. 10.7%). Only two of the 39 editing events we predict to incur in the region of annotated splice sites are contained in the complementary RADAR-2 database⁵⁰, however, this database includes data from studies that intentionally select against editing events in annotated splice sites^51,52,53. In contrast to genetic variants (Fig. 2d), more than twice the proportion of edited nucleotides (~68% vs. 32%) disrupt their harboring splice site, which can be expected by mechanistic restrictions when considering the possible sequence alterations of ADAR editing in the canonical dinucleotides of annotated sites (Fig. 3a). Consequently, we observe 28 A-to-G transitions that disrupt the AG acceptor dinucleotide, whereas the only activation event we predict for ADAR editing incurs by conversion of a donor AT dinucleotide, usually employed in a very limited set of introns spliced by the minor spliceosome⁵⁴.

Our data in Supplementary Fig. 5a further suggests that RNA editing targets significantly shorter introns (median 607nt vs. 1,881nt in constitutive introns) and particularly RNA editing events that disrupt splicing activities are limited to very short introns (median 522.5nt vs. 972nt in the other introns with edited sites, p-value ~1.1e-09, MWW test). Supplementary Table 1 also summarizes that, according to the Gencode reference transcriptome, most of the splice sites (28 of 41 sites) that are affected by RNA editing are alternatively spliced, which interestingly leads predominantly to retaining the entire intron (in 18 of 28 introns with edited sites). Indeed, we also observe in the Geuvadis dataset substantial amounts of reads from introns flanked by sites with predicted editing events (Supplementary Fig. 5b), in agreement with recent reports concluding that the ADAR complex can sterically block the splicing machinery from accessing the RNA substrate⁵⁵.

Unlike the binary state of variants encoded by the genome, RNA editing constitutes a more gradual trait that has been reported to vary across individuals, transcript sequences and gene expression levels⁵⁶. Interestingly, we also find in the Geuvadis data that the editing efficiency in splicing disrupting events anti-correlates with the splicing efficiency, as introns flanked by disrupted sites that are exhaustively edited (>0.9 of non-reference bases) exhibit higher intron read coverages and therefore more retained introns (Fig. 3b). We do not observe this difference for non-disruptive editing events (Supplementary Fig. 5c). These results support complementary observations of splicing⁵⁷ and also RNA editing⁵⁸ being co-transcriptionally competing processes (Fig. 3c). Our findings suggest that both molecular processes are often temporally coordinated, as also reported by complementary evidence^55,59 and that RNA editing can guide splice site choice in particular genes and species^60,61,62,63.

Genetic diversity in polyadenylation signals

Beyond splicing, we also investigated the impact of inter-individual DNA variability on polyadenylation. To obtain 3′ end information we predicted 52,349 putative cleavage sites (PCSs) from read mappings that align partly with the genomic sequence and exhibit poly-A tails (Methods). The number of PCSs found with higher read support levels decreases rapidly (Supplementary Fig. 6a), but independently of the expression rate of the underlying transcript (Supplementary Fig. 6b). In our further analyses we focus on the conservative subset of 21,102 PCSs supported by ≥2 reads, which are still twice as many as identified in previous studies^28,64. These PCSs exhibit a high degree of overlap with annotated 3′ UTRs (71.4%), especially within a distance of 50 nt from 3′ transcript ends annotated in Gencode (66%) and they are highly supported by complementary RNA-Seq data from the Encode Project (Table 1b).

Scanning the genomic sequence around these PCSs (Methods), we identified for 96.3% of them sequences that agree with earlier described poly-A motifs and the nucleotide distribution of their consensus also matches earlier reports³⁶. For those that coincide with polyA-signals provided by the Gencode annotation, we additionally analyzed the degree up to which genetic variation affects the composition of the poly-A motif. Most poly-A motifs are exempt of SNPs, but Fig. 4a shows 235 events of SNPs that are reproducing known poly-A signals and therefore overall maintain the consensus profile (“altered motifs”, left panel in Fig. 4a) in contrast to 214 polymorphisms that produce sequences unknown to function as poly-A signals that distort the consensus and therefore likely disrupt the affinity of the site to the CPSF (“degraded” motifs, right panel in Fig. 4a). Interestingly, we observe that poly-A motifs that are degraded by genetic variation locate marginally but significantly further away from the PCSs (Fig. 4b), indicating a different relevance of the CPSF for 3′-end formation. Summing up, we collected the Geuvadis RNA-Seq evidence for splice sites, introns and cleavage sites that are not annotated in the Gencode v12 reference and we exhaustively characterized the implications of genetic variation also in these novel elements.

Discussion

In this study we employed the genetic diversity annotated for 462 individuals from the 1000 Genomes project, to compose a genome-wide catalogue of genetic polymorphisms in annotated splice sites and to estimate their potential effects on splicing based on the sequence changes in splice site motifs. In this light we consider the landscape of inter-individual variants described by the large-scale Geuvadis experiment as a natural source of mutagenesis experiments from which we deduce rules for the regulation of splicing. Due to their important functional role, splice sites are generally depleted for genetic polymorphisms and our results suggest an even higher level of selective constraints in the splice site dinucleotides than in the adjacent exon sequences. Employing HMM scoring models established in gene finding, we implemented a tool that allows to score the splicing potential of splice sites and their variants. We evaluate the computed score by an alternative scoring model based on PWMs and we compare the results produced by either method to establish a rationale to classify the changes observed in splicing scores in five classes (i.e., disrupting, weakening, neutral, enhancing and activating variants). From a computational point of view, we contribute to forthcoming studies along the same lines by making our programs to compute splicing scores for reference and variant sites publicly available.

Based on these score predictions, the mechanistic impact of genetic variation on splice sites is often of subtle nature, for instance modulating the inclusion level of alternative exons, but can also be rather severe. We describe variants that activate or disrupt entirely the splicing activity, providing examples from the Geuvadis Project where SNPs switch intron splicing allele-specifically on or off. Although RNA-editing can also affect splicing, we find that ADAR-edited splice sites are comparatively rare, however, with a higher degree of disrupting variants caused by A-to-G substitutions in the canonical AG dinucleotide of the acceptor site. Our analyses suggest that RNA-editing targets mainly short introns of evolutionary rather old genes, most of the edited sites are already known to be alternatively used and many are related to intron retention. The Geuvadis dataset shows a substantial amount of intronic reads in introns with edited sites, as expected in the proposed model under which the ADAR complex makes the RNA molecule inaccessible to the splicing machinery and in concordance with the computed splice site scores the RNA-Seq coverage is even higher in introns with splice sites that are predicted to disrupt splicing activity. We also find that the RNA-Seq read coverage of introns with splice sites disrupted by RNA-editing increases when editing levels rise close to the complete substitution of the genomic base, whereas this is not observed in introns with edited sites that are still predicted to be functional. Altogether, the computational models we apply to combined DNA- and RNA-sequencing at a population scale support multiple aspects of RNA editing postulated by previous observations in limited gene sets.

Allele frequencies from the 1000 Genomes project show that most of the genetic variation affecting splicing stems from rare alleles in the population, but we discover also a small set of common polymorphisms that actually describe a functional splice site in contrast to a splicing-defective reference sequence, which shows that relying exclusively on the reference genome in gene annotation and polymorphism effect estimation may be problematic in specific cases. In fact, the combined sequencing depth of hundreds of samples and billions of reads provides us with the power to detect thousands of transcribed elements that are not annotated in the Gencode reference annotation, including novel introns (PNIs) and cleavage sites (PCSs). The majority of these previously undetected elements are also discovered in complementary RNA-Seq data from the Encode project and exhibit attributes similar to the biology of their annotated counterparts. Many of them occur only in few individuals, which may be the reason why they are absent from existing annotations, but they may still be important determinants of personal transcriptomes by contributing to the genetic makeup of each individual.

Employing these novel elements predicted from the phenotype data, we show that PNIs exhibit a higher proportion of activating as well as disrupting variants, indicating that the absence of their splicing can be tolerated more often. These conclusions are in agreement with our observations of comparatively low splicing and population frequencies for PNIs. We also find that genetic polymorphisms potentially disrupt poly-A signals, especially in cases where the CPSF recognition site localizes slightly further away from the PCS. In a nutshell, our results are certainly limited because RNA-Seq in the Geuvadis experiment have been obtained from a single cell type per individual, namely lymphoblastoid cell lines and we expect that our observations will be extended in the future with more population-scale tissue data becoming available. However, our study demonstrates a hitherto less explored potential for mechanistic studies on the inter-individual variability and population diversity in RNA-processing that can be derived by combined RNA- and DNA-sequencing.

Methods

Supplementary Fig. 7 shows an overview of all resources employed and the analyses carried out for this work, employing the analyses detailed in the following.

Computing splicing scores

Following traditional approaches in gene finding⁴¹, we employ computational splice site models that comprise an informative sequence of 9nt for splice donors (interval [−2; 7]) and 27nt for splice acceptors—from −24 to +3 including additionally the typical area of the upstream polypyrimidine tract⁶⁵. We first apply these models to the splice sites annotated in the GENCODE version 12 reference transcriptome and subsequently also to novel introns (PNIs, see below) as well as predicted RNA-editing in splice sites (see below). To estimate splicing efficiency of polymorphisms, the splice site sequence composition is represented by a second order Markov Model^66,67. Under this model, sequences with a higher degree of similarity to the consensus bind more tightly to the corresponding factors of the splicing machinery^68,69 and therefore are more frequently observed as authentic splice sites^70,71. We then compute the log-odds “splicing score” and compare the scores of sequences derived from splice site variants with the score of the corresponding splice site reference sequence in the human genome assembly GRCh37. Our scoring algorithm is implemented in the Scorer tool of the Astalavista framework available at http://scorer.sammeth.net, which we employed using the command: astalavista -t scorer -i gencode_v12.gtf -c GRCh37_sequences_folder –gid geneid.human.070123.param –vcf population_variants.vcf -f population_variant_scores.vcl

where geneid.human.070123.param is the GeneID parameters file for the human genome, downloaded from ftp://genome.crg.es/pub/software/geneid/human.070123.param.

Comparison of HMM scores with PWM scores

Hidden Markov Model (HMM) scores were calculated with our Astalavista Scorer tool as described above. Position Weight Matrix (PWM) scores were calculated by running the FIMO⁷² motif scanning tool with default parameters on the splice site DNA sequences retrieved with the Astalavista Scorer tool, using PWMs from the SpliceRack database⁴⁴. The motif score assigned by FIMO was used as the PWM score. For both approaches, score differences Δ_HMM and Δ_PWM were calculated by subtracting the reference sequence (RS) score from the variant sequence (VS) score, with negative score differences suggesting splice site “weakening” variants while positive differences imply splice site “enhancing” variants. As the PWM scores exhibited a trimodal distribution separated by minima at ~+/−6, we classified all score differences between −6 and +6 as “neutral” variants (Supplementary Fig. S2a). We subsequently varied the “neutral” threshold for the HMM score differences between 0 and +/−2.5 and we determined the degree of classification agreement as enrichment between the two scoring schemes using the chi-square test from the R statistical program⁷³. The enrichment is measured as the standardized residuals of the chi-square test, i.e., an enrichment of x means that the observed frequency of coincidences is x times the standard deviation away from the expected frequency of coincidences between both models.

Classification of sequence variants in splice sites

SNPs that increase/decrease the splicing score of a reference splice site sequence above/below the previously determined threshold (|τ| = 1.5 for donors and |τ| = 1.0 for acceptors) are classified as “enhancing”/“weakening” variants. In the cases where either the GRCh37 genome or the splice site variant reproduces a sequence that is absent from the training set of our model, we assume that the sequence does not represent a functional splice site and consider the corresponding variants as “activating”/“disrupting” the splice site potential. All other sequence variations that do not change the splicing score more than |τ| are “neutral” polymorphisms. We employed the global derived allele frequencies (DAFs) computed for the non-reference alleles by the 1000 Genomes Project.

Prediction of RNA editing in splice sites

We employed the samtools (version 0.1.18) mpileup tool in combination with the bundled vcfutils.pl script⁷⁴ to call sequence polymorphisms from RNA-Seq reads by the following command: samtools mpileup -C0 -m3 -F0.0002 -E -d999999 -q20 -DSuf hg19.fa -b inputBams | bcftools view -cgv - | vcfutils.pl varFilter -Q25 -d3 -D4999500 -a2 -w10 -W10 -10.0001 -21e-400 -30 -40.0001 -p > variants.vcf

This pipeline produces from the Geuvadis RNA-Seq mappings (“inputBams”) a list of variants (“variants.vcf”), employing the mpileup standard parameters for disabling the adjustment of mapQ (-C0) and for the minimum fraction of gapped reads (-F0.002), but allowing a higher per-BAM depth (-d999999), to attribute for the unequal read coverage in genes with different expression levels and requiring a higher mapQ (-q20) for mappings to be considered during calling. The corresponding parameters (-D4999500 and -Q25) were also adjusted in the vcfutils.pl filtering script, where we additionally increased the stringency for polymorphisms to not locate up to 10nt next to a gapped position (-w10 and -W10). Subsequently, we merged the calls from 421 individuals with non-imputed genotypes in the Phase2 dataset of the 1000 Genomes Project³², removing polymorphisms with a median coverage of <10 at called sites, with <10 samples showing the called non-reference base and with a variant quality of <100 assigned by SAMtools. We thus obtained 8,479 predictions polymorphisms, of which 7,770 (91.6%) correspond to 1000 Genomes genotype variants employed by the Geuvadis Project:

http://www.ebi.ac.uk/arrayexpress/experiments/E-GEUV-1/files/genotypes/

Considering the transcription directionality of each respective gene, 39 of the remaining 709 non-genomic polymorphisms correspond to A-to-G variants that modify in total 41 introns annotated by the Gencode reference (Supplementary Table 1).

Prediction of putative novel introns (PNIs)

We rescue PNIs from split-mapped RNA-Seq reads that indicate non-annotated alternative 5′ or 3′ splice sites within proximity of up to 30 nt to an annotated exon boundary, considering only properly paired mappings with a mapping quality of at least 150, an edit distance ≤6 and an insert-size of ≤1,000,000 nt. We then superimpose PNIs to the exon-intron structures of the Gencode v12 annotation and we employ our earlier described definition to classify the patterns of alternative splicing events implied by these novel introns⁷⁵.

Prediction of putative cleavage sites (PCSs)

To identify putative cleavage sites, we employ unmapped reads containing a poly-A tail (or a poly-T head) that pinpoint the cleavage site in poly-adenylated mRNAs. After trimming the reads for these subsequences, filtering them by a minimum informative length (>25nt after trimming) and removing low complexity reads (i.e., read sequences with an [A] and [T] content ≥80%), we obtain ~24 million reads of which 685,351 map uniquely to the genome and indicate 52,349 putative cleavage sites (PCSs). This can be summarized by the following commands, using the trimest tool⁷⁶: samtools view -f 4 $BAMFILE | awk ‘{if($10 !~ /\./&& (($10~/AAAA$/) || ($10 ~/^TTTT/))){cnt++;print “>“cnt”\n”$10}}’ | trimest -filter -minlength=5 -fiveprime Y -mismatches=1 | perl FastaToTbl.pl | awk –f selByLenAndContent.awk | perl TblToFasta.pl>$OUTFILE

selByLenAndContent.awk: {len=length($2);cntA=cntT=0; for(i=0;i<len+1;i++){if(substr($2,i,1)==“A”) {cntA++;} if(substr($2,i,1)==“T”) {cntT++;}}rA=cntA/len;rT=cntT/len;rr=rA+rT;if((rr < 0.8) && length($2)>25){print;}}

This pipeline receives as input a BAM file (BAMFILE) and produces a file with polyA reads already trimmed and selected. The scripts FastaToTbl and TblToFasta convert from tabular format to Fasta format. We consider a PCS predicted from the Geuvadis RNA-Seq data to be confirmed if we can extract a corresponding PCS from the Encode dataset that intersects in the genomic region to which the non poly-adenylated parts of supporting reads align. This analysis can be summarized by the following command using BedTools⁷⁷: windowBed -a gencode.polyA.sites.bed -b./geuvadis.polyA.bed -w 50 -c | awk ‘{if($7>0)print}’

Finding poly-A signals

In order to identify poly-A motifs for previously identified PCSs, we use a recursive approach similar to an earlier proposed method³⁷. We employ 13 hexamer motifs that have been identified as potential binding sites of the CPSF^36,37, i.e. AATAAA, ATTAAA, TATAAA, AGTAAA, AAGAAA, AATATA, AATACA, CATAAA, GATAAA, AATGAA, TTTAAA, ACTAAA, AATAGA. This list of hexamers is ranked by the frequency with which each motif is observed, with AATAAA being the most and AATAGA the least frequent poly-A motif in the human transcriptome. We then scan the DNA sequences of 50 nt around the previously predicted PCSs in a top-down approach, starting with searching for the most frequently occurring hexamer; if a corresponding hexamer sequence is found, we record its position, otherwise we continue scanning with next most frequent motif until all of the 13 known poly-A motifs have been tested.

Additional Information

How to cite this article: Ferreira, P. G. et al. Sequence variation between 462 human individuals fine-tunes functional sites of RNA processing. Sci. Rep. 6, 32406; doi: 10.1038/srep32406 (2016).

References

Black, D. L. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 72, 291–336 (2003).
CAS PubMed Google Scholar
Black, D. L., Chabot, B. & Steitz, J. A. U2 as well as U1 small nuclear ribonucleoproteins are involved in premessenger RNA splicing. Cell 42, 737–750 (1985).
CAS PubMed Google Scholar
Wahle, E. & Kühn, U. The mechanism of 3′ cleavage and polyadenylation of eukaryotic pre-mRNA. Prog. Nucleic Acid Res. Mol. Biol. 57, 41–71 (1997).
CAS PubMed Google Scholar
Colgan, D. F. & Manley, J. L. Mechanism and regulation of mRNA polyadenylation. Genes Dev. 11, 2755–2766 (1997).
CAS PubMed Google Scholar
Curado, J., Iannone, C., Tilgner, H., Valcárcel, J. & Guigó, R. Promoter-like epigenetic signatures in exons displaying cell type-specific splicing. Genome Biol. 16, 236 (2015).
PubMed PubMed Central Google Scholar
Derrien, T., Guigó, R. & Johnson, R. The Long Non-Coding RNAs: A New (P)layer in the ‘Dark Matter’. Front. Genet. 2, 107 (2011).
PubMed Google Scholar
Wilusz, J. E., Sunwoo, H. & Spector, D. L. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev. 23, 1494–1504 (2009).
CAS PubMed PubMed Central Google Scholar
Tilgner, H. et al. Nucleosome positioning as a determinant of exon recognition. Nat. Struct. Mol. Biol. 16, 996–1001 (2009).
CAS PubMed Google Scholar
Papasaikas, P., Tejedor, J. R., Vigevani, L. & Valcárcel, J. Functional splicing network reveals extensive regulatory potential of the core spliceosomal machinery. Mol. Cell 57, 7–22 (2015).
CAS PubMed Google Scholar
Krawczak, M., Reiss, J. & Cooper, D. N. The mutational spectrum of single base-pair substitutions in mRNA splice junctions of human genes: causes and consequences. Hum. Genet. 90, 41–54 (1992).
CAS PubMed Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
CAS PubMed Google Scholar
Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic determinants of disease. Science 347, 1254806 (2015).
PubMed Google Scholar
Garcia-Blanco, M. A., Baraniak, A. P. & Lasda, E. L. Alternative splicing in disease and therapy. Nat. Biotechnol. 22, 535–546 (2004).
CAS PubMed Google Scholar
Faustino, N. A. & Cooper, T. A. Pre-mRNA splicing and human disease. Genes Dev. 17, 419–437 (2003).
CAS PubMed Google Scholar
Singh, R. K. & Cooper, T. A. Pre-mRNA splicing in disease and therapeutics. Trends Mol. Med. 18, 472–482 (2012).
CAS PubMed PubMed Central Google Scholar
Acedo, A. et al. Comprehensive splicing functional analysis of DNA variants of the BRCA2 gene by hybrid minigenes. Breast Cancer Res. 14, R87 (2012).
CAS PubMed PubMed Central Google Scholar
Rahman, M. A. et al. HnRNP L and hnRNP LL antagonistically modulate PTB-mediated splicing suppression of CHRNA1 pre-mRNA. Sci. Rep. 3, 2931 (2013).
PubMed PubMed Central Google Scholar
Vibe-Pedersen, K., Kornblihtt, A. R. & Baralle, F. E. Expression of a human alpha-globin/fibronectin gene hybrid generates two mRNAs by alternative splicing. EMBO J. 3, 2511–2516 (1984).
CAS PubMed PubMed Central Google Scholar
Kwan, T. et al. Heritability of alternative splicing in the human genome. Genome Res. 17, 1210–1218 (2007).
CAS PubMed PubMed Central Google Scholar
Zhang, X., Zou, F. & Wang, W. Efficient Algorithms for Genome-wide Association Study. ACM Trans. Knowl. Discov. Data 3, 19:1–19:28 (2009).
Google Scholar
Fraser, H. B. & Xie, X. Common polymorphic transcript variation in human disease. Genome Res. 19, 567–575 (2009).
CAS PubMed Google Scholar
Kwan, T. et al. Tissue effect on genetic control of transcript isoform variation. PLoS Genet. 5, e1000608 (2009).
PubMed PubMed Central Google Scholar
Lu, Z.-X., Jiang, P. & Xing, Y. Genetic variation of pre-mRNA alternative splicing in human populations. Wiley Interdiscip. Rev. RNA 3, 581–592 (2012).
CAS PubMed Google Scholar
Monlong, J., Calvo, M., Ferreira, P. G. & Guigó, R. Identification of genetic variants associated with alternative splicing using sQTLseekeR. Nat. Commun. 5, 4698 (2014).
ADS CAS PubMed Google Scholar
Ongen, H. & Dermitzakis, E. T. Alternative Splicing QTLs in European and African Populations. Am. J. Hum. Genet. 97, 567–575 (2015).
CAS PubMed PubMed Central Google Scholar
Rivas, M. A. et al. Human genomics. Effect of predicted protein-truncating genetic variants on the human transcriptome. Science 348, 666–669 (2015).
ADS CAS PubMed PubMed Central Google Scholar
Montgomery, S. B. et al. Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464, 773–777 (2010).
ADS CAS PubMed Google Scholar
Pickrell, J. K. et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464, 768–772 (2010).
ADS CAS PubMed PubMed Central Google Scholar
Stranger, B. E. et al. Population genomics of human gene expression. Nat. Genet. 39, 1217–1224 (2007).
CAS PubMed PubMed Central Google Scholar
Cheung, V. G. et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365–1369 (2005).
ADS CAS PubMed PubMed Central Google Scholar
Dimas, A. S. et al. Common regulatory variation impacts gene expression in a cell type-dependent manner. Science 325, 1246–1250 (2009).
ADS CAS PubMed PubMed Central Google Scholar
Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 135, 0–9 (2012).
Google Scholar
Lappalainen, T. et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 (2013).
ADS CAS PubMed PubMed Central Google Scholar
’t Hoen, P. A. C. et al. Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories. Nat. Biotechnol. 31, 1015–1022 (2013).
PubMed Google Scholar
Zhang, X. H.-F., Leslie, C. S. & Chasin, L. a. Dichotomous splicing signals in exon flanks. Genome Res. 15, 768–779 (2005).
CAS PubMed PubMed Central Google Scholar
Beaudoing, E., Freier, S., Wyatt, J. R., Claverie, J. M. & Gautheret, D. Patterns of variant polyadenylation signal usage in human genes. Genome Res. 10, 1001–1010 (2000).
CAS PubMed PubMed Central Google Scholar
Tian, B., Hu, J., Zhang, H. & Lutz, C. S. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res. 33, 201–212 (2005).
CAS PubMed PubMed Central Google Scholar
Harrow, J. et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774 (2012).
CAS PubMed PubMed Central Google Scholar
Graveley, B. R. The haplo-spliceo-transcriptome: common variations in alternative splicing in the human population. Trends Genet. 24, 5–7 (2008).
CAS PubMed Google Scholar
Zhang, W. et al. Identification of common genetic variants that account for transcript isoform variation between human populations. Hum. Genet. 125, 81–93 (2009).
CAS PubMed Google Scholar
Guigó, R., Knudsen, S., Drake, N. & Smith, T. Prediction of gene structure. J. Mol. Biol. 226, 141–157 (1992).
PubMed Google Scholar
Ast, G. How did alternative splicing evolve? Nat. Rev. Genet. 5, 773–782 (2004).
CAS PubMed Google Scholar
Wang, E. T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).
ADS CAS PubMed PubMed Central Google Scholar
Sheth, N. et al. Comprehensive splice-site analysis using comparative genomics. Nucleic Acids Res. 34, 3955–3967 (2006).
CAS PubMed PubMed Central Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
ADS CAS PubMed Google Scholar
Olivier, M. et al. A high-resolution radiation hybrid map of the human genome draft sequence. Science 291, 1298–1302 (2001).
ADS CAS PubMed Google Scholar
Lim, L. P. & Burge, C. B. A computational analysis of sequence features involved in recognition of short introns. Proc. Natl. Acad. Sci. USA 98, 11193–11198 (2001).
ADS CAS PubMed PubMed Central Google Scholar
Djebali, S. et al. Landscape of transcription in human cells. Nature 489, 101–108 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Nishikura, K. Functions and regulation of RNA editing by ADAR deaminases. Annu. Rev. Biochem. 79, 321–349 (2010).
CAS PubMed PubMed Central Google Scholar
Ramaswami, G. & Li, J. B. RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 42, D109–D113 (2014).
CAS PubMed Google Scholar
Kleinman, C. L., Adoue, V. & Majewski, J. RNA editing of protein sequences: a rare event in human transcriptomes. RNA 18, 1586–1596 (2012).
CAS PubMed PubMed Central Google Scholar
Ramaswami, G. et al. Identifying RNA editing sites using RNA sequencing data alone. Nat. Methods 10, 128–132 (2013).
CAS PubMed PubMed Central Google Scholar
Ramaswami, G. et al. Accurate identification of human Alu and non-Alu RNA editing sites. Nat. Methods 9, 579–581 (2012).
CAS PubMed PubMed Central Google Scholar
Wu, Q. & Krainer, A. R. AT-AC pre-mRNA splicing mechanisms and conservation of minor introns in voltage-gated ion channel genes. Mol. Cell. Biol. 19, 3225–3236 (1999).
CAS PubMed PubMed Central Google Scholar
Licht, K., Kapoor, U., Mayrhofer, E. & Jantsch, M. F. Adenosine to Inosine editing frequency controlled by splicing efficiency. Nucleic Acids Res. 10.1093/nar/gkw325 (2016).
Fumagalli, D. et al. Principles Governing A-to-I RNA Editing in the Breast Cancer Transcriptome. Cell Rep. 13, 277–289 (2015).
CAS PubMed PubMed Central Google Scholar
Tilgner, H. et al. Deep sequencing of subcellular RNA fractions shows splicing to be predominantly co-transcriptional in the human genome but inefficient for lncRNAs. Genome Res. 22, 1616–1625 (2012).
CAS PubMed PubMed Central Google Scholar
Rodriguez, J., Menet, J. S. & Rosbash, M. Nascent-seq indicates widespread cotranscriptional RNA editing in Drosophila. Mol. Cell 47, 27–37 (2012).
CAS PubMed PubMed Central Google Scholar
Laurencikiene, J., Källman, A. M., Fong, N., Bentley, D. L. & Ohman, M. RNA editing and alternative splicing: the importance of co-transcriptional coordination. EMBO Rep. 7, 303–307 (2006).
CAS PubMed PubMed Central Google Scholar
Rueter, S. M., Dawson, T. R. & Emeson, R. B. Regulation of alternative splicing by RNA editing. Nature 399, 75–80 (1999).
ADS CAS PubMed Google Scholar
Jin, Y. et al. RNA editing and alternative splicing of the insect nAChR subunit alpha6 transcript: evolutionary conservation, divergence and regulation. BMC Evol. Biol. 7, 98 (2007).
PubMed PubMed Central Google Scholar
Jones, A. K. et al. Splice-variant-and stage-specific RNA editing of the Drosophila GABA receptor modulates agonist potency. J. Neurosci. 29, 4287–4292 (2009).
CAS PubMed PubMed Central Google Scholar
Grohmann, M. et al. Alternative splicing and extensive RNA editing of human TPH2 transcripts. PLoS One 5, e8956 (2010).
ADS PubMed PubMed Central Google Scholar
Fu, Y. et al. Differential genome-wide profiling of tandem 3′ UTRs among human breast cancer and normal cells by high-throughput sequencing. Genome Res. 21, 741–747 (2011).
CAS PubMed PubMed Central Google Scholar
Coolidge, C. J., Seely, R. J. & Patton, J. G. Functional analysis of the polypyrimidine tract in pre-mRNA splicing. Nucleic Acids Res. 25, 888–896 (1997).
CAS PubMed PubMed Central Google Scholar
Blanco, E., Parra, G. & Guigó, R. Using geneid to identify genes. Curr. Protoc. Bioinformatics Chapter 4, Unit 4.3 (2007).
Hull, J. et al. Identification of common genetic variation that modulates alternative splicing. PLoS Genet. 3, e99 (2007).
PubMed PubMed Central Google Scholar
Nelson, K. K. & Green, M. R. Mechanism for cryptic splice site activation during pre-mRNA splicing. Proc. Natl. Acad. Sci. USA 87, 6253–6257 (1990).
ADS CAS PubMed PubMed Central Google Scholar
Zamore, P. D., Patton, J. G. & Green, M. R. Cloning and domain structure of the mammalian splicing factor U2AF. Nature 355, 609–614 (1992).
ADS CAS PubMed Google Scholar
Ohshima, Y. & Gotoh, Y. Signals for the selection of a splice site in pre-mRNA. Computer analysis of splice junction sequences and like sequences. J. Mol. Biol. 195, 247–259 (1987).
CAS PubMed Google Scholar
Brunak, S., Engelbrecht, J. & Knudsen, S. Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol. 220, 49–65 (1991).
CAS PubMed Google Scholar
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
CAS PubMed PubMed Central Google Scholar
Team, R. C. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013 (2014).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
PubMed PubMed Central Google Scholar
Sammeth, M., Foissac, S. & Guigó, R. A General Definition and Nomenclature for Alternative Splicing Events. PLoS Comput. Biol. 4, e1000147 (2008).
ADS PubMed PubMed Central Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).
CAS PubMed Google Scholar
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This research leading to these results has received funding from the European Commission 7^th Framework Program, Project N. 261123 (GEUVADIS). PGF received funding by POPH - QREN Type 4.2, European Social Fund and Portuguese Ministry of Science and Technology (MCTES), Contrato Programa no âmbito do Programa Investigador FCT, 2014, IF/01127/2014. MO received funding by the National Counsel of Technological and Scientific Development (CNPq) grant 310132/2015-0 and MS received funding by the Research Support Foundation of the State of Rio de Janeiro (FAPERJ) E_06/2015 and by CNPq grant 401626/2015-6.

Author information

Authors and Affiliations

Bioinformatics and Genomics, Center for Genomic Regulation (CRG), Barcelona, 08003, Catalonia, Spain
Pedro G. Ferreira, Xavier Estivill, Roderic Guigó & Michael Sammeth
Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, 1211, Switzerland
Pedro G. Ferreira, Emmanouil Dermitzakis, Stylianos Antonarakis & Tuuli Lappalainen
Instituto de Investigação e Inovação em Saúde, (i3S) Universidade do Porto, Porto, 4200-625, Portugal
Pedro G. Ferreira
Institute of Molecular Pathology and Immunology (IPATIMUP), University of Porto, Porto, 4200-625, Portugal
Pedro G. Ferreira
Institute of Biophysics Carlos Chagas Filho (IBCCF), Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, 21941-902, Brazil
Martin Oti & Michael Sammeth
Institute of Clinical Molecular Biology, Christians-Albrechts-Universität zu Kiel, Kiel, 24105, Germany
Matthias Barann, Stefan Schreiber & Philip Rosenstiel
Institute of Human Genetics, Helmholtz Center Munich, Neuherberg, 85764, Germany
Thomas Wieland, Thomas Meitinger & Tim M Strom
Center for Human Genome and Stem-cell research (HUG-CELL), University of São Paulo (USP), São Paulo, 05508090, Brazil
Suzana Ezquina
Science for Life Laboratory, Stockholm University, Box 1031, Solna, 17121, Sweden
Marc R. Friedländer
Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom
Manuel A. Rivas
Centre Nacional d’Anàlisi Genòmica, Barcelona, 08028, Catalonia, Spain
Anna Esteve-Codina & Ivo Gut
Center for Research in Agricultural Genomics (CRAG), Autonome University of Barcelona, Bellaterra, 08193, Catalonia, Spain
Anna Esteve-Codina
Institute of Human Genetics, Technische Universität München, Munich, 81675, Germany
Tim M Strom
Institute for Genetics and Genomics in Geneva (iGE3), University of Geneva, Geneva, 1211, Switzerland
Tuuli Lappalainen
Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland
Tuuli Lappalainen
Pompeu Fabra University (UPF), Barcelona, 08003, Catalonia, Spain
Roderic Guigó
National Center of Scientific Computing (LNCC), Petrópolis, 2233-6000, Rio de Janeiro, Brazil
Michael Sammeth
Wellcome Trust Sanger Institute, Hinxton Cambridge CB10 1SA, UK
Aarno Palotie
Centre National de la Recherche Génomique, 91030 Evry, France
Jean François Deleuze
Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
Ralf Sudbrak & Hans Lerach
Uppsala University, Box 256 751 05 Uppsala, Sweden
Ann-Christine Syvänen & Ulf Gyllensten
Radboud University Nijmegen Medical Centre, 6500 HB Nijmegen, the Netherlands
Han Brunner & Joris Veltman
Leiden University Medical Center, 2333 ZA Leiden, the Netherlands
Peter A.C.T Hoen & Gert Jan van Ommen
Universidad de Santiago de Compostela, E-15706 Santiago de Compostela, Spain
Angel Carracedo
European Bioinformatics Institute, EMBL-EBI, Hinxton Cambridge CB10 1SD, UK
Alvis Brazma & Paul Flicek
Institut National de la Santé et de la Recherche Médicale, 75013 Paris Country, France
Anne Cambon-Thomsen
Life Technologies, 64293 Darmstadt, Germany
Jonathan Mangion
Illumina Cambridge Limited, Fulbourn Cambridge CB21 5XE, UK
David Bentley
Johns Hopkins University School of Medicine, Baltimore MD 21205, USA
Ada Hamosh

Authors

Pedro G. Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Martin Oti
View author publications
You can also search for this author in PubMed Google Scholar
Matthias Barann
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Wieland
View author publications
You can also search for this author in PubMed Google Scholar
Suzana Ezquina
View author publications
You can also search for this author in PubMed Google Scholar
Marc R. Friedländer
View author publications
You can also search for this author in PubMed Google Scholar
Manuel A. Rivas
View author publications
You can also search for this author in PubMed Google Scholar
Anna Esteve-Codina
View author publications
You can also search for this author in PubMed Google Scholar
Philip Rosenstiel
View author publications
You can also search for this author in PubMed Google Scholar
Tim M Strom
View author publications
You can also search for this author in PubMed Google Scholar
Tuuli Lappalainen
View author publications
You can also search for this author in PubMed Google Scholar
Roderic Guigó
View author publications
You can also search for this author in PubMed Google Scholar
Michael Sammeth
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

The GEUVADIS Consortium

Xavier Estivill
, Roderic Guigó
, Emmanouil Dermitzakis
, Stylianos Antonarakis
, Thomas Meitinger
, Tim M Strom
, Aarno Palotie
, Jean François Deleuze
, Ralf Sudbrak
, Hans Lerach
, Ivo Gut
, Ann-Christine Syvänen
, Ulf Gyllensten
, Stefan Schreiber
, Philip Rosenstiel
, Han Brunner
, Joris Veltman
, Peter A.C.T Hoen
, Gert Jan van Ommen
, Angel Carracedo
, Alvis Brazma
, Paul Flicek
, Anne Cambon-Thomsen
, Jonathan Mangion
, David Bentley
& Ada Hamosh

Contributions

The GEUVADIS Consortium produced the raw RNA-seq data, the mapping data and defined the final dataset after quality control analysis. P.G.F., M.O., P.R., T.M.S. and M.S. designed the research. P.G.F, M.O., M.B., T.W., S.E., A.E.C. and M.S. conducted the analyses. P.G.F., M.O., M.F., M.R., T.L., R.G. and M.S. wrote the paper.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Ferreira, P., Oti, M., Barann, M. et al. Sequence variation between 462 human individuals fine-tunes functional sites of RNA processing. Sci Rep 6, 32406 (2016). https://doi.org/10.1038/srep32406

Download citation

Received: 12 April 2016
Accepted: 03 August 2016
Published: 12 September 2016
DOI: https://doi.org/10.1038/srep32406

This article is cited by

A comparison of transcriptome analysis methods with reference genome
- Xu Liu
- Jialu Zhao
- Haihong Ye
BMC Genomics (2022)
A novel splice site indel alteration in the EIF2AK3 gene is responsible for the first cases of Wolcott-Rallison syndrome in Hungary
- Andrea Sümegi
- Zoltán Hendrik
- György Balla
BMC Medical Genetics (2020)
Structure-mediated modulation of mRNA abundance by A-to-I editing
- Anneke Brümmer
- Yun Yang
- Xinshu Xiao
Nature Communications (2017)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.