Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes: New Candidates and Distinguishing Features

Long noncoding RNAs (lncRNAs) represent a vast unexplored genetic space that may hold missing drivers of tumourigenesis, but few such “driver lncRNAs” are known. Until now, they have been discovered through changes in expression, leading to problems in distinguishing between causative roles and passenger effects. We here present a different approach for driver lncRNA discovery using mutational patterns in tumour DNA. Our pipeline, ExInAtor, identifies genes with excess load of somatic single nucleotide variants (SNVs) across panels of tumour genomes. Heterogeneity in mutational signatures between cancer types and individuals is accounted for using a simple local trinucleotide background model, which yields high precision and low computational demands. We use ExInAtor to predict drivers from the GENCODE annotation across 1112 entire genomes from 23 cancer types. Using a stratified approach, we identify 15 high-confidence candidates: 9 novel and 6 known cancer-related genes, including MALAT1, NEAT1 and SAMMSON. Both known and novel driver lncRNAs are distinguished by elevated gene length, evolutionary conservation and expression. We have presented a first catalogue of mutated lncRNA genes driving cancer, which will grow and improve with the application of ExInAtor to future tumour genome projects.

Scientific RepoRts | 7:41544 | DOI: 10.1038/srep41544 recruits the repressive PRC2 chromatin regulatory complex to hundreds of genes 11 . Cancer-related lncRNA have features of functional genes, including sequence conservation, orthologues in other mammals, chromatin marks and regulated subcellular localisation 4 . Moreover they display typical characteristics of cancer drivers, including influence on cellular phenotypes of proliferation and apoptosis, and in clinical features such as patient survival and altered expression across tumour collections 3,8,11 .
The absence of whole-genome maps of somatic mutations has meant that searches for new cancer-related lncRNAs have relied on conventional transcriptomic approaches that reveal changes in their expression levels that accompany cancer. However such approaches are not capable of distinguishing passenger and driver effects, nor do they identify mutations in the mature lncRNA sequence that may drive tumourigenesis independent of upstream regulatory changes 8,12,13 . Two recent studies clearly demonstrate that somatic mutations, in these cases amplifications of entire loci, can drive tumour formation 14,15 . Nevertheless, we remain largely ignorant of the role that mutations in lncRNA genes play during the early stages of tumourigenesis.
The statistical analysis of somatic mutation patterns is a powerful means of identifying genes that drive early tumour formation. A number of methods have been developed to search for candidate driver genes whose open reading frames display non-random mutational patterns consistent with positive selection on the encoded protein. In essence, all methods search for statistical enrichment in some measure of mutational impact, compared to a background model that accounts as far as possible for biases inherent in mutational processes. For example, OncodriveFM 16 employs predicted functional impact of mutations on encoded proteins, as inferred by a variety of methods, and using an empirical local background model. On the other hand, MutSigCV 17 identifies genes with elevated mutational rates, incorporating a variety of known mutational covariates in order to estimate an accurate background model drawn from silent sites amongst selected neighbouring protein-coding sequences. Finally, ActiveDriver 18 searches for genes with excess mutations falling in signaling sites, protein domains and regulatory motifs. While these approaches have discovered dozens of new cancer genes, their use of features specific to protein-coding genes to infer mutational biases, makes them inapplicable to lncRNA.
To date, the majority of driver discovery projects have been carried out using exome sequencing -the targeted capture and sequencing of approximately 2% of the genome encoding protein 17 . While successful for discovering protein-coding driver genes, exome sequencing ignores mutations occurring in the multitude of noncoding regulatory elements known to exist in the human genome 19 . Very recently, drops in the cost of sequencing have made plausible the sequencing of collections of entire tumour genomes 1 . Mutation maps from these genomes make it possible, for the first time, to search for non-coding driver elements.
In the present study, we describe and characterise a tool, called ExInAtor, for the discovery of driver lncRNA genes. ExInAtor identifies genes with excess of exonic mutations, compared to the expected local neutral rate estimated from intronic and surrounding sequences. We present a comprehensive prediction of candidate lncRNAs across 1104 genomes from 23 cancer types. These candidates have a series of features consistent with their being genuine drivers.

Results
A method for discovering driver genes from cancer genomes. Our aim was to develop a method to identify tumour driver long noncoding RNAs (lncRNAs) using short nucleotide variant (SNV) mutations from cancer genome sequencing projects. We define SNVs, from now on, as somatic substitutions or indels of length 1 nt. Only these mutations (representing the vast majority in this study, 97.7%) are used, due to the nature of ExInAtor's statistical model (see Materials and Methods). The majority of GENCODE lncRNA annotations are spliced (21,523/23,898 = 90.0% of transcripts), and we assume throughout that their functional sequence resides in exonic regions that are incorporated into the mature transcript 20 . Intronic sequence is removed during splicing and hence is not directly relevant to their function. Consequently, we hypothesised that driver lncRNAs will display an excess of somatic mutations in exons compared to the local background mutational rate, estimated by their introns and flanking genomic regions -henceforth referred to as "background regions". This approach is conservative, given that background regions are likely to include functional regulatory elements that may themselves carry driver mutations.
We implemented this approach in a computational pipeline called ExInAtor ( Fig. 1 and Supplementary Fig. S1). ExInAtor requires two principal inputs: an annotation of lncRNA genes and a catalogue of tumour mutations. At its heart, ExInAtor employs a parametric statistical test to identify genes that present a significantly elevated exonic mutation rate compared to local background regions. The latter are comprised of intronic and flanking genomic sequence. We took care to account for a key confounding factor: the unique mixture of mutational signatures that characterises every individual tumour, and every tumour type 21 . Such signatures can be described as a probability for every nucleotide to mutate to every other, conditioned on the identity of flanking positions -summarised in a matrix of 96 trinucleotide substitution frequencies 21 . In other words, mutation rates are dependent on nucleotide composition. The mutational signature must be taken into account when comparing mutational loads of exons to surrounding regions, because they tend to have marked differences in nucleotide compositionboth for protein-coding genes and lncRNAs 22 .
ExInAtor employs a subsampling approach to balance the trinucleotide content of exons and background regions, thereby accounting for mutational signatures (Fig. 1A, Supplementary Fig. S1). Exonic regions of each gene are defined as the projection of all exons from the union of its transcripts. Next, the background region is defined as all non-exonic nucleotides within the gene, in addition to upstream and downstream windows of defined length. Within these exonic and background regions, the frequencies of trinucleotides are calculated. Then, nucleotides are randomly sampled (without replacement) from the background region, until the maximum possible amount of sequence with identical trinucleotide composition has been collected. Now, the number of SNVs overlapping exons, M, and those overlapping remaining background nucleotides, m, are compared using We prepared a carefully-filtered lncRNA annotation, to avoid several potential sources of false positive predictions. We were particularly concerned by two potential confounding factors: first, misinterpretation of mutations that may affect protein-coding regions overlapping the same DNA as lncRNA exons; and second, the presence of mis-classified protein-coding transcripts among the GENCODE annotation 4 . Thus, we removed genes of uncertain protein-coding potential, as judged by computational protein-coding potential classifiers (see Materials and Methods). We also removed any lncRNA genes, such as cis-antisense and intronic lncRNAs, that overlap annotated protein-coding genes. In this way we narrowed the set of GENCODE v19 lncRNA genes from 13,870 to 5,887 intergenic, confidently-noncoding lncRNAs (Table 1). To this set we added back 27 cancer-related, GENCODE v19 lncRNAs from the literature (see below).
One advantage of ExInAtor is its indifference to genes' biotype. This arises from its lack of reliance on measures of functional impact 16 , meaning that it can equally be used on lncRNAs or protein-coding genes. Indeed, similar approaches have been used to discover coding driver genes in the past 23 . We took advantage of this to assess its ability to discover known protein-coding driver genes from the Cancer Gene Census 24 amongst the GENCODE annotation. This provided us with a useful independent validation of ExInAtor's precision, of particular value given the low number of known driver lncRNAs at present.

Datasets of somatic mutations in cancer genomes.
To search for lncRNA driver genes, we took advantage of the two largest available sources of cancer genome mutations: one collected by the Cancer Genome Project at the Sanger Institute, hereafter named "Alexandrov" 21 , and the other from The Cancer Genome Atlas (TCGA) 1 (Table 2). These data were aggressively filtered to remove potential artefacts arising from germline mutations (see Materials and Methods). The Alexandrov dataset comprises 9 cancers with between 15 and 119 individuals and 10,436 and 2,796,863 mutations each. The TCGA dataset consists of 14 cancers with between 15 and 96 individuals and 21,113 to 4,680,653 mutations each. Of note is the large spread in sample sizes and mutation rates across tumour types. Taking all cancers together, we observed an excess of mutations in lncRNAs  compared to protein-coding genes, and in background over exons, suggesting a general selective pressure against disruptive mutations in both gene classes ( Supplementary Fig. S2).
The landscape of driver lncRNAs across 23 tumour types. To comprehensively discover candidate lncRNA drivers, ExInAtor was run on the 23 tumour types described above. We adopted some analysis strategies to account for the relatively shallow nature of the data and our consequently weak statistical power to find driver genes. First, in order to discover both cancer-specific and ubiquitous driver genes, ExInAtor was run on each dataset in distinct configurations: (1) grouping samples by tumour type ("Tumour Specific"), (2) pooling together the entire set of tumours within each of the two projects ("Pancancer") and (3) pooling data across both projects ("Superpancancer"). Second, we used sample stratification to boost sensitivity. This approach is commonly used when statistical power is reduced by multiple hypothesis testing 25,26 . LncRNA genes were divided into two groups of different sizes, and each was treated independently during multiple hypothesis correction. This reduces the burden on resulting false discovery rate estimates. As a reference set, we curated 45 experimentally-validated cancer-related lncRNAs from the scientific literature, henceforth "Cancer-Related LncRNAs" (CRLs) (Supplementary File S1). All CRL genes belong to GENCODE v19 annotation. Remaining filtered lncRNAs are referred to as "Non-CRL" (Supplementary File S2). Summary statistics of the gene sets used are shown in Table 1.
At a Q value (false discovery rate) cutoff of 0.1, we discovered a total of 15 lncRNAs (6 and 9 from CRL and non-CRL, respectively) ( Fig. 2A) (Supplementary Files S3 and S4) and 24 protein-coding genes (Supplementary File S5). Relaxing the cutoff to Q < 0.2, we discover 10 and 27 CRL and non-CRL lncRNAs, respectively. Henceforth we refer to these as driver genes, and a Q-value threshold of 0.1 is assumed unless stated otherwise. ExInAtor predicted a total of five lncRNA driver genes in Alexandrov tumours, nine in TCGA and two in Superpancancer (one of them already detected in Pancancer TCGA). The greatest numbers of drivers predicted in individual tumours were three apiece in Breast and Kidney Chromophore (Fig. 2D).
Several findings suggest that false positive prediction rates are low. Reported P values closely follow the expected null distribution for the majority of genes (a full set of Quantile-quantile (QQ) plots can be found in Supplementary Fig. S3). Furthermore, while a number of tumour types display a small number of putative driver lncRNAs that strongly deviate from the null expectation (exemplified by Breast cancer sample in Fig. 2B), other samples yield no candidates at all (eg Liver cancer, Fig. 2C). In general, inspection of QQ plots shows a tendency for deflation of P values ( Supplementary Fig. S3). To further test false discovery rates, we reran these analyses on tumour data that had been randomised using two different methods (see Materials and Methods for details). ExInAtor predicted no lncRNA drivers in either dataset (grey dots in Fig. 2B and C and Supplementary Fig. S3).  In order to rule out the possibility that our candidates are false candidates due to effects of elevated local mutation rates, or gene lengths, we performed two additional simulations. For these analysis we selected the Breast collection, as it has the highest number of non-CRL predictions (3). In the first simulation, we shuffled the lncRNA genes 100 times within the whole genome, not altering exon-intron structure, and keeping real mutation positions. The real number of candidates is greater than 97% of simulations, finding an average of 0.52 candidates per iteration which implies a false discovery rate of ~0.17, consistent with our cutoff of FDR 0.1 ( Supplementary Fig. S4). In the second simulation, we examined whether predicted targets could be false positives due to locally elevated mutational rates. We constructed an annotation of artificial "intronic" genes, where the exons were replaced by equally-sized, randomly-selected fragments of introns from the same gene. These genes were used as the input for ExInAtor and run (with same settings as for real data) on Breast samples. Zero candidate drivers were predicted ( Supplementary Fig. S5).
Together these data point to a rather conservative statistical model, with low false positive predictions, which may discard some bona fide drivers. A comprehensive set of predictions across all analyses can be found in Supplementary File S6.
ExInAtor identifies known and novel lncRNA driver genes. ExInAtor's sensitivity is demonstrated by its identification of altogether six CRL genes. These are: MALAT1, NEAT1, PCA3, BCAR4, lncRNA-ATB (CTD− 2314B22.3) and the recently-discovered melanoma driver SAMMSON (RP11− 460N16.1) ( Table 3). The latter was detected in stomach adenocarcinoma, and we found that it is also present in stomach RNAseq ( Supplementary Fig. S6). The majority of candidates were found in tumour-specific analysis (Fig. 3A). Nevertheless, two CRL lncRNAs, NEAT1 and MALAT1, were identified in Pancancer analysis, consistent with a general role in tumourigenesis: both are long, unspliced and nuclear-retained lncRNAs with demonstrated roles across a range of cancer types 9 . It's worth mentioning that neither of these genes was identified in the positional shuffling of Breast data described above, suggesting that these genes are not false positives due to their length. As shown in Fig. 3B, the NEAT1 exon region experiences an elevated mutation rate across cancers, when compared to its flanking background regions. NEAT1 was identified in a recent study of liver cancer genomes, and as the authors pointed out, it cannot be ruled out that it is identified through increased local mutation rate 27 .
One important potential source of false positive signal in this study could be elevated mutational rates in DNA regulatory elements, such as enhancers, which happen to overlap the exon of a lncRNA annotation. Such cases would be expected to produce driver lncRNAs, where all mutations are concentrated in a single exon. This would be indistinguishable from bona fide driver lncRNAs that have an important functional domain located in a single exon. To investigate this further, we inspected the exon-level mutational density of all candidate lncRNAs ( Supplementary Fig. S7). Intriguingly, we find at least two cases where mutations are elevated across multiple exons, but not intervening introns ( Fig. 3C and D). Altogether of 13 multi-exonic candidate lncRNAs, five have mutations in more than one exon. This supports the interpretation that, for these cases at least, mutations cause gain-or loss-of-function in mature lncRNA transcripts, and not through disruption of a DNA-encoded element.
Amongst the novel candidate driver genes were a number of intriguing cases with various characteristics of functionality, none of which have been described in the scientific literature. In Fig. 3F we highlight one case from the wider Q < 0.2 candidate set (see Supplementary File S6), RP11-820L6.1, whose promoter is characterised by canonical histone modifications, obvious evolutionary conservation and the recruitment of transcription factors. Most notably, the master tumour suppressor transcription factor and regulator of several cancer lncRNAs, P53, is bound within the first intron 28 .
We further sought to establish the degree of overlap between ExInAtor-predicted driver genes and candidates predicted by transcriptomic analyses. Two previous studies to identify cancer-related lncRNAs have searched for differentially-expressed transcripts in cancer transcriptome data from microarrays and RNA sequencing 8,12 .  Fig. S8). It should be noted that all these genes belong to the CRL set. MiTranscriptome and Du share 11 genes (P < = 0.0001, Chi-square with Yates' correction test). This surprising discordance of driver gene prediction between studies, in addition to their lack of overall intersection with the published CRL set, suggests that (1) these large-scale predictions have considerable false negative rates, and (2) that available catalogues of cancer-related lncRNAs, represented by the CRL set, are incomplete.
We searched for independent evidence of cancer roles for ExInAtor-predicted candidates. Importantly, we separately considered (1) the entire set of candidates, including known CRL genes, and (2), the novel ExInAtor candidates alone. This ensures that findings are not biased by the inclusion of experimentally-verified CRL drivers amongst candidate gene sets. We first tested the frequency with which candidates are affected by copy number variants (CNVs) across matched cancers 29 . We found that all candidates, and novel candidates alone, both display a trend to have elevated rates of copy number variation (Fig. 3E). We also investigated whether candidates are more proximal to germline cancer mutations 29 . Once more, we observe a trend for candidates to be more likely to   be proximally located to such mutations than expected by chance. Although the small numbers involved do not generally reach statistical significance, these findings are additional evidence that ExInAtor predictions, either including or excluding known cancer-related lncRNAs, are involved in tumour progression.
ExInAtor identifies known protein-coding cancer genes. Although ExInAtor was designed with lncRNAs in mind, it makes no use of functional impact predictions and hence is agnostic to the protein-coding potential of the genes it analyses. We took advantage of this versatility to further test ExInAtor's precision, by comparing predictions to the gold-standard catalogue of the Cancer Gene Census (CGC) 24 . CGC is a manually-curated and regularly-updated annotation of genes whose somatic mutations have been associated with cancer. CGC genes represent a subset of 545 genes (Supplementary File S9) (2.7%) of the entire GENCODE set of 20,314 studied here (Supplementary File S10) ( Table 1). We ran ExInAtor using protein-coding gene annotations, without stratification. Altogether, a total of 24 protein-coding drivers were identified at a false discovery rate cutoff of Q < 0.1. Of these, 9 (38%) are CGC genes (indicated in red, Fig. 4A). This represents enrichment of 14-fold over random expectation (P < = 0.0001, Chi-square with Yates' correction test). The most significantly enriched gene in this analysis is TP53, the most frequently mutated across cancers and identified in previous exome sequencing projects 30 . TP53 exons display an obvious and consistent enrichment of somatic mutations in both datasets, clustered in exons 4 and 7-11 (Fig. 4D). This TP53 signal is observed in both Pancancer and multiple individual cancer types.
Several of the 15 non-CGC genes identified have evidence for cancer roles: ANKRD18A in lung cancer 31 , DDX3X and PBRM1 in various cancers 32 , HPSE2 in thyroid carcinoma 33 , MYO5B in gastric cancer 34 . These findings suggest that ExInAtor precision may be better than implied by the analysis of CGC genes alone.
We examined the performance of ExInAtor, in terms of the percent of predicted genes that belong to CGC, at a series of Q value thresholds (Fig. 4B) (Supplementary File S11). Shown are separate analyses for all cancer types (expressed as mean prediction per cancer), and various pancancer combinations. These show that, although the number of predicted genes are low, they tend to have far higher rate than that 2.7% expected by random chance, even at a Q value threshold of 0.1.
In summary, ExInAtor performs well in identifying known cancer related genes at high precision from a protein-coding training set ~10 times larger than CRL lncRNAs.

ExInAtor is competitive with tools designed for protein-coding genes. Next we compared
ExInAtor to a series of well-known pipelines for identification of protein-coding drivers: MutSig 17 , OncodriveFM 16 and OncodriveClust 35 . In side-by-side analyses on identical Alexandrov Pancancer data, we found that ExInAtor has low sensitivity (ie makes few predictions), but has excellent precision. In fact, its predictions contain a higher percentage of CGC genes than the other methods ( Fig. 4C and Supplementary Fig. S9). For example, at a cutoff of Q < 0.1, ExInAtor predicts 3 genes (of which 2 are known drivers), compared to 4 known drivers out of 39 for MutSig, 11 known drivers out of 104 for OncoDriveClust and 59 known drivers out of 589 for OncodriveFM (Fig. 4C). Furthermore, comparing the top 30 candidates detected at several cutoffs ( Supplementary Fig. S10), the majority of genes detected by ExInAtor are also detected by at least one other method.
We also compared the four programs' performance on real and simulated Pancancer data, displayed as Q-Q plots in Supplementary Fig. S11. Again, ExInAtor performs relatively well: its predictions on true data mirror the expected distribution quite well, and true P values are smaller than for simulated data. ExInAtor predictions appear to be conservative, having a tendency for moderately deflated P values. In contrast, other methods tend to perform worse, being either strongly deflated (MutSig), inflated (OncodriveFM) or predicting less in true than randomised data (OncodriveClust). In summary, despite not employing any information from functional impact of mutations on protein-coding sequence to inform its predictions, ExInAtor is surprisingly competitive with existing methods in the identification of coding driver genes. In particular, its predictions have low sensitivity (possibly many false negatives) but high precision (a high fraction of true positives). This lends weight to the accuracy of ExInAtor's lncRNA predictions.
LncRNAs are predicted as drivers at higher rates compared to coding genes. We were interested in the overall rates of prediction of lncRNAs and protein-coding genes, as well as their apparent tumour-specificity. Known driver genes are highly variable with respect to their tumour-type specificity. TP53 mutations are found across a wide range of cancers, while other drivers are only mutated in single tumour types 30,32 . In this analysis, we detected no lncRNAs in more than one tumour (Supplementary Fig. S12). In contrast, two coding genes were discovered in two independent cancer types, while TP53 was identified in no less than 9. Interestingly, a higher fraction of lncRNAs was predicted as driver genes than protein coding: 0.25% and 0.11%, respectively. These figures are likely to be strongly influenced by both the low sensitivity of ExInAtor discussed above and by the sparse data. In future, many more genes are likely to be identified in multiple cancers when deeper data is available. Nevertheless these findings suggest that lncRNA are mutated in cancer at a rate similar to, or higher than protein-coding genes.
Novel and known driver lncRNAs share distinctive features of functionality. Returning to the driver lncRNAs identified by ExInAtor, we next asked whether any features distinguish these from other lncR-NAs. Previous studies of lncRNA have used features such as evolutionary conservation and expression as proxies for functionality 36,37 . Furthermore, previous research on protein-coding cancer genes showed that their genes and their processed transcripts tend to be longer than average 38 .
We compiled a series of features and, for each one, asked to what extent it differs between the CRL genes and all other lncRNAs. The full set of results, plotted by magnitude of difference and statistical significance, are Scientific RepoRts | 7:41544 | DOI: 10.1038/srep41544 shown in Fig. 5A. It is clear that CRL genes are distinguished by a diverse range of features. They are transcribed from longer genes, and have longer mature transcripts ("exonic length"). They are under stronger evolutionary constraint: their promoters and exons are more evolutionarily conserved across a range of evolutionary distances. Their steady state RNA levels are higher and more variable across human tissues. Finally, they are also more likely The precision of ExInAtor predictions was estimated as the percent of predicted driver genes that also belong to CGC. Bars are coloured by the Q-value cutoff used, and the fraction of all known genes belonging to CGC is shown in blue as a reference. "Mean" displays the average overlap across all individual cancer types. The numbers above each bar indicate the total number of predicted driver genes at that cutoff. For example, in "Superpancer", a total of three candidates are identified at a cutoff of 0.1, of which two (66%) belong to CGC. (C) Comparison of the performance of ExInAtor to other methods for protein-coding driver gene discovery, using the Alexandrov Pancancer dataset. Plot description as for Panel B.  to be proximal to a binding site of the P53 tumour suppressor. In contrast, there is no difference in genic or exonic GC content between CRLs and other genes.
Having established a series of cancer lncRNA-specific features, we asked whether these features are also present in ExInAtor candidate genes. We were particularly interested in whether novel candidates (ie non-CRL) share these characteristics, since this would represent an independent test for the value of ExInAtor predictions. Therefore we compared the features of three gene groups: CRL lncRNAs, all ExInAtor candidate genes, and novel ExInAtor candidates alone. These groups were compared to the null set of genes, represented by the entire set of remaining Gencode lincRNAs ("All other genes").
In Fig. 5B are shown the results across seven selected features. ExInAtor candidates, in common with CRL, have longer genes and transcripts than lincRNAs in general (P = 4E-8, P = 6E-4, respectively, Wilcoxon test). Surprisingly, and in contrast to CRL genes, ExInAtor candidates have significantly lower GC content (P = 7E-3), and higher repetitive sequence content (P = 0.03). Finally, for features of evolutionary conservation of both promoter and exon, in addition to steady-state RNA levels, we find that novel candidates display a similar trend as CRL genes, although these do not reach statistical significance (P > 0.05). In summary, and pending future replication with larger gene sets, it appears that novel ExInAtor predicted cancer genes share a number of distinguishing features with known cancer lncRNAs, consistent with being bona fide driver genes.

Discussion
Here we have presented ExInAtor, to our knowledge the first method specifically designed to identify cancer driver lncRNAs from tumour genome cohorts. ExInAtor aims to address the unique opportunity of comprehensively discovering cancer driver lncRNAs within and across tumour types using mutation data generated by projects such as TCGA and ICGC.
ExInAtor is the first approach dedicated to identifying driver lncRNAs. Due to their explicit use of protein-coding information, most existing tools are inappropriate for this task [16][17][18] . The most similar existing tool is probably InVEx 23 , which uses within-sample empirical distributions to identify protein-coding drivers. However it differs in several key ways from ExInAtor: InVEx utilises protein-based mutational impact scores to judge impact of mutated positions, while ExInAtor uses simple mutational burden; InVEx relies on randomisation, while ExInAtor utilizes the hypergeometric distribution for increased processing speed; and ExInAtor uses a larger, more flexible background region encompassing both flanking and intronic regions (InVEx utilises introns and UTRs only). In particular, the requirement for mutational impact scores would appear to make InVEx unsuited to analysis of the type presented here.
We have presented the results of scans across the two most substantial tumour genome sequencing cohorts presently available, the Alexandrov and TCGA datasets, altogether comprising more than 1000 genomes from 23 cancer types. In addition to successfully retrieving at nine known protein-coding drivers (38% of total predictions) and six published cancer-related lncRNAs (40% of predictions), we identify for the first time a total of nine novel lncRNA driver genes at low false positive rates (0.1 FDR). These novel candidates share with known cancer lncRNAs a series of features including evolutionary conservation, normal tissue expression and gene length. They also tend to be proximal to germline cancer SNPs and have increased probability of lying in CNV regions, lending weight to their association with tumourigenesis. Together these observations support the idea that ExInAtor predicts bona fide driver lncRNAs. The true test of these predictions must await experimental validation in cell lines and animal models, where the tumorigenic effect of the observed tumour SNPs can be tested in a controlled way.
The distinguishing features of cancer-related lncRNAs are reminiscent of similar findings for protein coding genes 38 . Evolutionary conservation and high steady-state RNA levels are generally interpreted in this context as evidence for functionality of lncRNAs 36,37 . The significance of other features is less clear, and we should be careful to consider possible non-biological factors. In the case of gene length, it is likely that ExInAtor has greater statistical power for longer genes, possibly explaining the significantly elevated lengths of known and novel candidates. Furthermore, it is likely that the annotated length of lncRNAs is correlated with their expression, since higher expressed genes have more supporting ESTs and cDNAs, and hence are more complete.
Other observations were unexpected: the exons of novel candidate drivers have elevated repetitive content and reduced GC content. Furthermore, and in contrast to the above, these features are not shared with known CRL driver genes. It is unclear whether this reflects technical artefacts of the analysis, or a genuine biological insight. We can think of no bias in ExInAtor, or the cancer mutation datasets, that may favour gene models with these properties, although it is entirely feasible. On the other hand, transposable elements have been linked to both cancer 39,40 and lncRNA functionality 41 . It is attractive to hypothesise that repeat-rich lncRNAs play roles in tumourigenesis and are preferentially mutated during this process. Further study will be required to establish the significance of these findings.
Previous driver gene predictions have been prone to false positive predictions arising from heterogeneous mutational rates, most notably those correlating with low gene expression level and late replication timing 17 . Although the use of local background mutation rates in ExInAtor would be expected to mitigate this, we considered whether such processes could have introduced error in this study. Both replication timing (earlier than average, Supplementary Fig. S14) and expression (higher than average, Fig. 5B) of candidate driver lncRNAs are opposite to that expected if they were false positives. Furthermore, the high precision of protein-coding predictions with respect to the Cancer Gene Census, the favourable performance compared to well-established methods, and the lack of predictions in locally-shuffled data (that presumably maintains replication-dependent mutational biases) are also inconsistent with such an effect. Finally, of the 24 ExInAtor protein-coding candidates, none are amongst the problematic false positive genes mentioned in the Lawrence et al. study 17 . Nevertheless it will be important to remain vigilant for these and other sources of error in future lncRNA driver discovery efforts.
At present, our understanding of how lncRNA function is encoded in sequence motifs and structures is limited 20 . Consequently, advanced approaches for scoring the functional effect of mutations, such as those used for protein Scientific RepoRts | 7:41544 | DOI: 10.1038/srep41544 sequences, are unavailable. We have here hypothesised that driver lncRNAs should display an elevated mutational burden in their exons, under the assumption that function is mediated by the spliced transcript. Mutations in these regions might be presumed to affect function through disruption or creation of microRNA 42 or protein binding sites 43 , or else through alterations to the RNA folding 44,45 . Future improvements to ExInAtor may include information on RNA structures, protein binding sites, post-transcriptional processing and evolutionary conservation to weight mutations based on their likely impact on lncRNA function. Furthermore, more sensitive statistical methods employing information on mutation clustering and cancer-specific mutational signatures will likely improve predictions.
It remains unclear how many lncRNA drivers remain to be discovered, and which have tumour-specific or pan-cancer activity. We expect that future studies will yield many more candidate lncRNAs than produced here: although the datasets we have used represent a large proportion of all presently available tumour genomes, future projects -including PCAWG -will be larger and produce mutation calls of better quality 1,46 . It is also likely that ExInAtor's sensitivity can be improved in future: analyses of protein-coding and lncRNA true positive datasets clearly showed that, while ExInAtor makes predictions with high precision (high true positive rate), it likely also calls many false negatives. This lack of sensitivity arises from the fact that ExInAtor only uses mutational burden to call drivers, and ignores measures of functional impact. The benefit is ExInAtor's agnosticism as to gene biotype, but it carries obvious drawbacks in sensitivity. For these reasons, it would be unwise at this point to comment on the relative proportions of tumour-specific and pancancer lncRNAs. Nevertheless, evidence consistently points to lncRNA have a more tissue-specific expression compared to mRNAs 4 , making it plausible that they should have a higher rate of tumour-specificity compared to protein-coding genes.
The increasing scale of cancer genome projects will place a growing emphasis on computational efficiency. One of the benefits of ExInAtor is its ability to handle data with complex trinucleotide biases uses a simple subsampling algorithm, and without any functional impact predictions. This simplicity has the unintended benefit that ExInAtor is capable of identifying protein-coding drivers with precision comparable to the best methods. Another outcome is that ExInAtor makes very low computational demands: analyses for this paper were executed on a workstation running Intel Core i7 processors. 25 minutes were required to analyse protein coding genes in Superpancancer (the largest dataset tested here) using a single core and 2,050 MB of RAM. It required just three minutes to analyse Pilocytic astrocytoma with six cores and 648 MB of RAM. Together, these features make ExInAtor suited to future, large-scale cancer genome sequencing projects.

Materials and Methods
Gene annotation and filtering. The GENCODE v19 lncRNA catalogue was downloaded in GTF format from (www.gencodegenes.org) 4,47 , and comprises 13,870 genes. A number of filtering steps were applied to this list. First, only intergenic genes (having no transcripts overlapping protein-coding genes on the opposite strand, or within 10 kb at their closest point on the same strand) were retained (6,308). Second, any lncRNA gene with transcripts of uncertain protein-coding potential were removed, leaving 5,887 genes (see below for details). Third, we included several cancer-related lncRNAs from the scientific literature, resulting in a final set of 5,914 lncRNA genes (Table 1 and Supplementary Files S1 and S2). Note that literature genes may violate the two filters above, but must have a GENCODE identifier. This set of filtered lncRNAs is used throughout.
The protein-coding gene catalogue was also obtained in GTF format from GENCODE v19 47 . From this annotation, all genes with biotype "protein-coding" were selected, resulting in 20,314 genes and 145,518 transcripts. Finally, all transcripts not having biotype "protein-coding" were removed, reducing the transcripts to 81,702. (Table 1 and Supplementary File S10).
Somatic mutation data curation. Whole-genome cancer somatic mutations were obtained in BED format from two sources: 10 cancers described in Alexandrov et al. 21 , and 14 cancers from TCGA 1 . In addition, we included an additional dataset of 100 stomach adenocarcinoma (STAD) with the Alexandrov dataset 48 , resulting in an original set of 22,877,059 mutations. Only single nucleotide somatic mutations and indels of length 1 were retained, hereafter referred to as "mutations". AML and ALL cancers from the Alexandrov dataset were removed due to their low number of genomes and mutations. Statistics on the remaining cancers can be found in Table 2. Both mutation datasets were prefiltered in order to remove possible misannotated germline SNPs. First, any mutations identical to an entry in dbSNP 146 "common" (> 1% frequency) were removed, leaving 22,128,594 mutations (96.7%). Second, any recurrent mutations, having the same nucleotide change observed in the same location more than once, were collapsed and treated as a single event, resulting in a final set of 20,837,263 mutations (91.1%). Other more stringent SNP filtering strategies resulted in a loss in precision and sensitivity (Supplementary File S12). We also experimented with removing hypermutated tumour samples, but once more this resulted in reduced performance (Supplementary File S12). A similar effect was observed after masking repetitive regions (Supplementary File S12).
Assessing the protein-coding potential of lncRNA. All GENCODE v19 lncRNA transcripts were tested for protein-coding potential with CPAT 49 at default settings. Any gene having one or more transcripts predicted to be protein-coding (coding potential > = 0.364) was removed from further analysis.
ExInAtor design. ExInAtor requires eight mandatory inputs: (1) a gene annotation in GTF format containing information on genes and exons to analyse (transcript information is ignored), (2) a catalogue of mutations in BED format, (3) the number of individual genomes or samples represented by the BED file, (4) the output folder destination, (5) a file with two columns showing the name of each chromosome and its nucleotide length, (6) a gene annotation in GTF format containing information on genes and exons of the whole genome (transcript information is ignored), (7) FASTA file of the whole genome and (8) a file containing all the possible trinucleotides. Optional inputs are: (1) a minimum number of exonic and/or (2) background mutations that each gene must have to be analysed, (3) the number of CPU cores to use in the analysis and (4) the extension length of the background region that includes all introns.
The ExInAtor workflow can be divided into the following steps: exon and background definition, mutations mapping, sub-sampling of background region, gene filtering by mutation counts and statistical analysis ( Fig. 1 and Supplementary Fig. S1).
Exon/Background definition. The full set of exons from all transcripts belonging to a gene are merged. The remaining genic space is then defined as background, which is extended to both sides of the gene according to the window length parameter. In the present study, this value was set at 10 kb throughout. Regions overlapping exons from any other gene are removed from this background region. The coordinates of non-overlapping exons and background regions are saved in BED format (Fig. 1A).
Mutations mapping. Mutations are mapped to exons and background regions, then counted. We collapsed the recurrent, identical mutations into one single mutation. However, if two or more distinct mutations fall in the same position they are counted separately.
Sub-sampling of background region. The trinucleotide content of the region is calculated. Then, an identical number of the same trinucleotides are randomly sampled from the background region. Their accompanying mutations are recorded. This is performed sequentially, without replacement, until it is impossible to continue. At every iteration, the sampled positions are added to a new background region, along with their associated mutations. In this way, a new background region of maximal size and identical trinucleotide composition to the exonic region is assembled for every gene (Supplementary Fig. S1).
Gene filtering by mutation counts. Mutation data are sporadic and of low density, potentially resulting in inflated P values. To avoid this, ExInAtor accepts a user-defined minimum number of exonic and background mutations, below which lncRNAs will not be considered. These cutoffs may be defined by the user, with the default filter (used in the present study) discarding genes with less than 1 exonic mutations or 1 background mutations.
Statistical analysis. Statistical enrichment of exonic mutations is determined using the hypergeometric test (Fig. 1B). The following contingency table is compiled for each gene, with the total exonic and background lengths, N and n respectively: This is the starting point for calculations of statistical significance of enrichment of exonic mutations using the hypergeometric distribution, which describes the probability of obtaining a given number of successes in a given number of draws without replacement from a finite population of a specific size. It is important to note that the positions corresponding to each genome are counted independently, meaning that the total gene length N is defined as gene length multiplied by the number of genomes. n is treated similarly. Statistical significance is estimated for a gene to have that many or more exonic mutations, then are corrected for multiple hypothesis testing using the Benjamini-Hochberg procedure, which controls the False Discovery Rate (FDR), here indicated by "Q".
ExInAtor returns the input gene list with mutation counts and associated exonic enrichment Q-values. The latest ExInAtor version is freely available for download here: https://github.com/alanzos/ExInAtor/.
Creation of a simulated mutation dataset. Two distinct methods were used to create trinucleotideaware simulations of tumour mutations. In the first method ("Fixed window reassignment"), the genome was divided into fixed partitions of 50 kb. Mutations were randomly assigned to another genomic location with the same reference trinucleotide and surrounding nucleotides for substitutions and indels, respectively. In the second method ("Sliding window reassignment"), a 50 kb window is centred on each individual mutation. The mutation is then reassigned to another position with identical reference trinucleotide within its window. These simulations, while maintaining approximately the same number of single nucleotide substitutions and indels of the original Alexandrov dataset as well as the same mutation trinucleotide signature, constitute neutral datasets that are not expected to be enriched in cancer related lncRNA.

Visual inspection and validation of candidates' mutations.
To verify the quality of the mutation calling, we visually validated 12 single somatic mutations from 4 candidates. First, we downloaded a SAM file of the surrounding regions of each mutation (+ /− 2 kb) with the BAM Slicer tool from CGHUB (https://cghub.ucsc.edu/); then we opened those files with IGV to check the reads supporting the mutations ( Supplementary Fig. S13) 50 .
Curation of Cancer Related LncRNA (CRL) set. A literature search was performed to identify lncRNAs with validated roles in cancer. Principal sources were Pubmed, lncRNAdb 51 and lncRNADisease 9 . Two criteria were applied for inclusion. First, lncRNAs must have a GENCODE identifier. Second lncRNA must have a causative role in cancer or cancer-relevant phenotypes, as judged by in vitro or in vivo experiments, somatic mutations or germline causative variants. Differential gene expression alone was not sufficient evidence for inclusion. The full set of CRL lncRNAs can be found in Supplementary File S1.
Comparison of cancer features. We obtained the list of lncRNA genes proximal to cancer-related germline SNPs from Table S5 of ref. 29. The enrichment of indicated lncRNA genesets with respect to these genes were assessed by contingency table analysis using Fisher's Exact text. For the analysis of CNVs, the set of regions were obtained from Table S3 of the same paper, and statistical enrichments were calculated similarly. Only data from cancers corresponding to those in the present study were considered.
Comparison of lncRNA features. To assess which features may distinguish cancer lncRNAs, we collected different genomic and expression data for all the genes and divided them into the four groups of interest: (1) non-CRL non-candidates (non-CRL gene list excluding ExInAtor discoveries at a Q ≤ 0.2, "All other lincRNAs"). (2) CRL genes.
For each feature we compared all groups to non-CRL non-candidates. Statistical tests were performed using R. Features were compiled from the following sources: Gene Sequence. Gene sequence features were calculated based on Gencode v19 annotations. Exonic regions of each gene were defined as the projection of all exons from the union of its transcripts. Promoter regions of each gene were defined as a window of + /− 100 nucleotides from the reference transcription start site (TSS).
Conservation. PhastCons scores from vertebrate and primate species alignments and PhastCons Elements from vertebrate, mammals and primate species alignments were downloaded from UCSC Genome Browser. Two separate analyses were performed, using either base-level scores, or conserved element regions. We separately computed the average exonic base-level conservation score of each gene for primates and vertebrates PhastCons scores. We merged conserved elements annotations from primate, mammal and vertebrate species alignments and intersected these regions with promoters and exonic regions. We then computed the percent of nucleotides (from promoters or exonic regions) covered by conserved elements for each gene.
Repeat Elements. We downloaded the 2013 version of RepeatMasker human genomic repetitive element annotations and converted it to BED format. These annotations were intersected with exonic regions of lncRNAs. For each gene we calculated the percent of exonic nucleotides overlapping repetitive elements.
Tissue Expression Analysis. We extracted tissue expression values for 16 human tissues from Human Body Map (HBM) RNAseq data, downloaded from ArrayExpress under accession number E-MTAB-513. These data were used to quantify Gencode v19 genes using the GRAPE pipeline 52 . Considering only genes that are expressed (RPKM > 0) at least in one tissue we described the mean, the maximum and the variance of RPKM expression values across tissues. The percent of expressed genes for a given group represents the total number of genes that are expressed at least in one tissue compared to the total number of genes of the given group.
P53 analysis. We obtained ChIP data for p53 binding sites from ref. 28. Binding maps from the two available timepoints were merged. We attempted to assess a possible link between cancer driver lncRNAs and p53 binding site regions in two different ways. We first analysed whether the position of CRL genes in the genome tend to be closer to p53 binding site regions compared to non-CRL genes. To this aim, we calculated the nucleotide distance from the promoter of the gene (defined as explained before) to the closest p53 binding site region for all CRL and non-CRL genes. As an alternative, we compared the probability of finding a p53 binding site close to a TSS for CRL and non-CRL genes: for each we counted how many genes out of the total contain at least one predicted p53 binding site region within a window of 100 kb, centred on the TSS.
Replication timing. Replication timing analysis was performed using data for the whole human genome produced by massively parallel sequencing of nascent BrdU-labeled replicating DNA in HeLa cells 53 . S50 ratios, defined as the fraction of the S phase at which 50% of the DNA is replicated in a defined genomic region, for 100 kb genomic regions along the human genome were downloaded. These regions were intersected with ENCODE v18 long-noncoding RNA annotations, and the average of S50 values of the nucleotides of each gene was calculated.