lncRNAKB: A comprehensive knowledgebase of long non-coding RNAs

We have assembled a comprehensive long non-coding RNA knowledgebase (lncRNAKB) of 77,199 annotated human lncRNAs (224,286 transcripts) by methodically integrating widely used lncRNAs resources. To facilitate functional characterization of lncRNAs, we employed Genotype-Tissue Expression (GTEx) project to provide tissue-specific gene expression profiles of lncRNAs in 31 solid organ tissues. Additional information includes network analysis to identify co-expressed gene modules to potentially delineate lncRNA function. Tissue-specificity, phylogenetic conservation scores and coding potential for lncRNAs are included. Finally, using whole genome sequencing data from GTEx, expression quantitative trait loci (cis-eQTL) regulated lncRNAs were calculated in all tissues. lncRNAKB is available at http://www.lncrnakb.org.

some conditions [4]. Among the non-protein coding transcripts, the long non-coding RNAs (lncRNAs) are a class of transcripts whose length range from 200 base pairs (bp) to 100 kilobases (kb) (approximately 10 kb on average) [5]. Currently, the number of estimated lncRNAs annotations in humans range from 20,000 to 100,000 [6].
While the field of lncRNA is nascent, enough evidence exist to suggest that they play a critical role in many biological processes including transcriptional and post-transcriptional regulation, epigenetic regulation, organ or tissue development, cell differentiation and apoptosis, cell cycle control, cellular transport, metabolic processes and chromosome dynamics [7,8]. A consensus has emerged on other features of lncRNAs such as lack of interspecies conservation [9][10][11][12], low level of expression [7], and higher degree of tissue-specific expression compared to mRNAs [10,13]. The little sequence conservation among lncRNAs is confined to short, 5′-biased patches of conserved sequences nested in exons [9]. Some lncRNAs undergo translation, though only a minority of such translation events results in stable and functional peptides [14,15].
Recently, several publicly available resources dedicated to annotation of lncRNAs in humans and other species have been developed [6,16,16,17]. Most of these databases are available through web-based searchable interfaces and provide downloadable annotation files in Gene Transfer Format (GTF) or Gene Feature Format (GFF) [18][19][20]. Frequently used resources of lncRNAs annotation include GENCODE [21,22], CHESS [23], LNCipedia [24,25], NONCODE [26], FANTOM [27], MiTranscriptome [28] and BIGTranscriptome [3]. These resources annotated lncRNAs by two approaches: manual or automatic [6]. Manual annotation involves human annotators curating gene and transcript models based on RNA and protein experimental evidence and defined sets of rules [22]. Automatic annotation uses bioinformatics methods such as StringTie [29] and Cufflinks [30] to reconstruct gene and transcript models based on billions of short RNA-sequence (RNA-seq) reads [28]. A few of these databases have attempted to integrate annotations from multiple sources such as expression, methylation, variation, conservation and functional annotation of lncRNAs in humans. The information on lncRNAs provided in these databases is, however, not comprehensive and at times inadequate, making it difficult to understand their molecular and cellular functions.
To address the limitations and shortcomings of the existing lncRNA databases, we have developed lncRNAKB which systematically combines the frequently used lncRNAs annotation resources mentioned above using a cumulative stepwise intersection method. This method of integration compares the annotations thoroughly, discarding redundant/ambiguous lncRNAs records and also accounts for the large overlap between the lncRNAs annotation databases to avoid redundancy. The lncRNAKB also provides a comprehensive downloadable, searchable and viewable (via the UCSC Genome Browser) [31] GTF annotation file of human PCGs and a large number of lncRNAs (n=77,199) that can be used by researchers to quantify RNA-seq data generated in lab or downloaded from public domains for lncRNA discovery. To improve lncRNAs' functional characterization, we have implemented an up-to-date analysis pipeline to analyze the tissue-specific RNA-Seq data available through Genotype Tissue Expression (GTEx) project [32] and have created a tissue-specific expression body map of human lncRNAs. Using the gene expression information, we calculated tissue-specificity scores across all the genes in the lncRNAKB. Additionally, we calculated expression quantitative trait Loci (eQTL)-regulated lncRNA genes using the GTEx expression and GTEx whole genome sequencing (WGS) genotype data and created a tissue-specific eQTL body map of human lncRNAs. Additional features of the lncRNAKB include providing information regarding classification of lncRNA based on their positional information and calculating their coding potential using FlExible Extraction of LncRNAs (FEELnc) [33] algorithm. Furthermore, the knowledgebase also provides eQTL results and exon-level conservation scores derived from an alignment of 30 vertebrate species [31] for all PCGs and lncRNAs. Finally, to predict lncRNA functions, we used Weighted Gene Co-expression Network Analysis (WGCNA) [34] method to analyze lncRNAs-mRNAs co-expression patterns in a tissue-specific manner. The co-expression modules were subjected to pathway enrichment analysis to identify functional pathways associated with lncRNAs thereby creating a tissue-specific body map of functionally annotated lncRNAs. All pathway results files can be browsed or are available for download. Moreover, for each tissue we have selected 25 notable pathways and created a dynamic network figure on the website to view the strength of connections between the highest-ranking mRNAs and lncRNAs. All these features in combination with annotation of 77,199 lncRNAs presented in the lncRNAKB provides the most up-to-date and comprehensive functional information about lncRNAs.

Data sources and collection
To identify widely used lncRNAs databases we performed literature search of the PubMed database through February 28 th , 2019 with the following keyword algorithm: (lncrna or long noncoding or long non-coding rna or noncoding) and (annotation or function or database).
A total of 13,412 articles were returned filtered by human species and published within the past five years sorted by the best match criteria. The titles, abstracts, keywords, and full text were manually reviewed to identify publications that reported lncRNAs annotations, databases and function. The references of these articles were also searched to identify other articles that were potentially missed by the initial PubMed search. After this review, six lncRNA resources were selected for step-wise integration to create lncRNAKB. The six resources are: CHESS (version 2.1), LNCipedia (v5.2), NONCODE (v5.0), FANTOM (5.0.v3), MiTranscriptome (v2) and BIGTranscriptome (v1).

Data integration
The gene transfer format (GTF) or gene feature format (GFF) from all six annotation databases (links in Table 1) were downloaded. To streamline the data integration step, all the GTF or GFF annotations were parsed to the same format using the following steps: (i) where required, the coordinates of annotation were updated using the UCSC liftOver tool [31] from hg19 to hg38 (latest genome build), and (ii) for each chromosome, the gene and transcript records were split into individual files named by chromosome, strand, start and stop base pair locations. Each gene block file contained the transcripts information and the transcript block file contained the exons information. In cases where the transcripts or exons records lacked genes information, a gene entry was manually created using the gene ids in the transcripts or exons records and combined with the base pair locations of the first exon (as gene start), of the last exon (as gene stop), and transcript strand to represent the gene strand. All the redundant records between annotation files were removed in this process.
Using CHESS as the reference annotation database (containing both protein-coding and lncRNAs genes) we used a cumulative stepwise intersection method to merge it with the rest of the five lncRNAs annotation databases in the following order: (i) FANTOM, (ii) LNCipedia, (iii) NONCODE, (iv) MiTranscriptome and (v) BIGTranscriptome at the genes and transcripts levels. This order of intersection was arbitrarily chosen. Figure 1 illustrates the cumulative stepwise intersection method for two databases as an example, D1 (CHESS) in blue and D2 (FANTOM-lncRNAs only) in green. For each gene entry in D1 (top blue panel), we kept genes from D2 (green panel) that had full overlap and were also within D1 gene boundary. The resulting intersection is shown in orange. D2 gene that had partial overlap with D1 gene (marked in red X) were discarded as we did not want to re-define boundaries of genes in the reference annotation database.
For genes that intersected, the transcript records (shown as smaller bar connected by line to represent exons and introns, respectively) from D1 and D2 were compared. Similarly, to the gene intersection, transcript entries whose start and stop were within the gene boundaries were included. Several transcripts (marked with a red X) that fell outside the gene boundary and were probably incorrectly assigned to genes were removed in this process. In addition, if a transcript in D2 had partial or no overlap with transcripts in D1, we incorporated that transcript (marked with red checks) including all the exons to the gene record accordingly. For genes with no overlap in D1, we added all the transcripts and corresponding exons to the merged annotation as a lncRNA entry (marked in red checks).

Evaluation of coding potential of lncRNAs
FlExible Extraction of LncRNAs (FEELnc) [33] was used to classify/annotate and calculate the coding potential of all the gene entries in the lncRNAKB. FEELnc annotates lncRNAs based on a machine learning method, Random Forest (RF) [35], trained with general features such as multi k-mer frequencies, RNA sequence length and open reading frames (ORFs) size. It is comprised of three modules: (i) filter, (ii) coding potential, and (iii) classifier. The filter module flags and removes transcripts overlapping (in sense) exons of the reference annotation, specifically the protein-coding exons. We used the GENCODEv29 [22] GTF file as the reference annotation to get an estimate of the number of transcripts from lncRNAKB overlapping with "protein_coding" transcripts. We arbitrarily set the minimal fraction out of the candidate lncRNAs size to be considered for overlap to be excluded as 0.75 (> 75% overlap) to retain many lncRNAs transcripts. Transcripts < 200 base pairs (bp) long were filtered out and monoexonic transcripts were retained. We then used the filtered GTF annotation output file from the filter module and calculated a coding potential score (CPS) for each transcript using the coding potential module. Due to the lack of a gold standard/known human lncRNAs data set for training, we used the "intergenic" mode in the module. This approach extracts random intergenic sequences of length L from the genome of interest to model species-specific noncoding sequences as the non-coding training set. We used the human reference genome FASTA file (hg38) and the GENCODE GTF file as the reference annotation. To get the best training set of known mRNAs, we used "transcript_biotype=protein_coding" and "transcript_status=KNOWN" for the RF model. We used the default values for the k-mer sizes, number of trees and ORF type. To determine an optimal CPS cut-off, FEELnc automatically extracts the CPS that maximizes both sensitivity and specificity based on a 10-fold crossvalidation. The CPS was between 0 and 1 where 0 indicates a non-coding RNA and a score close to 1 a mRNA. And finally, to classify potential lncRNAs with respect to the localization and the direction of transcription of nearby mRNAs (or other non-coding RNAs) transcripts as shown in Supplementary Figure 1, we used the classifier module. We used the final set of lncRNAs transcripts output from the coding potential module and classified them using the GENCODEv29 GTF file as the reference annotation. A sliding window size around each lncRNA was used to check for possible overlap with nearest reference transcripts. We used a minimum and maximum window size of 10 kilobase (kb) and 100kb respectively. The classification method reported all interactions within the defined window and established a best partner transcript using certain rules.

Conservation Analysis
Conservation of exons between protein-coding genes and lncRNAs in the lncRNAKB annotation database was analyzed using the bigWigAverageOverBed [36] and the cons30way (hg38) track [37] both downloaded from the UCSC genome browser. This track shows multiple alignments of 30 vertebrate species and measurements of evolutionary conservation using two methods (phastCons and phyloP [38]) from the PHAST package [39] for all thirty species. The multiple alignments were generated using multiz [40] and other tools in the UCSC/Penn State Bioinformatics comparative genomics alignment pipeline. An exon-level BED file was created using the lncRNAKB GTF annotation file separately for protein-coding genes and lncRNAs. We merged overlapping exons within transcripts to avoid counting conservation scores of overlapping base pairs more than once. For each exon, the bigWigAverageOverBed function calculates the average conservation score across all base pairs. Using boxplots, we visualized and compared the average conservation score differences between lncRNAs and protein-coding exons.

Expression profiling
To evaluate gene expression, RNA-seq data available through the Genotype Tissue Expression (GTEx) project [32] was utilized. The raw paired-end RNA-seq data (FASTQ files -GTEx Release v7) from the dbGap portal (study_id=phs000424.v7.p2) of 31 solid organ human normal tissues were downloaded. For each solid tissue, quality control of paired-end reads were assessed using FastQC tools [41], adapter sequences and low-quality bases were trimmed using Trimmomatic [42] and aligned to the human reference genome (H. sapiens, GRCh38) using the HISAT2 [43]. Using the uniquely aligned reads to the human genome, gene-level expression (raw read counts) were generated with the featureCounts software [44] using the lncRNAKB GTF annotation. After visualizing the distribution of uniquely mapped paired-end reads assigned to genes across all the GTEx samples, samples with < 10 6 reads assigned to genes were excluded.
In addition, there were samples with data that we could not map or download. We normalized the raw read counts to Transcripts Per Kilobase Million (TPM) [45]. To explore gene expression similarity between tissues and across GTEx samples as well as summarize lncRNAs tissuespecific expression we performed a principal component analysis (PCA) using the prcomp package in R [46,47]. We used the normalized TPM expression values, transformed by taking , across all lncRNAs (n = 77,199) and tissues (n = 31) (no filters applied).

Tissue-specificity scores
In addition to gene expression quantitation, we calculated two tissue-specificity metrics (Tau and Preferential Expression Measure (PEM)) [48,49] using the normalized TPM expression values across all genes and tissues (no filter applied). Tau summarizes in a single number whether a gene is tissue-specific or ubiquitously expressed across all tissues. PEM shows for each tissue separately how specific the gene is to that tissue. The PEM scores the expression of a gene in a given tissue in relation to its average expression across all other genes and tissues. The average gene expression across all replicates by tissue was used to compute Tau and PEM.
Genes that were not expressed in at least one tissue were removed from the analysis.

eQTL analysis
For each solid tissue, we implemented a two-step filtering approach, which is similar to the steps adapted by GTEx [32]. Briefly, the genes were first filtered based on TPM to include genes with TPM > 0.50 in at least 20% of the samples. Next, the genes were filtered based on raw counts to include genes with counts > 2 in at least 20% of the samples. The edgeR [56] and limma-voom [57] [58] package in R [59] were used to process the filtered read counts into log2 counts per million (log2CPM) that were normalized using trimmed mean of M-values (TMM) [60]. The expression files were then sorted by gene start and stop, compressed with BGZIP and indexed with TABIX [61]. Only tissues with > 80 samples were included in the cis-eQTL analysis. For eQTL analysis, the first five principal components (PCs) (see Genotype file processing) and sex information was included as a covariate. Within each tissue, cis-eQTLs were identified by linear regression, as implemented in FastQTLv2.0 (threaded option) [62], adjusting for the five PCs and sex. We restricted our search to variants within 1 megabase (Mb) of the transcription start site (TSS) of each gene. In addition, the adaptive permutations option in FastQTL between 1000 and 10000 permutations. Once we obtained the permutation p-values for all the genes, we accounted for multiple testing to determine the significant cis-eQTLs. We used the Benjamini and Hochberg correction method [63] to calculate the false discovery rate (FDR) in R statistical programming language (R) [59]. For each tissue, all cis-eQTL results were visualized using a Manhattan plot created using the qqman package in R [64].

Functional characterization of lncRNAs using a network-based approach
Using the filtered log2CPM and TMM normalized gene expression data (see construction and content of lncRNAKB: Expression profiling and eQTL analysis), we used the weighted gene co-expression network analysis (WGCNA) approach [34] as implemented in the Co-Expression Modules identification Tool (CEMiTool) package in R [65] to identify modules of lncRNA-mRNA clusters that are co-expressed and therefore likely work in concert to carry out various biological functions. For this, the gene expression data was filtered by log2CPM > 2 in at least 50% of the samples to avoid random correlations between low-expressing genes. The default CEMiTool parameters were used with the following exceptions: (i) Pearson method was used for calculating the correlation coefficients, (ii) the network type used was unsigned, (iii) no filter was used for the expression data, (iv) applied Variance Stabilizing Transformation (VST) and the correlation threshold for merging similar modules were set to 0.90. All the co-expressed modules were subjected to over-representation analysis (ORA) by module based on the hypergeometric test [66]. We used Gene Ontology (GO) pathways [67][68][69] to check for overrepresentation of genes and determined the most significant module functions based on pathways FDR q-value 0.05 [70]. The background set used for the pathway enrichment analysis was genes represented across all GO pathways. To visualize the interactions between the genes in each co-expression module, we selected 25 notable pathways for each tissue. The module adjacency matrices for each of these pathways were filtered based on correlations 0.20 across all genes in each pathway. A JSON file (one per pathway) was created to produce interactive networks using Cytoscape v3.6.0 JavaScript modules [71]. The network files and the module adjacency/correlation matrix files are available for downloading at lncRNAKB.

Architecture of the database
The 3-tier server architecture model containing data, logic and presentation tiers has been implemented as shown in Figure 2. Functional characterization of lncRNAs using a network-based approach). Figure 3 illustrates the results of the lncRNAKB that are discussed in detail in each section below. Table 1 summarizes the number of lncRNAs features (genes n = 77,199, transcripts n = 224,286 and exons n = 611,340) in the lncRNAKB GTF annotation file.

Downloadable, searchable and viewable lncRNA database
Based on the PubMed search and literature review, six databases were chosen to systematically integrate all the lncRNA entries with the goal of providing a comprehensive annotation of lncRNAs (see construction and content of lncRNAKB). Table 2   However, from Figure 4, we observed that the number of non-overlapping genes added from CHESS is 9,595, which indicates that we added non-coding transcripts from overlapping lncRNAs in other annotation databases to the protein-coding genes. 5,295 genes overlapped between all six sources. Supplementary Table 1a and 1b shows the number of transcripts and the sources of annotation databases at gene level for protein-coding genes between CHESS and lncRNAKB, respectively. Comparing these two tables showed that the number of transcript entries for the protein coding genes in lncRNAKB was much higher than that in CHESS

Evaluation and comparison of lncRNA and mRNA conservation scores
In addition to the evaluating the coding potential, the conservation of exonic sequences of the lncRNAs and mRNAs was determined (see construction and content of lncRNAKB) and compared. Figure 9 shows the two box plot distributions of exon sequence conservation scores comparing protein-coding and lncRNAs in the lncRNAKB annotation database. Overall, it shows that exons of the protein-coding genes have higher mean sequence conservation scores compared to exons of the lncRNAs. Additionally, the conservation scores for all the exons will be available for review and download at the respective gene page on the website.

Tissue-specific expression profiling of lncRNAs
To further understand the function of lncRNAs, their gene expression profile across tissues was determined. For this purpose, RNA seq data from 31 solid organ tissues was accessed from GTEx, processed by RNA-seq analysis pipeline, and the gene expression evaluated using the lncRNAKB GTF file (see construction and content of lncRNAKB). Supplementary Table 3 shows the number of RNA-seq samples we analyzed across 31 solid organ human normal tissues from GTEx (n=9,425). Supplementary Table 4 shows the summary statistics of alignment (total number of paired-end reads, total number of uniquely aligned paired-end reads, unique and overall alignment rate) across all samples analyzed by tissue. Supplementary Table 5 shows the summary statistics of quantification (total gene count, total number of uniquely aligned pairedend reads used for quantification, total number of uniquely aligned paired-end reads assigned to genes and proportion of successfully assigned paired-end reads to genes) across all RNA-seq samples analyzed by tissue. Supplementary Figure 2 shows the distribution of uniquely aligned paired-end reads assigned to genes across all samples. Bars highlighted in red show the numbers of samples with < 10 6 reads assigned to genes (n=351) that were excluded from further analysis.
In the website, users can visualize the normalized gene expression levels (TPM) across 31 solid organ human normal tissues by searching for any gene. Figure 5 shows an example box plot distribution of gene NPPB (natriuretic peptide B) for visualization. NPPB had a Tau (overall) and PEM score (top five highest positive tissue-specificity score in the heart tissue) of 1 and 1.49 respectively (see: Evaluating tissue-specificity of lncRNAs). It functions as a cardiac hormone and plays a key role in cardiac homeostasis [72]. A high concentration of this protein in the bloodstream is indicative of heart failure. Even though NPPB is categorized as a PCG, it has three transcript isoforms that are characterized as lncRNAs. Users can download the boxplot of any gene of interest and download genome-wide gene expression matrices (raw counts and TPM) in text format across all 31 solid organ human normal tissues in the website.

Evaluating tissue-specificity of lncRNAs
Using the gene expression results described in the section above, the tissue specificity score of all the lncRNAs was calculated. Two different metrics, Tau and Preferential Expression Measure (PEM), were calculated that both indicates the tissue specificity of the lncRNAs (see construction and content of lncRNAKB). Figure 6 shows the density distribution of tissuespecificity metrics Tau and PEM across protein-coding genes (PCGs) and lncRNAs in the lncRNAKB annotation database as a comparison. The tissue-specificity scores vary from 0 to 1, where 0 means broadly expressed, and 1 is specific. Figure 6A displays average Tau score across all tissues and Figure 6B displays the maximum and normalized specificity value of PEM among all tissues. Overall, Figure 6 shows that lncRNAs have higher tissue-specificity compared to PCGs. For each gene, we provide a distribution of PEM scores across all tissues and all tissuespecific figures and score files are available for viewing and downloading individually on the lncRNAKB website.
The tissue specificity of lncRNAs could also be determined by performing an unsupervised principle component analysis (PCA) (see construction and content of lncRNAKB).
For this analysis, the log transformed TPM lncRNAs expression data across all tissues was used.
Each tissue showed a characteristic transcriptional signature, as revealed by PCA of lncRNA expression. The separation was evident between nonsolid (blood) and solid tissues and, within solid tissues, brain and testis are the most distinct (Fig 7). This finding is an additional confirmation that lncRNAs are tissue-specific.

eQTL analysis of lncRNAs
To add to our understanding of lncRNAs gene expression information, we utilized the gene expression data (described in previous sections) in combination with the whole genome sequencing (WGS) data available at GTEx consortium to identify variants in the genome that can alter gene expression (see construction and content of lncRNAKB). This analysis resulted in identification of a number of variants that significantly alter lncRNA gene expression in tissuespecific manner. Table 4

Functional characterization of lncRNAs using a network-based approach
To further our understanding of lncRNAs, we also undertook WGCNA, a network-based approach that relies on calculating correlation and identifying clusters/modules of correlated between genes (both protein-coding and lncRNAs) (see construction and content of lncRNAKB).
Since correlated genes are predicted to play similar functions in the cells, the pathway enrichment analysis of the correlated clusters/modules can help characterize the functions of lncRNAs in the correlated module. Supplementary Table 6 Table 9 shows the results of all GO pathways enriched in the heart tissue by module. There were several significant pathways identified (q-value <= 0.05) with many of these involved in heart related biological processes. Figure 10 highlights the network figure created using Cytoscape for module M2 identified in the heart tissue. This module is involved in heart-specific processes such as heart growth, development and contraction. The network has 148 genes (34 protein-coding and 106 lncRNAs) after filtering the adjacency matrix with correlations < 0.20 and "heart development" specific pathways/genes. The orange triangles and green circles/nodes represent lncRNAs and mRNAs respectively. The thickness of the edges highlights the correlation between nodes. The relatively strong connections of several lncRNAs to PCGs in this network suggests these could be potentially involved in the same heart development specific biological processes. On the lncRNAKB website, for each tissue, users can view and download the WGCNA results (similar to Supplementary

Discussion and future directions
There is a large volume of transcriptomics data publicly available and currently being produced at an unprecedented rate. Novel transcripts assembly using RNA-seq data is a method that generates thousands of new transcripts that need to be characterized. Several of these novel transcripts have been categorized as lncRNAs. While these data support the presence of lncRNAs in cell and tissue specific manner, bigger questions surrounding the purpose and functionality of lncRNAs in human biology remain. Several lncRNA databases have tried to address some of these questions, however, there is clearly a need to integrate the lncRNA annotation between databases to create a non-redundant list of well-annotated lncRNA entries that could be utilized by biologists to pursue research in this area.
To address this need, we have created the lncRNAKB, a well-structured research tool that delivers valuable data on human lncRNAs, which can be used for data exploration and hypothesis building purpose. Briefly, the lncRNAKB is the end-product of systematic step-wise integration of six widely used lncRNA databases that resulted in a total non-redundant 99,717 genes entries that were accompanied with 530,947 transcripts and 3,513,069 exon entries. All the annotated lncRNAs can be browsed at http://www.lncrnakb.org. This web-resource also provides a comprehensive list of information that researchers can access for every lncRNA entry. This includes viewing and downloading coding potential, conservation score, tissue specific expression information, and tissue specificity score for any gene of interest from the website. In addition to the gene-level information, we have also created a lncRNA body map where we have utilized the gene expression information and created a tissue specific gene expression and network pages that can be browsed by researchers and the information downloaded as per their research needs. This information includes gene expression count and TPM matrix for all the tissues, eQTL results and lncRNA-mRNA co-expression clusters/modules and the pathway enrichment results of respective modules. Put-together, the above-described features presented in the lncRNAKB web-resource will provide a comprehensive set of information that could be used by biologist interested in pursuing research in lncRNAs in their tissue or biological process of interest.

CONCLUSION
We have created the long non-coding RNA knowledgebase (lncRNAKB) by methodically integrating widely used lncRNAs resources. We systematically and carefully combined these resources, created a meticulously crafted non-redundant resource and provide valuable functional information on lncRNAs. The present release of lncRNAKB describes a substantial advance in annotations of the largest number of unique lncRNAs (n=77,199). In addition, we employed the Genotype-Tissue Expression (GTEx) project to provide tissuespecific expression profiles and tissue-specificity scores for these lncRNAs in 31 solid organ human tissues. We also performed Weighted Gene Co-expression Network Analysis (WGCNA) to identify co-expressed lncRNA-mRNA that were then subjected to pathway enrichment analysis to identify meaningful biological processes that lncRNAs could be potentially involved in, providing potential understanding on lncRNAs function. We created dynamic Cytoscape networks for exploration and visualization of each pathway. The lncRNAKB also incorporates coding potential and phylogenetic conservation. Furthermore, using whole genome sequencing data of 652 subjects from GTEx, we calculated expression quantitative trait loci (cis-eQTL) regulated lncRNAs in all tissues. This will provide a strong foundation for integrative traits association analysis by linking a variety of trait related GWAS and our eQTL results. All these features are provided in a user-friendly web interface, available at http://www.lncrnakb.org.    . The xaxis represents the 31 solid organ human normal tissues from GTEx and y-axis is the TPM expression. NPPB was ranked among the top five heart-specific genes Figure 6: Distribution of tissue-specificity scores with data for RNA-seq of 31 solid organ human normal tissues from GTEx across protein-coding genes (PCGs) and lncRNAs in the lncRNAKB as a comparison. The tissue-specificity scores varies from 0 to 1, where 0 means broadly expressed, and 1 is specific. Graph created with density function from R, which computes kernel density estimates A: Average Tau score across all tissues B: Maximum and normalized specificity value of PEM among all tissues    : Cytoscape network for lncRNA-mRNA co-expression Module 2 (M2) in the heart identified using WGCNA. The network was filtered for heart development genes (n=148) and correlations > 0.20. Orange triangles and green circles/nodes represent lncRNAs and PCGs respectively. The density of gray lines/edges represents the strength of the connection between genes Table 1: Summary of lncRNAs annotation databases that have been integrated into the lncRNAKB and the types of data included in each resource * lncRNAs only Table 2: Results of the cumulative stepwise intersection method across the six lncRNAs annotation databases compared to the reference (CHESS) at the gene level