Revealing Missing Human Protein Isoforms Based on Ab Initio Prediction, RNA-seq and Proteomics

Biological and biomedical research relies on comprehensive understanding of protein-coding transcripts. However, the total number of human proteins is still unknown due to the prevalence of alternative splicing. In this paper, we detected 31,566 novel transcripts with coding potential by filtering our ab initio predictions with 50 RNA-seq datasets from diverse tissues/cell lines. PCR followed by MiSeq sequencing showed that at least 84.1% of these predicted novel splice sites could be validated. In contrast to known transcripts, the expression of these novel transcripts were highly tissue-specific. Based on these novel transcripts, at least 36 novel proteins were detected from shotgun proteomics data of 41 breast samples. We also showed L1 retrotransposons have a more significant impact on the origin of new transcripts/genes than previously thought. Furthermore, we found that alternative splicing is extraordinarily widespread for genes involved in specific biological functions like protein binding, nucleoside binding, neuron projection, membrane organization and cell adhesion. In the end, the total number of human transcripts with protein-coding potential was estimated to be at least 204,950.


Introduction
In this Supplementary material we give technical details regarding the detection, validation and characterization of novel human transcripts. First, we introduce ALTSCAN system in detail. Next, we give additional information for validation and consequent analysis described in the manuscript. Finally, we showed how the data mentioned in this paper would be accessed.

ALTSCAN
ALTSCAN (ALTernative splicing SCANner) is developed to find all possible exon-intron structure for coding genes. ALTSCAN incorporates transcriptional, translational, and splicing signals, together with length distributions of exons, introns and intergenic regions to predict multiple transcripts for each gene. When calculating the probability of a transcript, an extended Viterbi algorithm was utilized. Here we first presented details of ALTSCAN algorithms ( Figure S1), and then showed the processes of training and prediction. Figure S1. ALTSCAN algorithm. ALTSCAN utilizes an extended Viterbi algorithm. Viterbi algorithm, which is a variation of dynamic programming, finds the most probable path in HMM/GHMM. A "trellis" is used to keep all the values in dynamic programming. A shows the "trellis" used in a traditional Viterbi algorithm. B shows the "trellis" used in ALTSCAN. A cell keeps an array of values instead of a single value, as shown in the enlarged view. The maximum top N values will be calculated from arrays of values and top N pointers will be kept. Finally, the most probable N paths will be reconstructed by tracing back. 3

Algorithm
Generalized Hidden Markov Model (GHMM, also called Hidden Semi-Markov Model) has been utilized in gene structure prediction since 1997 1

. Compared with
Hidden Markov Model (HMM), whose states follow geometric length distribution, GHMM supports general length distributions for its states. This is necessary for human gene structure prediction because of the non-geometric length distribution of its exons. On the other hand, the GHMM involves a more complex algorithm, requiring a recursion which searches back at each position over all possible previous positions. If the length of the sequence is L and the state number is N, then the computing complexity will be O(N 2 L 2 ). Obviously, the algorithm must be simplified due to the large size of L to make this approach practical. Burge proposed a simplified method 1 based on several assumptions: 1) States are grouped into 2 classes, coding states and non-coding states; 2) Coding states have durations less than D, and will be still treated using general length distribution; 3) Non-coding states follows geometric length distributions and sequence generating models which are "factorable", for instance, the emission probability E for type i to generate sequence s can be described as , where a, b and c are sequence positions and s x,y represents the sequence segment of sequence s from position x to y.
For HMM/GHMM model, the optimization problem is efficiently solved by the Viterbi algorithm. We first describe the problem here. For sequence , the model will give a "label" from the "label" space Ω to each position of the sequence. The label sequence can be described as , where . The task is to find the "label" sequence to maximize the joint probability Pr(q, s). Before describing the optimization algorithm, we'll introduce some preliminaries: (1) States of the model , P -, A -, Utr5 -, Utr3 -, I p0 -, I p1 -, I p2 -}.
The states are described using forms as Xy a s : X represents the types (E/exon, N/intergenic region, P/promoter, I/intron, A/polyA).
a represents the "phase" (p0/phase 0, p1/phase 1, p2/phase 2). Phase of coding state is defined as the number of base(s) at the head of the exon which is integrated with base(s) from adjacent upstream exon to form a "codon" when translated to protein. Phase of non-coding state is defined as the same phase as the upstream exon's.
s represents the strands, plus strand or minus strand. All the states have a strand mark except the state N (intergenic region).
( On the other hand, the emission probabilities of non-coding states, Ψ , are modeled using a homogeneous fifth-order Markov matrix. As described in the simplified method above, since state j ( Ψ ) follows a geometric length distribution, the probability of state j to generate the sequence segment differs from that of generating by a factor, , independent of a, which is the basis that we can treat non-coding states as in standard HMM in Viterbi algorithm.
(5) Initial probabilities We force sequence start from non-coding states, therefore the initial probabilities of coding states are set as a non-zero value MIN_VALUE. The emission probability of non-coding state j is represented as Ψ . 6 Here, we first introduce our algorithm to predict the best gene structures (Top 1 Viterbi algorithm) based on this simplified method, and then introduce the extended algorithm (Top N Viterbi algorithm) with ability to predict the top N gene structures. Finally, tracing back from the maximum probability from , , we can obtain an optimal gene structure. 7

2) ALTSCAN Top N Viterbi algorithm
In order to find the top N probable gene structures, we need to keep the best N probabilities at each cell of the computation matrix. It is to say, the computation matrix is enlarged to and variables are defined as the joint probability of the t th optimal parse of the subsequence which ends in state n at position m. These variables can be calculated recursively as follows: Initialization: Induction: Termination: Finally, tracing back from the maximum N probability from , , we can obtain top N optimal gene structures.

Training and prediction
Refseq annotation was cleaned by removing genes with in frame stop codons with

Evaluation
Refseq genes and GENCODE genes were combined as KNOWN dataset, which was used as a gold standard. Transcripts were merged to genes if they overlapped and located on the same strand. KNOWN  protein-coding gene structures, so we didn't evaluate its specificity.
As described above, ALTSCAN used an extended multi-layer Viterbi algorithm. Here we showed this extended method indeed helped to find more transcripts. To see how much we were benefited from the predicted suboptimal structures, the KNOWN dataset was used as a gold standard. As shown in Figure S2, with a traditional Viterbi

Deficiency
ALTSCAN was used to find transcripts as many as possible, while it had several deficiencies. First, we applied ALTSCAN to candidate gene regions instead of the whole genome/chromosomes. Although we considered known protein-coding gene regions, predicted gene regions and regions only having EST supports, we might still miss potential protein-coding gene regions. Second, N=100 or even N=250 might not be deep enough. As shown in Figure S2, there were still 14 transcripts found at each depth from 150 to 250, and this number didn't seem to go down. In addition, ALTSCAN didn't take rare splice signal, like GC-AG, into consideration. These deficiencies hindered ALTSCAN's exhausting protein-coding gene structures.

Dataset construction
In order to construct a comprehensive transcript dataset, we first collected transcripts from the GENCODE and RefSeq database. Then GENCODE and RefSeq transcripts 11 were combined and filtered. Transcripts sharing the same coding regions, having internal stop codons or short introns (<20bp) were removed. After filtering, the refined dataset was named KNOWN dataset, which represented all currently known transcripts. Then ALTSCAN dataset and KNOWN dataset were merged with redundant ones removed to form a MIXTURE dataset.

RNA-seq data information
We collected 50 RNA-seq runs from the Illumina Human BodyMap2 project and ENCODE project. They were utilized to validate transcripts in the MIXTURE dataset (Table S1). Different runs of a biological sample were merged together as a single dataset. Finally, we obtained 26 datasets, which could be classified into to 3 groups: GROUP I, data sequenced from a single tissue from the Illumina Human BodyMap2 project; GROUP II, data sequenced from a mixture of 16 tissues from Illumina Human BodyMap2 project; and GROUP III, data from ENCODE project.

Validation strategies
Quality control of RNA-seq data were automatically processed using the NGSQC program with default parameters 14 . Coding sequences from MIXTURE transcripts were extracted with 100nts upstream start codons and 100nts downstream stop codons.
These coding fragments formed the mature transcript dataset. High quality reads were mapped to mature transcript dataset using Bowtie 15 with the option "-l 28 -n 2 -a --best".
We created a pipeline to validate candidate transcripts. For each transcript, we checked the coverage of all bases (full transcript coverage) and the coverage of all splice junctions (junction coverage). A base was covered if there was at least one read mapped to the position. We focused on validation of the splice sites of these transcripts. This wasn't to say transcription was not important. To balance this, we give the rule that all the bases of a transcript should be covered at least once, which was quite loose, but effectively showed the transcription. A splicing junction site was covered if and only if at least M read(s) covered both sides of the adjacent exons with no less than L nt(s). Therefore, it was possible that a transcript was fully covered but not all the junctions were covered and vice versa. We developed multiple levels of validation criteria. For all transcripts (both single exon transcripts and multi-exon transcripts), full transcript coverage should be 100% in order to be called validated.
For multi-exon transcripts the junction coverage should be 100% as well. In this step, two strategies (the standard strategy and the stringent strategy) were used to check the junction coverage. In standard strategy, M was set to be 1 and L was set to be 10. In the stringent strategy, values of M and L depended on the RNA-seq data groups. We described how this was done in following parameter selection part.
In addition, additional rules were used to validate novel transcripts (in ALTSCAN but not in KNOWN dataset) in three levels, i.e. VHC (validation with high confidence) VMC (validation with median confidence) and VLC (validation with low confidence).
Multi-exon transcripts validated in stringent strategy with one or more novel internal splice junction sites (NIJ filter, see following and Figure S3 for details) were described as VHC transcripts; multi-exon transcripts validated in standard strategy with one or more novel internal splice junction sites were described as VMC transcripts; and multi-exon transcripts validated in standard strategy without novel internal splice junction sites, together with validated single-exon transcripts were described as VLC transcripts.
As shown in Figure S3, there were two situations that validated transcripts had no novel internal splice junctions. The first situation was an alternative translational start sites and the second situation was an intron retention in the KNOWN transcripts.
These transcripts were also put into the VLC datasets. The remained transcripts with novel internal junctions would be put into the VHC transcripts.

Parameter selection
Here we showed how we selected parameter M and L for stringent strategy. The sequencing depth and read length may impact on the selection of M and L. Therefore different datasets might need different parameters. Consider that different datasets from the same data group had the same sequencing lengths and the data size was quite 16 close, and to be comparable between validations from different datasets, we decided using the same parameters for datasets from a same group. validation with GROUP Ⅲ datasets ( Figure S4).

Validation landscape
We validated MIXTURE transcripts with the pipeline and parameters described above.
The numbers of validated transcripts from each dataset were shown in Table S2. By comparing validation of KNOWN and novel transcripts (Table S2), we showed that novel transcripts have much higher levels of tissue-specific expression than KNOWN transcripts do (Table S3).

PCR validation
We randomly designed primers flanking novel splice sites for 88 VMC transcripts using Primer3. We ensured that primers were uniquely (at least 2 base mismatches for suboptimal alignments) mapped to human transcriptome or genome (hg19). We also ensured the product sizes were from 250 bps to 500 bps. Primer sequences were shown in Table S4. 88 primers flanking novel splice sites and 8 primers flanking known splice sites from house-keeping genes were selected (Supplemental Table S4).  The sample was analysed on the Agilent Bioanalyser to ensure that primers were successfully removed from the sample, and diluted to a final concentration of 2nM.

Real-time PCR experiment
The sample was then denatured by incubating with 0.2 N NaOH for 5 minutes at room temperature. The denatured sample was further diluted to 10pM in pre-chilled HT1 (Hybridization Buffer provided in the MiSeq reagent kit) and spiked with 1-2% Illumina PhiX control (Catalog # FC-110-3001) before loading on the MiSeq.  Figure S5 A and B). In addition, 154 VMC transcripts from 128 loci (including 40 transcripts from 32 loci) overlapped partially with L1 elements ( Figure S5 C and D).
Only 15 transcripts from 10 genes didn't overlap with L1 elements. Details were shown in Figure S6.  We introduced 11,549 confident transcripts (after pseudo-transcripts were removed from the 11,722 VHC transcripts) and the majority came from known genes as we described above. Therefore we found more alternative transcripts for known genes. In addition, there are still 5,166 (29.9%) multi-exon genes with only one transcript in KNOWN dataset, and we found an alternative isoform for 488 of these 5,166 transcripts. We also checked the novel splice events (see Detections of AS events part for details) and found that cassette exons, alternative translation stop sites and alternative translation start sites contributed most (Table S6). Other events also involve a group of (at least 2) splice sites (see Supplemental materials for details).

Detection of AS events
Alternative splicing events are classified into seven categories: (1)  (1) Exon skipping (cassette exons) A skipped exon (cassette exon) is defined as an exon that can exist or skip in the mature mRNAs. To detect this event, we compare every two transcript from a gene.
Only when a target exon exists in one transcript and disappear in another, and both the upstream and downstream exons are exactly the same, we call it a cassette exon. As shown in Figure S7A, there are 5 transcripts in this gene. When comparing Transcript 1 with Transcript 2, the 2nd exon in Transcript 1 is captured. When Transcript 1 is compared with Transcript 5, the 2nd exon in Transcript 1 will not be captured, because the upstream exon is different in the acceptor site.
(2) Mutually exclusive exons 32 Exclusive exons are defined as a group of adjacent exons that only exact one can exist in one mature mRNA. First, we compare every two transcripts from a gene. As shown in Figure S7A, only the situation of Transcript 2 and 3 is captured. The 2nd exon from Transcript 2 and the 2nd exon from Transcript 3 are qualified as exclusive exon candidates. As in exon skipping detection, the upstream and downstream exons must be exactly the same. Then we'll go over all transcripts again, to see if these 2 exons satisfy the definition. For example, when checking Transcript 1, these 2 exons exist at the same time. Therefore, these candidates are abandoned.
(3) Alternative 5' or 3' splice sites (alternative donor or acceptor sites) Two exons in each transcript must have a same splice site and a pair of different splice sites, then the different ones are defined as alternative donor or acceptor sites depending on their location. As shown in Figure S7A, when comparing Transcript 1 and Transcript 2, we capture a pair of acceptor sites, and when comparing Transcript 1 and Transcript 3, we capture a pair of donor sites. But when comparing Transcript 1 and Transcript 4, we capture nothing, because the downstream exons are not the same.
In a similar way, when comparing Transcript 1 and Transcript 5, we capture nothing.
(4) Intron retention Intron retention is defined as an intron that can appear in the mature mRNA. As shown in Figure S7B, when comparing Transcript 1 and Transcript 5, we capture one case at exon 2 of transcript 1. Of course, the donor site of upstream exon and acceptor site of downstream exon must be exactly the same as those of the exon in another transcript.
(5) Alternative translational start or stop site It is just as its literal meaning, if two transcripts have different start or stop codons, these events will be reported. Figure S7. Illustrations in AS event detection. A and B showed two genes (groups of transcripts) to help explain how we detected AS events.

GO analysis
A background function distribution is important for GO analysis. We checked the distribution of GO functions in KNOWN datasets and KNOWN+VHC or KNOWN+VMC datasets. Coding sequences of transcripts were first translated into protein sequences, then the protein sequences were aligned to the GO database using the NCBI blastp 2.26+. Only those alignments with E value ≤ 1e-5, identity≥30% and coverage of the query sequence ≥25% would be considered as matches. Sequences were first aligned to human proteins in GO database, and sequences failed to align to human proteins were then aligned to proteins of other species.
The comparison of distributions of GO functions were shown in Figure S8.

Enrichment analysis
We checked if AS was enriched in some specific pathways or functions using KEGG 21 and GO annotation. KNOWN genes and ALTSCAN VMC genes were merged and then ranked by their transcript number. Enrichment analysis for the top quarter genes with the higheast number of transcripts (transcript number >5) were carried out with DAVID 22 . No significant enrichment were found in KEGG pathways. Enriched GO functions (adjusted p value with benjamini-hochberg method < 0.05) were shown in Table S7.