Identification of novel transcripts and peptides in developing murine lens

We previously investigated the transcriptome and proteome profiles of the murine ocular lens at six developmental time points including two embryonic (E15 and E18) and four postnatal time points (P0, P3, P6, and P9). Here, we extend our analyses to identify novel transcripts and peptides in developing mouse lens. We identified a total of 9,707 novel transcripts and 325 novel fusion genes in developing mouse lens. Additionally, we identified 13,281 novel alternative splicing (AS) events in mouse lens including 6,990 exon skipping (ES), 2,447 alternative 3′ splice site (A3SS), 1,900 alternative 5′ splice site (A5SS), 1,771 mutually exclusive exons (MXE), and 173 intron retention (IR). Finally, we integrated our OMIC (Transcriptome and Proteome) datasets identifying 20 novel peptides in mouse lens. All 20 peptides were validated through matching MS/MS spectra of synthetic peptides. To the best of our knowledge, this is the first report integrating OMIC datasets to identify novel peptides in developing murine lens.


Results
Here, we extend our analysis of the RNA-Seq data using multiple bioinformatics tools to identify novel transcripts, fusion genes, and alternative splicing (AS) in developing mouse lens (Fig. 1). Additionally, we integrate our OMIC (Transcriptome and Proteome) datasets to identify novel peptides in mouse lens and subsequently validated them through matching MS/MS spectra of synthetic peptides (Fig. 1).
First, the raw reads were mapped to the Mus musculus genome resulting in >93% alignment to the genome. Next, the aligned reads were examined for PCR duplicates identifying ~19% of total reads as PCR duplicates that were removed. The remaining mapped reads were processed using StringTie to convert the RNA-Seq alignments into potential transcripts and the expression of each transcript was measured and normalized using transcripts per million (TPM) algorithm. Our analysis identified both annotated and novel transcripts in mouse lens transcriptome. We divided the novel transcripts further into two categories: first, transcripts that map entirely to the unannotated regions of the mouse genome and second, transcripts that partially align to both annotated and unannotated regions of the mouse genome.
We identified a total of 21,265 annotated transcripts expressed in at least one of the six developmental time points (Table 1 & Supplementary Table 1). Of these, we identified 9,707 novel transcripts present in at least one of the six developmental stages mapping entirely to unannotated regions of the mouse genome (Table 1 &  Supplementary Table 2). Additionally, we identified 14,113 transcripts aligned to both the annotated and the unannotated regions of the mouse genome (Table 1 & Supplementary Table 3) termed hereafter as semi-novel transcripts.
We further investigated our mRNA sequencing data to identify fusion genes expressed in the mouse lens transcriptome. The analysis identified 325 novel fusion genes including 195, 161, 275, 218, 280, and 223 fusion genes in the mouse lens at E15, E18, P0, P3, P6, and P9, respectively (Table 1 & Supplementary Table 4). Gene ontologies (GO) based functional and mammalian phenotype enrichment analysis of novel fusion genes revealed Briefly, HISAT2, a splice aligner tool was used for the alignment of mouse lens RNA-Seq reads (FASTQ) to the mouse genome, followed by the transcripts annotation and expression quantification using the StringTie algorithm. In parallel, the RNA-Seq aligned data was further processed using JAFFA, and rMATS algorithms to detect fusion genes, and alternative splicing events, respectively, expressing in mouse lens. The novel transcripts (≥1.0TPM) were analyzed using a proteogenomics approach to identify novel peptides. The novel transcripts were translated into potential open reading frames (ORFs) to generate a reference database. The mouse lens proteome data (MS/MS spectra) was searched against this reference database to identify novel peptides. Finally, the novel peptides were validated through matching MS/MS spectra of synthetic peptides. (q-value ≤ 0.01) unique molecular function, biological process, cellular component and mammalian phenotypes (Supplementary Tables 5-6). Next, we examined our RNA-Seq dataset using the rMATS pipeline (≤0.01 FDR) to identify the novel AS events across the six developmental time points in mouse lens. In total, we identified five AS events including exon skipping (ES), alternative 3′ splice site (A3SS), alternative 5′ splice site (A5SS), mutually exclusive exons (MXE), and intron retention (IR) in developing mouse lens ( Table 2). The analysis identified 6,990 novel ES splicing events (≤0.01 FDR) across the six developmental time points in mouse lens (Supplementary Table 7). Of these 2,023 events present in at least one developmental time point, and 4,967 ES events in ≥2-time points (Supplementary Table 7).
We identified 2,447 novel A3SS splicing events (≤0.01 FDR) including 809 events detected in a single developmental time point and 1,638 events in ≥2-time points (Supplementary Table 8). Likewise, we identified 1,900 novel A5SS splicing events including 719 events detected in a single developmental time point and 1,181 events in ≥2-time points (Supplementary Table 9). Furthermore, we identified 1,771 novel MXE splicing events (≤0.01 FDR) including 387 events detected in a single developmental time point and 1,384 events in ≥2-time points (Supplementary Table 10). Lastly, our analysis identified 173 IR splicing events (≤0.01 FDR) in mouse lens (Supplementary Table 11).
Our RNA-Seq datasets are critical in identifying novel transcripts; however, the biological significance of these events is incomplete without knowing the corresponding changes at the protein level. We recently investigated the proteome profile of developing mouse lens through mass spectrometry-based protein sequencing 24 . We integrated our OMIC datasets to identify novel peptides in mouse lens. As mentioned above, we identified a total of 9,707 novel transcripts that were translated into three open reading frames (ORFs) to identify all theoretical peptides translated by the novel transcripts. This theoretical peptide dataset was interrogated against the mouse lens proteome to identify peptides originating from a sequence of the novel transcripts (9,707 novel transcripts identified in the mouse lens transcriptome). The analysis identified 55 peptides in the mouse lens proteome based on TMT spectra. All of the 55 candidate peptides were screened against the mouse non-redundant (nr) protein database (NCBI) and peptides with ≥2 amino acids mismatches and an XCorr score ≥2.5 were considered novel. This criterion identified a total of 20 novel peptides that were retained for further analysis.
All 20 novel peptides along with three control peptides were synthesized commercially and the respective spectra of these synthetic peptides were generated using the Orbitrap Fusion Lumos Tribrid Mass Spectrometer. The MS/MS fragmentation patterns of synthetic peptides were manually compared with MS/MS spectra generated from the proteomic analysis of mouse lens. The control peptides revealed similar spectra consistent with the MS/MS fragmentation patterns originating from mouse lens proteome ( Supplementary Figs 1-3). The MS/MS fragmentation patterns of all 20 synthetic peptides (representing 20 novel peptides) exhibited spectrum consistent the MS/MS fragmentation patterns originating from mouse lens proteome dataset (

Discussion
We previously investigated mouse lens transcriptome at two embryonic (E15 and E18) and four postnatal (P0, P3, P6, and P9) time points using next-generation RNA sequencing, which identified a total of 14,465 genes along with 12 different classes of non-coding RNAs (ncRNAs) in mouse lens 22,23 . More recently, we completed a comprehensive proteome of mouse lens at the same six developmental time points identifying 5,404 proteins 24 . A brief overview of these published datasets is provided in Table 4. In here, we extend our analyses to identify novel transcripts and novel peptides in developing mouse lens.
Our analysis revealed 9,707 novel transcripts identified in six developmental time points (Table 1, and  Supplementary Table 2). Of these, ~25% are multi-exonic and ~40% are expressed in all six developmental time points (Supplementary Table 2). In addition to novel transcripts, we identified a total of 14,113 semi-novel transcripts in lens transcriptome (Table 1, and Supplementary Table 3). In contrast to the novel transcripts where a majority (~75%) of the transcript are single exon, most of the semi-novel transcripts (>85%) revealed a multi-exonic structure i.e. ≥2.0 exons (Supplementary Tables 2 and 3).
We identified 325 novel fusion genes in developing mouse lens (Table 1). Among these, we identified multiple fusion transcripts for CryαA, CryαB, CryβA1, CryβA2, CryβA4, CryβB1, CryβB2, CryβB3, CryƔA, CryƔC, CryƔD, CryƔE, and CryƔS (Supplementary Table 4). Additionally, we identified fusion transcripts for Bfsp1, Bfsp2, Tdrd7, Mip, Lim2, Pax6, and Dnase2β in mouse lens (Supplementary Table 4). Multiple studies have reported fusion genes in normal human and mouse tissues and more importantly, in tumorigenesis 25-28 . We did not find any reports in literature, and therefore this would be the first report describing fusion genes in the ocular lens.
Multiple transcriptome-based studies have revealed extensive AS in normal human and mouse ocular tissues [29][30][31][32][33][34][35] . We identified 13,281 novel AS events in developing mouse lens with a predominant contribution from ES, A3SS, A5SS, and MXE while a small number of IR splicing events identified in developing mouse lens ( Table 2). The ES events contribute to the diversity of the transcriptome and consistent with this notion, we identified a total of 6,990 ES events in developing mouse lens. Interestingly, most of the ES were identified at embryonic time points (Table 2).
Recently, Srivastava and colleagues reported the identification of 1,241 AS events in developing mouse lens 21 , examining RNA-Seq datasets recently published by our group 22 . We identified a total of 13,281 AS events in developing mouse lens analyzing the same RNA-Seq datasets. The difference in the numbers of AS events published by Srivastava and colleagues and identified in the current study may be attributed to different cut-off values. Srivastava and colleagues used PSI (Percent Spliced Index) cut-off value for significant (<1% FDR) AS events 21 , whereas we used a cut-off value of ≤0.01 FDR for the identification of AS events in developing mouse lens. Finally, we adopted a proteogenomics approach to identify expression of novel transcripts in mouse lens proteome and subsequently validated these novel peptides through MS/MS spectra of corresponding synthetic peptides. In silico analyses identified a total of 55 novel peptides in mouse lens protome. Of these, only 20 peptides passed a more stringent criterium i.e. ≥2 amino acids mismatches and an XCorr score ≥2.5. These 20 novel peptides were validated through MS/MS spectra of corresponding synthetic peptides. We are currently investigating the identity of the proteins harboring these novel peptides and examining their biological significance in lens morphogenesis.
In conclusion, we represent a comprehensive developing mouse lens profile through the identification of novel transcripts, novel fuson genes, and novel AS events. Additionally, we integrate our OMIC datasets to identify novel peptides in developing mouse lens.

Materials and Methods
mRNA sequencing data. The Illumina paired-end mRNA sequencing data (GEO series accession number GSE69221) was used for the downstream bioinformatics analysis. The paired-end raw reads were processed to remove the adapter sequences using SeqPrep (https://github.com/jstjohn/SeqPrep). The quality of the pre-processed reads was evaluated with FastQC (www.bioinformatics.babraham.ac.uk/projects/fastqc) and low-quality reads were removed prior to the downstream analysis.

Mapping and transcript annotation of mRNA sequencing data. HISAT2 (Hierarchical Indexing for
Spliced Alignment of Transcripts), a spliced alignment tool (Ver. 2.1.0-beta) was used to map pre-processed reads to the mouse genome (NCBI37/mm9) with default parameter settings 36 . The BAM output files were generated for each sample and PCR duplicates were marked and removed from BAM files using Picard software (Ver. 2.8.3; https://github.com/broadinstitute/picard). Subsequently, StringTie algorithm (Ver. 1.3.3b) was used with default parameter settings to assemble RNA-Seq alignments into annotated and novel transcripts and estimate their respective expression level 37 . The expression of these transcripts was normalized using transcripts per million (TPM) algorithm and the number of known, and novel transcripts were estimated from the output GTF file generated by StringTie with expression threshold (≥1.0 TPM).
Identification of AS events. AS events were identified through rMATS software (Ver. 3.2.5) 38 . The mRNA-Seq alignment files (bam files) generated by the HISAT2 was used as an input for the rMATS analysis. The Mus musculus RefSeq gene annotations (GRcm38/mm10) was used as a reference with default parameter settings. Finally, the rMATS was used to calculate p-value and false discovery rate (FDR) for AS events among different developmental time points.
Identification of fusion genes. JAFFA (Ver. 1.08), a multi-step pipeline was used in a hybrid mode to detect fusion genes in mRNA sequencing data 39 . The Mus musculus reference genome (GRcm38/mm10),  Gene ontologies functional enrichment analysis. A functional annotation analysis of mouse lens genes was investigated using Visual Annotation Display (VLAD; Ver. 1.6.0), a web-based tool from the Mouse Genome Informatics (MGI) 40 . The VLAD tool performs the statistical analysis to test the enrichment of gene ontology (GO) terms based on their annotations to gene function and mammalian phenotype 40 . A complete set of mouse genes was used as a reference annotation dataset and ontological terms annotated with the evidence code ND (no biological data) were excluded from the enrichment analysis. The statistically significant enriched terms were sorted based on their corrected p-value (≤0.01) calculated using multiple testing and positive false discovery rate for each term.   shown with a similar fragmentation pattern observed from the corresponding synthetic peptide. Note: the term "peptide identified from sample" refers to the MS/MS spectra identified in mouse lens proteome 24 , and the synthetic peptide refers to MS/MS spectra of the peptide synthesized by JPT Peptide Technologies (Berlin, Germany).  (Ver. 2.1; Thermo Scientific, Bremen, Germany). The parameters for the Proteome Discoverer included, trypsin as a proteolytic enzyme with a maximum of two missed cleavages, acetylation of protein N-termini and oxidation of methionine as variable modifications, carbamidomethylation of cysteine as fixed modification, a minimum peptide length of 6 amino acids, and the mass tolerances of 10 ppm and 0.02 Da for precursor and fragment ions, respectively. The matched spectra or candidate peptides were filtered using the Percolator algorithm within the Proteome Discoverer suite using the false discovery rate (FDR < 0.01) at protein and peptide levels. Finally, all peptides identified above (by interrogating the MS/MS spectra from mouse lens proteome against the reference database) were screened against the mouse nr protein database (NCBI) to identify novel peptides not present in the mouse nr protein database. Peptides with ≥2 amino acids mismatches (mouse nr protein database) and XCorr score ≥2.5 were considered novel and retained for further analysis.

A. Mouse Lens
Validation of novel peptides through LC-MS/MS analysis of synthetic peptides. All synthetic peptides were purchased from JPT Peptide Technologies (Berlin, Germany). The synthetic peptides were pooled and labeled with 1-plex TMT reagents according to the manufacturer's instructions (Thermo Fisher Scientific). The labeling reaction was performed for one hour at room temperature followed by quenching of the labeling reaction with 100 mM Tris-HCl (pH 8.0). The labeled peptides were desalted with C 18 Sep-Pak (Waters Corporation, Milford, MA), dried and resuspended in 0.1% formic acid. Peptides were subjected to Orbitrap Fusion Lumos Tribrid Mass Spectrometer coupled with the Easy-nLC 1200 nano-flow liquid chromatography system (Thermo Fisher Scientific) with similar parameters used for the mouse lens proteome profiling 24 .