Structure prediction of novel isoforms from uveal melanoma by AlphaFold

Alternative splicing is an important mechanism that enhances protein functional diversity. To date, our understanding of alternative splicing variants has been based on mRNA transcript data, but due to the difficulty in predicting protein structures, protein tertiary structures have been largely unexplored. However, with the release of AlphaFold, which predicts three-dimensional models of proteins, this challenge is rapidly being overcome. Here, we present a dataset of 315 predicted structures of abnormal isoforms in 18 uveal melanoma patients based on second- and third-generation transcriptome-sequencing data. This information comprises a high-quality set of structural data on recurrent aberrant isoforms that can be used in multiple types of studies, from those aimed at revealing potential therapeutic targets to those aimed at recognizing of cancer neoantigens at the atomic level.


Background & Summary
Alternative splicing (AS) can influence transcriptome and proteome diversity, as evidence shows that approximately 95% of genes with multiple exons produce multiple isoforms 1,2 .Therefore, it is not surprising that the gene isoforms play important roles in many biological processes, such as processes related to development, pluripotency and apoptosis [3][4][5] .Aberrant isoforms have been implicated in multiple human tumors, including uveal melanoma (UM), showing extensive changes via alternative splicing and the expression of critical gene isoforms [6][7][8] .Specific splicing isoforms are important for the initiation, metastasis and drug resistance of cancer, and some AS events have been shown to be significantly related to patient survival [9][10][11] .Although the role of a few splicing isoforms in cancer has been studied, 3D protein structure prediction on a scale that covers the transcriptome and can be used for evaluating biological functionality remains unexplored.
The suitability of short-read mRNA sequencing (short-read RNA-seq) in the discovery of AS events is limited because of the mapping uncertainty of short read lengths or assembly problems 12 .Long-read mRNA sequencing (long-read RNA-seq) shows advantages over short-read RNA-seq in isoform detection because long reads directly cover the entire transcript without the need of reconstruction, which is needed for short reads [13][14][15] .However, because of the high sequencing error rate (~15%) of raw long-read RNA-seq data, it is still challenging to determine the precise splicing sites with only long-read RNA-seq data 16 .Hybrid sequencing, combining long-read RNA-seq reads with high quality short-read RNA-seq reads and taking advantage of both platforms, improves the identification of AS events and gene isoforms 17,18 .Although great efforts have been made to study the alternative splicing mechanisms and functions of different isoforms, our knowledge of the 3D structure of splicing isoforms is very limited 19,20 .This lack of information means that for a large majority of spliced isoforms, no documented structures have been deposited in the Protein Data Bank (PDB), causing a large knowledge gap; hence, accurate prediction of protein structure is one of the most challenging goals in biology 21,22 .
As structures carry vital information about how different isoforms with a certain degree of sequence homology perform different functions, it is necessary to investigate the 3D structure of abnormal isoforms to explore their functions 23 .The most recent achievement in related technology, AlphaFold, a deep-learning-based approach, has been proven to be highly successful in predicting the 3D structures of proteins based on their amino acid sequences 24 .This is a significant advance that might have a profound impact on the study of protein dysfunction and the discovery of new polypeptides with potential medical applications 25 .
In this study, we provide an information resource based on the predicted structures of 315 novel isoforms obtained by long-read RNA-seq and short-read RNA-seq of transcriptome data from 18 UM patients.To better understand the structural differences of abnormal isoforms and the potential effects, we compared the structural differences between 295 abnormal-gene-encoded isoforms and their normal gene-encoded protein counterparts.We also identified 13 potential AS-derived neoantigens in 10 abnormal isoforms with altered amino acid sequences.These data constitute particularly valuable information on aberrant isoform structures that intersects with that on abnormal isoforms in other datasets, which can be used for an investigation into the roles of these isoforms in multiple cancer types.This study also offers new insights into the structure-based prediction of neoantigens and potential drug targets.

Methods patient samples.
A total of 20 patients with primary UM who visited Shanghai Ninth People's Hospital between 2018 and 2021 were selected for sampling.The detail information of the 20 UM patients is listed in Table 1.The process of sample collection adhered to the tenets of the Declaration of Helsinki and was approved by the Ethics Committee of Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine (SH9H-2021-T82-1).All patient samples were donated freely, with written informed consent and with the full cooperation of each patient.We have confirmed that in our ethics statement, patients consent to the disclosure of genomic data.The critical exclusion criteria include previous treatment of chemotherapy or radiotherapy.Sterilized instruments collected the samples, and immediately underwent snap-frozen in liquid nitrogen and stored at a temperature below −80 °C.illumina sequencing.Total RNA was isolated using Trizol Reagent (Invitrogen Life Technologies), then the concentration, quality and integrity of RNA were determined by NanoDrop spectrophotometer (Thermo Scientific).Three micrograms of RNA were used as input material for the RNA sample preparations.Sequencing libraries were generated which was then sequenced on Illumina NovaSeq.6000 platform by Shanghai Personal Biotechnology Cp. Ltd.

Nanopore sequencing.
Total RNA was isolated using the Trizol Reagent (Invitrogen Life Technologies), and the concentration, quality and integrity were determined by NanoDrop spectrophotometer (Thermo Scientific).A total of 1ug RNA was prepared for cDNA libraries using the cDNA-PCR Sequencing Kit (SQK-PCS109) according to the instructions of Nanopore Technologies (ONT).Defined PCR adaptors were directly added to both ends of the first-strand cDNA by reverse transcriptase.After 14-cycle of PCR by LongAmp Tag (NEB), the PCR products were subjected to ONT adaptor ligation using T4 DNA ligase (NEB).Agencourt XP beads were used for DNA purification according to ONT protocol.The final cDNA libraries were added to FLO-MIN109 Hybrid-sequencing strategy.Hybrid error correction, a simple and cost-effective approach involved with high quality short-read RNA-seq data, was used to improve the quality of long reads (Fig. 1).Here we used LoRDEC to correct the errors of full-length (FL) sequences 17 .LoRDEC is a new and efficient hybrid correction algorithm based on De Bruijn Graphs (DBG) of short reads.Achieving a comparable accuracy, LoRDEC runs six times faster and requires 93% less memory than PacBioToCA and LSC.LoRDEC first reads the short reads, builds their DBG of order k and then corrects each long read one after the other independently.
pinfish pipeline.The corrected FL reads for each sample were aligned to hg38 using minimap2 with the command 'minimap2 -ax splice' 26 .Spliced_bam2gff was used to convert sorted BAM files with spliced alignments (from minimap2) into GFF2 format.With sorted GFF2 file as input, based on the median of exon boundaries from all transcripts in the cluster, cluster_gff clusters reads with similar exon/intron structures into a rough consensus set of clusters.Then, by mapping all reads to the median length of read within each cluster generated by cluster_gff, polish_clusters creates an error corrected read and polishes it using racon 27 .Finally, taking polished and consistent transcripts as input, collapse_partials filters transcripts which are likely caused by 5′ end degradation and collapses input transcripts into a polished and collapsed transcripts set of each UM case (Fig. 1).
Fig. 1 The overall workflow of this study.
Flair pipeline.Flair_align aligns FL reads of each sample to hg38 using minimap2 and converts the SAM output to BED format.Flair_correct corrects mis-aligned splice sites with genome annotations and short-read splice junctions generated by STAR 28 .Finally, flair_collapse defines high-confidence transcript sets from corrected long reads 14 (Fig. 1).

Novel isoform detection.
We filtered out transcripts supported by less than two FL reads.With the cuffcompare tool in the Cufflinks package 29 , we compared the high-confidence isoforms output by flair and pinfish pipelines with the "RefSeq" gene annotation file, respectively.Cuffcompare explores the structure of each isoform, and matches reference transcripts that agrees on the coordinates and orders of all their exons, as well as strand.Isoforms set were further classified into eight groups based on their exon structures (splicing junctions) after the cuffcompare process.Isoform labeled by " = " and "j" tags in the output ".tracking" file was considered as an annotated and unannotated (novel) isoform, respectively.We got a median of 8,989 annotated and a median of 9,150 novel isoform candidates based on pinfish pipeline (Table 2), with a median of 11,117 annotated and a median of 12,366 novel isoform candidates from flair pipeline (Table 2).Finally, we defined novel isoforms as those were identified as novel isoform candidates in both flair and pinfish pipelines of each case.In order to further define high-confidence and recurrent novel isoform set for further analysis, we only remained 315 novel isoforms which have been identified in more than 10 UM cases (Fig. 1).We then performed GO enrichment analysis and found regulation of translation initiation, elongation and termination related pathways were significantly enriched in these isoform related genes (Fig. 2).
alphafold structure prediction.For novel isoforms, we first extracted corresponding DNA sequences from the human reference genome (hg38) using "samtools faidx" based on the coordinates and orders of all their exons, as well as the strand 30 .The tool of ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/) was subsequently employed to search for open reading frames (ORFs) 31 .The 3D structure was predicted using AlphaFold-Multimer version 2.2.0 using Shanghai Jiao Tong University's supercomputing resources (more specifically, one NVIDIA Volta V100 GPUs with 32GB graphics processing unit (GPU) memory) 24 .The version and parameters of AlphaFold-Multimer databases used were outlined as below:  --db_preset = full_dbs --output_dir = output --fasta_paths = input.fasta"TM-score calculation for structural comparison.Typical structure files were downloaded from Uniprot (https://www.uniprot.org)according to the priority order of EM, NMR, X-ray and alphafold predicted sources of structures.We used TM-score (https://zhanggroup.org/TM-score) to compare the predictive results with typical structures 32 .Protein pairs with a TM-score > 0.5 are mostly in the same fold while those with a TM-score < 0.5 are mainly not in the same fold, and some of them with a TM-score < 0.17 just have random structural similarity 33 .
We then check the distribution of comparison scores of novel isoforms based on the gene ontology enrichment results above (Fig. 3).As only 32 pairs (32/295) of both ratios of aligned length to novel protein and aligned length to canonical protein are bigger than 0.9, which suggested that most of novel isoforms have random or low structures similarities with their canonical proteins.TM-score results of each paired comparison were deposited at figshare 34 .

Neoantigens prediction.
For predicting AS-derived neoantigens, we first used a custom script to extract the unannotated splicing site information from all novel isoforms and defined such splicing junctions as neojunctions.
Based on neojunction loci, we obtained all polypeptides generated by a neojunction.We used a custom script to extract all possible 9-amino acid sequences from these polypeptides.NetMHCPan was used to perform MHC-I binding affinity prediction.NetMHCpan methods inform if a sequence is a strong MHC binder if the % Rank is below the specified threshold (0.5%), and define the peptide as a weak binder if the % Rank is above the threshold of the strong binders but below the specified threshold (2%) 35 .Peptides with strong binding or weak binding affinity were defined as neoantigens (Fig. 4).Finally, we got 13 potential AS-derived neoantigens from 10 highly recurrent abnormal isoforms (present in at least 10 UM cases) due to changes in amino acid sequences (Table 3).

Data Records
The datasets presented here have been stored at GEO under GSE206464 36 .Our cohort includes a total of 20 cases, of which 20 cases have short-read RNA-seq data and 18 cases have long-read RNA-seq data.This study is based on the 18 cases with both short-read and long-read RNA-seq data.AlphaFold-Multimer predicted structures files are accessible at figshare 37 .Each folder, whose name consists of gene name and isoform ID, corresponds to each isoform structure files (for example, "AAMDC_TCONS_00022447", which means isoform "TCONS_00022447" of gene AAMDC).Each folder contains multiple format text files which represent the predicted structures information.Among all predicted structures files, "ranked_0.pdb"file has the highest confidence.
Fig.  We observed an average of 7.05 million FL reads (accounting for 87.49% of total long reads) with an average read quality of 10.78 (ranging from 9.9 to 12.8), confirming the high confidence in the quality of sequencing data (Table 4).

Fig. 3 Fig. 4
Fig. 3 Structure comparison of novel isoforms with typical proteins.

Table 1 .
The detail information of UM patients.flow cells and were sequenced on the PromethION platform.GUPPY (version 3.2.6)was used for basecalling to convert the fast5 format data to fastq format.

Table 2 .
Statistics of annotated and novel isoforms among 18 UM cases.

2
Gene ontology enrichment of genes with novel isoforms.
technical Validationsequencing quality of long-read RNA-seq data.Pychopper package (https://github.com/nanoporetech/pychopper)wasusedto identify, orient and rescue FL cDNA reads.The number of total long reads ranged from 4,788,440 to 14,048,314 in which the FL reads were between 4,112,595 and 12,828,344 (Table4).

Table 3 .
Sequences, binding scores and levels of neoantigens.

Table 4 .
Sequencing statistics of 18 UM cases.