Nanopore sequencing reveals TACC2 locus complexity and diversity of isoforms transcribed from an intronic promoter

Gene expression is controlled at the transcriptional and post-transcriptional levels. The TACC2 gene was known to be associated with tumors but the control of its expression is unclear. We have reported that activity of the intronic promoter p10 of TACC2 in primary lesion of endometrial cancer is indicative of lymph node metastasis among a low-risk patient group. Here, we analyze the intronic promoter derived isoforms in JHUEM-1 endometrial cancer cells, and primary tissues of endometrial cancers and normal endometrium. Full-length cDNA amplicons are produced by long-range PCR and subjected to nanopore sequencing followed by computational error correction. We identify 16 stable, 4 variable, and 9 rare exons including 3 novel exons validated independently. All variable and rare exons reside N-terminally of the TACC domain and contribute to isoform variety. We found 240 isoforms as high-confidence, supported by more than 20 reads. The large number of isoforms produced from one minor promoter indicates the post-transcriptional complexity coupled with transcription at the TACC2 locus in cancer and normal cells.


Material and methods
All experiments on human subjects were conducted in accordance with the declaration of Helsinki.
JHUEM-1 cell line culture and RNA preparation. The endometrial cancer cell line JHUEM-1 was provided by the RIKEN BRC through the National Bio-Resource Project of the MEXT, Japan. JHUEM-1 cells were cultured in Dulbecco's modified Eagle's medium /HamF12 (Sigma-Aldrich Corp., St. Louis, MO, USA) supplemented with 15% fetal bovine serum and 1% penicillin-streptomycin solution. The cells were plated in 6-cm dishes at 10 5 per dish and incubated for 6 days.
The culture medium was removed by aspiration, the cells were washed with phosphate-buffered saline (PBS), which was removed by aspiration and replaced with PBS containing 0.2% trypsin. Cells were transferred to an RNase-free tube, medium was added, and centrifuged at 300×g for 5 min. The supernatant was completely removed by aspiration. Total RNA was extracted with an RNeasy Mini Kit (Qiagen, Valencia, CA, USA) according to the manufacturer's protocol. Total RNA was diluted to 250 ng/μl based on Nanodrop measurements.
Patients, sample collection, and RNA preparation. Endometrial cancer tissues and normal endometrium tissues from patients with benign gynecological tumor were obtained from the patients recruited from the Department of Obstetrics and Gynecology, Juntendo University Hospital, Tokyo, Japan, with a written informed consent by following a protocol approved by the ethical review board of Juntendo University Faculty of Medicine. For classification of clinical cases, the patient's information such as histologic subtype, differentiation grade, the International Federation of Gynecology and Obstetrics (FIGO) stage in 2008 and TNM classification were used.
Cancer tissues were cut into 5 mm cubes after hysterectomy, frozen immediately in liquid nitrogen, and stored at − 80 °C. Frozen samples were cut into 30 mg (~ 3 mm cubes) and RNA was extracted with an RNeasy Mini Kit. Total RNA was eluted with 50 µl of RNase-free water. Normal endometrium tissues were cut into 2 mm cubes after hysterectomy, and immediately placed into a PAXgene Tissue FIX Container (Qiagen) in liquid-I (Tissue Fix) at room temperature; after 3 h, they were transferred into liquid-II (Stabilizer) and stored at − 80 °C. RNA was extracted according to the Qiagen protocol for the PAXgene Tissue Container. Total RNA was eluted with 28 µl of RNase-free water. Total RNA was diluted to 250 ng/μl based on Nanodrop measurements.
Preparation of sequencing target amplicons. cDNA was prepared from total RNA of JHUEM-1 cells and surgical specimens using a PrimeScript II 1st strand cDNA Synthesis Kit (Takara Bio Inc., Kusatsu, Japan). Briefly, samples were denatured at 65 °C for 5 min in the presence of Oligo dT Primer and reverse transcribed at 42 °C for 60 min. The enzyme was inactivated at 70 °C for 15 min, and the samples were immediately chilled at 4 °C. The cDNA was purified with AMPure XP beads (Beckman Coulter, Brea CA, USA), eluted in 50 µl of RNA-free water and 1 µl of the 50 µl was used to prepare TACC2 p10 amplicons by PCR with LA-taq DNA polymerase and GC Buffer. TACC2 p10 forward primer (CCA GTT GCT GAA GGG CAG AA) and TACC2 p10 reverse primer Ex22 (TTG CCT CGA ACC TGA GCA ATC) were designed from the transcription start site (p10) revealed in a previous study 6  Computational processing of MinION read data. We first aligned the reads with the reference genome GRCh37 by using minimap2 21 with the splice alignment mode option (− x splice). The exon blocks were generated by merging overlapping exons of the primary alignments, and the primary alignments with the same set of exon blocks were grouped. Sequencing errors were corrected with Canu v1.7 22 by specifying nanopore read (-nanopore-raw) and with additional options corOutCoverage = 999, corMinCoverage = 0, and stopOnRead-Quality = false. The corrected reads were subjected to a second round of alignment and grouping. The resulting groups in the second round were expected to agree with individual isoforms more precisely than the ones generated in the first round, as the genome alignments were based on the corrected reads within each of the loosely defined groups in the first round. Within each of the relatively accurate groups generated in the second round, the sequence error correction was conducted again on the original raw reads to avoid bias potentially introduced in the first round owing to incorrect grouping. The script of this error correction step is available as https:// github. com/ hkawa ji/ ggec. The corrected reads were aligned with minimap2 with splice alignment mode (− x splice) and k-mer size 12 (− k 12), and only the primary alignments overlapping with the exons targeted by the amplification primers were kept. Individual exons of the alignments were merged when they overlapped, resulting in 29 non-overlapping merged blocks. The most frequent boundary of the overlapping exons in each block was chosen as the representative exon boundary. Variations of exon boundaries were taken when more than 5% of exon-containing reads were found. Only the alignments matching the representative exon or the identified exon variants were used in isoform counts (Table S4).
Validation PCR and electrophoresis of target amplicon from JHUEM-1 RNA. cDNA was prepared as described above for the JHUEM-1 cell line, diluted tenfold with RNase-free water, and used as a template for PCR in an ABI 7500 Fast Real-time PCR System (Thermo Fisher Scientific) according to the manufacturer's instructions. The final concentrations in PCR mixture were as follows: 1× Prime STAR Buffer (Mg 2+ plus), 200 µM dNTP mixture, 0.025 U/µl of PrimeSTAR HS DNA polymerase, 0.52 × SYBR Green (Thermo Fisher Scientific), 0.2 µM forward primer, 0.2 µM reverse primer, and 25 ng/µl cDNA. Thermal cycling was performed for 40 cycles (98 °C for 10 s, 55 °C for 15 s, and 72 °C 60 s). Custom primers for detecting the new exons (Ex2-3, Ex3-4, Ex7-8) were designed by the Primer3 web tool (http:// bioin fo. ut. ee/ prime r3-0. 4.0/) and were synthesized by Thermo Fisher Scientific; they are listed in Table S4.  www.nature.com/scientificreports/ To confirm the amplification with the PCR primers, control synthetic DNA fragments were used (Eurofins Genomics Co., Ltd., Brussels, Belgium). One negative control fragment contained only known TACC2 exons without novel exons, and three positive control fragments contained novel exons (Fig. S5). The design of the forward primer for rex01 (Ex1-2) and the reverse primer for rex05 (Ex3) has been already published 6 , and these primers were used to amplify the synthetic DNA fragments to assure their quality (Table S5). The new exon custom primers were verified using these control synthetic DNA fragments. The PCR fragments were electrophoresed in 3% NuSieve GTG agarose gel with Tris-borate-EDTA (TBE) buffer for 45 min; 100-bp DNA ladder (Takara Bio, Otsu, Japan) was used as a molecular weight marker. The gel was stained with ethidium bromide to visualize the amplicons. Images were captured using Gel Doc XR plus (Bio-Rad, Hercules, CA, USA).

Results
The intronic p10 promoter of TACC2 is highly active in an endometrial cancer cell line, JHUEM-1. Using RNA-seq data from the Cancer Cell Line Encyclopedia (CCLE) 23,24 , we explored 25 endometrial cancer cell lines with the active p10 promoter of the TACC2 gene (Table S1) and found that some of them have an RNA-seq signal in the first exon adjacent to the p10 promoter. The expression was most active in JHUEM-1 cells (Fig. S1). The promoter activity was also supported by an unamplified CAGE profile 25 in the FANTOM5 project 2 (Fig. 1A,B). We chose the JHUEM-1 cell line for the subsequent analysis. A raw reads read groups CAGE Peak (p10@TACC2) Novel Partially known www.nature.com/scientificreports/ Amplicon sequencing of RNA isoforms transcribed from the intronic promoter. To obtain the full-length amplicons of TACC2 isoforms, we cultured JHUEM-1 cells, extracted RNA, and performed longrange PCR with a primer pair amplifying transcripts from the first exon immediately downstream of p10 to the penultimate exon (Fig. 1C). Using nanopore sequencing, we determined the entire structures of the TACC2 RNA isoforms transcribed from the p10 promoter. We found two distinct sizes of amplicons (Fig. S2A), suggesting a subset of the isoforms with the long 4th exon 13 . We performed long-range PCR in triplicate, ligated an adaptor with a barcode sequence unique to each replicate, pooled the replicates, and subjected them to nanopore sequencing. We obtained 869,847 reads in total; their size distribution was consistent with the electrophoresis data (Table S2; Fig. S2).
Sequence error correction and exon identification. Despite the advantage of read length in nanopore sequencing, its drawback is its high error rate (about 10% 19,20 ), which makes it challenging to accurately determine exon boundaries through their alignment with the genome. Previous studies using ONT sequencing of long-range PCR amplicons relied on 2D pass reads 26,27 , but we could not use their computational pipelines because our sequencing was based on 1D. Error correction approaches are based on two strategies 28 . The "hybrid" strategy relies on accurate short reads 28 , where additional data has to be obtained separately and correction prioritizes sequences shared among abundant isoforms. This could skew the data toward major isoforms and impede identification of the minor ones. "Self-correction" relies on the other reads obtained from the same experiment 29 , and its direct application to RNA isoforms may still result in skewing the data toward abundant isoforms. We developed an alternative strategy, called "reference-guided self-correction". The longread sequences are grouped on the basis of loosely defined exon structures and are subjected to self-correction Table 1. Representative exon boundaries identified by long-read sequencing. Genomic coordinates (start, end) on chromosome 10 are based on GRCh37 human genome assembly (hg19). Exon names are prefixed by "rex" (e.g., rex01), with exon order in a previous study 6 indicated in parentheses (e.g., Ex2 indicates the second exon, and Ex2-3 indicates an intron between the second and third exons). When the number of nucleotides in an exon is a multiple of 3, it is referred as symmetric and could be skipped without a change in the reading frame. www.nature.com/scientificreports/ ( Figs. 2A,B, S3). As the correction strategy requires the reads to be aligned with the genome and to belong to a group with multiple members, only a subset (365,177, ~ 40%) of the reads were corrected. We found a fraction of reads derived from other gnomic regions due to non-specific hybridization of the PCR primers, and ones which ends do not necessarily matched to the priming site probably due to incompleteness of error correction. We selected 177,369 (~ 20%) aligned reads that cover the genomic regions corresponding to the primers used for amplification (Table S2). Of them, ~ 4% entirely matched to combinations of 29 representative exon boundaries and their variations defined below. By aligning the error-corrected reads with the reference genome and examining the aligned blocks, we identified 29 regions and their representative boundaries with the highest frequency per region (Table 1; Figs. 2C, S4). Hereafter, we refer to them using the prefix "rex" for "representative exon regions", i. e. rex01 to rex29. The outer boundaries of the first and last regions were based on the known gene models, as their internal sequences were the targets of amplification with the PCR primers. Of the 29 regions, 24 exactly matched (at the base-pair level) the exons of the gene models in RefSeq 30 , Ensembl 31 , and a previous report 6 . This confirmation of the known exons indicated the accuracy of our results. We also found three regions, rex04, rex06, and rex12, that did not overlap any known exons, which we considered as candidate novel exons. The remaining two regions were partially overlapping exons of a pseudogene annotated in Ensembl and a transcript model of an EST (expressed sequence tag)-based prediction (SIB Gene in the UCSC Genome Browser Database 32 , https:// ccg. epfl. ch/ tromer/). We also found length variations in 13 regions (Table 2); in all of them, either the 5′ or 3′ boundaries were consistent with the representative boundaries. The frequencies of exons and isoforms are summarized in Tables S3 and S4.
Validation of the three novel exons in an endometrial cancer cell line. We examined whether the three novel regions represent authentic exons or technical artifacts arising from sequencing errors. In an independent experiment, we conducted PCR analysis with primers specific to the sequences of the novel regions (Fig. 3). The designed primers targeted the most frequent exon-exon junctions (Figs. 4A, S5, and Table S5). cDNAs prepared from JHUEM-1 cells, and synthetic positive and negative controls were subjected to PCR (Table S6). A single band of the expected size was found in the JHUEM-1 cell sample and in the positive but not negative control for each amplicon (Fig. 4B). These results confirmed the three regions as authentic exons.
Long-read sequencing of RNA isoforms in primary tissues. Next, we analyzed 16 cases of endometrial cancer and 5 normal endometria (Table S7), as transcripts from the intronic promoter p10 of TACC2 were detectable in normal endometria at a similar level to lower cases of primary tumors. We used the same approach as above, but performed two PCR rounds to obtain enough material for sequencing because the RNA amount was limited. We also found two distinct sizes of amplicons (Fig. S6), but the bands corresponding to Table 2. Variations of exon boundaries. Patterns of variations are written as "L value 1:R value 2", where "L" refers to the 5′-end and "R" indicates changes at the 3′-end. A positive value indicates exon extension, and a negative one indicates exon truncation. For example, L-4412:R0 indicates that the 5′-end is truncated by 4412 bp and the 3′-end is unchanged. www.nature.com/scientificreports/ the larger isoforms were sometimes faint. This is likely due to less efficient amplification of long molecules with the additional round of PCR. We obtained two pools of full-length cDNA amplicons; each of them consisted of 12 samples with unique sequence adaptors (Table S7). Sequencing of the amplicons produced almost ~ 5.5 million reads in total (~ 231 thousand reads on average per sample). We corrected sequencing errors as above and obtained ~ 1.7 million corrected reads (~ 72 thousand on average per sample; ~ 30%; Table S7). The ratio was lower than for JHUEM-1 cells (~ 40%) but the number of corrected reads was larger than in the individual JHUEM-1 replicates. Approximately 179 thousand reads (~ 8 thousand on average per sample; ~ 3%) matched exactly the exons defined above. Although we obtained the full-length sequences of TACC2 from the primary tissues, quantification of isoform abundance was limited because of the additional round of long-range PCR.

Name Length (bp) Symmetric Pattern Length change (BP) Symmetric
Diversity of isoform structures. Our full-length sequence data contained 27 profiles, comprising the triplicates for the cell line and 21 primary tissues, including three duplicates for primary tumors. Approximately 200 thousand full-length error-corrected sequences (8 thousand sequences per profile on average; Tables S2 and  S7) were obtained, with each genome alignment exactly matching the 29 representative exons or their 21 variations. We confirmed 9 of the 10 published isoforms 6 ; 6 isoforms were within the top 10 by the number of reads. The six ones were supported with more than 6000 reads, 2 within the top 50 with more than 500 reads, and one was at the 240th place with 20 reads (Tables S4 and S8. The only missing isoform was expected to have a 60-bp exon, where the exon itself (rex03) was found in our data set with less than 1% frequency (Table 1).
We next assessed the reproducibility of isoform frequencies based on the read counts of technical replicates. In JHUEM-1 cells (Fig. S7), technical replicates rep1 and rep2 were more consistent with each other (Spearman correlation coefficient (r s ) = 0.88) than with rep3 (r s = 0.78). On the other hand, in primary tumors, correlation was poor (r s < 0.5 in the examples shown in Fig. S8). Principle component analysis of isoform frequencies (Fig. S9) did not show clear segregation of sample types, while one could assume cell or tissue specific patterns in an analogous way to tissue-specific alternative splicing 3 . The accuracy of isoform quantification was good for the cell line but limited for primary tissues, probably because of an additional amplification step of long-range PCR, and it is not practical to compare isoform frequencies among different samples by using the present data.
Therefore, we focused on qualitative features rather than quantification. The overall frequencies of isoforms, exons, and exon-exon junctions, and isoform structures are shown in Fig. 5. The isoforms were ranked by frequency, as it provides supporting evidence of the isoform presence. The most frequent one corresponded to only www.nature.com/scientificreports/ 10% of the reads, suggesting that there are no dominant isoforms; 26 isoforms accounted for 80% of the reads and 101 isoforms accounted for 95%. Notably, the top isoforms show high frequencies in the profiled samples, while the quantification accuracy is limited. To cover all known and newly discovered exons, isoforms down to the 157th isoform were required (Fig. 5). This is consistent with the observation mentioned above that the 240th isoform matched a known isoform. The exon frequencies indicated the presence of three exon classes: 16 stable exons present in nearly all isoforms, 4 variable exons present in more than 10% of the reads, and 9 rare exons present in as few as 0.01-10% of the reads. The three novel exons belonged to the class of rare exons. The functionally characterized TACC domain is located in the C-terminal part of the protein, and all exons encoding this domain were stably included. We found that the variations arise from the N-terminal part, which is similar to that in TACC2 isoforms transcribed from the canonical upstream promoters. The top 12 isoforms arose only from the combinations of variable exons, while a variant of the 6th residential exon lacking 12 bp from its start position appeared repeatedly in the subsequent isoforms (Fig. S10).

Discussion
Here we used nanopore sequencing to extensively profile TACC2 isoforms derived from the intronic promoter p10. We chose nanopore sequencing because of low installation costs, simple operation, and use in other studies in which 2D sequencing was performed 26,27 (although we used 1D sequencing). Because this approach has a higher error rate than 2D sequencing or circular consensus sequencing of PacBio (an improved version of SMRT sequencing), we developed a strategy called "reference-guided self-correction" to correct errors in each isoformlevel group. This strategy substantially increased base accuracy and reduced the number of available reads to only 3% to 4% of the original reads. Unique molecular identifier (UMI), random oligonucleotide sequences uniquely attached to individual PCR templates, is successfully applied to increase sequence accuracy in a recent study 33 . It performs error correction within the reads obtained from the same PCR template while our strategy performs the correction within the same isoform. The two strategies approaching the same problem with different levels can be used complementary. UMI-based approach could be applied to a larger potion of the reads as it does not require the reads to be mapped on the genome, and remaining reads with insufficient error-correction at molecular level could be corrected at isoform level. Further, our strategy of isoform-level error correction would be useful for PCR-free applications, such as RNA direct sequencing.
We confirmed two major sizes of TACC2 isoforms, ~ 4 kb and 10 kb (Fig. S2), all previously reported exons 6 ( Fig. 2; Table 1), and 9 of 10 reported isoforms 6 (Fig. 5). Precise matches of the identified exon boundaries to the reported ones emphasize the accuracy of our results at the base-pair level. The absence of the 16th exon in one of the isoforms derived from the most upstream promoter was also consistently found in the previous study 6 and our data. The 15th exon was absent in 10 isoforms in the previous study 6 , and its frequency was lower than 1% in our data set (referred to as rex23; Fig. 5, Table S3). The isoform missing from our data includes a rare exon  www.nature.com/scientificreports/ (rex03, < 1% frequency). It is likely to be absent in the samples we studied, or to be present below the detection limit. We further uncovered three novel exons, which was validated with PCR independently to the sequencing. Overall, we identified 29 exons, and all of them are supported by previous reports, existing databases, and our own confirmatory experiments. Our data demonstrate strong preference of exon usage: 16 stable, 4 variable, and 9 rare exons. The longest exon (rex08) is variable; it contributed to both bands on the gel electrophoresis of full-length cDNAs. Exons rex09, rex10, and rex11 are also variable. If the four variable exons were chosen randomly, 16 (= 2 4 ) isoforms could be present with high frequencies, but only 12 isoforms were present with top frequencies, indicating unrevealed regulation of exon selection. These four consecutive exons span almost 50 kbp on the genome (Fig. 2), and repressive regulation at the post-transcriptional level rather than co-transcriptional regulation seems more likely.
Rare exons also contributed to isoform diversity. Eight of 9 (88%) rare exons and all 4 variable exons were symmetric (i.e., their nucleotide numbers were multiples of 3), whereas only 2 of 16 (12%) stable exons were symmetric ( Table 2). This is consistent with the reported preference of alternative exons 34 . Notably, the regions encoded by the variable and rare exons were located N-terminally of the functionally characterized and crossspecies conserved region, TACC domain 35 . As they do not disrupt the coding frame, the C-terminal domain is preserved in most of the isoforms.
With the 29 identified exons and their 21 variations, we found 1,464 isoforms, including those observed only once, and 240 isoforms were supported by at least 20 reads. The 240th isoform has been independently identified by cDNA cloning with capillary sequencing 6 . We consider those observed at least 20 times as high-confidence isoforms. If we set the threshold at 3 times per isoform, we found 694 isoforms. Our data unveiled 240 isoforms with high confidence and additional 454 with minimum evidence. Isoform diversity in conjunction with the frequency of exons and exon-exon junctions is shown in Fig. 5.
Diverse isoforms were identified for BRCA1 27 and CACNA1C 26 . In the case of BRCA1 having ~ 24 exons, 32 isoforms were identified, including 18 novel ones found in the control lymphoblastoid cells but not in breast cancer tissues. These include aberrant isoforms that are usually degraded by nonsense-mediated RNA decay, upon inhibition of the decay by culturing cells with cycloheximide. The latter study 26 identified 241 isoforms and 38 novel exons in the CACNA1C gene, where known isoforms consist of ~ 50 exons. The large number of exons facilitates the great diversity of isoforms. In the TACC2 gene, we found 240 isoforms with high-confidence and 694 isoforms with minimum evidence, based on 29 exons including 3 novel ones. Our data demonstrates that isoform diversity can be increased even with a limited number of exons.
Not all the data on the association of TACC2 expression with tumors are consistent. Chen et al. 9 reported it as a tumor suppressor in breast cancer, but Cheng et al. 12 reported its oncogenic effect in the same type of cancer. Takayama et al. 10,11 suggested its positive effect on tumor growth in prostate cancer. While we did not find any relationships between isoform abundance and cell types, probably because of technical limitations of this study, the discordant reports could be caused by the complexity of gene structure and isoform diversity. Concordant results of tumor associations could be obtained through careful discrimination of these isoforms.