Full-length transcriptome sequencing from multiple tissues of duck, Anas platyrhynchos

Duck (Anas platyrhynchos), one of the most economically important waterfowl, is an ideal model for studying the immune protection mechanism of birds. An incomplete duck reference genome and very limited availability of full-length cDNAs has hindered the identification of alternatively spliced transcripts and slowed down many basic studies in ducks. We applied PacBio Iso-Seq technologies to multiple tissues from duck for use in transcriptome sequencing. We obtained 199,993 full-length transcripts and comprehensively annotated these transcripts. 23,755 lncRNAs were predicted from all identified transcripts and 35,031 alternative splicing events, which divided into 5 models, were accurately predicted from 3,346 genes. Our data constitute a large increase in the known number of both lncRNA, and alternatively spliced transcripts of duck and plays an important role in improving current genome annotation. In addition, the data will be extremely useful for functional studies in other birds.

Full-length sequencing and analysis pipeline. We combined all raw data and performed initial data processing according to the Iso-seq standard pipeline (Fig. 1). The Circular consensus sequence (CCS) was generated from initial data using the SMRTlink (version 5.1) software 16 . The CCS was classified into full-length and non-full length reads according to the 5′and 3′adapters and the poly(A) tail. Reads containing both the 5′ and 3′ primers and having a poly(A) tail signal preceding the 3′ primer were considered to be full-length reads. Iterative Clustering for Error Correction (ICE) was used to find transcript clusters based on the pairwise alignment and reiterative assignment of full-length reads. The cluster consensus reads were polished with non-full length reads to obtain high-quality isoforms using Arrow software(https://downloads.pacbcloud.com/public/ software/installers/smrtlink_5.0.1.9585.zip). The RNA-Seq data from 16 tissues of duck 23 generated by our lab was used to correct nucleotide mismatches in consensus reads with the software LoRDEC 24 . Any redundancy in corrected consensus reads was removed by CD-Hit-Est 25 to obtain final transcripts for the subsequent analysis. To estimate the completeness of our multiple tissue transcriptomic sequencing, we used a benchmarking universal single-copy orthologs (BUSCO) assessment 26 . We used ortholog sets from Aves lineages to examine transcriptome completion. We analyzed the completeness of datasets in processing steps, both corrected, polished consensus data and non-redundant transcript data.
Functional annotation of PacBio isoforms. The obtained full-length transcripts were annotated by conducting a local BLASTx 27 search against the protein databases, namely the Nr protein database at GenBank (http://www.ncbi.nlm.nih.gov), UniProtKB (http://www.expasy.ch/sprot, version:2019-8-14) and KOG. We determined the best match between each transcript and a known sequence based on the bit score. The results with a bit score below 50 were discarded and the highest bit score was considered as the best match. To classify the functions of transcripts based on molecular function, biological process and cellular component features, GO annotation was performed using Metascape 28 , while KEGG orthology and pathway annotations were obtained by using KAAS (KEGG Automatic Annotation Server) 29 . ANGLE 30 was used to determine the open reading frame (ORF) of each full-length cDNA sequence. We used high confidence duck protein sequences (ftp://ftp.ensembl. org/pub/release-95/fasta/anas_platyrhynchos/cds/) for ANGLE training and then ran the ANGLE prediction for given sequences. www.nature.com/scientificdata www.nature.com/scientificdata/ In addition to protein-coding RNAs, long non-coding RNAs constitute a major component of the transcriptome. In order to improve the accuracy of prediction of lncRNA, we used CPC (Coding Potential Calculator) 31 , PLEK (the predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme) 32 , Pfam-scan 33 and CNCI (Coding-Non-Coding-Index) 34 to predict the coding potential of transcripts after CD-Hit-Est, respectively. First, PLEK and CNCI were used to predict the coding potential according to the sequence characteristics of transcripts. The sequence of transcripts was compared with the known protein database by BLAST using CPC and searched by homology with Pfam-A and Pfam-B databases, their coding potential being predicted more accurately after comparing with the databases. The transcripts found by all programs were considered candidate lncRNA. Then, candidate lncRNA whose ORF length was longer than 300 bp and also had meaningful blast homology (BLASTX) when searched against the bird protein databases, were then removed. We determined the remaining non protein-coding transcripts as high confidence lncRNAs.
identification of As modes. The full-length transcripts were mapped to the reference genome CAU_ duck1.0 using GMAP 35 . The alignment file was filtered for 90% alignment coverage and 90% alignment identity and corresponding GFF files generated using cDNA_Cupcake 16 . SUPPA2 36 generates the AS and transcript events from an annotation file (GFF/GTF format). It then generates two files: ioe format for local AS events, and ioi format for transcripts. The ioe file provides for each AS event in a gene and the transcripts that describe either form of the event. The ioi file provides for each transcript in a gene, the set of all transcripts from that gene from which the transcript relative abundance is calculated. The AS event generated by SUPPA2 contained five different types: Alternative 5′/3′ splice-site (A5/A3), Skipping exon (SE), Alternative first/last Exons (AF/AL), Mutually exclusive exons (MX) and Retained intron (RI).

Fig. 1
The standard Iso-Seq pipeline for raw data processing. Raw sequence reads from a Pacbio RSII sequencer were processed using SMRTlink. The full-length reads and non-full-length reads were clustered into consensus transcripts using Arrow. All polished reads were corrected with Illumina short-read data using LoRDEC. All sequence data that removed redundant sequences using CD-Hit-Est were carried on to further analysis.

technical Validation
Quality control of sequencing analysis. From 77 Gb raw data, we produced 41.62 Gb subreads, which was classified into 702,788 non-chimeric circular consensus (CCS) reads. CCS reads comprised 563,320 fulllength reads with an average read length of 3,338 bp. The 313,565 high-quality consensus isoforms and low-quality consensus isoforms were corrected with RNA-Seq data using LoRDEC. 199,993 corrected full-length isoforms were used for further analysis after accounting for redundancy ( Table 2). We used Aves lineages (ortholog sets) to examine transcript completion ( Table 3). As expected, the percentage of complete BUSCO genes is over 80% in full-length transcripts, both before and after removing redundancy. After the redundant sequences were removed, the complete duplicated sequence decreased by 12.4% and the number of complete single copy genes increased by 10.4%, indicating that the integrity of the full-length transcripts was not compromised by removal of the redundant sequences. Significantly reduced, non-redundant full-length transcript data sets showed high integrity for subsequent analysis. annotation quality control. We annotated full-length transcripts with multiple reference databases for further study of gene function. First, the majority of transcripts (185,435; 92.72%) have similar sequences in Nt. Matches to other databases were as follows: 127,780 (63.89%) to Nr, 102,539 (51.27%) to UniProtKB and 53,570 (26.79%) transcripts aligned to the pfam database using BLASTx.
All transcripts were subject to functional annotation and classification. About half of the full-length transcripts were annotated by KEGG, GO and KOG databases. In general, 187,139 (93.57%) transcripts were found in at least one database and 20.81% of the transcripts were found in all databases ( Table 4). The metascape website first obtained GO annotations from Gene Ontology (http://geneontology.org/, 2019-07-01) 57   www.nature.com/scientificdata www.nature.com/scientificdata/ were assigned to each isoform based on the corresponding homologs in UniProtKB database. A total of 97,823 (48.91%) transcripts were annotated to multiple GO classification terms. In the "biological process" category, the majority of the transcripts were represented by 'cellular process' (63,996), 'biological regulation' (55,142) and 'single-organism process' (54,805) terms. On the other hand, 'cell' (86,603) was the most represented item in the "cellular component" category, while 'binding' (62,070) was the most common term in the "molecular function" category ( Fig. 2). Further analysis of the KEGG annotations revealed that most transcripts were enriched in signal transduction (13,791), endocrine system (6,935), immune system (5,791), cellular community-eukaryotes (5,744) and transport and catabolism (5,645). With KOG analysis, 82,456 (41.23%) transcripts were annotated and classified into 26 KOG categories. The largest cluster was "Signal transduction mechanisms (T)", indicating that most of the function represented by these transcripts are for the basic mechanisms controlling cell growth, proliferation, metabolism, and many other processes. The next largest cluster was 'the general function prediction only (R)' , followed by 'Posttranslational modification, protein turnover, chaperones (O)' , 'Cytoskeleton (Z)' and 'Transcription (K)' .
We obtained 34,364 candidate lncRNAs determined by the coding ability of the predicted sequence. In order to improve the accuracy of predicted lncRNA, sequences with ORF > 300 bp and which aligned against the avian protein databases were excluded, leaving 23,755 remaining sequences. The average gene expression of predicted lncRNAs is much lower than that of protein-coding RNAs (Fig. 3). In addition, the number of exons in lncRNAs is also significantly less than that of protein-coding RNAs. 71.72% of the predicted lncRNAs have only a single exon and only 11.36% of lncRNAs have more than two exons (Fig. 3).

Quality control of aS events.
More than 99% of full-length transcripts were mapped to the reference genome, and 18,328 gene models predicted (Table 5). We identified 35,031 AS events from 3,346 gene models. RI predominated, accounting for 61.86% of alternative transcripts. Except for AL (9.62%) and MX (8.13%), other AS types, such as RI (61.86%), SE (53.44%), A3 (50.30%), A5 (44.98%) and AF (29.63%), are more common in alternative splicing events (Fig. 4). Most genes exhibited only one model of AS, with only 70 genes showing every AS type (Fig. 5). We found that the number of AS events within genes is correlated with the number of exons, indicating that the complexity and diversity of transcription is enhanced by AS as exons increase.
The data provided in this study form the first report of a full-length transcriptomic resource for ducks, which includes predicted lncRNA and AS events identified by Iso-seq technology. These findings will be invaluable for improving genome annotation, examining AS evolution, and conducting functional studies in ducks.

Code availability
Most of the data analysis was completed by software running on the Linux system, and the version and parameters of main software tools are described below.
(1) SMRTlink: version 5.1, parameters: no_polish TRUE, max_drop_fraction 0.8, min_zscore −9999.0, min_ length 50, min_predicted_accuracy 0.8, max_length 15000, min_passes 2.    www.nature.com/scientificdata www.nature.com/scientificdata/ ca ta ly tic a ct iv ity m o le cu la r fu n ct io n re g u la to r tr a n sp o rt e r a ct iv ity si g n a l tr a n sd u ce r a ct iv ity m o le cu la r tr a n sd u ce r a ct iv ity st ru ct u ra l m o le cu le a ct iv ity n u cl e ic a ci d b in d in g tr a n sc ri p tio n fa ct o r a ct iv ity tr a n sc ri p tio n fa ct o r a ct iv ity , p ro te in b in d in g tr a n sl a tio n re g u la to r a ct iv ity a n tio xi d a n t a ct iv ity ch e m o re p e lle n t a ct iv ity p ro te in ta g e le ct ro n ca rr ie r a ct iv ity ch e m o a tt ra ct a n t a ct iv ity m e ta llo     Fig. 4 The total number of AS events in detected genes and transcripts by SUPPA2 analysis. A3, alternative 3′ splice site; SE, skipped exon; A5, alternative 5′ splice site; AF, alternative first exon; MX, mutually exclusive exon; AL, alternative last exon; RI, retained intron.