De novo transcriptome assembly for the spiny mouse (Acomys cahirinus)

Spiny mice of the genus Acomys display several unique physiological traits, including menstruation and scar-free wound healing; characteristics that are exceedingly rare in mammals, and of considerable interest to the scientific community. These unique attributes, and the potential for spiny mice to accurately model human diseases, are driving increased use of this genus in biomedical research, however little genetic information is accessible for this species. This project aimed to generate a draft transcriptome for the Common spiny mouse (Acomys cahirinus). Illumina sequencing of RNA from 15 organ types (male and female) produced 451 million, 150 bp paired-end reads (92.4Gbp). An extensive survey of de novo transcriptome assembly approaches using Trinity, SOAPdenovo-Trans, and Oases at multiple kmer lengths was conducted, producing 50 single-kmer assemblies from this dataset. Non-redundant transcripts from all assemblies were merged into a meta-assembly using the EvidentialGene tr2aacds pipeline, producing the largest gene catalogue to date for Acomys cahirinus. This study provides the first detailed characterization of the spiny mouse transcriptome. It validates use of the EvidentialGene tr2aacds pipeline in mammals to augment conventional de novo assembly approaches, and provides a valuable scientific resource for further investigation into the unique physiological characteristics inherent in the genus Acomys.


Background
The Common or Cairo spiny mouse ( Acomys cahirinus ) is a small rodent species endemic to the semiarid deserts of Africa and the Middle East (Wilson & Reeder, 2005).Used in research to model human disease, spiny mice exhibit physiological characteristics not typically found in rodents: they exhibit a precocial pattern of development (Brunjes, 1990; Dickinson et al., 2005), atypical synthesis of hormones such as cortisol and dehydroepiandosterone (Lamers, 1986; Quinn et al., 2013; 2015), and a menstrual cycle (Bellofiore et al., 2016).These traits are common to humans and other higher order primates, but rare in other mammals.For example, menstruation has been identified in only six nonprimate species, none of which are rodents (Emera et al., 2012).The discovery of humanlike physiological characteristics in a rodent is highly valuable for those in the scientific community looking to model human conditions.The true value of the common spiny mouse as an animal model has yet to be fully realized, as many fundamental aspects of their biology are still to be explored; for instance, there is little genetic information available for this species.
Publically available genetic information for the spiny mouse consists of the mitochondrial genome (Hadid et al., 2014), and two RNA sequencing (RNASeq) datasets: PRJNA184055 (Fushan et al., 2015), and PRJNA292021 (Gawriluk et al., 2016).These nextgeneration sequencing (NGS) datasets were created with specific aims: to establish incipient sympatric speciation as a mode of natural selection in mammals inhabiting divergent microclimates (Hadid et al., 2014), to examine the molecular basis for natural variation in mammalian lifespan (Fushan et al., 2015), and to characterize and investigate another characteristic unique to Acomys : scarfree wound healing and skin regeneration (Gawriluk et al., 2016).De novo assembly of NGS reads was conducted for each specific organ/tissue sequenced in these projects, however the accuracy and completeness of the resulting assemblies were not explicitly described.Accurate identification of differentially expressed genes is dependent on accurate read mapping (Garber et al., 2011).The use of a highquality speciesspecific reference sequence containing transcripts from multiple organ types promotes accurate mapping, thereby facilitating accurate gene expression analysis.
RNASeq provides an unprecedented opportunity for costeffective, largescale genetic analysis in nonmodel organisms for which a genome sequence is unavailable.De novo assembly of millions/billions of RNASeq reads into a reference transcriptome can provide a valuable scientific resource, with applications in phylogenetic analysis (Robertson & Cornman, 2014), novel gene identification (Maudhoo et al., 2015), RNA editing (Athanasiadis et al., 2004) and alternative splicing investigation (Du et al., 2015), qPCR primer design (Zieliński et al., 2014), development and refinement of bioinformatics software (Bens et al., 2016), augmenting proteomic research (Francischetti et al., 2013), and investigation of gene expression profiles underlying complex physiological traits (Carneiro et al., 2014, Shimoyama et al., 2016).Surprisingly, there is no universal protocol for the assembly process.Despite the availability of protocols for de novo transcriptome assembly using popular software packages (eg.Haas et al., 2013), and practical guidelines such as the comprehensive Oyster River protocol (MacManes, 2016; http://oysterriverprotocol.rtfd.org/ ),the assembly pipeline commonly requires considerable optimisation to obtain high quality, meaningful results from each dataset (Robertson et al., 2010; SurgetGroba & MontoyaBurgos, 2010).
Here, we describe the application of a variety of de novo transcriptome assembly methods, utilizing both single and multikmer approaches, with unique transcripts from multiple assemblies combined to address our primary aim: the generation of a comprehensive de novo transcriptome assembly for the common spiny mouse ( Acomys cahirinus ).

Sample preparation and sequencing
Tissues were collected from male and female adult spiny mice, and placenta from three pregnant females.Total RNA was extracted using Trizol reagent from skin, lung, liver, small intestine, kidney, adrenal gland, brain, thymus, spleen, diaphragm, heart, skeletal muscle, testis, ovary, and placenta.Quantitation and quality control was performed using Agilent Bioanalyzer and Qubit, with all samples returning RIN scores >7.0.The Illumina TruSeq Stranded Total RNA Library Prep kit (Illumina, Hayward, CA) was used to generate indexed cDNA libraries for each RNA sample.
Filtering of poor quality reads was conducted using Trimmomatic v0.30 (Bolger et al., 2013) with settings 'LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 AVGQUAL:20 MINLEN:35'.Up to three nucleotides with quality scores lower than 20 were trimmed from the 3' and 5' read ends.Reads with an average quality score lower than 20, and reads with a total length of fewer than 35 nucleotides were removed.
Probabilistic error correction was performed using SEECER (HaiSon et al., 2013) with default parameters, remaining highquality reads were again examined with FastQC.Both errorcorrected and nonerrorcorrected reads were subjected to de novo assembly.
Assembly statistics were computed using the TrinityStats.plfrom the Trinity package, and log files produced by SOAPdenovoTrans and Velvet/Oases.

Collating nonredundant transcripts from multiple assemblies
The tr2aacds pipeline from the EvidentialGene package (Gilbert, 2013: http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_trassembly)was used to identify and collate nonredundant transcripts from each individual transcriptome assembly.The tr2aacds pipeline predicts amino acid sequences and transcript coding sequences, removes transcript redundancy based on coding potential, removes sequence fragments, clusters highly similar sequences together into loci, and classifies nonredundant transcripts as 'primary' or 'alternative'.
Transcripts that scored poorly were removed, with remaining 'primary' and 'alternative' transcripts from each singlekmer assembly collated into a final 'clustered assembly' (CA).

Assessing accuracy and completeness of each assembly
Accuracy and completeness was assessed in all assemblies (singlekmer and CA) using BUSCO v1.1b1 (Simão et al., 2015) to establish the presence or absence of universal single copy orthologs common to vertebrates and eukaryotes.Accuracy was assessed by the proportion of original sequence reads mapped ('backmapping') to each assembly using Bowtie v0.12.9 (Langmead et al., 2009) with settings: 'q phred33quals n 2 e 99999999 l 25 I 1 X 1000 p 12 a m 200 chunkmbs 256'.Independent RNASeq reads were obtained from the NCBI sequence read archive (SRA): datasets SRR636836, SRR636837, and SRR636838 were obtained from project PRJNA184055, and datasets SRR2146799 SRR2146807 from project PRJNA292021.These reads were generated from liver (Fushan et al., 2015) and skin ( Gawriluk et al., 2016) and neither organ was subjected to treatment i.e. these reads correspond to 'control' groups in their corresponding experiments.The independent RNASeq reads were aligned using Bowtie, with settings as specified above.The proportion of mapped reads was calculated using samtools flagstat with default parameters (Li et al., 2009).Structural integrity was examined using TransRate v1.0.2 (SmithUnna et al., 2016) with default settings, in which Salmon v0.6.0 (Patro et al., 2015) and SNAPaligner v0.15 (Zaharia et al., 2011) were implemented.Redundancy in assembled transcripts was assessed by the proportion of highly similar contiguous sequences (contigs), clustered using CDHITEST v4.6.5 (Li & Godzik, 2006; Fu et al., 2012; Huang et al., 2010) with settings 'c 0.95 n 8 p 1 g 1 M 200000 T 8 d 40'.Further clustering at 90%, 95%, and 100% similarity was conducted on a representative singlekmer assembly to assess contig redundancy.
Annotations were loaded into an SQL database.GO terms linked to the SwissProt entry of the best BLAST hit were used for ontology annotation.GO functional classifications and plotting was performed by WEGO (http://wego.genomics.org.cn)(Ye et al., 2006).

Raw data
Raw reads are available from the NCBI SRA under project number PRJNA342864.
In total, 451 million read pairs were generated, with yield, proportion aligned, error rate, intensity, and GC content provided in Table 1.Further summary statistics for data yield, percent passfilter (%PF), raw cluster percentage per lane, and quality score summary are provided in Supplementary Table S1.Filtering of poor quality read pairs (Q<30) removed 32% of the original 451 million read pairs, with 305 million highquality paired reads used for assembly.FastQC results for raw, filtered and in silico normalized data is provided in Supplementary Figure S1.

Transcriptome assembly
In total, 49 singlekmer transcriptome assemblies were produced from 305 million paired reads, with and without digital normalization and probabilistic error correction, as described in Figure 1 .Detailed metrics for all transcriptome assemblies are provided in Supplementary Table S2.Digital normalization reduced the size of the dataset by >90%, however assemblies constructed using normalized data contained fewer BUSCO reference orthologues (Figure 2), had poorer backmapping rates (Figure 3), poorer mapping of independent reads (Figure 4), and poorer TransRate scores (Figure 5), compared to assemblies generated from nonnormalized data.
Read errors identified using probabilistic error correction program SEECER were comprised of 14,821,705 substitutions, 1,760,162 deletions, and 1,614,908 insertions, affecting 6% of reads in total.Error correction provided a modest improvement to BUSCO score and mapping of independent reads, however it also resulted in slightly poorer backmapping rate and transrate score, compared to assemblies generated from noncorrected reads (Figures 2 5).A recent analysis of RNASeq read error correction supports the use of SEECER for enhancing transcriptome assembly quality (MacManes, 2015), however this result suggests the potential benefits of read error correction may not be realized in all de novo assembly pipelines.
Trinity produced the most accurate and complete singlekmer assembly comprised of 2,011,167 contigs.Clustering highly similar contigs resulted in a small reduction in the total number of contigs (Supplementary Figure S2), with the majority of clustered contigs found to be gene isoforms (Supplementary Table S3, Table S4, Table S5).
Clustering highly similar contigs increased the proportion of single copy BUSCOs detected, however it decreased the proportion of duplicate BUSCOs and increased the number of fragmented BUSCOs (Supplementary Figure S3).
Nonredundant transcripts from all assemblies were merged into a single clustered assembly (CA) using the EvidentialGene tr2aacds pipeline (Supplementary Table S4).This was effective in increasing the proportion of complete BUSCOs found, and reducing the number of fragmented and missing BUSCOs (Figure 2).The BUSCO values obtained are consistent with the most complete reference transcriptomes from other vertebrate and eukaryote taxa (Simão et al., 2015: Supplementary Online Material).

Annotation
The most accurate and complete singlekmer assembly was produced by Trinity from the 'nonnormalized' dataset.It contained 2,011,167 contigs (1,881,373 'genes' as defined by Trinity), representing a 995 Mb transcriptome.In total, 177,496 transcripts showing significant homology to the Uniprot/Swissprot database, with 59,912 (33.75%) of these transcripts containing an open reading frame, and 7,563 signalling proteins and 1,119 protein families were identified.The output from the annotation pipeline is provided in Supplementary Table S5.In comparison, the Of the transcripts returning BLASTx hits, 44,706 were associated with one or more Gene Ontology (GO) terms, with 394,288 GO categories represented.GO terms attributed to the greatest number of genes were metal ion binding (GO:0046872; n=13850), RNAdirected DNA polymerase activity (GO:0003964; n=11357), DNA recombination (GO:0006310; n=10728), cytoplasm (GO:0005737; n=9111), nucleus (GO:0005634; n=8648), integral component of membrane (GO:0016021; n=7829), endonuclease activity (GO:0004519; n=7619), plasma membrane (GO:0005886; n=5740), DNA binding (GO:0003677; n=3718), and membrane (GO:0016020; n=3676).GO terms were well distributed between the categories of biological process, cellular component and molecular function (Supplementary Figure S4) .

Validation
Further validation of transcriptome quality was performed by downloading and mapping RNASeq reads from NCBI projects: PRJNA184055 and PRJNA292021.A relatively large proportion of reads from these datasets mapped to the transcriptome assemblies generated, with the highest proportion from both datasets mapping to the CA (Figure 3B, Figure 4).Differences in mapping rates between these two independent datasets are likely due to read length (Fushan et al. PRJNA184055: 50 bp pairedend reads; Gawriluk et al.PRJNA292021: 100 bp pairedend reads), and depth of sequencing.Reads used to generate each of the 50 transcriptome assemblies (150bp pairedend reads) were also mapped to each assembly (backmapping), with very few unmapped reads further validating the completeness of the CA (Figure 3) .The use of '4th generation' longread sequencing to improve assembly structure would likely increase the proper pairing of reads, and may reduce the number of unmapped reads further, however the relatively low proportion of unmapped reads is encouraging for broad applications as a reference sequence.
TransRate scores were highest for assemblies generated from nonnormalized data, with higher scores typically corresponding to larger kmer sizes (Figure 5).Despite performing well in other measures of transcriptome assembly integrity and completeness, TransRate scores were lower than expected for highquality transcriptomes (≥0.2), with assemblies typically scoring between 0.1 and 0.2.
Possible causes for lowerthanexpected TransRate scores are the large proportion of read errors identified by SEECER, and the inclusion of poor quality reads from the original dataset.TransRate scoring is contingent on accurate backmapping of reads to transcripts, as it evaluates assemblies based on whether each base has been called correctly, whether bases are truly part of transcripts, whether contigs are derived from a single transcript, and whether contigs are structurally complete and correct (SmithUnna et al., 2016).Poor read quality impedes read mapping, which can negatively impact the TransRate score.Aligning errorcorrected reads may increase the TransRate score compared to uncorrected reads, however this was not examined here as the format of reads corrected using SEECER is incompatible with TransRate.

Conclusion
We have generated a comprehensive de novo transcriptome assembly for the spiny mouse ( Acomys cahirinus ) using the combined output of three de novo transcriptome assemblers: Trinity, SOAPdenovoTrans, and Velvet/Oases.The EvidentialGene tr2aacds pipeline was effective in identifying and collating unique transcripts from 50 unique assemblies, producing a 451 Mb transcriptome assembly comprised of 888,080 transcripts.To our knowledge, this is the first reported instance of the tr2aacds pipeline augmenting transcriptome assembly in a mammal.
The resulting transcriptome assembly was extensively validated, and shown to be accurate and relatively complete.This transcriptome represents the largest gene catalogue to date for the spiny mouse, providing an important resource for further investigation, and facilitating access to scientific techniques such as sgRNA design for CRISPR/Cas9 gene editing.This reference sequence is now being used to further investigate physiological traits unique to this species, and for establishing the spiny mouse as a useful experimental and clinical model to advance biomedical science.Additional file 1: Table S1

Figures and Tables
tr2aacds generated CA was comprised of 888,080 transcripts, representing a 451 Mb transcriptome, with 189,925 transcripts showed significant homology to the Uniprot/Swissprot database, and with 109,222 (57.5%) containing an open reading frame.

Figure 1 .
Figure 1.Flow chart of transcriptome assembly pipeline.*SEECER probabilistic error correction conducted on these datasets.