Background & Summary

Dispersal is a key process in ecology and evolutionary biology, as it contributes to shaping the spatial patterns of biodiversity and their variation over space and time1,2,3. Recently, growing evidence has been shown the importance of the role of animal personality in affecting dispersal processes4,5,6. In particular, some personality traits have been linked to a higher – or lower – individual attitude to disperse and/or to survive in recently colonized environments. Therefore, behavioral polymorphisms in dispersal-related traits can affect the spatial patterns of intraspecific genetic variation and can play an active role in driving eco-evolutionary pathways. Recent advances in the study of the heritability of animal personality indicate the substantial contribution of additive genetic variance to behavioral trait variation7. However, we still miss a thorough understanding of the genetic underpinnings of behavioral polymorphisms related to dispersal.

Amphibians provide an intriguing opportunity to study dispersal ecology. By moving from aquatic to terrestrial habitats, amphibians face with substantial niche shifts at metamorphosis, which is linked to ontogenetic changes in morphology, physiology, and behavior8. Such a dramatic change in habitat features also requires substantial changes in dispersal-related traits. Noteworthy, conditions experienced during the aquatic larval stage (e.g., water temperature, predation rates, and conspecific density) can shape post-metamorphic phenotypes and thus influence dispersal process after metamorphosis9. This results in complex carryover effects on dispersal-related traits that add complexity to identifying patterns and mechanisms of dispersal compared to taxa with simple life cycles10.

In this study, we aim to contribute to the understanding of the genetic basis of behavioral polymorphisms related to dispersal in amphibians, by assembling the de novo transcriptome of the larval stage of the fire salamander Salamandra salamandra, a species showing marked polymorphism in dispersal-related personality traits. The fire salamander is a stream-breeding amphibian widespread in the western Palearctic region, characterized by rather low dispersal attitudes11. A recent study on the inter-individual behavioral variation in the Italian population of the fire salamander, showed marked polymorphisms in dispersal-related behavioral profiles of larvae and juveniles12. In particular, two distinct profiles have been identified within populations: a less active and less exploratory behavioral profile, and a more active and exploratory behavioral profile. Interestingly, this polymorphism has been associated with a marked differentiation between two co-occurring mitochondrial DNA lineages, not mirrored at the level of the nuclear genome12.

Here, we focused transcriptome analyses on tissues extracted from the cephalic region, as the brain is a target tissue for investigating the genetic background of behavior13,14. In fact, there is evidence showing that brain gene expression patterns can reflect behavioral state in response to environmental stimuli15,16. Thus, transcriptome analyses of the brain can contribute to reveal the genetic architecture of animal personalities17,18,19. The transcriptome presented here has been validated and annotated, in order to provide a reference for further analysis. Furthermore, because of its large size, the S. salamandra genome is not still available, and thus this transcriptome will join the other transcriptome data for this species20,21,22 to provide a valuable genomic resource for further ecological and behavioral studies.


Experimental design

We collected salamander larvae from a population in Central Italy showing behavioral polymorphism12 (Picentini Mountains: 40°48′ N, 14°53′ E). Details about sampling, housing and behavioral essays are described in Chiocchio et al.12. We selected 10 larvae of fire salamander representative of two distinct behavioral profiles, i.e., slow, less active and less exploratory behavior vs fast, more active and more exploratory behavior (thereafter referred as “slow” and “fast”, respectively; see Table 1). For each individual, the cephalic region was dissected and immediately stored in RNAprotect Tissue Reagent (Qiagen) until RNA extraction. All procedures followed the relevant guidelines and regulations for welfare and were approved by the Italian Ministry of Environment (permit number: 0008275.20-04-2018), the Institute for Environmental Protection and Research (#23501, 23-03-2018) and “Regione Campania” (#0203190, 27-03-2018). Permission to temporarily house amphibians at the University facilities was granted by the local health and veterinary authority (A.S.L. Tarquinia, license 050VT427).

Table 1 Summary of the 10 libraries deposited in the ENA (European Nucleotide Archive, Study Accession Id PRJEB51202), in terms of number of raw and trimmed reads per sample.

Dataset generation

RNA extractions were performed using the RNeasy Plus Kit (Quiagen) on approximately 60 mg of tissue, according to the manufacturer’ instructions. RNA quality and concentration were assessed by means of either a spectrophotometer and a Bioanalyzer (Agilent Cary60 UV-vis and Agilent 2100, respectively - Agilent Technologies). From each individual, we were able to extract 7.2 to 22.3 ug of total RNA. RNA integrity numbers (i.e., RIN) ranged from 8.5 to 9.

Library preparation and RNA sequencing were performed by NOVOGENE (UK) COMPANY LIMITED using the Illumina NovaSeq platform. Library construction was carried out using the NEBNext® Ultra ™ RNA Library Prep Kit for Illumina®, following manufacturer instructions. Briefly, after quality control, the mRNA present in the total RNA sample was isolated with magnetic beads of oligos d(T)25 (i.e., polyA-tailed mRNA enrichment). Subsequently, mRNA was randomly fragmented and cDNA synthesis proceeded by random hexamers and the reverse transcriptase enzyme. Once the synthesis of the first chain was finished, the second chain was synthesized by means of the Nick translation method, with the addition of dNTPs, RNase H, polymerase I of E. coli. The resulting products went through purification, repair, A-tailing and adapter ligation. Fragments of the appropriate size were then enriched by PCR, the indexed P5 and P7 primers were introduced, and the final products were purified. The Illumina NovaSeq 6000 sequencing system was used to sequence the libraries, through a paired-end 150 bp (PE150) strategy. We obtained on average 52.7 million reads for each library. The sequencing data are available at the NCBI Sequence Read Archive (see Table 1).

Pre-assembly processing stage

Data from the brains of larvae were derived from ten independent samples and processed for bulk transcriptome sequencing. The workflow of the bioinformatic pipelines is shown in Fig. 1. All the described bioinformatics analyses were performed on the high-performance computing systems provided by ELIXIR-IT HPC@CINECA23,24,25.

Fig. 1
figure 1

Workflow of the bioinformatic pipeline, from raw data to annotated scripts, for the de novo transcriptome assembly of S. salamandra. Each step was progressively numbered.

We obtained 1,153,432,918 pairs of reads. The quality of the raw reads was assessed with FastQC 0.11.5 ( The quality results were aggregated across all samples into a single report with MultiQC26 v. 1.9. Raw reads were then analyzed through a quality trimming step with Trimmomatic27 v. 0.39 (options SLIDINGWINDOW: 4: 15, MINLEN: 36, HEADCROP: 13), in order to remove low quality bases and adapter sequences. Unpaired reads were discarded. After the cleaning step and removal of low-quality reads, a total amount of 1,112,352,766 clean reads were maintained for building the de novo transcriptome assembly (i.e., 96% of raw reads, Table 1).

De novo transcriptome assembly

Because there is no reference genome for S. salamandra, we performed a de novo transcriptome assembly following the workflow of the bioinformatic pipeline described in Fig. 1. In order to construct an optimized de novo transcriptome, avoiding chimeric transcripts and improving the reliability of the final assembly, we adopted the strategy to launch two de novo assembly tools both based on the building of de bruijn graphs, particularly suitable for eukaryotic organisms. The two assemblers were Trinity28, v. 2.11.0, and SPAdes29, v. 3.11.1, used in rnaSPAdes mode. rnaSPAdes was then applied with the default parameters (k-mer size equal to 73) to provide the assembly result. rnaSPAdes run generated as output a total of 1,094,271 transcripts (Table 2). On the other hand, also Trinity was launched applying the default parameters (kmer size equal to 25) to provide the assembly result. At this stage, a total of 1,207,872 transcripts were generated as output of Trinity run (Table 2).

Table 2 Statistics on rnaSPAdes and Trinity output and the result after CD-HIT evaluated with the two assembly validators.

After the assembly step, the two output results were merged using Trans-AbySS30 v. 2.0.1, with the merging function.

Assessing assembly quality

Two validation steps were applied to the assembly results: one after step 5, to evaluate the preliminary assembly, and one after step 8, to assess the quality of the final, non-redundant, assembly output. Two different tools were used for this task: TransRate31, v. 1.0.3, and DETONATE32, v. 1.11. These tools generate several metrics that serve as a guide to evaluate error sources in the assembly process and provide evidence about the quality of the assembled transcriptome. In Table 2 we reported the assessment analyses of a) the assembly output of Trinity, b) the assembly output of rnaSPAdes, c) the final assembly output, i.e., the merged assembly with removed redundancies.

The quality of the final assembly (output from step 8) was further evaluated through the assessment procedure implemented in BUSCO33 (Benchmarking Universal Single-Copy Orthologs) v. 5.4.4. It provides a quantitative measure of transcriptome quality and completeness, based on evolutionarily informed expectations of gene content from the near-universal, ultra-conserved proteins databases. We analyzed the gene content by launching BUSCO, on four databases of ortholog genes: CVG (Core Vertebrate Genes), Tetrapoda, Vertebrata and Eukariota databases. In Table 3 we reported transcriptome completeness in BUSCO. Moreover, Fig. 2 shows completed, fragmented and missing genes mapped from the four databases. It is worth noting that we found a high percentage of completed genes on Tetrapoda and Vertebrata databases, confirming the good quality of our assembly.

Table 3 The BUSCO (v. 5) validation, though the gVolante web server49 was applied to four databases: Tetrapoda, Vertebrata, CVG (Core Vertebrate Genes) and Eukariota.
Fig. 2
figure 2

BUSCO assessment results.

Another kind of quality assessment evaluation was introduced by performing mapping the trimmed raw data against the obtained final de novo transcriptome assembly. The software tool used for mapping was HISAT234 v. 2.1.0, one of the fastest and most widely used open-source gene alignment resources. The mapping results are shown in Fig. 3. As can be noticed, the results are always higher than 94%, confirming the good quality of the assembly.

Fig. 3
figure 3

For each sample we have in blue the representation of total paired-reads, in orange the total paired-reads after the adapter removal and quality trimming and in azure we have the trimmed paired-reads mapped mapped-back against the Salamandra salamandra assembled de novo transcriptome.

Generation of the full-length transcriptomes

After the validation step (“Transcriptome Quality Check” in Fig. 1), the merged assembly was the input for CD-HIT-est35 program, v. 4.8.1, a hierarchical clustering tool used to avoid redundant transcripts and fragmented assemblies common in the process of the de novo assembly, providing unique genes. CD-HIT-est was run with default parameters, corresponding to a similarity of 95%. Subsequently, a second validation step (“Unigenes Quality Checks” in Fig. 1) was launched on the CD-HIT-est output file. The results are shown in Table 2. Then, the CD-HIT-est output file was run on TransDecoder36,37 v5.7.0, the current standard tool that identifies long open read frames (ORFs) in assembled transcripts, using the default settings. TransDecoder by default performs ORFs prediction on both strands of assembled transcripts regardless of the sequenced library, i.e., without considering the specific sequencing library used to generate the transcriptomic data. It also ranks ORFs based on their completeness, by looks for any length of AA codons upstream of a start codon (M) without a stop codon, to determine if the 5′ end is incomplete. We adopted the “Longest ORF” rule and selected the longest 5′ AUG (relative to the in-frame stop codon) as the translation start site.

Transcriptome annotation

We employed different annotations for all further analysis. Contigs were aligned with DIAMOND38 on NCBI nr, Swiss-Prot and TrEMBL to retrieve corresponding best annotations. An annotation matrix was then generated by selecting the best hit for each database. We applied DIAMOND-fast setting DIAMOND blastx -t 48 -k 250 -min-score 40, and DIAMOND-sensitive, setting DIAMOND blastx -t 48 -k 250 -sensitive -min-score 40. Results from the analysis in DIAMOND are resumed in Table 4. Overview of data files and data sets produced in this study, with information on data repository and accession numbers, are summarized in Table 5. A Venn diagram was created to show the redundancy of the annotations in different databases; the diagrams were constructed for both DIAMOND blastx and DIAMOND blastp (Fig. 4) and showed 153,048 (blastx) and 95,942 (blastp) shared unigenes, i.e., annotated from the three databases. In a further step, contigs were processed with InterProScan39, to predict protein signatures. The InterPro database40 integrates predictive models or ‘signatures’ representing protein domains, families, and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. We scanned the Interpro database using InterProScan (the software package that allows sequences to be scanned against InterPro’s member database signatures) we got 56179 unigenes annotated among which 9850 were GO-annotated and 2311 KEGG-annotated.

Table 4 Summary of annotations on different databases.
Table 5 Overview of data files and data sets produced in this study, with information on data repository.
Fig. 4
figure 4

Venn diagrams for the number of contigs annotated with DIAMOND (blastx and blastp functions) against the three databases: NCBI nr, Swiss-Prot, TREMBL.

Comparison with Pleurodeles waltl open reading frames

We used the reference genome and the corresponding genome annotation of Pleurodeles waltl (the P. waltl’s gene model)41 to (1) extract the transcriptome of P. waltl, (2) predict the corresponding ORFs, (3) map the predicted ORFs of S. salamandra versus the predicted ORFs of P. waltl and vice versa. The point 3 was implemented to assess the similarity rate between our de novo transcriptome of S. Salamandra and the transcriptome of P. waltl.

As first step we downloaded the genome and annotation files of P. waltl42, subsequently we used the program GffRead43, an open-source program to manipulate GFF and GTF format files, to extract the transcriptome of P. waltl. As input files for GffRead, we used the annotation file, aPleWal.anno.v2.20220926.gff3, and the assembled genome, aPleWall.pri.20220803.fasta.gz, which is 20 GB in size. As outcome of the GffRead execution, we obtained the transcriptome of P. waltl (the file P_Waltl_transcripts.fa was uploaded on figshare44). By TransDecoder we predicted the ORFs of P. waltl and, used them to create a blast database. We ran the DIAMOND program (blastx function)45 to compare the ORFs of S. salamandra (query sequences in multi-fasta format) with the indexed blast database of ORF sequences of P. waltl. Of the 441,339 CDS-ORF sequences in the S. salamandra, only 290,095 mapped the ORFs of P. waltl, representing 65.7% of the total sequences of the S. salamandra. Similarly, the CDS-ORFs of P. waltl (1180470) were mapped against the ORFs of S. Salamandra showing similarity for 792563 of them (67.1%). The blastx output files of both blastx runs, named respectively salamandra_unigenes_vs_pleurodeles_orf_blastx.tsv and pleurodeles_vs_salamandra_unigenes_orf_blastx.tsv, were also uploaded on figshare44. The protocol above described provides the comparison of ORFs of S. salamandra and P. waltl and shows the level of genomic similarity between the two species.

Data Records

All the raw data generated in this project were deposited in the European Nucleotide Archive (ENA) database under study identification number PRJEB5120246. The de novo transcriptome assembly resource is available on both the ENA archive, HBZU01000000047, and the SRA archive on NCBI HBZU000000000.148. Datasets containing all files produced in this transcriptome assembly and annotation pipeline (Trinity and rnaSPAdes transcriptome assemblies, unigenes, and functional annotation files) were also deposited on figshare archive44 (links to pipeline outcomes are listed in Table 5).

Technical Validation

Quality of the raw reads and assembly validation

The overall data quality was assessed using FastQC for all samples before and after trimming. Among the FastQC results, the mean quality scores at each base position were higher than 35 (see “Image file 1” in Table 5). Validation of the transcriptome assembly was performed using three validator tools: BUSCO, DETONATE, and TransRate. The results from DETONATE and TransRate validation steps are shown in Table 2, which includes the scores obtained from the execution of the two analysis tools. BUSCO analysis was performed on four databases: Tetrapoda, Vertebrata, CVG, and Eukariota. The details of BUSCO are listed in Table 3, and some of them are plotted, like a histogram, in Fig. 2. A further validation assessment was performed by mapping the trimmed reads against the de novo assembled transcriptome of S. salamandra. The HISAT2 results showed an even higher percentage of 94% (Fig. 3), confirming the very good quality of the assembly. The final transcriptome (unigenes) obtained after CD-HIT-est included a total of 1,146,571 transcripts and an N50 of 1529 bp, with a value greater than 94% completeness for BUSCO evaluation in each queried database.

Quality control of annotation

The transcriptome was functionally annotated by performing DIAMOND and InterProScan. By selecting the best hit for each database, the annotation matrix generated with DIAMOND has led to 153,048 and 95,942 contigs, and a total of 7,547 transcripts were annotated in at least one database.

InterProScan is a tool that combines different protein signature recognition methods of the InterPro member databases into one resource. It provides as result the corresponding InterPro accession numbers and, among other accession IDs, the GO and KEGG annotation.