It’s okay to be green: Draft genome of the North American bullfrog (Rana [Lithobates] catesbeiana)

Frogs play important ecological roles as sentinels, insect control and food sources. Several species are important model organisms for scientific research to study embryogenesis, development, immune function, and endocrine signaling. The globally-distributed Ranidae (true frogs) are the largest frog family, and have substantial evolutionary distance from the model laboratory Xenopus frog species. Consequently, the extensive Xenopus genomic resources are of limited utility for Ranids and related frog species. More widely applicable amphibian genomic data is urgently needed as more than two-thirds of known species are currently threatened or are undergoing population declines. Herein, we report on the first genome sequence of a Ranid species, an adult male North American bullfrog (Rana [Lithobates] catesbeiana). We assembled high-depth Illumina reads (66-fold coverage), into a 5.8 Gbp (NG50 = 57.7 kbp) draft genome using ABySS v1.9.0. The assembly was scaffolded with LINKS and RAILS using pseudo-long-reads from targeted denovo assembler Kollector and Illumina Synthetic Long-Reads, as well as reads from long fragment (MPET) libraries. We predicted over 22,000 protein-coding genes using the MAKER2 pipeline and identified the genomic loci of 6,227 candidate long noncoding RNAs (IncRNAs) from a composite reference bullfrog transcriptome. Mitochondrial sequence analysis supported Lithobates as a subgenus of Rana. RNA-Seq experiments identified ~6,000 thyroid hormone– responsive transcripts in the back skin of premetamorphic tadpoles; the majority of which regulate DNA/RNA processing. Moreover, 1/6th of differentially-expressed transcripts were putative lncRNAs. Our draft bullfrog genome will serve as a useful resource for the amphibian research community.

4 is such that Xenopus genomic and transcriptomic resources are inadequate for studying Ranid species (Helbing 2012). The genome of a more-closely related frog, the Tibetan Plateau frog (Nanorana parkeri), has been recently released (Sun et al. 2015), though this species is also substantially separated from Ranids by approximately 89 million years (timetree.org).
Using some of the latest sequencing and bioinformatics technologies, we have sequenced, assembled, and annotated an initial draft sequence of the ~5.8 billion nucleotide North American bullfrog genome (NG50 length 57,718 bp). We predicted 52,751 transcripts from 42,387 genes, of which 22,204 had supporting biological evidence, and were deemed high confidence. We anticipate that this much-needed resource, which we make public alongside comprehensive transcriptome assembly data, will directly and immediately impact genetic, epigenetic, and transcriptomic bullfrog studies. On a wider scale, it will enable developmental biology research ranging from amphibians to mammals, provide direly needed insights to curb rapidly declining Ranid populations, and further our understanding of frog evolution.

Results
The draft assembly of the R. catesbeiana genome consists of 5.8 Gbp of resolved sequence (Table 1). The majority of raw reads from the PET libraries were successfully merged (63-75%), yielding longer pseudo-reads (mean +/-SD, 446 +/-107 bp). The success of the pre-assembly read merging allowed us to use a value of the assembly k parameter greater than our shortest PET read length, increasing our ability to resolve short repetitive sequences. Genome scaffolding with orthogonal data, which included the reference transcriptome and scaffolds assembled at a lower k value, greatly improved the contiguity of the resulting assembly (Table   1). We assessed the improvement to the assembly after each round of scaffolding using the NG50 length metric and the number of complete and partial core eukaryotic genes (CEGs) using CEGMA, which reports a proxy metric for assembly completeness in the genic space (Parra et al. 2009). Using the Synthetic Long-Reads (SLR) and the Kollector (Kucuk, et al., in press) targeted gene assembly (TGA) tool, RAILS (Warren 2016) merged over 56 thousand scaffolds; this permitted the recovery of an additional four partial CEGs, and raised the contiguity of the assembly to approximately 30 kbp (Supplemental Table S1). The most dramatic improvements to assembly contiguity and resolved CEGs were obtained using LINKS (Warren et al. 2015a) and the MPET reads (NG50 increase of ~16 kbp and 10 additional complete CEGs; Supplemental Table S1), followed by the combined Kollector TGA and the lower-k whole genome assembly (~8 kbp improvement to NG50 and 9 additional complete CEGs; Supplemental Table S1).  S1). Of this high confidence set, 15,122 predicted proteins encoded by 12,671 genes could be assigned a functional annotation based on significant similarity to a SwissProt entry.
Furthermore, 680 proteins from 590 genes were identified as particularly robust predictions by GeneValidator (Dragan et al. 2016) (score ≥ 90). This 'golden' set includes several members of the Homeobox (HOX), Forkhead box (FOX), and Sry-related HMG box (SOX) gene families, which are transcription factors involved in developmental regulation (Kamachi and Kondoh 2013;Mallo and Alonso 2013;Schmidt et al. 2013). Immune-related genes, including interleukins 8 and 10, interferon gamma, and Toll-like receptors 3 and 4 were also confidently annotated.

lncRNA
The discovery and analysis of lncRNAs represents a new frontier in molecular genetics, and is of major relevance to the biology behind the functional and largely unexplored component of the transcriptome. The low degree of lncRNA primary sequence conservation between organisms, and the lack of selective pressure to maintain ORF integrity or codon usage complicates traditional similarity based discovery methods (Quinn and Chang 2016). The subtractive approach to lncRNA detection that we employed identified 6,227 candidate lncRNA sequences.

Differential expression
Characterization of the regulatory factors that mediate thyroid hormone (TH) dependent initiation of tissue specific gene expression programs during metamorphosis have been extensively studied in X. laevis. However, this species experiences markedly different environmental conditions in its natural habitat than many Ranids do, and these experiments employed supraphysiological levels of TH (Buckbinder et al. 1992;Wang et al. 1993). Our present analysis of the TH-induced metamorphic gene expression program in the back skin detected nearly 6,000 genes significantly (p < 0.05) differentially expressed upon T3 (the TH 3,3',5-triiodo-L-thyronine) treatment ( Fig. 1), including those found previously through targeted quantitative polymerase chain-reaction (qPCR) experiments (Supplemental Table S3). The most prominent "biological process" gene ontologies associated with the Swiss-Prot derived functional annotations are related to RNA/DNA processing, signal transduction (including hormone signaling), and functions related to cell growth and division (Supplemental Fig. S2). A selection of new transcripts related to RNA/DNA processing were evaluated using qPCR and found to show similar relative abundance as observed with the RNA-Seq data (Supplemental  Table S3).
The effect of T3 treatment was not limited to the predicted protein coding genes, as expression of 1,085 candidate lncRNAs was also significantly affected. A selection of lncRNA transcripts was evaluated using qPCR (Supplemental Fig. S4).

Non-iridovirus sequence in bullfrog
Significant similarity to a frog virus 3 (FV3) protein was noted for one of the gene annotation.
FV3 is the type species of the Ranavirus genus, so we explored the possibility that a FV3-like viral genome had wholly or partially integrated into the bullfrog specimen we sequenced. The whole-genome and transcript alignments showed limited similarity between a single bullfrog scaffold and two FV3 sequences that encode hypothetical proteins, FV3gorf5R and FV3gorf98R (not shown). The region of similarity was identified as part of a conserved US22 domain (not shown).

Phylogenetic analysis of amphibian mitochondrial gene and genomes
Frog taxonomy is subject of debate (Pauly et al. 2009). To address the controversy, we performed a number of phylogenetic experiments comparing selected amphibian mitochondrion (Mt) genomes and Mt genes at the nucleotide level (Fig. 2). As expected, we observe clear separation of salamanders and toads (genus Bufo) from other species as outgroups ( Fig. 2A; Supplemental Figs. S5-S8). We color-coded the Lithobates and Rana in yellow and blue, respectively, as re-classified by Frost et al. (2006),

Comparative genomics
Estimates of the genomic sequence identity between of N. parkeri, X. tropicalis and R.
catesbeiana were performed. We did this using Bloom filters (Bloom 1970), probabilistic data structures with bit sets for each genome's k-mers (collection of all subsequences of length k). In a previous study, this method was shown to provide concordant estimates of the genome sequence divergence of known model organisms (human and apes), and was applied to conifer genomes (Warren et al. 2015b).
This method is designed to compare the k-mer spectra of any two genomes by computing the kmer set bit intersections of their respective Bloom filters. It is assumed that differences between the genomes are independently distributed. We point out that this method does not factor size differences in the genomes, nor structural rearrangements; instead it reports on the commonality over short sequence stretches. The reliability of the method also comes into question for very divergent genomes, as common k-mers are rarer, and precise values of sequence identity are not expected. As such, divergence figures are likely an underestimate of their true separation.

Discussion
Amphibians are the only group where most of its members exhibit a life cycle that includes distinct independent aquatic larval and terrestrial juvenile/adult phases. The transition between the larval and juvenile phases requires substantial or complete remodeling of the organism (metamorphosis) in anticipation of a terrestrial lifestyle. Thus, this places amphibians in a unique position for the assessment of toxicological effects in both aquatic and terrestrial environments. Our sequence divergence analysis with the recently released Xenopus genome of a diploid species, X. tropicalis (version 9.0) highlights the reason why Pipids are generally unsuitable genome references for Ranid species due to their considerable evolutionary divergence. Our present work is consistent with earlier estimates of divergence dating over ~200 MYA. The initial assembly of another Xenopus species' genome -that of the allotetraploid X. laevis inbred 'J' strain -has just been published (Session et al. 2016). The haploid genomes of these species are substantially smaller (1.7 and 3.1 Gbp, respectively) than the typical Ranid genome (Gregory 2017). Despite the difference in relative sizes, the number of high confidence predicted protein-encoding genes in R. catesbeiana (~22,000) is comparable (24,022 in X. laevis; ~20,000 in X. tropicalis) (Hellsten et al. 2010;Session et al. 2016). However, the sequence divergence at the nucleic acid level confirms the empirical challenges of using Xenopus genomes as scaffolds for RNA-Seq experiments in Ranids. Even the recently published N. parkeri genome (Sun et al. 2015), is substantially separated from R. catesbeiana by nearly 90 million years of evolutionary time (timetree.org).
The taxonomic classification of Ranid species is contentious and highly debated amongst scholars (Pauly et al., 2009). This stems largely from the suggested reclassification of the genus Rana in favor of Lithobates a few years ago (Frost et al. 2006 The molecular mechanisms of amphibian metamorphosis have been predominantly studied using X. laevis and X. tropicalis, likely in no small part because of their amenability to captive breeding, which ensures a ready supply of research specimens (Parker et al. 1947). However, Xenopus larvae are typically much smaller than those of R. catesbeiana, with the consequence that each individual animal yields a smaller quantity of tissue for analysis. Indeed, R.
catesbeiana tadpoles are large enough that techniques such as the cultured tail fin (C-fin) assay are possible, where multiple tissue biopsies are collected from an individual animal and cultured ex vivo in a variety of hormone or chemical conditions (Hinther et al. 2010;Hinther et al. 2011;Hinther et al. 2012;Hammond et al. 2013). With this assay design each animal is effectively The bullfrog genome is larger than those of N. parkeri, X. laevis, and X. tropicalis, in concordance with earlier predictions for many Ranid genomes (Mazin 1980;Buisine et al. 2015;Sun et al. 2015;Session et al. 2016). Unlike X. laevis, this does not appear to be due to an allotetraploidization event in the Ranid progenitor species (Wang et al. 2000;Zhu and Wang 2006  and proteomics experiments. We anticipate that this resource will be valuable for conservation efforts such as identifying host/pathogen interactions and to identify environmental impacts of climate change and pollution on the development and reproduction of Ranid species worldwide.  Combined, this approach accounted for 66-fold sequence coverage of the approximately 6 Gbp bullfrog genome (Table 2).

Read merging
PET read pairs were merged sequentially using the ABySS-mergepairs tool (Birol et al. 2013) and Konnector (version 1.9.0) . Bloom filters were constructed from all reads using the ABySS-Bloom utility (Warren et al. 2015b), and every tenth value of k between 75 and 245 bp, inclusive. Reads from potentially mixed clusters on the sequencing flow cells (determined by the Illumina chastity flag) were discarded, and the remaining reads were trimmed to the first base above a quality threshold (Q=3 on the phred scale) prior to merging.

Assembly process
ABySS (version 1.9.0) was used to reconstruct the R. catesbeiana genome (Simpson et al. 2009). For the initial sequence assembly, three sets of reads were used: (i) merged reads described above from paired-end Illumina HiSeq 150 bp, 250 bp, and MiSeq 500 bp libraries, (ii) unmerged reads from these same libraries, and (iii) synthetic long-reads. The unmerged HiSeq and MiSeq PET reads were also used for paired linking information to generate contigs. Finally, the MPET reads were used to bridge over regions of repetitive sequence to form scaffolds (see Table 2 for summary statistics of the sequencing data).
A certain fraction of the unresolved bases within these scaffolds were recovered using Sealer version 1.9.0 (Paulino et al. 2015). Sealer uses a Bloom filter representation of a de Bruijn graph constructed using k-mers derived from the genomic reads to find a path between the sequences flanking scaffold gaps, and fill in the consensus sequence. In comparison to the fixed k-mer length of the whole genome assembly method, it uses a range of k-mer lengths to navigate repeat and low coverage areas within the graph. The Bloom filters that were used during the read merging phase were reused, and default values were used for all parameters except for "--flank-length=260" and "--max-paths=10".
In RAILS, long sequences are aligned against a draft assembly (BWA-MEM V0.7.13-r1126 (Li 2013) -a -t16), and the alignments are parsed and inspected, tracking the position and orientation of each in assembly draft sequences, satisfying minimum alignment requirements (at minimum 250 anchoring bases with 99% sequence identity or more used in this study).
A final round of automated gap closing in the assembled draft was performed by Sealer.
Completeness of the assembly was evaluated by comparison to a set of ultra-conserved core eukaryotic genes (Parra et al. 2009).

Protein coding gene prediction
The MAKER2 genome annotation pipeline (version 2.31.8) was used to predict genes in the draft R. catesbeiana genome (Holt and Yandell 2011). This framework included RepeatMasker (Smit et al. 2013) to mask repetitive sequence elements based on the core RepBase repeat library (Jurka et al. 2005). Augustus (Stanke et al. 2006), SNAP (Korf 2004) and GeneMark We refined MAKER2's predicted gene list further by identifying a high confidence set, better suited for downstream biological analyses. Three criteria were considered: 1) the predicted transcripts must have at least one splice site, and all putative splice sites must be confirmed by an alignment to external transcript evidence; 2) the coding DNA sequence (CDS) of each transcript must have a BLASTn alignment to a BART contig with at least 95% identity along 99% of its length; or 3) the protein sequence encoded by the CDS must have a BLASTp (Camacho et al. 2009) alignment to a human or amphibian Swiss-Prot protein sequence with at least 50% identity along 90% of its length (Supplemental Fig. S1).

Functional annotation
The high confidence set of transcripts was annotated according to the best BLASTp alignment of each putative encoded protein to the Swiss-Prot database (UniProt 2015), when available.
There were two levels of confidence for the annotations, 1) the most robust were identified using GeneValidator, which compares protein coding gene predictions with similar database proteins (Dragan et al. 2016), where those having a score of 90 or greater were definitively identified as the Swiss-Prot sequence they aligned to; and 2) all other annotated transcripts were considered to encode "hypothetical" proteins similar to their best Swiss-Prot hit, provided that they aligned with at least 25% identity along 50% of their length.

Resource for the Transcriptome; BART)
Transcriptome assemblies were generated from 32 R. catesbeiana tadpole samples (representing 5 tissues under several different chemical and temperature exposure conditions) using Trans-ABySS (Robertson et al. 2010) (see Supplemental Table S5). The transcripts from each independent assembly were aligned using the BLAST-like Alignment Tool (Kent 2002)

lncRNA prediction
To complement the protein coding gene predictions, a computational pipeline was developed to identify putative lncRNAs in the R. catesbeiana composite reference transcriptome BART. As there is a paucity of conserved sequence features that may positively identify lncRNA transcripts, we instead took a subtractive approach, and omitted transcripts that were predicted to have coding potential or had sequence similarity to known protein encoding transcripts, as has been advocated in previous studies (Tan et al. 2013;Etebari et al. 2015). See Supplemental Methods for additional details.
We then used CD-HIT-EST (Fu et al. 2012) (v4.6.6, -c 0.99) to identify and remove contigs with significantly redundant sequence content. The remaining transcripts were then interrogated for the presence of a poly(A) tail and one of 16 polyadenylation signal hexamer motifs (see Supplemental Table S6). The contigs were aligned to the genome assembly using GMAP (v2016-05-01, -f samse, -t 20) (Wu and Watanabe 2005), and instances where there was a 3' sequence mismatch due to a run of As, or a 5' mismatch due to a run of Ts (in cases where the strand specific sequencing failed, and a RNA molecule complementary to the actual transcript was sequenced) prompted a search for the presence of a hexamer motif within 50 bp upstream (relative to the direction of coding) of the putative transcript cleavage site. Contigs containing a poly(A) tail and a hexamer motif were selected for further analysis. We are aware that not all lncRNA are polyadenylated. The poly(A) tail filter was put in place to reduce the proportion of spurious transcripts, retained introns and assembly artifacts.
Candidate lncRNA transcripts were aligned to the draft genome with GMAP (version 2015-12-31, -f 2, -n 2, --suboptimal-score=0, --min-trimmed-coverage=0.9, --min-identity=0.9) (Wu and Watanabe 2005), and those that had at least 90% of their sequence identified across no more than two separate genomic scaffolds with 90% sequence identity were retained. Alignments where the exon arrangement was not collinear with the original contig sequence were omitted.
Further evidence of conservation of lncRNA candidates among amphibian species was obtained using a comprehensive amphibian transcriptome shotgun assembly database, as described in the Supplemental Methods and Supplemental Table S7.

Differential gene expression analysis
As an example of the utility of the draft genome assembly and high confidence gene predictions, RNA-Seq reads from six premetamorphic R. catesbeiana tadpoles exposed to 10 pmol/g body weight T3 or dilute sodium hydroxide vehicle control for 48 h were used to characterize the T3induced gene expression program in the back skin (Supplemental Methods; Supplemental

Gene ontology (GO) and pathway analysis
Due to the particularly extensive biological information available for human proteins, a second round of BLASTp alignments were performed between the high confidence set of predicted proteins and the Swiss-Prot human proteins, using the same alignment thresholds noted above.

Iridovirus integration experiment
The FV3 ranavirus genome and associated protein coding DNA sequences were downloaded from NCBI (GenBank Accession NC_005946.1), and aligned to the assembly with GMAP version 2015-12-31 (Wu and Watanabe 2005) using default settings.

Mitochondrial genome assembly and finishing
The Mitochondrial (Mt) genome sequence was identified integrally in our whole genome assembly. Rounds of scaffolding effectively brought unincorporated Mt contigs to the edge of the scaffold, and after inspection were removed by breaking the redundant scaffolds at N's.
Multiple sequence alignments were done between our sequence, two NCBI references originating from China (GenBank accessions NC02296 and KF049927.1) and one from Japan (AB761267), using MUSCLE (Edgar 2004) from the MEGA phylogenetic analysis package (Kumar et al. 2016) using default values. These analyses indicated that the Mt sequence reported herein is most similar to the Japanese sequence, but with some discrepancies in two repeat regions, namely the absence of a 161 bp sequence at coordinate 15,270, and a 12-bp insertion at coordinate 9,214 relative to AB761267. We resolved these misassemblies using the correct Japanese reference sequence for these regions as candidates for a targeted de novo assembly of Illumina paired-end 250 bp reads with TASR (Warren and Holt 2011); v1.7 with -w

Phylogenetic analyses
Complete mitochondrial genome sequences of selected salamanders and frogs (Supplemental Table S8) were compared using the MEGA phylogenetic package (Kumar et al. 2016). In another set of phylogenies, we also compared the mitochondrial genes cyb and 12S and 16S rRNA rnr1 and rnr2 of selected amphibian species (Supplemental Table S9). For these analyses, we first generated multiple sequence alignments of the genome or gene sequences described above using either MUSCLE (Edgar 2004) or clustalw (Chenna et al. 2003) (v1.83 with gap opening and extension penalty of 15 and 6.66 for both the pairwise and multiple alignment stages, DNA weight matrix IUB with transition weight of 0.5), and used the resulting pairwise alignments as input for MEGA7 (Kumar et al. 2016). The evolutionary history was inferred by using the Maximum Likelihood method based on the Tamura-Nei model (Tamura and Nei 1993), where initial trees for the heuristic search are obtained by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood approach, and then selecting the topology with superior log likelihood value.

Data Access
The whole genome sequence data and final assembly of the North American bullfrog genome with annotated MAKER2 gene predictions is available at NCBI-Genbank under accession (LIAG00000000, BioProject PRJNA285814).