The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA

Frogs play important ecological roles, and several species are important model organisms for scientific research. The globally distributed Ranidae (true frogs) are the largest frog family, and have substantial evolutionary distance from the model laboratory Xenopus frog species. Unfortunately, there are currently no genomic resources for the former, important group of amphibians. More widely applicable amphibian genomic data is urgently needed as more than two-thirds of known species are currently threatened or are undergoing population declines. We report a 5.8 Gbp (NG50 = 69 kbp) genome assembly of a representative North American bullfrog (Rana [Lithobates] catesbeiana). The genome contains over 22,000 predicted protein-coding genes and 6,223 candidate long noncoding RNAs (lncRNAs). RNA-Seq experiments show thyroid hormone causes widespread transcriptional change among protein-coding and putative lncRNA genes. This initial bullfrog draft genome will serve as a key resource with broad utility including amphibian research, developmental biology, and environmental research.


Supplementary Figures
Supplementary Figure 1. Selection criteria for the high confidence gene set. Transcripts were included in the high confidence set if they satisfied one or more of the following criteria: 1) the gene contained at least one splice site, and all splice sites were confirmed by an alignment to external transcript evidence (splice sites supported); 2) the CDS had a BLASTn alignment to a BART contig with at least 95% identity along 99% of its length (CDS hit); 3) the protein sequence encoded by the CDS had a BLASTp alignment to a human or amphibian Swiss-Prot protein sequence with at least 50% identity along 90% of its length (Protein hit).

Supplementary Figure 2. Enriched GO terms associated with genes differentially expressed in R. catesbeiana back
skin following exposure to 10 nM T3 for 48 h. RNA-Seq reads were aligned to the genome with STAR, read counts per high-confidence transcript determines with HTSeq, and differential expression in the T3 treated group relative to the vehicle control determined using DESeq2 , where significance was at the 0.05 level.

Supplementary Figure 3. qPCR analysis of select transcripts encoding proteins involved
in RNA/DNA processing in the back skin. Premetamorphic tadpoles (n = 3 per treatment) were injected with 10 pmol/g body weight of T3 or dilute sodium hydroxide solvent (C) and the back skin collected after 48 h for RNA isolation and qPCR analysis. The median fold abundance of transcripts encoding U1 small nuclear ribonucleoprotein A (snrpa), ribosomal RNA processing protein 8 (rrp8), and histone-lysine-N-methyltransferase (suv91) relative to the control is shown.
Whiskers indicate the median absolute deviation, and the open circles denote the fold difference of individual animals. All transcripts were significantly different (Mann-Whitney U test, p < 0.05).

Supplementary Figure 4. qPCR analysis of select lncRNA transcripts in the back skin.
Premetamorphic tadpoles (n = 6 per treatment) were injected with 10 pmol/g body weight of T3 or dilute sodium hydroxide solvent (C) and the back skin collected after 48 h for RNA isolation and qPCR analysis. The median fold abundance of transcripts of candidate lncRNAs, ncr7 and   Table 9 for additional information including the species code. Figure 9. Workflow for detection of putative lncRNA transcripts. BART contigs with low protein coding potential were excluded, as were redundant sequences.

Supplementary
Polyadenylated transcript sequences were selected, and any residual sequences that may have encoded a peptide sequence with similarity to any database sequences were eliminated to arrive at the set of putative lncRNA sequences.

Supplementary Tables
Supplementary Table 1

Targeted gene assembly with Kollector
Kollector is an alignment-free targeted de novo assembly pipeline that uses thousands of transcript sequences concurrently to inform the localized assembly of corresponding gene loci 1 .
Kollector scans whole genome shotgun sequencing data to recruit reads that have sequence similarity to input transcripts or previously recruited reads, which are then assembled with ABySS. This greedy approach to read collection enables resolution of intronic regions for the assembly of complete genes.
To provide long-distance information for scaffolding, we used Kollector to reconstruct the gene loci of the transcripts contained in the BART reference transcriptome. The BART transcripts were randomly divided into 80 bins of approximately 10,000 transcripts each, and Kollector ran on each bin in parallel (-j 12 -s 0.9 -r 128 -k 128). To evaluate success of the targeted gene assemblies (TGA), the input transcripts were aligned to the Kollector-assembled sequences with BLASTn 2 , and those transcripts that aligned with 90% sequence identity and 90% query coverage were considered to have had their corresponding gene successfully reconstructed.
Transcripts that did not meet these criteria were re-binned and re-tried in the next iteration with parameters tuned for higher sensitivity. This is achieved by lowering the r parameter (number of nucleotide matches required for recruiting a read) and the value of k used in the assembly step.

Protein coding gene prediction
Prediction of protein coding genes was performed using the MAKER genome annotation pipeline 3 (version 2.31.8). This framework included RepeatMasker 4 to mask repetitive sequence elements based on the core RepBase repeat library 5 . Augustus 6 , SNAP 7 and GeneMark 8 were also run within the MAKER2 pipeline to produce ab initio gene predictions. BLASTx 2 , BLASTn 2 , and exonerate 9 alignments of human and amphibian Swiss-Prot protein sequences 10 (retrieved 16 February 2016) and BART were combined with the gene predictions to yield the gene models. MAKER2 was first applied to an early version of the bullfrog genome assembly, and the 1000 best gene models by eAED score were used for retraining SNAP 11 .

Gene ontology and pathway analysis
Due to the particularly extensive biological information available for human proteins, a second round of BLASTp alignments were performed between the high confidence set of predicted proteins and the Swiss-Prot human proteins, using the same alignment thresholds noted above.

Assembly versioning
The bullfrog genome project produced three main assemblies to date, predominantly differentiated by the incorporation of additional sequencing reads (version 2) and the utilization of progressively more sequence data for scaffolding (version 2 and 3

TH experiment
We

qPCR analysis of transcript abundance
Transcript abundance of select transcripts encoding proteins involved in RNA/DNA processing and lncRNAs was determined using methods and conditions published previously 15 . The primer sequences, annealing temperatures, and amplicon sizes are shown in Supplementary Table 4.

lncRNA detection
The workflow used to detect candidate lncRNAs is summarized in Supplementary Figure 9.
First, open reading frames (ORFs) were predicted using TransDecoder v3.0.0 (transdecoder.github.io) with the default parameters, and contigs with complete or partial predicted ORFs were excluded. We also performed 3-frame in silico translations of the contigs to evaluate the validity of any potential encoded peptides via comparison to the Pfam curated database of peptide motifs 16 using HMMScan v3.1b2 from the HMMER package 17 . Furthermore, we did a six-frame translation of our nucleotide sequences, and queried them against Uniref90 18 and NCBI's RefSeq databases using the BLASTx program from NCBI's BLAST+ (v2.4.0) software package 2 . We discarded all contigs that returned a hit to any sequence in these databases at e-value < 10 -5 . We constructed a comprehensive amphibian transcriptome shotgun assembly database (CATSA) by downloading and combining nucleotide sequences for 16 amphibian species (Supplementary Table 4) from the NCBI Genbank Transcriptome Shotgun Assembly Sequence (TSA) database 19 . We interrogated our putative lncRNA contig set against this CATSA database for homologs that could add confidence to our set. We also did a similarity search against lncRNA sequences present in lncRNADB 20 and LNCipedia 21 , which are databases of previously reported lncRNAs.