Gekko japonicus genome reveals evolution of adhesive toe pads and tail regeneration

Reptiles are the most morphologically and physiologically diverse tetrapods, and have undergone 300 million years of adaptive evolution. Within the reptilian tetrapods, geckos possess several interesting features, including the ability to regenerate autotomized tails and to climb on smooth surfaces. Here we sequence the genome of Gekko japonicus (Schlegel's Japanese Gecko) and investigate genetic elements related to its physiology. We obtain a draft G. japonicus genome sequence of 2.55 Gb and annotated 22,487 genes. Comparative genomic analysis reveals specific gene family expansions or reductions that are associated with the formation of adhesive setae, nocturnal vision and tail regeneration, as well as the diversification of olfactory sensation. The obtained genomic data provide robust genetic evidence of adaptive evolution in reptiles.


Supplementary Tables
Supplementary Table 1 Sizes and numbers of contigs and scaffolds are presented. The data corresponding to fragments shorter than 100 bp are not included. Ten fosmid clones were used as reference data, and the assembled genome sequence was aligned with them (BLASTn, E-value threshold was 1e-5) to assess the coverage rate. The alignments were contiguous and of high quality, and the average coverage rate of the fosmid clones was approxmately 99.65%. The EST of Gekko japonicus was downloaded from NCBI. The Illumina RNA-Seq data were assembled into unigenes using Trinity, and redundant data were removed using TGICL. The EST and the unigenes built by Trinity were aligned on the assembled scaffolds using BLAT with an identity cutoff of 90%. The RepBase TEs and TE proteins were both based on RepBase, but they were annotated using RepeatMasker and RepeatProteinMask, respectively. De novo data were derived from the sequence library based on RepeatModeler and then annotated using RepeatMasker. The combined TEs represent the non-redundant results from the three methods. The Gekko japonicus genome contains more TE repetitive sequences than the other species listed. P represents the evidence from de novo forecasted data. C represents the evidence from cDNA/EST forecasted data. H stands for the evidence from homologous forecasted data. "Single" indicates only one piece of data was used as evidence. "More" indicates that multiple types of data were used as evidence. Percentage of overlap represents the ratio of overlap in CDS areas between these predictions and the final gene set.

Supplementary
26 The results indicate that numerous ncRNAs are involved in regulating the expression of target genes the creation of mRNA precursors, such as miRNA and snRNA.

Supplementary
27 The majority of the genes could be functionally annotated based on combination of different databases. Only 4.92% of genes could not be annotated using current databases. The analysis revealed that the beta-keratin family in G. japonicus contained far more mumbers than those in other species, and the majority of the family members (48 beta-keratin genes) were clustered in only one scaffold (scaffold426).

Supplementary Table 20. Physical and biochemical characteristics of beta-keratins from G. japonicus.
Many of the listed proteins include the S-core boxes, are rich in Cys and Gly, and possess higher isoelectric point (PI) and lower molecular weights (Wt). The above data reveal that A. carolinensis, adiurnal reptile with tetrachromatic color vision has a greater quantity of light-sensitive opsins than G. japonicus, a nocturnal reptile.

Supplementary Table 24. Qualities of genome assemblies in seven species.
The summarized assembly quality parameters could be used to evaluate the risk of comparative analysis of olfactory receptor genes from these species.

Genome sequencing, size evaluation and quality estimation
Genomic DNA was extracted from the blood of an adult male Gekko japonicus sampled in Jiangsu, China. The DNA was fragmented and different insert sizes by were isolated electrophoresis. PCR amplification was performed following the addition of adapters, and the products were clustered to form mate-pair libraries (insert-size ≥ 2Kb). A total of 12 libraries including for sizes of 170bp, 500bp, 800bp, 2Kb, 5Kb, 10Kb, 20Kb and 40Kb were constructed. This facilitated a high level of genome assembly to create a N50 length scaffold. The genomic DNA was sequenced using an Illumina HiSeq™ 2000 platform (20 lanes, 330.90 Gb, 131.35X). Reads were discarded when they contained one of the following: 1) greater than 2% Ns or with poly(A) structure; 2) greater than 40% low quality bases (quality value < 8) for small insert-size libraries and 60% for large insert size libraries; 3) adapter pollution; 4) overlaps between read 1 and subsequent reads (at least 10 bp overlap, with < 10% mismatch); 5) duplicated reads within fastq files. The clean data comprised 233.49 Gbp, which covered the genome by 94.65-fold, and the physical coverage was approximately 1693.27-fold. Approximately 0.14% of the small-insert-size clean data were further corrected according to high frequency and low frequency K-mers defined by K-mer distributions (frequency ≤ 11 was considered low-frequency for the17-mers), and 0.47% of the reads and 1.95% of the bases were deleted. Genome size, heterozygosis and repeat content were estimated using K-mer analysis. Reads comprising 87.14 Gbp of clean data (34.52X) were split into K-mers, and K-mer depth was computed to obtain depth distribution 1 . The peak K-mer depth is about 29, which was determined to be the K-mer expectation depth. Genome size was estimated as K-mer num/Peak depth, which results 2.52 Gb with 660033 scaffolds.
SOAPdenovo (v2.04) 2 and SSPACE 3 software were used for genome to assembly. The clean data corresponding to the short-insert-size library were split into 29-mers, used to construct a de Bruijn graph, and connected to obtain contigs. Scaffolds were constructed by realigning all usable reads onto the contig sequences, calculating the amount of shared paired-end relationships between each pair of contigs, and linking the contigs into scaffolds using SOAPdenovo followed with SSPACE. The read pairs were finally retrieved, and one was mapped to the unique contig. Local assembly was performed to fill the gaps with the other read by pair end relation. The contig N50 and scaffold N50 following the final assembly were 21.07 Kb and 684.96 Kb respectively. All reads were aligned to the assembled scaffolds using SOAPaligner2 4 , and the depth of each base was calculated using soap.coverage program with an average assembly depth of approximately 80-89X. After graphing the GC depth, the GC content was found to be approximately 45%.
To assess the accuracy of the assembly, 10 fosmids were sequenced and aligned to the assembled scaffolds 5 . Plasmid DNA was extracted and sheared into fragments approximately 1~3Kb in length using Gene Machines. The DNA fragments were purified and end-repaired to blunt, followed by ligation into the PUC118-Vector and transformation into Escherichia coli by electroporation. The transformants were plated on X-gal and IPTG LB plates with ampicillin overnight. Large-scale Sanger sequencing was performed on an ABI 3730. When the sequence coverage reached to about 6 folds, the assembly process was initiated, and more Sanger reads were sequenced to get the complete map. The 10 fosmid clones were therefore used as reference data, and they were used to map the assembled genome sequence (BLASTn, E-value threshold of 1e-5) to check the coverage rate. The alignments were contiguous and of high quality, and the average coverage rate of the fosmid clones was about 99.65%.
The completeness of the euchromatic portion of the assembly was assessed using RNA-Seq data. The EST of Gekko japonicus was downloaded from NCBI. The Illumina RNA-Seq reads data were assembled into unigenes by Trinity 3 , and TGICL 6 was used to remove redundancy. The EST and the unigenes built by Trinity were aligned on the assembled scaffolds using BLAT with an identity cutoff of 90%.

Transcriptome sequencing
Transcriptome analysis included two experiments. The first was a multiple tissues assay to produce a G. japonicus gene set. Brain, spinal cord, liver, and ovary were pooled for RNA sequencing. The second was the creation of an expression profile of the tail at different time point after amputation. For sampling, 50 individuals were randomly divided into 5 groups with half males and half females in each group by the random number table method. A caudectomy was conducted by inserting a nylon slipknot at the site of the sixth tail segment and pulling gently, which mimicked the conditions of natural autotomy. The endmost tissues of the tail were sampled at a length of 0.5 cm on days 0, 1, 3, 7 and 14, and 10 individuals (half males and half females) were prepared for each time point. The samples were then submitted to RNA sequencing. RNA sequencing was conducted as follows. Oligo(dT) beads were used to enrich poly(A) mRNA after total RNA was collected from the above samples.
Fragmentation buffer was added to cut the mRNA into short fragments, which were used as templates. Random hexamer primers were used to synthesize first-strand cDNA. Second-strand cDNA was synthesized using a mixture of buffer, dNTPs, RNase H and DNA polymerase I. Short fragments were purified with QiaQuick PCR extraction kits and resolved with EB buffer for end repair and addition of poly(A). Next, the short fragments were connected using sequencing adapters. For amplification with PCR, suitable fragments were selected as templates based on agarose gel electrophoresis. Finally, the libraries were sequenced on a Illumina HiSeq™ 2000.
Dirty reads were discarded from the raw data when one of the following conditions were met: 1) contain adapter sequences, 2) ambiguous nucleotides comprised more than 5%, 3) low quality reads comprised more than 30% and 4) quality less than 10. Clean reads were mapped to the reference genomes and gene sequences using SOAP (v.2.21). Mismatches of no more than 5 bases were allowed in the alignment. The proportion of clean reads mapped back to the genome and genes, provided an overall assessment of sequencing quality.

Segment duplication
LASTZ (T: 2, C:2, H:2000, Y:3400, L:6000, K:2200) was used to identify the segment duplication blocks in Gekko japonicus. There were 39,863 duplicate blocks longer than 1Kb in the genome, covering about 53 Mb of the genome in total. The longest block is 217,810 bp. The majority of the blocks are between 1 Kb and 5 Kb in length. The identity is greater than 98%. gene) integration of de novo and homologous gene models, and RNA-seq data were applied to refine the gene set. De novo prediction was performed based on the repeat-masked genome. Two programs AUGUSTUS (v.2.5.5) 10 and GENSCAN 11 were used for the prediction. Homolog-based prediction was performed by mapping protein sequences (downloaded from NCBI) of related representative species (Anolis carolinensis, Gallus gallus, Homo sapiens, Meleagris gallopavo and Xenopus tropicalis) to the genome using TblastN (E-value cutoff 1e-5). Alignment and identification of accurate spliced alignments was accomplished using GeneWise (wise2-2-0) 12 . The EST of Gekko japonicus (download from NCBI) was aligned against the assembly genome using BLAT (identity >= 0.95, coverage >= 0.90) to generate spliced alignments, and PASA was used to filter the overlaps, link the spliced alignments and predict the possible gene model. The data were integrated by GLEAN to produce a consensus gene set. Also, transcriptomes data from multiple tissues were aligned to genome using tophat and assembled using cufflinks to improve the accuracy and completeness of the predicted gene set 13,14 . Finally, a total of 22,487 genes were obtained, with an average exons number per gene of 7.90, an average gene length of 25,676 bp, and an average CDS length of 2,426 bp. These values were similar between the closely related species. Gene functions were annotated based on the best matched hits to SwissProt and TrEMBL databases (Release15.10) using BLASTp (E-value<=1e-5). Gene motifs and domains were identified by InterProScan (v.4.7) against protein databases. All genes were aligned against KEGG databases proteins (E-value<=1e-5), and the pathwaysthat the genes might be involved in were derived based on the matching genes in the KEGG database.

Phylogenetic analysis
A phylogenetic tree was constructed using single-copy orthologous genes from  15 , and concatenated into a super sequence. PhyML was used to construct the phylogenetic tree under the GTR and invgamma model functions 16 . The same sequence set was applied to estimate the periods of species divergence using the program PAML MCMCTREE under the correlated molecular clock function in the approximate likelihood calculation method 17 . The correlated molecular clock and REV substitution model were selected to perform estimation. The MCMC process of PAML mcmctree was run to sample 100,000 times with a sample frequency of 50 after a burn-in of 5,000,000 iterations. Fossil calibrates were derived from www.fossilrecord.net. For calibration, divergence periods with paleontological evidence were used to date the nodes of the tree between two species 18  anatinus, D. rerio, M. musculus, and G. japonicus) was performed with an E-value less than 1e-7. Then, HSP segments were concatenated between the same pair of proteins with Solar, and similarities were evaluated by Bit-score. Gene families were determined by hcluster_sg (v.0.5.0) clustering 1 . The 22,487 genes clustered to 9,356 families and 2,726 genes remained unclustered.

Expansion and contraction analysis of gene families
Gene family expansion and contraction were identified using CAFÉ 20 , which employed a random birth and death model to study gene gain and loss in gene families across a user-specified phylogeny. The global parameter λ, which described both the gene birth (λ) and death (μ = -λ) rate across all branches in the tree for all gene families, was estimated using maximum likelihood. A conditional p-value was calculated for each gene family, and families with conditional p-values under the threshold (0.0001) were considered to have accelerated rates of gain or loss. We identified branches responsible for low overall p-values of significant families. A total of 2694 contracted families involving 4602 genes were identified, and 1247 expanded families with 5574 genes were also identified.

Positive selection genes
Ka/ks ratios were calculated for all single copy orthologs of Gekko japonicus, Anolis carolinensis, Alligator sinensis, Python molurus bivittatus, Chelonia mydas, and Pelodiscus sinensis. Orthologous genes were first aligned by PRANK (v.100802) 21 , an alignment tool used for studies of molecular evolution 22 . Gblocks (v.0.91b) 23,24 was used to remove ambiguously aligned blocks within PRANK alignments. The 'codeml' in the PAML package 17 with the free-ratio model was employed to estimate Ka, Ks, and Ka/Ks ratios on different branches. The difference in mean Ka/Ks ratio for single-copy genes between Gekko japonicus and each of the other species were compared with paired Wilcoxon rank sum tests 25 . After filtering out the false positive genes, we obtained a final total of 155 positive selection genes in Gekko japonicus.

Analysis of beta-keratins in setae
Beta-keratin proteins encoded in the Anolis carolinensis genome were used as a reference 26 , and aligned with the genome sequence of Gekko japonicus and Alligator sinensis by tBLASTn (1e-5). Then, the genome sequence was analyzed with Genewise to predict candidate gene models, which were syntenic to the matched proteins, and gained the homolog genes. A conserved region among the 71 beta-keratin proteins of Gekko japonicus (TCISQCPPSSVFIQPPPFCVTVPGPIMSCADEPCAVECTTPCAP SY) was identified from the Muscle alignment and used for predictions with PHD software (http://npsa-pbil.ibcp.fr/). Select amino acid sequences are characteristic of beta-keratins 27 including S-core box (SEVTIQPPPCTVVVPGPVLA), previously termed the Ge-cprp-9 core box, and the A Box (AEVLIQPPPSVVTLPGPILS), previously termed the Ge-gprp-6 core box. Both of these motifs exist in the digital pads of G. gecko 28 . S-core box containing proteins in particular have an extremely important role in setae present on digital pads of Gekko gecko 27 .
The expansion time of beta-keratins in setae was evaluated as follows. Reptilian and avian beta-keratins share a common ancestor 29 . We constructed a phylogenetic tree of beta-keratin proteins from different species and calculated the divergence of each branch. The molecular clock of protein evolution is roughly constant according to neutral theory 30,31 , therefore, a divergence value is proportional to the timescale of the corresponding genetic event. As such, we could calculate the timescales of unknown genetic events based on the dates of another known events. However, the actual evolution rate of a protein might fluctuate for various known and unknown reasons. Therefore, this method will not produce a reliable result unless appropriately calibrated time points are available. The timescale of avian scale and claw keratins divergence 32 and the timescale of feather keratins expansion in Neornithes 33 were used to calibrate An ultra-geometric test was then used to identify significantly enriched GO terms in the target gene list compared to the genomic background. To accomplish this, the following formula was used: where N represents the total number of genes with GO annotation, n represents the number of target genes in N, M represents the total number of genes that are annotated to certain GO terms, and m represents the number of target genes in M. The calculated p-value was submitted to Bonferroni correction with a threshold-corrected p-value ≤ 0.05. GO terms fulfilling this condition were considered to be significantly enriched 38 .

Gene expression
Gene coverage was defined as the percentage of a gene covered by reads. This value was equivalent to the ratio of the number of bases in a gene covered by unique mapping reads to the number of total bases in that gene. Gene expression was calculated using the RPKM method (Reads Per Kb per Million reads) 39 as follows: where C represents the number of reads that uniquely aligned to gene A, N represents the total number of reads that uniquely aligned to all genes, and L represents the base number in the CDS of gene A. Using the RPKM method eliminated the influence of different gene lengths and sequencing discrepancies on gene expression calculations. Therefore, the above calculations were used to directly compare differences in gene expression among samples.