Comparative analysis of corrected tiger genome provides clues to their neuronal evolution

The availability of completed and draft genome assemblies of tiger, leopard, and other felids provides an opportunity to gain comparative insights on their unique evolutionary adaptations. However, genome-wide comparative analyses are very sensitive to errors in genome sequences and thus require accurate genomic assemblies for reliable evolutionary insights. In this study, while analyzing the tiger genome, we found almost one million erroneous substitutions in the coding and non-coding region of the genome affecting 4,472 genes, hence, biasing the current understanding of tiger evolution. Moreover, these errors produced several misleading observations in previous studies. Thus, to gain insights into the tiger evolution, we corrected the erroneous bases in the genome assembly and gene set of tiger, which was also validated by resequencing of a Bengal tiger genome and transcriptome. A comprehensive evolutionary analysis was performed using 10,920 orthologs from nine mammalian species including the corrected gene sets of tiger and leopard, and using five different methods at three hierarchical levels i.e. felids, Panthera, and tiger. The unique genetic changes in tiger revealed that the genes showing the signatures of adaptation in tiger were enriched in development and neuronal functioning. Specifically, the genes belonging to Notch signalling pathway, which is among the most conserved pathways involved in embryonic and neuronal development, were found to be significantly diverged in tiger in comparison to the other mammals. Our findings suggest the role of adaptive evolution in neuronal functions and development processes, which correlates well with the presence of exceptional traits such as sensory perception, strong neuro-muscular coordination, and hypercarnivorous behavior in tiger.


Sample collection, DNA isolation, and sequencing of the Bengal tiger genome
immediately brought to the laboratory at 4 °C and genomic DNA was extracted using DNeasy Blood  in a room temperature centrifuge and the supernatant was discarded. The pellet was gently resuspended in 1 ml of RBC Lysis Buffer and transferred to a 1.5 ml microcentrifuge tube and incubated at room temperature for 5 minutes. The cells were pelleted for 2 minutes by centrifuging 87 at room temperature at 3000 rpm. The supernatant was discarded, and the pellet was resuspended 88 in 1 ml of sterile DPBS. The cells were again pelleted at room temperature at 3,000 rpm, and the 89 supernatant was discarded. 1200 µl of TRIzol solution was added to each tube. 0.2 ml of chloroform 90 was added, and the tube was vortexed for 15 seconds. The sample was then centrifuged at 13,000 91 rpm for 10 minutes at 4°C. The upper phase was removed and transferred to a clean microcentrifuge 92 tube. To the remaining upper phase, an equal volume of cold isopropanol was added, and inverted 93 to mix. The sample was placed in a -20°C freezer to precipitate. Sample was then centrifuged at 94 13,000 rpm for 10 minutes at 4°C. The supernatant was carefully discarded, and the pellet was 95 rinsed with 0.5 ml of ice-cold 75% ethanol. The sample was centrifuged at 13,000 rpm for 10 96 minutes at 4°C. The supernatant was discarded, and the pellet was allowed to dry for 5 to 10 97 minutes to remove any remaining ethanol. The RNA pellet was dissolved by adding 20 µl of RNAse-98 free water. The transcriptomic libraries were prepared from the total RNA using the SMARTer 99 universal low input RNA kit and TruSeq RNA sample prep kit v2 using the manufacturer's 100 instructions, and 100 bp paired end sequencing was performed on the Illumina HiSeq platform. genome assembly using bwa mem (v0.7.12) 17 using default parameters. The alignment file was sorted and split scaffold-wise using Samtools (v1.4) 18 .

113
The per-nucleotide metrics for each scaffold was calculated using bam-readcount tool 114 (github/genome/bam-readcount) using minimum mapping quality 25, minimum base quality 25, and 115 maximum depth 400. The base positions with less than 10x coverage or more than 200x coverage, or 116 percent indel> 10% were filtered out to remove the low coverage, potentially repetitive, and indel 117 regions, respectively, and the remaining bases were analyzed further. At a given position, if the 118 representation of the reference base was less than one-fifth of the most frequent base, then the 119 reference base was replaced by the most frequent base at that position. The adoption of this 120 stringent criteria ensured that only those positions were corrected where a sufficient coverage of 121 the most frequent base relative to the reference base was available to justify the replacement of the 122 reference base. The above criteria were optimized after several iterations and visualization of 123 randomly selected regions with the mapped reads in the IGV software 19 . The corrected genome 124 assembly was used to construct the corrected gene set using the gene structure information 125 available at Ensembl.

Protein and nucleotide alignment
The one-to-one orthologs were filtered for the presence of premature stop codons (non-sense 136 mutations). The gene phylogeny of each ortholog was inferred from the species phylogeny and was 137 subjected to protein alignment using SATé-II 21 , which implemented PRANK for alignment, Muscle 138 for merging the alignment, and RAxML for tree estimation. The protein-based nucleotide alignment 139 was carried out using 'tranalign' tool in EMBOSS package 22 . The variation in ω ratio between lineages on individual genes was calculated using the branch model 143 in CodeML from the PAML software package (v4.9a) 23 . The codons with any ambiguity site were 144 removed from the analyses. The genes that qualified likelihood ratio test using a conservative 5% 145 false-discovery-rate criterion against the null model (One ratio) were considered for further analysis.

146
Also, the genes with dN/dS values >3 were not used for further analysis 24,25 . The genes having a

163
Higher nucleotide divergence

164
The maximum likelihood phylogenetic tree for each gene using its CDS alignments was constructed 165 using PhyML package v3.1 28 . The root-to-tip branch length distances were calculated for each 166 species using the 'adephylo' package in R 29,30 . The genes with a significantly higher root-to-tip 167 branch length for lineage leading to tiger compared to all other lineages were considered to show 168 higher nucleotide divergence in tiger.  network-based pathway enrichment analysis was carried out based on the methodology 177 implemented in EnrichNet 33 to identify the network interconnectivity score (XD-score) and classical 178 overlap-based enrichment score (Fisher's exact test adj. using Benjamini-Hochberg) using KEGG as the reference database. The significance threshold was calculated by performing a linear regression of network interconnectivity score (XD-score) and enrichment score (Fisher's q-value). The pathways above the significance threshold were considered as enriched.

184
Comparative genomic analysis was performed to gain insights into the evolution of tiger with several 185 other mammalian species. Tiger is a prominent member of the Panthera genus, which is a fast 186 evolving group that has undergone recent radiation with rapid functional diversification [34][35][36]  Therefore, to understand the adaptive evolution of tiger lineage, we first corrected the available reference assembly of the tiger genome, which was further validated by sequencing the genome and 205 transcriptome of a Bengal tiger from India. A total of 175,680,850 paired-end reads (67 GB) and 206 32,252,904 reads (6 GB) were generated for the genome and transcriptome, respectively 207 (Supplementary Table S1-S2). Among the five extant species in the Panthera genus, the genome 208 assemblies are publicly available for only two species, tiger and leopard 1,39 . Thus, in addition to 209 tiger, we also generated the corrected genome assembly of leopard using the strategy shown in

271
indicates that in terms of the broader biological processes, the results from the evolutionary analysis 272 using the corrected genome assemblies corroborate with the previous study on felids 5 .

273
We observed several felid-specific amino acid substitutions in the AgRP gene expressed in AGRP 274 neurons, which is involved in regulating the feeding behavior in animals [43][44][45] . The injection of AgRP 275 peptides into the brain in rats was found to induce voracious eating behavior even in well-fed mice. have significant functional impact predicted using SIFT, and thus, could be associated with the voracious feeding behavior shown by felids 48,49 .   Table S8).

307
Insights into the evolution of tiger using genes with multiple signs of adaptation

308
The genes with multiple signs of adaptation (MSA) were identified as the genes that showed three or 309 more signs of adaptive evolution out of the five methods used for the adaptive evolution analysis (A.    Table S9).

344
Taken together, it points towards the differential evolution of neuronal functioning and 345 developmental processes genes in tiger.

346
The highly evolved Notch signalling pathway in tigers

347
The pathway enrichment analysis performed using the fisher's exact test and network enrichment 348 method revealed the Notch signalling pathway to be the most significantly enriched pathway 349 (Supplementary Table S10). The regression of XD-score, which is a measure of network after applying these tests, only the Notch signalling pathway was above the significance threshold In the Notch signalling pathway, 11 genes showed adaptive evolution in tiger among which, CTBP1 355 gene showed all the five signs of adaptation. The 11 genes include the notch receptor (NOTCH3), 356 ligand (DLL3), intracellular and extracellular regulators (DVL3, NUMB, LFNG, ADAM17), transcription 357 factor (RBPJL), and its regulators (CREBBP, NCOR2, CTBP1). From the protein-protein interaction 358 data obtained from STRING database 51 , it was apparent that these 11 genes can interact with all the 359 genes and regulators of the Notch pathway (Figure 4b and 4c). Taken together, it is apparent that

399
After correction, most of the bases corrected in the coding genome of tiger were identical to the indicates that the divergence time of tiger calculated using the genetic differences in the previous studies could suffer from an over-estimation because of these erroneous substitutions 8