The molecular evolution of genes previously associated with large sizes reveals possible pathways to cetacean gigantism

Cetaceans are a group of aquatic mammals with the largest body sizes among living animals, including giant representatives such as blue and fin whales. To understand the genetic bases of gigantism in cetaceans, we performed molecular evolutionary analyses on five genes (GHSR, IGF2, IGFBP2, IGFBP7, and EGF) from the growth hormone/insulin-like growth factor axis, and four genes (ZFAT, EGF, LCORL, and PLAG1) previously described as related to the size of species evolutionarily close to cetaceans, such as pigs, cows, and sheep. Our dataset comprised 19 species of cetaceans, seven of which are classified as giants because they exceed 10 m in length. Our results revealed signs of positive selection in genes from the growth hormone/insulin-like growth factor axis and also in those related to body increase in cetacean-related species. In addition, pseudogenization of the EGF gene was detected in the lineage of toothless cetaceans, Mysticeti. Our results suggest the action of positive selection on gigantism in genes that act both in body augmentation and in mitigating its consequences, such as cancer suppression when involved in processes such as division, migration, and cell development control.


Results
Among the nine selected genes to perform the molecular evolution analyses, we found that in the EGF gene, stop codons resulted in the interruption of the reading frame only in the Mysticeti cetacean lineage. To our knowledge, this is the first time that EGF pseudogenization has been reported for mysticetes. As this work focuses on the coding regions, only the results for the other eight genes are described below.
Selection analyses. Branch model. Branch analyses were performed using species trees. First, we labeled the ancestral branch that led to the giant cetaceans, i.e., the stem mysticete lineage and the branch of the sperm whale (Physeter catodon), since this is the only giant species of odontocete group, exceeding 10 m in body size (Fig. 1). Then we compare the rates between giant and non-giant cetaceans, labeling all species with body size larger than 10 m as one group.
For the first labeled scheme, codeML free-ratio test fitted our data significantly better than the one-ratio model for all genes. However, the two-ratio model, used to estimate whether giant cetaceans have a different omega value compared to the other species, fitted better for PLAG1. In the second scheme, no statistically significant differences were found using codeML.
For RELAX analysis, the IGFBP2 gene was found to be under intensified selection (K > 1) using the first labeled scheme with the ancestral branch from mysticete lineage and sperm whale (Fig. 2), while no evidence of intensified selection was found using the second scheme, in which all giant cetaceans were labeled as a one group.
Branch-site models. To estimate selective pressures acting in specific sites on the giant cetacean lineages, we performed branch-site models using BUSTED, aBSREL, and codeML. BUSTED indicated the occurrence of positive selection in at least one site in at least one branch for the GHSR gene (p-value ≤ 0.05), aBSREL resulted in episodic positive selection in Eschrichtius robustus lineage for 0.34% of sites on the same gene, and CodeML found significant positive selection for site 211 for GHSR. CodeML also found positive selection for sites 134 and 353 for the NCAPG gene and the site 278 for IGFBP7. No significant positive selection was detected for the IGF2, LCORL, PLAG1, and ZFAT genes.
Site models. To find sites under positive selection within giant cetaceans, we used SLAC, MEME, and FUBAR (Table 1). SLAC resulted in some codons with ω values greater than 1 but no statistically significant signatures of positive selection. MEME found episodic selection/diversifying selection for sites 243 and 249 for GHSR, and 241, 466, and 885 in the NCAPG gene. FUBAR detected thirteen sites under episodic selection/diversifying selection for NCAPG (348, 373, 464, 466, 630, 885, 906, 908, 925, 940, 953, 969, and 991), and the 211 codon was identified as being under selection for GHSR gene.
We found evidence for positive selection in the power to be at the middle of alpha-helix physicochemical property using TreeSAAP, with global z-scores > 3.09 (p < 0.001) for the GHSR gene, and in the power to be at the C-terminal physicochemical property for the IGFBP7 gene (Fig. 3).

Discussion
This study presents the molecular evolution of genes related to body size in mammals, focusing on giant cetaceans. We found molecular signs of selection in the GHSR, IGFBP7, PLAG1, and NCAPG genes from the nine genes included in our dataset, with results converging with different algorithms. Furthermore, in the EGF gene, the presence of stop codons resulted in the interruption of the reading frame in all Mysticeti cetacean species, which is unique to this group.
The presence of stop codons in the epidermal growth factor (EGF) gene is a pseudogenization indicator. The stop codons start at position 948 (exon 3) in the alignment, and remain present until the last exon in different and multiple sites for all species of the Mysticeti group. Some stop codons likely share the same locus between www.nature.com/scientificreports/ particular mysticete species, but because the reading frame is very corrupted, the alignment is unreliable only in the group of toothless cetaceans, and we cannot state the site of the first mutation in stem mysticete lineage that led to the EGF pseudogenization in this group. To our knowledge, the inactivation of EGF in mysticetes is reported for the first time in this paper. The inactivation of protein-coding genes has already been associated with several traits of cetaceans 32 , such as vision 33 , loss of taste receptors 34 , hair loss 35 , and teeth in Mysticeti 36 , among others. From an evolutionary perspective and considering the occupation of the aquatic environment, the loss of function of these genes can be understood as part of the adaptation process and not only because of the relaxation of selection 37 . EGF binds to the epidermal growth factor receptor (EGFR), which then dimerizes or forms ErbB-2, ErbB-3, or ErbB-4 homologs, increasing the intracellular activity of tyrosine kinase, activating effects such as cell proliferation, apoptosis and angiogenesis, embryonic growth, and tissue regeneration 38 . In addition, EGF has been associated with the development and eruption of teeth, being found within the dental follicle, in the alveolar bone related to ameloblasts (the cells that form tooth enamel), and during the pre-functional stage of tooth eruption in rats, animals in which EGF injections in neonates significantly stimulated the eruption of the incisor teeth [39][40][41] . Teeth loss occurred in the common ancestor of all extant mysticete cetaceans, and ontogenetic evidence suggests that teeth develop rudimentarily in fetuses in this group; however, they are later aborted and reabsorbed before enamel formation 42,43 . Thus, evidence of pseudogenization of the EGF gene only in Mysticeti is likely related to the loss of teeth and the appearance of baleen. The baleen, for this group, was an evolutionary innovation that allowed whales to exploit a new foraging niche: filtration, which was previously identified as a probable trigger for gigantism 26,44 . Furthermore, the loss of EGF functionality and its role in important components of homeostasis, such as the kidneys, would be compensated for by the role of other genes in the EGF family and other pathways, an overlap of functions that has been reported in rats that did not have active EGF and yet had a normal and healthy phenotype 44 .
Cetacean body size seems to respond to intense selective pressures imposed by the aquatic environment. Factors such as thermoregulation, feeding ecology, and space availability shaped the gigantic body proportions www.nature.com/scientificreports/ of these animals [14][15][16] . Furthermore, migratory behavior, like the one performed by the blue whale that transits in polar waters, affects body size following Bergmann's rule, which states that animals living in colder climates are generally larger than those living in warmer regions 45 . The combination of these external factors can now be studied at the molecular level due to advances in genomic technologies. For example, a recent study reported signs of positive selection in body size genes in cetaceans: in small species, the genes under selection related to short size were ACAN, OBSL1, and GRB10; in the large cetaceans, the selection was identified in the CBS, EIF2AK3 and PLOD1 genes, all related to the large size 46 . Together, their results and the results from this study aid to our understanding of the evolutionary panorama of large body evolution, which is a complex feature that affects many genetic pathways.
Our results identified the GHSR with evidence for positive selection in the physicochemical property Power to be at the middle of alpha-helix, with global z-scores > 3.09 (p < 0.001) using TreeSAAP. BUSTED found evidence of positive selection for the GHSR gene (p-value ≤ 0.05) and aBSREL in the sperm whale for 0.32% of sites. Also, codons 243 and 259 were identified as being under positive selection by MEME, and 211 were identified by the codeML branch-site model (98%) and FUBAR (0.93 p.p). In this site, the sperm whale (Physeter catodon) was the only one to present glycine (G). This nonpolar amino acid that is compatible with hydrophilic and hydrophobic www.nature.com/scientificreports/ environments, while all other animals had threonine (T), which is polar and highly soluble in water. This modification is important because glycine is considered a "helix breaker" once it disrupts the regularity of the α helical backbone conformation since it lacks a β carbon, which is associated with more conformational freedom than other residues 47,48 . As mentioned before, the results reinforce that large phenotypes may evolve by different paths. It is worth noting that changes in specific species, such as the different site in the sperm whale-the only odontocete classified as a giant, may be related to characteristics of that species, involving the large body size or other characteristics affected by the gene. GHSR is an endogenous ligand that can stimulate, through its ghrelin ligand, the release of growth hormone through the pituitary gland and thus increase appetite, regulate body weight, energy metabolism, and fat accumulation 49 . In addition, it is associated with the secretion of gastric acid, control of cell proliferation, apoptosis, lactation, and cardiovascular pressure [50][51][52] . This gene has been linked to increased body size in cattle, sheep, and pigs 30,[53][54][55] . In most cases, the increase in body size in these animals results from changes in a few sites, but these changes were not found in giant cetaceans. Nevertheless, this may indicate that minor changes in this gene can result in phenotypic modifications. IGFBP7 is a 27 kD protein and a member of the IGFBP superfamily, responsible for the viability of insulinlike growth factors (IGFs)-molecules involved in promoting cell growth and division 56 . Evidence suggests that IGFBP7 acts as an oncosuppressor gene in prostate, breast, lung, and colorectal cancer due to its regulatory action related to cell proliferation, cell adhesion, cell senescence, and angiogenesis 57,58 . This repressor activity may arise with the interruption of the cell cycle in the G1 phase, induction of senescence, and an increase in the level of cell death through apoptotic cells 59 . In addition, it has already been observed that the higher the body mass index, the greater the expression of IGFBP7. This is possibly associated with the fact that obesity is an agent related to senescence, and IGFBP7 is secreted by senescent cells. This relationship that may indicate a compensation mechanism for organisms that reach high body mass 60 . In our analyses, we found evidence for positive selection in the physicochemical property power to be at the C-terminal with global z-scores > 3.09 (p < 0.001) using Tree-SAAP. Furthermore, the branch-site implemented in codeML inferred that site 278 is under positive selection. In this site, the blue whale (Balaenoptera musculus) shows a loss of codons and no expressed amino acids. In contrast, the gray whale (Eschrichtius robustus) presents the nonpolar and hydrophobic methionine (M), while the other animals had glutamate (E), a polar amino acid. In summary, it seems that IGFBP7 is related to two main characteristics of giant cetaceans: increase in body size and suppression of cancer. Some cancer suppressor genes have already been reported to be under positive selection for cetaceans 61 , and IGFBP7 is likely to be one more.
NCAPG (Non-SMC Condensin I Complex Subunit G) is a gene strongly associated with increased body size and weight gain. It has been reported to be linked to birth weight, withers height, feeding efficiency, and pubertal growth in bovine species [62][63][64] . Besides cattle, this gene has been linked to growth in horses, donkeys (Equus asinus), pigs, humans, and chickens [65][66][67][68][69][70][71] . NCAPG also presented many sites evolving under positive selection, thirteen in the FUBAR program (348, 373, 464, 466, 630, 885, 906, 908, 925, 940, 953, 969, and 991), three in www.nature.com/scientificreports/ MEME (241, 466, and 885) and two in the codeML branch-site model (134 and 353) with some of them recovered by different methods, such as the 466, and 885 identified by FUBAR and MEME. Accordingly, this gene is probably the one that may be most directly involved in cetacean gigantism from our dataset, acting directly on two important characteristics-growth and weight gain. In the same direction, PLAG1 (Pleomorphic Adenoma Gene 1) is a gene associated with growth in cattle 72 , pigs 73 , and sheep 74 , mainly in traits such as height, knuckle, biceps, and shank 75 . This gene encodes a zinc finger protein family, a nuclear protein transcription regulator 76 , playing an important role in the transcriptional regulation of growth factors such as IGF2, which is related to embryo growth and cell survival 77 . This is a candidate gene for future analysis with new parameters since it is related to growth in several animals, and mutations have been described as promoting changes in height 78 . In our analyses, PLAG1 was the only gene with evidence of positive selection by the codeML branch model.
Collectively, our results indicate four genes likely to be involved in increasing body size in giant cetaceans. Some of these genes, such as GHSR and IGFBP7, may also be responsible for mitigating the possible consequences of extreme size, as they control important aspects of the cell cycle. Hypothetically, being a giant has severe consequences, such as increased chances of developing cancer, in addition, cetaceans are long-lived animals, which is also related to this disease. Giant cetacean species (larger than 10 m) included in this study live longer than 30 years, with the humpback whale (Megaptera novaeangliae) reaching 50 years, and the blue whale (Balaenoptera musculus) and the fin whale (Balaenoptera physalus) reaching up to 90 years, while the bowhead whale (Balaena mysticetus) is the longest lived mammal known, with a lifespan of 200 years 79 . Despite these triggers, giant animals are less likely to develop cancer than small animals, a logical contradiction called Peto's Paradox that suggests the existence of a mitigation mechanism 7 . Thus, genes that act on body growth and control of the negative aspects mentioned above through cell control, cell division, and tumor suppression could be targets of natural selection, allowing these animals to become giants, live longer, and have great body mass.
It is worth remembering that body size is a complex characteristic that involves many factors and molecular pathways. Throughout cetacean evolutionary history, different lineages had different selective pressures on different genes that could result in the same large body phenotype. This could be the case for the sperm whale, an odontocete as large as a mysticete. Interestingly, genes previously reported as associated with large sizes in artiodactyls, such as LCORL and ZFAT, apparently do not show the same effect in cetaceans, highlighting how large sizes may arise from different pathways from different genes in different lineages.

Conclusion
In summary, here we investigated the molecular evolution of genes possibly related to increased body size in giant cetaceans. We found evidence for positive selection at the coding level for sites in the GHSR, IGFBP7, PLAG1, and NCAPG genes. Besides that, we found evidence of pseudogenization of the EGF gene in the Mysticeti lineage, an event likely related to teeth loss in these cetaceans, which could be connected with the emergence of the baleen plate filter system. In conclusion, our study provides new perspectives on the evolution of cetacean gigantism, reinforcing the selective pressures of the aquatic environment, the various possibilities of action of natural selection on different genes that have similar functions depending on specific characteristics for each species, and indicating that pseudogenization is also an adaptive process for this group.

Material and methods
Sample data. We focused on the genes GHSR, IGF2, IGFBP2, IGFBP7, and EGF from the growth hormone/insulin-like growth factor axis and the genes NCAPG, LCORL, PLAG1, and ZFAT that are associated with increased body size in artiodactyls. The cetacean group was composed of 12 odontocetes (Lagenorhynchus obliquidens, Neophocaena asiaeorientalis, Delphinapterus leucas, Tursiops truncatus, Orcinus orca, Monodon monoceros, Globicephala melas, Lipotes vexillifer, Physeter catodon, Phocoena sinus, Sotalia fluviatilis, and Sotalia guianensis) and seven mysticetes (Balaenoptera acutorostrata, Balaena mysticetus, Eschrichtius robustus, Eubalaena japonica, Megaptera novaeangliae, Balaenoptera physalus, and Balaenoptera musculus), totaling 19 species. The coding sequences for the species Balaena mysticetus came from the public platform The Bowhead Whale Genome Resource. In addition, Sotalia fluviatilis and Sotalia guianensis were sequenced by our laboratory (data not published). All other coding sequences were retrieved from GenBank and the accession numbers can be found in the Table 1. The sequences were retrieved according to their availability in the databases and quality. Thus, the dataset is not the same for all genes, but at least one giant cetacean is present for all of them. The sequences were aligned using the MUSCLE algorithm 80 and we used Geneious software v. 7.1.3 81 to remove fragmented sequences that were larger or smaller than expected.

Molecular evolutionary analyses.
To estimate the role of natural selection in our focus genes we estimated the value of ω (dN/dS), which is the ratio of the rate of non-synonymous substitutions (dN) to the rate of synonymous substitutions (dS), where ω < 1 indicates purifying selection, ω = 1 suggests neutral evolution, and ω > 1 indicates positive selection 83,84 . Different approaches were applied: the branch model that identifies how ω varies through the branches of the phylogeny, site-models that detect variations of ω in distinct sites, and the branch-site model that integrates both approaches [83][84][85] . The branch model and branch-site models were performed for a dataset with the species of interest and other mammals labeling the ancestral branches that resulted in these giant animals. In contrast, the site-models were performed only with the cetacean species classified as  87 .
Branch models. To check whether the value of ω for giant cetaceans was different compared to other animals of the phylogenetic tree we used a branch model available at codeML within the PAML package 85 that allows the variation of ω in the branches of the phylogeny. First, we used the one-ratio model that estimates a single value of ω for all branches. Then, the free-ratio model was applied, calculating ω for each branch. Finally, we used the two-ratio model, where we inferred a value of ω for giant cetaceans and another for the rest of the phylogeny. In this case, the interest group was identified as a foreground branch, while the algorithm treated the other unmarked ones as background branches 88 . We tested two scenarios, first labeling the ancestral branches that led to the stem mysticete lineage and the branch of the sperm whale (Physeter catodon), and second labeling all cetaceans' species with body size larger than 10 m classified as giants as one group to compare within non-giants cetaceans. The same configuration was used in RELAX, a method to test whether the selection was relaxed (K < 1) or intensified (K > 1) on a portion of branches specified a priori in the phylogeny 89 .
Site models. For site model analyses, the dataset comprised only the sequences of giant cetaceans. FUBAR software (Fast Unconstrained Bayesian AppRoximation) was used to identify sites that may have experienced generalized diversification or purifying selection by estimating the ratios between dN and dS substitution rates for each site where posterior probability (PP) of ω > 1 is greater than 95% 90 . SLAC (Single-Likelihood Ancestor Counting) was also used. This algorithm combines maximum-likelihood (ML) and counting approaches to calculate the ratios between dN and dS rates by site given a codon alignment 91 . Finally, MEME (Mixed-Effects Model of Evolution) looked for evidence of episodic or diversifying selection at individual sites allowing ω to change from site to site and branch to branch 92 . TreeSAAP v.3.2 93 relies on the MM01 model implemented in baseML from the PAML package 85 using phylogeny to reconstruct the most likely ancestral states for the gene sequences, detecting selection at the amino acid level. The software assigns weight values to the non-synonymous codon changes, for which overall physicochemical effects are assessed using a model with 31 physicochemical amino acid properties, with these changes ranging from 1 (conservative) to 8 (radical change). Positive selection is checked through a z-score to calculate deviation from neutral evolution. A highly significant z-score (z > 3.09, p < 0.01) indicates more non-synonymous substitutions than assumed under the neutral model, and only amino acid changes with a score between 6 and 8 and with a positive z-score < 0.001 were considered 94 .
Branch-site models. The branch-site model was used to identify whether some sites were subjected to the action of positive selection in the group of giant cetaceans. For this analysis in codeML, the interest group (i.e., all giant cetaceans) was labeled in the phylogeny as foreground branches, where sites with ω > 1 are allowed, and the rest of the tree was labeled as background branches, where sites with ω > 1 are not allowed. Model A was then used against the null model. www.nature.com/scientificreports/