Gene duplications and gene loss in the epidermal differentiation complex during the evolutionary land-to-water transition of cetaceans

Major protein components of the mammalian skin barrier are encoded by genes clustered in the Epidermal Differentiation Complex (EDC). The skin of cetaceans, i.e. whales, porpoises and dolphins, differs histologically from that of terrestrial mammals. However, the genetic regulation of their epidermal barrier is only incompletely known. Here, we investigated the EDC of cetaceans by comparative genomics. We found that important epidermal cornification proteins, such as loricrin and involucrin are conserved and subtypes of small proline-rich proteins (SPRRs) are even expanded in numbers in cetaceans. By contrast, keratinocyte proline rich protein (KPRP), skin-specific protein 32 (XP32) and late-cornified envelope (LCE) genes with the notable exception of LCE7A have been lost in cetaceans. Genes encoding proline rich 9 (PRR9) and late cornified envelope like proline rich 1 (LELP1) have degenerated in subgroups of cetaceans. These data suggest that the evolution of an aquatic lifestyle was accompanied by amplification of SPRR genes and loss of specific other epidermal differentiation genes in the phylogenetic lineage leading to cetaceans.

The list includes PGLYRP, SEDC, SFTP genes and the S100A genes, that flank PGLYRP3 or the SFTP region, but not other S100As. **, RNA-seq peaks in the "Genomic regions, transcripts and products" view at www.ncbi.nlm.nih.gov (accessed on 14 July 2020). ***, GenBank gene prediction at the locus of gene predicted in the present study n.a., not applicable *CDS (coding sequence) is shown for intact genes. For genes that carry inactivating mutations (labeled with "m"), nucleotide positions indicate the region of sequence similarity that was identified by tBLASTn search using orthologous proteins of other species as queries. The list includes PGLYRP, SEDC, SFTP genes and the S100A genes, that flank PGLYRP3 or the SFTP region, but no other S100As. The symbols < and > indicate that ends of the coding sequence are not present on the scaffold. **, RNA-seq peaks in the "Genomic regions, transcripts and products" view at www.ncbi.nlm.nih.gov (accessed on 14 July 2020). ***, GenBank gene prediction at the locus of gene predicted in the present study n.a., not applicable *CDS (coding sequence) is shown for intact genes. For genes that carry inactivating mutations (labeled with "m"), nucleotide positions indicate the region of sequence similarity that was identified by tBLASTn search using orthologous proteins of other species as queries. The list includes PGLYRP, SEDC, SFTP genes and the S100A genes, that flank PGLYRP3 or the SFTP region, but no other S100As. **, RNA-seq peaks in the "Genomic regions, transcripts and products" view at www.ncbi.nlm.nih.gov (accessed on 14 July 2020). ***, GenBank gene prediction at the locus of gene predicted in the present study n.a., not applicable *CDS (coding sequence) is shown for intact genes. For genes that carry inactivating mutations (labeled with "m"), nucleotide positions indicate the region of sequence similarity that was identified by tBLASTn search using orthologous proteins of other species as queries.
Genes on the scaffold with Acc. Nr. NW_006726465.1 were preceded by a gap and genes not homologous to typical EDC genes, indicating either an assembly error or a gene rearrangement.

Same CDS prediction in
GenBank?

GenBank
Gene ID*** **, RNA-seq peaks in the "Genomic regions, transcripts and products" view at www.ncbi.nlm.nih.gov (accessed on 4 MAY 2021). ***, GenBank gene prediction at the gene locus predicted in the present study Notes: The list includes PGLYRP, SEDC, SFTP genes and the S100A genes, that flank PGLYRP3 or the SFTP region, but no other S100As.
*, CDS (coding sequence) is shown for intact genes. Positions of genes carrying inactivating mutations (labeled with "m") indicate the region identified by tBLASTn search using orthologous proteins of other species as queries. Other EDC of cattles were found in GenBank with annotations based on assembly accession GCF_002263795.1 *, RNA-seq peaks in "Genomic regions, transcripts and products" view at www.ncbi.nlm.nih.gov (accessed on 14 July 2020) **, GenBank gene prediction at the locus of gene predicted in the present study CDS, coding sequence; lncRNA, long non-coding RNA Amino acid sequences of dolphin SFTPs. (C) Amino acid sequences of proteins encoded by other EDC genes of the dolphin. To show the peculiar amino acid compositions of SEDCs and SFTPs and the importance for protein cross-linking the following amino acid residues are highlighted: lysine (K) and glutamine (Q) as potential sites of transglutamination; cysteine residues (C) as potential sites of disulfide bonds; glycine (G), proline (P) and serine (S) are highly abundant residues not directly involved in cross-linking. When available, the GenBank accession number is shown behind the protein name. "XXX" indicates a stretch of unknown amino acid residues, corresponding to a gap in the gene sequence. Only the S100A proteins whose genes are flanking PGLYRP3 and FLG are included here. SEDC, Simple epidermal differentiation complex gene; SPRR, small proline rich protein; SFTP, S100 fused-type protein; Tt, Tursiops truncatus. To show the peculiar amino acid compositions of SEDCs and SFTPs and the importance for protein cross-linking the following amino acid residues are highlighted: lysine (K) and glutamine (Q) as potential sites of transglutamination; cysteine residues (C) as potential sites of disulfide bonds; glycine (G), proline (P) and serine (S) are highly abundant residues not directly involved in cross-linking. When available, the GenBank accession number is shown behind the protein name. "XXX" indicates a stretch of unknown amino acid residues, corresponding to a gap in the gene sequence. Only the S100A proteins whose genes are flanking PGLYRP3 and FLG are included here. SEDC, Simple epidermal differentiation complex gene; SPRR, small proline rich protein; SFTP, S100 fused-type protein;  To show the peculiar amino acid compositions of SEDCs and SFTPs and the importance for protein cross-linking the following amino acid residues are highlighted: lysine (K) and glutamine (Q) as potential sites of transglutamination; cysteine residues (C) as potential sites of disulfide bonds; glycine (G), proline (P) and serine (S) are highly abundant residues not directly involved in cross-linking. When available, the GenBank accession number is shown behind the protein name. "XXX" indicates a stretch of unknown amino acid residues, corresponding to a gap in the gene sequence. Only the S100A proteins whose genes are flanking PGLYRP3 and FLG are included here. SEDC, Simple epidermal differentiation complex gene; SPRR, small proline rich protein; SFTP, S100 fused-type protein; Ba, Balaenoptera acutorostrata scammoni. Figure S4. Amino acid sequences of proteins encoded by EDC genes of cattle. Amino acid sequences of proteins encoded by cattle SEDC genes that were predicted in this study. GenBank annotations were used for all other cattle EDC proteins which are not shown here. To indicate the peculiar amino acid compositions of SEDCs and SFTPs and the importance for protein cross-linking the following amino acid residues are highlighted: lysine (K) and glutamine (Q) are potential sites of transglutamination; cysteine residues (C) are potential sites of disulfide bonds; glycine (G), proline (P) and serine (S) are residues highly abundant in some SEDC proteins. SPRR, small proline rich protein, LCE, late cornified envelope. Bt, Bos taurus.  Figure S5. Sequence repeats of involucrin (IVL) proteins in cattle and cetaceans. Amino acid sequences were aligned to highlight the internal sequence repeats. IVL sequences of bottlenose dolphin (Tursiops truncatus), vaquita/porpoise (Phocoena sinus) and minke whale (Balaenoptera acutorostrata scammoni) are encoded by genes listed in Supplementary Tables S2-S4. IVL of cattle (Bos taurus) has the GenBank accession number XP_005203889.1. Amino acid residues cysteine (C), proline (P), glutamine (Q), glutamic acid (E) and lysine (K) are highlighted. Sequences were aligned with the MultAlin program (Corpet 1988) and manually adjusted. Dashes were inserted to optimize alignment of sequence repeats. Note that the intra-species similarity of IVL1 and IVL2 is more pronounced than the similarities of IVL1 and IVL2 sequences of different species, suggesting independent IVL gene duplications in the lineages leading to dolphin and vaquita.