Palaeoproteomic analysis of Pleistocene cave hyenas from east Asia

The spotted hyena (Crocuta crocuta) is the only extant species of the genus Crocuta, which once occupied a much wider range during the Pliocene and Pleistocene. However, its origin and evolutionary history is somewhat contentious due to discordances between morphological, nuclear, and mitochondrial data. Due to the limited molecular data from east Asian Crocuta, also known as cave hyena, and the difficulty of extracting ancient DNA from this area, here we present proteomic analysis of cave hyenas from three locations in northern China. This marks the first proteomic data generated from cave hyenas, adding new molecular data to the east Asian populations. Phylogenetic analysis based on these protein sequences reveals two different groups of cave hyenas in east Asia, one of which could not be distinguished from modern spotted hyenas from northern Africa, tentatively the result of previously suggested gene flow between these lineages. With developments of instrumentation and analytical methods, proteomics holds promising potential for molecular phylogenetic reconstructions of ancient fauna previously thought to be unreachable using ancient DNA.

When restricted to the Hyaenidae taxa, 13 variable sites were present along the type I collagen sequences, among which one was specific for striped hyena, two for brown hyena, and seven for striped and brown hyena. One crucial variable site was detected at position 1449 (I>V, COL1A2, hereafter the position number was derived from the dataset alignments), which contains two types of amino acids. Isoleucine (I) was displayed in striped hyena, brown hyena and the Namibian individual (southern Africa), which may be an ancestral type. Valine (V) was detected in the other two modern spotted hyena individuals from Somalia (northeast Africa) and Ghana (northwest Africa), which we regard as the derived type. For the three cave hyena sequences, this variable site was all called as Valine (V) with a coverage of 183 (HZD), 83 (SYZ), and 69 (LXD) PSMs respectively, which is consistent with the modern spotted hyena sequences from northern Africa. Examples of the PSMs covering this region are showed in Fig. S1.
For the cave hyena sequence from the SYZ sample, a specific amino acid substitution was detected (H>D, COL1A2 at position 2049), with a coverage of 14 PSMs. One of the MS/MS product ion spectra for peptide TGDPGSVGPAGVR (2047-2059) from the SYZ sample is displayed in Fig. S2a, where the amino acid substitution (D) was identified by both y and b ions. However, the deamidation of asparagine (N) could result in the inference of aspartic acid (D). The average deamidation frequency for all glutamine/asparagine positions (n = 31) in COL1A2 from the SYZ sample was as high as 0.94, and the average deamidation frequency for asparagine (N) positions (n = 18) only was higher (0.97). We could not exclude the possibility that the amino acid call at position 2049 is a full deamidated asparagine (N). To confirm the original call of this position, we searched the carnivoran type I collagen database, where this specific position was a variable site with two types of amino acids, i.e. histidine (H) and glutamine (Q). When restricted to the represented Feliformia taxa, it was a conserved site with the same amino acid call, i.e. histidine (H). Neither asparagine (N) and aspartic acid (D) were observed at this position, but a previous study has reported a H>D substitution in COL1A2 from a Pleistocene Equus sp. sequence 3 . In addition, the short peptide "TGDP" could be found in most of the represented carnivoran taxa at another region (1480-1483) while "TGNP" could not be detected. Thus, we consider this amino acid substitution as D rather than N.
For the cave hyena sequences from the HZD and LXD samples, another specific amino acid substitution was detected (E>S, COL1A1 at position 523), with a coverage of 9 (HZD) and 33 (LXD) PSMs respectively. Examples of the MS/MS product ion spectra for peptide GFPGSRGVQGPPGPAGPR (519-536) from the HZD and LXD samples are displayed in Fig. S2b and c. Besides type I collagen, two amino acid substitutions were detected in the HZD sample (K>R, CHAD at position 2561, and F>I, OMD at position 3175), both with a coverage of only 2 PSMs, which may require confirmation when more sequences become available for the carnivoran taxa in the future. S18-20). The spectra were exported from PEAKS X (https://www.bioinfor.com/peaks-studio/). The arrows and the y/b numbers were added to the Roepstorff nomenclature diagrams using Adobe Photoshop CC 2015 (https://www.adobe.com/cn/products/photoshop.html). Figure S2. Examples of MS/MS product ion spectra covering the specific amino acid substitutions. a) MS/MS product ion spectrum covering the amino acid substitution (H>D, COL1A2 at position 2049) from the SYZ sample. b) and c) MS/MS product ion spectra covering the amino acid substitution (E>S, COL1A1 at position 523) from the HZD (b) and LXD samples (c). Both deamidation (Q) and hydroxylation (P) were detected and presented as lowercase letters. Arrows point out the positions of the variable sites. The ion match tables for the spectra are displayed below (Figs. S21-23). The spectra were exported from PEAKS X (https://www.bioinfor.com/peaks-studio/). The arrows and the y/b numbers were added to the Roepstorff nomenclature diagrams using Adobe Photoshop CC 2015 (https://www.adobe.com/cn/products/photoshop.html).

Reanalysis of previously published MS/MS raw data of a modern spotted hyena
The type I collagen sequence for modern spotted hyena has once been reported in reference 4 , where the MS/MS datafile was searched against a mammalian collagen database without any Hyaenidae sequence included. In that database, the closest relative of hyenas was from the family Felidae, which also belongs to the suborder Feliformia. In contrast, the local database in our study contained sequences from three modern spotted hyenas (Crocuta crocuta) from Africa, one striped hyena (Hyaena hyaena) and one brown hyena (Parahyena brunnea), which were all generated from high coverage genomes 5,6 . The raw data from reference 4 was downloaded and re-analyzed here with our new database and the same parameters described in the methods. The newly generated sequence has a greater coverage than previously reported (92.7% vs. 88.9%), and there are nine amino acid differences between the two Crocuta sequences, which is a substantial number considering the highly conserved nature of type I collagen. For seven of the nine amino acid differences, our newly developed sequence presented the same amino acid call as the Hyaenidae reference sequences while the sequence previously reported gave the same amino acid call as the Felidae reference sequences. Thus, most of the amino acid differences would be attributed to the improved database, whose effect has been discussed before 7,8 . The newly developed sequence for the modern sample is considered more reliable than that from reference 4 . For this sample, no reliable amino acid substitution was detected compared to the sequences of modern spotted hyenas from northern Africa. However, the geographic location of this sample is unknown. Therefore, its sequence was excluded from the phylogenetic datasets.

Phylogenetic reconstructions based on parsimony and Bayesian analyses (carried out in MrBayes) (dataset 1 and 2)
Parsimony and Bayesian analyses (carried out in MrBayes) result in largely concordant phylogenetic trees. Especially for the Hyaenidae clade, both datasets generate the same topology from two different methods (Figs. S3 and S4). The lower bootstrap support (BS) under parsimony analysis is possibly due to too few parsimony informative sites of the datasets, while Bayesian analysis makes full use of the data and results in higher posterior probabilities (PP). The support difference of two phylogenetic methods has been reported in a previous publication 7 .  The phylogenetic tree generated from dataset 1 agrees well with the commonly accepted high level Feliformia classification based on morphological and genetic data (Fig. S3) [9][10][11] . The sequences from family Hyaenidae form a monophyletic group which is sister to all the other Feliformia species from family Felidae. Within the Felidae clade, the subfamily Pantherinae (big cats) is clearly differentiated from subfamily Felinae (small cats). As to the Hyaenidae clade, Hyaena and Parahyaena form a group, which is sister to all the modern spotted and cave hyenas. The phylogenetic tree based on dataset 2 has the same topology in the Hyaenidae clade as that of dataset 1, but with higher support values (BS and PP), perhaps due to more amino acid residues included in dataset 2 (Fig. S4).

Additional phylogenetic analyses
To address the importance of certain variable site (position 1449, COL1A2), additional phylogenetic analyses were performed based on a modified dataset from dataset 2. With all the Hyaenidae sequences called as X at position 1449, the results of parsimony and Bayesian analyses are displayed in Fig. S5 and S6. In the Crocuta clade, both the African modern spotted hyenas and the east Asian cave hyenas could still be separated into two groups respectively. But the Namibian individual did not form a basal lineage anymore. In Fig. S6, the SYZ sample grouped together with the northern African spotted hyenas, but with a lower support (PP 0.31), which was sister to the Namibian individual (PP 0.42). The divergence time between the SYZ and the northern African individuals was estimated to be 1.01 Ma, which agreed well with that estimated from the original dataset 2. Compared with the time-scaled maximum clade credibility tree generated from the original dataset 2 (Fig. 4), here the placement of the Namibian individual was different. The different results between the original and modified dataset 2 indicated that the variable site (position 1449, COL1A2) has significant importance to distinguish different clades within the genus Crocuta based on ancient protein evidence, and to further support the gene flow between the Northern African spotted hyenas and east Asian cave hyenas. However, it is noteworthy that the frequency distribution of different amino acid types (I, V, possibly other types detected afterwards at position 1449) within Hyaenidae populations remains unclear yet due to limited number of whole genomes available so far. With more samples analyzed in the future, the phylogenetic relationships between modern spotted hyenas and east Asian cave hyenas could be confirmed or modified.  In an attempt to determine the relationships between our samples and other Eurasian cave hyenas, we incorporated the translated protein sequences from 11 low coverage genomes (1.6 -6.3x) of 4 cave hyenas and 7 modern spotted hyenas 6 . The phylogenies based on the new dataset are displayed in Figs. S7 and S8. Even though the maximum clade credibility tree (Fig. S8, obtained from BEAST analysis) shows that the European cave hyenas form the basal lineages to other Crocuta individuals, the supports are low (just around 0.50 or lower). Actually, most of the clades in Fig. S8 have a posterior probability of below 0.5. Combining with the 50% majority rule consensus trees obtained from parsimony and Bayesian (carried out in Mrbayes) analyses (Fig. S7), the relationships within most of the cave hyenas and modern spotted hyenas could not be resolved. What is more, the node age of the Crocuta clade is estimated to be much older than that estimated from nuclear genome data. The inconclusive phylogeny and the older age could be caused by the addition of substitutions brought about by sequencing error or DNA damage, which is more prone to be exposed in low coverage genomes. To obtain clearer and more accurate evolutionary results of cave hyenas, more reliable data should be chosen for the phylogenetic analysis.  All the details of phylogenetic analyses carried out in BEAST and Mrbayes were provided in XML and NEX files, which were also uploaded as supplementary materials (XML Files S1-3 and NEX Files S1-4).
The spectra generated from prescreening analyses and the ion match tables for the MS/MS spectra used in Figs. S1 and S2 Figures S9-14 show the FTIR spectra of the cave hyena samples. Figures S15-17 show the MALDI-TOF spectra of three cave hyena (sub)samples. The spectra of the other three fossil subsamples (LXD-1D, LXD-2B, and LXD-2D) were not displayed because of their poor signal. The peptide markers (P1-G) 4,12 were annotated beside their related mass. Figures S18-23 show the ion match tables for the example MS/MS spectra used in Figs. S1 and S2.

Legends for supplementary tables and other supplementary Files
Table S1. The identified proteins in the SYZ sample. Table S2. The identified proteins in the HZD sample. Table S3. The identified proteins in the LXD sample. Table S4. Comparative datasets used for phylogenetic reconstructions, referencing database sources (with accession numbers) or associated genome/proteome publications.
Fasta File S1. Comparative dataset 1 alignment used for phylogenetic reconstructions, where COL1 A1 and COL1A2 were concatenated, with COL1A1 ranging from position 1 to position 1057, and COL1A2 from 1058 to 2098. XML File S1. The details of phylogenetic analysis carried out in BEAST using a concatenated alignment of 10 proteins from 8 extant and extinct taxa.
XML File S2. The details of phylogenetic analysis carried out in BEAST using a concatenated alignment of 10 proteins from 8 extant and extinct taxa. All the Hyaenidae sequences were called as X at position 1449.
XML File S3. The details of phylogenetic analysis carried out in BEAST using a concatenated alignment of 10 proteins from 19 extant and extinct taxa.
NEX File S1. The details of phylogenetic analysis carried out in Mrbayes using proteomic dataset 1, with Ailuropoda melanoleuca (giant panda) as an outgroup.
NEX File S2. The details of phylogenetic analysis carried out in Mrbayes using proteomic dataset 2, with Felis catus (domestic cat) as an outgroup.
NEX File S3. The details of phylogenetic analysis carried out in Mrbayes using a modified dataset 2, with Felis catus (domestic cat) as an outgroup. All the Hyaenidae sequences were called as X at position 1449.
NEX File S4. The details of phylogenetic analysis carried out in Mrbayes using a concatenated alignment of 10 proteins from 20 extant and extinct taxa, with Felis catus (domestic cat) as an outgroup.
Extended Data File S1. The raw data of all the MALDI-TOF MS spectra.