Palaeoproteomics confirm earliest domesticated sheep in southern Africa ca. 2000 BP

We used palaeoproteomics and peptide mass fingerprinting to obtain secure species identifications of key specimens of early domesticated fauna from South Africa, dating to ca. 2000 BP. It can be difficult to distinguish fragmentary remains of early domesticates (sheep) from similar-sized local wild bovids (grey duiker, grey rhebok, springbok—southern Africa lacks wild sheep) based on morphology alone. Our analysis revealed a Zooarchaeology by Mass Spectrometry (ZooMS) marker (m/z 1532) present in wild bovids and we demonstrate through LC–MS/MS that it is capable of discriminating between wild bovids and caprine domesticates. We confirm that the Spoegrivier specimen dated to 2105 ± 65 BP is indeed a sheep. This is the earliest directly dated evidence of domesticated animals in southern Africa. As well as the traditional method of analysing bone fragments, we show the utility of minimally destructive sampling methods such as PVC eraser and polishing films for successful ZooMS identification. We also show that collagen extracted more than 25 years ago for the purpose of radiocarbon dating can yield successful ZooMS identification. Our study demonstrates the importance of developing appropriate regional frameworks of comparison for future research using ZooMS as a method of biomolecular species identification.


Protein Extraction
Archaeological and modern bone fragments were treated initially with 0.6 M hydrochloric acid (HCl) for several days at 4 ℃ to demineralise the sample before solubilisation of the protein fraction. After demineralisation, the supernatant was removed, 150 µl of 50 mM ammonium bicarbonate (Ambic) was added, vortexed, and then centrifuged for 1 minute. This process was repeated 2 times. For the final rinse, the pH of the supernatant was checked to verify that the HCl had been successfully removed. Archaeological and modern bone fragments were then solubilised by incubation with 100 µl of 50 mM Ambic at 65 ℃ for 1 hour. After solubilisation the extracted protein concentration was determined using BCA (bicinchoninic acid) assay according to the manufacturer's instructions. Briefly, a 5 point standard series was made by serially diluting bovine serum albumin (BSA) covering a concentration range of 62.5 -1000 µg/ ml. Each sample was measured in duplicate with 10 µl of sample incubated with 200 µl of BCA 'working reagent' at 37 ℃ before absorbance at 560 nm was measured using a plate reader. Where possible a volume of sample extraction containing 20 µg of protein was transferred into a new protein Eppendorf LoBind tube and the volume adjusted to a minimum volume of 50 µl with 50 mM Ambic, along with 1 µl of 0.4 µg/µl of sequencing grade trypsin (Promega). Trypsin digestion was performed at 37 ℃ overnight with mild shaking (500 rpm). After digestion, the samples were centrifuged at 10,000 g for 1 minute and then acidified to < pH 2 using 5% (v/v) trifluoroacetic acid (TFA). Peptide clean-up was performed using C18 reverse phase resin ZipTips according to the manufacturer's instructions. Peptides were eluted using 50 µl of 50% acetonitrile (ACN) 0.1% TFA solution.
A subset of the archaeological and modern bone fragments were also sampled using two minimally destructive techniques: PVC eraser 1,2 and polishing films 3 . The PVC eraser method involves rubbing the bone with a small fragment of eraser, and through triboelectric friction, the collagen is transferred from the bone to the eraser. The resultant eraser crumbs are collected into a LoBind Eppendorf tube. The polishing film method involves rubbing the bone with a small piece of polishing film with a gritted surface. The films come in a range of different grit sizes and materials (1-30 μm particles); here we used 15 μm alumina film and 3 μm diamond film, held in tweezers. Friction of the film against the bone surface creates abrasion and microscopic bone particles are transferred to the film, which is then placed whole into an Eppendorf tube. For both PVC eraser and polishing film methods, 75 μL of Ambic was added directly to the eraser rubbings/film pieces in the tube with 1 µl of 0.4 µg/µl of sequencing grade trypsin (Promega). Trypsin digestion was performed at 37 ℃ for four hours. After digestion, the samples were centrifuged at 10,000 g for 1 minute and then acidified to < pH 2 using 5% (v/v) TFA. Peptide clean-up was performed using C18 reverse phase resin ZipTips according to the manufacturer's instructions. Peptides were eluted using 50 µl of 50% ACN 0.1% TFA solution.

Matrix-assisted Laser Desorption/Ionization Time-of-Flight Mass Spectrometry data acquisition and analysis
Peptide eluates from all extraction methods (solution, PVC eraser, polishing film) were co crystallised with α-cyano-4-hydroxycinnamic acid (Sigma Aldrich) matrix solution (50% ACN /0.1% TFA (vol/vol) at a ratio of 1:1 (1μL : 1μL). Mass spectrometry was performed using a Bruker Ultraflex III (Bruker Daltonics) matrix-assisted laser desorption/ionization time of flight mass spectrometer (MALDI-TOF-MS) run in reflector mode with laser acquisition set to 1200 and acquired over an m/z range of 800-3200. The generated spectral output was converted to TXT and was analysed using the open-source software mMass v.5.5.0 4 . The triplicate raw files were merged, and then peak picked with a S/N threshold of 4. MALDI-TOF-MS was performed at the Centre for Excellence in Proteomics at the University of York, United Kingdom.

Liquid Chromatography Tandem Mass Spectrometry data acquisition and analysis 1.3.1 LC-MS/MS Acquisition
The reference samples A (springbok), D (grey rhebok), G (Namaqua Afrikaner sheep), and M (grey duiker), as well as collagen from the earliest dated archaeological sample (P4859C) from Spoegrivier were further analysed by LC-MS/MS. Samples were processed as for ZooMS but an additional parallel digestion was performed with elastase instead of trypsin. The tryptic and elastase ZipTip eluates were combined before being evaporated to dryness using a vacuum concentrator (Eppendorf, Hamburg, Germany), and transferred to the Novo Nordisk Foundation Center for Protein Research, the University of Copenhagen for LC-MS/MS analysis on a EASY nLC 1200 (Proxeon, Odense, Denmark) coupled to a Q-Exactive HF-X (Thermo Scientific, Bremen, Germany). The volume required for approximately 2 µg of protein per sample was placed in separate wells on a new 96-well plate and topped up to 30 µL using 40% ACN and 0.1% formic acid (FA) They were then vacuum centrifuged and resuspended with 10 μL of 0.1% TFA, 5% ACN, and 5 µL of sample analysed by LC-MS/MS on a 77 min gradient. The LC MS/MS parameters were the same as previously used for palaeoproteomic samples 5,6 , in short: MS1: 120 k resolution, maximum injection time (IT) 25 ms, scan target 3E6. MS2: 60 k resolution, top 10 mode, maximum IT 118 ms, minimum scan target 3E3, normalised collision energy of 28, dynamic exclusion 20 s, and isolation window of 1.2 m/z.

LC-MS/MS Data Analysis
To validate the presence of the suspected ZooMS markers, the Thermo RAW files generated were then searched using the software MaxQuant (v.1.6.3.4) 7 . The database was prepared using previously published type 1 collagen sequences from several species including Ovis aries, Capra hircus, Bos taurus, Mus musculus, and Homo sapiens. MaxQuant settings were as follows: Digestion mode was set to semi-specific for trypsin, to account for possible additional hydrolytic cleavages occurring during diagenesis. Variable modifications were: oxidation (M), acetyl (Protein N-term), deamidation (NQ), Gln➝pyro-Glu, Glu➝pyro-Glu, and hydroxyproline. Carbamidomethyl (C) was set as a fixed modification. The remaining settings were set to the program defaults, apart from Min. score for unmodified and modified peptides searches, which were both set to 60. Proteins were considered confidently identified if at least two razor+unique peptides covering distinct areas of the sequence were recovered. MS/MS spectra were assessed manually for confident identification. In addition, the samples were searched against the MaxQuant contaminant database that identifies proteins which may be present due to sample handling and laboratory analysis. Any protein not considered authentic (i.e. keratins from skin, the laboratory standard BSA) was not included in further analysis. Deamidation was assessed using publicly available code 6 , with the contaminant proteins filtered out.
In addition, the archaeological sample Spoegrivier P4859C was searched in MaxQuant against the sheep proteome database from Uniprot (downloaded 20/3/20) in order to identify other proteins besides COL1. This was done once with the same MaxQuant settings as above and then again for semi specific digestion with elastase instead of trypsin, and the results combined (Table. S3 and S4). Species specific peptides for the proteins recovered were identified as specific based on searches using the NCBI BLASTp tool (https://blast.ncbi.nlm.nih.gov/Blast.cgi) against all publicly available protein sequences.

Construction of springbok collagen sequence
Springbok is the species most likely to be confused with sheep in this context, especially at Spoegrivier. Therefore, the collagen 1 (COL1) sequence for springbok was derived from the LC MS/MS springbok sample. The raw file created from modern springbok bone (A) was also searched with PEAKS (v.7) 8 against a database comprising COL1 sequences (COL1A1 and COL1A2 were combined with a single K residue separating them) of the following species: Ovis aries, Capra hircus, Bos taurus, Sus scrofa, Mus musculus, and Homo sapiens. The searches (de novo, PEAKS DB, PEAKS PTM, and SPIDER) were performed with peptide mass tolerance +/-10 ppm and fragment mass tolerance +/-0.05 Da. No enzyme was specified. Searches allowed the variable modifications for deamidation (NQ), oxidation (M), Gln ->pyro-Glu, Glu ->pyro Glu, and hydroxyproline. Any single amino acid polymorphism (SAP) with convincing evidence (more than half of the residues located at that position with well annotated spectra) detected through the searches were compiled into a new COL1 sequence for springbok. In cases where it could be ambiguous, two versions of the sequence were made.
These SAPs were then authenticated by searching the sample of modern springbok bone (A) with MaxQuant (v 1.6.3.4) using the same database as before, but with the added springbok predicted sequences. Digestion mode was set to semi-specific for trypsin. Variable modifications were: oxidation (M), deamidation (NQ), Gln →pyro-Glu, Glu →pyro-Glu, and hydroxyproline. Carbamidomethyl (C) was set as a fixed modification. The remaining settings were set to the program defaults, apart from minimum score for unmodified and modified peptides searches, which were both set to 60 to limit the amount of poor quality spectra included in the analysis. This search was also repeated for the semi-specific search of elastase, and those results combined with that of the trypsin search. These trypsin and elastase searches were repeated and the springbok sequence(s) updated according to well annotated matches from the other sequences in the database, until one predicted springbok sequence was left that incorporated all SAPs confidently detected.
Each peptide assigned to springbok COL1 was then manually examined for well annotated spectra, with the consideration that for most peptides every third residue would be a glycine (G), as that is the structure of the vast majority of the collagen sequence 9 . If part of the sequence was in doubt due to poor annotation and/or missing ions, these residues would be changed to an X. This included the consideration of post-translational modifications (PTMs) that could affect the interpretation of the sequence. This was mostly focused on the inspection of deamidated residues (where aspartic acid could be deamidated asparagine and glutamic acid deamidated glutamine). If all spectra of a peptide contained the deamidated form of a residue it was marked with an X. Additionally, the mass of a hydroxylated proline (common in collagen) beside an alanine has the same mass as an unmodified proline and a serine. In this case the true sequence can only be known if these residues are detected in separate ion fragments. If this was not possible, the two residues were replaced with Xs.
The confidently matched spectra (including those with X residues but otherwise well annotated), were then aligned using MAFFT's online service 10 (v.7). A consensus sequence was then made from these peptides, with each amino acid residue considered correctly assigned only if it could be confidently identified in at least two overlapping peptides. All residues that did not reach this standard were marked with an X in the final sequence. In addition, since leucine and isoleucine cannot be distinguished with this protocol (being the same mass in a different configuration), all isoleucines (I) from the original sequence have been presented as leucines (L) for consistency 11 . The peptides used for reconstruction and the resulting predicted and final springbok sequences (separated into COL1A1 and COL1A2) are provided below (and Table S2).
Final searches (both semi specific for trypsin and elastase) of springbok bone A and the other modern reference bones (D, G, M) as well as the archaeological sample Spoegrivier 4859C, were performed with the final springbok predicted sequence database for comparison between these samples, and for additional validation of the presence of the suspected ZooMS marker of interest.
2. ZooMS spectra images from 4 reference samples and Spoegrivier P4859C.  Table S1: ZooMS m/z markers for each method tested: It is necessary to screen ZooMS spectra produced from minimally destructive sampling techniques (eraser and film methods) for common contaminants such as human keratin (from skin, nails or hair) deriving from handling during or post excavation, and the plastic residue of the eraser and films themselves. These contaminant peaks have been removed from the analysed spectra, following 123