Eyasi Plateau Paleontological Expedition, Laetoli, Tanzania, fossil specimen database 1998–2005

The Eyasi Plateau Paleontological Expedition (EPPE) Laetoli specimen database contains 13716 records of plant and animal fossils (ca. 28248 specimens) collected by EPPE field teams working at Laetoli, Tanzania between 1998 and 2005. This dataset is a digital version of the original hard-copy specimen catalog, and it documents the discovery, stratigraphic provenience and taxonomic diversity of Plio-Pleistocene fauna and flora in northern Tanzania between 4.4 Ma and >200 ka. Laetoli is renowned for the discovery of important hominin fossils, including the lectotype for Australopithecus afarensis, one of our early hominin ancestors, the first record of Paranthropus aethiopicus outside Kenya-Ethiopia, and an early record of our own species Homo sapiens. This database is one of the few publicly available palaeoanthropological fossil datasets and serves as an example for expanding open access to primary fossil occurrence data in palaeoanthropology. The taxonomic identifications appearing in this dataset are the original field identifications and are provisional. Any taxonomic analysis employing this dataset should refer to updated taxonomic identifications published by specialists.

The database represents an important addition to the resources currently available for researchers investigating human evolution and vertebrate palaeontology in Africa. Field data of this kind, which provide crucial documentation about the nature and history of fossil collections, is rarely available to other researchers, and in the past essential contextual data about historical collections have been lost. For example, in 1938-1939 Ludwig and Margrethe Kohl-Larsen made one of the most important collections of fossil vertebrates from Laetoli [12][13][14] , which is housed in the Museum für Naturkunde in Berlin and the University of Tübingen. However, no comprehensive documentation of the collecting localities and stratigraphic provenance of the specimens was made at the time of discovery (or at least none that survives to the present-day), so crucial information about the context is largely unknowable and this greatly lessens the value and significance of the Kohl-Larsen collections. Making the Laetoli field data available ensures that future researchers have access to the history and contextual information relating to the discovery of individual specimens. As such, the database becomes an important historical resource. The database also provides important information, especially when combined with corresponding data from other palaeontological sites in Africa, which can be used for analyses of palaeoecology, palaeobiogeography, taphonomy, biochronology, and macroevolutionary patterns of speciation and extinction. Finally, the database can be used in broader-scale analyses of the impact of regional and global climate change on biotas during the Pliocene.

Methods
The data were collected and processed following the steps outlined in Fig. 3, from field collection and hard-copy documentation, to digitization, alignment/import, cleaning/harmonization, and metadata annotation.
Field collection and documentation. The original fossils were recovered from the Eyasi Plateau, an uplifted fault block on the northwest margin of Lake Eyasi, located in the Ngorongoro Conservation Area, Tanzania (3.25°S, 35.10°E). Most of the fossils were recovered from the Laetoli area, but smaller collections have been recovered from Kakesio and Esere-Noiti. The project area covers approximately 400 square km. Fossils were primarily recovered from the surface of exposed outcrops after they have eroded out of the sediments. Partially exposed fossils in situ were excavated. No systematic screening for microinvertebrates and microvertebrates was undertaken, although dry screening methods were employed to recover associated remains and at localities where fossil hominins were recovered. The collection protocol stipulates collecting all vertebrate fossils that were anatomically identifiable (with the exception of rib fragments and limb bone shaft fragments that did not retain at www.nature.com/scientificdata www.nature.com/scientificdata/ least a portion of one articular surface). Bone fragments that were not anatomically identifiable but preserved traces of taphonomic interest (such as carnivore bite marks, rodent gnawing, cut marks, insect damage, and root etching) were also collected. Isolated fragments of tortoise shells and ostrich eggshell, terrestrial gastropods, insects and insect traces, and macrobotanical fossils were not collected systematically, but representative specimens were collected at each locality as reference specimens. Collecting events occurred at 60 designated localities and sublocalities within specific stratigraphic units in those localities 15 . Fossils were cataloged the same day they were discovered and field numbers were inscribed on the fossils with permanent ink. Collection details (date of collection, locality, stratigraphic unit, anatomical element, taxonomic identification, other remarks) for each www.nature.com/scientificdata www.nature.com/scientificdata/ fossil were written onto collection cards that remain with the fossil and details were also written into a hard-copy collection catalog. Preliminary taxonomic identifications included in the catalog are based on expertise and literature sources. All specimens were accessioned into the collections of the National Museums of Tanzania (NMT), Dar es Salaam.

Digitization.
Original records of the collected materials were written into a hard-copy catalog that is kept at the National Museums of Tanzania (NMT), Dar es Salam, Tanzania. Entries from the paper catalog were digitized into seven spreadsheet files, one for each field campaign in 1998,1999,2000,2001,2003,2004,2005. In all there were eight field campaigns, with two separate campaigns in 2000 (one in January-February and another in August). Data from the spreadsheets were imported into the Paleo Core data repository (http://paleocore.org), aligned to standard fields and harmonized to established vocabularies and formats as described below. In total 13720 records were read from the spreadsheets, of these 10 were deleted as duplicates and 6 records were added as the result of splitting bulk records, resulting in the final count of 13716 data records (Table 1). alignment and import. Data from the digitized spreadsheets were mapped to a set of verbatim fields that record the original values from the spreadsheets in the Paleo Core database. The data in the verbatim fields were then processed and used to populate the cleaned fields in the database ( Table 2). Two (2) of the spreadsheet columns, "Tray" and "Published" contained no data and were dropped.  www.nature.com/scientificdata www.nature.com/scientificdata/ cleaning and harmonization. For each of the 17 columns imported into Paleo Core the data were cleaned, and where appropriate, harmonized to a data encoding scheme and structured vocabulary. Details on this process are described in detail below for each field. Alignment and harmonization were automated in Python, so that every step in the process is documented in code and reproducible from the original Excel files.

Data Records
Data files. This dataset comprises a single data file in comma delimited format (.csv). The first row is a header of column names matching standard terms. The dataset is available for download from the Paleo Core data repository at: https://paleocore.org/projects/eppe/ and the figshare data repository 16 at: https://doi.org/10.6084/ m9.figshare.8847935.v2.
Field definitions. The data fields (columns) included in this dataset are of two types: verbatim fields and cleaned fields. The verbatim fields contain the uncleaned, digitized data copied from the spreadsheets. They provide a record of the digitized version of the paper specimen catalog. The cleaned fields contain data cleaned and harmonized from the verbatim data and mapped to one of the standards listed in Table 3. Details about the processing are described in the sections describing each field, and in the Python code for the import script.
All field names are presented in snake_case (i.e. in lower-case and words are joined by underscores) to standardize their presentation and promote readability.
Verbatim fields. The spreadsheets contained 19 columns of data. Seventeen (17) columns were copied, unmodified, into corresponding "verbatim" fields in the final dataset (Table 2). Dates in the spreadsheets were a mixture of string values and integer date formats, which were parsed accordingly and converted to a standardized format in the database, all other fields were copied, unmodified, as text. Two (2) of the spreadsheet columns, "Tray" and "Published" contained no data and were dropped.
Cleaned fields. In addition to the verbatim fields the dataset includes the fields listed in Online-only Table 1. The data from the verbatim fields were cleaned, validated and used to populate these fields. Unless otherwise indicated, each field is defined according to the Darwin Core standard 17 . One field, verbatim_element, does not conform well to an existing Darwin Core term and was aligned to the term PartOfOrganism from the ABCD-EFG data standard. Two of the taxonomic fields, verbatim_phylum_subphylum and verbatim_tribe also are not represented by Darwin Core terms. The column for verbatim_phylum_subphylum was divided into dwc:phylum and subphylum, the latter is not standard but its definition is similar to that of other taxonomic fields in Darwin Core.

Spreadsheet Column
Verbatim Field Cleaned Field  www.nature.com/scientificdata www.nature.com/scientificdata/ Similarly, verbatim_tribe was cleaned and transferred to tribe. The tribe field is particularly significant in the analysis of Plio-Pleistocene bovid faunas where many fossils cannot be identified to genus level but can be identified to ecologically informative tribe level designations. A complete listing of all the cleaned fields and their definitions is provided in Online-only Table 1. Additional comments about each field are provided in the subsequent sections.
Definition -dwc:catalogNumber. Catalog Number is the primary key (pk) for the EPPE Laetoli Database and as such is unique for all records, but not guaranteed globally unique outside this dataset. Catalog numbers match the values written on the fossil specimens and take the form shown in Table 4, where EP stands for Eyasi Plateau, <item_number> indicates a three-or four-digit unique (to this dataset) integer, [a-z] indicates an optional, lower-case, lettered part of a specimen (Example 2) and yy indicates a two-digit year. Item numbers less than 100 have leading zeros to the hundreds place, as shown in Example 1, though values may expand to the thousands as indicated in Example 2. There is always a single space between EP and the item number. Leading zeros were retained for consistency with published specimen numbers.
Definition -dwc:institutionCode. All values for this field are the string, 'NMT' for the National Museums of Tanzania.
Collection code. Field name -institution_code. Definition -dwc:collectionCode. All values for this field are the string 'EP' for Eyasi Plateau. This collection code distinguishes these collections from others made at Laetoli, e.g. earlier Leakey collections under code 'LAET' or 'LIT' .
Definition -dwc:occurrenceRemarks. This field is copied from verbatim_comments. No additional processing was applied to the verbatim data.
Event date. Field name -event_date. Definition -dwc:eventDate. The values for event date are derived from verbatim_date_discovered, which in the spreadsheet files were a mix of dates in number format where the value corresponds to the number of days (or fractions thereof) since a designated start date (which is stored in the workbook), or the entries were string values in the format dd/mm/yy, where dd corresponds to the day, mm to the month and yy to the year, all as two digits. Dates were converted to Python date format using the xlrd Python library depending on the data type of the cell in the spreadsheet (date type vs string type). Each record was validated for a value that fell in the interval between 1998 and 2005 inclusive.

Basis of record. Field name -basis_of_record.
Definition -dwc:basisOfRecord. This is a Darwin Core term used to indicate the type of data record. For this dataset all records have the value 'FossilSpecimen' , which is from the recommended Darwin Core type vocabulary.

Part of organism. Field name -part_of_organism.
Definition -abc:PartOfOrganism. The part_of_organism field records free text description of the fossil elements preserved for each specimen. As there is no Darwin Core term to handle free text anatomical element descriptions, this field definition is drawn from the Access to Biological Collections Data standard (ABCD) Extended for Geology (EFG). Values in the description field were pared down to 2293 unique descriptions from the 3976 unique entries in verbatim_element. The reduction was accomplished by standardizing or expanding abbreviations for anatomical elements, parts and sides.
Definition -dwc:organismQuantity. The number of identified fossil specimens (NISP) included with each record. The values in this field derive from the item counts appearing in parentheses in verbatim_element, e.g. "Proximal Ulnae (3)". The anatomical description is preserved in part_of_organism and the number of specimens appearing in parentheses is recorded in organism_quantity as an integer value, without parentheses.
Definition -dwc:organismQuantityType.  www.nature.com/scientificdata www.nature.com/scientificdata/ This field is set to 'NISP' for all records, indicating the number of identified specimens as the quantity expressed in Organismal Quantity.
Country. Field name -country.
Definition -dwc:country. All values for this field are the string, 'Tanzania' .
Locality. Field name -locality. Definition -dwc:locality. A locality is the place within the Laetoli project area from where a specimen was recovered. Harrison and Kweka 15 list 60 fossil localities at Laetoli, which they group into three major areas: Laetoli, Kakesio, and Noiti-Esere (Fig. 1). Three localities (Ndoroto, Olaltanaudo, Oleisusu) fall outside these areas. The 219 unique entries in verbatim_locality were cleaned to match one of the 65 locality terms in the Laetoli locality vocabulary. The additional 5 entries in the vocabulary correspond to conflated values such as 'Kakesio 1-6' which indicate that the fossils came from one of the 6 Kakesio localities but it is unclear which one. Locality names follow one of the formats shown in Table 5.
Locality place and number are always separated by a single space, and ranges are indicated by a single n-dash with no spaces around it. All entries were stripped of leading and trailing whitespace.
Bed. Field name -bed.
Definition -dwc:bed. Each fossil derives from a stratigraphic unit or range of units, which comprise the geological context for the fossil. Most collections were surface finds and their provenience is inferred from the surrounding sediments, adhering matrix, and preservation. Geological samples from volcanic tuffs were submitted for radiometric analysis to determine geochronological age controls for the site 3 . The original 196 unique entries in verbatim_horizon were pared down to 34 unique entries in the cleaned, bed field and the values correspond to the units shown in Fig. 2.
Definitions -dwc:minimumChronometricAge, dwc:maximumChronometricAge. The age fields are drawn from the Chronometric Age extension to Darwin Core. The minimum and maximum are based on the absolute age estimates for each unit as illustrated in Fig. 2. The dates provided are the best estimates (central tendency) for the minimum and maximum age respectively. Most ages were determined using 40 Ar/ 39 Ar radiometric dating, with Bayesian interpolation for beds between dated tuffs 2,18 . Dates for the Upper Ngaloba are based on Amino Acid Racemization while dates for the Lower Ngaloba Beds are based on biochronology. Each fossil was ascribed dates based on the bed or interval of beds from which it was recovered.
All dates for minimum and maximum chronometric age are provided in Ma (megaannum, i.e. millions of years).

Chronometric age uncertainty in years. Field name -chronometric_age_uncertainty_in_years.
Definition -dwc:chronometricAgeUncertaintyInYears. While the max and min ages are provided in Ma, the uncertainty is provided in years to conform with the standard. In cases where uncertainty differed between minimum and maximum ages, or where uncertainty is known only for one of the values, the greatest uncertainty is reported. For example, fossils found in the Laetolil Beds, Upper Unit, Between Tuffs 7 -8, are bracketed between a maximum age of 3.66 Ma and a minimum age of 3.631 ± 0.018 Ma. The maximum age is interpolated and the minimum age is based on argon-argon analysis and has a standard error of 0.018 Ma. The reported uncertainty for this example is thus 18000 years.
Definitions -dwc:kingdom, dwc:phylum, dwc:class, dwc:order, dwc:family, dwc:genus, dwc:specific_epithet. The taxonomic fields, kingdom, phylum, class etc., record preliminary identifications for each specimen at the designated taxonomic rank and are derived from the verbatim taxonomic fields. Spelling and typographical errors were corrected, but no attempt was made to update taxonomic assignments. Values from the  www.nature.com/scientificdata www.nature.com/scientificdata/ verbatim_phylum_subphylum field were split into the cleaned fields phylum and subphylum accordingly. There are no standard terms in Darwin Core or ABCD-EFG for the taxon ranks subphylum and tribe. These ranks are included because of their significance in paleoanthropological faunal analysis. Their definitions are as for the other taxon fields.
Definition -dwc:scientificName. This field records full name at the lowest level taxonomic designation available for the specimen. It is derived from the cleaned taxonomic fields (unlike the taxonomic fields which derive directly from the verbatim taxonomic fields). For genus and above it provides a single word and for species the complete species name (binomen). An important note about this field is that it does not include designations of uncertainty, for example Australopithecus cf. afarensis will have scientific_name Australopithecus afarensis and identification_qualifier cf. afarensis, as per the recommended best practice under Darwin Core. Users are encouraged to consult the Darwin Core documentation for dwc:scientificName and dwc:identificationQualifier for definitions and details.
Definition -dwc:identificationQualifier. This field is derived from the verbatim taxonomic fields. The verbatim taxonomic fields were searched for incidents of '?, cf., aff, indet., sp., nov. ' and where found the taxonomic field and identification_qualifier field were updated accordingly. The import script includes detailed notes on the regular expression searches and parsing used to update this field. Generally, the identification qualifiers are taken to follow common convention for open nomenclatures sensu Sigovini et al. 19 . Users are referred to the Darwin Core documentation on dwc:scientific-Name and dwc:identificationQualifier to understand their definition and use.
Definition -dwc:taxonRank. This field was updated by analysing the taxon fields to identify the most precise rank for which there is a value. All ranks are presented in lower case.
Definition -dwc:taxonRemarks. The values in this field were updated for cases where taxonomic changes were made such that taxonomic fields differ from verbatim taxon fields.
Definition -Remarks describing potential or known problems with a record. This field derives from verbatim_problems and provide free text remarks about possible or known problems associated with a data record. Twenty records have entries for verbatim_problems, two of these referring to duplicate catalog numbers were removed. The remaining 18 problem remarks indicate potential missing or problematic fossil specimens. There is no standard term in Darwin Core or ABCD-EFG that matches this field. It could be accommodated in remarks, but we preferred to keep problem remarks separate to facilitate search.

technical Validation
Automated validation was conducted for catalog_number, event_date, locality, and the taxonomic fields. catalog number. All catalog_number strings were validated for consistency of formatting and uniqueness.
Entries in the catalog_number field were copied from verbatim_specimen_number and validated against the following Python regular expression: Splits. Five items for bulk collections contained multiple or mixed taxa and had to be split into parts. The splits added 6 new items to the catalog (Table 6).
Corrections. Six (6) specimens raised validation errors (Table 7). Three (3) of these specimens had an entry in catalog_number that did not match the formatting regular expression. The first, with verbatim specimen number EP 120A + B/98 is an associated set of left and right upper teeth and skull fragments. These were published 20 as parts A and B but exact designation for all elements was not given so the catalog number is merged in the official version of the catalog to simply EP 120/98. The other formatting errors are typos in the year suffix, which were corrected. Another three (3) were duplicate catalog numbers resulting from digitization errors. These were corrected as shown in Table 7.
Deletions. Ten (10) records were deleted from the dataset. Seven (7) records were duplicate entries with identical data (Table 8). Two (2) pairs of duplicate records were the result of emended taxonomic identifications (Table 9, items 1-2). The amended data was retained, the earlier version deleted, and a note of the change was added to the taxonomic remarks field. One (1) duplicate stems from an incorrect entry, which was deleted ( Locality. Locality values were validated against the Laetoli locality vocabulary listed in Online-only Table 2. The Verbatim Locality column indicates the corresponding entries in the verbatim_locality field that were harmonized to a common locality name. For example 'Kakesio #1' and 'Kakesio #1' were both harmonized to the value of 'Kakesio 1' . Some verbatim entries are identical except for trailing whitespaces.     www.nature.com/scientificdata www.nature.com/scientificdata/ Taxonomic fields. Entries in the taxonomic fields were validated for spelling and taxonomic placement (in the appropriate hierarchy) against the iDigBio taxonomic backbone at genus level and above. Taxonomic validation was used to verify taxa, their spellings and their appropriate ranks. Three genera were not found in the iDigBio taxonomic backbone; these were confirmed against the original print catalog and retained.

Usage Notes
These data represent the primary digital version of the specimen catalog and are suitable for analyses on differences between localities and comparisons with other fossil catalogs from Laetoli and other sites.
The taxonomic identifications and item descriptions are provisional and should be used with care. Taxonomic identifications are provided primarily to assist in data discovery and to assist taxonomic specialists and systematists in identifying specimens in their area of expertise or that might be relevant for subsequent analysis. Quantitative analyses making use of taxonomic information at higher levels (e.g. Tribe and above) may be appropriate but lower-level, fine-grained taxonomic analysis using these data is probably not appropriate without consulting published analyses of the fossils conducted by specialists 5 .
Similarly, the descriptions of anatomical elements preserved is meant as a guide to data discovery and further analysis. Quantitative analysis of element preservation or taphonomic processes based on these data is not recommended without further reference to published analyses of the specimens or confirmation with the physical specimens.
The digital version of this dataset is public and distributed under a Creative Commons CC-0 license 16 . The physical fossils specimens are under the jurisdiction of the National Museums of Tanzania and the Antiquities Division of the Ministry of Natural Resources and Tourism.