Background & Summary

Extracting, transforming, and distributing natural resources and finished goods between individuals and groups has always been an important aspect of technological, economic, and social behaviors in human societies1,2,3,4. Such material aspects of cultures can be inferred with the help of provenance studies, by reconstructing the movements of materials and artefacts across space. For this purpose, archaeologists have regularly used petrographic and geochemical analyses for more than 40 years for characterising the geological provenance of raw materials and stone artefacts and for reconstructing patterns of exchange based on hard evidence5,6,7. Geochemical techniques have proven to be the most efficient and reliable way to fingerprint raw material sources and artefacts thereby providing reproducible and comparable results8,9,10. Furthermore, geochemical data are quantitative and can therefore be examined with statistical methods11,12 or by using, for example, well-known principles of petrogenesis and mantle source evolution.

Due to the improvement of analytical techniques and the increasing use of geochemical sourcing, the production and publication of archaeological compositional data have grown exponentially. It is now recognized that using large source data compilations can lead to more efficient and cost-effective research planning7,10,13. Sharing source data compilations facilitates assigning unambiguous provenance to artefacts because it enables a better understanding of geochemical variability of sources throughout a given study region and also shows potential geochemical differences between sources14, especially for artefacts found in either very homogeneous or complex petrogenetic contexts15,16,17. Furthermore, accessing large geochemical datasets of archaeological artefacts will lead to more robust and large-scope modelling of prehistoric exchange systems18,19,20. However, the current lack of appropriate global data management platform makes it difficult to access and reference relevant archaeological datasets and often induces duplication of individual endeavors.

In this data descriptor, we introduce the Pofatu Database, a curated and open-access database of geochemical data on archaeological materials and sources supported by comprehensive contextual information about individual samples and artefacts, including about the archaeological provenance, and a thorough description of analytical procedures. The goals of the database are (i) to provide easy access to published compositional data of archaeological sources and artefacts, (ii) to assemble contextual archaeological information for each individual sample, (iii) to facilitate reuse of existing data and encourage the appropriate crediting of original data sources, and (iv) to ensure reproducibility and comparability by documenting instrumental details, analytical procedures and reference materials used for calibration purposes or quality control. We provide compositional data as well as contextual metadata for 7759 individual samples with a current focus on archaeological sites across the Pacific Islands (Fig. 1). Our vision is an inclusive and collaborative data resource to activate an operational framework for data sharing in archaeometry, that will progressively include more datasets, and initiate a more global project similar to other online repositories for geological materials already available through a wide geoinformatics network21,22,23,24. Furthermore, by using common non-proprietary file formats (CSV) and an open source system for storage and version control (Git and GitHub repository), the Pofatu Database provides an analysis-friendly environment that enables transparency and built-in reproducibility of analytical tasks25.

Fig. 1
figure 1

Locations of samples already released in the Pofatu Database.

Methods

The data can be accessed and downloaded from the Zenodo archive (https://doi.org/10.5281/zenodo.3670127) and browsed in the Pofatu web application (https://pofatu.clld.org/). The database was designed to contain geochemical compositional data and extensive contextual metadata (sample identification, archaeological provenance, analytical methods, and related bibliographical references), which we compiled to ensure further reuse and reinterpretation of previous provenance analyses (Fig. 2).

Fig. 2
figure 2

Structure of the Pofatu Database.

The compositional data contains all analytical values for major oxide and trace element compositions, radiogenic and stable isotope ratios, and geochronology. Sample metadata involves the creation of unique identifiers, and a description of sample condition and preparation. Archaeological metadata provides information on the geographical, cultural and stratigraphic context of the parent artefacts (name, category and attributes), the collection origin (collector, date and nature of field research, storage location), and a description of the site and stratigraphic context (name, code, context, stratigraphic position). The reference metadata lists all bibliographical sources of the data and metadata information26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173. Methodological metadata ensure a control on data quality and include information about the preparation of samples analytical procedure (technique, laboratory, analyst) as well as the accuracy and reproducibility of published analyses (errors, precision, standard values, correction procedures).

Data acquisition

All data and metadata in the Pofatu Database and included in this data descriptor release are linked with published resources. Geochemical datasets are extracted from peer-reviewed material, while contextual metadata include information gathered from peer-reviewed articles, monographs, book chapters, and publicly available institutional reports. Original sources are coded in the repository and available as a BibTeX database file, suitable for importing into reference management software. Geochemical datasets are associated with a method identifier, which is unique and defined based on the set of available methodological metadata for a specific set of values.

The process of data acquisition includes:

Data submission: Data and metadata are gathered and stored in normalized tables linked by foreign keys. These interrelated tables each contain sets of information on (i) Data source, (ii) Sample and archaeological provenance, (iii) Compositional data, (iv) Primary analytical and method-specific metadata. The Pofatu Database is frequently curated and updated on a regular basis. New datasets and complementary information on previously documented datasets can be submitted using the Data Submission Template and Guidelines available online (https://pofatu.clld.org/about).

Data validation: The content of each table is handled manually but several fields are constrained by ontologies, which are built-in form validation in the submission template. Data is also validated using functionality implemented in the Python package pypofatu, which imposes suitable constraints on data like geographic coordinates.

Data output: The manually curated “raw” data undergoes an automated processing workflow (implemented in the Python package pypofatu) to create output formats ready for distribution.

For long-term accessibility, the data is converted to a set of interrelated CSV files, described by metadata encoded as JSON-LD (cf. https://www.w3.org/TR/json-ld/, accessed January 30, 2020), following the World Wide Web Consortium (W3C) recommendations174,175. Because the compiled data is exclusively made of line-based text files (in CSV format), it is well-suited for long-term access since it has the lowest requirements on processing software, and provides for a transparent history of changes with the version control software Git (cf. https://git-scm.com/, accessed January 30, 2020).

Data Records

A release of the Pofatu Database is available from the Zenodo archive176. Details of the parameters and measurements reported in the database are summarized in Online-only Table 1. Unique identifiers for samples, artefacts and analytical methods were created for each data record, and used as primary and foreign keys to define relationships between tables.

Technical Validation

Quality control of data and editorial procedures include:

Data review: Database contributors who submit a new dataset are asked to be the editor of that specific dataset and to engage in a review of potential missing or inaccurate data. The content of new datasets is systematically cross-checked with the content of original sources and with potentially related content. Authors are contacted when information is missing or when clarifications are needed.

Duplicate detection: Since Pofatu assigns semantic, unique identifiers to the objects in the database, and links data from additional tables using these keys (following the recommendations by Wilson and colleagues177), data consistency can be checked automatically, e.g. detecting multiple conflicting measurements of the same parameter in the same analysis, or conflicting sample metadata.

Users feedback: Data and metadata issues can be reported to pofatu@shh.mpg.de. Editors will be contacted if an issue with one of their datasets is reported.

Usage Notes

The Pofatu Database provides an analysis-friendly environment178 that enables transparency and built-in reproducibility of analytical tasks that can be achieved through freely available softwares or web browsers25.

Since the metadata provided with the csv-formatted data files has information about data types as well as relations between the tables making up the dataset, it is automatically loaded into an SQLite database (cf. https://sqlite.org/appfileformat.html, accessed January 29, 2020) for the convenience of the users. This SQLite database is contained in a single file document that can be queried with a high-level query language, has accessible content, is cross-platform, performant, and can be used with multiple programming languages.

The Python package pypofatu used for curating the dataset also provides functionality (built-in SQLite driver) that enables access and queries of the data with Python programs or the pypofatu API, and facilitates running SQL queries against the SQLite database.

Complex queries can be created in various ways and with different computing environments:

  • using SQL command line

  • using SQL browsers such as SQLite manager or SQLite reader

  • using R, with SQL codes in a notebook or packages such as sqldf or dplyr179,180

  • using the Datasette tool181

Data usage instructions are provided in the GitHub repository where the dataset is curated (cf. https://github.com/pofatu/pofatu-data, accessed February 6, 2020). A “cookbook” collects shareable pieces of code and how-to instructions to query the relational database (cf. https://github.com/pofatu/pofatu-data/blob/master/doc/cookbook.md, accessed February 6, 2020), and users are invited to contribute with the “recipes” they used for “cooking” with Pofatu.