Pofatu, a curated and open-access database for geochemical sourcing of archaeological materials

Compositional analyses have long been used to determine the geological sources of artefacts. Geochemical “fingerprinting” of artefacts and sources is the most effective way to reconstruct strategies of raw material and artefact procurement, exchange or interaction systems, and mobility patterns during prehistory. The efficacy and popularity of geochemical sourcing has led to many projects using various analytical techniques to produce independent datasets. In order to facilitate access to this growing body of data and to promote comparability and reproducibility in provenance studies, we designed Pofatu, the first online and open-access database to present geochemical compositions and contextual information for archaeological sources and artefacts in a form that can be readily accessed by the scientific community. This relational database currently contains 7759 individual samples from archaeological sites and geological sources across the Pacific Islands. Each sample is comprehensively documented and includes elemental and isotopic compositions, detailed archaeological provenance, and supporting analytical metadata, such as sampling processes, analytical procedures, and quality control.


Background & Summary
Extracting, transforming, and distributing natural resources and finished goods between individuals and groups has always been an important aspect of technological, economic, and social behaviors in human societies [1][2][3][4] . Such material aspects of cultures can be inferred with the help of provenance studies, by reconstructing the movements of materials and artefacts across space. For this purpose, archaeologists have regularly used petrographic and geochemical analyses for more than 40 years for characterising the geological provenance of raw materials and stone artefacts and for reconstructing patterns of exchange based on hard evidence [5][6][7] . Geochemical techniques have proven to be the most efficient and reliable way to fingerprint raw material sources and artefacts thereby providing reproducible and comparable results [8][9][10] . Furthermore, geochemical data are quantitative and can therefore be examined with statistical methods 11,12 or by using, for example, well-known principles of petrogenesis and mantle source evolution.
Due to the improvement of analytical techniques and the increasing use of geochemical sourcing, the production and publication of archaeological compositional data have grown exponentially. It is now recognized that using large source data compilations can lead to more efficient and cost-effective research planning 7,10,13 . Sharing source data compilations facilitates assigning unambiguous provenance to artefacts because it enables a better understanding of geochemical variability of sources throughout a given study region and also shows potential geochemical differences between sources 14 , especially for artefacts found in either very homogeneous or complex

Methods
The data can be accessed and downloaded from the Zenodo archive (https://doi.org/10.5281/zenodo.3670127) and browsed in the Pofatu web application (https://pofatu.clld.org/). The database was designed to contain geochemical compositional data and extensive contextual metadata (sample identification, archaeological provenance, analytical methods, and related bibliographical references), which we compiled to ensure further reuse and reinterpretation of previous provenance analyses (Fig. 2).
The compositional data contains all analytical values for major oxide and trace element compositions, radiogenic and stable isotope ratios, and geochronology. Sample metadata involves the creation of unique identifiers, and a description of sample condition and preparation. Archaeological metadata provides information on the geographical, cultural and stratigraphic context of the parent artefacts (name, category and attributes), the collection origin (collector, date and nature of field research, storage location), and a description of the site and stratigraphic context (name, code, context, stratigraphic position). The reference metadata lists all bibliographical sources of the data and metadata information 26-173 . Methodological metadata ensure a control on data quality and include information about the preparation of samples analytical procedure (technique, laboratory, analyst) as well as the accuracy and reproducibility of published analyses (errors, precision, standard values, correction procedures).

Data acquisition.
All data and metadata in the Pofatu Database and included in this data descriptor release are linked with published resources. Geochemical datasets are extracted from peer-reviewed material, while contextual metadata include information gathered from peer-reviewed articles, monographs, book chapters, and publicly available institutional reports. Original sources are coded in the repository and available as a BibTeX database file, suitable for importing into reference management software. Geochemical datasets are associated with a method identifier, which is unique and defined based on the set of available methodological metadata for a specific set of values.
The process of data acquisition includes: Data submission: Data and metadata are gathered and stored in normalized tables linked by foreign keys. These interrelated tables each contain sets of information on (i) Data source, (ii) Sample and archaeological provenance, (iii) Compositional data, (iv) Primary analytical and method-specific metadata. The Pofatu Database is frequently curated and updated on a regular basis. New datasets and complementary information on previously documented datasets can be submitted using the Data Submission Template and Guidelines available online (https://pofatu.clld.org/about).
Data validation: The content of each table is handled manually but several fields are constrained by ontologies, which are built-in form validation in the submission template. Data is also validated using functionality implemented in the Python package pypofatu, which imposes suitable constraints on data like geographic coordinates.
Data output: The manually curated "raw" data undergoes an automated processing workflow (implemented in the Python package pypofatu) to create output formats ready for distribution.
For long-term accessibility, the data is converted to a set of interrelated CSV files, described by metadata encoded as JSON-LD (cf. https://www.w3.org/TR/json-ld/, accessed January 30, 2020), following the World Wide Web Consortium (W3C) recommendations 174,175 . Because the compiled data is exclusively made of line-based text files (in CSV format), it is well-suited for long-term access since it has the lowest requirements on processing software, and provides for a transparent history of changes with the version control software Git (cf. https:// git-scm.com/, accessed January 30, 2020).

Data Records
A release of the Pofatu Database is available from the Zenodo archive 176 . Details of the parameters and measurements reported in the database are summarized in Online-only Table 1. Unique identifiers for samples, artefacts and analytical methods were created for each data record, and used as primary and foreign keys to define relationships between tables.

technical Validation
Quality control of data and editorial procedures include: Data review: Database contributors who submit a new dataset are asked to be the editor of that specific dataset and to engage in a review of potential missing or inaccurate data. The content of new datasets is systematically cross-checked with the content of original sources and with potentially related content. Authors are contacted when information is missing or when clarifications are needed.
Duplicate detection: Since Pofatu assigns semantic, unique identifiers to the objects in the database, and links data from additional tables using these keys (following the recommendations by Wilson and colleagues 177 ), data consistency can be checked automatically, e.g. detecting multiple conflicting measurements of the same parameter in the same analysis, or conflicting sample metadata.
Users feedback: Data and metadata issues can be reported to pofatu@shh.mpg.de. Editors will be contacted if an issue with one of their datasets is reported.

Usage Notes
The Pofatu Database provides an analysis-friendly environment 178 that enables transparency and built-in reproducibility of analytical tasks that can be achieved through freely available softwares or web browsers 25 .
Since the metadata provided with the csv-formatted data files has information about data types as well as relations between the tables making up the dataset, it is automatically loaded into an SQLite database (cf. https:// sqlite.org/appfileformat.html, accessed January 29, 2020) for the convenience of the users. This SQLite database is contained in a single file document that can be queried with a high-level query language, has accessible content, is cross-platform, performant, and can be used with multiple programming languages.
The Python package pypofatu used for curating the dataset also provides functionality (built-in SQLite driver) that enables access and queries of the data with Python programs or the pypofatu API, and facilitates running SQL queries against the SQLite database.
Complex queries can be created in various ways and with different computing environments: • using SQL command line • using SQL browsers such as SQLite manager or SQLite reader • using R, with SQL codes in a notebook or packages such as sqldf or dplyr 179,180 • using the Datasette tool 181 Data usage instructions are provided in the GitHub repository where the dataset is curated (cf. https://github. com/pofatu/pofatu-data, accessed February 6, 2020). A "cookbook" collects shareable pieces of code and how-to instructions to query the relational database (cf. https://github.com/pofatu/pofatu-data/blob/master/doc/cookbook.md, accessed February 6, 2020), and users are invited to contribute with the "recipes" they used for "cooking" with Pofatu.

Code availability
The pypofatu Python package is open-source software, maintained on GitHub and distributed via the Python Package Index (https://pypi.org/project/pypofatu), with released versions archived with Zenodo 182 . The two output formats listed above are created and stored as part of the GitHub repository where the dataset is curated (https://github.com/pofatu/pofatu-data/releases/tag/v1.0.0), and each release of the dataset is also archived on Zenodo 176 . Additionally, the dataset is loaded into a clld 183 web application, providing an online, browsable user interface for "window-shopping", before downloading and using the dataset locally.
Released versions of the Pofatu dataset meet the requirements on FAIR data as laid out by Wilkinson and colleagues 177 . The data is findable thanks to Zenodo's integration in the research data landscape on the web, and the metadata we provide. It is accessible via the DOI doled out by Zenodo.