Machine learning-ready remote sensing data for Maya archaeology

In our study, we set out to collect a multimodal annotated dataset for remote sensing of Maya archaeology, that is suitable for deep learning. The dataset covers the area around Chactún, one of the largest ancient Maya urban centres in the central Yucatán Peninsula. The dataset includes five types of data records: raster visualisations and canopy height model from airborne laser scanning (ALS) data, Sentinel-1 and Sentinel-2 satellite data, and manual data annotations. The manual annotations (used as binary masks) represent three different types of ancient Maya structures (class labels: buildings, platforms, and aguadas – artificial reservoirs) within the study area, their exact locations, and boundaries. The dataset is ready for use with machine learning, including convolutional neural networks (CNNs) for object recognition, object localization (detection), and semantic segmentation. We would like to provide this dataset to help more research teams develop their own computer vision models for investigations of Maya archaeology or improve existing ones.


Background & Summary
Airborne laser scanning (ALS) surveys have proved crucial for advancement of knowledge in archaeological "site" distribution, particularly in the forested regions of the ancient Maya [1][2][3] , as they have greatly accelerated and expanded traditional archaeological landscape surveys.The research use of ALS in landscape archaeology typically involves the identification, localisation, recording and investigation of natural and cultural features for a variety of, usually interrelated, contexts, including but not limited to the mapping and analysis of settlement, urbanism, agricultural production and water management [4][5][6][7][8][9][10][11] .
Archaeologists typically inspect ALS data in the form of raster visualisations, which enhance the perception of surface features [12][13][14] .Human visual analysis and digitisation is time-consuming and the examination of hundreds of square kilometres can take months, depending on the level of detail, number of structures, and the recording method.0][21][22][23] .The volume of data makes it difficult to annotate entire datasets, especially if not only the locations of objects, but also their shape is to be indicated.The subjectivity of human visual inspection and digitisation and the variability between human interpreters is also an issue 24 .There is therefore a pressing need to employ computer vision methods that can find archaeological objects and delineate their boundaries automatically 25,26 .Among the various machine learning approaches, deep convolutional neural networks (CNNs) are the current state-of-the-art for computer vision, but they usually require a large number of already labelled samples 27 for training.This makes labelled datasets crucial for developing and testing the methods.
In one of our own previous studies, we have already demonstrated that CNNs can classify ancient Maya archaeological objects from DEM visualisations, achieving up to 95% accuracy 28 .However, classification models do not have the potential to replace manual inspection and labelling, for which semantic segmentation is required.Semantic segmentation is readily applied in remote sensing a review is given by27 , but even more so in medical imaging, where CNNs often outperform experts [29][30][31][32][33][34] .
The original intention for collecting the ALS data in the area around Chactún, one of the largest ancient Maya urban centres known so far in the central lowlands of the Yucatan Peninsula, was to better understand the water management, agriculture, settlement dynamics and socio-political organisation of the ancient Maya living in this area 11,35 .
We generated a labelled dataset that can be used for the analysis of ancient Maya archaeology comprising more than 10.000 objects, divided into three different classes; buildings, platforms, aguadas (artificial water basins).We used polygonised outlines of objects to create binary raster masks.The associated multimodal dataset contains data from three remote sensing sources: • 0.5 m resolution ALS data visualisations 28  Sentinel-1 and 2 Earth observation missions are part of the European Union Copernicus Programme.An overview of the experimental workflow used to generate and analyse the data is presented in Fig. 1.
7][38][39][40][41] ; currently, however, there are only a few CNN-based semantic segmentation studies conducted with ALS data 26,[42][43][44][45][46] , and even fewer instance segmentation models published in this particular field 47 .Easily accessible, archaeologically labelled datasets suitable for machine learning are therefore extremely rare.We believe that sharing a large labelled dataset that allows semantic segmentation, because it is based on polygonised objects rather than centroids, points, or simple bounding boxes, has great reuse value.This dataset is also unique because it is multimodal and, to date, the only one in the Maya region.Such a rich dataset allows related research groups to develop or improve their own segmentation models.This has already led to improvements in recognition rates, as the dataset was used in the Discovering the mysteries of the Maya machine learning competition 48,49 .The teams that took part in this machine learning challenge achieved a segmentation performance of more than 0.83 for the intersection over union (IoU, also known as the Jaccard index) when learning from ALS data.However, most teams did not include satellite data in their final model.Deep learning from ALS visualisations alone produced better results with much less machine learning engineering effort.However, by providing a multimodal dataset for a wider reuse, we hope other teams can develop new models based on architectures that can better harness the information in the satellite data.

Methods
Site description.The area around Chactún (Fig. 2) is karstic and therefore lacks perennial water and permanent water streams.Low hills typically rise up to 30 m above the surrounding seasonal wetlands (bajos).The climate in the Maya lowlands is tropical and isothermal 50,51 and within the elevated interior region rainfall is highly seasonal and spatially variable.Typically, 90% of precipitation occurs during the rainy season 9 .The entire study area is covered by natural, unmanaged, tropical semi-deciduous forest and bushes, rarely exceeding 20 m in Fig. 1 An overview of the experimental workflow used to generate and analyse the data.
height.The forest can be classified as primary forest, where there has been no agricultural or grazing activity for a millennium.Before the establishment of the Calakmul Biosphere Reserve in 1989, selective logging for valuable timber and chicle collection were the main economic activities.
Šprajc and his team discovered the urban core of Chactún, composed of three concentrations of monumental architecture, in 2013 52 .Temple-pyramids, massive palace-like buildings and two ball courts surround its several plazas.A large rectangular water reservoir lies immediately to the west of the main groups of structures.Ceramics collected from the ground surface, the architectural characteristics and dated monuments indicate that the centre began to flourish in the Preclassic period (c.1000 BC-250 AD), and reached its climax during the Late Classic (c.600-1000 AD) playing an important role in the regional political hierarchy 52,53 .South of Chactún are Lagunita and Tamchén, both prominent urban centres.Numerous smaller building clusters are scattered on the hills around them 54,55 .

Survey information. The whole dataset includes five different types of data records:
• airborne laser scanning (ALS) raster visualisations, • ALS data derived canopy height model, • Sentinel-1 synthetic aperture radar satellite data, • Sentinel-2 optical satellite data, and • data annotations.
airborne laser scanning data.The main part of the dataset consists of visualised airborne laser scanning data collected with the Titan system by the National Centre for Airborne Laser Mapping (NCALM) at the end (peak) of the dry season in May 2016.Mission planning, data acquisition and data processing were carried out with clear archaeological objectives in mind 56,57 .The density of the final point cloud and the quality of the derived elevation model with a 0.5 m spatial resolution (Table 1) proved to be excellent for the detection and interpretation of archaeological features with very clearly defined minute elevation differences.
The technical quality control of the data included verification of the scanning density, the absolute horizontal accuracy (better than 20 cm), the absolute vertical accuracy (better than 15 cm), and the thematic accuracy of the produced elevation model.
Ground points were classified using Terrascan software (version 016.013), which uses an adaptive triangular irregular networks densification algorithm 58 .The algorithm settings were optimised to remove only the vegetation cover, leaving the remains of past human activities as intact as possible (Table 2).Ground points therefore include remains of buildings, walls, terraces, mounds, chultuns (cisterns), sacbeob (raised paved roads), and drainage channels (Fig. 3).Rare areas without ground returns include aguadas with water.Many landscape features, such as ditches and low field walls, were essentially invisible in the field, due to dense vegetation, and would most likely have been missed by conventional surface mapping.As revealed by ground-truthing, the Fig. 2 Location of the study area with delineated area of the airborne laser scanning mission.Data records were extracted from its southern part (coloured in red) adapted from 28 .
elevation model contains very few data collection and processing artefacts (commission and omission errors).Instances of omission errors are limited to smaller objects, such as altars, while commission errors mostly include larger tree trunks.In some places, parts of buildings are misshapen, for example walls being classified as vegetation because a tree is growing from a chamber on top of a pyramidal building.
We used raster ALS data visualisations to support the human vision interpretation of objects and as a three-band dataset for the data records.ALS raster data visualisations are computed derivatives of a digital elevation model that provide information about the landscape.They can have a purely presentational value or can relate to physical quantities 59 .In effect, they facilitate 'reading and exploring' the landscape in search of meaningful information.The primary visualisation was a combined visualisation for archaeological topography (VAT) 13 .It blends two combinations of four distinct visualisations: analytical hillshading, slope, positive openness and sky-view factor (SVF) 12 , computed with settings for normal and flat terrain (Table 3).The individual visualisation methods are complimentary, depicting small topographic variations in different ways, and the combined image preserves their positive characteristics.Before blending, the visualisations are normalised and have a custom histogram stretch applied.Having a single combined visualisation to consider has advantages, including better representation of structures in a wider range of terrain types, conservation of disk space, and faster display.VAT was created in the Relief Visualization Toolbox (RVT; https://github.com/EarthObservation/RVT_py).The calculation takes about half a minute per km 2 for data at 0.5 m resolution on a normal office laptop.VAT has already been used for pedestrian surveys in a range of environments, from semi-deciduous tropical forests of Mexico to the largely open heather and scrubland of western Scotland and the karstic, rugged terrain of the Mediterranean.The visualisation does not introduce artefacts and shows small-scale features well, regardless of their orientation and shape.However, it is not completely orientation independent, as hillshading is used as the base layer.It aids human vision interpretation, but uses directional light source and is therefore not suitable for data augmentation techniques such as rotation and flipping.Therefore, only sky-view factor, positive openness, and slope, which are all direction independent, were used to create three-band (RGB) data records.These also ranked highest in our study of the performance of different visualisations and visualisation combinations for the classification task with CNNs 28 .Local dominance 12 served as an additional aid for human vision interpretation of outer boundaries of aguadas, which are usually very faintly raised above the surrounding flat terrain.
In addition, we provide a canopy height model (CHM) with a resolution of 1 m, computed with a spike-free algorithm 60 implemented in the LAStools las2dem tool (version 230330).The processing parameters are listed in Table 4. Based on visual inspection, we removed all points that are more than 30 m above the ground (or a building), as they represent noise rather than a true measurement of tree heights.
The raw ALS data was provided by ZRC SAZU as part of a collaboration between the authors of this paper.As Chactún, Lagunita and Tamchén have only recently been (re)discovered and are remote and difficult to access, their exact location is not known to the general public.The density of ancient Maya anthropogenic structures and terrain modifications in this area is astonishing, reaching the level of mayor urban centres like Tikal, but is still almost completely unexplored archaeologically.To prevent looting, the locations of the urban cores and the numerous smaller settlement groups are restricted to researchers.The full ALS data is therefore not publicly available.However, investigators who wish to use it for a specific application should approach ZRC SAZU directly by contacting the corresponding author, describing the topic and goals of their project.For this study, we used data acquired by both satellites in the Interferometric Wide (IW) swath mode, as this is the primary acquisition mode over land with the largest data archive.We used the Level-1, Ground Range Detected (GRD) product with dual polarisation (Vertical Transmit -Horizontal Receive Polarisation (VV) and Vertical Transmit -Vertical Receive Polarisation (VH)) for both ascending (ASC) and descending (DES) orbits.The backscatter coefficient used was Sigma0.The values of the backscatter coefficient were converted from linear power to decibels (dB), fitted to an interval of [−30, 5] dB, and normalised to the range [0, 1].
We used SAR data for the years 2017-2020, with 114 ascending orbit (ASC) images and 205 descending orbit (DES) images, collected from Sentinel Hub 61 .A total of 319 Sentinel-1 images were acquired over the study area, each containing data for VV and VH polarisation with 10 m spatial resolution.We calculated the following temporal statistics for each pixel: mean, median, standard deviation, coefficient of variance, and 5 th and 95 th percentiles.We calculated the statistics for each year within the observation period and for the entire period.The data were stored as a multiband raster (120 bands) in the Tagged Image File Format (TIFF) format (Fig. 5 and Table 8).All processing was done with our own code and the Python packages eolearn (version 1.4.2) and sentinelhub (version 3.9.1).
Due to the complete absence of modern anthropogenic objects and measured permanent scatterers in our study area, it was impossible to verify the positional accuracy of the SAR data.However, according to the Sentinel-1 Annual Performance Report 62 , the geolocation accuracy of the IW swath mode products without geometric corrections is −3.5 m for range and 2.1 m for azimuth.The absolute localisation error is therefore well below the mission requirements of 10 m at 95% confidence.Fig. 3 Examples of buildings (a-g,i,j), platforms (a-j), and aguadas (k-o) that have been annotated and are included in the dataset.Other man-made structures such as walls, channels, terraces, rock-piles, etc. were not annotated (p-t).All panels have the same scale and cover about a quarter of a single data record area (Fig. 5).The visualisation is for illustrative purposes only; it combines a coloured simplified local relief model with a combined visualisation for archaeological topography.

Sentinel-2 optical satellite data. The Sentinel-2 optical satellite mission began with the launch of
Sentinel-2A in June 2015, followed by Sentinel-2B in March 2017.Both satellites carry a single multi-spectral instrument (MSI) with 13 spectral channels in the visible/near-infrared (VNIR) and shortwave infrared spectral range (SWIR).The spatial resolution is 10 m for four bands, 20 m for six bands and 60 m for three bands (Table 5).
The geographical and climatic characteristics of the study area are manifested in a high proportion of cloudy optical satellite images.Out of 658 Level-2A images acquired during 2017-2020 61 , 78 have cloud cover of less than 5%, however, small convective clouds or haze are present in most of them.We calculated a cloud mask with a 10 m resolution for each acquisition date and performed a manual visual inspection of the set to finally select the 17 images without cloud cover over the study area.We resampled all bands to 10 m resolution using the nearest neighbour resampling method.The dataset therefore comprises 12 spectral bands (excluding band 10) and a corresponding cloud mask, computed using s2cloudless 63 (available at https://github.com/sentinel-hub/sentinel2-cloud-detector) adjusted to 10 m resolution, for 17 dates (Fig. 4) (221 bands in total), saved in the TIFF format (Fig. 5 and Table 8).We excluded spectral band 10 (also known as the cirrus band) because it does not contain surface information.It is used for atmospheric corrections and is therefore not included in the atmospherically corrected Level 2 A product.All processing was done with our own code and the Python packages eolearn (version 1.4.2) and sentinelhub (version 3.9.1).Table 5.The spatial and spectral resolution of Sentinel-2 MSI bands 66 .
According to the Sentinel-2 Annual Performance Report 64 , the absolute geometric accuracy of Sentinel-2 L2A data is better than 6 m, multi-temporal co-registration of the same or different satellites in the same repeat orbit is better than 5 m at 95% confidence, and multi-temporal co-registration in different repeat orbits is better than 5 m.

annotations. Many machine learning and deep learning studies in archaeological prospection use ALS vis-
ualisations with simple annotations that do not delineate the exact boundaries of objects.Such studies mostly use points and simple bounding boxes as annotations, which makes them primarily suitable for tasks dealing with object classification or object localisation (detection), rather than semantic segmentation.For segmentation purposes, the exact boundaries of an object are a prerequisite.To create a dataset suitable for supervised segmentation, we delineated polygons for archaeological objects.Data annotation was done by a single person.The manual work took roughly 8 full working months and resulted in 9303 buildings and 2110 platforms annotated in the southern part (130 km 2 ), and 95 aguadas annotated in the whole study area (220 km 2 ) (Table 6).The platforms are apparently artificial, flat surfaces that stand out from the surrounding terrain, support other structures or most likely had this function, even if no buildings are currently visible.Buildings include various types of raised structures such as temple-pyramids, palace-like buildings, ball courts, single or multi-room houses and residential complexes.Aguadas are mostly clay-lined depressions capable of holding water through the dry season see also35 .
The perimeter polygon around a building or platform was drawn where the interpreter could define the boundary between artificial (modified) and natural terrain on VAT.Wherever possible, a single polygon represents a single instance, a single structure.However, because buildings are often closely connected architecturally or due to collapsed material, a precise boundary between structures is sometimes difficult to determine.As a result, a single polygon often contains more than one building.Polygons of buildings and platforms regularly overlap, but there are also many examples of platforms without buildings and of buildings that are not on a platform.
We used local dominance (LD) visualisation to complement the VAT for annotating aguadas and larger water reservoirs.LD is particularly suitable for depicting their slightly raised embankments.
All polygons of the same type (building, platform, or aguada) were saved as separate vector layers.The annotations were revised and curated by an expert archaeologist with deep local knowledge of the area.In very ambiguous cases where it was difficult to determine whether a formation is natural or anthropogenic, e.g.whether an object is a small, eroded platform or a naturally levelled terrain, consensus was reached within a panel of experts.We discussed the issues individually or agreed on a new rule if there were many similar examples.
The modified landscape contains many other types of anthropogenic structures such as terraces, quarries, walls, sacbeob (raised paved roads), chultunob (underground storage chambers), channels, rock piles etc.To save time, we did not annotate these initially, as we were primarily interested in the number and cumulative volume of buildings and the volume of available water to determine the number of people living in the area and the labour required to construct the structures in question.Because the study did not prioritize agricultural aspects, areas with channels in bajos and of terraces were delineated later and this data is not included in the data records.Other smaller and less pronounced anthropogenic structures, such as short walls, are also numerous and often eroded, making them harder to distinguish from natural formations.With the unlabelled data now made available, future researchers can annotate these existing records for further projects dealing with ancient Maya land use.

Binary masks.
Vector polygons representing each object class were rasterised to create a binary segmentation mask for that class.The rasterisation was done using the Geospatial Data Abstraction Library (GDAL).The masks were converted to TIFF files, and serve as labelled data for training, validation, and testing.
tiling.The fully annotated area of about 130 km 2 was split into tiles of 240 × 240 meters.Each data record consists of tiles with multiple layers, except for the CHM which has only one layer (Fig. 5): • three binary segmentation masks (one each for building, platform and aguada class), • the ALS visualisations tile with three layers, • the canopy height model, • the Sentinel-1 image tile with multiple layers, and • the Sentinel-2 image tile with multiple layers.In accordance with the original DEM resolution of 0.5 m, visualisations and binary mask tiles each have a size of 480 × 480 pixels, the CHM has a size of 240 × 240 pixels, while the Sentinel-1 and 2 data have 24 × 24 pixels.The neighbouring tiles do not touch or overlap, but are separated by a buffer of 20 meters.The geographic location of tiles was chosen to match the 10 m grid of Sentinel-2 data.

Data records
Tiled binary masks, ALS visualisations, Sentinel-1, and Sentinel-2 satellite data are archived in Figshare online repository 65 .The dataset contains 2094 data records with an object in at least one of the segmentation masks.The randomly selected set of 1765 data records (tiles 0-1764) was initially published for the participants of the Discover the mysteries of the Maya online challenge that we organised in the framework of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021) 48,49 , while the remaining 329 (tiles 1765-2093) were withheld for testing the submitted deep learning models (Fig. 6).The file format for all tiles is uncompressed TIFF.Geolocation data was intentionally omitted to avoid revealing the exact location of the archaeological remains.Tiles were randomly numbered to prevent reconstruction of the entire study area.An example of a data record with details of each tile can be found in Fig. 5 and in Tables 7, 8.The filename structure for each data record is tile_<consecutive-number>_<data-source>.tif,where the data source can specify a mask, ALS visualisations (lidar), CHM, or Sentinel data (S1 or S2).The sequential number is a unique identifier of a data record; all files with the same sequential number represent the same geographical area, but differ in the number of pixels (480 × 480 pixels, 240 × 240 pixels or 24 × 24 pixels) and bit depth (8-bit integer or 32-bit float) (Table 7).
Each Sentinel-2 tile consists of 221 bands (17 dates × 13 bands), ordered by acquisition date (from newest to oldest), the corresponding spectral bands, and the associated cloud mask (Fig. 5 and Table 8).
A single data record is 2.35 Mb in size, while the total size by type is as follows: 1,449 MB for masks, 1,449 MB for ALS visualisations, 484 MB for CHM, 581 MB for Sentinel-1, and 1,070 MB for Sentnel-2 (a total of 5,033 MB).The repository stores ZIP compressed data, compiled by type (masks, lidar, CHM, S1, S2).The total size of the compressed files is 2,214 MB.

technical Validation
Because of the wide variety of data used to create the data records, we have described the process for obtaining the best possible input for each of the sources in the relevant section of the Methods chapter.The object boundaries resulting from the interpretation of the ALS data serve as ground truth.Based on the 10 m buffer from ground tracks from the 2017 and 2018 field verification campaigns, we verified 33.3% of aguadas, 22.4% of buildings, and 24.2% of platforms in the field (Fig. 6).Given the extreme difficulty of fieldwork in the remote and densely vegetated area, these are very high numbers.We did not record the errors systematically and cannot give exact frequencies for overlooked objects.However, the quality of our ALS data and the nature of the archaeological structures surveyed suggest that the number of structures we may have overlooked or mislabelled is likely to be very small.The experience of archaeologists working in the Neotropics shows that interpretations derived from ALS data are very reliable and that field verification can be less consistent over larger areas as conditions make efficient investigation impossible.

Fig. 4
Fig.4The selected Sentinel-2 acquisition dates over the study area with close-up views.

Fig. 5 A
Fig. 5 A single data record (e.g.1469) contains (a) binary masks of structures, (b) a tile with ALS data visualisations, (c) a tile with a canopy height model, (d) a tile with annual statistics for Sentinel-1 Sigma0 backscatter coefficients, and (e) a tile with Sentinel-2 bands for 17 cloudless scenes.

Table 6 .
The number of annotations.

Fig. 6
Fig. 6 Learning (train and validation) and test tiles with ground tracks of two field verification campaigns.

Table 1 .
ALS data acquisition parameters of the region around Chactún, Calakmul Biosphere Reserve, Campeche, Mexico.

Sentinel-1 synthetic aperture radar satellite data. The
The latter was decommissioned after data collection ceased due to power failure on 23 December 2021.The dual constellation had a repetition frequency of 6 days and a revisit frequency (in ascending and descending orbit) of 3 days at the equator.A single satellite has a revisit frequency of 6 days at the equator.
Sentinel-1 satellite constellation provides C-band Synthetic Aperture Radar (SAR) data.The first satellite, Sentinel-1A, was launched in April 2014, followed by Sentinel-2 in April 2016.

Table 3 .
Raster visualisations used to create the combined visualisation for archaeological topography that we used to identify objects.It blends four raster visualisations computed with settings for general and flat terrain.A colour coded simplified local relief model was added to improve the visibility of structures (Fig.3) and local dominance was used for delineating the outer edges of aguadas.Sky-view factor, positive openness, and slope, computed with settings for general terrain, are used as raster bands in the data records.

Table 4 .
Spike-free canopy height model processing parameters.

Table 8 .
The structure of Sentinel-1 and Sentinel-2 data tiles.

Table 7 .
Properties of a single data record (e.g.1469).