Background & Summary

Conserving the biodiversity of island biotas is a global concern because of the unique set of species living on islands1. Most native floras and faunas on islands face great challenges2,3, with intensifying land use changes which lead to the destruction of habitats, and the increasing number of introduced, and potentially invasive, species which act as predators, parasites, or competitors to the native ones4,5, being the most worrisome. In this context, highly specialised and small-range endemic species are particularly vulnerable6. Although the results obtained with long-term arthropod datasets can strongly support the understanding of the vulnerability of the endemic ecosystems, plant and vertebrate population trends, including extinctions, are still relatively better-documented7,8 than to those of arthropods. Yet, for instance, alien insect species were reported to threaten endemic diversity on the Galapagos Islands9 and negative correlation was found between invasive ants and the species richness of other insect taxa on island of the Great Barrier Reef10. Most of these studies, however, rely on short-term datasets or qualitative historical data.

Indeed, despite their key role in ecosystem functioning and the high conservation importance of endemic species, arthropod diversity in most archipelagos is substantially understudied, with studies spanning long periods being rare11.

The BALA ‘Biodiversity of Arthropods of the Laurisilva of the Azores’ dataset aims to address this gap in knowledge. It contains a unique and comprehensive set of long-term quantitative arthropod data, sampled over 3 sampling campaigns covering 25 years and across eight Azorean islands (Fig. 1), using consistent and standardised sampling protocols.

Fig. 1
figure 1

Map of the Azores archipelago with the location of sampling transects. Top left shows the position of the three island groups (Western group – red, Central group – green, and Eastern group – blue). The three groups are shown separately on the left side and the individual islands with all sampling points (black dots) and core sampling points (red circles) are shown on the right. Abbreviations as: FLO – Flores, COR – Corvo, FAI – Faial, PIC – Pico, SJG – São Jorge, GRA – Graciosa, TER – Terceira, SMG – São Miguel, and SMR – Santa Maria. Note that no pristine area remains in Graciosa and Corvo and there was no sampling on Corvo. Scales on the group maps show 20 km, and 10 km on the maps of individual islands.

The core dataset is a result of 4,929 sampling events of 31 transects distributed across 15 fragments of native Azorean humid forests on 7 islands (excluding Corvo and Graciosa). Sampling was conducted between 1997 and 2022 and split into three multi-year sampling campaigns, BALA 1 (between 1997 and 2004), BALA 2 (2010 and 2011), and BALA 3 (between 2019 and 2022). The core dataset contains only transects in pristine native vegetation that have been repeatedly sampled during this period, but the final dataset also includes data from several other projects that used the same methodology to sample invertebrates across the archipelago. The final dataset thus encompasses data of 124 transects in 27 fragments from eight islands (Table 1). Most fragments are pristine native forests but, with the extended dataset, the early succession of lava flows and secondary forests (on Graciosa island, where no native habitat is left) are also included.

Table 1 Numbers of transects sampled per year in each fragment.

The core dataset includes samples collected in sub-montane and montane forests. The sub-montane forests are dominated by Laurus azorica (Azorean Laurel forests), that in the past probably covered more than two thirds of the islands, from 300 m to 600 a.s.l.2. The high elevation sites between 600 m and 1000 a.s.l. are Juniperus-Ilex forests and Juniperus woodlands12. All these forests are characterised by a dense cover of bryophytes that are present on all substrates.

Altogether, the collection amasses a total of 266,675 specimens belonging to 543 arthropod morphospecies (123,446 individuals and 404 morphospecies in the core dataset), of which 354 (65.2%) are identified at the species level. Of these species, 77 (105,919 individuals) are endemic to the Azores, 97 (97,562 individuals) are native but not endemic, and 146 (32,267 individuals) are exotics (Figs. 2, 3).

Fig. 2
figure 2

Voronoi maps showing the proportions of species richness of invertebrates caught in different higher taxa on each island (FAI = Faial, FLO = Flores, GRA = Graciosa, PIC = Pico, SJG = São Jorge, SMG = São Miguel, SMR = Santa Maria, TER = Terceira) and in the entire dataset (All data). Main colours represent arthropod classes, whilst different hues of these colours indicate families. Indigenous (endemic and native but not endemic), introduced species, and those of unknown origin are marked with patterns of triangles, squares, and crosses, respectively.

Fig. 3
figure 3

Voronoi maps showing the proportions of total log abundances of arthropods collected in different higher taxa on each island (FAI = Faial, FLO = Flores, GRA = Graciosa, PIC = Pico, SJG = São Jorge, SMG = São Miguel, SMR = Santa Maria, TER = Terceira) and in the entire dataset (All data). Main colours represent arthropod classes, whilst different hues of these colours indicate families. Indigenous (endemic and native but not endemic), introduced species, and those of unknown origin are marked with patterns of triangles, squares, and crosses, respectively.


Study area and biogeographical aspects

The Azorean archipelago is situated in the North Atlantic Ocean, comprising nine volcanic islands and occupying a surface area of 2,346 km2 (Fig. 1). All islands have a wet oceanic, mild to warm, subtropical climate, with moderate variation in daily and annual temperatures13.

Biogeographically, the Azores belong to Macaronesia, with its native vegetation being characterised by several types of hyper-humid forests, including the laurel forests (Laurisilva)12. However, the islands have been inhabited since the mid-15th century and the native forest, originally covering almost the entire surface of the islands, has been increasingly converted to agricultural cropland, pastures, settlements, and, during the 20th century, tree plantations dominated by Cryptomeria japonica (Thunb. ex L.f.) D.Don. Even the remaining fragments are dominated by Juniperus forests and shrublands different from those of Madeira and Canary Islands12. By 1988, when Natural Forest Reserves were established on seven of the nine Azorean islands (see below), only 2.5% of the original native forests remained in isolated fragments14,15,16.

In our study, eight islands of the Azorean archipelago (from west to east) were sampled: Flores, Faial, Pico, São Jorge, Graciosa, Terceira, São Miguel, and Santa Maria. The surveyed native vegetation included (according to reference12): Laurus Submontane Forest, Laurus-Ilex forests (300–600 m), that hosted 14 woody plant species in average, with a tree height ranging from 6 to 8 m, being dominated by Laurus azorica; Juniperus-Ilex Montane Forests (600–900 m), that hosted an average of 18 plant species, with a tree height ranging from 3 to 5 m, being dominated by Juniperus brevifolia and Ilex azorica but Laurus azorica being still very frequent; and Juniperus Montane Woodlands between 700 m and 1000 m, hosting an average of 15 plant species, with a tree height ranging from 2 to 4 m, being dominated by Juniperus brevifolia. In the special case of the lower elevation island of Santa Maria, the native forest is a mix of Picconia-Morella lowland forest and Laurus Submontane Forest.

No pristine area remained in Graciosa and thus on this island only secondary forests could be sampled using the sampling protocol introduced below. Corvo island has no pristine vegetation either and its remoteness makes sampling logistically challenging, thus this was not sampled.

Projects’ details

The dataset stems from a number of projects using the same sampling protocol for collecting arthropods in the Azorean archipelago. The Biodiversity of Arthropods from the Laurisilva of the Azores (BALA) project was established with a vision to initiate a comprehensive survey of the Azorean invertebrate fauna, with a particular interest in endemic species17. The main initial goals were to: (i) contribute to solve the Linnaean Shortfall18, by surveying new habitats (tree canopy) and formally describing new species (taxonomic effort); (ii) gather comprehensive and sustained ecological data and to meticulously assess the spatial and temporal dynamics of species distributions and abundance across various scales. The initiative also aims to (iii) address the gaps highlighted by the Wallacean and Prestonian shortfalls19; and, by providing a thorough understanding of ecosystem dynamics and facilitating informed conservation strategies, (iv) inform the Azorean Government about the quality status of the Azorean native forest.

Later, with the two consecutive sampling campaigns over the next 20 years, it became the longest running monitoring effort of the changes of the Azorean biota. Its relatively large scale both in space and time also made the BALA dataset suitable for testing macroecological and biogeographical hypotheses (e.g. refs. 15,20) and also contributing to clarify the potential occurrence of an “insect decline” in the Azores5 and identify the spatial and temporal invasion patterns of exotic arthropod species. The Project is coordinated by the Azorean Biodiversity Group (cE3c), based at the University of the Azores in Angra do Heroismo, Terceira.

The core sampling, organised into three consecutive phases, sampled the same 30 sites repeatedly. However, the core database includes 31 sites (Table 2) because one of the sites surveyed during BALA 1 in Faial island (FAI-NFCF-T-11) had to be replaced (FAI-NFCF-TB26) with a nearby location due to an invasion by Rubus ulmifolius Schott that made the installation of BALA 2 and BALA 3 pitfall traps impossible.

Table 2 The 31 core BALA sites. Island abbreviations: FAI = Faial, FLO = Flores, GRA = Graciosa, PIC = Pico, SJG = São Jorge, SMG = São Miguel, SMR = Santa Maria, TER = Terceira.

However, in the initial sampling round, which took place from 1997 to 2004 (referred to as BALA 1), a total of 100 sites within 18 forest fragments were surveyed as a result of a collective effort from several projects. During the subsequent rounds, spanning from 2010 to 2012 (BALA 221) and 2021 to 2022 (BALA 3), only the 30 core sites, out of the original 100 sites, located within 15 fragments were resampled. Although no pristine native vegetation is found on the small island of Graciosa, a single expedition was conducted in June 2005 to survey 11 sites on this island (Project BALA Graciosa). The Geotermia1 project, also using the BALA protocol, was a monitoring survey of native forest plots in Terceira Island, aiming to evaluate the impact of the construction of a geothermal power plant near Galhardo Natural Forest Fragment. These surveys were conducted in 2002 and 2007 in 16 native forest sites. Data from two PhD projects are also included in the dataset. In the first, by Clara Gaspar (2003-2004), 44 additional BALA 1 plots were setup and in the second, by Silvia Calvo Aranda, this sampling at the same sites was repeated during BALA 2 (2011) (Table 1).

Sampling protocol

Each site was sampled for soil fauna along a 150 m transect, in which 30 pitfall traps, each with a 5 cm opening diameter, were placed at 5-meter intervals. Every second trap was filled with ethylene-glycol and the remaining 15 traps with Turquin’s solution (10 g chloral hydrate, 5 ml formalin, 5 ml acetic acid, added to 1 L of dark beer)22. Pitfall traps were collected after two weeks (14 nights) of continuous operation. Additionally, for the purpose of capturing arthropods residing in the canopy, the study also included ten samples per each of the three most common native tree species using a beating technique, primarily focusing on endemic Juniperus brevifolia (Seub.) Antoine (Cupressaceae), Erica azorica Hochst. ex Seub. (Ericaceae), Ilex azorica Gand. (Aquifoliaceae), Laurus azorica (Seub.) Franco (Lauraceae), and Vaccinium cylindraceum Sm. (Ericaceae). Trees were selected randomly within a 5 m distance from the pitfall trap line and were beaten five times at the height of ca. 1.5–2 m. Sampling campaigns took place between July-September when arthropods are the most active and canopy samples were always collected in dry and warm weather conditions. For examples when the BALA protocol was used, readers should consult the works using the BALA protocoll14,17,23.

Taxonomic scope, biogeographic origin and conservation categorization

All arthropod taxa, with the exclusion of mites (Acari) and Collembola, were collected. Specimens were categorised to morphospecies on a first step and later identified to the lowest taxonomic level possible. Due to the difficulty of their identification, Diptera and Hymenoptera (excluding Formicidae) were not sorted to morphospecies but kept as bulk samples. Species identification was based on either the consultation of historical Azorean entomological literature, mostly for the identification of endemic species (see list of references in Borges & Vieira24) or the expertise of several taxonomists that collaborated with us in publications during the last 20 years (see e.g. ref. 17). Species nomenclature follows the last checklist of Azorean Arthropods25 and, with the exception of a few cases, it aligns with the GBIF Taxonomic Backbone. All identified species were categorised according to their biogeographic origin as 1) endemics to the Azores; 2) native but not endemic species, which occur naturally in the islands; and 3) introduced species whose presence is thought to be the result of (intentional or unintentional) human introduction14,26. The conservation status of each endemic species was assessed following the guidelines of the International Union for Conservation of Nature (IUCN) and an IUCN category was assigned to each.

Sample and voucher archives

Upon collection, samples were labelled and initially stored in 96% ethanol. Identified specimens were either stored in 96% ethanol or mounted on insect pins. All samples were archived and voucher specimens preserved in the Dalberto Teixeira Pombo Insect Collection (Collection Code: DTP; collectionID: 1366b359-8936-4e40-be36-1f1e1eb6d2b0), situated at the University of the Azores in Angra do Heroismo, Terceira Island, Portugal. Pin-mounted specimens are kept at room temperature, with controlled humidity and safe from museum pests in well-sealed insect boxes. Specimens preserved in alcohol are kept in fridges at 4°Cor in a temperature-controlled room at 13°C.

Data processing

Unique identifiers were assigned to each voucher morphospecies and event IDs, used by the GBIF, were assigned as identifiers to samples.

Morphospecies identifiers act as references and ensure that both voucher specimens can be traced back to their source and new identifications can be added or existing ones updated. By using the event ID, all metadata about the sampling event, such as when and where the sample was taken, by whom, can be retrieved. This also facilitates any corrections or updates to the database, as changes can be linked directly to specific events through their event IDs. Morphospecies occurrences and the abundances of juveniles, adults, males and females (when these were possible to identify) were linked to each event ID.

The dataset was cleaned up and converted to all formats using the dplyr27, reshape228 and RMySQL29 and VoronoiPlus30 packages in R31. To make the Voronoi maps accessible to those with colour vision deficiency, their colours were chosen with the help of colorblindr32. All related code is deployed with the dataset and also available on the GitHub pages.

Data Records

Available formats

All data are available on figshare33 and also published on the GBIF website34 with the data fully compliant with the Darwin Core (DwC) standard (, under a CC-BY 4.0 Licence.

The DwC is a standardised format developed to facilitate the sharing and integration of biological and biodiversity data across different platforms and databases, and provides a stable framework comprised of terms and vocabulary. This framework ensures that data from different sources remains interoperable and easily comprehensible, irrespective of its origin. The DwC may also encompass data related to sampling time, habitats, sampling methods, and other facets of biodiversity research.

To facilitate the usage of the data for those not familiar with the DwC format, we also provide one large data table, containing all data as semicolon separated values. Similarly, for those who wish to import data into GIS applications, we provide a geographically referenced relational database in a MySQL format.

Database structure

The data consist of three separate data types: 1) data related with the sampling event, such as the sampling transect and its basic characteristics, date, and the sampling method; 2) morphospecies-related data, such as the species identity (if known), and higher taxonomy (i.e. genus, family, order, class), the biogeographic status, as well as the IUCN category; and 3) an occurrence dataset, including the number of individuals captured, separated to adults and juveniles and, when possible, to males and females.

DwC-formatted data consist of two tables. Whilst the event table contains information on the sampling event, such as sampling method, date, and site information, the occurrence table is focused on the organisms collected and it lists their taxonomic identity, biogeographic origin, and the abundance of the arthropods collected per development status (adults, juveniles) and sex, if it was determinable. The two tables are linked through the ‘eventID’ field unique to the event table. These two tables can be assessed at = bala_arthropods34.

The MySQL database contains three separate tables: (1) the species list enriched with taxonomic information, biogeographic origin, and conservation status; (2) a sample site list, containing the WGS84 coordinates of the sampling sites, both as text and as a geometry field in MySQL; and (3) the occurrence table listing all collected specimens, method used for collection, life stage, and (where possible) sex. The species and site tables are linked to the occurrence table by the morphospecies ID and site ID, respectively. The database is provided in an.sql format which can directly be imported in any MySQL database. The tables are linked through unique identifiers, such as the morphospecies code and site code.

Technical Validation

All samples were individually labelled and stored. Archive samples are still available for further data checks and analyses. Most species identifications were conducted by one of the authors (PAV Borges) but to some problematic specimens species-level identifications were assigned by expert taxonomists of the corresponding taxon.

Data were meticulously checked to avoid transcription errors and several tests were run to identify outliers in the data. For instance, unique values of each categorical variables were listed and carefully checked for misspellings, duplicates or similar errors and numerical variables and dates were plotted, and their interquartile range was examined and tested for outlying values. When necessary, these values were corrected or removed. Most of the data were already used in a number of analyses, for instance refs. 21,35,36,37.

All species names were automatically checked against the GBIF Taxonomy Backbone using the R function provided by Pozsgai et al.38.

Usage Notes

Our database has great potential for analysing macroecological, biogeographical as well as species- and community-level patterns, particularly those focusing on insular systems. It can be especially valuable when combined with other island datasets. Indeed, since there are several long-term invertebrate datasets from Europe and the U.S.A.39, only a handful of those are available on island biotas (e.g. ref. 40). In addition to its primary objective of inventorying Azorean arthropods, the BALA database also allows comparison of diversities on multiple scales, at variable taxonomic level, and among a variety species groups (e.g. exotic and endemic species). Moreover, due to the long temporal span of the dataset, it also allows to contribute to biodiversity studies over time, namely investigating invasion dynamics and the effects of climate change and, hence, can support conservation planning.