GalliForm, a database of Galliformes occurrence records from the Indo-Malay and Palaearctic, 1800–2008

Historical as well as current species distribution data are needed to track changes in biodiversity. Species distribution data are found in a variety of sources, each of which has its own distinct bias toward certain taxa, time periods or places. We present GalliForm, a database that comprises 186687 galliform occurrence records linked to 118907 localities in Europe and Asia. Records were derived from museums, peer-reviewed and grey literature, unpublished field notes, diaries and correspondence, banding records, atlas records and online birding trip reports. We describe data collection processes, georeferencing methods and quality-control procedures. This database has underpinned several peer-reviewed studies, investigating spatial and temporal bias in biodiversity data, species’ geographic range changes and local extirpation patterns. In our rapidly changing world, an understanding of long-term change in species’ distributions is key to predicting future impacts of threatening processes such as land use change, over-exploitation of species and climate change. This database, its historical aspect in particular, provides a valuable source of information for further studies in macroecology and biodiversity conservation.


Background & Summary
Gathering primary biodiversity data is necessary to improve our knowledge of the ecology and conservation status of species. International commitments such as the Convention on Biological Diversity 1 call for a halt to biodiversity loss and therefore require data to measure biodiversity change. Recent trends in changes in population sizes or geographical ranges can be used to track progress toward biodiversity targets but longer-term trends are needed if we are to put the status of present-day biota into a proper historical context 2,3 . Similarly, if we are to understand the impacts of climate and land use change on species distributions, historical data are required. Ideally, this biodiversity information must be comprehensive, covering common species as well as threatened, and areas of lower biodiversity as well as hotspots.
Our knowledge of species' distributions is extremely coarse compared to most other environmental variables 4 . Analyses of species' geographical ranges often rely on predictions of where a species might occur. Predictions might be gleaned from expert opinion (e.g. https://birdsoftheworld.org/bow/home) (and in some instances may be influenced by historical data), the extent of suitable habitat 5 , gridded survey data 6 or point occurrences 7 . Prominent conservation datasets such as the Living Planet Index 8 and IUCN's species distribution maps (https:// www.iucnredlist.org/resources/spatial-data-download) are regularly used to assess rates of biodiversity loss but these data sources do not extend back beyond around 1970. Longer-term trends can reveal major shifts in abundance and composition of biological communities, information that should be considered when setting conservation targets 9 .
While aggregated population trends or extent of occurrence maps are useful conservation tools, primary data allow us to investigate biodiversity loss in far greater detail. For example, if species' ranges are punctuated with local extinction events we might overlook or underestimate species' declines because we lack the precision to measure them 10 . Additionally, data summaries may be at coarser resolutions than the original data or missing attributes attached to the original record. Freely available primary data allow new questions to be investigated, for which data summaries might not be suitable.
The avian order Galliformes has relatively high quality historical distribution data. This is in part due to their economic and cultural value and their attraction for collectors and ornithologists 11 . Almost all species are non-migratory, making delimitation of their current and historical ranges more tractable. In recent times they have received much conservation attention through being one of the most threatened avian orders -over 25% of species are threatened (www.iucnredlist.org) and many local extinctions have been reported 12 (http://datazone. birdlife.org/home). Galliformes are subject to a variety of threats including habitat loss, hunting, and agricultural intensification and disturbance (http://datazone.birdlife.org/home). The order exhibits a wide range of ecological characteristics and life history traits, and occurs in a diversity of habitats, meaning that the Galliformes lend themselves well to macroecological studies 13 .
Here we present GalliForm 14 , a database of 186687 occurrence records covering the 130 species of the avian order Galliformes that occur in the Palaearctic and Indo-Malay biogeographic realms (see Fig. 1 for spatial distribution of records). Records cover the period 1648 to 2008 although 95% of records date from 1877 onwards. Records increase markedly though time (Fig. 2). Records were collected from museums, peer-reviewed and grey literature, bird atlases, banding records and birding trip report websites (see 15 for spatial biases within sources). Where possible, data were informally refereed by local experts who, if necessary, supplemented the data with their personal records. Each data source was found to have a distinct set of spatial, temporal and taxonomic biases 15 . Combining biodiversity data from a variety of primary sources helps to minimise data bias.
The GalliForm dataset 14 is an extremely valuable resource for ecological and conservation studies. Occurrence data underpin species distribution modelling but geographic ranges are changing rapidly due to the diverse impacts caused by human activities. Historical occurrence data, coupled with climate and land-use data, may improve our understanding of populations' responses to climate change, land-use change and hunting. The species occurrence data described here have been used to assess the completeness of geographic range size estimates 16 , to investigate patterns of range collapse with respect to distance to range edge 17 and to assess species extirpations outside Protected Areas 12 . Nine publications 10,12,[15][16][17][18][19][20][21] have so far arisen from this database but many avenues remain to be explored.

Methods
These methods are an expanded version of those in our related work, Boakes et al. 15 .
The database was compiled over the period 2005-2008. Data collection equates to around 1500 person-days and data were gathered by a team of 21 people. Between them, team members were fluent in English, French, German, Mandarin, Russian, Spanish and Swedish. These languages were extremely helpful in transcribing www.nature.com/scientificdata www.nature.com/scientificdata/ museum specimen labels and in translating publications. However, the majority of publications were in English and we acknowledge that the database will be biased toward records published in English-language publications.
Our study focuses on the 130 galliform species that occur within the Palaearctic and Indo-Malay biogeographic realms 22 (see Online-only Table 1). We have additionally included records of the Imperial Pheasant (Lophura imperialis) although it is now recognised that this is a hybrid and not a species. The geographic range of two of the species in the database, the Red Grouse (Lagopus lagopus) and the Rock Ptarmigan (Lagopus muta), extends to North America. North American data was often included in the information which museums sent us and in these instances we entered those records into the database since we thought they might be of use to researchers studying these species. However, it should be noted that we did not search exhaustively for records of these species in North America, we have merely included those that we came across.
We attempted to gather all species distribution data that could be accessed from five different sources; museum collections, literature records, banding (ringing) data, ornithological atlases and birdwatchers' trip report websites. For each data source, exhaustive and systematic search strategies were adopted.

Museum collections.
Using web-based searches and Roselaar 23 , 377 natural history collections were identified. We found contact details for 338 of these collections and requested by email or letter a list of the Galliformes in their holdings along with collection localities and dates. Non-respondents were recontacted. 135 museums were able to share data with us (see Online-only Table 2). Museum records were obtained through publicly available online databases e.g. ORNIS, electronic or paper catalogues sent to us by the museums or by visiting the museums and transcribing data directly from specimens or card catalogues. Almost half of the museums we contacted did not respond despite at least one follow-up enquiry, and there was substantial variation in the amount and format of data contributed by those that did reply. Altogether, over 50% of the records came from just six museums (Natural History Museum, London; Zoological Institute of the Russian Academy of Sciences, St Petersburg; Zoological Museum of Lomonosov Moscow State University; Field Museum of Natural History, Chicago; American Museum of Natural History, New York; National Museum of Natural History, Leiden), a single museum (the Natural History Museum, London) contributing nearly 20% of the museum records that could be georeferenced and dated 15 . Following databasing and/or georeferencing, records were returned to larger collections and to those who had requested the data.
Literature. Data from the literature were added to those previously collected by McGowan 24 . Entire series of key English-language international and regional ornithological journals such as Ibis, Bird Conservation www.nature.com/scientificdata www.nature.com/scientificdata/ International, Journal of the Bombay Natural History Society, and Kukila were scanned for relevant information, availability allowing. We began at the library of the Zoological Society of London and followed up missing journal issues at the BirdLife International library, Cambridge UK; the British Library, London, UK; the Edward Grey Institute, University of Oxford, UK. Relevant Chinese literature was also scanned. Additionally, data were obtained from regional reports, personal diaries, letters, newsletters etc stored in the archives of BirdLife International, Cambridge, UK; the World Pheasant Association, Newcastle, UK; the Edward Grey Institute, University of Oxford, UK. Several of the species/regional experts we consulted also contributed their personal records which were recorded in the database as 'personal communications' . As far as it were possible, records were classed as primary or secondary data within the 'dynamicProperties' field of GalliForm 14 . It is important to note that some primary records or museum specimens will be duplicated within the database in the secondary data.
Banding records. Eighty-three ornithological banding groups were identified using web-based searches and were contacted via email. Thirty of these groups replied and only seven were able to provide us with data (see Table 1). The majority of galliform species tend not to be banded due to their large body sizes and spurs. Additionally, many of the banding groups kept their records on paper and were not able to send them to us. Nevertheless, we were able to access and georeference 15,152 banding records.
Ornithological atlases. We digitised location data from 20 ornithological atlases (see Table 2). Data from several other atlases were not used since the range of dates for the records was wider than 20 years. trip report website data. We used the two trip report websites that were popular with birders during the data recording period (2005-2008), www.travellingbirder.com and www.birdtours.co.uk. At that time, eBird (probably the most relevant current online source today) did not cover the majority of the countries within our study region, and our intention with the deposition of this dataset is to focus on pre-eBird data that are more difficult and time consuming to access. We extracted data from all trip reports of birdwatching visits to European, Asian and North African countries. Care was taken to enter reports that featured on both websites once only.
Criteria for data inclusion. To be included in the database, records had to meet the following criteria: 1. The record identified the species of the bird concerned. 2. The record contained either a verbal description of the locality at which the bird concerned was observed or the co-ordinates at which the bird was observed.
Records of captive birds were excluded. Records relating to non-native occurrences were included but were flagged in the 'establishmentMeans' field as "introduced". Data entry. GalliForm 14 was originally compiled in the programme Microsoft Access 2003. To maximise uniformity in data entry, all data recorders were given thorough and consistent training and each was provided with a set of database guidelines. An Access Database form was created to standardise data entry and to enable multiple members of the team to collect data simultaneously.
Each entry in GalliForm 14 corresponds to a single record of a single species recorded in a specific location. The data fields of GalliForm 14 are described in Online-only Table 3. The taxonomy used has been updated to be consistent with the BirdLife International 2019 taxonomy (datazone.birdlife.org). All information was entered exactly as it was described in the data source, with as much information extracted as possible. Multiple records from different sources which recorded the same information were still included in the interest of completeness. The only exception to this is the trip report data in which we did not enter identical records which occurred on both the Travelling Birder and Bird Tours websites.
The source of the data, i.e. literature, museum, atlas, ringing or website trip report is recorded in the 'dynam-icProperties' field under the code "dataSource". For literature data, (where known) the nature of the record, i.e. primary or secondary, is recorded under the code "datatype".
Taxonomy has of course changed considerably over time. To allow for this we recorded the taxonomy as it was described in the data source in the 'originalNameUsage' field. The current taxonomy was then selected from a look-up table. If at the time of data entry, the data compiler was unsure which species the synonym referred to, the species was tagged as "unknown" and the species was designated at a later date following further research on the synonym. www.nature.com/scientificdata www.nature.com/scientificdata/ Identical localities can also be described in multiple ways. We recorded the locality as it was given in the data source in the 'verbatimLocality' field. If the 'verbatimLocality' clearly tallied with a locality already within the database, the record was linked to that locality in order to increase georeferencing efficiency.
It was rare for a source to record absence of evidence, i.e. a survey for a species at a particular locality which failed to find that species. However, in the few cases where we did come across such records, the locality and date of the survey were recorded and "absent" was recorded in the 'occurrenceStatus' field.
Each record refers to an independent observation. For museum and ringing records, this means a single individual. For literature, atlas or trip report records this may refer to a group of birds observed in one particular locality, on one particular day. If given, the number of total individuals is recorded in the 'individualCount' field. The number of males and females is recorded in the 'sex' field and the number of juveniles and adults in the 'life-Stage' field. If the 'lifeStage' field is blank, it is reasonable to assume the individual(s) is an adult.
Occasionally, additional information about the observation might be included in the data source, for example the habitat the bird was observed in or whether the bird was common or rare in that locality. These data are recorded in the 'habitat' and 'organismQuantity' fields, respectively. Any additional information which did not fit within the structure of the database was recorded in the 'occurrenceRemarks' field, along with any notes found on museum labels.
For the purposes of data deposition, the database was converted to a tab-delimited CSV file with all fields following Darwin Core format. A full summary of these fields is given in Online-only Table 3.
Georeferencing. Locality descriptions were converted to geographic co-ordinates using a wide range of atlases and gazetteers, co-ordinates generally only being assigned if accurate to one degree (although in the majority of cases the locations were accurate to within 30 minutes, Table 3). We would initially search for a locality

Atlas
Year Editors  www.nature.com/scientificdata www.nature.com/scientificdata/ within the gazetteers available to us at the time. If the locality was not listed within those gazetteers we would search for the locality using atlases. Since this fieldwork was conducted, MaNIS standards have become widely used for studies of this kind, but these weren't fully developed at the time of data collection 25 . Named places, e.g. towns or counties, were georeferenced using their geographic centre and georeferencing uncertainty measured from the centre to the edge of the named place. Often localities were given simply as the name of a river, mountain or Protected Area. In these instances we used the midpoint of the river between source and mouth (uncertainty measured as distance from midpoint to source/mouth), the summit of the mountain (uncertainty measured as distance from summit to approximate mountain foot) and the rough centre of the Protected Area (uncertainty measured as distance from centre to Protected Area edge). If a particular locality description matched two or more places their midpoint was taken (uncertainty measured as distance from midpoint to place). Offsets from localities (e.g. "50 km N of Kuala Lumpur"; "8 miles along the road from Sheffield to Chesterfield") were measured using a digital atlas (uncertainty was approximated at the georeferencer's discretion in these instances, usually between 3 and 10 arc-minutes, depending on the vagueness of the offset.) For georeferencing done 'in house' , the gazeteer/atlas used was recorded.
When possible, localities we could not georeference ourselves were sent to regional experts. 92% of our localities are georeferenced to an accuracy of 30 minutes, corresponding to 82% of occurrence records (see Table 3).
We had less success at georeferencing museum records than literature records 15 , due in part to difficulties in reading hand-writing on specimen labels. Older records were also harder to georeference, presumably due to changes in place names over time, and to some early ornithologists failing to document the collection locality. As might be expected, localities from countries that do not use the Roman alphabet were also harder to georeference.
Some records were excluded from the database based on their locality: records which we thought were trading localities, notably Malacca in Malaysia and Leadenhall Market in the UK; records from captive specimens, e.g. zoological gardens.
Dating. 49% of records are dated to within an accuracy of one year. Where possible, we assigned date ranges to undated records. For example, if the name of the collector was given on a museum specimen and we knew when that collector was active in that region, we assigned a date range covering that period. There remain undated records which could perhaps be dated in this way. Undated literature records were designated as occurring before their publication date. We were able to date 89% of records to within 10 years.

Data Records
A relational database structure was created in Microsoft Access to organise and store the species occurrence records with their spatial dependencies and data sources and to keep track of synonyms. For the purposes of publication, this database was converted to a tab-delimited CSV file that followed the Darwin Core format.
We provide a dataset for Galliformes occurrences within the Palaearctic and Indo-Malay realms at species level. These data, obtained and curated as explained above, are available from the Global Biodiversity Information Facility (https://doi.org/10.15468/9825yw). Online-only Table 3 lists and describes the fields of GalliForm 14 .
The following figures and tables summarise the dataset. Figure 1 shows the spatial distribution of records; Fig. 2 shows the accumulation of records through time; Fig. 3 shows the spatial distribution of the number of records, species richness and the most recent year of record; Fig. 4 shows the completeness of selected data fields. Table 1 lists the ringing groups which were able to share data with us; Table 2 lists the atlases that we digitised; Table 3 details the completeness of records which are georeferenced and/or dated to within 1 year. Online-only Table 1 details the number of records per species and the time span these records cover; Online-only Table 2 lists the museums which were able to share data with us; Online-only Table 3 describes the Field Names of GalliForm 14 .

technical Validation
Georeferenced data were subject to the following checks: 1. That each data point was in the country that its locality described. 2. That each data point was within reasonable distance of the species' known historical range. 3. That each data point that identifiably came from a protected area listed in the World Database of Protected Areas (https://www.protectedplanet.net/) was indeed within that protected area.
Finally, data were sent to experts on regions/species for informal 'refereeing' to highlight dubious or missing data. We were able to referee approximately one third of the records in this way.  www.nature.com/scientificdata www.nature.com/scientificdata/

Usage Notes
The dataset described here can be used to investigate the spatial and temporal patterns of Galliformes distributions at multiple scales and resolutions. The dataset was first used to examine bias in different sources of biodiversity data 15 . It has also been used to investigate predictors of range change 18 , to examine the effects of missing data on estimates of biodiversity metrics 10 , to assess the completeness of geographic range estimates 16 , to investigate the position of local extinctions with respect to species' range edges 17 , to explore the optimisation of Protected www.nature.com/scientificdata www.nature.com/scientificdata/ Area networks 20 , to examine the local extirpation of species outside Protected Areas 12 and to model the potential distributions of highly threatened species 19,21 . There remains much scope for this database to inform further biodiversity or conservation related studies, for example, investigations of geographic range change or predictors of extinction risk.
The data presented here do need to be interpreted carefully with respect to data bias and to missing data. Biodiversity data may be biased in a variety of ways, for example geographically, towards particular ecosystems or towards more charismatic species e.g. 26,27 . Additionally, these data biases may change over time. Although our database is based on a systematic and thorough search of all the data available to us from all regions covered, the data are still likely to be biased because there will have been intrinsic biases in the available data sources. For example, in this database, central India is under-represented in terms of recent research locales and it is hard to disentangle whether this is due to a lower number of ecologists focussing their studies there or if it is a justified skew as a result of biodiversity loss in this area. More recent records also show a bias toward threatened species and Protected Areas 15 . There are very few records of species absence although of course absence may be inferred if there are many records of other species in a particular locality. For a more detailed discussion of bias and missing data see Boakes et al. 15 and Boakes et al. 10 .  Table 3.