A database of freshwater fish species of the Amazon Basin

The Amazon Basin is an unquestionable biodiversity hotspot, containing the highest freshwater biodiversity on earth and facing off a recent increase in anthropogenic threats. The current knowledge on the spatial distribution of the freshwater fish species is greatly deficient in this basin, preventing a comprehensive understanding of this hyper-diverse ecosystem as a whole. Filling this gap was the priority of a transnational collaborative project, i.e. the AmazonFish project - https://www.amazon-fish.com/. Relying on the outputs of this project, we provide the most complete fish species distribution records covering the whole Amazon drainage. The database, including 2,406 validated freshwater native fish species, 232,936 georeferenced records, results from an extensive survey of species distribution including 590 different sources (e.g. published articles, grey literature, online biodiversity databases and scientific collections from museums and universities worldwide) and field expeditions conducted during the project. This database, delivered at both georeferenced localities (21,500 localities) and sub-drainages grains (144 units), represents a highly valuable source of information for further studies on freshwater fish biodiversity, biogeography and conservation.


Background & Summary
The Amazon Basin covers more than 6,000,000 km 2 , produces about 20% of the world's freshwater discharge [1][2][3] and contains the highest freshwater richness on Earth 4 . This is especially true for Amazonian fishes that represent ~15% of all freshwater fish species described worldwide 5,6 . The processes having generated this highly diverse fish fauna are incompletely understood. However, low rates of species extinction over several millions of years due to the diversity in aquatic habitats and the stability in favourable climatic conditions are most probably involved 7,8 . Compared to other large riverine ecosystems on Earth, the Amazon Basin and its fish fauna are still in a relatively good state of conservation 9,10 . Nevertheless, recent expansion of infrastructures and economic activities are likely to endanger this fish fauna in the near future due to the substantial increase in threats such as habitat fragmentation and river flow modification by dams, deforestation, roads, mining, urban and/or agricultural pollutions, species introduction and overfishing 11 . Climate change will probably exacerbate these threats further amplifying changes in the structure and function of fish communities 11,12 .
Our knowledge on fish species occurrence and spatial distribution within the Amazon Basin is far from complete. Numerous new species are described each year 13,14 and some large areas are still unknown in several portions of the basin 15,16 . This was among the key motivations of the transnational collaborative project AmazonFish (https://www.amazon-fish.com/) that aimed to compile the most complete and up-to-date information currently available on freshwater fish species distribution for the entire Amazon drainage basin and to initiate scientific collecting expeditions in under-sampled areas to fill the gaps. This database is thus the result of mobilizing information available from various sources (published articles, grey literature, field expedition reports, online biodiversity databases and scientific collections from museums and universities worldwide) and field expeditions organized during the project. This compilation, covering a time span of almost two hundred years (1834-2019), currently comprises 2,406 valid native freshwater fish species recorded from 590 different sources representing more than 235,064 occurrence records (232,936 georeferenced and 2,128 non-georeferenced) and 21,500 sampled localities (hereafter called sampling sites). Two parallel compilation efforts on the distribution of freshwater fish species in the Amazon Basin have been recently released 17,18 . The field guide book from van der Sleen and Albert 17 delivers a general view of the current knowledge of fish ecology and distribution maps at the genus level only. The compilation from Dagosta  www.nature.com/scientificdata www.nature.com/scientificdata/ information 19 . Here, we complement and refine these previous initiatives by providing species-level distributions on a database format combining available information at both sampling site and sub-drainage grains (144 units).
By compiling the knowledge on the spatial distribution of freshwater fishes and addressing the taxonomic and sampling gaps, the Amazon Fish database should become a valuable and long-lasting source of information for ecological and conservation studies. The database is currently being used to analyse fish diversity patterns at the Amazon Basin scale 19 , to evaluate the potential effect of climate change 20 and fragmentation 21 on this biodiversity and to define diversity hotspots for the whole basin conservation priorities 22 . Besides improving our fundamental knowledge of the patterns and processes involved in the generation of Neotropical freshwater fish diversity, the information provided can also help developing regional conservation programs and contributing to largescale transnational ecosystem management initiatives.
Species occurrences are delivered here at two spatial grains, sampling sites (with precise geographic coordinates) and sub-drainage (144 units) grains. The database is organised in two sub-datasets and one shapefile. The first dataset contains the species list by sub-drainage with the taxonomic FishBase reference name (Family, Genus, Referent species valid name and Author), the species status ('native' or 'exotic') and the occurrence species status ('valid' , 'to be verified' , 'marine'; see Technical validation for more details). The second dataset contains the geographic coordinates for the georeferenced records, the information source of each record, and the original name of the species cited in the source ('synonym' , 'typing error'). Finally, the shapefile delineates all the sub-drainages, along with the corresponding geographic information (e.g. main river name, main country, geographic coordinates and surface area of the sub-drainage). The database is obviously not complete, regular updates are planned in the future to include new occurrence records from literature, collections and new field expeditions planned to cover sampling gaps, together with the distribution of newly described species and nomenclatural changes. . All these partners brought into the project, besides their Neotropical fish taxonomic expertise needed to produce a high-quality database, existing fish databases from their own collections and expeditions, and a large networking capacity that was essential for identifying and involving other data providers.
In order to build the AmazonFish database, an inventory of the possible data sources was conducted at the beginning of the project in early 2016 and data from a wide range of sources were compiled and standardized in a single dataset.
The information used includes five source types: A. Information extracted from the literature (published articles, books, grey literature) B. Data from online biodiversity databases (i.e. GBIF and others) C. Data from museums and universities collections D. Data held or compiled by the project partners (e.g. country level) E. New data from sampling campaigns organized within the framework of the project An inventory of all the literature sources (published articles, books, technical reports) existent for the Amazon Basin led to more than 800 different documents that were subsequently analysed, from which 459 provided valuable data on fish species distribution, not redundant with any official collection. An important amount of data was extracted from the most used and frequently updated online biodiversity databases (see details in Table 1). These repositories release biological data under a Creative Commons licence in which the user agrees to acknowledge the data sources. Data from museums and universities collections not available through these online facilities were obtained by contacting the curators or researchers in charge and integrating them as official project collaborators (curators and researchers mainly from Brazil, Ecuador and Bolivia). The project partners (Colombia and Peru) compiled data at the country level. For Colombia 23,24 , the data were previously published through the GBIF network. For Peru, the AmazonFish project has supported the numeric digitalization of the national freshwater fish collections 25,26 , which is still an ongoing work (51% of the records have been digitalized so far). Finally, supplementary occurrence data were obtained during five sampling campaigns in Brazil, Colombia and Peru and targeting under-sampled areas identified during the project.
www.nature.com/scientificdata www.nature.com/scientificdata/ species, taxonomy and status. All occurrences not identified to species level were discarded (i.e. occurrences giving only genus names commonly abbreviated to sp., species affinis commonly abbreviated to: sp. aff., aff., or affin. or species confer abbreviated to cf.). All species scientific names are reported in the database as appearing in each information source and were carefully checked for typing errors and misspellings. Because taxonomy is a 'moving target' , species names were standardized and linked to an internationally accepted standardized name and associated taxonomic information in order to find synonymies and provide accepted names. All species names were first searched in FishBase through the 'rfishbase' package 27 from the R environment 28 allowing to easily obtain the valid species names. For species names absent from FishBase, a manual search was applied in the Eschmeyer's Catalog of Fishes (http://researcharchive.calacademy.org/research/ichthyology/catalog/fishcatmain.asp). This last step allowed finding valid names and recently described species not yet included in FishBase. The final standardized species list contains 3,366 valid species names avoiding biases due to synonyms and uncertain identifications (see 'Technical Validation'). We also integrated all remaining species names, i.e. not listed in any of the two scientific catalogues, as 'unknown name at present' (294 species names).
A species status ('native' or 'exotic') and an occurrence species status ('valid' , 'to be verified' or 'marine') were assigned to each species. The species status distinguishes 'native' from 'exotic' species (i.e. non-native species introduced in the Amazon Basin) 5 and the occurrence species status is divided in three criteria: (1) 'valid' (species known to belong to the Amazon Basin); (2) 'to be verified' (species whose presence in the Amazon Basin is not certain because of possible mis-identification or localisation errors); and (3) 'marine' (species whose primary habitat is not freshwater, based on information available in FishBase or Eschmeyer's Catalog of Fishes).
At this time, the database contains 2,406 'native' and 'valid' freshwater fish species, 837 'to be verified' species, 105 'marine' , 18 'exotic' and 294 'unknown' species. The species considered as 'native' and 'valid' , i.e. freshwater species belonging to the Amazon Basin, were the only species considered in all species numbers reported below.
sub-drainages delineation. The Amazon Basin was defined here as the area of land where precipitation collects and drains off into a common outlet. This excludes de facto the Tocantins basin and Guiana coastal streams (see Fig. 1), but constitutes for freshwater fishes an ideal grain for conducting biogeographical and/or macroecological studies 29 .
The hydrological sub-drainage units within the Amazon Basin were delineated using the HydroBASINS framework, a subset of the HydroSHEDS database 30 . The levels 5 and 6 were combined with a constraint area of >20,000 km 2 , at the exception of sub-drainages located in the river mainstem where delineation was based on the distance between two main tributaries entering the mainstem. This led to obtain a total of 144 sub-drainages covering the entire Amazon system (Fig. 1).

Data Records
The database 31 provides a comprehensive overview of the current knowledge of the fish species diversity and distribution in the Amazon Basin, with 21,500 sites (Fig. 1), 232,936 georeferenced occurrence records and 2,128 non-georeferenced records from 590 different sources combining literature, scientific collections, sampling campaigns and partner's datasets. Some of the online biodiversity repositories (Table 1) showed some redundancies because often referring to the same collections. In this specific case, only one occurrence record was retained.
The main sources of the database are online biodiversity databases (56% of the occurrences), followed by locally hosted data from the scientific partners (Peru and Colombia), museums and universities from Brazil, Bolivia and Ecuador (38%), literature data (5% of the records) and data obtained during sampling campaigns by partners from Colombia, Peru and Brazil (1%). This represents 93 different collections from Scientific Institutions, 31 Partners references, 459 literature references and five AmazonFish expeditions.
The database includes information for 56 families, 514 genera and 2,406 native valid freshwater species, virtually half of the circa 4,760 total number of species known for the whole Neotropical biogeographic region 5,6 . Among these 2,406 species, 1,402 are found exclusively in the Amazon Basin (i.e. species appearing nowhere else on Earth; Amazonian endemic species) based on the global species distribution provided by Tedesco et al. 5 .

Biodiversity Repository
Online Biodiversity Repository complete name www.nature.com/scientificdata www.nature.com/scientificdata/ The lowland Amazon and its two main tributaries, the Negro and Madeira Rivers regroup the highest number of sites, occurrences and the highest diversity (Table 2), whereas less information is available for some small tributaries. At the sub-drainage grain, the density of sites presents an important spatial variability (Online-only Table 1 and Fig. 2). For instance, the Curuçá sub-drainage belonging to the Javari River, currently lacks information about its ichthyofauna. The 'updates and limitations' section below presents a more detailed overview of the spatial data gaps.

Number of institutions
The whole dataset is organised in three sub-sets 31 : a table of the species list by sub-drainage ('GeneralDistribution'), a table of occurrence records with sources ('CompleteDatabase'), and a shapefile of the 144 sub-drainages ('SubDrainageShapefile').
The first sub-set ('GeneralDistribution') contains the species list by sub-drainage with the taxonomic reference name (Family, Genus, Referent species valid scientific name and Author), the species status ('native' or 'exotic') and the occurrence species status ('valid' , 'to be verified' , 'marine'). The corresponding table has nine columns (see Table 3).
The second sub-set ('CompleteDatabase') provides the geographic coordinates for the georeferenced sampling sites and the information source of each record. It is complemented with the original name of the species cited in the source ('synonym' or 'typing error') and those species with status 'unknown name at present' . The detailed sources contain the source type of the data ('Literature' , 'Online Biodiversity Database' , 'Partners Datasets' and ' AmazonFish Expedition'), the Biodiversity Repository source for the Online Biodiversity Database, the Scientific Institution Code and complete name, the GBIF Citation and DOI, the complete literature reference and the citation reference of the Partner dataset. Finally, the non-georeferenced occurrences are separated in three categories, 'sub-drainage information' (species occurrence information at the sub-drainage grain), 'approximated coordinates' (species occurrence information at river or reach scales) and 'geographic error' (the geographical coordinates of a site do not correspond to the geographical location given in the source). The corresponding table has 22 columns (see Table 3).
The two table sub-sets ('GeneralDistribution' and 'CompleteDatabase') are in CSV format (columns separated by commas) and the shapefile sub-set ('SubDrainageShapefile') in ArcGis SHP format 31 . Both formats can be linked to the species occurrence table using the unique sub-drainage code or name to visualize and analyse species distribution using any adapted software (e.g. R or QGIS, http://qgis.osgeo.org). The sampling coordinates and shapefile are in the World Geodetic System 1984 (WGS84) datum and geographic coordinate system. The files of the database are in 'CSV' format (UTF-8 encoding, comma separator) and can be uploaded by most statistical software, spreadsheets or any other database management systems. The current version of the database

technical Validation
Taxonomic and status validation. Each species name found in a given information source was confronted to the valid and synonym species names lists from FishBase and Eschmeyer's Catalog of Fishes to ensure the identifications validity provided by the information source. This taxonomic validation identified 1,332 synonyms, 781 typing errors and 294 unknown species names (names not listed in any of the two scientific catalogues). The original scientific names of the species are reported in the expanded table of the database ('CompleteDatabase'), where users can extract sub-species, synonyms or unknown species names.
After having validated the taxonomic names, we further verified the presence certainty in the Amazon Basin of all the taxonomically valid species recorded in our database. This careful review was an essential step in the elaboration of the database and resulted in assigning a status to each species. The species status is based on the information provided by the data source, expert opinion from the AmazonFish partners and information about the species general distribution available in FishBase or Eschmeyer's Catalog of Fishes catalogues. When the presence of a taxon was inconsistent with its actual known distribution, the species was classified as 'to be verified' . A recently published database on the global distribution of freshwater fish species 5 was also consulted to verify the overall distribution of each species, their exotic status and to identify species endemic to the Amazon Basin.
As a result, the database provides not only information on the validity of each species, but also on species occurrences and names that need further attention ('to be verified' and 'unknown name at present'). This gives the opportunity for database users to refer to their own expertise and knowledge to validate or not the accuracy of the original source, species name and distribution (ideally, giving feedback to the AmazonFish project, https:// www.amazon-fish.com/). species distribution validation. The geographic coordinates of the sites were compared to the location name of the sub-basin given in the source. In case of mismatch, the coordinates were removed from the database and the information was kept only at the sub-basin grain and referenced as 'geographic error' .
The geographic accuracy of the species distribution (for 'native' and 'valid' species) inside the Amazon Basin was checked using a basic geographic analysis. A convex hull envelop was delineated for each species based on its occurrence points, resulting in a list of sub-drainages potentially occupied by a given species. This list was then compared to the list of sub-drainages where the species had at least one record. From this comparison, circa 200 Amazonas  100,687  1,463  16,365  51  342  971  11   Jari  58,207  80  471  41  160  227  9   Xingu  511,169  1,701  13,215  50  314  821  73   Paru  39,289  10  28  13  20  22  1   Curuá-una  31,116  123  1,025  38  119  195  2   Curuá  25,291  42  189  25  58  80  -Tapajós  492,  www.nature.com/scientificdata www.nature.com/scientificdata/ species showed some inconsistent distributions (outlying occurrences). All these occurrences were consequently carefully checked and further validated or excluded (see 'ExcludedOccurrences' file 31 ).

Number of endemic species
Updates and limitations. The database is obviously not complete and definitive, and we aim to keep the high-quality level of the database with regular updates, ideally with bi-annual steps, depending on human and financial resources. More than 100 new fish species were described between 2017 and 2019, which makes this update effort crucial in order to improve our knowledge about the distribution of freshwater fish within the Amazon Basin. The technical and taxonomic validation procedures described above will be applied to any new information included in the database. Three main factors will be considered in future updates: (1) new or previously non-available data sources with species lists or records; (2) occurrences of newly described species; and (3) nomenclature changes in the taxonomic classification.
If the main rivers of the Amazon Basin appear well surveyed, some gaps do exist, however in various parts of the basin (Fig. 2). These gaps are mainly located in zones either difficult to access due to the topography and/or located in protected areas (indigenous lands or protected areas). Identifying never-sampled (to our knowledge) or under-sampled sub-drainages is a first step to guide increasing sampling efforts in these areas. The AmazonFish project has already initiated this process, by supporting the numeric digitalization of the national freshwater fish