Background & Summary

The Amazon Basin covers more than 6,000,000 km2, produces about 20% of the world’s freshwater discharge1,2,3 and contains the highest freshwater richness on Earth4. This is especially true for Amazonian fishes that represent ~15% of all freshwater fish species described worldwide5,6. The processes having generated this highly diverse fish fauna are incompletely understood. However, low rates of species extinction over several millions of years due to the diversity in aquatic habitats and the stability in favourable climatic conditions are most probably involved7,8. Compared to other large riverine ecosystems on Earth, the Amazon Basin and its fish fauna are still in a relatively good state of conservation9,10. Nevertheless, recent expansion of infrastructures and economic activities are likely to endanger this fish fauna in the near future due to the substantial increase in threats such as habitat fragmentation and river flow modification by dams, deforestation, roads, mining, urban and/or agricultural pollutions, species introduction and overfishing11. Climate change will probably exacerbate these threats further amplifying changes in the structure and function of fish communities11,12.

Our knowledge on fish species occurrence and spatial distribution within the Amazon Basin is far from complete. Numerous new species are described each year13,14 and some large areas are still unknown in several portions of the basin15,16. This was among the key motivations of the transnational collaborative project AmazonFish (https://www.amazon-fish.com/) that aimed to compile the most complete and up-to-date information currently available on freshwater fish species distribution for the entire Amazon drainage basin and to initiate scientific collecting expeditions in under-sampled areas to fill the gaps. This database is thus the result of mobilizing information available from various sources (published articles, grey literature, field expedition reports, online biodiversity databases and scientific collections from museums and universities worldwide) and field expeditions organized during the project. This compilation, covering a time span of almost two hundred years (1834–2019), currently comprises 2,406 valid native freshwater fish species recorded from 590 different sources representing more than 235,064 occurrence records (232,936 georeferenced and 2,128 non-georeferenced) and 21,500 sampled localities (hereafter called sampling sites). Two parallel compilation efforts on the distribution of freshwater fish species in the Amazon Basin have been recently released17,18. The field guide book from van der Sleen and Albert17 delivers a general view of the current knowledge of fish ecology and distribution maps at the genus level only. The compilation from Dagosta & De Pinna18 provides species lists for 30 Amazonian sub-drainages but suffers from a lack of information19. Here, we complement and refine these previous initiatives by providing species-level distributions on a database format combining available information at both sampling site and sub-drainage grains (144 units).

By compiling the knowledge on the spatial distribution of freshwater fishes and addressing the taxonomic and sampling gaps, the Amazon Fish database should become a valuable and long-lasting source of information for ecological and conservation studies. The database is currently being used to analyse fish diversity patterns at the Amazon Basin scale19, to evaluate the potential effect of climate change20 and fragmentation21 on this biodiversity and to define diversity hotspots for the whole basin conservation priorities22. Besides improving our fundamental knowledge of the patterns and processes involved in the generation of Neotropical freshwater fish diversity, the information provided can also help developing regional conservation programs and contributing to largescale transnational ecosystem management initiatives.

Species occurrences are delivered here at two spatial grains, sampling sites (with precise geographic coordinates) and sub-drainage (144 units) grains. The database is organised in two sub-datasets and one shapefile. The first dataset contains the species list by sub-drainage with the taxonomic FishBase reference name (Family, Genus, Referent species valid name and Author), the species status (‘native’ or ‘exotic’) and the occurrence species status (‘valid’, ‘to be verified’, ‘marine’; see Technical validation for more details). The second dataset contains the geographic coordinates for the georeferenced records, the information source of each record, and the original name of the species cited in the source (‘synonym’, ‘typing error’). Finally, the shapefile delineates all the sub-drainages, along with the corresponding geographic information (e.g. main river name, main country, geographic coordinates and surface area of the sub-drainage). The database is obviously not complete, regular updates are planned in the future to include new occurrence records from literature, collections and new field expeditions planned to cover sampling gaps, together with the distribution of newly described species and nomenclatural changes.

Methods

Information sources

The database results from the transnational collaborative project AmazonFish (ERANetLAC/DCC-0210) whose purpose was to identify and compile all known information sources available on freshwater fish species occurrences for the entire Amazon drainage basin. The original project included researchers from (1) the French Institute for Development (IRD) in France, (2) the Pontificia Universidad Javeriana (PUJ-UNESIS) in Colombia, (3) the Museo de Historia Natural de la Universidad Nacional Mayor de San Marcos (MUSM) in Peru and (4) the Royal Belgian Institute of Natural Sciences in Belgium. The project also benefited from official collaborations with researchers from Brazil (Instituto Nacional de Pesquisas da Amazônia INPA; Universidade Federal de São Paulo UNIFESP; Universidade Federal de Rondônia UNIR; Universidade Federal do Pará UFPA; Universidade Federal do Oeste do Pará UFOPA; Universidade Federal de Mato Grosso UFMT), Colombia (Instituto Alexander von Humboldt IAvH, Universidad Nacional de Colombia UN ICN-MHN, Universidad del Tolima UT-CZUT, Instituto Amazónico de Investigaciones Científicas SINCHI-CIACOL, Instituto para la Investigación y la Preservación del Patrimonio Cultural y Natural del Valle del Cauca INCIVA, Universidad Católica de Oriente UCO), Ecuador (Museo Ecuatoriano de Ciencias Naturales MECN-DP, Instituto Nacional De Biodiversidad INABIO), Bolivia (Universidad Mayor de San Simon UMSS-ULRA, Colección Boliviana de Fauna MNHN–IE UMSA, Universidad Autónoma del Beni CIRA) and Switzerland (Museum d’Histoire Naturelle de Genève, MHNG). All these partners brought into the project, besides their Neotropical fish taxonomic expertise needed to produce a high-quality database, existing fish databases from their own collections and expeditions, and a large networking capacity that was essential for identifying and involving other data providers.

In order to build the AmazonFish database, an inventory of the possible data sources was conducted at the beginning of the project in early 2016 and data from a wide range of sources were compiled and standardized in a single dataset.

The information used includes five source types:

  1. A.

    Information extracted from the literature (published articles, books, grey literature)

  2. B.

    Data from online biodiversity databases (i.e. GBIF and others)

  3. C.

    Data from museums and universities collections

  4. D.

    Data held or compiled by the project partners (e.g. country level)

  5. E.

    New data from sampling campaigns organized within the framework of the project

An inventory of all the literature sources (published articles, books, technical reports) existent for the Amazon Basin led to more than 800 different documents that were subsequently analysed, from which 459 provided valuable data on fish species distribution, not redundant with any official collection. An important amount of data was extracted from the most used and frequently updated online biodiversity databases (see details in Table 1). These repositories release biological data under a Creative Commons licence in which the user agrees to acknowledge the data sources. Data from museums and universities collections not available through these online facilities were obtained by contacting the curators or researchers in charge and integrating them as official project collaborators (curators and researchers mainly from Brazil, Ecuador and Bolivia). The project partners (Colombia and Peru) compiled data at the country level. For Colombia23,24, the data were previously published through the GBIF network. For Peru, the AmazonFish project has supported the numeric digitalization of the national freshwater fish collections25,26, which is still an ongoing work (51% of the records have been digitalized so far). Finally, supplementary occurrence data were obtained during five sampling campaigns in Brazil, Colombia and Peru and targeting under-sampled areas identified during the project.

Table 1 Online Biodiversity Repository Sources with the complete name, the number of occurrences and institutions, the last consulted date and the internet link.

Species, taxonomy and status

All occurrences not identified to species level were discarded (i.e. occurrences giving only genus names commonly abbreviated to sp., species affinis commonly abbreviated to: sp. aff., aff., or affin. or species confer abbreviated to cf.). All species scientific names are reported in the database as appearing in each information source and were carefully checked for typing errors and misspellings. Because taxonomy is a ‘moving target’, species names were standardized and linked to an internationally accepted standardized name and associated taxonomic information in order to find synonymies and provide accepted names. All species names were first searched in FishBase through the ‘rfishbase’ package27 from the R environment28 allowing to easily obtain the valid species names. For species names absent from FishBase, a manual search was applied in the Eschmeyer’s Catalog of Fishes (http://researcharchive.calacademy.org/research/ichthyology/catalog/fishcatmain.asp). This last step allowed finding valid names and recently described species not yet included in FishBase. The final standardized species list contains 3,366 valid species names avoiding biases due to synonyms and uncertain identifications (see ‘Technical Validation’). We also integrated all remaining species names, i.e. not listed in any of the two scientific catalogues, as ‘unknown name at present’ (294 species names).

A species status (‘native’ or ‘exotic’) and an occurrence species status (‘valid’, ‘to be verified’ or ‘marine’) were assigned to each species. The species status distinguishes ‘native’ from ‘exotic’ species (i.e. non-native species introduced in the Amazon Basin)5 and the occurrence species status is divided in three criteria: (1) ‘valid’ (species known to belong to the Amazon Basin); (2) ‘to be verified’ (species whose presence in the Amazon Basin is not certain because of possible mis-identification or localisation errors); and (3) ‘marine’ (species whose primary habitat is not freshwater, based on information available in FishBase or Eschmeyer’s Catalog of Fishes).

At this time, the database contains 2,406 ‘native’ and ‘valid’ freshwater fish species, 837 ‘to be verified’ species, 105 ‘marine’, 18 ‘exotic’ and 294 ‘unknown’ species. The species considered as ‘native’ and ‘valid’, i.e. freshwater species belonging to the Amazon Basin, were the only species considered in all species numbers reported below.

Sub-drainages delineation

The Amazon Basin was defined here as the area of land where precipitation collects and drains off into a common outlet. This excludes de facto the Tocantins basin and Guiana coastal streams (see Fig. 1), but constitutes for freshwater fishes an ideal grain for conducting biogeographical and/or macroecological studies29.

Fig. 1
figure 1

(a) Distribution of sampling sites recorded in the AmazonFish database and (b) delimitation and codes of the 144 sub-drainages units (see corresponding names in Online-only Table 1), based on a modified version of HydroBASINS (see methods). The major tributaries of the Amazon Basin are represented in different colours and their names are added in bold.

The hydrological sub-drainage units within the Amazon Basin were delineated using the HydroBASINS framework, a subset of the HydroSHEDS database30. The levels 5 and 6 were combined with a constraint area of >20,000 km2, at the exception of sub-drainages located in the river mainstem where delineation was based on the distance between two main tributaries entering the mainstem. This led to obtain a total of 144 sub-drainages covering the entire Amazon system (Fig. 1).

Data Records

The database31 provides a comprehensive overview of the current knowledge of the fish species diversity and distribution in the Amazon Basin, with 21,500 sites (Fig. 1), 232,936 georeferenced occurrence records and 2,128 non-georeferenced records from 590 different sources combining literature, scientific collections, sampling campaigns and partner’s datasets. Some of the online biodiversity repositories (Table 1) showed some redundancies because often referring to the same collections. In this specific case, only one occurrence record was retained.

The main sources of the database are online biodiversity databases (56% of the occurrences), followed by locally hosted data from the scientific partners (Peru and Colombia), museums and universities from Brazil, Bolivia and Ecuador (38%), literature data (5% of the records) and data obtained during sampling campaigns by partners from Colombia, Peru and Brazil (1%). This represents 93 different collections from Scientific Institutions, 31 Partners references, 459 literature references and five AmazonFish expeditions.

The database includes information for 56 families, 514 genera and 2,406 native valid freshwater species, virtually half of the circa 4,760 total number of species known for the whole Neotropical biogeographic region5,6. Among these 2,406 species, 1,402 are found exclusively in the Amazon Basin (i.e. species appearing nowhere else on Earth; Amazonian endemic species) based on the global species distribution provided by Tedesco et al.5.

The lowland Amazon and its two main tributaries, the Negro and Madeira Rivers regroup the highest number of sites, occurrences and the highest diversity (Table 2), whereas less information is available for some small tributaries. At the sub-drainage grain, the density of sites presents an important spatial variability (Online-only Table 1 and Fig. 2). For instance, the Curuçá sub-drainage belonging to the Javari River, currently lacks information about its ichthyofauna. The ‘updates and limitations’ section below presents a more detailed overview of the spatial data gaps.

Table 2 Summary table with the total number of sites, occurrences and total number of families, genera, species and endemic species (i.e. Amazon endemic species present only in the sub-drainage) for each Amazon major tributary (with their surface area in km2).
Fig. 2
figure 2

(a) Number of records, (b) density of sites (number of sites divided by the sub-drainage area and areas without information using the HydroBASINS Level7 spatial grain unit30), (c) total number of species and (d) number of endemic species (species present only in the Amazon Basin and only in the sub-drainage) for the 144 sub-drainage units.

The whole dataset is organised in three sub-sets31: a table of the species list by sub-drainage (‘GeneralDistribution’), a table of occurrence records with sources (‘CompleteDatabase’), and a shapefile of the 144 sub-drainages (‘SubDrainageShapefile’).

The first sub-set (‘GeneralDistribution’) contains the species list by sub-drainage with the taxonomic reference name (Family, Genus, Referent species valid scientific name and Author), the species status (‘native’ or ‘exotic’) and the occurrence species status (‘valid’, ‘to be verified’, ‘marine’). The corresponding table has nine columns (see Table 3).

Table 3 Detailed legend of the information given in each column of the 2 datasets (9 columns for the GeneralDistribution file and 22 columns for the CompleteDatabase file).

The second sub-set (‘CompleteDatabase’) provides the geographic coordinates for the georeferenced sampling sites and the information source of each record. It is complemented with the original name of the species cited in the source (‘synonym’ or ‘typing error’) and those species with status ‘unknown name at present’. The detailed sources contain the source type of the data (‘Literature’, ‘Online Biodiversity Database’, ‘Partners Datasets’ and ‘AmazonFish Expedition’), the Biodiversity Repository source for the Online Biodiversity Database, the Scientific Institution Code and complete name, the GBIF Citation and DOI, the complete literature reference and the citation reference of the Partner dataset. Finally, the non-georeferenced occurrences are separated in three categories, ‘sub-drainage information’ (species occurrence information at the sub-drainage grain), ‘approximated coordinates’ (species occurrence information at river or reach scales) and ‘geographic error’ (the geographical coordinates of a site do not correspond to the geographical location given in the source). The corresponding table has 22 columns (see Table 3).

The third sub-set (‘SubDrainageShapefile’), corresponds to shapefile delineating all the sub-drainages and their corresponding geographic information and is organized in eight columns: (1) the major Amazon tributary name (MTRIB_NM), (2) the unique major Amazon tributary code (MTRIB_CD), (3) the unique sub-drainage name (SBD_NM), (4) the unique sub-drainage code (SBD_CD), (5) the surface area of the sub-drainage (AREA, in km2), (6) the main country where it belongs (COUNTRY), (7,8) the centroid longitude and latitude coordinates of the sub-drainage (CENT_X and CENT_Y).

The two table sub-sets (‘GeneralDistribution’ and ‘CompleteDatabase’) are in CSV format (columns separated by commas) and the shapefile sub-set (‘SubDrainageShapefile’) in ArcGis SHP format31. Both formats can be linked to the species occurrence table using the unique sub-drainage code or name to visualize and analyse species distribution using any adapted software (e.g. R or QGIS, http://qgis.osgeo.org). The sampling coordinates and shapefile are in the World Geodetic System 1984 (WGS84) datum and geographic coordinate system. The files of the database are in ‘CSV’ format (UTF-8 encoding, comma separator) and can be uploaded by most statistical software, spreadsheets or any other database management systems. The current version of the database can be retrieved from Figshare31, the AmazonFish website (https://www.amazon-fish.com/) and the Freshwater Biodiversity data portal (https://data.freshwaterbiodiversity.eu/).

Technical Validation

Taxonomic and status validation

Each species name found in a given information source was confronted to the valid and synonym species names lists from FishBase and Eschmeyer’s Catalog of Fishes to ensure the identifications validity provided by the information source. This taxonomic validation identified 1,332 synonyms, 781 typing errors and 294 unknown species names (names not listed in any of the two scientific catalogues). The original scientific names of the species are reported in the expanded table of the database (‘CompleteDatabase’), where users can extract sub-species, synonyms or unknown species names.

After having validated the taxonomic names, we further verified the presence certainty in the Amazon Basin of all the taxonomically valid species recorded in our database. This careful review was an essential step in the elaboration of the database and resulted in assigning a status to each species. The species status is based on the information provided by the data source, expert opinion from the AmazonFish partners and information about the species general distribution available in FishBase or Eschmeyer’s Catalog of Fishes catalogues. When the presence of a taxon was inconsistent with its actual known distribution, the species was classified as ‘to be verified’. A recently published database on the global distribution of freshwater fish species5 was also consulted to verify the overall distribution of each species, their exotic status and to identify species endemic to the Amazon Basin.

As a result, the database provides not only information on the validity of each species, but also on species occurrences and names that need further attention (‘to be verified’ and ‘unknown name at present’). This gives the opportunity for database users to refer to their own expertise and knowledge to validate or not the accuracy of the original source, species name and distribution (ideally, giving feedback to the AmazonFish project, https://www.amazon-fish.com/).

Species distribution validation

The geographic coordinates of the sites were compared to the location name of the sub-basin given in the source. In case of mismatch, the coordinates were removed from the database and the information was kept only at the sub-basin grain and referenced as ‘geographic error’.

The geographic accuracy of the species distribution (for ‘native’ and ‘valid’ species) inside the Amazon Basin was checked using a basic geographic analysis. A convex hull envelop was delineated for each species based on its occurrence points, resulting in a list of sub-drainages potentially occupied by a given species. This list was then compared to the list of sub-drainages where the species had at least one record. From this comparison, circa 200 species showed some inconsistent distributions (outlying occurrences). All these occurrences were consequently carefully checked and further validated or excluded (see ‘ExcludedOccurrences’ file31).

Updates and limitations

The database is obviously not complete and definitive, and we aim to keep the high-quality level of the database with regular updates, ideally with bi-annual steps, depending on human and financial resources. More than 100 new fish species were described between 2017 and 2019, which makes this update effort crucial in order to improve our knowledge about the distribution of freshwater fish within the Amazon Basin. The technical and taxonomic validation procedures described above will be applied to any new information included in the database. Three main factors will be considered in future updates: (1) new or previously non-available data sources with species lists or records; (2) occurrences of newly described species; and (3) nomenclature changes in the taxonomic classification.

If the main rivers of the Amazon Basin appear well surveyed, some gaps do exist, however in various parts of the basin (Fig. 2). These gaps are mainly located in zones either difficult to access due to the topography and/or located in protected areas (indigenous lands or protected areas). Identifying never-sampled (to our knowledge) or under-sampled sub-drainages is a first step to guide increasing sampling efforts in these areas. The AmazonFish project has already initiated this process, by supporting the numeric digitalization of the national freshwater fish collections from Peru25,26 and by initiating sampling campaigns in detected gaps in Colombia, Peru and Brazil. All these spatial gaps in the database will also be prioritized in future updates through literature and web-based sources checking. Researchers holding fish distribution data from any of the current gaps or under-sampled areas (Fig. 2) and that wish to share these data are welcome to join the project. This information will be included with the complete source, after validation, in the next update of the database.