One millennium of historical freshwater fish occurrence data for Portuguese rivers and streams

The insights that historical evidence of human presence and man-made documents provide are unique. For example, using historical data may be critical to adequately understand the ecological requirements of species. However, historical information about freshwater species distribution remains largely a knowledge gap. In this Data Descriptor, we present the Portuguese Historical Fish Database (PHish–DB), a compilation of 2214 records (557 at the basin scale, 184 at the sub-basin scale and 1473 at the segment scale) resulting from a survey of 194 historical documents. The database was developed using a three-scale approach that maximises the inclusion of information by allowing different degrees of spatial acuity. PHish database contains records of 25 taxonomical groups and covers a time span of one millennium, from the 11th until the 20th century. This database has already proven useful for two scientific studies, and PHish further use will contribute to correctly assess the full range of conditions tolerated by species, by establishing adequate benchmark conditions, and/or to improve existing knowledge of the species distribution limits.


Background & Summary
Collecting historical data on species diversity and occurrence from time periods earlier than the major impactful human activities taking place (e.g., damming, the Industrial Revolution, modern fishing, and river channelisation), may lead to an improvement of knowledge about species ecology. However, historical documents have limitations that need to be understood to avoid incorrect interpretations 1 and/ or extrapolations 2 . There is cultural filtering that affects not only the spatial and temporal availability, completeness, and reliability of documentary records but also their quantity and quality 2 . Most historical records rely factually on questionnaires/interviews, hence if there are erroneous answers, the inventories or scientific surveys will present incorrect data 3 . Nevertheless, the utility of historical insight cannot be underestimated and using historical data for ecological studies is valid 2 . Using an interdisciplinary approach 4 , combining information from several independent spatial and temporal sources 1,2,4,5 , crosschecking lines of evidence with independent datasets 6 and blending different methods 2 can help mitigate the limitations and lead to more accurate knowledge about past ecosystem conditions.
For some organisms and specific purposes, historical data might be essential to model the potential species distribution (e.g., Lassalle and Rochard 7 , Clavero and Hermoso 8 ) because current distributions are often highly constrained by anthropogenic pressures that alter the natural realised ecological niche. A typical example is the case of diadromous fish species with their inland progression being gradually constrained by the presence of artificial barriers 8,9 . Consequently, current occurrence data will only cover a restricted range of the full conditions tolerated by species. To create the Portuguese Historical Fish Database (PHish-DB) we scouted 194 historical documents, resulting in 2214 records from 30 basins, 280 sub-basins and 490 segments. Data collection started in 2007 and was performed by researchers in history and ecology. Despite some underrepresentation of coastal areas, the spatial distribution of the historical records is homogeneous throughout the country and covers all the major river basins (Fig. 1). Three international river basins stand out (Douro, Minho and Guadiana) with a high number of records (Fig. 1a). The sub-basins with the highest number of records are from the River Tâmega (Douro) and River Zêzere (Tagus) (Fig. 1b). Spatial acuity of the records depended on the information present in the historical source. Thus, we opted for a three-scale approach to maximise the collected information. This has resulted in 557 records limited to the basin scale, 184 reaching the sub-basin scale, and 1473 records identified down to the highest accurate spatial scale, the river segment. PHish database covers a time span of one millennium, from the 11 th until the 20 th century, having a larger number of records for the second half of the millennium and particularly for the 18 th and 19 th centuries (Fig. 1c). The Interpretation of historical data can be very subjective 2 , and matching ancient fish common names with current taxonomy was challenging. To minimise uncertainty in the taxonomical classification of a fish record, a conservative approach was followed to establish the adequate taxonomical groups. Of the 25 group names defined from the records gathered, three stood out: Petromizontidae, Chondrostoma sp. and Salmo trutta (Fig. 1d).
The information present in the database has been partially used in the work of Segurado, et al. 10 , and also incorporated in a relevant European project, the European Fish Index-Plus (EFI + ) (http://efi-plus. boku.ac.at/). This database can nevertheless be useful to: improve existing scientific knowledge in Iberian context (e.g., Clavero and Hermoso 8 ,Clavero,et al. 11 ); expand scientific knowledge in European context via an Iberian occurrence scenario of a species with broad-European distribution (e.g., Filipe,et al. 12 ); be used for research where historical interactions between human activities and riverine fish communities and population are relevant.

Methods
These methods are expanded and updated versions of descriptions in our related work Segurado, et al. 10 .

Historical sources
The present historical records compilation of riverine fish distribution was based on geographical dictionaries and other published information for Portugal, dated between the 11 th and early 20 th centuries. Portugal is the most south-western part of the European continent, representing 15% of the Iberian Peninsula. There are four major international rivers (Douro, Guadiana, Minho, and Tagus) and numerous other relevant national rivers. The available historical information on fish populations for this period was almost exclusively based on qualitative data of species occurrence. Available sources dated before the 16 th century included charters, inquiries, donations, and monastic chronicles. From the 16 th century onwards, more thorough recordings of the patrimony of the Portuguese kingdom were available, with the emergence of chorographies, historical-geographical memos, parish inquiries and dictionaries that recorded historically and geographically the Portuguese landscape. In addition to these sources, information from private libraries was also included. A total of 194 documents were consulted (Table 1 (available online only)). These historical sources contain information varying from aspects of the Portuguese physical territory, records about the natural resources of rivers, or cultural context of fisheries exploitation. Most of this data were compiled in the context of the EU projects EFI + (http://efi-plus.boku. ac.at/) and DURERO (Douro River Basin: Water Resources, Water Accounts and Target Sustainability Indices; http://138.100.137.130/durero_project_2014/), with the main purpose of providing data on reference conditions to compute biotic indicators based on diadromous species. Many regions of Europe have been shaped for centuries by human activities, leading to an absence of natural reference conditions for many water body types 13  availability of historical sources of information on species occurrence 14 . This is especially relevant in the context of the Water Framework Directive of the European Union (WFD) 15 , which involves the assessment of the ecological quality of water bodies using the reference condition approach, in which quality classes are defined according to deviations from benchmark conditions.

Taxonomical precision
Taxonomic acuity is critical to provide the best possible taxonomy insights from historical records and to derive reliable databases to be used as sources of information to test scientific or management hypotheses. However, this condition is rather challenging to attain when looking at large spatial scales. Indeed, in historical texts, the norm is to use local common names, mostly because many of the records predate the scientific description of the species. Therefore, the first step to produce this database was to establish a reference list by collecting and compiling ancient and current common names with their correspondence to scientific nomenclature. The second step was to attribute a valid species to each record, with an extra challenge when distinct common names are attributed to the same species among different regions. However, the most challenging issues are posed when several species share common names in certain regions or when very similar and even congener species are sympatric. Despite these caveats, because of the known present distribution of species, the reduced sympatry of similar species and the fact that most shared common names are of similar allopatric species, it is possible to attribute valid scientific identities to each record without errors. Whenever this attribution was impossible or uncertain, the genus, family or order was attributed to the record, instead of the species-specific epithet. This was the case for the genera Alosa, Luciobarbus, Lampetra, Salmo and Squalius, for the families Petromyzontidae and Mugilidae, and for order Pleuronectiformes. For the nases, it was decided to use an older genus' name -Chondrostoma, valid in Europe, but with no current taxonomic validity in the Iberian Peninsulathat currently represents seven species in this database. This older genus aggregates three recently described genera (Achondrostoma, Iberochondrostoma and Pseudochondrostoma) 16 that are, basin-wise, sympatric, coexisting at least two of these genera per basin with historical records. For Pleuronectiformes, the decision was made because it was unsure whether the species record corresponded to a freshwater species (Plactichthys flesus) or to a marine fish, of which there are several species. A conservative approach was followed and whenever a possibility for misinterpretation existed, the species were aggregated to the  corresponding upper taxonomic level under the column "Group Name" (information that we recommend to use without any uncertainty). Whenever there were plausible reasons to believe that the record belonged to a given species, the full scientific binomial nomenclature was attributed to the column "Sub-group Name". This has some associated uncertainty as the decision was made by expert judgement based on the available information. Whenever no plausibility existed, the higher taxonomic group (genus or family) was maintained without the attribution of a "Sub-group Name". If there existed a possibility of confusion between species that did not fit the higher taxonomical groups defined, NA was attributed to "Group Name". If plausible, an educated guess, for a species or a taxonomical group, was made into the "Sub-group Name", based on the interpretation of the historical text extract. All the species and species groups considered are detailed in Table 2. To add value to the database, whenever available, information about the phenology and conservation status (national and international) was included.

Georeferencing
To create a spatial representation of the historical data we have used the Catchment Characterisation and Modelling-River and Catchment database v2.1 (CCM2) (http://data.europa.eu/89h/fe1878e8-7541-4c66-8453-afdae7469221). An advantage of this pan-European database is its hierarchical structure, besides representing a fully integrated system between rivers and drainage catchments 17 . Using three spatial scales (basin, sub-basin and segment) allowed storing historical records with distinct spatial accuracy. Even though finer scales are more informative, historical data at a coarser scale is not irrelevant. For the basin scale, we used the identification code that CCM2 gives for each basin (WSO_ID) to link an historical record to this scale level. The same procedure was established at the segment scale, using the ID code that CCM2 assigns for each segment (WSO1_ID). Since CCM2 does not have any identification or spatial representation of the sub-basins within each sea outlet basin, we used a free software to create this information, the River Network Toolkit (RivTool). This software (available at www.rivtoolkit.com) uses integrated data about river networks and landscape/environmental datasets to produce new or aggregated data via calculations that consider the directional hierarchical network nature of rivers. The set of natural sub-basins of all sea outlet basins of the study area was created using the "sub-basin ID" function of RivTool. The descriptions found in the historical sources varied greatly in their geographical precision. Most presence records referred to a given river or stream within a restricted region, usually described as being near a given village, township or city. When the geographic location was available, the record was georeferenced in a Geographical Information System (GIS) using CCM2. These were the most spatially precise records, the segment scale, where a segment corresponds to a river reach between two consecutive tributaries. In some cases, regions or town names were obsolete and further investigation was needed to clarify the current location and/or designation associated with that former nomination. Nevertheless, for some historical records, the former names did not have any information or relation with the current designations or did not have enough precision to be linked to a river segment. In those cases, the record was attributed to a higher spatial scale (Sub-basin or Basin). Data entries that could only be related to a watercourse that is a major river or stream that flows to the Atlantic Ocean, coastal lagoons or estuaries, were considered as low precision records and spatially defined at the basin scale. When the watercourse was identified as a tributary (i.e., smaller river or stream not flowing to the Ocean, coastal lagoon or estuary), the precision was considered higher and the record was spatially defined at the Sub-basin scale.

Data Records
A relational database structure (Fig. 2) was created in Microsoft Access ® (available in the .accdb file extension) to adequately organise and store the historical data collected with their spatial and temporal dependencies, and also to maintain their link to the historical sources (Table 1 (available online only)). The PHish database is publicly available at the Open Science Framework (Data Citation 1) and at the University of Lisbon, School of Agriculture repository http://www.isa.ulisboa.pt/proj/PHish/. The database contains six tables: three related with spatial organisation, "Basins", "Sub-basins", "Segments"; one with taxonomical identification, "Taxonomical Groups"; one establishing the details of historical sources, "Historical Documents"; and finally, one aggregating historical record information with respective spatial, taxonomical and historical source information, "Historical Records" (Table 3 (available online only)). The latter table is the core of the relational database structure, relating to all other tables (Fig. 2) and where resulting historical data are stored (Table 3 (available online only)).

Technical Validation
Interpretation of historical data can be very subjective and historical science is mostly inductive 2 . To increase objectivity and guarantee a correct assessment of historical information it is necessary to perform a critical evaluation of sources 3 , while comparing and combining multiple and independent sources and methods 18 . Special attention was taken to verify if authors were not just replicating information from other sources, and by that leading to duplication of results in the database. This was done not only while researchers were reading and surveying the historical documents and sources, but also by analysing, comparing and searching within the final set of historical records for similarities. For example, combined similarities in taxonomical groups and spatial references, similarities in paragraphs, sentences or parts of sentences were normally an indication that the author was just citing text from another document without acknowledging it explicitly. Whenever there was reasonable doubt about the originality of the information present in the historical source, or of the historical record, only the oldest one was included in the database.
Despite the numerous hurdles, taxonomical identification of ancient species common names followed a conservative approach that guaranteed no uncertainty for the "Group Name" field. Concerning the "Sub-group Name" field, the integration of information between spatial location and taxonomical identification of a record, the reliable considerations based on literature and the knowledge of experienced ecologists assured low levels of uncertainty. Moreover, when there was reasonable doubt or lack of plausibility, no consideration was made.
Spatial information for the records location was primarily accessed based on three Portuguese official water management and administrative sources at GIS environment: 1) Rivers map (Shapefile from www. hidrografico.pt); 2) Administrative regions and municipalities map (Shapefile from www.dgterritorio.pt); and 3) Online orthophotomaps (WMS link from www.igeo.pt). When the record city/council name or region was not easily connected to the information available in maps, numerous municipalities and parish websites were consulted, along with other websites from relevant local or regional associations, to help identify the more site-specific or out-dated spatial references. The connection with the CCM2 database was performed only after this thorough process. Records for places or historical locations which were not geographically identifiable were conservatively handled, either by discarding or including them in upper spatial scales (sub-basin or basin scale) when the river name was objectively identified.

Usage Notes
Just like every database of historical records, the PHish database is neither definitive nor complete. All reasonable and possible updates will be held, though nevertheless dependent on resources and future funding. Methods will be maintained to avoid usage biases and/or interpretation issues. Future surveys for historical data should focus on rectifying spatial and temporal data heterogeneity. Obtaining information for older times (backward from the 16 th century) and focusing on the coastal areas of Portugal should bridge the spatial and temporal knowledge disparity.
Despite our best efforts, and even considering future updates to this database, true species occurrence will inevitably be broader than what historians and chroniclers may have reported. The cultural filtering 2 , accidental or intentional destruction of documents, doubtful sources of the historians/chroniclers and bias towards certain species 3 affect both spatial and temporal availability, completeness, and reliability of documentary records 2 . In England, copyright law was limited to a special group of people until the 18 th century; indeed documents availability was still censored and limited to printers and publishers rights rather than to authors properties 19 . The concept of author's intellectual property over its work only proliferated during the 19 th century, particularly in culturally developed countries such as France and Great Britain 20 . This means that at least until the 18 th century, and in Portugal most likely until the 19 th century, authors could quote other works without acknowledging them. Thus, data duplication is a possibility within this database, though we consider it in a very low probability given our cautious  approach to this conundrum. Also, users must be aware that PHish database is a presence-only compilation of historical records. Without a wary systematic sampling, absence data is inevitably prone to a high degree of uncertainty, and to our best knowledge, no modern-day systematic survey of fish assemblages across the country was undertaken in Portugal until the end of the 20 th century.
The lack of data for coastal areas, and more specifically for the southern coastal regions of Portugal may result from several particular circumstances. Smaller basins are composed of smaller rivers and inevitably with less human settlements. Adding to this, southern Iberian rivers are Mediterranean-type freshwater ecosystems strongly shaped by autumn/winter flooding and summer drought events 21 . This seasonal instability has implications in the structure of freshwater communities 22 , meaning that these rivers will probably tend to support less interesting fishing areas and species. The database is also temporally unbalanced, i.e., although covering a vast time-scale it does not represent a consistent timeseries due to lack or reduced number of records for the first half of the millennium. Future updates to this database will probably not fully overcome this as it is also the result of temporal filtering 3 of historical sources. Another relevant issue is the heterogeneity of the established taxonomical groups in the user recommended field (Group Name). Our conservative approach followed herein avoids uncertainty in this classification, translating into correct and objective taxonomical information. However, for example, it may be thwarting to use some taxonomical groups (e.g., "Mugilidae" or "Chondrostoma") when the objective is to perform species environmental niche modelling.
All mentioned issues, biases and unbalances are normal for historical databases, and none of them hampers the usage of the database. To our knowledge, this database is the first public compilation of historical distribution data based on freshwater fish species in Portuguese rivers and South-western Europe. Notwithstanding, users should keep in mind all these features and caveats whenever making any considerations or extrapolation based on this database. The PHish database is geographically limited since it is restricted to inland Portugal. However, the current database contains valuable information by covering data for the Portuguese areas, including sea outlets, of four Iberian international rivers and most Portuguese major watercourses. Moreover, because Portuguese data were not compiled and made available until now, researchers have been using only Spanish records to study fish species distribution for the whole Iberian Peninsula, minimizing its importance as a meaningful biogeographical entity 23 . By using only Spanish data, authors concede to uncertain premises (e.g., Clavero and Villero 24 ) and/or extrapolate when predicting for the entire Iberian Peninsula (e.g., Clavero and Hermoso 8 ). Hence, this database will fill an important gap in the current knowledge and contribute to the development of new studies covering the whole Iberian Peninsula without being hindered by political borders.