Background & Summary

The Qinghai-Tibetan Plateau (QTP), which is renowned as the “Roof of the World,” “Third Pole,” and “Water Tower of Asia,” is the world’s largest high-elevation plateau with an average elevation of over 4,500 m and covering a 2,500,000 km2 area1,2. This region possesses about 46,000 glaciers and develops the major large rivers of Asia, such as the Yellow, Yangtze, Lancang-Mekong, Nu-Salween, Dulong-Irrawaddy, Yarlung Tsangpo-Brahmaputra, Ganges, and Indus rivers, providing water to over 20% of the global population1,2. The QTP uplift since the early Cenozoic has dramatically altered the Earth’s environment (e.g., separates the westerlies and forms Indian and Asian monsoons) and biodiversity distribution by engaging complex geologic, atmospheric, and hydrologic processes3,4,5,6. There are three major biodiversity hotspots with high species richness and many rare and endemic species surrounding the QTP7. These biota parallelly evolved with the environmental change caused by the QTP uplift and thus are a valuable source for uncovering the geological history, paleoclimatic environment, and mechanisms of functional adaptation to environmental change8,9,10,11. In addition, because the QTP is among the most sensitive areas to recent climate change, these organisms are also ideal for studying relevant biological responses, providing a scientific basis for predicting and mitigating the effects of climate change12. Among them, schizothoracine fish (Cyprinidae: Schizothoracinae) are the most representative taxon in aquatic ecosystems13,14,15,16.

The Schizothoracinae fish are widely distributed in rivers and lakes in the QTP and its surrounding areas17. They are the only natural group of Cyprinidae fish adapted to the extreme environmental conditions of the QTP. Currently, a total of over 100 Schizothoracinae species or subspecies belonging to 12 genera have been recorded17,18. These species greatly support regional fish diversity and wild fisheries and are important to maintaining the structure and function of relevant ecosystems17. Phylogenetically, these species were diverged from the primitive Barbinae fish through accumulating genetic and morphological traits adapted to environmental changes in response to QTP uplift14. Unlike terrestrial organisms that can rapidly disperse over long distances and in multiple directions on land, freshwater fish are strictly constrained in drainage systems, restricting their gene flow, and thus promoting local diversification and speciation. Therefore, genetic differences between different species and populations of schizothoracine fish have been suggested as suitable biological evidence for inferring the geologic history of the QTP uplifts and large river formations14,19,20. For example, the fossils of Schizothoracinae fish have been used to estimate the paleo-elevation of the QTP, and the results indicate that there have been large spatial and temporal differences in the uplift since the Oligocene8,21. Based on the degrees of morphological specialization (e.g., scales, pharyngeal teeth, and barbs) and distribution of modern Schizothoracinae fish, Cao and colleagues argued that the three evolutionary stages of them are closely related to the uplift processes of the QTP14. Accordingly, the Schizothoracinae fish can be grouped into three grades, including the primitive grade, the specialized grade, and the highly specialized grade. A molecular phylogeny of 24 Schizothoracinae species estimated that the average altitude of the QTP in the late Miocene should be between 2,750 and 3,750 m, providing a different perspective than sediment records20. In addition, studies have also shown that these fish are sensitive in response to recent climate change through changing growth and reproductive phenology16,22. The Schizothoracinae is one of the most threatened subfamilies in China, with 55% of species under threat23.

Species occurrences and functional trait information are fundamental to understanding biodiversity distribution patterns, predicting biological responses to environmental change, and promoting relevant conservation and management. This is because the functional trait composition and diversity of a community can reflect the characteristics and changes in the environment24,25. However, a dataset including such information for Schizothoracinae fish remains unavailable. Currently, their occurrence records and functional trait information are scattered in a wide range of sources (e.g., books, journal articles, master theses, doctoral dissertations, and online databases). The relevant knowledge held by most researchers and managers is outdated and mostly comes from surveys and published literature from the last century14,18,26. In addition, most of these data sources were written in Chinese, which poses a language barrier to interested non-Chinese researchers27. Although there are some large-scale databases (e.g., FishBase [https://www.fishbase.se/], Eschmeyer’s Catalog of Fishes [https://www.calacademy.org/scientists/projects/eschmeyers-catalog-of-fishes], and Global Biodiversity Information Facility [GBIF, https://www.gbif.org/]) related to freshwater fishes, Schizothoracinae fish are not targeted for consideration, and the included species and functional trait data are far from delicate and comprehensive. For example, a total of 78 Schizothoracinae species or subspecies were included in the most comprehensive global database of freshwater fish species occurrence at the drainage basin scale, without precise geographic coordinates or sampling time information28. The global database CESTES for metacommunity ecology, which integrates species, traits, environment, and space, does not include freshwater fish in Asia29. A Schizothoracinae-targeted database compiled the transcriptome data of 14 endemic species, but without precise sampling locations or functional trait information30. Therefore, it is urgently necessary to build a dataset containing species occurrences and functional traits of Schizothoracinae fish, given that the QTP has experienced more profound climate change and increasing anthropogenic disturbances31,32.

In this study, we introduce the SchiSOFT33 (Schizothoracinae fish Species Occurrences and Functional Traits) dataset, which compiled and curated data from our long-term survey records, possible online databases (e.g., FishBase and GBIF), and systematically searched literature (Fig. 1). The literature covers both that written in Chinese and English, and the publication date spans from 1842 to 2022. Details such as sampling locations, geographic coordinates, sampling dates, and functional traits (e.g., maximum body length, scale coverage, and pharyngeal teeth) were gathered, collated, and verified. The SchiSOFT33 presents the most comprehensive dataset of Schizothoracinae fish, including all 125 species or subspecies from the 12 genera, 7,333 occurrence records, and 3,204 records of 32 functional traits. Sampling records spanned over 180 years (1840s–2020 s). This dataset enables researchers and managers to quickly acquire specific information (e.g., distribution range, functional traits) about Schizothoracinae fish through querying corresponding fields such as scientific names, genus names. Thus, it can promote research, conservation, and management of Schizothoracinae fish diversity and resources and further ensure the goods and services they provide for both natural ecosystems and human society. It is also accessible to the public and can be used for educational activities, contributing to a deeper public understanding and awareness of the conservation of Schizothoracinae fish.

Fig. 1
figure 1

The workflow to compile the SchiSOFT dataset on species occurrences and functional traits of Schizothoracinae fish.

Methods

Information sources

The occurrence records and functional traits were primarily extracted from the following four sources: (1) published literature (e.g., journal articles, master theses, doctoral dissertations, and conference papers); (2) books (e.g., key books and ichthyographies); (3) online databases (e.g., FishBase, GBIF and National Specimen Information Infrastructure [NSII, http://www.nsii.org.cn/]); (4) field surveys over decades conducted by the authors’ research groups. We conducted a systematic literature search in multiple databases, such as the Web of Science (WoS, https://www.webofscience.com/), Scopus (https://www.scopus.com/), and the Chinese National Knowledge Infrastructure (CNKI, https://www.cnki.net/). We also searched for books and online databases (e.g., FishBase and GBIF). The search was initially conducted in October 2020 and updated in October 2022.

Our search queries were based on the names of the target fish (e.g., scientific genus names and common names). Data from the WoS and Scopus, based on titles, abstracts, and keywords, were searched using the following English search phrase: (Schizothoracinae OR schizothoracine OR Schizopygopsinae OR Aspiorhynchus OR Chuanchia OR Diptychus OR Herzensteinia OR Gymnocypris OR Oxygymnocypris OR Platypharodon OR Schizothorax OR Schizopygopsis OR Schizocypris OR Racoma OR Schizothoraichthys OR Ptychobarbus OR Oreinus OR schizothoracin OR snowtrout OR marinka OR “naked carp”). The Chinese search phrase was generally the same as the English version, searched from the CNKI, Wanfang Database (https://www.wanfangdata.com.cn/), and Weipu Database (https://qikan.cqvip.com/). After removing 8,415 duplicates through fuzzy title matching using the restricted Damerau-Levenshtein distance similarity34, a total of 18,886 references were retained. We then screened the titles, keywords, and abstracts of the documents returned by the search and excluded records with explicit reasons as follows: (1) reviews without sampling data; (2) river basin scale or regional aquatic field surveys with no records of schizothoracine fish occurrence data; (3) research articles that do not include field sampling or only used environment DNA methods; (4) studies with schizothoracine fish but were not identified to species level; (5) captive-bred schizothoracine fish without field sampling information. Primary research articles mentioned in review papers that potentially contain relevant data were also included to complement our reference pool (Fig. 1). The searching, screening, and filtering strictly followed the workflow of PRISMA35 (Preferred Reporting Items for Systematic Reviews and Meta-Analysis). Finally, we obtained data from 706 pieces of published literature, 28 books in English and Chinese, and seven online databases, including FishBase, NSII, GBIF, the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena/), the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/), iNaturalist (https://www.inaturalist.org/), IUCN Red List of Threatened Species (IUCN, https://www.iucnredlist.org/). Schizothoracinae species occurrences and photos for functional trait measurement collected during field surveys conducted by our research groups spanned fourteen years33 (i.e., 2008–2021). These data are clearly noted in the spreadsheet at figshare33.

Data extraction

We extracted occurrence information for each species, including scientific names, georeferenced locations, and sampling times (Table 1), from the text, tables, figures, and supporting information from all the sources. To extract data from maps and other types of figures, we used the WebPlotDigitizer36 (Version 4.4). The occurrence records were cleaned to remove outliers, for example, those records with high spatial uncertainty, using the R package ‘CoordinateCleaner37. Occurrence coordinates were recorded in decimal degrees (see section below ‘Technical Validation’).

Table 1 Descriptions of the fields used in the SchiSOFT dataset.

The functional traits of schizothoracine fish are essential for understanding their evolutionary adaptation and responses to historical and modern environmental changes in the QTP. Our dataset encompasses 32 functional traits, which can be grouped into five categories: multi-functional (5 traits; e.g., maximum body length), trophic (16 traits; e.g., feeding habits and oral gape position), locomotion (6 traits; e.g., body elongation), life history (2 traits; e.g., fecundity), and habitat utilization (3 traits; e.g., habitat substrate) (Table 2). Maximum body length and maximum body weight data were mainly taken from FishBase and supplemented from books. Twelve commonly used ratio traits (continuous data; e.g., relative eye size and caudal fin aspect ratio) in evaluating the morphological diversity of freshwater fish26,38,39,40,41 were measured from specimen photos or images (i.e., scientific drawings of fish lateral views) with the assistance of ImageJ software (http://rsb.info.nih.gov/ij/index.html). The rest of the trophic traits, life history traits, and habitat utilization traits, which are mostly categorical, were extracted from text descriptions in books (e.g., Fauna Sinica17, The Fishes of the Qinghai-Xizang Plateau26, and The Fishes of the Hengduan Mountains Region42) and taxonomic research articles.

Table 2 Descriptions of the functional traits in the SchiSOFT dataset.

Species, taxonomy, and status

The scientific names of all Schizothoracinae fish included in the dataset have all been thoroughly checked for typing errors and misspellings. To avoid including invalid species and synonyms, we verified the validity of each species according to FishBase, using R package ‘rfishbase43. For the species or subspecies that was not matched, it would be searched again in Eschmeyer’s Catalog of Fishes. For the subspecies that were designated as subspecies in Fauna Sinica17, but were not identified in FishBase and Eschmeyer’s Catalog of Fishes, they were listed as valid subspecies in our dataset. The final standardized species list has 125 valid species (i.e., 98 species and 27 subspecies) (see section below ‘Technical Validation’).

Data Records

Our final dataset33 has been deposited at figshare and can be downloaded from https://doi.org/10.6084/m9.figshare.24638538.v1. It includes a total of 7,333 occurrence records, and the total number of functional trait records is 3,204 (Fig. 2, Table 2). Among them, 3,876, 844, and 673 occurrence records were from 706 publications, 28 books, and seven online databases, respectively. The remaining 1,940 records were from field surveys conducted by our research groups over fourteen years (2008–2021). And the functional trait mean data completeness was 80.1%. Among them, 1,424 records were extracted from images, 109 records were extracted from online databases, and 1,671 records were obtained from text descriptions of published documents.

Fig. 2
figure 2

Occurrence points of the Schizothoracinae fish by specialized grades at the global scale in the SchiSOFT dataset. The lines in blue show the main river around the Qinghai-Tibetan Plateau (QTP), and the lines in red show the boundary of the Pan-Tibetan Highlands (PTH). The 20 regions are referred to as Level 3 in HydroBASINS (https://www.hydrosheds.org/products/hydrobasins). 1: Tarim, 2: Turfan, 3: Dzungaria, 4: Hexi, 5: Qaidan, 6: Yellow River, 7: Yangtze River, 8: Xi Jiang, 9: Lancang-Mekong, 10: Salween-Irrawaddy, 11: Hoh xil, 12: Inner Tibetan Plateau, 13: Brahmaputra-Ganges, 14: Indus, 15: Helmand-Sistan, 16: Kavir and Lut Deserts, 17: Amu Darya WEST, 18: Amu Darya EAST, 19: Syr Darya, 20: Balkash-Alakul.

Occurrences and functional traits were recorded according to uniform standards. The dataset33 was organized into three CSV-format files. (A), “SchiSOFT_species_checklist_&_image_sources.csv”, includes genus name, specialized grade, scientific name, taxonomic status, IUCN Red List extinction risk, image type, source type, references in English, and URL. (B), “SchiSOFT_occurrence_records.csv”, includes the genus name, scientific name, the original species name in the sources, taxonomic status, the sampled date, and remarks of the sampled date, decimal latitude and latitude, the source language, source type, references in English, references in Chinese, DOI or ISBN code. If there was no sampled date recorded in the sources, we recorded the received date, accepted date, and published date of the source document as a substitute, if available. The citations of Chinese publications had also been translated into English, recorded in ‘referenceInChinese’ and ‘referenceInEnglish’, respectively. (C), “SchiSOFT_functional_trait_records.csv”, includes the scientific name and 32 functional traits records (Table 2).

(A), “SchiSOFT_species_checklist_&_image_sources.csv”, contains the checklist of 125 species or subspecies of schizothoracine fish, including fields: ‘genus’, ‘specializedGrade’, ‘scientificName’, ‘taxonomicStatus’, ‘IUCNcategory’, ‘imageType’, ‘sourceType’, ‘referenceInEnglish’, and ‘URL’ (Table 1).

(B), “SchiSOFT_occurrence_records.csv”, contains the species list and geographic coordinates, including 15 fields: ‘occurrenceID’, ‘genus’, ‘scientificName’, ‘originalNameInSources’, ‘taxonomicStatus’, ‘sampledDate’, ‘longitudeX’, ‘latitudeY’, ‘remarksOfSampledDate’, ‘sourceLanguage’, ‘sourceType’, ‘referenceInEnglish’, ‘referenceInChinese’, ‘URL’, and ‘DOIorISBN’ (Table 1).

(C), “SchiSOFT_functional_trait_records.csv”, contains functional trait data of Schizothoracinae fish, including 34 fields: ‘genus’, ‘scientificName’, ‘lateralLine’, ‘scaleCoverage’, ‘maximumBodyLength’, ‘maximumBodyWeight’, ‘verticalEyePosition’, ‘dorsalRay’, ‘feedingHabits’, ‘lowerJawHorny’, ‘lowerTipPapillne’, ‘mouthPosition’, ‘mouthShape’, ‘pharyngealTeethFormation’, ‘postlabialGroove’, ‘eyeSize’, ‘intestineLength’, ‘maxillaBarbLength’, ‘maxillaryLength’, ‘oralGapePosition’, ‘rectalBarbLength’, ‘barbPairs’, ‘pharyngealTeethRows’, ‘bodyElongation’, ‘bodyLateralShape’, ‘caudalFinAspectRatio’, ‘caudalPeduncleThrottling’, ‘pectoralFinPosition’, ‘pectoralFinSize’, ‘startSpawningSeason’, ‘fecundity’, ‘substrate’, ‘habitatFlow’, and ‘waterbody’ (Table 2).

Technical Validation

Taxonomic and status validation

Each original species name was compared to the list of valid species names in FishBase, Eschmeyer’s Catalog of Fishes, Fauna Sinica17, The Fishes of the Qinghai-Xizang Plateau26 or Xinjiang Ichthyology44 to ensure the identification validity provided by the information source. In the column ‘originalNameInSources’, the original scientific names in sources were recorded fully for checks and verifications. The R package ‘rfishbase43 was used to do a batch search and matching for species names. For the species or subspecies that was not matched, it would be searched again in Eschmeyer’s Catalog of Fishes. There were 125 valid species names (98 species and 27 subspecies), including the subspecies designated as subspecies in Fauna Sinica17 but identified as species in FishBase or Eschmeyer’s Catalog of Fishes. In addition, our dataset removed misidentified species.

Species distribution validation

Sampling points with accurate latitude and longitude in all sources were recorded directly in decimal degrees. To extract data from the sampling maps, we used the WebPlotDigitizer36 (Version 4.4). The coordinates of occurrence records with exact sampling point descriptions were located using Google Earth (https://earth.google.com/). For occurrence data recorded at a coarse spatial resolution (e.g., villages, towns, and even counties with relatively extensive coverage), their coordinates were determined by combining location names and the sampled rivers or streams. The native distribution range of the species and transportation accessibility were also used as supporting information. Generally, these points were within a 10-kilometer radius from the area centre. In cases where the area was too large with a complex river network, we discarded those occurrence records directly. Unclear sampling ranges described in the text were eliminated. For example, only river basins or sub-basins were described without information on administrative boundaries; sub-basins were reported with administrative boundaries placed at the provincial or city level. All sampling points were fixed to the river network based on the Hydrography90m45 using the ‘NEAR’ function in the software ArcGIS (Version 10.4). The occurrence records were cleaned to remove outliers and records with high spatial uncertainty using the R package ‘CoordinateCleaner37, the cleared geographic coordinates had been rechecked manually.

Then, the geographic coordinates of the occurrence points were checked and validated with the river name or the administrative district that was described in the sources, such as the county, town, and village names. In the event of a mismatch, the coordinates were removed from the dataset after double-checking. Records from the online databases, books, master theses, doctoral dissertations, and articles may share field sampling; in this case, duplicated occurrence records were removed. Finally, the distribution basins of 125 species or subspecies were checked and reviewed to eliminate non-natural distributions induced by religious release activities or artificial enrichment releases. For the endangered species or threatened species in our dataset33, we have kept geographic coordinates rounded to 0.1 degree of the latitude and longitude46. The IUCN Red List status of these species includes Critically Endangered (CR), Endangered (EN), Extinct in the Wild (EW), and Extinct (EX). Detailed data of these species can be supplied to researchers on request. All the occurrence records were given an ID number, and the ‘occurrenceID’ is unique and can be checked in “SchiSOFT_occurrence_records.csv”33.

Functional traits data validation

For the functional traits, which were taken from books and published literature, each species of schizothoracine fish gathered from two or more sources as much as possible to avoid incorrect data. We compared and checked the data for the same species recorded from text descriptions in different sources to see if there were differences or deviations. Where differences or deviations existed, we added as many sources as possible about the species, and the same or similar descriptions in most of the literature were adopted. If the recorded data from the documents was an interval value, the median value was recorded in our dataset33. The maximum body length and maximum body weight data were mainly taken from FishBase and supplemented by books and published literature. In the case where there was only total length data without body length data, we used the ratio of body length to total length extracted from the images to calculate the body length.

The most ratio traits were measured from specimen images (i.e., photos or scientific drawings of fish lateral views). Scientific drawings were primarily sourced from Fauna Sinica17, The Fishes of the Qinghai-Xizang Plateau26 and Xinjiang Ichthyology44, and specimen photos were mostly downloaded from FishBase, supplemented by journal articles and museum photos. When several images with a lateral view were available for a species, measurements were taken on the one with the best quality. Photos of museum specimens were also used only if they could provide a morphological representation of the fish species. The quality of the photos did not allow for the measurement of all morphological traits of all species due to improper body positioning and specimen distortion. All those doubtful measurements were scrapped and recorded as ‘NA’. The sources of photos and scientific drawings were recorded in our dataset33, “SchiSOFT_species_checklist_&_image_sources.csv”.

Usage Notes

Based on published literature, books, online databases, and field surveys, we collected a full species and image sources list, occurrence data, and detailed functional traits data for Schizothoracinae fish. The dataset33 is obviously not complete and conclusive, and we aim to support the dataset with regular updates, ideally with biannual or triennial steps, depending on the available resources. Three main factors will be considered in future updates: (1) new or previously unavailable data sources (e.g., investigated reports of Sichuan and Qinghai Provinces) with species lists or records for additional drainage basins or drainage basins already present in the dataset; (2) the distribution of newly described species; and (3) nomenclature changes in the taxonomic classification. This collection not only offers high-resolution occurrence data but also presents intricate details regarding the functional traits of Schizothoracinae fish.

The dataset limitations are manifested in: (1) the distribution of Schizothoracinae fish in the high-elevation, low-oxygen areas of the QTP and its surroundings, as acquiring specimens is challenging due to accessibility limits and sampling bias; (2) Schizothoracinae-related documents written in languages other than English and Chinese are not included in this dataset, and the grey documents are also not included.

Our dataset serves multiple purposes, making it invaluable for various scientific inquiries. First and foremost, it provides a solid foundation for historical biogeographical research centered on Schizothoracinae fish, particularly in the context of the QTP uplift. Furthermore, the occurrence data can be harnessed to predict shifts in species distribution by employing ecological niche models or species distribution models to improve the future protected area management paired with high-resolution geographic or climatic data. The functional traits data play an important role in delving even deeper into the distribution dynamics of Schizothoracinae and the underlying factors contributing to these variations.