A global dataset on species occurrences and functional traits of Schizothoracinae fish

The Schizothoracinae fish are a natural group of cyprinids widely distributed in rivers and lakes in the Qinghai-Tibetan Plateau (QTP) and adjacent regions. These fish parallelly evolved with the QTP uplift and are thus important for uncovering geological history, the paleoclimatic environment, and the mechanisms of functional adaptation to environmental change. However, a dataset including species occurrences and functional traits, which are essential for resolving the above issues and guiding relevant conservation, remains unavailable. To fill this gap, we systematically compiled a comprehensive dataset on species occurrences and functional traits of Schizothoracinae fish from our long-term field samplings and various sources (e.g., publications and online databases). The dataset includes 7,333 occurrence records and 3,204 records of 32 functional traits covering all the genera and species of Schizothoracinae fish (i.e., 12 genera and 125 species or subspecies). Sampling records spanned over 180 years. This dataset will serve as a valuable resource for future research on the evolution, historical biogeography, responses to environmental change, and conservation of the Schizothoracinae fish.

changes in response to QTP uplift 14 .Unlike terrestrial organisms that can rapidly disperse over long distances and in multiple directions on land, freshwater fish are strictly constrained in drainage systems, restricting their gene flow, and thus promoting local diversification and speciation.Therefore, genetic differences between different species and populations of schizothoracine fish have been suggested as suitable biological evidence for inferring the geologic history of the QTP uplifts and large river formations 14,19,20 .For example, the fossils of Schizothoracinae fish have been used to estimate the paleo-elevation of the QTP, and the results indicate that there have been large spatial and temporal differences in the uplift since the Oligocene 8,21 .Based on the degrees of morphological specialization (e.g., scales, pharyngeal teeth, and barbs) and distribution of modern Schizothoracinae fish, Cao and colleagues argued that the three evolutionary stages of them are closely related to the uplift processes of the QTP 14 .Accordingly, the Schizothoracinae fish can be grouped into three grades, including the primitive grade, the specialized grade, and the highly specialized grade.A molecular phylogeny of 24 Schizothoracinae species estimated that the average altitude of the QTP in the late Miocene should be between 2,750 and 3,750 m, providing a different perspective than sediment records 20 .In addition, studies have also shown that these fish are sensitive in response to recent climate change through changing growth and reproductive phenology 16,22 .The Schizothoracinae is one of the most threatened subfamilies in China, with 55% of species under threat 23 .
Species occurrences and functional trait information are fundamental to understanding biodiversity distribution patterns, predicting biological responses to environmental change, and promoting relevant conservation and management.This is because the functional trait composition and diversity of a community can reflect the characteristics and changes in the environment 24,25 .However, a dataset including such information for Schizothoracinae fish remains unavailable.Currently, their occurrence records and functional trait information are scattered in a wide range of sources (e.g., books, journal articles, master theses, doctoral dissertations, and online databases).The relevant knowledge held by most researchers and managers is outdated and mostly comes from surveys and published literature from the last century 14,18,26 .In addition, most of these data sources were written in Chinese, which poses a language barrier to interested non-Chinese researchers 27 .Although there are some large-scale databases (e.g., FishBase [https://www.fishbase.se/],Eschmeyer's Catalog of Fishes [https:// www.calacademy.org/scientists/projects/eschmeyers-catalog-of-fishes],and Global Biodiversity Information Facility [GBIF, https://www.gbif.org/])related to freshwater fishes, Schizothoracinae fish are not targeted for consideration, and the included species and functional trait data are far from delicate and comprehensive.For example, a total of 78 Schizothoracinae species or subspecies were included in the most comprehensive global database of freshwater fish species occurrence at the drainage basin scale, without precise geographic coordinates or sampling time information 28 .The global database CESTES for metacommunity ecology, which integrates species, traits, environment, and space, does not include freshwater fish in Asia 29 .A Schizothoracinae-targeted database compiled the transcriptome data of 14 endemic species, but without precise sampling locations or functional trait information 30 .Therefore, it is urgently necessary to build a dataset containing species occurrences and functional traits of Schizothoracinae fish, given that the QTP has experienced more profound climate change and increasing anthropogenic disturbances 31,32 .
In this study, we introduce the SchiSOFT 33 (Schizothoracinae fish Species Occurrences and Functional Traits) dataset, which compiled and curated data from our long-term survey records, possible online databases (e.g., FishBase and GBIF), and systematically searched literature (Fig. 1).The literature covers both that written in Chinese and English, and the publication date spans from 1842 to 2022.Details such as sampling locations, geographic coordinates, sampling dates, and functional traits (e.g., maximum body length, scale coverage, and pharyngeal teeth) were gathered, collated, and verified.The SchiSOFT 33 presents the most comprehensive dataset of Schizothoracinae fish, including all 125 species or subspecies from the 12 genera, 7,333 occurrence records, and 3,204 records of 32 functional traits.Sampling records spanned over 180 years (1840s-2020 s).This dataset enables researchers and managers to quickly acquire specific information (e.g., distribution range, functional traits) about Schizothoracinae fish through querying corresponding fields such as scientific names, genus names.Thus, it can promote research, conservation, and management of Schizothoracinae fish diversity and resources and further ensure the goods and services they provide for both natural ecosystems and human society.It is also accessible to the public and can be used for educational activities, contributing to a deeper public understanding and awareness of the conservation of Schizothoracinae fish.
Our search queries were based on the names of the target fish (e.g., scientific genus names and common names).Data from the WoS and Scopus, based on titles, abstracts, and keywords, were searched using the following English search phrase: (Schizothoracinae OR schizothoracine OR Schizopygopsinae OR Aspiorhynchus OR Chuanchia OR Diptychus OR Herzensteinia OR Gymnocypris OR Oxygymnocypris OR Platypharodon OR Schizothorax OR Schizopygopsis OR Schizocypris OR Racoma OR Schizothoraichthys OR Ptychobarbus OR Oreinus OR schizothoracin OR snowtrout OR marinka OR "naked carp").The Chinese search phrase was generally the same as the English version, searched from the CNKI, Wanfang Database (https://www.wanfangdata.com.cn/), and Weipu Database (https://qikan.cqvip.com/).After removing 8,415 duplicates through fuzzy title matching using the restricted Damerau-Levenshtein distance similarity 34 , a total of 18,886 references were retained.We then screened the titles, keywords, and abstracts of the documents returned by the search and excluded records with explicit reasons as follows: (1) reviews without sampling data; (2) river basin scale or regional aquatic field surveys with no records of schizothoracine fish occurrence data; (3) research articles that do not include field sampling or only used environment DNA methods; (4) studies with schizothoracine fish but were not identified to species level; (5) captive-bred schizothoracine fish without field sampling information.Primary research articles mentioned in review papers that potentially contain relevant data were also included to complement our reference pool (Fig. 1).The searching, screening, and filtering strictly followed the workflow of PRISMA 35 (Preferred Reporting Items for Systematic Reviews and Meta-Analysis).Finally, we obtained data from 706 pieces of published literature, 28 books in English and Chinese, and seven online databases, including FishBase, NSII, GBIF, the European Nucleotide Archive (ENA, https://www.ebi.ac.uk/ena/), the National Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov/),iNaturalist (https://www.inaturalist.org/),IUCN Red List of Threatened Species (IUCN, https://www.iucnredlist.org/).Schizothoracinae species occurrences and photos for functional trait measurement collected during field surveys conducted by our research groups spanned fourteen years 33 (i.e., 2008-2021).These data are clearly noted in the spreadsheet at figshare 33 .

Data extraction.
We extracted occurrence information for each species, including scientific names, georeferenced locations, and sampling times (Table 1), from the text, tables, figures, and supporting information from all the sources.To extract data from maps and other types of figures, we used the WebPlotDigitizer 36 (Version 4.4).The occurrence records were cleaned to remove outliers, for example, those records with high spatial uncertainty, using the R package 'CoordinateCleaner' 37 .Occurrence coordinates were recorded in decimal degrees (see section below 'Technical Validation').The functional traits of schizothoracine fish are essential for understanding their evolutionary adaptation and responses to historical and modern environmental changes in the QTP.Our dataset encompasses 32 functional traits, which can be grouped into five categories: multi-functional (5 traits; e.g., maximum body length), trophic (16 traits; e.g., feeding habits and oral gape position), locomotion (6 traits; e.g., body elongation), life history (2 traits; e.g., fecundity), and habitat utilization (3 traits; e.g., habitat substrate) (Table 2).Maximum body length and maximum body weight data were mainly taken from FishBase and supplemented from books.Twelve commonly used ratio traits (continuous data; e.g., relative eye size and caudal fin aspect ratio) in evaluating the morphological diversity of freshwater fish 26,[38][39][40][41] were measured from specimen photos or images (i.e., scientific drawings of fish lateral views) with the assistance of ImageJ software (http://rsb.info.nih.gov/ij/index.html).The rest of the trophic traits, life history traits, and habitat utilization traits, which are mostly categorical, were extracted from text descriptions in books (e.g., Fauna Sinica 17 , The Fishes of the Qinghai-Xizang Plateau 26 , and The Fishes of the Hengduan Mountains Region 42 ) and taxonomic research articles.Species, taxonomy, and status.The scientific names of all Schizothoracinae fish included in the dataset have all been thoroughly checked for typing errors and misspellings.To avoid including invalid species and synonyms, we verified the validity of each species according to FishBase, using R package 'rfishbase' 43 .For the species or subspecies that was not matched, it would be searched again in Eschmeyer's Catalog of Fishes.For the subspecies that were designated as subspecies in Fauna Sinica 17 , but were not identified in FishBase and Eschmeyer's Catalog of Fishes, they were listed as valid subspecies in our dataset.The final standardized species list has 125 valid species (i.e., 98 species and 27 subspecies) (see section below 'Technical Validation').

Data records
Our final dataset 33 has been deposited at figshare and can be downloaded from https://doi.org/10.6084/m9.figshare.24638538.v1.It includes a total of 7,333 occurrence records, and the total number of functional trait records is 3,204 (Fig. 2, Table 2).Among them, 3,876, 844, and 673 occurrence records were from 706 publications, 28 books, and seven online databases, respectively.The remaining 1,940 records were from field surveys conducted by our research groups over fourteen years (2008-2021).And the functional trait mean data completeness was 80.1%.Among them, 1,424 records were extracted from images, 109 records were extracted from online databases, and 1,671 records were obtained from text descriptions of published documents.
Occurrences and functional traits were recorded according to uniform standards.The dataset 33 was organized into three CSV-format files.(A), "SchiSOFT_species_checklist_&_image_sources.csv",includes genus name, specialized grade, scientific name, taxonomic status, IUCN Red List extinction risk, image type, source type, references in English, and URL.(B), "SchiSOFT_occurrence_records.csv",includes the genus name, scientific name, the original species name in the sources, taxonomic status, the sampled date, and remarks of the sampled date, decimal latitude and latitude, the source language, source type, references in English, references in Chinese,  DOI or ISBN code.If there was no sampled date recorded in the sources, we recorded the received date, accepted date, and published date of the source document as a substitute, if available.The citations of Chinese publications had also been translated into English, recorded in 'referenceInChinese' and 'referenceInEnglish' , respectively.(C), "SchiSOFT_functional_trait_records.csv",includes the scientific name and 32 functional traits records (Table 2).(A), "SchiSOFT_species_checklist_&_image_sources.csv",contains the checklist of 125 species or subspecies of schizothoracine fish, including fields: 'genus' , 'specializedGrade' , 'scientificName' , 'taxonomicStatus' , 'IUCNcategory' , 'imageType' , 'sourceType' , 'referenceInEnglish' , and 'URL' (Table 1).
technical Validation taxonomic and status validation.Each original species name was compared to the list of valid species names in FishBase, Eschmeyer's Catalog of Fishes, Fauna Sinica 17 , The Fishes of the Qinghai-Xizang Plateau 26 or Xinjiang Ichthyology 44 to ensure the identification validity provided by the information source.In the column 'orig-inalNameInSources' , the original scientific names in sources were recorded fully for checks and verifications.The R package 'rfishbase' 43 was used to do a batch search and matching for species names.For the species or subspecies that was not matched, it would be searched again in Eschmeyer's Catalog of Fishes.There were 125 valid species names (98 species and 27 subspecies), including the subspecies designated as subspecies in Fauna Sinica 17 but identified as species in FishBase or Eschmeyer's Catalog of Fishes.In addition, our dataset removed misidentified species.

Species distribution validation.
Sampling points with accurate latitude and longitude in all sources were recorded directly in decimal degrees.To extract data from the sampling maps, we used the WebPlotDigitizer 36 (Version 4.4).The coordinates of occurrence records with exact sampling point descriptions were located using Google Earth (https://earth.google.com/).For occurrence data recorded at a coarse spatial resolution (e.g., villages, towns, and even counties with relatively extensive coverage), their coordinates were determined by combining location names and the sampled rivers or streams.The native distribution range of the species and transportation accessibility were also used as supporting information.Generally, these points were within a 10-kilometer radius from the area centre.In cases where the area was too large with a complex river network, we discarded those occurrence records directly.Unclear sampling ranges described in the text were eliminated.For example, only river basins or sub-basins were described without information on administrative boundaries; sub-basins were reported with administrative boundaries placed at the provincial or city level.All sampling points were fixed to the river network based on the Hydrography90m 45 using the 'NEAR' function in the software ArcGIS (Version 10.4).The occurrence records were cleaned to remove outliers and records with high spatial uncertainty using the R package 'CoordinateCleaner' 37 , the cleared geographic coordinates had been rechecked manually.
Then, the geographic coordinates of the occurrence points were checked and validated with the river name or the administrative district that was described in the sources, such as the county, town, and village names.In the event of a mismatch, the coordinates were removed from the dataset after double-checking.Records from the online databases, books, master theses, doctoral dissertations, and articles may share field sampling; in this case, duplicated occurrence records were removed.Finally, the distribution basins of 125 species or subspecies were checked and reviewed to eliminate non-natural distributions induced by religious release activities or artificial enrichment releases.For the endangered species or threatened species in our dataset 33 , we have kept geographic coordinates rounded to 0.1 degree of the latitude and longitude 46 .The IUCN Red List status of these species includes Critically Endangered (CR), Endangered (EN), Extinct in the Wild (EW), and Extinct (EX).Detailed data of these species can be supplied to researchers on request.All the occurrence records were given an ID number, and the 'occurrenceID' is unique and can be checked in "SchiSOFT_occurrence_records.csv" 33.
Functional traits data validation.For the functional traits, which were taken from books and published literature, each species of schizothoracine fish gathered from two or more sources as much as possible to avoid incorrect data.We compared and checked the data for the same species recorded from text descriptions in different sources to see if there were differences or deviations.Where differences or deviations existed, we added as many sources as possible about the species, and the same or similar descriptions in most of the literature were adopted.If the recorded data from the documents was an interval value, the median value was recorded in our dataset 33 .The maximum body length and maximum body weight data were mainly taken from FishBase and supplemented by books and published literature.In the case where there was only total length data without body length data, we used the ratio of body length to total length extracted from the images to calculate the body length.
The most ratio traits were measured from specimen images (i.e., photos or scientific drawings of fish lateral views).Scientific drawings were primarily sourced from Fauna Sinica 17 , The Fishes of the Qinghai-Xizang Plateau 26 and Xinjiang Ichthyology 44 , and specimen photos were mostly downloaded from FishBase, supplemented by journal articles and museum photos.When several images with a lateral view were available for a species, measurements were taken on the one with the best quality.Photos of museum specimens were also used only if they could provide a morphological representation of the fish species.The quality of the photos did not allow for the measurement of all morphological traits of all species due to improper body positioning and specimen distortion.All those doubtful measurements were scrapped and recorded as 'NA' .The sources of photos and scientific drawings were recorded in our dataset 33 , "SchiSOFT_species_checklist_&_image_sources.csv".

Usage Notes
Based on published literature, books, online databases, and field surveys, we collected a full species and image sources list, occurrence data, and detailed functional traits data for Schizothoracinae fish.The dataset 33 is obviously not complete and conclusive, and we aim to support the dataset with regular updates, ideally with biannual or triennial steps, depending on the available resources.Three main factors will be considered in future updates: (1) new or previously unavailable data sources (e.g., investigated reports of Sichuan and Qinghai Provinces) with species lists or records for additional drainage basins or drainage basins already present in the dataset; (2) the distribution of newly described species; and (3) nomenclature changes in the taxonomic classification.This collection not only offers high-resolution occurrence data but also presents intricate details regarding the functional traits of Schizothoracinae fish.
The dataset limitations are manifested in: (1) the distribution of Schizothoracinae fish in the high-elevation, low-oxygen areas of the QTP and its surroundings, as acquiring specimens is challenging due to accessibility limits and sampling bias; (2) Schizothoracinae-related documents written in languages other than English and Chinese are not included in this dataset, and the grey documents are also not included.
Our dataset serves multiple purposes, making it invaluable for various scientific inquiries.First and foremost, it provides a solid foundation for historical biogeographical research centered on Schizothoracinae fish, particularly in the context of the QTP uplift.Furthermore, the occurrence data can be harnessed to predict shifts in species distribution by employing ecological niche models or species distribution models to improve the future protected area management paired with high-resolution geographic or climatic data.The functional traits data play an important role in delving even deeper into the distribution dynamics of Schizothoracinae and the underlying factors contributing to these variations.

Fig. 1
Fig.1The workflow to compile the SchiSOFT dataset on species occurrences and functional traits of Schizothoracinae fish.

Table 1 .
IUCNcategory the species IUCN Red List extinction risk, including LC (Least Concern), DD (Data Deficient), VU (Vulnerable), NT (Near Threatened), EN (Endangered), CR (Critically Endangered), EX (Extinct), and 'NA' if not available.date, if there was no sampling date recorded in the sources, we recorded the earliest available date of the sources as a substitute, such as the received date of the article, etc. 'NA' indicated the actual sampling date in the field 'sampledDate' .sourceLanguage language of sources, including English and Chinese, 'NA' if sourced from field surveys.sourceType the source of the record, including published literature (journal articles, master theses, doctoral dissertations, and conference papers), books, online databases, and field surveys.referencesInEnglish the citation of references published in English, and Chinese references had been translated into English.'NA' if sourced from field surveys.referencesInChinese the citation of references published in Chinese, 'NA' if sourced from field surveys or published in English.DOIorISBN the DOI code of published literature and the ISBN code of books, 'NA' if not available or sourced from field surveys.Descriptions of the fields used in the SchiSOFT dataset.
imageType the image type of the Schizothoracinae fish, including photo and scientific drawing, 'NA' if not available.URL the URL for the exact online database information.habitatFlow Categorical preferred waterflow velocity of the species, including 'slow flow' , 'rapid flow' , and 'NA'; derived from text descriptions.waterbody Categorical preferred waterbody type of the species, including 'lake' , 'river' , 'lake and river' , and 'NA'; derived from text descriptions.

Table 2 .
Descriptions of the functional traits in the SchiSOFT dataset.Categorical indicated the categorical data type.
'Numerical.d' indicated the numerically discrete data type, and 'Numerical.c'indicated the numerically continuous data type.