A georeferenced rRNA amplicon database of aquatic microbiomes from South America

The biogeography of bacterial communities is a key topic in Microbial Ecology. Regarding continental water, most studies are carried out in the northern hemisphere, leaving a gap on microorganism’s diversity patterns on a global scale. South America harbours approximately one third of the world’s total freshwater resources, and is one of these understudied regions. To fill this gap, we compiled 16S rRNA amplicon sequencing data of microbial communities across South America continental water ecosystems, presenting the first database µSudAqua[db]. The database contains over 866 georeferenced samples from 9 different ecoregions with contextual environmental information. For its integration and validation we constructed a curated database (µSudAqua[db.sp]) using samples sequenced by Illumina MiSeq platform with commonly used prokaryote universal primers. This comprised ~60% of the total georeferenced samples of the µSudAqua[db]. This compilation was carried out in the scope of the µSudAqua collaborative network and represents one of the most complete databases of continental water microbial communities from South America.


Methods
Data compilation. The µSudAqua[db] database was constructed with samples from published papers and new data generated in the scope of this work (Table 1). We only considered those studies fitting with the following criteria: 1) samples were obtained from continental water systems of South America; 2) the whole bacterial community was studied using high-throughput amplicon sequencing of the 16S rRNA gene; 3) the 16S rRNA gene was subject to amplification using universal primers (i.e. studies using group-specific primers or functional genes were not included); 4) sequencing data were publicly available or provided by the authors of the study upon request; 5) the samples could be georeferenced.
Sample metadata were collected from the published papers or provided by the authors of the current work. The altitude was automatically extracted based on the sampling location, using the QGIS geographic information system software (https://qgis.org/). Each sample was assigned to an environmental type (e.g. shallow and deep lakes, rivers, streams, reservoir, swamps) and an ecoregion (section Ecoregions description). Besides, the georeferenced location and procedures adopted for the sampling and sequencing were fully recovered. The complete list of metadata recovered and its description is presented in Table 2. The samples information used to build the database is available as an accessed as plaintext (TSV format) at Zenodo repository 22 .
The µSudAqua[db] was used as a seed to construct the curated database, µSudAqua[db.sp], which contains a subset those samples sequenced with 1) Illumina MiSeq technology and; 2) the commonly used set of primers proposed by Herlemann & collaborators 21 . Fig. 1 Global distribution of amplicon sequencing samples from continental water systems using HTS. The information was acquired from MGnify (https://www.ebi.ac.uk/metagenomics/) resource by searching for non-marine aquatic samples, obtained by amplicon or metabarcoding experimental types. The geographical coordinates were retrieved for 4,691 samples from a total of 7,832 using the metadata available from each sample.
The microbial communities were obtained with different filtration strategies. In some environments, water samples were pre-filtered to exclude larger particles, or to split the microbial community in free-living and particle attached fractions (Table 3). Even though different DNA-extraction methods were used (Table 3), the V3-V4 regions of the 16S rRNA gene were amplified using the same set of bacterial universal primers 341 F (5′-CCTACGGGNGGCWGCAG-3′) and 805 R (5′-GACTACHVGGGTATCTAATCC-3′) 21 . Samples of each project were indexed with Nextera XT v2 kit, and sequenced using the Illumina MiSeq technology in different sequencing facilities. The samples were obtained mostly from surface waters (0-50 cm) of continental systems with different limnological characteristics and different spatial and temporal coverage through six ecoregions (Table 3).
Amplicon sequences from the µSudAqua[db.sp] were processed using DADA2 v1.10.0 23 , after primers trimming by Cutadapt v1.18 24 . Each sequencing project was analyzed separately with the same filtering parameters as recommended by Callahan & collaborators 23 . The quality of the samples was explored using the functions fastx_eestat and fastx_info from USEARCH v10.0.240 25 to define the filtering parameters. This was then performed using the filterAndTrim function from DADA2 with the following quality values: maxEE = c(2,2) and truncLen = c(250,220). Only samples with more than 10,000 reads were analyzed.
To increase sensitivity to rare variants and avoid chimeras and sequencing errors, we used the "pool" option from dada function. The chimera sequences were excluded after merging the different projects using the functions removeBimeraDenovo and mergeSequenceTables, respectively. The taxonomic classification was performed using BLAST v2.5 26 with the blastn algorithm (e-value = 0.0001) and the SILVA database (SSU Ref 132 NR 99 27 ) as a reference. The Amplicon sequence variants (ASVs) were classified into 7 different taxonomic groups. The contribution of each group was calculated as their relative abundance to the total number of reads, and the richness was defined as the total number of ASVs. The scripts used for DADA2 and sample description are available in GitHub (https://github.com/microsudaqua/usudaquadb).

Ecoregions description.
To define the ecoregions, we adopted the level II classification proposed by Griffith & collaborators 28 for Central, South America and the Caribbean. The characteristics of each ecoregion and subregion are briefly described below.

Central Andes 18.1 Central High Andes, Chile
The Central High Andes ecoregion extends from southern Peru, through Chile and Bolivia, to northern Argentina (5.18°-38.44° S, 78.17°-70.24° W). The landscape is typically mountainous, with snow-capped peaks, plateaus and valleys 29 . The ecoregion occupies an area of 140,960 km 2 and lies within the altitudinal range between 3,200 and 6,600 m 15 . Its climate varies from temperate to cold, with an annual average temperature between below zero and 15 °C. This region is dry, with precipitation between 250 and 500 mm per year 29,30 . It is considered as a transitional zone between the wet puna to the north and west, and the dry puna to the south. This ecoregion has several high-elevation wetlands comprising both fresh and saline lakes, salt flats, temporary endorheic basins, as well as permanent rivers and streams fed by snowmelt. They regulate water flow by retaining water during the wet season and releasing it during the dry season. The salt flats, or salares, represent remnants of extensive paleolakes 29 .

Southern Andes 19.2 Valdivian Forest Hill and Mountains
The Valdivian Temperate Forests ecoregion is in the southern cone of South America (33.02°-46.91° S, 70.55°-74.51° W). It covers a narrow continental strip between the western slope of the Andes and the Pacific Ocean (area: 248,100 km 2 ). The climate is temperate cool (mean annual temperature is 8.7 °C) with predominance of westerly winds, and annual precipitation of 1,500 mm 31 . The ecoregion is characterized by a profuse hydrographic system including large and deep lakes (mainly glacial origin) 32,33 and small and shallow lakes 34,35 . The main rivers fed from these Andean waters, run across the plateau steppe and outflow to the Atlantic Ocean, but there are also other rivers that cross the Andes flowing towards the Pacific Ocean. Deep lakes www.nature.com/scientificdata www.nature.com/scientificdata/ (Zmax > 100 m) have a warm monomictic thermal behavior 36 . Nevertheless, small and shallow lakes (Zmax ~12 m) are dimictic or polymictic 34 . These lakes have very low nutrient (ranging from ultra-oligotrophic to oligotrophic status) and dissolved organic carbon concentrations, and high transparency to different wavelengths, which would imply high exposure to ultraviolet radiation [37][38][39][40] .

Country
Country the observations belongs to

Ecoregion
Ecoregion Name

EnvironmentName
Type of habitat the sample was taken from SystemType Type of system the sample was taken from

Lat
Geographic Latitude in decimal degree

Long
Geographic Longitude in decimal degree

Altitude
Altitude of sampling location in meters above sea level [m.a.s.l]

Sample depth in meters [m]
CollectionDate Date of the sampling event

SizeFraction
Size fraction (µm) upper and lower threshold

FilterPreservation
Solution in which the filter was preservated

StorageTemperature
Temperature at which sample was stored C

ExtractionMethod
Method used for the nucleic acid extraction

MechDisruptionMethod
Method used to disrupte the cells

StorageDuration
Duration for which sample was stored

SeqPlatformName
Next-generation sequencing plataform which the reads were generated

SeqPlatformModel
Next-generation sequencing plataform model which the reads were generated

LibraryLayout
If single or paired end reads method was used

LibraryStrategy
Sequencing technique implemented for the library

LibrarySource
Type of source material that is being sequenced.

LibrarySelection
Method used to select and/or enrich the material being sequenced  The Amazon river basin is the largest in the world, comprising an area over 6 million km 2 , extending from 5°N to 17°S, and 79°W to 46°W. Basin sources are mostly located in the northern region of Brazil, starting in the Andes mountains of Peru and end in the Atlantic Ocean in the Brazilian coast 41 . The climate in the basin is in general hot and humid with mean annual temperature between 24 to 28 °C 42 . The average annual precipitation is ~2,200 mm, ranging from ~3,000 mm in the west to ~1,700 mm over the southeast of the basin 43 . The Amazon basin comprise numerous large rivers, tributaries, and large extensions of floodplains with thousands of lakes and associated wetlands linked to each other 44 . These systems vary from permanent to periodically flooded depending on the hydrological cycle, namely the flood pulse 38 . This flood pulse has a profound effect on the productivity, transport of elements and biotic interactions within these ecosystems 41,45 .

Eastern Highlands 21.2 Cerrado
The Cerrado is the second largest ecoregion in South America. It comprises the Brazilian central region (2.05°-23.77° S, 45.29°-54.37° W), and covers an area over 2 million km 2, 46 . It is a savannah domain, characterized by a tropical climate (mean annual temperature average: 22-27 °C), with dry winters and rainy summers 46 . Annual precipitation typically ranges from 1,200 to 1,800 mm and soil is usually acid and nutrient-poor 47 . The Cerrado altitude has little variation, being maximum only in the central highlands, from where important springs come out and end up contributing to form the three largest water basins in South America (Amazon, São Francisco and Del Plata-Paraná/Paraguay) 48 . There are very few natural lakes in this region, and most water bodies are either dammed shallow lakes or large hydroelectric reservoirs. As reservoirs are mainly found near cities, the nutrient inputs, pH and trophic state can vary 49,50 .

Atlantic Forests
The Atlantic Forests region is mainly located in Brazil, spanning along the Atlantic coast, and extending inland to Argentina and Paraguay (distributed from 5.00° S to 28.00° S and 35.14° to 53.56° W 51 ). This ecoregion is a wide tropical (mean annual temperature ~23 °C), humid biome known mainly by its long line of coastal rainforest 51 . The coast is humid all over the year, with an annual precipitation typically ranging from 1,800 to 3,600 mm. This ecoregion is characterized by different formations like deciduous and semi-deciduous continental forests, bogs and mangroves, and grasslands 52 . Landscape can be flat and lentic environments in the countryside are either human made dammed creeks used for cattle ranching and crop irrigation or large hydroelectric reservoirs. Along the Brazilian Atlantic coast, lentic ecosystems are shallow lakes dug into the mountainside, or squeezed into the narrow strip between the mountain chain and the ocean 53 . There are also some herbaceous/ shrubby sand-dune ecosystems, called Restinga, that form perennial or temporary coastal shallow lagoons 54 , which encompass wide environmental gradients (e.g.: trophic state, humic substances, salinity) that greatly influence aquatic biodiversity 55 .

Gran Chaco 22.2 Humid Chaco
Lakes and rivers from the Paraná floodplain system. The Paraná River is the second largest river of South America with a mean annual discharge of ~17,000 m 3 s −1 and a drainage area of 2.6 10 6 km 2 . The headwaters are fully developed in Brazil and it travels 3,800 km along a main north to south direction through tropical to temperate latitudes up to its mouth in the Río de la Plata Estuary with mean annual temperatures of ~12.5 °C 56 . The middle stretch of the river begins downstream from the confluence with the Paraguay River (Argentina). Climate is humid subtropical, with annual precipitation between 900 to 1,000 mm. At this stretch, the river is characterized by a well-defined main channel and a large floodplain about 20 to 40 km wide, located by its right margin. Thousands of permanent shallow lakes and temporary environments occupy the floodplain which is flooded and drained by a well-developed and relatively stable fluvial network 57 . The system dynamic is subject to hydro-sedimentological pulses that occur with different magnitudes and constitute the main driving factor of the limnological features and the biota 38,58 , particularly, the microbial communities 59-61 .

Pampas 23.1 Uruguayan savanna, Uruguay
The ecoregion Uruguayan savanna comprises an area of 355,605 km 2 which includes the whole country of Uruguay (30°-34° S, 53°-58° W) and extends mostly towards the southern part of Brazil to a small section of the Argentina 62 . The climate of this region is temperate, without dry season, and with hot summer 63 . The mean annual temperature ranges between 16 and 20 °C. The mean annual rainfall lies between 1,100 and 1,400 mm and is highly variable between years. This ecoregion encompasses the outlet of the Río de la Plata basin where a dense fluvial network, along with a series of coastal lagoons and numerous artificial lakes can be found. Rivers and streams are characterized by small slopes and rapid filling and draining 64 . Coastal lagoons, formed due to marine regressions and transgressions in the Holocene, are located at the Atlantic coast 65 and their size and age increase towards the East. They are characterized by large gradients in salinity, light penetration and nutrient concentrations, and their hydrological cycle strongly determines the composition and activity of the bacterial communities 66,67 . www.nature.com/scientificdata www.nature.com/scientificdata/

Southern Flat Pampas
The Pampa ecoregion extends westward across central Argentina (30.37°-38.98° S, 57.60°-62.31° W), from the Atlantic coast to the Andean foothills 32 . It is an extensive plain area (398,966 km 2 ), except for the two, almost parallel, hill systems that cross the area in a NW-SE orientation (Sierras de Tandilia and Sierras de Ventania). The climate of this region is temperate and humid, with mean annual temperatures varying from 14 to 20 °C. The precipitation is concentrated during spring and summer months, and decreases from NE to SW (from 1,000 to 400 mm) 38 . The ecoregion is dominated by a large number of fluvial-aeolic shallow lakes and low order rivers and streams that mostly belong to the Salado-Vallimanca basins 32 . Particularly, lakes are characterized by rounded contours and pan-shaped profiles. They are typically shallow, polymictic, eutrophic to hypertrophic, with highly variable water renewal time and salinity. Most of the surrounding land is devoted to agricultural practices 36 . This economic development directly affected shallow lakes, promoting shifts in many of them from clear regimes, characterized by the presence of submerged vegetation, to algal-dominated turbid states 68 .

Monte-Patagonian 24.2 Patagonian Tablelands
The Patagonian tablelands ecoregion (defined as "Patagonian plateau" by Quirós & Drago 32 , is a complex landscape of about 600,000 km 2 , located in Argentina (33.68°-54.52° S, 68.75°-66.35° W. It is delimited by the Colorado River to the North, the Atlantic Ocean to the East, the Andes to the West and parallel 54° to the South 69 . It is characterized by extreme conditions of cold and dry climate, with average maximum temperatures of 2.9 and 14.0 °C in winter and summer, respectively, and minimum temperatures can be below −19.0 °C in winter. The mean annual precipitation is ~300 mm. This ecoregion encompasses different types of water bodies, including reservoirs, permanent natural lakes and temporary ponds. Most water bodies are shallow lakes, typically ranging from mesotrophic to eutrophic. Climate conditions determine that small shallow lakes (i.e. less than 30 km 2 ) usually remain frozen from early autumn throughout late spring, however during the ice-free period due to frequent strong winds, the water columns are continuously mixed, thus preventing the formation of stable thermoclines [70][71][72] .   Fig. 2). It contains samples sequenced using 454, Ion Torrent and Illumina technologies, and targeting different hypervariable regions of the 16S rRNA gene. The raw samples files are freely available in the European Nucleotide Archive (ENA) database 73 . They can be downloaded using the Run Accession Number from the metadata file provided in Zenodo repository 22 .

technical Validation
The technical validation was performed using the µSudAqua[db.sp], that comprises the samples that were sequenced with the Illumina MiSeq technology, and targeted the V3-V4 regions of the 16S rRNA gene.

Usage Notes
The links to download the raw fastq data from µSudAqua[db] and µSudAqua[db.sp] are in the metadata file accessible in Zenodo 22 . In addition, other files associated with the µSudAqua[db.sp] are available in the same repository: ASVs table (number of reads in each sample), taxonomy, nucleotide sequences in fasta format and ASVs table filtered with only Bacteria. Importantly, the database will grow as new samples and sequencing projects from the µSudAqua network appear. This information will be uploaded in the repository and the tables will be updated in future versions of the database. A bibliography revision and open call for new data submission will be performed once a year, and the database will be updated after data quality check, processing and integration.
The µSudAqua[db] and µSudAqua[db.sp] databases are the first to integrate information of microbial diversity from continental systems of South America, an important region that has been overlooked comparing to other regions and environments worldwide. These databases will open new avenues for studies on the temporal patterns and spatial distributions of microbial communities among the different ecoregions of South America. Besides, the integration of the curated data to meta-analysis of microbial communities from different ecosystems (comparison between South America and well-studied regions of the world), will be particularly important for exploring the novel microbial diversity, allowing to reveal regions with unknown organisms and functions, as well as hotspots of microbial biodiversity.