An approach using ddRADseq and machine learning for understanding speciation in Antarctic Antarctophilinidae gastropods

Sampling impediments and paucity of suitable material for molecular analyses have precluded the study of speciation and radiation of deep-sea species in Antarctica. We analyzed barcodes together with genome-wide single nucleotide polymorphisms obtained from double digestion restriction site-associated DNA sequencing (ddRADseq) for species in the family Antarctophilinidae. We also reevaluated the fossil record associated with this taxon to provide further insights into the origin of the group. Novel approaches to identify distinctive genetic lineages, including unsupervised machine learning variational autoencoder plots, were used to establish species hypothesis frameworks. In this sense, three undescribed species and a complex of cryptic species were identified, suggesting allopatric speciation connected to geographic or bathymetric isolation. We further observed that the shallow waters around the Scotia Arc and on the continental shelf in the Weddell Sea present high endemism and diversity. In contrast, likely due to the glacial pressure during the Cenozoic, a deep-sea group with fewer species emerged expanding over great areas in the South-Atlantic Antarctic Ridge. Our study agrees on how diachronic paleoclimatic and current environmental factors shaped Antarctic communities both at the shallow and deep-sea levels, promoting Antarctica as the center of origin for numerous taxa such as gastropod mollusks.


Results
Phylogenetic reconstruction. From a total of 142 extracted samples for molecular analyses, only 61 were successfully COI-barcoded and 40 were used in ddRADseq experiments (Table 1, Fig. 1A,B). The sequenced fragment of the COI gene included ca. 658 bp. All newly sequenced samples belong to Antarctophilinidae; the closely related cephalaspideans Alacuppa sp. and Philinorbis sp. were obtained from GenBank and used as outgroups. The maximum likelihood topology (Fig. 1B) showed the sister group to all Antarctophiline species was Waegelea antarctica (E. A. Smith, 1902), with samples ranging across the Drake Passage, Ross Sea, Scotia Arc, and Weddell Sea (65-500 m depth). Also, six clades corresponding to Antarctophiline species were identified, including A. easmithi (E Weddell Sea, 170-460 m depth) not present in the ddRADseq datasets, and depicted in grey in Fig. 1B, and a complex of species with affinity to A. alata.
The initial ddRADseq Matrix 1 including both antarctophilinid genera (i.e. Antarctophiline and Waegelea) included 40 individuals, 3893 loci, and 38.5% missing data (not depicted; see "Methods" section). The outgroup included five individuals of W. antarctica, a species with a circumpolar distribution. Matrix 2 was built to increase the number of loci and resolution of the tree, including only the 35 individuals in the genus Antarctophiline. The final Matrix 2 dataset contained 5411 loci and 41.6% missing data (see summary statistics in Supplementary Table S1 and plotted matrix visualization in Fig. S1). A particularly long branch was found in sample P70 (Fig. 1A), most likely due to cross-contamination and/or high levels of missing data (see the section below). The subset to specifically address the A. gibba/A. alata species complex includes 27 individuals (Matrix 3), 5411 loci, and 41.6% missing data. Although several clustering thresholds were tested, only a 75% threshold of similarity produced an output to construct a matrix. This is due to the high genetic divergences among species, and thus, the high amount of singleton reads that makes the clustering within and across samples computationally too demanding. The ML tree of Matrix 2 was rooted with A. amundseni Moles, Avila & Malaquias, 2019 plus the abyssal Antarctophiline sp. 1 (Fig. 1A) as the sister group to the rest of Antarctophiline species. This was also found in the tree of Matrix 1 (Fig. S2), for which an identical topology and maximum bootstrap support (BS = 100) and posterior probability (PP = 1) values were recovered for most nodes, but sample P70 due to potential contamination.
Distinctiveness in genetic lineages. For Matrix 2, STRU CTU RE (Evanno method 32 ) favored an optimal K = 6 ( Fig. 2), recovering all six a priori barcoded Antarctophiline species as distinct clusters, including the eury-  Clustering with the VAE output of the full dataset resulted in 8-10 clusters (Fig. 3A). However, examining the VAE plot, it is apparent that samples P70 and P49 were misplaced and likely confounded an accurate representation of the data, for example, showing two widely divergent Antarctophiline sp. 3 clusters. As previously mentioned, P70 likely contains contamination and was removed from subsequent analyses. Additionally, the placement of P49 with Antarctophiline sp. 3 (specifically with P70) instead of the other A. gibba (P50) is perhaps driven by a combination of admixture ( Fig. 1A) and high levels of missing data (up to 80%, see Fig. S1) in this sample, and was also removed. VAE output of the dataset with these two samples removed (Fig. 3B) recovers a single cluster for all Antarctophiline sp. 3 samples, more in line with the results from COI-barcoding and ddRADseq analyses. "Partition around medoids" (PAM) and hierarchical clustering on this VAE output recover seven clusters (Fig. 3C), while the gap statistic favors 10 clusters, splitting A. alata, A. cf. alata, and A. amundseni into two clusters each. The gap statistic is thus probably over-splitting these taxa as overlapping VAE standard deviations indicate one cluster for A. amundseni and at most two clusters for A. cf. alata (Fig. 3B). Given the concordance between genetic clustering across multiple approaches, we favor seven species, as shown in the VAE clustering (Fig. 3B) results that match those in the phylogeny (Fig. 1) and the sum of the STRU CTU RE analyses (Fig. 2). Diagnosis Shell ovate-subquadrate, slightly flattened dorsoventrally; aperture wide; apex obtuse, slightly sunken; outer lip convex; posterior edge of outer lip obtuse, not protruding beyond apex; columellar wall concave; growth lines visible.  34 suggested its similarity to the genus Philine. Indeed, we believe the overall shell morphology matches the recently erected family Antarctophilinidae which epitomizes most Philinoidea diversity known from Antarctic waters.

Discussion
This study, grounded on a large collecting effort in remote and abyssal areas in the SO aided by phylogenetic analyses using ddRADseq-derived SNP data, increases our understanding of Antarctic benthic gastropod species distributions and diversity. Our barcode and STRU CTU RE analyses served as a starting point for discerning among genetic lineages 35 . Since incongruences in the number of distinct genetic groups recovered in both Matrix 2 and 3 were found using STRU CTU RE, we used novel unsupervised machine learning methods 36 . VAEs proved to be powerful and resolutive when using genomic data for recovering congruent genetic lineages across methods, some of which correspond to species hypothesis. Our results corroborated the latest systematic assessment by Moles et al. 29 on species diversity and support the likelihood of further cryptic diversity in Antarctophilinidae. Out of the eight delimited species three are considered undescribed, the abyssal Antarctophiline sp. 1, the shallow-water species Antarctophiline sp. 2 from Bransfield Strait, and Antarctophiline sp. 3 from the South Sandwich Islands. Regarding A. cf. alata and, to a certain extent, Antarctophiline sp. 3, these are considered to be at the early stages of speciation (potential cryptic species). Taxonomic descriptions of the undescribed species, as well as the validity of certain synonymized taxa within the A. alata species complex (Philine gouldi Doello-Jurado, 1918, and P. amoena Thiele, 1925), will follow in a separate manuscript. Ecological restrictions to bottom-dwelling habitats may be a driver for the morphological convergence in species of Philinoidea 29,30,37 .
Here, genomic data have enhanced the phylogenetic resolution of Antarctophilinidae species obtained through Sanger analysis, particularly for less diverged species, an approach that has been previously applied only to a handful of Antarctic samples (e.g., Ref. 38 ).
A total of seven species of Antarctophiline are found in Antarctic shallow waters, the five species analyzed here: A. alata, A. easmithi, A. gibba, Antarctophiline sp. 2, and Antarctophiline sp. 3 from the vicinities of the Scotia Arc and western Antarctic Peninsula plus the species from the Ross Sea A. apertissima (E. A. Smith, 1902) and A. falklandica (Powell, 1951) (also from the Falkland Islands 39,40      www.nature.com/scientificreports/ Additionally, high species richness is expected due to the elevated productivity of these shallow waters during warm seasons 41 , a pivotal influence controlling Antarctic benthic diversity 47 . In fact, A. gibba is endemic to South Georgia, an island considered a hotspot for gastropod diversity, with more than 50 endemic species recorded 48 . Nonetheless, the high degree of single species found at each collecting site underlines the difficulty of gathering data on Antarctic ecosystems, and thus, conclusions should be made with caution, especially concerning deepsea samples 49 . Overall, our study provides evidence for high diversity in a group of species previously considered to be rather low, which may be the result of fluctuating paleoclimatic history and current habitat heterogeneity. Antarctica has long been considered a center of radiation of marine benthic taxa [19][20][21] , and heterobranch gastropods in particular 24,25,27 . Here, the single fossil record from the Oligocene-Early Miocene at King George Island is attributed to Antarctophilinidae (Fig. 4), proposing that these snails were present in shallow waters of the South Shetland Islands at least 20 Mya. Although limited water transport is hypothesized during the Eocene 50-34 Mya, and probably until the mid-Miocene 15 Mya 28,50 , Trans-Antarctic migrations through a Ross-Weddell seaway through the Amundsen Sea could partially explain the disjunct distributions of several shallow-water gastropods 41,51 . In our study, molecular data for W. antarctica supports this disjunct distribution and, the specimens found in Peter I Island-an intermediate locality in the Bellingshausen Sea-further reinforce this hypothesis (reexamined material from Aldea & Troncoso 52 ). This has been suggested for other marine benthic taxa 50,53 , for which a circumpolar distribution seems unlikely, but instead, a disjunct distribution has been documented. Our compiling evidence suggests that the Scotia Arc and the Weddell Sea have played a pivotal role in the evolution and radiation of many molluscan species from the Miocene forward. However, until comprehensive sampling has been carried out-something difficult to accomplish in Antarctica-our understanding of species distributions remains somewhat limited.
The onset of the ACC led to the isolation of the Antarctic continent and subsequent cool down with the likely extinction of shallow-water faunas 17 . During interglacial periods of shelf ice retreat, the unpopulated shelf could have been re-colonized by fauna from the slope 54 or shelters on the continental shelf 14,42,43 . Species dispersal and gene flow at subtidal and shelf depths have been increasingly studied in SO areas with enough evidence of contrasting patterns related to the disparity in species life histories 55,56 , usually challenging the concept of well-connected, circumpolar distributions 57-59 but see Moore et al. 60 . Extensive geographical distributions were found in species such as A. alata (including A. cf. alata) and W. antarctica, which occur at 10-500 m depth over the Eastern Weddell Sea, the South Shetland Islands, and the Southern South Sandwich Islands, even extending towards the more remote Bouvet Island. The Weddell Gyre is a clockwise current known for connecting shallow shelf waters along the Weddell Sea coast to the South Shetland Islands (through the tip of the Antarctic Peninsula), and towards the South Sandwich Islands, which might ultimately reach Bouvet Island 61,62 . Dispersal through this hydrographic jet could explain the current distribution among the studied species (see Fig. 5a). The Scotia Arc faunal gateway, helped by oceanic eddies, might ultimately be responsible for the distribution found across the Sub-Antarctic islands of South Sandwich and South Georgia 63,64 . Unfortunately, scarce is information on the life history of Antarctophilinidae. The studied species A. gibba lays egg masses at the superficial waters of South Georgia, these containing thousands of large eggs lacking a planktonic phase 65 . Just recently, the larval development of W. antarctica was studied in detail showing early juvenile stages hatching with a relict larval velum, being able to drift away in water currents 66 . We hypothesize this is a drifting mode that may explain our current knowledge of distribution in this family (see Ref. 67 ). www.nature.com/scientificreports/ Cenozoic glacial-interglacial cycles may have also represented the environmental force that shaped the evolutionary trend toward eurybathy in many Antarctic benthic invertebrates 14,54,68 . During periods of extension of the continental ice sheet, an Antarctophiline lineage from the shelf may have been forced to go into deep slope refugia. In this sense, the sister group to the shallow water Antarctophiline clade is a deep-sea group composed of the recently described A. amundseni and an undescribed abyssal species, both displaying a bathymetrically separated distribution but with widespread geographical distributions. At the upper bathyal zone, which includes the Antarctic slope, we found populations of A. amundseni across the Eastern Weddell Sea and the Bransfield Strait. The Antarctic Slope Current circulating in a counterclockwise direction could have been responsible for such distributions 69 . Far below, very distant populations of Antarctophiline sp. 1 at 2900-4500 m depth were recorded west of the Antarctic Peninsula, over the Scotia Ridge, and towards the northern and eastern regions of Bouvet Island. The distribution expands through the South-Atlantic Antarctic Ridge (Fig. 5b), strikingly covering a linear distance of more than 3900 km. The eastern jet of the ABW (i.e., Weddell Sea Bottom Water) feeds the Atlantic abyssal waters all over the North Atlantic as part of the Global Thermohaline Circulation 70-72 and indeed reflects the distribution patterns reported here. An expected longer life cycle in the deep sea 73 may have driven these species into alternative modes of reproduction and dispersal, but the absence of detailed ecological information for Antarctic Philinoidea precludes any conclusions. Alternatively, geological events may explain the current distribution of disjunct species of sea snails across ocean basins 74 .
During the past two decades, an increasing number of molecular studies on different taxa have challenged the three central paradigms of Antarctic benthic lineages 55 , i.e., isolation 64,75 , circumpolarity 15,76 , and eurybathic distributions 16,77 . Our evidence, based on genomic data and novel machine learning approaches, also challenges these long-standing concepts of Antarctic benthic species. Habitat segregation either through shelf refugia or www.nature.com/scientificreports/ current ecosystem heterogeneity at Antarctic shelf depths may have favored species flocks 59 . Contrarily to the widespread longitudinal distribution of some species 78 , bathymetrically separated distributions are a common phenomenon found in Antarctophilinidae 29 , with higher species diversity and endemism found at shelf depths. Nonetheless, we have to bear in mind the potential biases of sampling efforts across depths, with the lower part of the slope and abyss, seldom explored compared to shallower depths 49 . Shallow-water and slope species are thought to have colonized abyssal depths during the late Mesozoic and Cenozoic epochs 19 . The resulting sinking of cold, saline water adjacent to the Antarctic continent and its subsequent movement northwards at abyssal depths has resulted in colonization from the Antarctic for many invertebrate families and genera. Deep-sea communities seem to harbor less species-level diversity, probably because of more homogeneity in their habitats, compared to the shallow water environments 5 . Strikingly, Antarctica has acted as a center of origin and radiation of certain benthic taxa 19,20 , including Antarctophilinidae mollusks.  (Table 1). Distribution maps for the 142 specimens collected, color-coded by species for both continental shelf (Fig. 5a) and slope (Fig. 5b) plains were designed in Arc-GIS 10.3 (Esri, Redlands, CA). Samples were mostly collected by dredging and trawling, but some were collected manually during SCUBA. When possible, specimens were photographed alive on board and preserved in either 70% or 95% EtOH for molecular purposes. Once back in the laboratory, all specimens were photographed dorsally and ventrally using a Keyence VHX-6000 Digital Microscope system at the Museum of Comparative Zoology (MCZ) before dissection. Additionally, material deposited at the Institute of Paleobiology, Polish Academy of Sciences, of the single philinoid fossil, precisely from Antarctica and originally attributed to Scaphander (Scaphandridae) 33 , was morphologically reassessed. Amplifications were cleaned with incubation of 1 µL ExoSAP-IT (Affymetrix, Santa Clara, CA) and sequencing reactions were performed in 10-µL reactions using BigDye ver. 1 chain-termination chemistry on an ABI3730xl (Applied Biosystems Inc., Foster City, CA). Sequences were edited and aligned with MUSCLE 80 , as implemented in Geneious v. 11.0.3 81 . All sequences were submitted to GenBank (see Table 1 for accession numbers).

Methods
Successful DNA extractions were then quantified using a Qubit 2.0 Fluorometer (Invitrogen, Carlsbad, CA), and 100-700 ng of genomic DNA for each sample was used for double-digest restriction site-associated DNA (ddRAD). Libraries were prepared following Peterson et al. 82 protocol with some modifications, using the enzymes EcoRI-HF and BfaI (New England Biolabs, Ipswich, MA) for digestion. Ca. 50-200 ng of fragmented DNA from each individual was later ligated using the customized P1 and P2 adapters with internal barcodes. Between 15-25 individual samples were then pooled together and size-selected to a range of 350-550 bp using a Blue Pippin (Sage Science Inc., Beverly, MA). Each size-selected pool was amplified through PCR and an Illumina P5 barcode was added using a Phusion High-Fidelity PCR Kit (New England Biolabs). PCRs were conducted with an initial denaturation for 30 s at 98

Matrix construction.
Barcode demultiplexing, quality control, within-sample clustering, and between-pair variant calling were carried out using ipyrad v. 0.7.28 83,84 . Settings and data processing steps are default commonly used with ddRAD data. Only reads with unambiguous barcodes and Phred Q scores ≥ 33 were retained, and loci with more than five undetermined bases were additionally discarded. Maxdepth was set to 10,000 to avoid uneven sequencing coverage of amplified fragments and highly similar genomic repetitive regions. Thereafter, all individuals from the different libraries were analyzed together, first clustering per locus with a clustering threshold for de novo assembly of 75% (i.e., level of sequence similarity at which two sequences are identified as being homologous, and thus cluster together); additionally, clustering thresholds at 80, 85, and 90% were attempted. Adapters were trimmed and reads < 35 bp discarded. To avoid over-inflation of estimated heterozygosity, we required a minimum of six reads for each cluster during consensus base-calling and up to four shared polymorphic sites per called locus. For finding the consensus loci and clustering across samples we used a minimum number of samples per locus of 50% of total species and a maximum number of SNPs per locus of 20%, only allowing for a 0.05% of ambiguous positions per consensus locus, setting the maximum number of indels to eight (thus filtering out poor final alignments), and a 50% of polymorphic sites per locus. Full data-  87 with the GTR + I + G model 88 . We ran four independent Markov chain Monte Carlo (MCMC) chains for 2 million generations, sampling every 1000 generations and discarding 10% of the trees as burn-in for each MCMC run before convergence. Convergence was achieved when the potential scale reduction factor (PSRF) was close to 1.0 for all parameters. Trees were visualized in FigTree v. 1.4.4 89 and edited in Adobe Illustrator CC 2018. Genetic structure and optimal clustering were analyzed in STRU CTU RE v. 2.3.4 90 using matrices with unlinked SNPs for the Antarctophiline total dataset (Matrix 2) and the A. gibba and A. alata species complex (Matrix 3). SNP matrices were run for 1 million generations using an admixture model and 100,000 burn-in on K values ranging from 2 to 10 for Matrix 2 and 2-5 for Matrix 3, with eight replicates each. An optimal K value was calculated through the Evanno method 32 in the Structure Harvester Web v. 0.6.94 91 (http:// taylo r0. biolo gy. ucla. edu/ struc tureH arves ter/). CLUMPAK 92 (http:// clump ak. tau. ac. il/) was used for graphical visualization that was later edited in Adobe Illustrator CC 2018.
To further visualize data and perform clustering on samples we use a VAE 93 for dimensionality reduction of SNP data. VAEs are a type of machine learning algorithm rooted in Bayesian statistics that relies on neural networks and unsupervised learning to learn a reduced-dimension representation (latent space) of high dimensionality data. This approach allows for easy visualization of the mean and standard deviation of each sample in latent space. The use of VAEs in species delimitation and clustering with genetic data was recently demonstrated by Derkarabetian et al. 36 . The STRU CTU RE formatted file was converted to "one-hot encoding" and the VAE was run using the "sp_deli" script (https:// github. com/ sokry pton/ sp_ deli) from Derkarabetian et al. 36 . An analysis was run on the full Antarctophiline dataset (Matrix 2). However, given potential contamination of sample P70 and issues with sample P49 (see "Results" section), a second analysis was run with these two samples removed. For both datasets, the VAE was run five times and the analysis with the lowest loss (a measure of the difference between input and reconstructed SNPs) was considered optimal. A single estimate of loss was calculated for each analysis by discarding the first 50% of generations as burn-in and calculating the average loss across the second 50% of generations. This average measure of loss for each analysis is akin to the likelihood estimate and burn-in associated with Bayesian phylogenetic analyses. Clustering was performed on the VAE output using only the mean of each sample; the two-dimensional representation (i.e., mean of each sample in latent space) was used as input for multiple clustering methods implemented in R 94 : "partition around medoids" (PAM) clustering using the cluster R package 95 , with the optimal K having the highest average silhouette width 96 ; PAM clustering with the optimal K inferred via the gap statistic with k-means clustering using the factoextra R package 97 ; and hierarchical clustering with the mclust R package 98 .