Adding loci improves phylogeographic resolution in red mangroves despite increased missing data: comparing microsatellites and RAD-Seq and investigating loci filtering

Hodel, Richard G. J.; Chen, Shichao; Payton, Adam C.; McDaniel, Stuart F.; Soltis, Pamela; Soltis, Douglas E.

doi:10.1038/s41598-017-16810-7

Download PDF

Article
Open access
Published: 14 December 2017

Adding loci improves phylogeographic resolution in red mangroves despite increased missing data: comparing microsatellites and RAD-Seq and investigating loci filtering

Scientific Reports volume 7, Article number: 17598 (2017) Cite this article

5642 Accesses
90 Citations
19 Altmetric
Metrics details

Subjects

Abstract

The widespread adoption of RAD-Seq data in phylogeography means genealogical relationships previously evaluated using relatively few genetic markers can now be addressed with thousands of loci. One challenge, however, is that RAD-Seq generates complete genotypes for only a small subset of loci or individuals. Simulations indicate that loci with missing data can produce biased estimates of key population genetic parameters, although the influence of such biases in empirical studies is not well understood. Here we compare microsatellite data (8 loci) and RAD-Seq data (six datasets ranging from 239 to 25,198 loci) from red mangroves (Rhizophora mangle) in Florida to evaluate how different levels of data filtering influence phylogeographic inferences. For all datasets, we calculated population genetic statistics and evaluated population structure, and for RAD-Seq datasets, we additionally examined population structure using coalescence. We found higher F _ST using microsatellites, but that RAD-Seq-based estimates approached those based on microsatellites as more loci with more missing data were included. Analyses of RAD-Seq datasets resolved the classic Gulf-Atlantic coastal phylogeographic break, which was not significant in the microsatellite analyses. Applying multiple levels of filtering to RAD-Seq datasets can provide a more complete picture of potential biases in the data and elucidate subtle phylogeographic patterns.

Impact of different numbers of microsatellite markers on population genetic results using SLAF-seq data for Rhododendron species

Article Open access 21 April 2021

Opportunities and challenges of macrogenetic studies

Article 18 August 2021

Inferring genetic structure when there is little: population genetics versus genomics of the threatened bat Miniopterus schreibersii across Europe

Article Open access 27 January 2023

Introduction

Choice of molecular markers remains a critically important consideration when designing a phylogeographic, phylogenetic, or population genetic study, as researchers must optimize the amount of informative genetic data they can obtain for a fixed and typically modest cost. In phylogeographic studies, theoretical considerations impact decisions regarding whether to include more individuals or more loci. Microsatellites (or simple sequence repeats, SSRs) have been one of the workhorses of phylogeographic studies for over two decades—their high variability made them popular for distinguishing between closely related conspecific or congeneric individuals^1,2,3. Microsatellite markers are now being gradually replaced by RAD-Seq data for phylogeographic inference⁴.

There are advantages and disadvantages to using microsatellites in phylogeographic studies^2,3,5. Microsatellites are a known quantity; hundreds of thousands of studies that use SSRs are in the literature—primers are already available for many groups. In addition, many user-friendly software packages are available for all aspects of microsatellite analysis, from loci development to population genetic inference³. If primers are already developed for the taxa of interest, microsatellites can be inexpensive to implement. Additionally, if initial results necessitate adding a few additional individuals and/or loci, project costs will increase linearly with microsatellites. However, there are caveats to using SSRs. Perhaps most importantly, a limited number of loci (usually < 25) can feasibly be employed in a typical microsatellite study. Also, the mutational properties of SSRs are unusually high and almost certainly do not reflect those of the genome as a whole. Thus, the property that makes microsatellites excellent for distinguishing different individuals may inflate statistics such as F _ST and heterozygosity relative to the rest of the genome. Furthermore, microsatellites can be just as expensive to implement as newer high-throughput sequencing (HTS) techniques if there are no existing genetic resources (e.g., no primers already developed, or no available transcriptomic or genomic resources)⁶.

The use of RAD-Seq data has increased greatly over the past decade, largely because thousands of loci can be generated simultaneously for hundreds of individuals for a fixed, known cost⁷. RAD-Seq uses restriction enzymes (REs) to create a reduced representation library of the genome; single-nucleotide polymorphisms (SNPs) in regions of DNA between restriction sites are used to distinguish between individuals⁸. Barcoding to allow efficient multiplexing during sequencing keeps costs down, which can be as little as $40 per individual for thousands of loci, assuming judicious sharing of reagents, and a well-designed plan for multiplexing individuals^9,10,11. Microsatellite genotyping has a similar cost per individual, assuming primers are not developed, but many fewer loci are obtained³. SNPs have several advantages over microsatellites, as they are less likely to exhibit homoplasy than SSRs¹².

Despite advantages, there are also several caveats to using RAD-Seq. Unless there is a reference genome, loci obtained using RAD-Seq are anonymous, and some loci may be non-neutral⁷. Additionally, biases may be introduced at several stages in a RAD-Seq protocol: (1) digestion with REs samples a non-random portion of the genome due to biases in base composition; this is potentially worse if methylation sensitive enzymes are used; (2) polymorphisms in restriction sites that can lead to segregating presence/absence polymorphisms that are very difficult to detect without very deep sequencing and negating the cost-savings of using RAD-Seq in the first place^7,13; (3) preferential PCR amplification of some loci during library construction necessarily reduces coverage of other loci¹³; (4) sequencing errors and/or low sequencing depth leads to incorrect genotype calling⁷; and (5) false loci are constructed due to the misassembly of paralogous reads^14,15. Many potential problems are resolved by multiple PCR steps to even out loci coverage and by improvements in software when processing loci, but concerns remain that RE-based reduced representation methods do not capture a representative snapshot of the genome¹⁶. One other concern with RAD-Seq loci is that manual data curation is impossible, and errors may go undetected even by the most careful researchers^14,17,18. Finally, the biggest potential problem when using RAD-Seq is that low coverage and high proportions of missing data can make it difficult to infer heterozygotes accurately.

Previous studies have compared results from SNPs and SSRs, revealing that microsatellites provide much more information—up to an order of magnitude more—on a per-marker basis than SNPs^19,20. However, SNP studies typically use several orders of magnitude more markers than an average SSR study. Evidence has shown that the large number of loci in SNP studies can effectively allow for more powerful inferences, even though the information at each locus is less than that in microsatellite markers²¹. Because of the low number of loci used in SSR studies, the standard practice is to aim to minimize missing data. However, the nature of current library preparation and sequencing means that higher percentages of missing data are an unavoidable part of RAD-Seq studies. Simulation studies have shown that the large amounts of missing data in RAD-Seq studies can inflate F _ST estimates due to allelic dropout^13,18. As more loci were included in these simulations, F _ST appeared to increase because many loci had genotype data for only one or a few individuals. In many such loci F _ST = 1 because by chance the few individuals sampled were homogeneous within populations but different between populations, leading to high average F _ST. Heterozygosity can be similarly inflated if the more frequent allele is likely to be absent (e.g., because mutations in the restriction site, which lead to allelic dropout, are often in ancestral alleles that occur at a high frequency)¹⁸. Arnold, et al.¹³ confirmed results from Gautier, et al.¹⁸ and also concluded that other summary statistics, including Θ and π, could be inaccurately estimated from loci with missing data. In spite of these problems, more recent simulation studies have indicated that missing data in RAD-Seq studies may not lead to incorrect inference, and in fact including loci with missing data can be advantageous for identifying shallow divergences²².

Convention in phylogeographic studies often is to require 75 or 80% of individuals to have data for a given locus—otherwise that locus is discarded from the analyses (e.g., refs^{23,24,25,26,27,28}). Presumably, requiring a locus to be present in a certain number of individuals will eliminate loci with high missing data that may be the cause of misestimated parameters^13,18. However, the choice of a cutoff is arbitrary and is typically not justified in phylogeographic studies— the number of SNPs is virtually always reported as a single fixed value (e.g., “we identified a total of 4,234 SNPs,” Jackson, et al.²⁴). In reality, the various parameter values that determine how many loci are constructed and retained in SNP alignment methods means that there is a range of loci that could conceivably be included in a study^27,29.

To date, no phylogeographic study has investigated the effect of varying amounts of missing data in an empirical RAD-Seq dataset, even those explicitly comparing RAD-Seq-generated SNPs and microsatellites^30,31. To remedy this knowledge gap, we investigate the phylogeography of red mangroves (Rhizophora mangle L., Rhizophoraceae) in Florida, using both an existing microsatellite dataset³², and new RAD-Seq SNP datasets that vary in number of loci and the percentage of missing data. We filtered RAD-Seq loci to generate a dataset that would approximate the number of loci and amount of missing data typically used in RAD-Seq phylogeography studies, and we also generated datasets with more or less stringent filtering to test the effects of increasing or decreasing the number of loci and percentage of missing data. Specifically, we address the following questions: (1) In RAD-Seq datasets, how are phylogeographic inferences affected by the number of loci used? (2) In RAD-Seq datasets, how are phylogeographic inferences affected by the percentage of missing data? (3) What are the important differences in performance between microsatellites and RAD-Seq data in population genetic and phylogeographic inference? (4) Do RAD-Seq data reveal any novel phylogeographic inferences not already recovered by microsatellites for red mangroves in Florida?

To address these questions, we used 96 red mangrove (Rhizophora mangle) individuals collected from 12 sampling locations on the coasts of Florida (Table 1, Fig. 1). Red mangroves are salt-tolerant trees that occur in coastal estuarine environments throughout the neotropics, experiencing high temperatures, frequent inundation, saline conditions, and periodic wave action associated with the coastal environment³³. Red mangroves provide a variety of ecosystem services, including filtering water, providing habitat to animals, stabilizing shorelines, and protecting coastal environments from frequent wave action and occasional storm surges. Thus, red mangroves are important conservation targets—for which phylogeographic data can improve conservation strategies—making red mangroves a valuable study system.

Table 1 The twelve sampling locations (each containing eight individuals), their codes, GPS coordinates, and the percentage of loci that have missing data for each sampling location before any filtering.

Full size table

Further analysis of phylogeographic patterns in red mangroves and other species occurring in the Florida peninsula is also warranted. Although previous studies of many coastal and marine taxa revealed a phylogeographic discontinuity at or near the southern tip of Florida^34,35, recent work on red mangroves using microsatellites failed to identify such a pattern^32,36. Different types of molecular markers could reveal new phylogeographic insights, due to broader sampling of the genome, and provide a predictive framework for understanding how genetic variation in this iconic species will respond to climate change. Finally, red mangroves are an ideal system for comparing the performance of alternative genetic markers, given previous analyses of microsatellite loci^32,36,37 and the size of the genome (approximately 1 Gb; Hodel unpublished data, based on flow cytometry observations), enabling a rigorous test of the RAD-Seq method. Genome size is a necessary consideration with RAD-Seq; as genome size increases, the number of loci shared among many individuals for a given sequencing depth decreases.

Results

Datasets

Seven datasets, ranging from 8 loci (SSR_8) to 25,198 loci (RAD_25198; Table 2), were used to investigate in depth how variation in number of loci and percent missing data impacted phylogeographic inferences. These were selected from all possible datasets, for which basic statistics were calculated (Supplementary Figure 1). The name of each of the seven datasets contains information about locus type (RAD or SSR) and number of loci in the dataset. The smallest RAD dataset contained 239 loci (RAD_239), and the percentage of missing data for RAD datasets ranged from 11.7% to 78.1% (Table 3). The dataset RAD_1180, which required a locus to be present in 75 of 96 individuals (78.1%), most closely mimicked the amount of loci filtering typically used in a phylogeographic study. Therefore, in our analyses, we used this as a baseline dataset against which to compare other RAD datasets. Across sampling locations, the proportion of missing data was relatively uniform (Table 1); percentage of missing loci in the data matrix for a given sampling location ranged from 65.8% to 88.9%.

Table 2 The seven data sets used in this study; RAD-Seq data sets were generated by filtering loci from largest data set (RAD_25198). For all data sets (six RAD and one microsatellite), the total number of loci used is indicated.

Full size table

Table 3 Relevant population genetic statistics for each of the seven data sets used in this study. For each column, warmer colors indicate lower values and cooler colors show higher values. Immediately to the right of each of the four columns (F _ST, F _IS, H _O, H _E) is the 95% confidence interval for each statistic.

Full size table

Population genetic analyses

Measures of heterozygosity were not significantly different between the microsatellite dataset and the RAD datasets; average H _O was 0.431 for the microsatellite dataset and 0.392 in RAD_1180, with a range from 0.354 to 0.477 across all RAD datasets (Table 3). Average H _E was 0.388 for the microsatellite dataset and 0.307 for RAD_1180 and ranged from 0.300 to 0.340 for all RAD-Seq datasets (Table 3). Average F _ST for microsatellites was 0.124, which was significantly greater than average F _ST for only one of the RAD datasets—the smallest (RAD_239; Table 3). Within the RAD datasets, average H _O was significantly greater in RAD_25198 than all others, and it was significantly lower in RAD_6255 than in all others; H _O did not predictably increase or decrease as the number of loci increased (Table 3). Additionally, within RAD datasets, average H _E was significantly greater in RAD_25198 than all others. F _ST ranged from 0.046 to 0.108 among the RAD datasets (Table 3). There was no significant difference in F _ST in the three smallest RAD datasets, but the three largest RAD datasets all had increased F _ST relative to the smaller datasets (Table 3). The dataset with the largest value of F _ST was RAD_6255 (Table 3). Average F _IS using microsatellites was not significantly different than F _IS calculated using RAD datasets; within RAD datasets, F _IS generally increased as more loci were added, although RAD_25198 had the lowest value of F _IS (Table 3). Many of the population genetic statistics were disproportionately affected by loci with very low or very high values of F _ST, F _IS, or heterozygosity (Fig. 2). The effect of extreme loci was particularly evident in the larger datasets (RAD_6255 and RAD_25198), in which there were large numbers of loci with extreme values (e.g., F _ST of 1.0; Fig. 2).

Pairwise F _ST

The values of pairwise F _ST for each sampling location relative to other sampling locations were remarkably consistent across datasets (Table 4). For most sampling locations pairwise F _ST estimated by SSRs was approximately twice as large as RAD dataset estimates. For every dataset, pairwise F _ST between Seahorse Key and all other sampling locations was the highest. For every RAD dataset, Islamorada had the lowest value for pairwise F _ST, but the SSR dataset identified West Palm Beach as the sampling location with the lowest estimate of pairwise F _ST. Even as the amount of missing data increased, the pairwise F _ST estimates remained consistent; RAD_25198 produced similar values to smaller RAD datasets (Table 4).

Table 4 Pairwise F _ST for each sampling location (i.e., one sampling location versus all others) for each of the seven datasets. Within each data set, lower (warmer colors) and higher (cooler colors) values of F _ST are shown using color-coding.

Full size table

F _IS by sampling location

Cape Canaveral was the location with the highest F _IS using the microsatellite data (SSR_8), followed by Key Largo and Seahorse Key (Table 5). Meanwhile, for all RAD datasets, Seahorse Key had one of the lowest (i.e., most negative) F _IS values among all populations. Within the RAD datasets, the number of loci and/or amount of missing data affected F _IS. For example, in Key Largo, the largest dataset yielded a value of 0.015, while the smallest dataset had a value of −0.194. This was not a large absolute change in F _IS, but the interpretation of this statistic changed based on whether it is positive or negative (higher values indicate a greater level of inbreeding). In general, within RAD datasets, F _IS increased as loci were added, although this trend was not universal, especially in the largest RAD dataset. For instance, in Bahia Honda Key, F _IS was lowest in the largest dataset RAD_25198 (25,198 loci, 78.1% missing data). Conversely, in Islamorada, F _IS was lowest in the smallest dataset (RAD_239, 11.7% missing data).

Table 5 The variation in average inbreeding coefficient (F _IS) among data sets and populations. Within each data set, lower (warmer colors) and higher (cooler colors) values of F _IS are shown using color-coding. The average value of F _IS across all data sets for each population is shown in the last column of the table.

Full size table

Heterozygosity by sampling location

Observed heterozygosity for each sampling location ranged from 0.320 (Seahorse Key) to 0.451 (Hollywood) when averaged across all datasets (Table 6). For most datasets, Seahorse Key was the sampling location with the lowest H _O, although notably RAD_25198 identified six other sampling locations with lower H _O than Seahorse Key (Table 6). Similarly, most datasets reported Hollywood as the sampling location with the highest H _O, but SSR_8 found Convoy Point and Islamorada had higher H _O than Hollywood, and RAD_25198 identified five other sampling locations with greater H _O, with Key Largo having the highest H _O (Table 6). For most sampling locations, measures of H _O, when ranked relative to other sampling locations, remained similar across all RAD datasets except RAD_25198. Interestingly, the values of H _O ranked relative to other sampling locations were more similar between SSR_8 and the five smallest RAD datasets (RAD_239-RAD_6255) than any of the five smallest RAD datasets were to RAD_25198 (Table 6).

Table 6 The variation in observed heterozygosity (H _O) among data sets and populations. Within each data set, lower (warmer colors) and higher (cooler colors) values of H _O are shown using color-coding. The average value of H _O across all data sets for each population is shown on the bottom row of the table.

Full size table

PCA and SVDQuartets

The PCA analysis revealed that microsatellite data did not identify clear groupings of individuals based on sampling location or other geographical divisions (Fig. 3). Similarly, RAD_239 did not differentiate the samples into discrete clusters. However, RAD_1180, RAD_2317, RAD_3831, RAD_6255, and RAD_25198 all divided the samples into two groups with minimal overlap in the PCA visualization: one group was Gulf Coast samples, and the other group was Atlantic Coast samples (Fig. 3). Closer inspection of the PCAs revealed that most of the Cape Canaveral individuals formed a discrete cluster intermediate between the two other clusters (Gulf and Atlantic). Most RAD datasets had sufficient resolution to place Cape Canaveral between the Gulf and Atlantic clusters, but the use of a small number of loci (i.e., RAD_239) was unable to show this relationship. Furthermore, the two largest datasets, RAD_6255 and RAD_25198, showed Cape Canaveral individuals clustering more closely to the Atlantic than the Gulf cluster.

The 50% majority-rule consensus bootstrap trees generated with SVDQuartets showed substantial variation between datasets when inferring genealogical relationships between individuals and/or sampling locations (Fig. 4). In many cases, dataset RAD_239 did not identify genealogical relationships that were recovered with other datasets with more loci. However, certain key relationships among individuals were consistently shown in multiple datasets with thousands of loci. In every dataset except RAD_239 (i.e., every dataset with at least 1180 loci), all Seahorse Key (ShKFl) samples formed a clade (Fig. 4). In four datasets (RAD_2317, RAD_3831, RAD_6255, RAD_25198), all Gulf Coast (NPRFl, ShKFl, TCBFl) samples (except one individual: NPRFl_R8), together with all Cape Canaveral (CpCFl) samples, formed a clade that is sister to all remaining Atlantic Coast samples plus NPRFl_R8. Interestingly, this relationship was not recovered in RAD_1180, the dataset with ‘ideal’ filtering of loci—but all datasets with more loci (and therefore more missing data) did recover the relationship. Each RAD dataset had nodes with varying levels of bootstrap support (Fig. 4). Datasets with fewer loci showed few nodes with bootstrap support >70%; RAD_239 had three such nodes. More loci resulted in more nodes with bootstrap support >70%, up to a point: RAD_1180 had six highly supported nodes, RAD_2317 had eight, and RAD_3831 had the most with nine. Then the number of highly supported nodes slightly declined as more loci were added: RAD_6255 and RAD_25198 each had eight nodes with bootstrap support >70% (Fig. 4).

Sampling loci

Analyzing differently sized samples of loci from RAD_25198 and SSR_8 provided several crucial insights. A microsatellite dataset with seven loci sampled from SSR_8 performed better in estimating F _ST than a dataset with six loci, although in each case, all 100 sampled replicates fell within the 95% confidence interval of F _ST for the complete SSR_8 dataset (Fig. 5). For all RAD datasets, the value of F _ST estimated using only originally filtered data is different from all 100 permuted values of F _ST calculated from an equivalent number of loci sampled from the largest dataset (RAD_25198). For almost all datasets, F _ST based on sampled loci was less than F _ST using original loci, except for one dataset (RAD_6255), F _ST based on the sampled loci was greater. Strikingly, in none of the datasets did the confidence intervals from the sampled loci overlap with the confidence intervals of the estimated F _ST values from the original data (Fig. 5). The percentage of missing data in the largest dataset clearly had an immense impact. Even when very few loci (e.g., 239 loci) were sampled from the largest dataset, the distribution of F _ST values clustered around the estimated F _ST using all 25,198 loci (Fig. 5), indicating that missing data, not number of loci, affected the differences in measured F _ST.

Discussion

Insights regarding choice of loci

Our results indicate that filtering loci using the standard cutoff (i.e., 75–80% of individuals must possess data for a given locus for that locus to be retained) should not be the gold standard in RAD-Seq studies—it is possible to retain many more loci without inflated statistics^{23,24,25,26,27,28}. F _ST increased as missing data increased, as predicted by simulation studies, but the relationship is more nuanced than previously assumed. F _ST increases as the percentage of missing data increases—up to a point—and then decreases from RAD_6255 to RAD_25198, as the percentage of missing data nearly doubles, from 41.3% to 78.1% (Table 3). When more loci are included, the distribution of F _ST across the genome is more completely sampled. However, adding loci with more missing data may cause analyses to miss low-frequency alleles in the loci with extensive missing data, which would add error to average estimates of F _ST. Sampling analyses confirmed that F _ST generally increased as missing data increased (Fig. 5). Heterozygosity was less affected by missing data, as there was little or no change in either observed or expected heterozygosity when the percentage of missing data ranged between 11.7% (RAD_239) and 41.3% (RAD_6255). Only the largest dataset (RAD_25198, with 78.1% missing data) showed significantly higher heterozygosity than other datasets. Some simulation studies reported that missing data could inflate F _ST, and would likely inflate estimates of heterozygosity, leading to calls for removing loci with incomplete sampling¹³. However, more recent simulation studies showed that removing loci with higher mutation rates, which are more likely to have missing data, negatively impacted phylogenetic analyses²². Our study shows the importance of thoroughly exploring how loci are filtered in empirical systems. Extreme amounts of missing data yield higher estimates of F _ST and heterozygosity and lower estimates of F _IS (Table 3). A large number of loci in RAD_25198 had very high values of certain statistics (e.g., hundreds of loci with F _ST > 0.975 and thousands of loci with H _O > 0.975), which severely impacted average estimates of these statistics (Fig. 2). Notably, not all datasets have these extreme loci—dataset RAD_3831, which only requires 52.1% of individuals to have data for a given locus, and had 29.4% missing data, did not suffer from extreme loci, despite liberal filtering.

Although missing data caused some statistics to increase, it did not dramatically affect our conclusions. For many analyses, using datasets other than RAD_1180, especially RAD_2317 and RAD_3831, did not change the interpretation of the results. Regardless of which of the three datasets was used, F _ST was relatively low—between 0.057 and 0.066. Importantly, nearly doubling the amount of missing data from 17.1% (RAD_1180, the ‘gold standard’) to 29.4% (RAD_3831) resulted in a very small increase in F _ST and no significant change in other statistics (F _IS, H _O, H _E; Table 3). Furthermore, using very few loci (RAD_239) did not significantly change any of the statistics estimated using RAD_1180 (Table 3). Our data indicate that the often-used cutoff of 75–80% individuals with locus data is arbitrary, and different cutoffs should be considered and evaluated on a case-by-case basis to ensure an appropriate number of loci are used. The results suggest that in many cases, only minimal filtering of loci is needed, and many more loci can be retained than typically are. Researchers who wish to maximize the number of loci in their study could likely use very low cutoffs (e.g., require a locus to have data for >10% of individuals).

Typically, microsatellite datasets have lower F _ST values relative to SNPs due to a larger number of alleles, although simulation studies have shown evidence that F _ST can be elevated up to an order of magnitude in microsatellite datasets due to factors such as correlated allele frequencies³⁸. Average F _ST ranged from 0.046 to 0.124 across all datasets—so there is not high differentiation detected in any dataset (Table 3). When using any RAD dataset except RAD_239, F _ST calculated using RAD loci was statistically indistinguishable from the microsatellite dataset (Table 3). In theory, a three-to-four-fold change in F _ST could alter biological conclusions—possibly with deleterious results (e.g., identifying populations or management units that would be prioritized for conservation)—but no matter how the loci were filtered, there was a relatively small range of estimated F _ST. RAD-Seq studies where larger values of F _ST were detected could exhibit larger absolute changes in F _ST when using different loci filtering cutoffs (e.g., refs^39,40).

Similarly, the interpretation of F _IS and H _O could impact how data are considered in a biological context (e.g., identifying locations at risk for inbreeding depression). Positive values of F _IS and/or low values of H _O often indicate inbreeding, which means sampling locations are more vulnerable than other sampling locations. As noted earlier, SSR_8 identified five sampling locations with positive F _IS values (Table 5). Only one of these sampling locations (Key Largo) also had positive F _IS values in any of the RAD datasets. Clearly, marker choice (microsatellite versus RAD-Seq) affected the assessment of which populations are more vulnerable based on F _IS values. Agreement between these two markers types would facilitate identifying sampling locations vulnerable to inbreeding depression. However, it is understandable that different markers would lead to different results, as mutation rate can affect estimation of F _IS—in theory, as the mutation rate increases, F _IS should decrease. The data showed opposite result though, as F _IS was higher in the SSR_8 dataset for most sampling locations (Table 5). This is likely because estimates of H _O and H _E have larger variance when few loci are used, as in the SSR_8 dataset (Table 3). The relative values of H _O and H _E can dramatically affect the interpretation of F_IS, especially when H _O and H _E are similar (e.g., using the equation F _IS = (H _E − H _O)/H _E would yield F _IS of 0.25 when H _O = 0.4 and H _E = 0.5, but F _IS = −0.25 if H _O and H _E are reversed). Within RAD datasets, estimates of F _IS for each sampling location were fairly consistent and F _IS increased as missing data increased, but this trend was not universal (Table 3). Identifying vulnerable sampling locations based on H _O revealed that RAD_25198 led to different conclusions than most other datasets. Across datasets, measures of H _O within each sampling location were consistent relative to other sampling locations, except for RAD_25198 (Table 6). Missing data impacted this analysis; a large number of loci in RAD_25198 had either very high or very low H _O, possibly leading to the pattern of H _O in RAD_25198 that contrasted with patterns in virtually every other dataset (Table 6, Fig. 2).

The PCA results show that as the number of loci increases, the definition of clusters improves, plateauing with RAD_2317 or RAD_3831. The clustering is similar in all RAD datasets with 1,180 or more loci, with Cape Canaveral individuals falling between the Gulf and Atlantic clusters. As more loci are added, the Cape Canaveral samples appear to be closer to the Atlantic cluster, especially in datasets RAD_6255 and RAD_25198 (Fig. 3). Taking into account the SVDQuartets results clarifies the clustering—all Cape Canaveral samples form a clade with all Gulf samples except one. However, this relationship is only present in datasets with 2,317 loci or greater—the putatively ‘gold standard’ dataset RAD_1180 does not show this relationship.

Phylogeographic patterns in red mangroves

Based on previous studies using microsatellite data^32,36, the relationship of Cape Canaveral samples to other sampling locations, as found here with both PCA and SVDQuartet analyses of RAD-Seq data, was surprising—previous studies did not find that any of the individuals in the Cape Canaveral population clustered with any of the Gulf samples. These new data could indicate an Atlantic-Gulf phylogeographic discontinuity, and that Cape Canaveral is an anomaly due to a lack of phylogeographic resolution, recent population founding, or human-mediated transplantation. The intermediate placement of Cape Canaveral in many of the PCAs suggests that it may actually cluster with the Atlantic samples, especially when considering datasets RAD_6255 and RAD_25198, indicating a phylogeographic break (Fig. 3). However, the SVDQuartets results place Cape Canaveral in a clade with the vast majority of Gulf samples, although this relationship is not highly supported in any datasets (i.e., bootstrap support is not >70% for this clade in any dataset) (Fig. 4). Assuming that Cape Canaveral is more closely related to Gulf samples, the age of the divergence between the two clades (Atlantic, Gulf + CpCFl) comes into question. Northern Florida represents the northern limit of the range of red mangroves³³. Typically, populations of these trees in northern Florida are periodically extirpated due to freezing events, and these areas are re-colonized. The lower values of H _O in northern populations (CpCFl, MlbFl, ShKFl, NPRFl, TCBFl) relative to southern populations indicate that these populations were likely founded more recently from a small number of propagules. The Cape Canaveral population was likely founded by individuals from the Gulf Coast, suggesting that the divergence between the two clades (Atlantic, Gulf + CpCFl) is very recent.

Previous research indicates that gene flow is greater from the Gulf Coast to the Atlantic Coast in red mangroves; there may be ongoing gene flow from the Gulf to Cape Canaveral³². Alternatively, alleles from the Gulf Coast could have migrated into an existing Cape Canaveral population and proliferated due to other processes (e.g., drift). Another explanation for the sister relationship between the Gulf samples and Cape Canaveral is human-mediated transplantation of propagules or seedlings from the Gulf Coast to Cape Canaveral. However, all available publications and information from land managers who replied to requests for information confirm that any restoration that required importation of propagules used either local propagules or seedlings from the southern Atlantic Coast (ref.⁴¹, personal communication with Rangers from Cape Canaveral National Seashore). Another possible explanation for this result is that red mangrove propagules were accidentally transported from the Gulf Coast of Florida to Cape Canaveral during construction of the Kennedy Space Center in the 1960s. Construction of the Space Center was a massive project. It is noteworthy that nearly 100,000 tons of steel was transported from the Gulf Coast to Cape Canaveral in numerous trucks; the transport of a few mangrove propagules during this process could easily have established a Gulf genotype in the Cape Canaveral area⁴². We conclude that, in contrast to microsatellites, RAD datasets recover a relationship between the Gulf Coast and the Atlantic Coast (excluding Cape Canaveral) that supports the presence of a maritime discontinuity in red mangroves. However, as red mangroves can disperse long distances, a population or populations that recently established in Cape Canaveral likely had a founder or founders that were predominantly of Gulf Coast origin. The fact that previous studies using SSRs did not elucidate this relationship is not surprising—both the PCA analysis and SVDQuartets analysis indicate that 1180 loci were barely sufficient to infer the placement of Cape Canaveral—datasets with many more loci were needed (Figs 3 and 4). The large number of loci required to resolve such relationships highlights why liberal filtering of RAD-Seq loci is advisable.

Conclusions

We cannot overemphasize the importance of thoroughly exploring RAD-Seq datasets when performing phylogeographic analyses—it is too easy to jump to conclusions when only using one arbitrary cutoff to filter loci. Our empirical data confirm that estimates of F _ST and/or heterozygosity may become inflated as missing data increase. However, this does not happen as quickly as implied in simulation studies as loci with missing data are added—liberal filtering of loci retains loci valuable for phylogeographic or phylogenetic inference, without inflating population genetic statistics. Thus, regardless of the cutoff value used to filter loci, researchers should investigate several other cutoffs with both increased and decreased amounts of missing data to appreciate fully the impact of missing data on parameters in their study. We found no evidence that the 75% or 80% cutoff commonly employed was optimal. In many analyses, other datasets with cutoffs ranging from 31.3% to 67.7% performed just as well as or better than RAD_1180. Many RAD-Seq studies aim to multiplex as many individuals as possible in a HTS run; our results show that retaining loci with more missing data is feasible and advantageous in empirical studies, and that researchers can include more samples in a single sequencing run. Our study confirmed that microsatellites were a valuable tool for inexpensively estimating population genetic statistics, such as F _ST, F _IS, and heterozygosity. However, this study revealed that the thousands of additional loci from across the genome provided by RAD-Seq increased phylogeographic resolution. We found that red mangroves likely have a phylogeographic discontinuity at the southern tip of Florida that was not detected in previous studies using SSRs^32,36 and that a single population from the Atlantic coast of Florida arose via recent colonization by propagules (either natural or human-mediated) from the Gulf coast.

Methods and Materials

Sample collection, DNA isolation

We collected leaf tissue from plants of R. mangle from 12 locations in Florida (Fig. 1). At each location, we collected one leaf from 10–20 individuals that were spaced at least 15 m apart to minimize collecting closely related individuals. For each sampling location, we randomly selected 8 individuals to use in genetic analyses. GPS coordinates for each sampling location were recorded (Table 1). Each sampled leaf was placed in a labeled bag with silica gel and stored for 1–12 months; we then extracted DNA from the dried leaf tissue using a standard CTAB protocol⁴³.

Microsatellite amplification and analysis

We PCR-amplified eight nuclear microsatellite loci for R. mangle (RM 11, 19, 21, 36, 38, 41, 46, 47)³⁷. An M13 protocol⁴⁴ was used to label amplicons with four fluorescent dyes (6-FAM, NED, PET, VIC). The PCR (25-μL reactions) contained: 5X buffer (5 μL), 2.5 mM MgCl₂ (2 μL), 2.5 mM dNTP (0.5 μL), 0.12 μM forward primer with M13 label attached (1.25 μL), 4.5 μM reverse primer (1.25 μL), 4.5 μM fluorescent dye (2.5 μL), H₂O (10 μL), Taq polymerase (0.5 μL), and 50 ng template DNA (2 μL). We carried out PCR in a Biometra T3 Thermocycler (Whatman Biometra, Goettingen, Germany) using the following cycles: initial denaturing at 94 °C for 3 minutes; 35 cycles of 94 °C (45 seconds), 52 °C (45 seconds), 72 °C (75 seconds); final elongation at 72 °C for 20 minutes. We used the Applied Biosystems 3730 DNA Analyzer (Applied Biosystems, Foster City, United States) at the University of Florida Interdisciplinary Center for Biotechnology Research to detect the fluorescent peaks. We determined microsatellite peaks in Geneious 6.5 (http://www.geneious.com/) using the GeneScan 600 size standard ladder for calibration⁴⁵.

RAD-Seq library preparation and data processing

We followed the double-digest RAD-Seq protocol developed by Peterson, et al.⁴⁶. For each sample, we constructed 96 DNA libraries by digesting approximately 200 ng genomic DNA with EcoRI and MseI. We then ligated Illumina adapters and unique 8–10-nucleotide barcodes to the DNA fragments. The DNA libraries were PCR-amplified in two separate reactions and pooled to minimize early PCR bias. We size selected 250–450-bp fragments using gel electrophoresis and sequenced the DNA fragments using the 1 × 100-bp setting on the Illumina HiSeq. 2500 platform. Raw sequence data were deposited in the NCBI Sequence Read Archive (accession numbers pending). We processed the raw Illumina reads using the FAST-X toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) to filter sequences; we required 95% of bases to be above a quality score of 30 to retain a read. We then converted the sequences from FASTQ to FASTA, demultiplexed the reads, sorted them by barcodes, and trimmed the sequences by removing the final 2 bases to ensure that we were using only high-quality sequence data. We assembled the sequences into loci using the STACKS 1.24 pipeline⁴⁷ with the following parameter settings: -n 3 -m 3 -M 2 (parameters were optimized following Mastretta-Yanes, et al.²⁹); all other parameters were left as the default. We selected seven datasets (one microsatellite and six RAD-Seq) and used a variety of analyses to compare the results produced by each dataset (Table 2). We used the ‘populations’ program in STACKS to produce an unfiltered dataset of RAD-Seq loci using the ‘write single SNP’ command and requiring a minor allele frequency >0.05. We then removed human, fungal, and microbial contamination from the loci and filtered loci by representation across individuals using an R script to create five smaller datasets (Data_aquisition.R; this script and all other scripts are available at https://github.com/richiehodel/red_mangrove_RAD_SSR). Filtered datasets were required to have locus data for a certain number of individuals for the given locus to be retained in the analysis; the number of individuals could range from 1–96 (Supplemental Fig. 1). The datasets were chosen such that they encompassed a wide range of loci and missing data.

Population genetic analyses

We used an R script (Basic_stats.R) and the R package ‘hierfstat’⁴⁸ to calculate average F _ST, the inbreeding coefficient F _IS, H _O, and H _E for each of the seven datasets. To investigate how the number of loci affected comparisons of population genetic statistics among populations, we calculated pairwise F _ST (one sampling location versus all others combined) for each sampling location for each dataset using GenoDive⁴⁹ and an R script (Pairwise_Fst.R). Additionally, we calculated F _IS and H _O for each sampling location for each dataset to determine how measures that often inform conservation practices might be affected by the number of loci and amount of missing data. We measured how missing data were partitioned across sampling locations to verify that there were not any sampling locations with unusually high or low amounts of missing data (Table 1). Additionally, we investigated how several population genetic statistics were distributed across loci in each of the datasets (Stat_Distribution.R; Fig. 2).

Principal components and SVDQuartets

We used a principal component analysis (PCA) implemented in the R package ‘SNPRelate’⁵⁰ to identify clusters of individuals in the RAD data with an R script (VCF_PCA.R) and GenoDive to run a PCA on the microsatellite data. After visualizing the initial results, we tested several ways of grouping sampling locations together based on geography. We used SVDQuartets⁵¹ to determine genealogical relationships among individuals. This program selects the optimal topology for a quartet of taxa, and, after sampling millions of quartets, infers a phylogeny for all individuals based on choosing the quartets with the best scores and assembling them into a phylogenetic tree. We used an R script (Nexus_creation.R) to convert the output from the ‘populations’ program in STACKS into nexus files that could be read for the SVDQuartets analysis. For each RAD dataset, we evaluated all possible quartets and selected trees under the multispecies coalescent using QFM (Quartet Fiduccia Mattheyses) quartet assembly⁵². We used non-parametric bootstrapping (100 replicates for each dataset) to assess confidence in inferred genealogical relationships between individuals. The R script Tree_formatting.R was used to visualize and annotate the 50% majority-rule trees from SVDQuartets using the R packages ‘ape’⁵³ and ‘ggtree’⁵⁴.

Sampling loci

To test whether the number of loci or percentage of missing data for the loci used is the more important factor impacting measures of fixation, population differentiation, and heterozygosity, we randomly sampled from RAD_25198 (the RAD-Seq dataset comprising 25,198 loci) the equivalent number of loci contained in RAD_239, RAD_1180, RAD_2317, RAD_3831, and RAD_6255, respectively, and used these five sets of sampled loci in analyses. We used an R script (Subsample.R) to randomly sample loci without replacement from RAD_25198 and repeated the sampling 100 times for each dataset. We compared measures of F _ST calculated using the original datasets with results calculated using the sampled loci from RAD_25198 (Fig. 5). We used bootstrapping to calculate 95% confidence intervals around F _ST for the original datasets and for the sets of loci sampled from RAD_25198 (Fig. 5).

Data availability

The datasets generated during the current study are available in the NCBI Genbank repository, https://www.ncbi.nlm.nih.gov/bioproject/PRJNA397667 (accession numbers SRR5918296-SRR5918355).

References

Guichoux, E. et al. Current trends in microsatellite genotyping. Mol. Ecol. Resour. 11, 591–611 (2011).
Article CAS PubMed Google Scholar
Kalia, R. K., Rai, M. K., Kalia, S., Singh, R. & Dhawan, A. K. Microsatellite markers: an overview of the recent progress in plants. Euphytica 177, 309–334 (2010).
Article Google Scholar
Hodel, R. G. J. et al. The report of my death was an exaggeration: A review for researchers using microsatellites in the 21st century. Appl. Plant Sci. 4, (2016).
Seeb, J. E. et al. Single-nucleotide polymorphism (SNP) discovery and applications of SNP genotyping in nonmodel organisms. Mol. Ecol. Resour. 11, 1–8 (2011).
Article PubMed Google Scholar
Gardner, M. G., Fitch, A. J., Bertozzi, T. & Lowe, A. J. Rise of the machines–recommendations for ecologists when using next generation sequencing for microsatellite development. Mol. Ecol. Resour. 11, 1093–101 (2011).
Article PubMed Google Scholar
Hodel, R. G. J. et al. A new resource for the development of SSR markers: Millions of loci from a thousand plant transcriptomes. Appl. Plant Sci. 4, (2016).
Andrews, K. R., Good, J. M., Miller, M. R., Luikart, G. & Hohenlohe, P. A. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat. Rev. Genet. 17, 81–92 (2016).
Article CAS PubMed PubMed Central Google Scholar
Baird, N. A. et al. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One 3, e3376 (2008).
Article ADS PubMed PubMed Central Google Scholar
Smith, A. M. et al. Highly-multiplexed barcode sequencing: an efficient method for parallel analysis of pooled samples. Nucleic Acids Res. 38, e142–e142 (2010).
Article ADS PubMed PubMed Central Google Scholar
Andolfatto, P. et al. Multiplexed shotgun genotyping for rapid and efficient genetic mapping. Genome Res. 21, 610–617 (2011).
Article CAS PubMed PubMed Central Google Scholar
Sonah, H. et al. An improved genotyping by sequencing (GBS) approach offering increased versatility and efficiency of SNP discovery and genotyping. PLoS One 8, e54603 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Rafalski, A. Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin. Plant Biol. 5, 94–100 (2002).
Article CAS PubMed Google Scholar
Arnold, B., Corbett-Detig, R. B., Hartl, D. & Bomblies, K. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Mol. Ecol. 22, 3179–3190 (2013).
Article CAS PubMed Google Scholar
Etter, P. D., Bassham, S., Hohenlohe, P. A., Johnson, E. A. & Cresko, W. A. SNP discovery and genotyping for evolutionary genetics using RAD sequencing in Methods in Molecular Biology 157–178 (Clifton, 2012).
Xu, P. et al. Population genomic analyses from low-coverage RAD-Seq data: a case study on the non-model cucurbit bottle gourd. Plant J. 77, 430–442 (2014).
Article CAS PubMed Google Scholar
Lowry, D. B. et al. Breaking RAD: an evaluation of the utility of restriction site-associated DNA sequencing for genome scans of adaptation. Mol. Ecol. Resour. 17, 142–152 (2017).
Article CAS PubMed Google Scholar
Davey, J. W. et al. Special features of RAD sequencing data: implications for genotyping. Mol. Ecol. 22, 3151–64 (2013).
Article CAS PubMed Google Scholar
Gautier, M. et al. The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Mol. Ecol. 22, 3165–78 (2013).
Article CAS PubMed Google Scholar
Liu, N., Chen, L., Wang, S., Oh, C. & Zhao, H. Comparison of single-nucleotide polymorphisms and microsatellites in inference of population structure. BMC Genet. 6(Suppl 1), S26 (2005).
Article PubMed PubMed Central Google Scholar
Coates, B. S. et al. Comparative performance of single nucleotide polymorphism and microsatellite markers for population genetic analysis. J. Hered. 100, 556–564 (2009).
Article CAS PubMed Google Scholar
Schopen, G. C. B., Bovenhuis, H., Visker, M. H. P. W. & van Arendonk, J. A. M. Comparison of information content for microsatellites and SNPs in poultry and cattle. Anim. Genet. 39, 451–453 (2008).
Article CAS PubMed Google Scholar
Huang, H. & Knowles, L. L. Unforeseen consequences of excluding missing data from Next-Generation Sequences: Simulation study of RAD sequences. Syst. Biol. 65, 357–65 (2016).
Article PubMed Google Scholar
Catchen, J. et al. The population structure and recent colonization history of Oregon threespine stickleback determined using restriction-site associated DNA-sequencing. Mol. Ecol. 22, 2864–2883 (2013).
Article CAS PubMed PubMed Central Google Scholar
Jackson, A. M. et al. Population structure and phylogeography in Nassau grouper (Epinephelus striatus), a mass-aggregating marine fish. PLoS One 9, e97508 (2014).
Article ADS PubMed PubMed Central Google Scholar
Bernardi, G., Azzurro, E., Golani, D. & Miller, M. R. Genomic signatures of rapid adaptive evolution in the bluespotted cornetfish, a Mediterranean Lessepsian invader. Mol. Ecol. 25, 3384–3396 (2016).
Article CAS PubMed Google Scholar
Blanco-Bercial, L. & Bucklin, A. New view of population genetics of zooplankton: RAD-seq analysis reveals population structure of the North Atlantic planktonic copepod Centropages typicus. Mol. Ecol. 25, 1566–1580 (2016).
Article CAS PubMed Google Scholar
Rodríguez-Ezpeleta, N. et al. Population structure of Atlantic mackerel inferred from RAD-seq-derived SNP markers: effects of sequence clustering parameters and hierarchical SNP selection. Mol. Ecol. Resour. 16, 991–1001 (2016).
Article PubMed Google Scholar
Van Wyngaarden, M. et al. Identifying patterns of dispersal, connectivity and selection in the sea scallop, Placopecten magellanicus, using RADseq-derived SNPs. Evol. Appl. 10, 102–117 (2017).
Article PubMed Google Scholar
Mastretta-Yanes, A. et al. Restriction site-associated DNA sequencing, genotyping error estimation and de novo assembly optimization for population genetic inference. Mol. Ecol. Resour. 15, 28–41 (2015).
Article CAS PubMed Google Scholar
Bradbury, I. R. et al. Transatlantic secondary contact in Atlantic Salmon, comparing microsatellites, a single nucleotide polymorphism array and restriction-site associated DNA sequencing for the resolution of complex spatial structure. Mol. Ecol. 24, 5130–5144 (2015).
Article CAS PubMed Google Scholar
Jeffries, D. L. et al. Comparing RADseq and microsatellites to infer complex phylogeographic patterns, an empirical perspective in the Crucian carp, Carassius carassius, L. Mol. Ecol. 25, 2997–3018 (2016).
Article PubMed Google Scholar
Hodel, R. G. J., Cortez, M. B., de, S., Soltis, P. S. & Soltis, D. E. Comparative phylogeography of black mangroves (Avicennia germinans) and red mangroves (Rhizophora mangle) in Florida: Testing the maritime discontinuity in coastal plants. Am. J. Bot. 103, 730–739 (2016).
Article CAS PubMed Google Scholar
Tomlinson, P. B. The Botany of Mangroves (Cambridge University Press, 2016).
Avise, J. C. Phylogeography: The History and Formation of Species (Harvard University Press, 2000).
Soltis, D., Morris, A., McLachlan, J. S., Manos, P. S. & Soltis, P. S. Comparative phylogeography of unglaciated eastern North America. Mol. Ecol. 15, 4261–4293 (2006).
Article PubMed Google Scholar
Kennedy, J. P. et al. Contrasting genetic effects of red mangrove (Rhizophora mangle L.) range expansion along West and East Florida. J. Biogeogr. 44, 335–347 (2017).
Article Google Scholar
Fu, R., Dey, D. K. & Holsinger, K. E. Bayesian models for the analysis of genetic structure when populations are correlated. Bioinformatics 21, 1516–1529 (2005).
Article CAS PubMed Google Scholar
Rosero-Galindo, C., Gaitan-Solis, E., Cárdenas-Henao, H., Tohme, J. & Toro-Perea, N. Polymorphic microsatellites in a mangrove species, Rhizophora mangle L.(Rhizophoraceae). Mol. Ecol. Notes 2, 281–283 (2002).
Article CAS Google Scholar
Kang, J., Ma, X. & He, S. Population genetics analysis of the Nujiang catfish Creteuchiloglanis macropterus through a genome-wide single nucleotide polymorphisms resource generated by RAD-Seq. Sci. Rep. 7, 2813 (2017).
Article ADS PubMed PubMed Central Google Scholar
Manthey, J. D., Geiger, M. & Moyle, R. G. Relationships of morphological groups in the northern flicker superspecies complex (Colaptes auratus & C. chrysoides). Syst. Biodivers. 15, 183–191 (2017).
Article Google Scholar
Johnson, L.K. & Herren, L.W. Re-establishment of fringing mangrove habitat in the Indian River Lagoon 19–22 (Florida Department of Environmental Protection, 2008).
NASA Public Affairs. The Kennedy Space Center Story (Graphic House, 1991).
Doyle, J. J. & Doyle, J. L. A rapid DNA isolation procedure for small quantities of fresh leaf tissue. Phytochem. Bull. 19, 11–15 (1987).
Google Scholar
Schuelke, M. An economic method for the fluorescent labeling of PCR fragments. Nat. Biotechnol. 18, 233–234 (2000).
Article CAS PubMed Google Scholar
Kearse, M. et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 1647–9 (2012).
Article PubMed PubMed Central Google Scholar
Peterson, B. K., Weber, J. N., Kay, E. H., Fisher, H. S. & Hoekstra, H. E. Double Digest RADseq: An inexpensive method for de novo SNP discovery and genotyping in model and non-model species. PLoS One 7, (2012).
Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A. & Cresko, W. A. Stacks: an analysis tool set for population genomics. Mol. Ecol. 22, 3124–40 (2013).
Article PubMed PubMed Central Google Scholar
Goudet, J. hierfstat, a package for R to compute and test hierarchical F-statistics. Mol. Ecol. Notes 5, 184–186 (2005).
Article Google Scholar
Meirmans, P. G. & Van Tienderen, P. H. Genotype and Genodive: two programs for the analysis of genetic diversity of asexual organisms. Mol. Ecol. Notes 4, 792–794 (2004).
Article Google Scholar
Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chifman, J. & Kubatko, L. Quartet inference from SNP data under the coalescent model. Bioinformatics 30, 3317–3324 (2014).
Article CAS PubMed PubMed Central Google Scholar
Reaz, R. et al. Accurate phylogenetic tree reconstruction from quartets: A heuristic approach. PLoS One 9, e104008 (2014).
Article ADS PubMed PubMed Central Google Scholar
Paradis, E., Claude, J. & Strimmer, K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, (289–290 (2004).
Google Scholar
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T. Y. ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 8, 28–36 (2017).
Article Google Scholar

Download references

Acknowledgements

We thank the following funding sources: Botanical Society of America, Florida SeaGrant, Sigma Xi, and the University of Florida Department of Biology. Publication of this article was funded in part by the University of Florida Open Access Publishing Fund.

Author information

Authors and Affiliations

Department of Biology University of Florida, Gainesville, FL, 32611, USA
Richard G. J. Hodel, Adam C. Payton, Stuart F. McDaniel & Douglas E. Soltis
Florida Museum of Natural History University of Florida, Gainesville, FL, 32611, USA
Richard G. J. Hodel, Shichao Chen, Stuart F. McDaniel, Pamela Soltis & Douglas E. Soltis
College of Life Sciences and Technology Tongji University, Shanghai, 200092, China
Shichao Chen
The Genetics Institute University of Florida, Gainesville, FL, 32610, USA
Stuart F. McDaniel, Pamela Soltis & Douglas E. Soltis

Authors

Richard G. J. Hodel
View author publications
You can also search for this author in PubMed Google Scholar
Shichao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Adam C. Payton
View author publications
You can also search for this author in PubMed Google Scholar
Stuart F. McDaniel
View author publications
You can also search for this author in PubMed Google Scholar
Pamela Soltis
View author publications
You can also search for this author in PubMed Google Scholar
Douglas E. Soltis
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.G.J.H., S.C. and A.C.P. collected the data, R.G.J.H., S.F.M., P.S.S. and D.E.S. contributed financially to the data collection, all authors designed analyses, R.G.J.H. wrote the manuscript, and all authors reviewed and edited the manuscript.

Corresponding author

Correspondence to Richard G. J. Hodel.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Supplementary Figure 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hodel, R.G.J., Chen, S., Payton, A.C. et al. Adding loci improves phylogeographic resolution in red mangroves despite increased missing data: comparing microsatellites and RAD-Seq and investigating loci filtering. Sci Rep 7, 17598 (2017). https://doi.org/10.1038/s41598-017-16810-7

Download citation

Received: 14 July 2017
Accepted: 15 November 2017
Published: 14 December 2017
DOI: https://doi.org/10.1038/s41598-017-16810-7

This article is cited by

Fine-scale adaptive divergence and population genetic structure of Aedes aegypti in Metropolitan Manila, Philippines
- Atikah Fitria Muharromah
- Thaddeus M. Carvajal
- Kozo Watanabe
Parasites & Vectors (2024)
Inconsistent estimates of hybridization frequency in newts revealed by SNPs and microsatellites
- Aurélien Miralles
- Jean Secondi
- Pierre-André Crochet
Conservation Genetics (2024)
Genetic diversity analysis of Euterpe edulis based on different molecular markers
- Francine Alves Nogueira de Almeida
- Jônatas Gomes Santos
- Marcia Flores da Silva Ferreira
Tree Genetics & Genomes (2024)
Genetic erosion in a tropical tree species demonstrates the need to conserve wide-ranging germplasm amid extreme habitat fragmentation
- A. Phang
- M.A. Niissalo
- G.S. Khew
Biodiversity and Conservation (2024)
Using genomic data to estimate population structure of Gopher Tortoise (Gopherus polyphemus) populations in Southern Alabama
- Alexander R. Krohn
- Brian Folt
- Jeffrey M. Goessling
Conservation Genetics (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Datasets

Population genetic analyses

Pairwise F ST

F IS by sampling location

Heterozygosity by sampling location

PCA and SVDQuartets

Sampling loci

Discussion

Insights regarding choice of loci

Phylogeographic patterns in red mangroves

Conclusions

Methods and Materials

Sample collection, DNA isolation

Microsatellite amplification and analysis

RAD-Seq library preparation and data processing

Population genetic analyses

Principal components and SVDQuartets

Sampling loci

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing Interests

Additional information

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links

Pairwise F _ST

F _IS by sampling location