Adding loci improves phylogeographic resolution in red mangroves despite increased missing data: comparing microsatellites and RAD-Seq and investigating loci filtering

The widespread adoption of RAD-Seq data in phylogeography means genealogical relationships previously evaluated using relatively few genetic markers can now be addressed with thousands of loci. One challenge, however, is that RAD-Seq generates complete genotypes for only a small subset of loci or individuals. Simulations indicate that loci with missing data can produce biased estimates of key population genetic parameters, although the influence of such biases in empirical studies is not well understood. Here we compare microsatellite data (8 loci) and RAD-Seq data (six datasets ranging from 239 to 25,198 loci) from red mangroves (Rhizophora mangle) in Florida to evaluate how different levels of data filtering influence phylogeographic inferences. For all datasets, we calculated population genetic statistics and evaluated population structure, and for RAD-Seq datasets, we additionally examined population structure using coalescence. We found higher FST using microsatellites, but that RAD-Seq-based estimates approached those based on microsatellites as more loci with more missing data were included. Analyses of RAD-Seq datasets resolved the classic Gulf-Atlantic coastal phylogeographic break, which was not significant in the microsatellite analyses. Applying multiple levels of filtering to RAD-Seq datasets can provide a more complete picture of potential biases in the data and elucidate subtle phylogeographic patterns.

there are caveats to using SSRs. Perhaps most importantly, a limited number of loci (usually < 25) can feasibly be employed in a typical microsatellite study. Also, the mutational properties of SSRs are unusually high and almost certainly do not reflect those of the genome as a whole. Thus, the property that makes microsatellites excellent for distinguishing different individuals may inflate statistics such as F ST and heterozygosity relative to the rest of the genome. Furthermore, microsatellites can be just as expensive to implement as newer high-throughput sequencing (HTS) techniques if there are no existing genetic resources (e.g., no primers already developed, or no available transcriptomic or genomic resources) 6 .
The use of RAD-Seq data has increased greatly over the past decade, largely because thousands of loci can be generated simultaneously for hundreds of individuals for a fixed, known cost 7 . RAD-Seq uses restriction enzymes (REs) to create a reduced representation library of the genome; single-nucleotide polymorphisms (SNPs) in regions of DNA between restriction sites are used to distinguish between individuals 8 . Barcoding to allow efficient multiplexing during sequencing keeps costs down, which can be as little as $40 per individual for thousands of loci, assuming judicious sharing of reagents, and a well-designed plan for multiplexing individuals [9][10][11] . Microsatellite genotyping has a similar cost per individual, assuming primers are not developed, but many fewer loci are obtained 3 . SNPs have several advantages over microsatellites, as they are less likely to exhibit homoplasy than SSRs 12 .
Despite advantages, there are also several caveats to using RAD-Seq. Unless there is a reference genome, loci obtained using RAD-Seq are anonymous, and some loci may be non-neutral 7 . Additionally, biases may be introduced at several stages in a RAD-Seq protocol: (1) digestion with REs samples a non-random portion of the genome due to biases in base composition; this is potentially worse if methylation sensitive enzymes are used; (2) polymorphisms in restriction sites that can lead to segregating presence/absence polymorphisms that are very difficult to detect without very deep sequencing and negating the cost-savings of using RAD-Seq in the first place 7,13 ; (3) preferential PCR amplification of some loci during library construction necessarily reduces coverage of other loci 13 ; (4) sequencing errors and/or low sequencing depth leads to incorrect genotype calling 7 ; and (5) false loci are constructed due to the misassembly of paralogous reads 14,15 . Many potential problems are resolved by multiple PCR steps to even out loci coverage and by improvements in software when processing loci, but concerns remain that RE-based reduced representation methods do not capture a representative snapshot of the genome 16 . One other concern with RAD-Seq loci is that manual data curation is impossible, and errors may go undetected even by the most careful researchers 14,17,18 . Finally, the biggest potential problem when using RAD-Seq is that low coverage and high proportions of missing data can make it difficult to infer heterozygotes accurately.
Previous studies have compared results from SNPs and SSRs, revealing that microsatellites provide much more information-up to an order of magnitude more-on a per-marker basis than SNPs 19,20 . However, SNP studies typically use several orders of magnitude more markers than an average SSR study. Evidence has shown that the large number of loci in SNP studies can effectively allow for more powerful inferences, even though the information at each locus is less than that in microsatellite markers 21 . Because of the low number of loci used in SSR studies, the standard practice is to aim to minimize missing data. However, the nature of current library preparation and sequencing means that higher percentages of missing data are an unavoidable part of RAD-Seq studies. Simulation studies have shown that the large amounts of missing data in RAD-Seq studies can inflate F ST estimates due to allelic dropout 13,18 . As more loci were included in these simulations, F ST appeared to increase because many loci had genotype data for only one or a few individuals. In many such loci F ST = 1 because by chance the few individuals sampled were homogeneous within populations but different between populations, leading to high average F ST . Heterozygosity can be similarly inflated if the more frequent allele is likely to be absent (e.g., because mutations in the restriction site, which lead to allelic dropout, are often in ancestral alleles that occur at a high frequency) 18 . Arnold, et al. 13 confirmed results from Gautier, et al. 18 and also concluded that other summary statistics, including Θ and π, could be inaccurately estimated from loci with missing data. In spite of these problems, more recent simulation studies have indicated that missing data in RAD-Seq studies may not lead to incorrect inference, and in fact including loci with missing data can be advantageous for identifying shallow divergences 22 .
Convention in phylogeographic studies often is to require 75 or 80% of individuals to have data for a given locus-otherwise that locus is discarded from the analyses (e.g., refs [23][24][25][26][27][28]. Presumably, requiring a locus to be present in a certain number of individuals will eliminate loci with high missing data that may be the cause of misestimated parameters 13,18 . However, the choice of a cutoff is arbitrary and is typically not justified in phylogeographic studies-the number of SNPs is virtually always reported as a single fixed value (e.g., "we identified a total of 4,234 SNPs, " Jackson, et al. 24 ). In reality, the various parameter values that determine how many loci are constructed and retained in SNP alignment methods means that there is a range of loci that could conceivably be included in a study 27,29 .
To date, no phylogeographic study has investigated the effect of varying amounts of missing data in an empirical RAD-Seq dataset, even those explicitly comparing RAD-Seq-generated SNPs and microsatellites 30,31 . To remedy this knowledge gap, we investigate the phylogeography of red mangroves (Rhizophora mangle L., Rhizophoraceae) in Florida, using both an existing microsatellite dataset 32 , and new RAD-Seq SNP datasets that vary in number of loci and the percentage of missing data. We filtered RAD-Seq loci to generate a dataset that would approximate the number of loci and amount of missing data typically used in RAD-Seq phylogeography studies, and we also generated datasets with more or less stringent filtering to test the effects of increasing or decreasing the number of loci and percentage of missing data. Specifically, we address the following questions: (1) In RAD-Seq datasets, how are phylogeographic inferences affected by the number of loci used? (2) In RAD-Seq datasets, how are phylogeographic inferences affected by the percentage of missing data? (3) What are the important differences in performance between microsatellites and RAD-Seq data in population genetic and phylogeographic inference? (4) Do RAD-Seq data reveal any novel phylogeographic inferences not already recovered by microsatellites for red mangroves in Florida?
To address these questions, we used 96 red mangrove (Rhizophora mangle) individuals collected from 12 sampling locations on the coasts of Florida (Table 1, Fig. 1). Red mangroves are salt-tolerant trees that occur in coastal estuarine environments throughout the neotropics, experiencing high temperatures, frequent inundation, saline conditions, and periodic wave action associated with the coastal environment 33 . Red mangroves provide a variety of ecosystem services, including filtering water, providing habitat to animals, stabilizing shorelines, and protecting coastal environments from frequent wave action and occasional storm surges. Thus, red mangroves are important conservation targets-for which phylogeographic data can improve conservation strategies-making red mangroves a valuable study system.   Further analysis of phylogeographic patterns in red mangroves and other species occurring in the Florida peninsula is also warranted. Although previous studies of many coastal and marine taxa revealed a phylogeographic discontinuity at or near the southern tip of Florida 34,35 , recent work on red mangroves using microsatellites failed to identify such a pattern 32,36 . Different types of molecular markers could reveal new phylogeographic insights, due to broader sampling of the genome, and provide a predictive framework for understanding how genetic variation in this iconic species will respond to climate change. Finally, red mangroves are an ideal system for comparing the performance of alternative genetic markers, given previous analyses of microsatellite loci 32,36,37 and the size of the genome (approximately 1 Gb; Hodel unpublished data, based on flow cytometry observations), enabling a rigorous test of the RAD-Seq method. Genome size is a necessary consideration with RAD-Seq; as genome size increases, the number of loci shared among many individuals for a given sequencing depth decreases.

Results
Datasets. Seven datasets, ranging from 8 loci (SSR_8) to 25,198 loci (RAD_25198; Table 2), were used to investigate in depth how variation in number of loci and percent missing data impacted phylogeographic inferences. These were selected from all possible datasets, for which basic statistics were calculated (Supplementary Figure 1). The name of each of the seven datasets contains information about locus type (RAD or SSR) and number of loci in the dataset. The smallest RAD dataset contained 239 loci (RAD_239), and the percentage of missing data for RAD datasets ranged from 11.7% to 78.1% ( Table 3). The dataset RAD_1180, which required a locus to be present in 75 of 96 individuals (78.1%), most closely mimicked the amount of loci filtering typically used in a phylogeographic study. Therefore, in our analyses, we used this as a baseline dataset against which to compare other RAD datasets. Across sampling locations, the proportion of missing data was relatively uniform ( Table 1); percentage of missing loci in the data matrix for a given sampling location ranged from 65.8% to 88.9%.

Population genetic analyses.
Measures of heterozygosity were not significantly different between the microsatellite dataset and the RAD datasets; average H O was 0.431 for the microsatellite dataset and 0.392 in RAD_1180, with a range from 0.354 to 0.477 across all RAD datasets (Table 3). Average H E was 0.388 for the microsatellite dataset and 0.307 for RAD_1180 and ranged from 0.300 to 0.340 for all RAD-Seq datasets (Table 3). Average F ST for microsatellites was 0.124, which was significantly greater than average F ST for only one of the RAD datasets-the smallest (RAD_239; Table 3). Within the RAD datasets, average H O was significantly greater in RAD_25198 than all others, and it was significantly lower in RAD_6255 than in all others; H O did not predictably increase or decrease as the number of loci increased (Table 3). Additionally, within RAD datasets, average H E was significantly greater in RAD_25198 than all others. F ST ranged from 0.046 to 0.108 among the RAD datasets (Table 3). There was no significant difference in F ST in the three smallest RAD datasets, but the three largest RAD datasets all had increased F ST relative to the smaller datasets ( Table 3). The dataset with the largest value of F ST was RAD_6255 (Table 3). Average F IS using microsatellites was not significantly different than F IS calculated using RAD datasets; within RAD datasets, F IS generally increased as more loci were added, although RAD_25198 had the lowest value of F IS ( Table 3). Many of the population genetic statistics were disproportionately affected by loci with very low or very high values of F ST , F IS , or heterozygosity (Fig. 2). The effect of extreme loci was particularly evident in the larger datasets (RAD_6255 and RAD_25198), in which there were large numbers of loci with extreme values (e.g., F ST of 1.0; Fig. 2).
Pairwise F ST . The values of pairwise F ST for each sampling location relative to other sampling locations were remarkably consistent across datasets (Table 4). For most sampling locations pairwise F ST estimated by SSRs was approximately twice as large as RAD dataset estimates. For every dataset, pairwise F ST between Seahorse Key and all other sampling locations was the highest. For every RAD dataset, Islamorada had the lowest value for pairwise F ST , but the SSR dataset identified West Palm Beach as the sampling location with the lowest estimate of pairwise F ST . Even as the amount of missing data increased, the pairwise F ST estimates remained consistent; RAD_25198 produced similar values to smaller RAD datasets (Table 4). F IS by sampling location. Cape Canaveral was the location with the highest F IS using the microsatellite data (SSR_8), followed by Key Largo and Seahorse Key (Table 5). Meanwhile, for all RAD datasets, Seahorse Key had one of the lowest (i.e., most negative) F IS values among all populations. Within the RAD datasets, the number of loci and/or amount of missing data affected F IS . For example, in Key Largo, the largest dataset yielded a value  (Table 6).

PCA and SVDQuartets.
The PCA analysis revealed that microsatellite data did not identify clear groupings of individuals based on sampling location or other geographical divisions (Fig. 3). Similarly, RAD_239 did not differentiate the samples into discrete clusters. However, RAD_1180, RAD_2317, RAD_3831, RAD_6255, and RAD_25198 all divided the samples into two groups with minimal overlap in the PCA visualization: one group was Gulf Coast samples, and the other group was Atlantic Coast samples (Fig. 3). Closer inspection of the PCAs revealed that most of the Cape Canaveral individuals formed a discrete cluster intermediate between the two other clusters (Gulf and Atlantic). Most RAD datasets had sufficient resolution to place Cape Canaveral between the Gulf and Atlantic clusters, but the use of a small number of loci (i.e., RAD_239) was unable to show this relationship. Furthermore, the two largest datasets, RAD_6255 and RAD_25198, showed Cape Canaveral individuals clustering more closely to the Atlantic than the Gulf cluster. The 50% majority-rule consensus bootstrap trees generated with SVDQuartets showed substantial variation between datasets when inferring genealogical relationships between individuals and/or sampling locations (Fig. 4). In many cases, dataset RAD_239 did not identify genealogical relationships that were recovered with other datasets with more loci. However, certain key relationships among individuals were consistently shown in multiple datasets with thousands of loci. In every dataset except RAD_239 (i.e., every dataset with at least 1180 loci), all Seahorse Key (ShKFl) samples formed a clade (Fig. 4). In four datasets (RAD_2317, RAD_3831, RAD_6255, RAD_25198), all Gulf Coast (NPRFl, ShKFl, TCBFl) samples (except one individual: NPRFl_R8), together with all Cape Canaveral (CpCFl) samples, formed a clade that is sister to all remaining Atlantic Coast samples plus NPRFl_R8. Interestingly, this relationship was not recovered in RAD_1180, the dataset with 'ideal' filtering of loci-but all datasets with more loci (and therefore more missing data) did recover the relationship. Each RAD dataset had nodes with varying levels of bootstrap support (Fig. 4). Datasets with fewer loci showed few nodes with bootstrap support >70%; RAD_239 had three such nodes. More loci resulted in more nodes with bootstrap support >70%, up to a point: RAD_1180 had six highly supported nodes, RAD_2317 had eight, and RAD_3831 had the most with nine. Then the number of highly supported nodes slightly declined as more loci were added: RAD_6255 and RAD_25198 each had eight nodes with bootstrap support >70% (Fig. 4).

Sampling loci.
Analyzing differently sized samples of loci from RAD_25198 and SSR_8 provided several crucial insights. A microsatellite dataset with seven loci sampled from SSR_8 performed better in estimating F ST than a dataset with six loci, although in each case, all 100 sampled replicates fell within the 95% confidence interval of F ST for the complete SSR_8 dataset (Fig. 5). For all RAD datasets, the value of F ST estimated using only originally filtered data is different from all 100 permuted values of F ST calculated from an equivalent number of loci sampled from the largest dataset (RAD_25198). For almost all datasets, F ST based on sampled loci was less than F ST using original loci, except for one dataset (RAD_6255), F ST based on the sampled loci was greater. Strikingly, in none of the datasets did the confidence intervals from the sampled loci overlap with the confidence intervals of the estimated F ST values from the original data (Fig. 5). The percentage of missing data in the largest dataset clearly had an immense impact. Even when very few loci (e.g., 239 loci) were sampled from the largest dataset, the

Discussion
Insights regarding choice of loci. Our results indicate that filtering loci using the standard cutoff (i.e., 75-80% of individuals must possess data for a given locus for that locus to be retained) should not be the gold standard in RAD-Seq studies-it is possible to retain many more loci without inflated statistics [23][24][25][26][27][28] . F ST increased as missing data increased, as predicted by simulation studies, but the relationship is more nuanced than previously assumed. F ST increases as the percentage of missing data increases-up to a point-and then decreases from RAD_6255 to RAD_25198, as the percentage of missing data nearly doubles, from 41.3% to 78.1% (Table 3). When more loci are included, the distribution of F ST across the genome is more completely sampled. However, adding loci with more missing data may cause analyses to miss low-frequency alleles in the loci with extensive missing data, which would add error to average estimates of F ST . Sampling analyses confirmed that F ST generally  increased as missing data increased (Fig. 5). Heterozygosity was less affected by missing data, as there was little or no change in either observed or expected heterozygosity when the percentage of missing data ranged between 11.7% (RAD_239) and 41.3% (RAD_6255). Only the largest dataset (RAD_25198, with 78.1% missing data) showed significantly higher heterozygosity than other datasets. Some simulation studies reported that missing data could inflate F ST , and would likely inflate estimates of heterozygosity, leading to calls for removing loci with incomplete sampling 13 . However, more recent simulation studies showed that removing loci with higher mutation rates, which are more likely to have missing data, negatively impacted phylogenetic analyses 22 . Our study shows the importance of thoroughly exploring how loci are filtered in empirical systems. Extreme amounts of missing data yield higher estimates of F ST and heterozygosity and lower estimates of F IS (Table 3). A large number of loci in RAD_25198 had very high values of certain statistics (e.g., hundreds of loci with F ST > 0.975 and thousands of loci with H O > 0.975), which severely impacted average estimates of these statistics (Fig. 2). Notably, not all datasets have these extreme loci-dataset RAD_3831, which only requires 52.1% of individuals to have data for a given locus, and had 29.4% missing data, did not suffer from extreme loci, despite liberal filtering. Although missing data caused some statistics to increase, it did not dramatically affect our conclusions. For many analyses, using datasets other than RAD_1180, especially RAD_2317 and RAD_3831, did not change the interpretation of the results. Regardless of which of the three datasets was used, F ST was relatively low-between 0.057 and 0.066. Importantly, nearly doubling the amount of missing data from 17.1% (RAD_1180, the 'gold standard') to 29.4% (RAD_3831) resulted in a very small increase in F ST and no significant change in other statistics (F IS , H O , H E ; Table 3). Furthermore, using very few loci (RAD_239) did not significantly change any of the statistics estimated using RAD_1180 (Table 3). Our data indicate that the often-used cutoff of 75-80% individuals with locus data is arbitrary, and different cutoffs should be considered and evaluated on a case-by-case basis to ensure an appropriate number of loci are used. The results suggest that in many cases, only minimal filtering of loci is needed, and many more loci can be retained than typically are. Researchers who wish to maximize the number of loci in their study could likely use very low cutoffs (e.g., require a locus to have data for >10% of individuals).
Typically, microsatellite datasets have lower F ST values relative to SNPs due to a larger number of alleles, although simulation studies have shown evidence that F ST can be elevated up to an order of magnitude in microsatellite datasets due to factors such as correlated allele frequencies 38 . Average F ST ranged from 0.046 to 0.124 across all datasets-so there is not high differentiation detected in any dataset (Table 3). When using any RAD dataset except RAD_239, F ST calculated using RAD loci was statistically indistinguishable from the microsatellite dataset (Table 3). In theory, a three-to-four-fold change in F ST could alter biological conclusions-possibly with deleterious results (e.g., identifying populations or management units that would be prioritized for conservation)-but no matter how the loci were filtered, there was a relatively small range of estimated F ST . RAD-Seq studies where larger values of F ST were detected could exhibit larger absolute changes in F ST when using different loci filtering cutoffs (e.g., refs 39,40 ).
Similarly, the interpretation of F IS and H O could impact how data are considered in a biological context (e.g., identifying locations at risk for inbreeding depression). Positive values of F IS and/or low values of H O often indicate inbreeding, which means sampling locations are more vulnerable than other sampling locations. As noted earlier, SSR_8 identified five sampling locations with positive F IS values (Table 5). Only one of these sampling locations (Key Largo) also had positive F IS values in any of the RAD datasets. Clearly, marker choice (microsatellite versus RAD-Seq) affected the assessment of which populations are more vulnerable based on F IS values. Agreement between these two markers types would facilitate identifying sampling locations vulnerable to inbreeding depression. However, it is understandable that different markers would lead to different results, as  Fig. 2).
The PCA results show that as the number of loci increases, the definition of clusters improves, plateauing with RAD_2317 or RAD_3831. The clustering is similar in all RAD datasets with 1,180 or more loci, with Cape Canaveral individuals falling between the Gulf and Atlantic clusters. As more loci are added, the Cape Canaveral samples appear to be closer to the Atlantic cluster, especially in datasets RAD_6255 and RAD_25198 (Fig. 3). Taking into account the SVDQuartets results clarifies the clustering-all Cape Canaveral samples form a clade with all Gulf samples except one. However, this relationship is only present in datasets with 2,317 loci or greaterthe putatively 'gold standard' dataset RAD_1180 does not show this relationship.
Phylogeographic patterns in red mangroves. Based on previous studies using microsatellite data 32,36 , the relationship of Cape Canaveral samples to other sampling locations, as found here with both PCA and SVDQuartet analyses of RAD-Seq data, was surprising-previous studies did not find that any of the individuals in the Cape Canaveral population clustered with any of the Gulf samples. These new data could indicate an Atlantic-Gulf phylogeographic discontinuity, and that Cape Canaveral is an anomaly due to a lack of phylogeographic resolution, recent population founding, or human-mediated transplantation. The intermediate placement of Cape Canaveral in many of the PCAs suggests that it may actually cluster with the Atlantic samples, especially when considering datasets RAD_6255 and RAD_25198, indicating a phylogeographic break (Fig. 3). However, the SVDQuartets results place Cape Canaveral in a clade with the vast majority of Gulf samples, although this relationship is not highly supported in any datasets (i.e., bootstrap support is not >70% for this clade in any dataset) (Fig. 4). Assuming that Cape Canaveral is more closely related to Gulf samples, the age of the divergence between the two clades (Atlantic, Gulf + CpCFl) comes into question. Northern Florida represents the northern limit of the range of red mangroves 33 . Typically, populations of these trees in northern Florida are periodically extirpated due to freezing events, and these areas are re-colonized. The lower values of H O in northern populations (CpCFl, MlbFl, ShKFl, NPRFl, TCBFl) relative to southern populations indicate that these populations were likely founded more recently from a small number of propagules. The Cape Canaveral population was likely founded by individuals from the Gulf Coast, suggesting that the divergence between the two clades (Atlantic, Gulf + CpCFl) is very recent.
Previous research indicates that gene flow is greater from the Gulf Coast to the Atlantic Coast in red mangroves; there may be ongoing gene flow from the Gulf to Cape Canaveral 32 . Alternatively, alleles from the Gulf Coast could have migrated into an existing Cape Canaveral population and proliferated due to other processes (e.g., drift). Another explanation for the sister relationship between the Gulf samples and Cape Canaveral is Table 5. The variation in average inbreeding coefficient (F IS ) among data sets and populations. Within each data set, lower (warmer colors) and higher (cooler colors) values of F IS are shown using color-coding. The average value of F IS across all data sets for each population is shown in the last column of the human-mediated transplantation of propagules or seedlings from the Gulf Coast to Cape Canaveral. However, all available publications and information from land managers who replied to requests for information confirm that any restoration that required importation of propagules used either local propagules or seedlings from the southern Atlantic Coast (ref. 41 , personal communication with Rangers from Cape Canaveral National Seashore). Another possible explanation for this result is that red mangrove propagules were accidentally transported from the Gulf Coast of Florida to Cape Canaveral during construction of the Kennedy Space Center in the 1960s. Construction of the Space Center was a massive project. It is noteworthy that nearly 100,000 tons of steel was transported from the Gulf Coast to Cape Canaveral in numerous trucks; the transport of a few mangrove propagules during this process could easily have established a Gulf genotype in the Cape Canaveral area 42 . We conclude that, in contrast to microsatellites, RAD datasets recover a relationship between the Gulf Coast and the Atlantic Coast (excluding Cape Canaveral) that supports the presence of a maritime discontinuity in red mangroves. However, as red mangroves can disperse long distances, a population or populations that recently established in Cape Canaveral likely had a founder or founders that were predominantly of Gulf Coast origin. The fact that previous studies using SSRs did not elucidate this relationship is not surprising-both the PCA analysis and SVDQuartets analysis indicate that 1180 loci were barely sufficient to infer the placement of Cape Canaveraldatasets with many more loci were needed (Figs 3 and 4). The large number of loci required to resolve such relationships highlights why liberal filtering of RAD-Seq loci is advisable.

Conclusions
We cannot overemphasize the importance of thoroughly exploring RAD-Seq datasets when performing phylogeographic analyses-it is too easy to jump to conclusions when only using one arbitrary cutoff to filter loci. Our empirical data confirm that estimates of F ST and/or heterozygosity may become inflated as missing data increase. However, this does not happen as quickly as implied in simulation studies as loci with missing data are addedliberal filtering of loci retains loci valuable for phylogeographic or phylogenetic inference, without inflating population genetic statistics. Thus, regardless of the cutoff value used to filter loci, researchers should investigate several other cutoffs with both increased and decreased amounts of missing data to appreciate fully the impact of missing data on parameters in their study. We found no evidence that the 75% or 80% cutoff commonly employed was optimal. In many analyses, other datasets with cutoffs ranging from 31.3% to 67.7% performed just as well as or better than RAD_1180. Many RAD-Seq studies aim to multiplex as many individuals as possible in a HTS run; our results show that retaining loci with more missing data is feasible and advantageous in empirical studies, and that researchers can include more samples in a single sequencing run. Our study confirmed that microsatellites were a valuable tool for inexpensively estimating population genetic statistics, such as F ST , F IS , and heterozygosity. However, this study revealed that the thousands of additional loci from across the genome provided by RAD-Seq increased phylogeographic resolution. We found that red mangroves likely have a phylogeographic discontinuity at the southern tip of Florida that was not detected in previous studies using SSRs 32,36 and that a single population from the Atlantic coast of Florida arose via recent colonization by propagules (either natural or human-mediated) from the Gulf coast.

Methods and Materials
Sample collection, DNA isolation. We collected leaf tissue from plants of R. mangle from 12 locations in Florida ( Fig. 1). At each location, we collected one leaf from 10-20 individuals that were spaced at least 15 m apart to minimize collecting closely related individuals. For each sampling location, we randomly selected 8 individuals to use in genetic analyses. GPS coordinates for each sampling location were recorded (Table 1). Each sampled leaf was placed in a labeled bag with silica gel and stored for 1-12 months; we then extracted DNA from the dried leaf tissue using a standard CTAB protocol 43 .
Microsatellite amplification and analysis. We PCR-amplified eight nuclear microsatellite loci for R. RAD-Seq library preparation and data processing. We followed the double-digest RAD-Seq protocol developed by Peterson, et al. 46 . For each sample, we constructed 96 DNA libraries by digesting approximately 200 ng genomic DNA with EcoRI and MseI. We then ligated Illumina adapters and unique 8-10-nucleotide barcodes to the DNA fragments. The DNA libraries were PCR-amplified in two separate reactions and pooled to minimize early PCR bias. We size selected 250-450-bp fragments using gel electrophoresis and sequenced the DNA fragments using the 1 × 100-bp setting on the Illumina HiSeq. 2500 platform. Raw sequence data were deposited in the NCBI Sequence Read Archive (accession numbers pending). We processed the raw Illumina reads using the FAST-X toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) to filter sequences; we required 95% of bases to be above a quality score of 30 to retain a read. We then converted the sequences from FASTQ to FASTA, demultiplexed the reads, sorted them by barcodes, and trimmed the sequences by removing the final 2 bases to ensure that we were using only high-quality sequence data. We assembled the sequences into loci using the STACKS 1.24 pipeline 47 with the following parameter settings: -n 3 -m 3 -M 2 (parameters were optimized following Mastretta-Yanes, et al. 29 ); all other parameters were left as the default. We selected seven datasets (one microsatellite and six RAD-Seq) and used a variety of analyses to compare the results produced by each dataset (Table 2). We used the 'populations' program in STACKS to produce an unfiltered dataset of RAD-Seq loci using the 'write single SNP' command and requiring a minor allele frequency >0.05. We then removed human, fungal, and microbial contamination from the loci and filtered loci by representation across individuals using an R script to create five smaller datasets (Data_aquisition.R; this script and all other scripts are available at https://github.com/richiehodel/ red_mangrove_RAD_SSR). Filtered datasets were required to have locus data for a certain number of individuals for the given locus to be retained in the analysis; the number of individuals could range from 1-96 (Supplemental Fig. 1). The datasets were chosen such that they encompassed a wide range of loci and missing data.

Population genetic analyses.
We used an R script (Basic_stats.R) and the R package 'hierfstat' 48 to calculate average F ST , the inbreeding coefficient F IS , H O , and H E for each of the seven datasets. To investigate how the number of loci affected comparisons of population genetic statistics among populations, we calculated pairwise F ST (one sampling location versus all others combined) for each sampling location for each dataset using GenoDive 49 and an R script (Pairwise_Fst.R). Additionally, we calculated F IS and H O for each sampling location for each dataset to determine how measures that often inform conservation practices might be affected by the number of loci and amount of missing data. We measured how missing data were partitioned across sampling locations to verify that there were not any sampling locations with unusually high or low amounts of missing data (Table 1). Additionally, we investigated how several population genetic statistics were distributed across loci in each of the datasets (Stat_Distribution.R; Fig. 2).
Principal components and SVDQuartets. We used a principal component analysis (PCA) implemented in the R package 'SNPRelate' 50 to identify clusters of individuals in the RAD data with an R script (VCF_PCA.R) and GenoDive to run a PCA on the microsatellite data. After visualizing the initial results, we tested several ways of grouping sampling locations together based on geography. We used SVDQuartets 51 to determine genealogical relationships among individuals. This program selects the optimal topology for a quartet of taxa, and, after sampling millions of quartets, infers a phylogeny for all individuals based on choosing the quartets with the best scores and assembling them into a phylogenetic tree. We used an R script (Nexus_creation.R) to convert the output from the 'populations' program in STACKS into nexus files that could be read for the SVDQuartets analysis. For each RAD dataset, we evaluated all possible quartets and selected trees under the multispecies coalescent using QFM (Quartet Fiduccia Mattheyses) quartet assembly 52 . We used non-parametric bootstrapping (100 replicates for each dataset) to assess confidence in inferred genealogical relationships between individuals. The R script Tree_formatting.R was used to visualize and annotate the 50% majority-rule trees from SVDQuartets using the R packages 'ape' 53 and 'ggtree' 54 .

Sampling loci.
To test whether the number of loci or percentage of missing data for the loci used is the more important factor impacting measures of fixation, population differentiation, and heterozygosity, we randomly sampled from RAD_25198 (the RAD-Seq dataset comprising 25,198 loci) the equivalent number of loci contained in RAD_239, RAD_1180, RAD_2317, RAD_3831, and RAD_6255, respectively, and used these five sets of sampled loci in analyses. We used an R script (Subsample.R) to randomly sample loci without replacement from RAD_25198 and repeated the sampling 100 times for each dataset. We compared measures of F ST calculated using the original datasets with results calculated using the sampled loci from RAD_25198 (Fig. 5). We used bootstrapping to calculate 95% confidence intervals around F ST for the original datasets and for the sets of loci sampled from RAD_25198 (Fig. 5).