Knowledge status and sampling strategies to maximize cost-benefit ratio of studies in landscape genomics of wild plants

To avoid local extinction due to the changes in their natural ecosystems, introduced by anthropogenic activities, species undergo local adaptation. Landscape genomics approach, through genome–environment association studies, has helped evaluate the local adaptation in natural populations. Landscape genomics, is still a developing discipline, requiring refinement of guidelines in sampling design, especially for studies conducted in the backdrop of stark socioeconomic realities of the rainforest ecologies, which are global biodiversity hotspots. In this study we aimed to devise strategies to improve the cost-benefit ratio of landscape genomics studies by surveying sampling designs and genome sequencing strategies used in existing studies. We conducted meta-analyses to evaluate the importance of sampling designs, in terms of (i) number of populations sampled, (ii) number of individuals sampled per population, (iii) total number of individuals sampled, and (iv) number of SNPs used in different studies, in discerning the molecular mechanisms underlying local adaptation of wild plant species. Using the linear mixed effects model, we demonstrated that the total number of individuals sampled and the number of SNPs used, significantly influenced the detection of loci underlying the local adaptation. Thus, based on our findings, in order to optimize the cost-benefit ratio of landscape genomics studies, we suggest focusing on increasing the total number of individuals sampled and using a targeted (e.g. sequencing capture) Pool-Seq approach and/or a random (e.g. RAD-Seq) Pool-Seq approach to detect SNPs and identify SNPs under selection for a given environmental cline. We also found that the existing molecular evidences are inadequate in predicting the local adaptations to climate change in tropical forest ecosystems.


Methodology
To perform this work, all research articles, published to date (at the time of writing, September 2018), which evaluated local adaptations in populations of wild plants were pooled. The papers were analyzed for (i) number of populations sampled, (ii) number of individuals sampled per population, (iii) total number of individuals sampled, and (iv) number of SNPs used. Care was taken to ensure that all data were obtained in empirical studies assessing local adaptation in wild plant populations. The Scopus (https://www.scopus.com) and Google Scholar (http://scholar.google.com.br/) databases were searched for title, abstract, and articles using the following keywords: environmental variables and SNPs, landscape genetics and SNPs, spatial analysis and SNPs, landscape genomics and SNPs, population genomics and SNPs, adaptive genetic variation, and local adaptation and SNPs. Search results were filtered to remove papers and reviews from clinical, biomedical, veterinary, and immunology areas. Then, a second filter was applied to eliminate the articles containing the following words: animal, fish, ecology of freshwater, marine biology, entomology, and zoology. Finally, papers using simulations, or involving exotic and crop species in plantations, or using greenhouse experiments were removed and the remaining 35 empirical studies on in situ wild plants were selected for the present work. Using this data set and ArcGIS (10.2), a map was created depicting the distribution of the localities where researches were performed, in the different terrestrial biomes (Fig. 1). A double check was done at this stage to ensure if all the 35 papers chosen for this study, had evaluated in situ local adaptation in wild plant populations, using SNPs markers. The following information from the papers: (1) authors names; (2) publication year, (3) country of study and geographical coordinates; (4) botanical family studied; (5) species; (6) total number of individuals; (7) number of populations; (8) number of SNPs used; (9) number of SNPs under selection; and (10) method used to generate the SNPs data, was compiled.
For this study, we considered as SNPs under selection, the SNPs from the pooled data that had a significant association with a geoclimatic variable, such as temperature, precipitation, latitude, longitude, elevation, evapotranspiration, and drought. They were used as a proxy to detect the potential for natural selection in wild plant populations. The number of SNPs under selection could be influenced by the methods employed to analyze the correlation between allelic frequency and environmental variables 38 , therefore most articles in the literature make inferences from such studies, using conservative criteria 34,35,40 . It is known that each method has different types of limitations and caveats (e.g., vary in the rates of false positives or of false negatives), which can cause errors in the estimates and consequently in the estimate of the number of SNPs under selection 38 . For this reason, we chose to be conservative, using only the results obtained either simultaneously by at least two methods of analysis or strictly controlled for rates of false positives or false negatives (e.g., studies that controlled the population structure in the analyses). Thus, although limitations of the different methodologies used were not fully circumvented, we believe that the results reported in our mini-review, portrays the findings of the literature, in a way seeking to minimize, as much as possible, the biases inherent to the different analysis.
The methods used to generate the SNPs data were subdivided into Random, Random Pool-Seq, Targeted, and Targeted-Pool-Seq categories. The Random category included studies that used random regions of DNA and individualized sequencing for the preparation of libraries. The Random-Pool-Seq category is distinct from Random, in that the data is sequenced with the Pool-Seq technique (Pool-Seq, an equimolar amount of DNA is taken from each individual from a population and pooled for sequencing). In the Targeted category, we included studies that used only specific gene regions or expressed sequence tags with individualized sequencing. The Targeted Pool-Seq category differed from Targeted, such that the data is sequenced by Pool-Seq technique mentioned above.
Influence of sample size on number of SNPs under selection was inferred by determining the average of the individual number of samples from each population which was arrived at by dividing the total number of individuals and the number of populations in each study. Additionally, the percentage of SNPs under selection was calculated to estimate the number of genes under natural selection in the genome. Then, a descriptive statistical analysis (minimum, maximum, mean, and standard) was performed with all the variables in R program (http:// www.r-project.org/).
To evaluate the influence of each variable on the number of SNPs under selection, the linear mixed effects model of the lme4 package of R program 41 was used. The variables in this study, such as the number of populations, number of individuals, average number of individuals sampled per population, and number of SNPs used were investigated to determine their influence on the number of SNPs under selection and, consequently, the detection of natural selection in plant species. As this type of model requires the fulfillment of the assumptions of normality and homoscedasticity of the residues, we performed the scaling of each variable separately. We used the rank function in R program to reduce the amplitude of our data, meet the assumptions, and obtain the goodness of fit of the statistical models. For each variable, an independent model was created by considering each variable as the fixed effect and the others as the random effect, such that: where, the number of SNPs under selection is the response variable; number of SNPs used is the fixed-effect variable; the terms in parentheses are the random effect variables and the number 1 indicates that the intercept is random between the observations of each variable.
The linear mixed effects model in the lme4 package does not provide the results as the coefficient of determination (r²), due to theoretical problems or difficulty of implementation. For this reason, we used the marginal r² to calculate the amount of variance explained by the fixed-effect variables in each model 42 . Finally, we used the function lm in R to establish the simple linear regression analysis of the number of SNPs used in the function of the total number of individuals. To make the graphs with the significant results of all the analyses, we used the ggplot2 package in R 43 .

Results
Based on the 35 selected empirical articles, we analyzed 44 observations involving 36 different species for the number of SNPs under selection (Table 1). These studies were conducted mainly in Europe and North America (Fig. 1A), in ten plant families, dominance by Pinaceae (53% of the species) (Fig. 1B), distributed mainly in biomes of temperate broadleaf and mixed forest, temperate grasslands and Mediterranean forests (Fig. 1C). The total number of individuals ranged from 22 to 2,574, and the number of SNPs used varied between 33 and 2,091,957. However, the number of SNPs under selection (SNPs with significant association with some geoclimatic variable) varied from 2 to 2,522 with the percentage of SNPs under selection ranging from 0.02 to 78.14 ( Table 1).
In the analysis using the linear mixed-effects models, we verified the effect of each variable (related to sample size and SNP number) individually on the number of SNPs under selection, whereas the others were used as random effect, as mentioned in the methodology section. On entering, the number of populations, as fixed effect variable, the model showed no significant relation with the number of SNPs under selection (marginal r² = 0.011, p = 0.38). A similar result was obtained when the mean number of individuals was used as the fixed effect variable (Marginal r² = 0.041, p = 0.16). However, when the number of SNPs was used as the fixed effect variable, a positive and significant relationship was observed with the number of SNP under selection ( Fig. 2A). When the total number of individuals was used as a fixed effect variable in the model, a negative and significant relation was observed in relation to the number of SNPs under selection (Fig. 2B). Subsequently, it was verified that the total number of individuals had a negative and significant relation with the number of SNPs used (Fig. 2C).

Discussion
The data obtained in the systematic review of studies in wild plant populations showed great heterogeneity in the number of individuals, populations, and quantity of SNPs. Our meta-analyses revealed that the total number of individuals and number of SNPs used are of fundamental importance in detecting signs of natural selection in wild plant populations. We also found an enormous knowledge gap in neotropics, underscoring the fact that predicting the response of the tropical forests to geoclimatic changes based on available data is not possible. Most of the studies had been conducted in the biomes of temperate and mixed forests of the European and North American habitats. However, we had to use them to apply the results in the recommendation for other environments, in order to help in decentralizing studies that are performed almost exclusively in the North hemisphere. Thus, it becomes evident the need to increase the possibilities of develop general methodologies guidelines that support decision making about sampling based on what we have available in the literature. Therefore, although our results were mostly based on species from temperate environment, we believe that they can be generalized to global conclusions. For this reason, our discussion focused on tropical regions, as they are recognized as a biodiversity hotspot and having really few studies with landscape genomics.
We found few empirical studies evaluating SNPs under natural selection, with only 36 plants species under analysis. From those studies, we registered ten families, where 53% of the species belonged to the Pinaceae family. This result evidences the poor understanding of local adaptation processes for different species. Considering that tropical forests have approximately 11,371 species of trees 44 , we have a limited vision to predict how species may respond to anthropogenic changes like climate change, fragmentation, and forest loss 45,46 . In addition, ~81% of studies were conducted in the European continent or in North America. Although much of the plant diversity is located in tropical regions 44,47 , the ability of these species to adapt in situ in response to environmental changes has not been studied 48 . We believe that this lack of knowledge for the tropics is a reflection of the socioeconomic conditions of this region as well as the lack of specialists in the area since population genomics is relatively recent and the first studies in the tropics are just emerging (see Table 1). Considering that tropical regions could be severely affected by climate change 4 , our results have shown the great need for studies that seek to understand how geoclimatic variables influence local adaptation. Through these studies, it will be possible to predict how tropical species will respond to climate change and assist in their conservation and management 29,32,35,45 .
With our systematic review of empirical studies, we have evidenced that the number of populations and mean number of individuals in a population do not influence the ability to detect natural selection. This emphasizes the importance of sampling, at the best, the whole extent of the area that is being influenced by the environmental variable of interest, rather than focusing on the number of populations sampled or on the mean number of individuals per population. If many populations, from similar environments, are sampled, the probability of detecting local adaptation does not increase 39 . On the other hand, even if few populations are sampled in contrasting environmental conditions, it would increase the power to detect local adaptation 39 . Thus, it will be more effective to cover the environmental heterogeneity, avoiding efforts both in the expansion of the number of populations and in the average number of individuals per population.
The linear mixed-effect model showed that the number of SNPs used explained 59% of the variation in the number of SNPs under selection and therefore can positively influence the detection of natural selection signals. This could occur because mostly a low percentage of genes in a genome are under natural selection (Table 1); therefore, increasing the genome sampling, also increases the chance of identifying SNPs under selection. Interestingly, a general pattern observed was that studies using Random, Targeted Pool-Seq, and Random Pool-Seq methodologies used more SNPs and had a greater number of SNPs under selection. Therefore, it would be beneficial if future works used one of these methodologies, especially Random Pool-Seq and Targeted Pool-Seq, to optimize the cost-benefit ratio. In this context, studies that aim to evaluate the indicators of natural selection in populations of wild plants could use the Targeted Pool-Seq approach, in cases where genomic information of the species is available (e.g., genome size and gene annotation). By using targeted sequencing in DNA pools, cost benefit ratio of such studies would be increased. In contrast, for the species with little or no genomic information available and for laboratories that are working on landscape genomics for the first time, Random Pool-Seq is the best strategy to increase the chances of detecting SNPs under selection. An interesting alternative www.nature.com/scientificreports www.nature.com/scientificreports/ to make landscape genomics financially feasible would be the use of grouped sequencing of individuals by techniques like RADseq that do not require any prior knowledge of the genome 49 . While it is still advantageous to use the Pool-Seq technique which can increase accuracy in allele frequency estimates, compared to individualized sequencing, making it an excellent tool to be used in landscape genomic studies [50][51][52] . Moreover, it also allows to reconcile the sampling of a large number of individuals needed in population genomics, with thousands or millions of SNPs, to evaluate adaptation in natural populations 12,23,53,54 . It is important to note that the Pool-Seq technique has some limitations, for example, not providing information for each individual separately 52 . Therefore, the use of this technique is only suitable for studies aiming for population inferences without the need of information about individuals.
By fixing the variable in the linear mixed effects model as total number of individuals, we found that the total sample size was inversely related to the number of SNPs under selection and to the number of SNPs used. Although significant, these correlations explain only 10% of the variation in the number of SNPs under selection and 14% in the number of SNPs used. However, much of the variation in the number of SNPs under selection (59%) could be explained by fixing the variable as the number of SNPs used. We therefore, believe that the significantly negative correlation between the total number of individuals and the number of SNPs under selection is a reflection of the inverse relationship observed between the total number of individuals and the number of SNPs used. Thus, the negative relation seen between the total number of individuals and the number of SNPs under selection is an artifact generated by the lower capacity detection of natural selection when the number of SNPs used decreases. The negative relationship observed between the number of individuals sampled and the number of SNPs used could be due to incomplete genomic information because of the costs associated with sequencing of whole genomes 53 . Among species for which some genomic information is available, size consideration of the genome becomes important; because the larger genomes may need more number of SNPs to be sampled to give a full perspective, thus impacting the number of individuals that can be analyzed given the budgetary constraints. Also, investing a huge portion of the budget for sampling in the field to increase the sample size, would affect the laboratory stages of the study by limiting resource allocation. Thus, the negative relation found in this study, can be related to the costs of individual sequencing of each sample. In population genomics studies, the conflicting demands between the total number of individuals and the number of SNPs used is circumvented by using the Pool-Seq technique 53 (Table 1). This technique has also been used in plant species with incomplete or total absence of genomic information 52 . This indicates that the same strategy can be used in the landscape genomics studies of species in tropical regions, which are major biodiversity hotspots of the world 44,47 with little or no genomic information available.

Guidance for Study of Local Adaptation in Wild Plant Species
Our study involving systematic review of empirical studies and meta-analysis of data and results therein, about wild plant populations, allows us to put forth the recommendations that in future landscape genomics studies, the experimental design should focus on increasing the total number of individuals sampled along the environmental heterogeneity under analysis 39 . In addition, we also suggest that researchers seek to evaluate the adaptation in wild plant populations using the Pool-Seq technique to increase the number of SNPs and accuracy in allele frequency estimates 52 . This technique has been used in population genomics as a powerful tool under different study scenarios, including model species 54 , allopolyploid species 55 as well as for species with little or no genomic information available 56 . Following these guidelines, it will be possible to extrapolate findings from studies that