Modeling spatially and temporally complex range dynamics when detection is imperfect

Species distributions are determined by the interaction of multiple biotic and abiotic factors, which produces complex spatial and temporal patterns of occurrence. As habitats and climate change due to anthropogenic activities, there is a need to develop species distribution models that can quantify these complex range dynamics. In this paper, we develop a dynamic occupancy model that uses a spatial generalized additive model to estimate non-linear spatial variation in occupancy not accounted for by environmental covariates. The model is flexible and can accommodate data from a range of sampling designs that provide information about both occupancy and detection probability. Output from the model can be used to create distribution maps and to estimate indices of temporal range dynamics. We demonstrate the utility of this approach by modeling long-term range dynamics of 10 eastern North American birds using data from the North American Breeding Bird Survey. We anticipate this framework will be particularly useful for modeling species’ distributions over large spatial scales and for quantifying range dynamics over long temporal scales.

package unmarked 15 . These programs allow users to fit several variations of the standard static or dynamic occupancy models 9 using generalized linear models (GLMs) to estimate covariate effects on occupancy and detection. Although this GLM-based approach can account for covariates, imperfect detection, and temporal auto-correlation in occupancy probability, these programs are restricted in their ability to model non-linear residual spatial variation that often characterizes species' distributions. In most cases, GLM-based models assume that spatial variation can be modeled as a linear or quadratic function of latitude and longitude or environmental covariates 16 . For many species, however, a small number of covariates cannot reasonably model all sources of variation and therefore GLM-based occupancy models are unlikely to provide accurate estimates of spatial variation in occupancy.
One alternative to the GLM approach is to estimate non-linear spatial variation in occupancy probability using generalized additive models (GAMs). An extension of GLMs, GAMs estimate the relationship between a response variable and a smoothed non-parametric function of covariates 17 . Because the shape of the smoothed functions is determined by the data rather than a parametric function, GAMs can estimate complex, nonlinear spatial patterns that are not accounted for by covariates 18 . Like GLMs, GAMs use link functions to accommodate response variables with normal or non-normal error distributions (e.g., binomial, Poisson), making it conceptually simple to extend occupancy models to estimate non-linear effects of covariates 19 . Although GAMs have been used to estimate species distributions in a number of contexts 8,20 , this approach has not been widely used in occupancy-based SDMs that account for both temporal dynamics and imperfect detection.
We propose a new model of space-time occupancy that uses a basis-function formulation and allows the basis coefficients, and thus the spatial patterns, to evolve dynamically over time. This formulation is consistent with other dimension-reduction approaches for space-time modeling 21 . The model is flexible and can accommodate data from a range of sampling designs that provide information about both occupancy and detection. Output from the model can be used to create distribution maps and to estimate intuitive indices of range shifts (range center, range limits). We demonstrate the utility of this approach by modeling long-term (1972-2015) range dynamics of 10 eastern North American birds using data from the North American Breeding Bird Survey. For this application, we also use Gibbs variable selection 22 to identify species-specific relationships between climate and occupancy. We anticipate this framework will be particularly useful for modeling species' distributions over large spatial scales and for quantifying range dynamics over long temporal scales because of the improved fit to complex species distributions.

Methods
Model description. We assume that j = 1, 2, … J temporally or spatially replicated presence/absence surveys are conducted in t = 1, 2, …, T primary periods at i = 1, 2, …, N sampling locations. Further, we assume that the true (but latent) occupancy state of each site, denoted z i,t , is closed within each primary period but can change across primary periods. During each survey, the observed occupancy state of the focal species, denoted h i,j,t , is recorded (0 = species not observed, 1 = species observed). Our primary aim is to model temporal and spatial variation in the probability of occupancy ψ i,t = Pr(z i,t = 1) while accounting for imperfect detection. Below, we describe a Bayesian state-space formulation of this model that uses smoothing splines to model complex spatial and temporal auto-correlation in occupancy probability.
State model. In each primary period t, occupancy state is modeled as a Bernoulli random variable with probability ψ i,t : , , where f t (lat i , lon i ) is a spatial smoothing function, β is a vector of slope coefficients, and X i,t is a matrix containing covariate values for route i in period t. The smooth function f t is composed of basis functions g k and their corresponding regressions coefficients ν k,t : Different smooth functions can be chosen based on the structure of the data 17,18 , providing a flexible and efficient means to model complex spatial variation in occupancy probability. The basis dimension K should be large enough to approximate the smooth function (i.e., avoid over-smoothing), though the exact choice of K is not critical because the degree of smoothing is determined primarily by a smoothing penalty term λ which penalizes against over-fitting 17 . In a Bayesian context, this penalization can be incorporated by specifying multivariate normal priors for the ν k coefficients, with the precision matrix proportional to λ. Larger values of λ produce more constrained priors and thus more similar (i.e., more smooth) estimates of the ν k coefficients 17 .
In combination with time-varying covariates X i,t , allowing the basis function coefficients to vary across primary periods allows occupancy probability at each site to change over time. When t = 1, the ν k,1 coefficients are estimated using the Bayesian penalization approach described above. To account for temporal auto-correlation in occupancy, the basis function coefficients in periods t > 1 were modeled as temporally-correlated random effects: where σ 2 is the variance among primary periods.
, , Covariates thought to influence the detection process (observer bias, weather, etc.) can be incorporated into this structure using a logistic link function on p t (see below).
Distribution maps and indices of range dynamics. When latitude, longitude, and annual covariate values are available at unsampled locations within a species' range, posterior distributions of the predicted occupancy probability at those locations can be easily estimated from the posterior samples of the fitted model 10,12 . Posterior estimates of range-wide occupancy probability can be used to visualize changes in species' distributions and to quantify indices of range dynamics 10 . Although many such indices are possible, we describe four that may be particularly relevant to quantifying range shifts. First, the mean occupancy probability of all map cells provides an index of changes in the proportion of area occupied 10 . Second, the mean breeding latitude, estimated as the sum of the cell latitudes weighted by their occupancy probabilities and divided by the total occupancy probability across all cells, can be used to quantify shifts in the center of species range 10 . Finally, annual indices of the northern/southern range limits can be estimated by sorting the map cells by latitude and then using a smoothing spline function to predict the latitude below/above which 99.9% of the total occupancy probability is located. Although not an absolute measure of the northern-and southern-most latitudes at which a species was found, this index provides a time series of relative change in the northern and southern range limits, which can be used to determine whether distributions have expanded or contracted over time. We selected these species because they exhibit a wide range of spatial and temporal complexity in range dynamics. Data for this analysis came from the North American Breeding Bird Survey (BBS), a large-scale citizen science program consisting of over 5500 roadside survey routes of which approximately 3000 are surveyed each May or June by trained volunteers 23 . The BBS was initiated in 1966, though only a small number of routes were surveyed during the early years. For this reason, we chose to use BBS data collected from 1972 to 2015 24 . Trained observers conduct 3-minute point counts at 50 regularly spaced stops along each approximately 39.4 km-long route. See 23 for more details regarding the BBS survey protocol. Prior to analysis, we converted the raw counts to stop-level presence/absence data.

Application.
To model spatial/climate relationships across the edge of each species' occupied range, we subset all BBS routes with at least one detection of the focal species over the study period (i.e., routes occupied in at least one year). Next, we created a 2°-buffered convex hull around the occupied routes and included all routes within the buffered region.
Climate data was obtained from the University of East Anglia Climate Research Unit (CRU) 25 . The CRU data contains global estimates of monthly surface climate variables for 0.5° grid cells 25 . Following Clement et al. (2016), we converted the monthly temperature and precipitation estimates from the CRU data set into five 'bioclim' variables that have low correlation and are effective for modeling species ranges 1 . Specifically, for each grid cell we calculated the mean temperature, mean diurnal temperature range, mean temperature of the wettest quarter, annual precipitation, and precipitation of the warmest quarter for the 12 months preceding each BBS survey (i.e., June-May). All estimates were obtained using the 'biovars' function in the 'dismo' package 26 in program R 27 . Prior to analysis, each variable was scaled to mean = 0 and standard deviation = 1 and we extracted the annual climate values for the grid cell containing the first stop of each BBS route.
Because BBS routes are surveyed a single time each year, conventional methods for estimating detection probability from temporally replicated surveys are not possible. Instead, we adapted an occupancy model that uses the correlation between adjacent spatial replicates to estimate detection probability 28,29 . In this model, occurrence is estimated at two scales: (1) the route-level (i.e., z i,t ), and (2) the stop-level (denoted y j,i,t |z i,t ). Hereafter we follow Clement et al. (2016) and refer to presence at the route-level as "occupancy" and presence at the stop-level as "availability". Digital records of the raw 50-stop BBS data are only available from 1997-present. However, 10-stop summaries (sum of counts from stops 1-10, 11-20, 21-30, 31-40, and 41-50) are available for the entire BBS period 24 . Initial testing indicated that estimates of the route-level occupancy did not differ when the model was fit using the full 50-stop data or the 10-stop summaries (i.e., 5 replicates per route). Therefore, we chose to use the 10-stop data so that inferences could be made over the entire BBS time series.
State model. Spatial and annual variation in route-level occupancy probability was modeled using a modified version of Eq. 1: where ω is a vector of binary indicator variables determining whether each climate predictor is included in the model and X i,t is a matrix containing the annual climate values in year t at route i. Estimation of ω is described below. To capture non-linear relationships between climate and occupancy, the matrix X contained both the linear and quadratic terms for each of the 5 climate covariates. For the spatial smooth, we used a thin-plate (2019) 9:12805 | https://doi.org/10.1038/s41598-019-48851-5 www.nature.com/scientificreports www.nature.com/scientificreports/ regression spline of the latitude and longitude of each BBS route. For all species, we chose k = 60, which initial tests indicated was large enough to approximate the smooth function for all species considered.
At the stop level, availability is modeled as a first-order Markov process with parameters: • θ i,j,t = Pr(availability at stop j|route i occupied and stop j − 1 unavailable) • θ' i,j,t = Pr(availability at stop j|route i occupied and stop j − 1 available) As noted by 28, the first stop on each BBS routes has no predecessor and thus availability at stop 1 cannot be modeled using θ or θ'. Instead, we directly estimated the probability π t that the first stop is available in year t: For the remaining stops, availability was modeled as: Observation model. Numerous factors could influence detection probability in BBS surveys. Observer experience and variation among observers are both known to influence detection of many species 30 . Weather conditions, particularly wind speed, may also influence detectability. For our analysis, we included wind speed scores recorded at the start of each BBS count using the Beaufort wind scale 23 . Between 2009 and 2015, a small number of BBS routes (n = 106) were surveyed using a modified protocol (RPID = 501) that incorporated time and distance information 31 . Compared to the standard BBS protocol (RPID = 101), this time-distance protocol resulted in on average 10% fewer observations per survey (Sauer et al.) 32 . To account for these effects, we modeled detection conditional on stop-level availability and route-level occupancy as: , , , , where α 0 is an intercept term, WIND i,t is the wind score, I i,t is a binary dummy variable indicating whether year t was an observer's first year of service, κ i,t is a dummy variable indicating the survey protocol used (0 = standard BBS survey, 1 = time-distance protocol), and η i,t is a random observer effect.
Model selection. Given the large number of climate predictors in our model and the lack of a priori hypotheses about which predictors should influence the distribution of each species, each climate variable m was multiplied by a binary latent variable ω m which determined whether the variable was included in the linear predictor. The posterior probability Pr(ω m = 1) is then a measure of the relative importance of variable m 22,33,34 . For the linear effect of each climate variable, we assumed mutually independent Bernoulli priors: For the quadratic terms, we enforced marginality by setting: Modeling fitting and indices of range shifts. We fit the models in JAGS 35 called from R using the jagsUI package 36 . As described above, we specified multivariate normal priors for the GAM smooth coefficients, Bernoulli priors for the indicator variables, and vague normal priors for the β coefficients. For all other parameters, we specified appropriate vague priors. See Data S1 for model code and specification details. We assessed goodness-of-fit using posterior predictive checks (PPC). Because conventional PPC metrics are inappropriate for binary occupancy data 19 , we used each posterior estimate from the fitted model to simulate the expected number of routes with each of the 32 possible detection histories in each year. We used a Freeman-Tukey statistic 37 to measure the discrepancy between the observed/simulated and predicted detection history frequencies and we report the Bayesian P-value from these tests.
For each species, we created annual distribution maps and range shift indices using the climate covariate values, latitude, and longitude for each 0.5° raster cell within the same buffered convex hull used to subset BBS routes. Posterior distributions of the predicted annual occupancy probability in each cell were estimated using (2019) 9:12805 | https://doi.org/10.1038/s41598-019-48851-5 www.nature.com/scientificreports www.nature.com/scientificreports/ the posterior samples for each model parameter. From these distributions we then estimated posterior distributions for the four indices described above (proportion of area occupied, mean breeding latitude, and northern/ southern range limits).

Results
Posterior predictive checks did not evince lack of fit for any species (Red-bellied Woodpecker p = 0. 35 26), indicating that the spatial GAM was able to model both complex and relatively simple distributions. For example, species like Fish Crow and Swainson's Warbler have complex spatial distributions that do not exhibit simple linear or quadratic relationships with latitude and/or longitude. Occupancy probabilities for the Fish Crow were high along the Atlantic and Gulf coasts as well as in the southern Mississippi River valley, resulting in a U-shaped distribution with no clear range center (Fig. 1A). The distribution of Swainson's Warbler was also complex, with three distinct areas of high occupancy and low occupancy in between (Fig. 1B). In contrast, the distributions of some species, including Red-bellied Woodpecker and Carolina Chickadee, were less complex, with a large central area of high occupancy with declining occupancy along the periphery of the range (Fig. 1C,D).
The model was also able to quantify temporal changes in occupancy probability over the 43 year time period of the BBS data, revealing similarities and differences in regional dynamics in several species. For example, Louisiana Waterthrush, Kentucky Warblers, Wood Thrush, and Golden-winged Warblers all experienced large declines in occupancy probability in the eastern portions of their range, especially in the northeastern United States and Appalachia, but were relatively stable or increasing in the mid-western United States (Fig. 2). These species differed, however, in occupancy trends in the southeastern United States, with Louisiana Waterthrush and Kentucky Warblers showing modest increases in occupancy probability and Wood Thrush and Golden-winged Warblers declining in occupancy probability.
Indices of range shifts from our model indicate that some species have undergone distributional shifts over the past four decades. For example, the northern range limit of Blue-gray Gnatcatchers shifted northward by 1.2° latitude and the mean breeding latitude shifted northward by 1° (Fig. 3A,B). Interestingly, this species has shown small (0.2°) southward expansion at the southern edge of its range and as a result, the proportion of area occupied has increased over time (Fig. 3C). The indices were also able to capture transient dynamics in distributional shifts. Northern populations of Carolina Wren, for example, experienced large declines in occupancy probability in the late 1970's, resulting in a contraction of the northern range limit and mean breeding latitude by 1° latitude (Fig. 3E,F). This contraction was temporary, however, with these populations subsequently experiencing a sustained northward expansion extending 1.7° beyond their initial northern range limit (Fig. 3D). www.nature.com/scientificreports www.nature.com/scientificreports/

Discussion
The distributions of most species are characterized by complex and dynamic variation in occurrence. Species distribution modeling seeks to relate this variation to environmental covariates and extrapolate these relationships to unsampled sites and times. Because habitats and species-habitat relationships change across both space and time, conventional GLM-based models rarely capture the inherent complexity of species distributions, especially when inferences are made across large spatial or long temporal scales. Here, we demonstrate a novel occupancy-based SDM that combines environmental predictors with a spatial GAM to model covariate relationships and complex, non-linear spatial variation in occupancy probability while accounting for imperfect detection. Application of this model to 10 North American bird species demonstrates the utility and flexibility of our approach for species' distribution modeling. Unlike parametric models that use low-order polynomials to capture spatial variation in occupancy, the spatial GAM derives non-parametric occupancy patterns during estimation while avoiding overfitting through the incorporation of a smoothing penalty term. This balancing of complexity and smoothing ensures that the model allows for, but does not impose, complex spatial variation in occupancy. This formulation provides an efficient and flexible method to model large-scale spatial variation in occupancy probability, for example allowing us to model both the complex distributions of Fish Crow and Swainson's Warbler and the relativity more simple distributions of Carolina Chickadees and Red-bellied Woodpeckers using a common model structure.
Modeling the GAM coefficients as temporally-correlated random effects also allowed us to explicitly model changes in occupancy probability over time. In our analysis of BBS data, the model uncovered interesting similarities and differences in the range dynamics of several species that inhabit forest habitats in the eastern United States, including Louisiana Waterthrush, Wood Thrush, Kentucky Warblers, and Golden-winged Warblers. In the case of Wood Thrush and Golden-winged Warblers, this regional variation in occupancy dynamics is consistent with differences in demographic rates among the regions 38,39 , suggesting that our occupancy model was able to capture spatial variation in population dynamics.
The indices of latitudinal range dynamics from our model also documented range expansions of Blue-gray Gnatcatchers and Carolina Wrens, demonstrating the utility of these metrics for quantifying range shifts over long temporal scales. The ability to document shifts at range margins while accounting for imperfect detection is particularly important given that these locations are likely to experience the largest changes in occupancy but also have the lowest detection probability. Application of this framework to a larger pool of species could indicate whether range shifts provide a consistent fingerprint of climate change. It may also be possible to quantify range shifts of entire groups of species by creating composite versions of our indices, which would be particularly useful for testing hypotheses about which traits promote or impede the ability of species to respond to habitat and climate change.
Our model differs from the conventional dynamic occupancy model in that we did not directly estimate change in occupancy as the result of extinction/colonization processes. Under the extinction/colonization formulation, occupancy probability at a given location i will converge on the stable-state occupancy distribution defined by ψ =