Semiparametric model selection for identification of environmental covariates related to adult groundfish catches and weights

Ecologists and fisheries managers are interested in monitoring economically important marine fish species and using this data to inform management strategies. Determining environmental factors that best predict changes in these populations, particularly under rapid climate change, are a priority. I illustrate the application of the least squares-based spline estimation and group LASSO (LSSGLASSO) procedure for selection of coefficient functions in single index varying coefficient models (SIVCMs) on an ecological data set that includes spatiotemporal environmental covariates suspected to play a role in the catches and weights of six groundfish species. Temporal trends in variable selection were apparent, though the selection of variables was largely unrelated to common North Pacific climate indices. These results indicate that the strength of an environmental variable’s effect on a groundfish population may change over time, and not necessarily in-step with known low-frequency patterns of ocean-climate variability commonly attributable to large-scale regime shifts in the North Pacific. My application of the LSSGLASSO procedure for SIVCMs to deep water species using environmental data from various sources illustrates how variable selection with a flexible model structure can produce informative inference for remote and hard-to-reach animal populations.

www.nature.com/scientificreports/ concentrated on deeper-dwelling species 18,19 . While focus on the effect of various climate modes has dominated ecological research on fishes in the North Pacific region, relationships between the marine environment and atmospheric trends are nuanced and may involve complex lagged effects, particularly for deepwater populations 20 .
Organisms that inhabit the deep ocean are also problematic to study, as they are not adapted to surface-level conditions and prove difficult to sample and keep alive, making experiments in laboratory conditions impossible or prohibitively expensive. Determining which specific environmental variables contribute to fluctuations in the populations of these species from observational data would represent major progress in discerning the impact of climate variability on marine ecosystem health and how those changes affect the economy and food security. A model structure able to accommodate a suite of environmental variables that vary spatiotemporally would be necessary to examine effects of many environmental covariates on deepwater marine populations simultaneously. Consider the single index varying coefficient model (SIVCM) of the form where y i is the response, Z = (z 1i , z 2i , . . . , z ti ) T and X = (x 0i , x 1i , . . . , x pi ) T with x 0i = 1 are predictor variables, θ 0 is a vector of unknown coefficients representing the single-index direction, G(·) = (g(·) 0 , . . . , g(·) p ) T are nonparametric coefficient functions, and ε are the random errors 21 . By setting Z to longitude, latitude, and time triplets, the SIVCM is a convenient structure for incorporating spatiotemporal effects for multiple environmental predictors in X. Spatiotemporal variation among values is incorporated through three-dimensional functions based on spline smoothing [22][23][24][25][26][27] , accounting for spatial and temporal autocorrelation and improving prediction and inference 28,29 . Similar structures have been widely used in the related spatially-varying coefficient models 30,31 and temporally-varying coefficient models 32 , including applications in forestry 33 , ecology 34 , and economics 35,36 . Spline-based methods, which are often used to estimate G(·) in SIVCMs 37,38 , are more robust for spatially correlated data and do not require the spatial variation to be specified by a functional form 39,40 . For such models, selection of important predictor variables in X is typically of interest. Forward selection, backwards elimination, and stepwise selection methods are unstable for models with many predictors and even with advancements to the algorithms, these methods are considered sub-optimal for variable selection, particularly for high-dimensional models 41 . Penalty-based regression procedures, such as ridge regression and least absolute shrinkage and selection operator (LASSO) estimation, penalize large regression coefficients to reduce overfitting. LASSO additionally performs variable selection by penalizing small regression coefficients to zero, effectively removing these coefficients from the model 42 . LASSO works particularly well for models with many predictors because it shrinks large coefficients to zero rather than minimizing them, and it is computationally efficient 43 . Group LASSO incorporates information about groupings of variables into the penalty function, which is particularly important for categorical predictor variables 44 . While selection for varying coefficient models (VCMs), a lower-order relative of the SIVCM, have built on both the smoothly clipped absolute deviation (SCAD) and LASSO approaches [45][46][47][48] , selection procedures of the single-index direction coefficients or the functions in SIVCMs have primarily used SCAD penalties [49][50][51] . SCAD procedures are unbiased, but they are sensitive to initial estimation and parameter tuning 48 . LASSO procedures are typically simpler to implement than SCAD, and group LASSO correctly selects important variables for VCMs where the number of dimensions far exceeds the number of observations 52 . In this analysis, I used a combination of least squared-based spline estimation and group LASSO (LSSGLASSO) proposed by Sun et al. 53 to select coefficient functions and estimate the index parameters in a SIVCM of spatiotemporally-varying environmental covariates potentially contributing to changes in groundfish populations in the North Pacific Ocean. With this application, I aimed to establish relevant environmental factors that influence populations of focal groundfish species in this region.

Methods
Annual surveys of several groundfish species are taken at established locations in the waters along the coast of Alaska by the Alaska Fisheries Science Center (AFSC), a division of the National Oceanic and Atmospheric Administration (NOAA). Catch per unit effort (CPUE), also referred to as catch rate, and mean weight in kilograms of six groundfish species determined at each location for each survey year were obtained for years 1979-2013 54,55 . The six groundfish species of focus in this analysis were Pacific cod (Gadus macrocephalus), Pacific halibut (Hippoglossus stenolepis), sablefish (Anoplopoma fimbria), rougheye rockfish (Sebastes aleutianus), shortraker rockfish (Sebastes borealis), and shortspine thornyhead (Sebastolobus alascanus). Air temperature in degrees Celsius (ATMP), sea level pressure in hPa (PRES), wind speed in meters per second averaged over eight-minute periods (WSPD), sea surface temperature in degrees Celsius (WTMP), and the average height in meters of the highest one-third of all waves in 20-minute sampling periods (WVHT) measured daily from buoys in the Gulf of Alaska were obtained from the National Data Buoy Center and summarized by monthly means 56 . Temperature in degrees Celsius measured at the sea floor (hereafter bottom temperature) was obtained from the AFSC Resource Assessment and Conservation Engineering (RACE) Division's bottom trawl surveys 57 . Zooplankton biomass volume given in number per cubic meter (hereafter plankton) were obtained from the NOAA's Coastal and Oceanic Plankton Ecology, Production, and Observation Database 58 . Alkalinity (Alk), chlorophyll (Chl), nitrate (NO3), dissolved oxygen (Oxy), phosphate (Phos), and silicate (Sil) concentrations at depths of 75, 400, and 900 meters were obtained from the NOAA's World Ocean Database 59 .
Since environmental data can include hundreds of predictor variables with potential multi-collinearity, a practical method for reducing variables before performing group LASSO is necessary for the LSSGLASSO procedure to be computationally feasible. Therefore, backward variable elimination of explanatory environmental variables using computed variance inflation factors (VIF) was performed. The VIF for explanatory variable i was calculated as 1/(1 − R 2 i ) where R 2 i was the correlation coefficient of the linear model with variable i regressed against all other explanatory variables. VIFs for all explanatory variables were computed, and the variable with www.nature.com/scientificreports/ largest VIF that exceeded a value of 5 was removed 60,61 . The process was repeated after each removed variable, until no variables had a VIF > 5 . None of the environmental variables considered for inclusion in the SIVCM had a VIF > 5 . The max VIF for any explanatory variables included in the final model was 2.64. Measurements of environmental variables were not recorded at the exact same locations across all years for which groundfish surveys were performed, thus spatiotemporal interpolation via inverse distance weighting was used to obtain monthly environmental measures for 1979-2013 at the precise locations where MESA surveys were conducted 62,63 . For all environmental variables, a seasonal amplitude was then calculated for each survey year for all locations. For physical variables ATMP, PRES, WSPD, WTMP, WVHT and bottom temperature, seasonal amplitude was defined as the mean of June, July, and August averages minus the mean of December, January, and February averages. This was owing to the fact that temperatures maximize in the summer and minimize in winter and winter storms enable strong mixing of ocean nutrients 64,65 . Seasonal amplitudes for chemical and biological variables including plankton, Chl, Alk, NO3, Oxy, Phos, Sal, and Sil were calculated as the mean of August, September, and October averages minus the mean of March, April, and May averages. Zooplankton biomass increases in May after winter vertical mixing of deepwater nutrients and benefits from spring phytoplankton blooms but has not yet been depleted by grazing from summer-migrating pelagic fishes and cephalopods 66,67 . Therefore nutrients are generally maximal in the spring and minimal in fall after depletion by phytoplankton, whereas zooplankton and chlorophyll would be maximal in fall after nutrient consumption and growth over the summer in the North Pacific region [68][69][70] .
The AFSC groundfish surveys sample only adults 55 , and groundfish recruitment is related to the effects of physical and biological variables on early life stages 71 . Therefore, all environmental predictors were lagged based on the sexual maturity of the focal groundfish. Sablefish and Pacific cod female maturity is reached around 5 years old 72,73 . Average age of Pacific halibut females that have reached sexual maturity is 12 years, while for males reaching maturity the average age is 8 years 74 . Rockfish sexual maturity is typically attained from 3 to 7 years of age 75 . Therefore, lags of 5 years were used for environmental variables in models with sablefish, rockfish, or Pacific cod CPUEs and weights as responses. Lags of 10 years were used for environmental variables in models with responses of Pacific halibut CPUE and weight. These lags were also supported by the findings of Sun et al. 53 .
A SIVCM of the form given in (1) was fit, where y i was the CPUE or mean weight of a groundfish at each location for each year, X = ( x 0i , . . . , x 25i ) T with x 0i = 1 and x 1i , . . . , x 25i were the lagged seasonal amplitudes of environmental variables described previously, and Z = (z 1i , z 2i , z 3i ) T were the longitude, latitude, and year for each observation. Selection of important variables was performed per year and over all years using the LSSGL-ASSO procedure 53 . The choice of knots for the B-spline approximation of the coefficient functions and the optimization of the tuning parameter in the LSSGLASSO procesure are covered in detail in Sun et al. 53 The optimal tuning parameter was chosen using a BIC-type selection criterion, while the number of knots was fixed at eight. The number of knots can be chosen using a selection criterion such as GCV or BIC, but since basis expansion was only used for selection of the coefficient functions and every function was re-estimated using local linear regression, the choice of the number of knots does not affect the results 53 . The performance of the group LASSO procedure was tested against procedures where the group LASSO penalty was replaced by either a group LASSO with ridge penalty or a group SCAD with ridge penalty, and there were no appreciable differences between the three selection procedures. Heat maps were used to visualize selection of variables for each groundfish species. After selection, the SIVCM for all years of sablefish CPUE was refit with scaled y, X, and Z using only variables selected by LSSGLASSO. The selected coefficient functions for important predictors of sablefish CPUE were plotted to provide a detailed example of interpretation for the SIVCM functions.
To further explore if regional climate conditions influenced the selection of variables each year, I considered potential relationships between the selection of a variable over time and three climate indices that influence environmental systems in the North Pacific region. The Pacific Decadal Oscillation (PDO) monthly index, multivariate El Niño/Southern Oscillation bi-monthly index (MEI), and North Pacific Gyre Oscillation (NPGO) monthly index were obtained for 1979-2013 [76][77][78] . Seasonal amplitude for climate indices was defined for each year as the mean of June, July, and August minus the mean of December, January, and February, since climate indices typically describe physical environmental conditions that contribute to temperature and pressure changes affecting ocean nutrient mixing 79,80 . Logistic regression was then used to fit the model g(E(y i )) = β 0 + β k x ki for years i = 1985, . . . , 2013 for each variable within a groundfish response, where g(·) was a log-link function; β 0 was the intercept term; β k were linear coefficients, k ∈ {PDO, MEI, NPGO} ; x ki were seasonal amplitudes of index k lagged by 5 years when considering the responses of all groundfish except those of Pacific halibut, which were lagged 10 years; and y i were binary indicators of whether a variable was or was not selected for each year. A Chi-square test was used to compare the models to an intercept-only model. P-values from the Chi-square test were adjusted within groundfish response to control for the false discovery rate (FDR) of multiple testing using the Benjamini-Hochberg procedure 81 .
Finally, it was necessary to determine whether PDO, MEI, or NPGO were good predictors of groundfish catches and weights and if the effect of a climate index on the selection of variables was related to that index's direct relationship to the groundfish populations. I fit generalized additive models (GAMs) of the form with lagged seasonal amplitudes of PDO, MEI, and NPGO as three additive smooth predictors and groundfish CPUEs or weights as responses y i 26 . Relationships of climate to groundfish responses cannot be assumed to be linear [82][83][84] . Therefore, the GAM structure was preferred over a generalized linear model, because it permits nonlinear relationships between predictors and the expected response. P-values were estimated for each smooth predictor in each model and plotted in a heatmap for efficient visualization.
A wide array of environmental variables were selected for rougheye rouckfish CPUE consistently and almost exclusively in years 1991, 2001, 2011, and 2013, whereas 1990 was the only year when more than one variable was selected for rougheye rockfish weight (Fig. 4). The most often selected variables related to rougheye rockfish CPUE were WSPD, Oxy 900m, Sil 900m, plankton, Chl 400m, and Oxy 400m, while Chl 400m, Sil 900m, and Oxy 900m were important when all years were considered together. In the case of shortraker rockfish, most environmental variables were selected as important for both CPUE and weight in 2000 and 2007 (Fig. 5, purple). WSPD, Chl 400m, Oxy 400m, and Oxy 900m were most often selected over time for shortraker rockfish CPUE, while 900m depths of NO3, Sil, Phos, and Oxy, 75m depths of Alk, Chl, NO3, Oxy, and Sal, and bottom temperature www.nature.com/scientificreports/ were most often selected as important to shortraker rockfish weight over all 29 years (Fig. 5). ATMP, WVHT, and Phos 75m were shared as selected variables for shortspine thornyhead CPUE and weight, and many environmental variables contributed to shortspine thornyhead weight in 1986-1987, 1998, 2000, and 2004 (Fig. 6). ATMP was selected most often for shortspine thornyhead CPUE, while WVHT, Chl 400m, and Phos 900m were consistently selected as important to shortspine thornyhead weight over the 29-year period. Only Pacific cod CPUE, Pacific halibut CPUE, sablefish CPUE and weight, and rougheye rockfish CPUE had variables selected as important predictors when performing selection on data including all available years (Figs. 1,  2, 3, 4). Alk 900m was a common selected variable for Pacific cod and sablefish CPUEs for all years included in the selection procedure, while WTMP was selected for both sablefish and Pacific halibut CPUEs for all years.
I plotted the selected coefficient functions of WTMP, Alk 400m, Alk 900m, Sil 75m, and Sil 900m for sablefish CPUE using all years of data to provide a detailed example of interpretation of the SIVCM coefficient functions (Fig. 7). Parameters representing the single-index direction estimated by the LSSGLASSO procedure were θ T = (0.9346, −0.3380, −0.1109) . The relationship between sablefish CPUE and space-time is highly nonlinear, as seen in the intercept function ( g 0 ). For a given location and year, we can express the single-index direction as θ T Z = 0.9346z 1 − 0.3380z 2 − 0.1109z 3 , where z 1 is longitude, z 2 is latitude, and z 3 is the year. For θ T Z to increase, either longitude increases (moving eastward) or latitude decreases (moving southward). As time moves forward, i.e. year increases, θ T Z decreases. Fixing two variables in Z while moving the third allows us to consider how each important variable in X affects the response as θ T Z changes. For example, for a fixed latitude and year, increasing longitude leads to increasing θ T Z , which was associated with increasing, decreasing, and increasing effect of WTMP on sablefish CPUE ( g 1 in Fig. 7). So for low and high (east and west) longitude values within the range of the data, increasing WTMP was associated with increasing sablefish CPUE for fixed latitude and year. For medium longitude values, increasing WTMP was accompanied by decreasing sablefish CPUE, ceteris paribus. This same nonlinear relationship was also observed for Sil 900m on sablefish CPUE (Fig. 7, g 5 ). Alternatively for www.nature.com/scientificreports/ a fixed longitude and year, decreasing latitude (moving southward) precipitates an increase in θ T Z , which was associated with an increasing and then decreasing effect of Alk 400m on sablefish CPUE ( g 2 in Fig. 7). Hence for southern locations with fixed longitude and year, increasing Alk 400m was associated with decreasing sablefish CPUE. As another option for interpretation, increasing year while fixing longitude and latitude leads to a decrease in θ T Z . Therefore for a fixed longitude-latitude pair, increasing years was associated with increasing, decreasing, then increasing effects of Alk 900m on sablefish CPUE (Fig. 7, g 3 ). Thus in early and late years (e.g. 1985-1988 and 2010-2013) for a fixed location, increasing Alk 900m was related to increasing sablefish CPUE. Lagged seasonal amplitude of NPGO was related to the selection or non-selection of PRES, WSPD, BOT TEMP, Alk 75m, Alk 400m, Chl 75m, Chl 900m, NO3 75m, NO3 900m, Oxy 75m, Oxy 900m, Phos 900m, Sal 75m, Sal 400m, Sil 75m, and Sil 900m for shortraker rockfish weight (Fig. 8). Seasonal amplitudes of PDO, MEI, and NPGO climate indices in the North Pacific were not related to the selection of any environmental variables for any other groundfish responses.
Lagged seasonal amplitudes of PDO, MEI, and NPGO were significant predictors of Pacific cod CPUE, while PDO and MEI were not significant predictors for any of the other groundfish responses (Fig. 9). Lagged seasonal amplitudes of PDO, MEI, and NPGO provided a moderate fit for Pacific cod CPUE (deviance explained= 0.581 ). Lagged NPGO was also a good predictor of rougheye rockfish CPUE and provided a good fit to the data (deviance explained= 0.641).

Discussion
Deepwater species such as the groundfish examined in this study are difficult to monitor regularly and study extensively, so determining environmental variables that most accurately predict population health of these species is critical. In this study, I illustrated LASSO-type variable selection on the SIVCM capable of including spatiotemporal covariates for modeling large-scale population data, identifying the most significant factors affecting catch and mean weights of groundfish, and determining the functional relationships of these important predictors to the response for inference. Though the environmental variables considered in this application were collected in situ, the SIVCM with group LASSO-type selection would be ideal for spatially or spatiotemporally measured remotely sensed data, such as environmental satellite data. www.nature.com/scientificreports/ CPUE and mean weights of groundfish contribute to calculations of relative population numbers and weights which are used to inform management and policy decisions 85 . The use of relevant environmental covariates influencing groundfish catch and weight can inform spatiotemporal modeling of groundfish distributions and aid prediction of distribution shifts due to climate change 86 . This knowledge is particularly urgent for predicting the effects of climate change and devising adaptive strategies for vulnerable populations such as the three rockfish species included in this study, stocks of which are highly sensitive to climate change in the eastern Bering Sea 87 .
Increased growth rates in juvenile sablefish with warmer water temperatures have been observed in laboratory settings 88,89 and off the coast of Oregon, where early growth was related to recruitment 90 . Increased sablefish recruitment may be reliant on greater stratification developed by warmer surface water temperatures 91 , while warmer deep waters during the sablefish egg stage combined with colder surface waters during the pelagic larvae stage have been related to higher recruitment 92 . The importance of sea surface temperature to sablefish CPUE across all years established in this analysis is therefore in agreement with previous experimental and observational studies and stresses the importance of surface water temperatures on the success of juvenile sablefish 90 . Sogard 90 presumed faster sablefish growth occurred due to high overall productivity reflected in the PDO index, however in this analysis there was no direct relationship between PDO and sablefish CPUE or mean weight, nor was PDO found to be related to to the selection of any environmental variables for sablefish CPUE or weight. However, the selection of silicate as important for sablefish CPUE may point to a relationship of primary production on sablefish, as silicate is an important regulator of phytoplankton growth in Alaskan marine waters 93 . The selection of air temperature and wind speed as important across all years for sablefish weight is consistent with the relationships of transport and of July winds during juvenile years to recruitment 92,94 and suggests the importance of air and surface water movements on early-stage sablefish.
Chlorophyll and sea surface temperature were important across all years for Pacific halibut CPUE, along with air temperature, dissolved oxygen, and wave height. These findings are corroborated by previous observations of Pacific halibut. Temperature influences growth on Pacific halibut juveniles, which exhibit compensatory growth www.nature.com/scientificreports/ in response to periods of temperature-limited growth 95,96 . Bioenergetics model simulations and diet data suggest that adult growth may also be limited by temperature and prey quality or availability 97 . Pacific halibut weight data were no longer collected after 1994 by the AFSC surveys, so a consistent, long term signal in the data to determine important variables related to weight of Pacific halibut was unlikely. However, growth in early years contributes to year-class abundance for Pacific halibut 95 . The findings of chlorophyll, air temperature, and wave height as important to Pacific halibut CPUE point to sensitivity of Pacific halibut to bottom-up forces in the North Pacific Ocean 98 . Chlorophyll is an important indicator of changes in the carbon transport system and plankton biomass vital for energy transfer to higher tropic levels 69,99 , while air temperature and wave height influence the duration of wind-mixing events and change the depth of the mixed-layer, affecting primary and secondary production 7,100-102 . A relationship between dissolved oxygen and Pacific halibut catch has also been established previously, with Pacific halibut likely adjusting to changes in oxygen environments through lateral rather than vertical movements 103 . Variables selected as important to specific groundfish in this work that have not previously been explored encourage further investigation. For example, only deep water alkalinity contributed to Pacific cod CPUE across all years of available data, but Pacific cod CPUE was also related to PDO, MEI, and NPGO indices. While water temperatures have been related to interannual depth changes, range shifts, and population declines of adult Pacific cod [104][105][106] , juvenile Pacific cod may respond to drivers other than temperature 106 . Alkalinity, as part of the ocean's carbonate system, moderates seasonal pH variability 107 . Further, increased variability in the interannual carbonate system in Pacific Ocean is also attributed to variability in the PDO and ENSO 108 . Therefore the selection of alkalinity as important to Pacific cod catch may correspond to several climate modes strongly dictated by the carbonate system and closely correlated to variability in alkalinity. Then again, the importance of alkalinity may instead represent a sensitivity of early stage Pacific cod to ocean acidification, as increased atmospheric carbon dioxide changes alkalinity levels 109,110 .   www.nature.com/scientificreports/ Another example of previously unexplored environmental variables found in this study to be related to groundfish are the important variables for rougheye rockfish CPUE across all years, which were exclusively from deeper waters (>300 m). Since rockfish are viviparous, exposure to near-surface environment during development is eliminated. Rougheye rockfish CPUE was best predicted by NPGO seasonal amplitudes, while chlorophyll, dissolved oxygen, and silicate were all important for rougheye rockfish CPUE over all years. NPGO describes variations in salinity, nitrate, phosphate, silicate, oxygen, and chlorophyll, capturing important interannual and decadal biological patterns 4 . This suggests that the selection of variables affecting rougheye rockfish catch was the result of an existing, strong effect of nutrient cycling patterns captured by the NPGO index on rougheye rockfish populations.
Selection of important variables related to groundfish catch and weight using LSSGLASSO for SIVCMs provided important confirmation of previous findings and new avenues of research for large populations of groundfish in the North Pacific over an extended time period during which several large-scale regime shifts have been  www.nature.com/scientificreports/ reported 1,6,111,112 . While large-scale climate indices are often favored when studying populations affected by regime shifts of the northern Pacific marine system, the specific pathways of their effects on a population or community are often difficult to disentangle. Most groundfish responses had a large array of environmental variables selected for specific years, indicating that the strength of environmental effects on groundfish may vary over time. Selection of environmental variables appeared to be unrelated to seasonal amplitudes of the PDO, MEI, and NPGO climate indices for most groundfish. Large-scale climate patterns are associated with groundfish responses 1,2 , however my results highlight the insufficiency of climate modes alone to accurately describe variability found in many commercially valuable marine populations on decadal scales 10 . These results indicate the existence of several pathways of climate and other environmental effects on groundfish responses. Ecological variability in these marine systems are therefore likely driven by gradual spatial and temporal climate variability that reduces resilience, with abrupt changes in climate modes precipitating permanent changes in vulnerable systems 8 .