Internet search and medicaid prescription drug data as predictors of opioid emergency department visits

The primary contributors to the opioid crisis continue to rapidly evolve both geographically and temporally, hampering the ability to halt the growing epidemic. To address this issue, we evaluated whether integration of near real-time social/behavioral (i.e., Google Trends) and traditional health care (i.e., Medicaid prescription drug utilization) data might predict geographic and longitudinal trends in opioid-related Emergency Department (ED) visits. From January 2005 through December 2015, we collected quarterly State Drug Utilization Data; opioid-related internet search terms/phrases; and opioid-related ED visit data. Modeling was conducted using least absolute shrinkage and selection operator (LASSO) regression prediction. Models combining Google and Medicaid variables were a better fit and more accurate (R2 values from 0.913 to 0.960, across states) than models using either data source alone. The combined model predicted sharp and state-specific changes in ED visits during the post 2013 transition from heroin to fentanyl. Models integrating internet search and drug utilization data might inform policy efforts about regional medical treatment preferences and needs.


INTRODUCTION
Opioid misuse currently kills 130 Americans per day, making it a top public health concern in the United States 1 . Rates of opioidrelated morbidity and mortality continue to increase, requiring new tools and approaches to prevent overdose. For example, there have been consistent year-over-year increases in predictors of mortality, such as the number of opioid-involved emergency department visits 2,3 , and 911 calls requiring the use of naloxone or multiple naloxone administrations 4 . The epidemic is also rapidly evolving: opioid analgesics were the primary cause of overdose until 2010, but heroin (subsequently) and fentanyl (currently) have been the primary drivers of recent opioid mortality rates 5 .
Local community health care providers, first responders, and public safety systems, which are particularly impacted by the crisis, are desperate for higher quality and more real-time data to monitor the problem and intervene in a timely manner. Timely information transmission is also essential for emergency medical service providers to be better prepared for a patient's arrival and to prevent mortality 6,7 . However, there are a number of problems with current data, including lack of access to real-time surveillance at the local level 8 ; 1-year lag times in the release of data on mortality, opioid prescribing, and emergency department (ED) visits 9 ; and lack of data on reasons for regional and temporal differences in risk behaviors and mortality 10 . Taken together, new data sources and tools are needed to better monitor and predict opioid-related outcomes to save lives.
Integrating online social/behavioral data, obtained in near realtime, may help overcome some of the issues associated with traditional public health data. For example, internet search data from Google have already been found to be associated with and/ or predictive of a number of health-related outcomes, including HIV 11 , heroin-related emergency department visits 12 , suicide 13 , cardiovascular disease 14 , and syphilis 15 . Importantly, internet search data can typically be broken down by Designated Market Areas (DMA), allowing analyses to inform regional differences in risk behaviors and mortality. However, there are limitations of previous studies using internet search data to predict health outcomes. For example, previous studies using internet search data to predict opioid-related outcomes (e.g., emergency department visits for heroin) have used internet search data as the only predictor. Compared to internet searches for opioids, which are indirectly linked to opioid-related outcomes, more directly linked (medical data) sources, such as prescription drug data, might be more accurate predictors of opioid outcomes. In addition, previous work focusing on using internet search data to predict opioidrelated outcomes focused on a small number of cities, limiting the generalization of this research. The prior research was also only studied up through 2011 12 , limiting the ability to learn whether the models would be able to predict the sharp increase in opioid overdoses after 2011 that resulted from the increased use of fentanyl.
There are also obvious limitations with internet search data: it is unclear whether individuals who are searching online for opioidrelated information would act on those searches, search data by itself may not provide enough information to inform interventions, and selectivity bias influences predictive ability. Nonetheless, studies on internet search data have found high correlations between searches and public health outcomes at the local level. It may be possible to harness their contribution and overcome their limitations by combining them with clinical data, which has not yet been done.
In this study, we seek to assess whether combining internet search and prescription drug data generate improved predictions of emergency department (ED) visits, including predicting geographic and longitudinal trends in opioid-related ED visits. We attempt to predict opioid-related ED visits because they occur more frequently (provide more data) than opioid-related deaths, are strongly associated with opioid-related deaths 2,16 , and because the effective distribution of naloxone has led to a reduction in the number of opioid related deaths while rates of ED visits remain high 17 . We report on the results of the models as well as potential interpretations of the qualitative results (i.e., the specific Google searches and drugs most commonly prescribed) across geographic regions.

Statistical results
Based on R 2 and RMSE criteria, the most accurate and best-fitting models predicting one-quarters or two-quarters-ahead for every state combined both the Google and Medicaid variables in the same model ( Fig. 1 and Table 1). We can see in Fig. 1 of four different states, that the model was able to predict opioid-related ED visits with high accuracy based on Google search and drug use data with one-quarter-ahead (solid) and two-quarter-ahead (dashed) models. The values for the R 2 from these models (which range from 0.913 to 0.960 across states in the one-quarter-ahead prediction) are consistently higher than models using either set of data alone, while the RMSE values (which range from 13.48 to 361.96) are consistently lower. Although models using Google data or Medicaid data alone performed reasonably well, the onequarter-ahead prediction models with Medicaid data alone performed better (in terms of higher R 2 and lower RMSE) than the same prediction model using just Google data for all states, except two (Georgia and Minnesota). The same was true for the two-quarter-ahead prediction models, although the two states where the Google data prediction models outperformed the Medicaid data were different (Indiana and Massachusetts). Neither outperformed the model using both types of data, particularly in terms of RMSE. Moreover, our combined model was able to predict sharp changes in opioid-related ED visits tied to changes in the primary drivers of the opioid crisis from heroin to fentanyl post 2013. The ability to accurately predict these shifts represents a major benefit of combining these data.
For robustness, we also compared the performance of the LASSO-based prediction model for each state with a mixed model and pooling analysis model (one combined model for all states). Random effects in the mixed model accounted for individual state Fig. 1 Results. One-quarter-ahead (solid) and two-quarter-ahead (dashed) predictions of the ED visits for four typical states based on Google search and drug use data.

DISCUSSION
Findings underscore the need for public health agencies to integrate novel and diverse data sources and methods (e.g., combining near real-time internet search data along with traditional health care data) into their monitoring and surveillance efforts. Agencies need as much information as possible to prepare for the impact of the constantly evolving opioid epidemic. Models that incorporate social and medical data together may better prepare hospitals and health systems for the changing needs of the opioid crisis. For example, similar models could be developed to identify geographic areas likely to experience increases and/or rapid changes in the need for treatment services in response to fentanyl cases. They could also provide insights into interventions among areas with particularly high rates of HIV-related stigma or unrecognized HIV infection that might be linked to substance use 18,19 . Health departments could use these forecasts to improve linkages between hospital emergency departments and treatment providers with unused capacity, such as buprenorphine waivered doctors who are treating fewer patients than allowed by their waiver.
There are a number of more specific implications of this research. First, contrary to possible intuition, Medicaid data does not appear to be a definitively better predictor of opioid-related visits than internet search data. In fact, the model incorporating both internet search and Medicaid data together demonstrated the best performance. This is important information to motivate researchers to explore the use and integration of internet search and other social data sources into modeling efforts. The results also suggest, that, for complicated issues such as the opioid crisis, models combining diverse sources of data might be better at predicting health outcomes compared to just one source of data. Second, the proposed model was able to predict longitudinal changes in opioid-related ED visits, even in years such as 2013 where traditional health econometric models have typically not performed well. This again suggests the importance and potential of integrating social/behavioral data (e.g., internet search) along with traditional medical (e.g., prescription drug) data in epidemiological efforts. Finally, this work suggests that public health agencies should explore integrating these novel data sources and modeling methods into their opioid-related surveillance efforts.
A further advantage of this modeling approach is that it allowed us to flexibly identify different key predictors of ED visits by geographic area, and confirmed that key predictors differ across states and over time. For example, while suboxone was a commonly used search term across most states, methadone was a more common and important search term in Minnesota and Indiana. This further highlights the need to make use of modeling tools and data that accurately reflect the local experience 10 .
This study has limitations, primarily related to the data sources. We were limited by the ability to collect observational rather than individual-level data; to acquire quarterly data rather than more frequently updated data; and by only being able to include 11 states at the aggregate state level, rather than a larger number of states with the ability for finer-grained within state analysis.
Even with these limitations, findings from this study clearly demonstrate that the integration of social/behavioral and medical information is more powerful for predicting geographic and temporal changes in opioid-related ED visits than either source of information alone. Although an increasing number of studies have used social media and/or internet search data to predict public health outcomes, an ongoing criticism of such studies is that they are an indirect and possibly biased source of behavioral health information and hence will have little predictive value compared to more directly linked medical data. This analysis suggests that this may not always be the case, as our models using internet search variables alone did reasonably well predicting ED visits and even outperformed models using only medical data in a small number of states. However, the predictive power of the models combining both data sources is clearly better.
Overall, results suggest that the integration of social/behavioral data, which are often available in near real-time, combined with traditional public health data, may improve surveillance efforts compared to current methods using traditional public health data alone. Although too early to directly implement into interventions, these types of data and approaches might be further studied and used to uncover the regional trends in preferences and/or interest in different types of medication assisted therapy to help geographically target educational campaigns and interventions to regions most in need and accepting of that treatment.

Data sources and methods
From January 1, 2005 to Dec 31, 2015, we collected quarterly Google Trends data for 22 commonly used opioid-related internet search terms and phrases for all states from 2005 through 2015. The terms include opioid medications (e.g., fentanyl and hydrocodone), opioid recreational drugs (e.g., heroin), and general searches about opioids and overdose risk. The Google Trends index provides the normalized search frequency based on the relative search volume of searched keywords at a specific time. The full list of the keywords, adapted from a previous study using opioidrelated search terms 12 , is presented in Table 3. Although the search term variables/data are slightly different than the earlier study because they include additional years of data, they were picked because that study had already shown the relationship between opioid-related search terms. We sought to reuse the terms that had already been found associated with heroin; however, we also included a small number of additional terms by using the google trends tool to find search terms related to those initial terms. The Google data retrieval was done on Oct 10, 2018. We chose the period of data for study as it is one where there were substantial shifts in the drivers of opioid-related mortality (e.g., from prescription drug use to heroin to fentanyl) that we want our model to capture and because the data were publicly available.
We obtained state-level quarterly drug utilization data during the same time period from the State Drug Utilization Data (SDUD), provided by Medicaid.gov 20 . These data represent medical prescriptions filled on an outpatient-basis and paid for by state Medicaid agencies. We included the eleven states with complete data for each year (California, Florida, Georgia, Indiana, Maryland, Minnesota, Missouri, New Jersey, New York, and Tennessee, Wisconsin) in the final analysis. For each of these 11 states, for each quarter, we identified the 100 prescribed drugs (based on the National Drug Code (NDC)) most correlated with relative Google search

Data analysis
The study was designed to determine the best fitting model incorporating internet search and/or drug utilization data as predictors of opioid-related ED visits. To model the number of ED visits, which are count data, we used the negative binomial generalized linear model (nbGLM), a widely adopted statistical model for count data that has been frequently used in public health prediction models 11,13,[22][23][24] . We adopted the Least Absolute Shrinkage and Selection Operator (LASSO) approach 25 to identify the subset of predictors that have the best predictive power among the list of search keywords, and 100 most frequently used drugs for each state. We validate the LASSO models by performing a retrospective out-of-sample prediction experiment, in which we use one set of data (historical data) to train the parameters of the models, and then use the trained model to predict the ED visits in another set of data (future event). More specifically, we validate the models' efficacy in performing one-quarter-ahead and two-quarter-ahead prediction tasks. We also compared the performance of the LASSO-based prediction model for each state with a mixed model and pooling analysis model (one combined model for all states). Comparative results suggest that the LASSO prediction model on individual states outperforms the mixed model and pooling analysis model (supplementary materials). In addition, the LASSO method is effective in the minimization of prediction errors that are common in statistical models to optimally select the search items that are predictive with high accuracy. The accuracy of LASSO method and low sensitivity to parameters are the result of its advantages to include shrinkage of coefficients. This approach is used to reduce variance and minimizes bias to ensure the validation of the predictions with an out-of-sample 10-fold cross validation approach. The prediction performance of LASSO is provided in Table 1, based on R 2 and RMSE evaluation criteria. The most accurate and best-fitting models predicting one-quarters or two-quarters-ahead for every state combined both the Google and Medicaid variables in the same model ( Fig. 1). Further, to verify the sensitivity of the prediction model based on the search terms chosen, we conducted both forward and backward stepwise regression (Supplementary Table 5 and Supplementary Table 6). The superior performance of models integrating both internet search and Medicaid data highlights the importance of combining the two datasets for better performance. It suggests that internet searches for opioids is associated with actual opioid use/outcomes. Details of the statistical models/approach are presented in the supplementary materials.
To evaluate the accuracy of the proposed models and identify the most predictive search terms for each state, we considered the commonly used R 2 (R 2 ) and Root Mean Square Error (RMSE) statistics in an out of-sample 10-fold cross validation approach. R 2 measures the extent the variance of predictors explains the variance of the response. A larger R 2 value indicates a model with greater explanatory power. RMSE is the standard deviation of the prediction errors. A smaller RMSE indicates a more accurate model. To address changing trends in opioid-related online search terms, the search terms in the model were updated each year. We performed experiments for two ED visit prediction scenarios: (i) one-quarter-ahead prediction, and (ii) two-quarter-ahead prediction. We compared the prediction performance for the two scenarios in each state. The proposed model performed well for both scenarios, with the prediction accuracy for one-quarter-ahead prediction being higher than for two-quarter-ahead prediction.

Reporting summary
Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.

DATA AVAILABILITY
The data used in this analysis are publicly available online through Medicaid and HCUP. Google data may be available upon request, pending confirmation from Google.