Introduction

Recreation and tourism are important components of many national and local economies and they contribute in innumerable ways to quality of life, sense of place, social connection, physical wellbeing, learning and other intangibles. Information on patterns of recreation and tourism and the factors that influence behavior in these realms is typically collected using site-specific surveys or interviews. The recent emergence of social media creates exciting alternative possibilities to assess how people use and respond to nature and other cues for recreation and tourism. One problem, however, is that while they generate “big data”, it is often unclear how to tease meaning and useful information from social media. Here we assess the relationship between the locations of photographs from the image-sharing website flickr and empirically derived visitation rates at sites around the world. This is the first study to ground-truth the use of data from social media to predict visitation rates, freeing researchers from time- and labor-intensive surveys and revolutionizing the use of social media to understand where people recreate.

A key reason for studying patterns of recreation or tourism is the economic significance of this industry. The total contribution of travel and tourism to the world's gross domestic product (GDP) in 2011 was approximately $6 B USD (9% of GDP), with expected growth to $10 B USD by 20221. On more regional scales, recreation and tourism represent a significant fraction of many local economies. The economy of the Caribbean region, for example, is dominated by tourism, with 15% of GDP from tourism and 17% of the available workforce employed in the tourism sector2. In the US, English et al.3 classify 156–338 nonmetropolitan counties in the US as “tourism dependent”, meaning 10% of income and 15% of jobs in these counties result from tourism. Of course, economic impacts are only one way of measuring the importance of recreation and tourism. These activities are critical contributors to diverse aspects of human wellbeing4. For example, outdoor recreation is a spiritual experience for many people5,6,7,8 and social interactions in nature contribute to building a sense of place9,10,11,12.

A major and growing portion of the global penchant for travel and recreation is “nature-based”, involving interactions with or appreciation of the natural environment13. For these types of activities, characteristics of the environment influence people's decisions about where, when and how to recreate. SCUBA divers, for example, select destinations based on the water clarity, water temperature and diversity of marine life14,15. Bird-watchers are drawn to the best places to see target species16, which inevitably are places where natural systems support populations of desirable birds17.

Several previous studies have succeeded in quantifying the degree to which recreation depends on environmental attributes such as species richness18, the diversity of habitats19,18, precipitation20 and temperature21, as well as to other attributes such as infrastructure and cultural attractions22,23. However, because people's motivations for how and where to recreate vary from place to place and depend on many factors including their origin and destination, each study must assemble new empirical data. Existing empirical data are often too coarse to illuminate the relationship between visitation and salient features of a particular location and represent point samples as opposed to landscape data. Empirical surveys are also expensive to conduct and are limited in coverage.

To avoid having to conduct costly and time-consuming surveys, researchers have long sought automated methods for collecting empirical data on visitation rates to multiple locations. Shoval and Isaacson24 recorded the movement patterns of several tourists in Israel who volunteered to carry global positioning system (GPS) locators, an approach since repeated in other locations25. Pettersson and Zillinger26 supplemented survey and GPS data with aerial images to estimate the number of people attending a sporting event in Sweden. While these techniques provide rich datasets, there remains a need for methods of estimating visitation rates across multiple spatio-temporal scales, in natural and built environments and in both developed and developing countries. We explore a new approach, using the density of existing geolocated photographs posted to the online photo-sharing website flickr as local data that reveal where people recreate.

Results

Since its launch in 2004, over 71 M flickr users have uploaded over six B photographs to the image-sharing website. Approximately 197 M of these photographs are geotagged, meaning the image has been assigned specific coordinates of latitude and longitude (Fig. 1). Since 2010, 40–50 M geotagged photographs have been uploaded annually. Most geotagged photographs on land are from locations in Europe (40%), North America (39%) and Asia (13%). Sixteen percent of all geotagged photographs are from marine and coastal environments. Most images are taken in the United States, United Kingdom and France. On a per-area basis, Vatican City, Macau and Gibraltar have the greatest number of geolocated pictures, while Antarctica and Chad have the lowest density of images.

Figure 1
figure 1

Locations of the approximately 197 M geotagged photographs uploaded to flickr from 2005–2012.

Figure created using the maps package for R.

Visitation rates

Crowd-sourced information from flickr photographs corresponds well with empirical information about where people go (Fig. 2). Unsurprisingly, only a portion of visitors to any site take geotagged photographs and share them on flickr. This is observable in Figures 24 as the distance of each point from the grey 1:1 line representing equal densities of visitors and photographs. Although there is appreciable variation across all sites, there is a reliable statistical relationship between the number of people counted and the flickr-generated estimate of user-days (Fig. 24). The relationship between the empirical estimates of mean annual visitor user-days () and those derived from photographs () is best described by a power function Y = Y0Xβ, where β = 0.698 and Y0 = 13.7 (F1,831 = 388.34, p < 0.001, R2 = 0.386; Table 1). Absolute visitation rates are less variable within datasets comprising sites from smaller geographies or similar types of use (Fig. 2).

Table 1 Results of an ANCOVA testing the effects of the covariate photo-user-days () and the factors attraction type (A) and destination income (I) on empirically estimated visitation rates ()
Figure 2
figure 2

Average user-days per year based on photographs (x-axis) vs. empirical data (y-axis) at 836 sites worldwide.

Panels display points from the individual datasets listed in Supplementary Table S1 () atop all data (+). Grey line is 1:1.

Figure 3
figure 3

Average photograph-based and empirical estimates of user-days per year.

Sites are colored according to whether they are primarily a cultural (green circles [n = 498]) or natural (orange diamonds [n = 338]) attraction. Black line depicts the overall trend across all sites. Grey line is 1:1.

Figure 4
figure 4

Average photograph-based and empirical estimates of user-days per year.

Sites are colored according to whether they are located in a country with a low (green circles [n = 35]) or high (orange diamonds [n = 801]) Gross National Income. Black line depicts the overall trend across all sites. Grey line is 1:1.

Type of attraction (A) and income-level of the country (I) do not change the observed relationship between empirical and photograph user-day estimates, but they do explain some of the variation across sites, both in total visitation rates and in the proportion of users who choose to take and upload photographs to flickr. Neither A nor I alter the slope of the scaling relationship, according to the results of the ANCOVA. Interactions between the covariate and the two factors A and I all non-significant (results not shown). Therefore, we refit the model by removing interactions with and assuming parallel slopes. All results presented here are for the simplified model. Predictors A and I, however, do change the intercept, Y0 (Fig. 3 and 4). Across the 836 sites in this study, cultural attractions have higher visitation rates than natural ones (F1,831 = 29.31, p < 0.001) and sites in high-income nations have more annual visitors than sites in low-income countries (F1,831 = 9.61, p = 0.002). Also, A and I interact, such that the higher visitation to cultural attractions occurs primarily at sites in low-income countries (F1,831 = 101.05, p < 0.001) while visitation rates in high-income nations are indistinguishable.

The lowest values were for Khaptad and Dhorpatan National Parks in Nepal ( and , respectively). The next least visited site was Hamilton Grange National Memorial, in New York City, with an average of 64 user-days each year from 2008–2010. During this period, the facility was closed while the structure was moved to a new location two blocks away. The two most highly visited sites according to were the Magic Kingdom Theme Park in Florida (over 17 M user-days) and Disneyland in California (over 15 M user-days). The natural attraction with the highest annual visitation was Yellowstone National Park in the United States ( user-days). Twenty-five of the 836 study sites had zero photos during the years for which we have EUD estimates.

Visitor origins

Data from photographs uploaded to flickr also serve as a good indicator of the country of origin for travelers (Fig. 5). Originating countries of incoming visitors surveyed at border crossings (EOC) are related to the home countries reported in the profiles of flickr users who took photos within the same focal country (POC), measured as proportions of incoming people (β = 0.735, F1,102 = 169.12, p < 0.001, R2 = 0.620). In our analysis, the flickr users traveling to five destination countries (D) originate from 55 nations from around the world. These nations represent a wide range of population sizes and income levels. A nonsignificant POC · D interaction term in the ANCOVA shows that the scaling relationship between EOC and POC does not differ across the five destination countries considered here. Furthermore, the intercept of the relationship between EOC and POC is also constant across all levels of D. Thus, we removed the D and POC · D terms from the model and present the results of a simplified ANCOVA in Table 2. In summary, the home location reported by flickr users who take and upload pictures in a particular country can predict the originating countries of travelers to that nation. Furthermore, the relationship between photo- and empirically-derived visitation rates is equal across the five destination countries examined.

Table 2 Results of an ANCOVA testing the relationship between the proportion of visitors from each origin countries, to five destination countries, based on photographs (the covariate ) and empirical estimates ()
Figure 5
figure 5

The average proportion and originating country of travelers who arrived to five destination countries each year, according to stated home locations of flickr users who took at least one photograph within the country (x-axis) and immigration data (y-axis).

Names of outlying originating countries are abbreviated. Datasets are distinguished by colors and symbols and described in Supplementary Table S2. Black line depicts the overall trend across all sites. Grey line is 1:1.

Visitation over time

Visitation rates estimated from flickr images match expectations over time at sites selected as examples of cultural (Zuccotti Park and Black Rock Desert) and natural (Vermont) attractions. There is a single spike in PUD at Zuccotti Park that corresponds to the duration of the encampment there (Fig. 6a). Increasing numbers of photos begin to appear on flickr when the protests start on September 17, 2011 and drop abruptly when the park is closed on November 15, 2011. In the Black Rock Desert, there is an annual spike in the numbers of photos taken that corresponds to the three-week period surrounding the Burning Man festival each year (Fig. 6b). Similarly, there is a period of higher PUD in southern Vermont each October (Fig. 6c) during the prime month for viewing the fall foliage.

Figure 6
figure 6

Total photo-user-days each week from 2006–2012 in (a) Zuccotti Park, the site of the occupy protest, (b) Black Rock Desert, the site of the Burning Man festival and (c) southern Vermont, an area popular for viewing autumn foliage. Grey shading indicates time periods from the start of the protest until people were barred from the park (a), three weeks spanning the annual week-long festival (b), or the month of October each year (c).

Discussion

A lack of useful information about where people go during their leisure time has hindered progress toward understanding what draws and repels people to and from various recreation sites around the world. Here, we show that crowd-sourced information can offer new perspectives on this old problem, revolutionizing the way we study people and understand their choices. We hypothesized that pictures could indicate visitors and furthermore, that photographs uploaded to an image-sharing website could record people's choices and provide useful data worldwide. Our comparison of visitation data collected from 836 sites in 31 countries with data generated from geotagged photographs uploaded to flickr shows that the crowd-sourced data are indeed a suitable proxy for the more traditional time- and labor-intensive empirical estimates. This represents a significant advancement, as this new proxy measure of visitation can be applied almost anywhere: in developed and developing countries, data-poor and data-rich locations, urban areas and wilderness. Wherever people are taking and uploading pictures we can use that information to indicate their visit and learn from it.

There is considerable variation among sites in the concordance between empirical and photograph-based visitation rates. However, if used carefully, this relationship could inform future studies of marginal or absolute changes in visitation rates. Many questions can be answered through scenario-based assessments of relative changes in visitation across alternative management regimes. Managers could use geotagged photographs to explore marginal changes in visitation rates with changes in ecosystem health, site access, or tourism infrastructure under alternative future scenarios, irrespective of the variability in the relationship between and . Studies requiring absolute visitation rates should note that the slope of the relationship between empirical and photograph-based visitation rates is consistent across income levels, I and attraction types, A and it is only the height of the function that varies. In other words, globally, visitation rates derived from field data and images are consistently scaled with a slope of 0.70, but the absolute visitation rate varies with local socioeconomic conditions and attributes of the site. The precision of predictions will hinge on the similarity of the study sites in both geography and the types of attractions. Absolute visitation rates are less variable across sites within nations or from a single destination type, such as a state park or an art gallery (Fig. 2). Wherever some local visitation data are available, the height (and potentially the slope) of the relationship between and could be adjusted to suit the research questions and particular study region. We encourage researchers and practitioners to seek additional socioeconomic factors beyond income and attraction type that explain local variability in the absolute visitation rate. Despite the variability in and , this study presents promising new evidence that visitation rates can be quantified, both in relative and absolute terms, using geotagged photographs and a few easily-measured variables of a site.

Home countries given by users on flickr correspond with the home countries of travelers recorded at immigration entry points, making crowd-sourced data not only useful for estimating visitation rates, but also for understanding where visitors originate. Because the time and money that people spend traveling indicates how much they value the destination, these data on the origin and destination of recreators are enormously beneficial for economic methods for valuing recreation sites. One preferred approach for quantifying value is to use a “travel cost model” which uses the cost of travel to estimate peoples' willingness to pay to recreate at particular sites27. Travel cost studies are often criticized for not accounting for people who visit multiple sites on a single trip away from home. Crowd-sourced visitation data can potentially address this issue since users often upload images throughout their journey.

The ability to estimate visitation rates without survey data allows for models that can anticipate changes in visitation in response to changes in ecosystems, relative to other types of change (in built features, social capital, etc.). Random utility models are one example of an economic technique for quantifying the marginal benefits of natural environments and other attributes. Typically, telephone surveys are conducted asking respondents where they live, which recreational sites they visit and why. These individuals' choices about which sites to visit reflect their preferences for certain characteristics of sites and the tradeoff between the costs (e.g., travel) and benefits (e.g., presence of wildlife) of the trip. Here, we show that the same data can be gathered using the locations of photographs and spatial data on the presence of features such as swimming beaches, cultural events, or other attractions. Enticing evidence that this approach is suitable for understanding people's choices is demonstrated by the match of flickr photos to known temporal aggregations of people in Zuccotti Park, Black Rock Desert and southern Vermont (Fig. 6). We offer these as initial examples and hope to spark further use of this approach to understand what draws and repels people to and from particular places.

Of course, this method is imperfect. There may be biases in who is taking digital photographs and uploading them to social media sites. Different recreational activities may be more or less suited to taking photographs. Surfers, for example, while likely possessing cameras and internet access, may prefer not to take photographs while surfing. Also, the perceived value of a trip may influence whether an individual takes or shares photographs, resulting in a bias against images from visitors who travel shorter distances from home. We observe, for example, that tourists visiting Nepal from neighboring Sri Lanka and India upload fewer photographs to flickr than predicted based on the overall trend (Fig. 5). Similarly, local visitors may be less inspired to take or share photographs of commonly-visited sites. While we find strong correlations between the crowd-sourced information and empirical data at attractions, such as national parks, we do not look at correlations between crowd-sourced information and visitation to more mundane locations, like shopping centers, that might be popular sites for recreation by local people. Further work is needed to explore the utility of this approach at locations that are not major attractions or landmarks. Other social media such as geotagged tweets might serve as more effective proxies for some types of recreation, particularly in urban areas.

New technologies and digital social media have begun making vast amounts of geolocated data available for a wide range of creative purposes, including art, targeted advertising, crime prevention and scientific research. Some authors are rightfully raising concerns about the appropriate and ethical use of these data and the potential for apophenia: to see patterns in “big data” where none actually exist28. In response to their calls for more critical assessments of digital data, this study vets a novel method for using geotagged photographs from flickr to provide sources of information for understanding where people go. We conclude that crowd-sourced information can not only break the log-jam of expensive empirical data requirements for predicting and valuing how changes in the landscape alter recreation and tourism, but also can provide revolutionary information for understanding questions about where people recreate, in ways unimaginable before the existence of the internet and social media.

Methods

While data from social media are fascinating, a critical question that will determine their utility for understanding where people go is: how well do they reflect on-the-ground visitor surveys and records? If we can establish reliable statistical relationships between image- and field-based records then we will have a powerful new tool for tracking how people interact with nature during recreation. To address this question, we first compare photograph- and field-based estimates of visitation rates at recreational sites around the world. Then, we assess how demographics of flickr users, specifically their home country, compare to survey responses by tourists entering through immigration checkpoints into five nations.

Visitation rates

We assembled data from nine independent empirical datasets that quantified visitation to 836 sites in 31 countries around the world (Supplementary Table S1). These empirical datasets represent a wide range of attractions, from amusement parks to national parks and from historic battlefields to art galleries. In each study, total empirical annual visitor user-days (EUD) were counted at a defined recreation site such as a park or museum. User-days are defined as one person spending a portion of one day within a site. The smallest and largest sites in the dataset are both in the USA, ranging from the 80 m2 Thaddeus Kosciuszko US National Memorial in Pennsylvania to the Gates of the Arctic National Park and Preserve in Alaska (30,448 km2). All nine datasets were available publicly on the world wide web. For a study to be included in our analyses, we required measurements of total annual user-days to at least nine sites for two years between 2005–2011.

To create an alternative measure of visitation rate at the same sites, we used metadata associated with photos uploaded to the online photo-sharing website flickr. We used flickr's public API to download metadata for photographs that were taken within the bounds of each site. Along with the location of pictures, we also stored the photographer's identification number and the date that the image was taken. We used these two additional variables to convert the metadata, often including numerous photographs from the same person on the same day, into total annual user-days over the same time period as the EUD data were collected at each site. Thus, for photographs, user-days are defined as the total number of days, across all users, that each person took at least one photograph within each site (PUD).

To account for potential variation in the scaling relationship between EUD and PUD, we recorded two additional attributes for each of the 836 study sites. First, each site was characterized as either a cultural or natural attraction (dummy variable A). Cultural sites are those visited primarily for recreation in the built environment and socializing with other people (e.g., amusement parks) or learning about human culture (e.g., art galleries, historic battlefields) irrespective of the natural setting. Natural attractions, on the other hand, are sites where people go primarily to appreciate the flora, fauna, or natural setting (e.g., state beaches, recreation areas, botanical gardens). In reality, the two categories are not exclusive, so we selected the category capturing the primary reason that most people visit each site. Second, we characterized each site as either high- or low-income (dummy variable I) based on the nation's per-capita gross national income (GNI) published by the World Bank. We divided the World Bank's four income levels into two categories, low and high, with low-income countries having a per-capita GNI below $4,035 USD in 2008.

We used analytical methods to assess the correspondence between EUD and PUD estimates. For all analyses, we aggregated repeated annual measures at each site into a single value of average annual EUD and PUD across all years that empirical data were collected. We fit a general linear model (GLM) in which , A (cultural vs. natural), I (low vs. high) and all interactive effects were regressed against . We used an analysis of covariance (ANCOVA) to test two main hypotheses: 1) is related to and 2) the slope of the relationship between the response and covariate is equal across levels of factors A and I. We evaluated hypothesis (2) based on the , and interaction terms in the ANCOVA.

Visitor origins

We collected a second empirical dataset of origin countries of tourists arriving to five destination countries (see Supplementary Table S2). Empirical estimates of the proportion of visitors from each country (EOC) derive from passenger surveys conducted at entry points such as airports and overland border-crossings. We gathered values from public sources, found online, that reported the EOC per origin country for at least one calendar year between 2005–2010. For comparison, we assembled a dataset of the origination locations of users who uploaded photographs to flickr taken within the destination country during the same time that the empirical data were collected. flickr users have the option to list a “current location” in their profile. We assumed each user's current location was the origination of each trip and used this information to calculate the proportion of photographers originating from each country (hereafter, POC).

We then explored the relationship between the EOC and POC values using a GLM. First, we aggregated repeated annual EOC and POC estimates into a single value of average annual proportion across all years that empirical data were collected. The GLM regressed , along with a categorical variable, D, representing the destination country and an interaction term against . We used an ANCOVA to test whether is related to and whether the slope of this potential relationship varies across the five destination countries of the analysis.

Visitation over time

Finally, we present examples of how geotagged images can track the aggregate movement of people responding to cultural and environmental cues. As examples of cultural attractions, we examine visitation rates to Zuccotti Park, in New York City, the epicenter of the Occupy Movement of 2011 and to the Black Rock Desert, Nevada, home of the Burning Man festival. In the first, we expect higher PUD values from the start of the protest on September 17, 2011 until people were barred from the park on November 15, 2011. In the second, we expect annual peaks of visitation, measured as PUD, surrounding the annual week-long festival. To display the influence of a natural attribute on visitation, we plot seasonal patterns of PUD in southern Vermont where “leaf-peepers” are drawn by the colors of the foliage annually in October.