Exercise contagion in a global social network

We leveraged exogenous variation in weather patterns across geographies to identify social contagion in exercise behaviours across a global social network. We estimated these contagion effects by combining daily global weather data, which creates exogenous variation in running among friends, with data on the network ties and daily exercise patterns of ∼1.1M individuals who ran over 350M km in a global social network over 5 years. Here we show that exercise is socially contagious and that its contagiousness varies with the relative activity of and gender relationships between friends. Less active runners influence more active runners, but not the reverse. Both men and women influence men, while only women influence other women. While the Embeddedness and Structural Diversity theories of social contagion explain the influence effects we observe, the Complex Contagion theory does not. These results suggest interventions that account for social contagion will spread behaviour change more effectively.

their activity information with friends through social networks operated by the platforms.
We collected and analyzed exercise and social network data from a global fitness tracking network to better understand peer effects in exercise behavior and human health interdependence more broadly. The fitness tracking technology creates an accurate running monitor and provides real-time feedback to runners during and after each run. The technology allows runners to keep track of all of their routes including a breakdown of pace and distance at different points during a run. Runners can analyze their own running data and connect with the website after each run to instantly save the run and share it with friends via the site itself as well as on Twitter, Facebook or other social media.
After each run, the fitness tracker can be connected to the runner's personal account at the platform's website, where personal fitness activity is stored. The website helps runners monitor their running experience with dynamic graphs that compare distance and time between single sessions, as well as weekly and monthly totals.
The website also allows individuals to form social ties and follow other individuals' running activity. Therefore an individual can track her training records but also review her friends' activity as well. The website also allows runners to initiate or participate in competitions with 1 friends, compare themselves to other runners across the globe and use a mapping tool that illustrates individual running routes, which can be shared with others. Given these features, as we develop in the main text and in argumentation below, our main hypothesis is that peer effects should play a major role in driving individual training and performance patterns.
Data Collection Procedures. The data contain anonymized running activities (distance, duration, pace and calories burned) for each run (and in 29% of runs a GPS trace of the actual run location and trajectory), as well as demographic information for all individuals using the network's fitness tracking devices. Running activity observations were collected over a five year period [We excluded a tiny fraction of data that we believed are not physically possible daily running activity performance or are likely error records. We remove runs that exceed a duration of 14 hours (860 minutes) or a distance of 120 km (74.5 miles) or a pace of 1.07 km/min (40 mph).]. At the same time, data on the fitness tracking social network was also collected. The dataset is organized as dyadic relations (from-to) with a timestamp indicating when the social tie was formed. We observe link formation for a period of five years. After an individual forms a social tie, each time they finish a run, their running performance is automatically shared with their friends. At the end of the observation period there are 3.4 million unique social links in the data among 1.1 million people (network nodes) who have at least one connection. This subset of individuals account for 59 million running activity events and 359M kilometers run.
Supplementary Table 1  Finally, 50% of runners are at a normal weight, 35% of runners are moderately overweight and only 2% of the sample is underweight.
Supplementary Figure 2B and C display daily activity measured in number of runs per day as well as the average pace per run taken by individuals in different demographic categories respectively. Women are more active than men on average and run at a faster pace. Interestingly, older people (especially in their 50's and 60's) run more frequently than younger people.
However, these are the ages that experience the slowest pace during runs. Runners from Japan are very frequent runners, however their pace of running is significantly lower compared to individuals from other countries. Finally, individuals who are at a normal weight or are slightly overweight are more active runners both in terms of number of daily runs but also in terms of the pace of their runs.
When we analyze exercise at the daily level, we see that activity depends not only on the day of the week but also on the specific time of day (see Supplementary Figure 4). Running is more popular during the weekend and less popular at the beginning of the work week. Also, people in our sample prefer an afternoon run to running early in the morning (Supplementary Figure 4). All of the above mentioned time fixed effects (day specific, season specific and year specific) are controlled for in our subsequent analysis for identifying exercise influence.

Social Network Data
The underlying running social network is organized in dyadic form (from-to) with a timestamp indicating when the social tie was formed. After an individual forms a social tie, each time they finish a run, their running performance is shared automatically with their running friends.
At the end of the observation period we have ∼1.1M individuals that are connected by ∼3.4M links.
The running social network is a sparse network with average degree (the number of connections or ties an individual has) close to 3.7 (S.D.=8.2). As is fairly typical, the network has a heavy-tailed degree distribution. While the vast majority of runners have a small number of connections, there is a small number of people with many connections, with the maximum of 1330 (Supplementary Figure 5). Long tail degree distributions are a common characteristic of natural and socio-technical networks, from protein-protein interactions to human mobility systems (2).

Weather Data
Since our objective is to use the weather as an instrument to identify and quantify exercise influence, we are interested in collecting complete daily weather data (precipitation and temperature) for the period of observation. We collect weather data at the station (or tower) level worldwide  Figure 8). The correlation coefficient between the two distributions is 0.59 suggesting that in highly populated areas the distribution of weather stations is very dense compared with sparsely populated areas. All of these observations are helpful for our analysis below, since one of our objectives is to pair weather to individual runners with high precision.
We assign each individual j to a weather station g by choosing the station that is closest to his/her running activity if GPS locations are recorded for that individual or otherwise to the address they provided during registration (Supplementary Figure 9). Individuals who are training in areas more than 30 km away from the closest weather station are excluded from the analysis (about ∼3% of our sample) since it is impossible to identify correctly the weather they experience. At the end of this process we have a 1:1 matching between stations and fitness tracking individuals and therefore time series of precipitation and temperature for the period of interest for each individual.
Precipitation. For each day during the period of social network observation, we collected daily precipitation data at the weather station level. Daily precipitation is recorded in tenths of a millimeter and indicates the total precipitation measured by each of the world weather stations each day. The precipitation is always a positive number and values greater than zero indicate a rainy day. The maximum precipitation recorded in our data is 179mm (7 inches)(see Supplementary Table 2).
Temperature. For each day in the same period of observation, we collected daily temperature data at the weather station level. The daily temperature is recorded in tenths of a Celsius degree and indicates the maximum temperature that the weather station experiences each day.
The temperature can be either positive or negative in either the Celsius or Fahrenheit scale.
In our dataset the minimum temperature recorded was -43 o C (-45 o F) and the maximum was 54.5 o C (129 o F) (see Supplementary Table 2). 6

Supplementary Note 2: Model Specification and Estimation Procedures A Causal Model of Exercise Contagion
Since dyadic models can be biased due to heterogeneity in individuals' connectivity, from here on we specify estimation models at the ego level.
Let A it be the fitness activity of individual i on day t. Individual fitness activity is measured in daily distance (km), duration (min), pace (km/min) or calories burned (cal). We define c ijt to be the binary indicator of the existence (=1) or not (=0) of a relationship between individuals i and j at time t (Adjacency Matrix). We can then define the degree (connectivity) of an individual i at time t as k it = j c ijt .
We specify four factors that affect fitness activity. First, there are time fixed effects -which include holidays, weekends, marathon days etc -that we denote with ν t for each time period t.
Second, there are time-invariant, individual fixed effects that separate individuals with different fitness habits and motivations, that we denote with η i for each individual i. Third, there are time varying characteristics like degree. Finally, there is exogenous variation in environmental conditions that perturb individual utility for outdoor training, like changes in weather patterns.
These effects, which we denote as w it for individual i during time period t, are time-varying and individual-specific (through the location of individual).
We further specify an endogenous factor that influences the fitness habits of an individual that is a function of their social ties. In other words we assume that each individual i's fitness activity on day t or the next couple of days, t + δt, is influenced by the fitness activity of her social circle, i.e. from the specific activity on day t of each individual j to whom she is connected.
Using the above definitions and assumptions we specify a linear model for the running activity of individual i at time t + δt (δt=0,1,2,.. days) as: In the special case where δt =0, we assume a memoryless model where individuals influence each other only within a day and not across time periods.
The above model assumes that the running activity A i,t+δt of Ego i at time t+δt (δt =0,1,2,..) is an additive linear function of other factors measured at the same time t + δt or previous time periods t + δt − 1, ..., t, including the time fixed effects ν t+δt , the effect of exogenous factors w i,t+δt (temperature and precipitation i experiences on the day of consideration t + δt); the effect β of an endogenous factorĀ p it = 1/k it j c ijt A jt (the average running activity of the social contacts of i on day t), the effect of the running history of the individual on previous days The usual assumption is that the error term, ε it , is i.i.d. (independent and identically distributed), but this is clearly violated here since our estimation takes place in a population of individuals connected in a network. A natural approach in such cases is to assume "clustered errors" i.e. that observations within a network cluster u are correlated in some unknown way, inducing correlation in ε it , within u. In the presence of clustered errors, OLS or IV estimates are unbiased but standard errors may be wrong, leading to incorrect inference in a surprisingly high proportion of finite samples.
Although this model seems straightforward to estimate, the reciprocal influence of an in-8 dividual on her friends' running state and vice versa makes it difficult t o i nterpret a simple association in their fitness behavior. Correlation in exercise habits may not only result from pairwise mutual influence, but also from triangles in the social network. For example, j might influence l's fitness behavior, which in turn affects i's fitness behavior, and so on. We address the inherent endogeneity of contagion in the next section using instrumental variable theory, a well known and understood method widely used in the econometrics literature for identifying causal effects in non-networked data.

Instrumental Variable Theory
Endogeneity exists when an explanatory variable is related to the error term in the population model of the data generating process -for example, due to omitted variables, measurement error, or other sources of simultaneity bias or reverse causality, which causes the ordinary least squares estimator (OLS) to be biased and inconsistent (4). Instrumental Variables (IV) is a method of estimation that is widely used in many economic, educational and epidemiology related applications, that provides a way to obtain consistent parameter estimates (5,6). not correlate with the error term u. The first assumption requires that there is an association between the instrument Z and the variable being instrumented X while the second assumption excludes the instrument Z as a regressor in the model of Y . In linear models, (a) and (b) are basic requirements for using IV theory.
If the instrument Z is valid, i.e. satisfies the above conditions of relevance and exogeneity, then the coefficient β c an b e e stimated u sing a n I V e stimator i n a T wo S tage L east Squares where in the second stageĀ p it are the predicted values from the first stage that are assumed to be orthogonal to ε it and β is the social influence coefficient (causal effect) that we are interested in estimating.
The credibility of these estimates hinges on the selection of suitable instruments. Good instruments are often created by policy changes. For example, the cancellation of a federal student-aid scholarship program may reveal the effects of aid on some students' outcomes.

The Weather as an Instrument
We propose the weather as an instrument for detecting contagion in exercise habits across social network ties. Instead of changing individuals' running activity directly with an experimental treatment (which seems difficult a nd e xpensive), w e l et t he w eather d o t he w ork f or u s by measuring how weather-induced changes in individual exercise behavior predict changes in the individuals' friends' exercise behavior. First, we must establish the two requirements for using an instrumental variable approach, relevance, i.e. detecting a strong relationship between the instrument (weather) and the endogenous predictor (friends' running activity), and exogeneity, i.e. weather changes that friends experience do not directly affect Ego's running behavior.
Relevance. Weather is unlikely to be affected by individuals running activity, therefore if we find a relationship between the two, it suggests that weather influences the running activity of individuals and not vice versa. We have two distinct weather indicators available, rainfall and temperature. In Supplementary Figure 10, we plot the daily, per capita running activity for fifteen large cities in the United States as a function of the precipitation and temperature experienced in those cities. It is clear that more precipitation is associated with lower running activity, in a monotonic fashion (similar to the graph in Supplementary Figure 11A). On the other hand, the relationship between running activity and temperature is non-monotonic, suggesting that very high and very low temperatures are associated with reduced exercise activity (similar to the graph in Supplementary Figure 11B). We further access the granularity of our dataset to visualize how bad weather affects running. . We have established a strong association between weather and running activity suggesting that precipitation and temperature can potentially serve as instrumental variables in order to detect peer effects in fitness habits and activities. We test the strength of these instruments more formally in our evaluation of the first stage regressions as described below.
Exclusion Restriction-Exogeneity. One of the biggest concerns in a model like this is that friends' weather is correlated, so the instrument might actually just be a proxy for the direct effect of weather on a person's running behavior -a violation of the "exclusion restriction" (5).
Unfortunately, weather patterns are highly correlated, both spatially and temporally. For example, geographically proximate regions are more likely to experience the same weather on the same day. A rainy day in Chicago, IL implies with large probability that it is a rainy day in  Supplementary Table 3).
In order to meet the exclusion restriction requirements in our IV analysis, we only consider how running behavior is transmitted between social dyads that have no correlation in weather patterns. To do so we compute the sample Pearson correlation coefficient between the weather of all links (i t+δt ,j t ) using the weather history for 45 months as (11): where w i,t+δt denotes the weather that individual i (Ego) experiences on day t + δt (δt=0,1,2,.. and ∼0.9M when we consider two day difference correlations (t + 2 vs t). In the case of more than two day differences, the correlation in almost all dyads drops below the threshold point and therefore almost no link is excluded (see also right panel in Supplementary Figure 13). We provide detailed tests of the sensitivity of our analysis to this choice of threshold below. The results of these analyses show that our estimates are not sensitive to the choice of threshold.

Choosing Optimal Instruments: The Lasso (Post-Lasso) Method
For each peer j on day t we consider a collection of binary weather indicators: N for the rain that the individual experiences r  . In order to generate binary indicators we divide the range of precipitation and temperature that individuals experience into percentiles and define areas where the precipitation and temperature are larger or smaller than point percentiles, as we described in Supplementary   Figure 14. In this way, we take into account and differentiate peers who live in cities with different average weather patterns. For example, a day with 2 inches of precipitation in Seattle is different than a day with the same amount of precipitation in Los Angeles. We assume that a rainy day in a typically dry city has larger marginal effects on running behavior than in a city that is typically wet. Let's assume that we have two peers, one living in a typically wet and cold city and one in a typically dry and warm city. We design our binaries to make sure that 2 inches of rain and a temperature of 20 • C in both cities activates different precipitation and temperature binaries. In this way we take into account city specific weather effects.
whereĀ p it is the average running activity of i' s peers, W Using the LASSO, we select optimal instruments that minimize the sum subject to |λ r,0 | + |λ r,1 | + ... + |λ θ,M −1 | ≤ s, where the first sum is taken over observations in the dataset andÂ p it are the predicted values of the regression. The bound s is a tuning parameter that controls the tradeoff between the penalty and the fit (loss/likelihood). When s is large enough, the constraint has no effect and the solution is just the usual multiple linear least squares regression ofĀ p it on W

Estimating Treatment Effect Heterogeneity
In order to gain insight regarding influential members of the running community we study, we introduce heterogeneous treatment effects in the Ego level estimation by examining time invariant features of runners. First, we are interested in whether a more active friend is more influential than a less active one. To measure these effects we split the neighborhood of each Ego j into subsets of peers according to the ratio between their overall running activity over the period of observation and Ego's total running activity. We first calculate the overall running activity of the Ego and each peer (A i = t A i,t and A j = t A j,t , j = 1, 2, ...k it respectively) and calculate all ratios of peer's running activity to Ego's running activity, where k it is the number of peers Ego i has at time t. We then define several continuous ranges for Λ ij : (i) Λ < 1/16, (ii) 1/16 ≤ Λ < 1/8, (iii) 1/8 ≤ Λ < 1/4, (iv)  and inactive pa ="L" runner groups and define a model of exercise contagion that includes an interaction term for the level of activeness as: inactive runner and peers are active runners, (iv) Ego is an inactive runner and peers are inactive runners. We then estimate an instrumental variable regression to identify the causal effect β, using the precipitation and temperature binary indicators chosen from the LASSO penalized regression to instrument for the endogenous interaction term.
We next consider an interaction model that investigates the role of running consistency (in time) in social contagion. We are interested in whether a consistent running friend is more influential than a sporadically active friend, or the other way around. First, for each runner in our dataset, we identify the periods during which their running activity is consistent. We do so by isolating the periods where activity is continuous without any inactivity lasting more than 2 weeks (see Supplementary Figure 15 below for illustration). By following the same methodology for all the available runners, we identify 376,000 distinct running activities with an average consistency length of 34 days (S.D. 62 days).
We then define an individual as a consistent runner if the largest consistency period lasts more than 1 month. Otherwise we define the individual as an inconsistent runner. For each ego j we split her/his neighborhood (peers j = 1..k it ) into consistent "C" and inconsistent runner groups "I" and we define a model of exercise contagion that includes an interaction term for running consistency as follows: is the average running activity of the peers at time t in the consistent group (pc = C) and the inconsistent group (pc = I) and E (ec) is an indicator variable denoting whether ego is consistent (ec = C) or inconsistent (ec = I). The interaction term in the above equation considers all four possible scenarios: (i) Ego is a consistent runner and peers are consistent runners, (ii) Ego is a consistent runner and peers are sporadic runners, (i) Ego is a sporadic runner and peers are consistent runners, and (iv) Ego is a sporadic runner and peers are sporadic runners. We then estimate an instrumental variables regression to identify the causal effect β, using precipitation and temperature instruments chosen by the LASSO penalized regression to instrument for the endogenous interaction term..
Furthermore, we are interested in how gender affects exercise influence. For each Ego i we split their neighborhood (peers j = 1..k it ) into male "M" and female "F" peers and we define a model of exercise contagion that includes an interaction term for gender as follows: Ego and (iv) female peers/female Ego. We then estimate an instrumental variables regression to identify the causal effect β, using precipitation and temperature instruments chosen by the LASSO penalized regression to instrument for the endogenous interaction term.
Finally, using a slightly modified model, we investigate the role of same-gender and crossgender influence. For each Ego i we split their neighborhood (peers j = 1..k it ) into a samegender group "S" and a cross-gender group "C" of peers with respect to Ego' s gender. We define a model of exercise contagion that includes an interaction term for gender as follows: whereĀ p(pg) it is the average running activity of the peers estimated in the same-gender group (g ="S") and in the cross-gender group (g ="C") at time t. The two categories are (i) same gender and (ii) cross gender. We then estimate an instrumental variables regression to identify the causal effect β, using precipitation and temperature instruments chosen by the LASSO penalized regression to instrument for the endogenous term.

Testing Structural Theories of Social Contagion
The Complex Contagion theory contends that multiple sources of exposure to a behavior increase the likelihood that an individual adopts the behavior (16,17). We test whether complex contagion explains contagion in our exercise data by investigating the impact of the number of running friends on Ego's running behavior. We do so by defining a model for Ego's activity (dependent variable) where the endogenous effect is the number of friends that are active on the same day, #FR t , controlling for the total number of connections Ego has, k it , and all other characteristics of Ego and their peers, elements of X it and X p it respectively as follows: We use an instrumental variable method where we instrument the endogenous variable #FR t with the precipitation and temperature binary indicators chosen from the LASSO penalized regression. We anticipate that the number of active friends is a positive predictor of social influence.
To double check the functional relationship between exercise influence and the number of active friends, we define an additional model where we have two endogenous regressors, the number of active friends #FR t and its square (#FR t ) 2 : The functional form of exercise influence and the number of active friends depends on the absolute value and sign of the estimation coefficient β 2 . We use the 2SLS methodology by instrumenting the two endogenous regressors β 1 and β 2 with at least two weather binary indicators that we choose using a LASSO regression analysis.
We then examine how the structural diversity of the Ego's neighborhood affects exercise influence by investigating how fitness contagion is driven by the number of (running) active connected components in Ego's network. To count the number of connected components that are active, we first go through all nodes (runners), identify their immediate connections (neighborhood), and find the connected components in the neighborhoods (see Supplementary Figure 16 for illustration).
We then define a model of Ego's activity in which the endogenous effect is the number of connected components that are active on the same day (#CR t ) controlling for the total number of connections that Ego has, k it , an element of X it , as follows: We use a 2SLS instrumental variable method to estimate the causal effect β using a subset of the binary weather indicators to instrument for the number of running (active) components #CR t .
Ugander et al. recently showed in a Facebook study that the probability of contagion is highly correlated with the number of connected components in an individual's contact neighborhood, rather than with the actual size of the neighborhood (18). We test this hypothesis by defining an exercise contagion model where we have two endogenous regressors, the number of active friends #FR t and the number of active connected components in Ego's neighborhood #CR t , making sure that we control for Ego's connectivity at time t, k it an element of X it , as follows: We use an instrumental variable method to estimate the two causal effects β f and β c by instrumenting the two endogenous variables with a subset of weather binary indicators chosen using a LASSO penalized regression.
One of the most widely studied social factors theorized to affect the strength of social influence is structural embeddedness, the extent to which individuals share common peers. In this subsection we investigate how structural embeddeness moderates social influence in exercise habits, while simultaneously controlling for confounding factors that can bias inference in networked settings. Here, we adopt the conventional network structural measure of embeddedness, defined as the number of common friends shared by individuals and their peers (19)(20)(21). We first split the neighborhood of each Ego i into two groups of peers, one in which peers share no common friends with Ego e ij = 0 and one in which all peers share at least one common friend 22 with Ego e ij = 1, where e is a categorical variable. We then propose an interaction model of exercise contagion based on our estimation model as follows: whereĀ is the average running activity of the set of peers that are embedded. We also estimate a model that examines the influence (β) of set of peers in Ego's neighborhood that ). We use an instrumental variable method to identify the effect β by instrumenting the endogenous effectĀ ) with a subset of the available weather binary indicators.

Model-Free Evidence of Exercise Clustering
We first present some model free evidence for running activity clustering in the network. In Supplementary Figure

Peer Effects-IV Estimation Results
While the fixed effect models provide evidence of the possible existence of peer effects in the system, their estimates are biased. To produce unbiased estimates of the magnitude of peer effects in exercise, we execute the IV estimation method described in Supplementary Note 2.
We organize our data into Ego i and day t panels, where for each day of observation we have Ego i on each day t, we have two meteorological binary indicators, one for the precipitation and one for the temperature w it = (r it , θ it ). The binary indicator for Ego's precipitation on day t (r it ) takes value 1 if the total precipitation that Ego experiences on day t is larger than their seasonal average calculated for a period of 2 months, centered on day t, and 0 otherwise.
The binary indicator for Ego's temperature on day t takes the value 1 if the temperature that Ego experiences on day t is either in the range (− inf, where θ i,min , θ i,max and θ i are the minimum, maximum and average temperature that Ego i experiences respectively in a 2 month period centered on day t.
We also control for Ego's past running activity To specify the endogenous effectĀ p it , for each Ego i we identify their running buddies and compute the Pearson correlation coefficient between the weather each peer experiences and the weather Ego experiences, dropping all the peers whose correlation coefficient is larger than the threshold described in the "Weather as an Instrument" in Supplementary Note 2 . By excluding all links for which peers' weather correlates with Ego's weather, we ensure the validity of our exclusion restriction. Using the remaining peers (k it in total), we calculate their average running activity asĀ p it = 1/k it j c ijt A jt , where A jt is the running activity of the peer j and c ijt is the adjacency matrix. Note that when we are interested in identifying same day social influence (δt = 0), we take into account time zone in order to designĀ p it making sure that peers running took place before Ego's running.
We also prepare the time varying characteristics of Ego i (X it ) as well as the time varying and time invariant characteristics of peers that we control for in our model (X p it ). The former includes the degree of i while the latter includes the average degree of peers, the average age of peers, the average height and weight of peers, the fraction of peers that are men (women) and the fraction of peers that are located in US, UK, Canada, or another country.
We finally specify the weather variables that instrument for the endogenous effect A¯p it . For the k it peers of each Ego i, we identify the C j unique weather towers to which peers are most closely located. Note that the number of unique weather towers that the k it peers are closely 25 located to is always C j ≤ k it , since it is possible that more than one of Ego's peers are located in the same city. By considering only the towers (or cities) with distinct weather, we make sure that we do not violate the exclusion criterion of the IV model. Each of these C j weather towers experiences different weather. For each of these weather towers, we define N rain (r , l = 1, 2, ..C j ) binary indicators according to the methodology described in Supplementary Figure 14. We finally define N+M variables as the sum of the weather binary indicators over the C j unique weather towers as R

Treatment Effect Heterogeneity
In this subsection we present the results from our heterogeneous treatment effects models 5 to 9, defined in the "Estimating Treatment Effect Heterogeneity" in Supplementary Note 2. In all of the models we first split each Ego i's network neighborhood into several groups according to each model's specifications and specify the endogenous term as the interaction between the group type and the average running activity of the peers in the group. We instrument for the endogenous term with an interaction between the group type and the weather variables in order to identify the causal peer effect β.
Supplementary Table 9 reports the estimates -with standard errors, t-statistics, p-values, 95% confidence intervals, and diagnostic statistics -for the second stage of the 2SLS regression for the interaction model in Supplementary Equation 5. Surprisingly, friends who are less active than Ego influence Ego's running habits more. Specifically, peers with four to eight times less running activity compared to Ego's running activity are the most influential on average with an influence coefficient close to 0.5. On the other hand, an extra kilometer run by a more active friend has no significant effects on Ego's running activity. Results of the model are graphically displayed in Figure 2A of the main manuscript.
Supplementary are generally more susceptible to exercise influence, especially when influence is coming from inactive runners. It is also worth mentioning that in conjunction with the results in Supplementary Table 9, active friends have no significant influence on non-active runners. Results of the model are graphically displayed in Figure 2B of the main manuscript.
Supplementary Table 11 reports the estimates -with standard errors, t-statistics, p-values, 95% confidence intervals, and diagnostic statistics -for the second stage of the 2SLS regression for the interaction model in Supplementary Equation 7. Similar to the results on active and inactive runners, here we find that inconsistent peers are very influential over consistent runners and that consistent peers do not influence inconsistent runners. We also find that the influence coefficient is almost identical when Ego and peers are both either consistent or inconsistent.
Results of the model are graphically displayed in Figure 2C of the main manuscript.
In Supplementary Table 12 we report the estimates -with standard errors, t-statistics, pvalues, 95% confidence intervals, and diagnostic statistics -for the second stage of the 2SLS regression for the gender interaction model in Supplementary Equation 8. Men tend to be more influential runners, especially with respect to their influence on other men. However, the influence coefficient estimates become insignificant when we consider men influencing women. On the other hand, women exert significant influence on other women and on men. Results of the model are graphically displayed in Figure 2D of the main manuscript. Finally, in Supplementary Table 13 we report the estimates along with errors and diagnostic statistics -for the second stage of the 2SLS regression for the same-gender and cross-gender interaction model in Supplementary Equation 9. We find that same gender influence is significantly larger (t-stat=4.98) than cross-gender influence. Results of this model are graphically displayed in the inset of Figure 2D of the main manuscript.

Structural Theories of Social Contagion
Complex Contagion. The Complex Contagion Theory of behavioral contagion suggests that the number of behaviorally active friends in one's Ego network is a significant (non-linear) predictor of social influence. In Supplementary Table 14  In Supplementary Table 15 we report the second stage results for the social influence coefficients β 1 and β 2 . The negative though small β 2 estimate suggests that there are diminishing returns to additional peers' influence.
Structural Diversity. The Structural Diversity Theory of behavioral contagion suggests that the number of behaviorally active components in one's Ego network, rather than the number of active friends is the main predictor of social influence. In Supplementary Table 17  Embeddedness. The Embeddedness Theory of behavioral contagion suggests that the more mutual friends two people share, the more influential they will be on one another. Supplementary Table 18 reports the estimates -with standard errors, t-statistics, p-values, 95% confidence intervals, and diagnostic statistics -for the second stage of the 2SLS regression for the interaction model in Supplementary Equation 14. The results, illustrating the correspondence between structural emboddedness and influence, are displayed in Figure 3D of the main manuscript for social influence on run distance and in Supplementary Figure 21 for influence on run duration.
We observe that individuals are statistically significantly more influential on peers with whom they are embedded, i.e. share common friends (t-stat=2.45). This result is evidence for the Embeddedness Theory and is consistent with the empirical evidence described in Aral

Exogeneity
We test for any direct causal relationship between weather changes that Ego's peers experience and Ego's running activity. As we discussed in the sections on model specification, an important component of the model is to exclude all links between individuals whose weather patterns are correlated. In order to investigate if our methodology is reliable, we consider here a simple model with Ego's running activity as the dependent variable and as independent variables the instruments we used for our model estimation Z jt , controlling for all other exogenous factors and for peer's running activity, as follows: In Supplementary Table 19 we report the estimates of δ for the four running indicators (distance, pace, duration and calories). The non significance of the estimates for δ indicate that peers' weather does not correlate with Ego's running (except through its effect on peer running), providing evidence of the exogeneity of the instruments.

Non-Independence: Clustering and Standard Errors
The usual assumption in these types of models is that ε it is iid (independent and identically distributed). But this assumption could be violated in many of the cases we consider. A natural generalization while working on networks is to assume "clustered errors" -that observations within group u are correlated in some unknown way, inducing correlation in ε it within u, but that groups u and v do not have correlated errors. In the presence of clustered errors, OLS and IV estimates are both still unbiased but standard errors may be quite wrong, leading to incorrect inference in a surprisingly high proportion of finite samples.
The optimal way of avoiding such a problem in a network topology would be having a

Alternative Instrument Design
As an alternative robustness check, we design the instruments in a slightly different, less sophisticated way to make sure our more complex specification is not somehow producing spurious results. Instead of specifying N (and M) binary indicators for the precipitation (and temperature) that peers experience and using the Post LASSO method to identify which set of binaries to use as instruments, here we propose a global design of instruments that are identical across peers.
For each peer we define two simple binary indicators for rain and temperature respectively. For each individual j on each day t we consider a binary rain indicator r jt that is equal to 1 if the precipitation that individual j experiences on day t, pr jt , is more than a seasonal average pr jt , and 0 otherwise (Supplementary Figure 22A). We compute the seasonal average as the average precipitation in a two month period, from 30 days before to 30 days after the current day t, pr jt = 1/60 t+31 τ =t−30 pr jτ . In this way, we account for seasonality as we differentiate 2-inches of precipitation during a wet winter from 2-inches of precipitation during a dry summer. At the same time, for each individual, we build a binary indicator for temperature θ j,t that is equal to 1 if the temperature T j,t that individual j experiences is outside a normal temperature range (T 0 , T 1 )=(35,85) o F and 0 otherwise (Supplementary Figure 22B).
After we establish the exclusion criterion by dropping dyads whose weather is correlated using the methodology described in "Weather as an Instrument" in Supplementary Note 2, we define the two variables that will serve as instruments for the average activity of Ego's friends in our analysis as the sum of the binaries over the set of unique weather towers (l = 1..c j ) that peers of Ego i are located close to, R f t = c j l=1 r l,t for the rain and Θ f t = We then use the instrumental variable method using the weather indicators 60 days afterwards as an instrument for friends' activity to predict the exercise influence coefficient β. identify the exercise influence coefficient using the same identification strategy. We randomly rewire each in the underlying social network with probability 1, making sure that the total number of links remains unchanged. We then use the model in Supplementary Equation 1 to identify the exercise influence coefficient.

Supplementary
In Supplementary Tables 25, 26  We check the results for multiple realizations and find that in all cases the social influence coefficient is near zero and insignificant. These results again suggest that our results are quite robust.

Sensitivity Analysis on the Weather Correlation Threshold
As discussed above in the sections on model specification, one procedure we followed was to exclude all links between individuals whose weather patterns correlate. For our main analysis, we set a weather correlation threshold of ρ c = +0.025 over which we drop all links (i.e. exclude links in which Ego and Friend experience weather correlation coefficients larger than 0.025).
A question that naturally arises in this context is: how robust are our estimates to variations in

Compliers and Non-Compliers
We finally analyze the running population to understand who "complies" with shocks from our instruments and who does not in order to make our generalizations more precise. For each runner, we calculate the faction of runs that happen on a rainy day f i (mean=0.1899, S.D.=0.1294).
We then define a linear model which uses all the available time-invariant characteristics of individuals (average daily activity, age, gender, height, weight, country and others) to predict compliance with the weather instrument (not running when it rains) while we control for how much rain an individual experiences: where nr i is the total number of raining days i experiences. In Supplementary Table 29 we show the results of the above regression. The results show that the more active someone is 39 the more likely they are to run through the rain. Also men, younger individuals and those of normal weight are more likely to run in the rain, while height plays no role. Finally we find that individuals in the United States, UK, Canada, Germany, Spain, Brazil, France and the Netherlands are more likely to run on a rainy day than people in Australia, Mexico and Japan.
These results help us more precisely characterize the types of people to whom our results most directly generalize.

Supplementary Figures
Supplementary Figure 1: A network visualization of a random 10% sample of the giant connected component of the network displayed using a force-directed graph drawing algorithm. Also shown are insets showing characteristic motifs of the network structure. The algorithm situates nodes of the graph in two-dimensional space so that all the edges are of more or less equal length and there are as few crossing edges as possible. This is achieved by assigning forces among the set of edges and the set of nodes and then using these forces either to simulate the motion of the edges and nodes or to minimize their energy (1).
Supplementary Supplementary Figure 11: More precipitation is monotonically associated with less running (see Figure 4C in the main manuscript). On the other hand, the relationship between running and temperature is non monotonic suggesting that very high and low temperatures are associated with less exercise activity (see Figure 4C in the main manuscript).
Supplementary Figure 15: Illustration of the methodology used to extract the training consistency of runners. Periods of consistent activity are defined as those during which no period of inactivity longer than two weeks exists.
Supplementary Figure 16: Illustration of the methodology used to extract the number of active connected components a runner (Ego) has at each time. In this particular example an Ego i on day t has a neighborhood of 6 friends (2 of which are running) and 3 connected components (2 of which are active). to peers that share the same or highly correlated weather. Also to design the weather variables that can potentially serve as instruments for the peers' running activity we consider only the distinct number of towers that have different weather. For example, in the above illustration, Ego has five friends in four different cities. First, we remove links between Ego and the friends that they have in the same city. Furthermore, in order to design the variables that can serve as instruments, we use the weather of the three distinct cities in which Ego has peers (city B, city C and city D). The above methodology ensures that the exclusion criterion is not violated.    Supplementary Table 9: Results of the second stage of the interaction model in Supplementary  Equation 5. The same results are graphically displayed in Figure 2A of the main manuscript. Supplementary Table 10: Results of the second stage of the interaction model in Supplementary  Equation 6. The same results are graphically displayed in Figure 2B of the main manuscript. Supplementary Table 11: Results of the second stage of the interaction model in Supplementary  Equation 7. The same results are graphically displayed in Figure 2C of the main manuscript.  Figure 2D of the main manuscript. Supplementary Table 13: Results of the second stage of the interaction model in Supplementary  Equation 9. The same results are graphically displayed in the inset of Figure 2D in the main manuscript.     figure 3A of the main manuscript. The results for the "Duration" are graphically displayed in Supplementary  Figure 19. Supplementary Table 17: The effect of the number of running friends (as a single variable) and the number of running connected components on the Ego's activity. Results for the "Distance" display in the Figure 3B of the main manuscript while the result for the running "duration" display in Supplementary Figure 20.  Figure 3C in the main manuscript for the running distance and in the Supplementary Figure 21 for the running duration.