Main

The rapid global spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the novel virus causing COVID-19 (refs. 1,2,3), has created an unprecedented public health emergency. In the United States, efforts to slow the spread of disease have included, to varying extents, social distancing, home-quarantine and treating of infected patients, mandatory facial covering, closure of schools and non-essential businesses, and test–trace–isolate measures4,5. The COVID-19 pandemic and ensuing response has produced a concurrent economic crisis of a scale not seen for nearly a century6, exacerbating the effect of the pandemic on different socioeconomic groups and producing adverse health outcomes beyond COVID-19. As a result, there is currently intense pressure to safely wind down these measures. Yet, in spite of widespread lockdowns and social distancing throughout the United States, many states continue to exhibit steady increases in the number of cases (https://www.worldometers.info/coronavirus/). To understand where and why the disease continues to spread, there is a pressing need for real-time individual-level data on COVID-19 infections and tests, as well as on the behaviour, exposure and demographics of individuals at the population scale with granular location information. These data will allow medical professionals, public health officials and policy makers to understand the effects of the pandemic on society, tailor intervention measures, efficiently allocate testing resources and address disparities.

One approach to collecting these types of data on a population scale is to use web- and mobile-phone-based surveys that enable large-scale collection of self-reported data. Previous studies, such as FluNearYou, have demonstrated the potential for using online surveys for disease surveillance7. Since the start of the COVID-19 pandemic, several different applications have been launched throughout the world to collect COVID-19 symptoms, testing and contact-tracing information8. Studies in the United States and Canada (CovidNearYou, https://covidnearyou.org/us/en-US/; and ref. 9), the United Kingdom (Covid Symptom Study10,11, also in the United States) and Israel (PredictCorona12) have reported large cohorts of users drawn from the general population with a goal towards capturing information about COVID-19 along a variety of dimensions, from symptoms to behaviour, and have demonstrated some ability to detect and predict the spread of disease10,11,12. This field has rapidly evolved since the beginning of the pandemic, with many analyses of these datasets focusing on COVID-19 diagnostics (that is, symptoms, test results, medical background)9, care seeking13, contact tracing14, patient care15, effects on healthcare workers16, hospital attendance13, cancer17, primary care18, clinical symptoms19 and triage20. Here, we perform a comprehensive analysis of a source of COVID-19-related information spanning diagnostic and behavioural factors sampled from the general population during the beginning of the pandemic in the United States. We investigate exposure, demographic and behavioural factors that affect the chain of transmission; understand the factors for who has been tested; and study the degree of presence of asymptomatic, presymptomatic and mildly symptomatic cases21.

To fill the gap and achieve these goals, we developed How We Feel (HWF; http://www.howwefeel.org) (Fig. 1a–d), a web and mobile-phone application for collecting de-identified self-reported COVID-19-related data. Rather than targeting patients with suspected COVID-19 or existing study cohorts, HWF aims to collect data from users representing the population at large. By drawing from a large user base across the United States who learn about the study through word of mouth and government partnerships, these results are complementary to other studies such as the Covid Symptom Study and CovidNearYou that also include sizable US populations and are targeted towards the general public. Users are asked to share information on demographics (gender, age, race/ethnicity, household structure, ZIP code), COVID-19 exposure and pre-existing medical conditions. They then self-report daily how they feel (well or not well), any symptoms they may be experiencing, test results, behaviour (for example, use of face coverings) and sentiment (for example, feeling safe to go to work) (Fig. 1c and Extended Data Fig. 1). To protect privacy, users are not identifiable beyond a randomly generated number that links repeated logins on the same device. A key feature of the app is the ability to rapidly release revised versions of the survey as the pandemic evolves. In the first month of operation, we released three iterations of the survey with increasingly expanded sets of questions (Fig. 1b).

Fig. 1: The HWF application and user base.
figure 1

a, The HWF app: longitudinal tracking of self-reported COVID-19-related data. b, Responses over time, as well as percentage of users reporting feeling unwell, with releases of major updates to survey indicated. c, Information collected by the HWF app. d, Users by state across the United States. e, Age distribution of users. Note: users had to be older than 18 to use the app. f, Distribution of self-reported gender. g, Distribution of self-reported race or ethnicity. Users were allowed to report multiple races. ‘Multiracial’ means the user indicated more than one category. ‘Other’ includes American Indian/Alaskan Native and Hawaiian/Pacific Islander, as well as users who selected ‘Other’.

We find symptomatic subjects, healthcare workers and essential workers are more likely to be tested. Due to asymptomatic and mildly symptomatic individuals and heterogeneous symptom presentation, our results show that commonly used symptoms may not be sufficient criteria for evaluating COVID-19 infection. Further, we find that exposure both outside and within the household is a major risk factor for users testing positive and build a predictive model to identify likely COVID-positive users. African-American users, Hispanic/Latinx users, and healthcare workers and essential workers are at a higher risk of infection, after accounting for the effects of pre-existing medical conditions. Finally, we find that even at the height of lockdowns throughout the United States, the majority of users were leaving their homes, and a large fraction were not engaging in social distancing or face protection.

Results

The app was launched on 2 April 2020 in the United States. As of 12 May 2020, the app had 502,731 users in the United States, with 3,661,716 total responses (Fig. 1b and Supplementary Table 1). In total, 74% of users responded on multiple days, with an average of seven responses per user (Extended Data Fig. 2). Each day, ~5% of users who accessed the app reported feeling unwell (Fig. 1b). The user base was distributed across all 50 states and several US territories, with the largest numbers of users in more populous states such as California, Texas, Florida and New York (Fig. 1d). Connecticut had the largest number of users per state, as the result of a partnership with the Connecticut state government (Fig. 1d). Users were required to be 18 years of age or older and were 42 years old on average (mean, 42.0; s.d., 16.3), including 18.4% in the bracket of 60+, which has experienced the highest mortality rate from COVID-19 (Fig. 1e)22,23. Users were primarily female (82.7%) (Fig. 1f) and white (75.5%, excluding 20.3% with missing data) (Fig. 1g). Although the survey ran from April 2 until May 12, users could report test results from earlier than April 2.

A major ongoing problem in the United States is the overall lack of testing across the country24 and disparities in test accessibility, infection rates and mortality rates in different regions and communities25. In the absence of population-scale testing, it will be critical during a reopening to allocate limited testing resources to the groups or individuals most likely to be infected to track the spread of disease and break the chain of infection. We therefore first examined who in our user base was currently receiving testing. We analysed 4,759 users who took the Version 3 (V3) survey and who were PCR tested for SARS-CoV-2 (of 272,392 total users) (Fig. 2a and Extended Data Fig. 3a). Of these, 8.8% were PCR positive. The number of tests reported by test date displays a similar trend to the estimated number of tests across the United States, suggesting that our sampling captures the increase in test availability (Fig. 2a). The number of PCR tests per HWF user is highly correlated with external estimates of per-capita tests by state (Fig. 2b and Extended Data Fig. 3b; Pearson correlation 0.77)26.

Fig. 2: SARS-CoV-2 PCR testing and symptoms.
figure 2

a, Stacked bar plot of user-reported test results over time, overlaid with official number of tests across the United States based on COVID Tracking Project data (n = 4,759 users who took the V3 survey and reported a test result, of 277,151 users). b, Left: map of per-capita test rates across the United States. Right: map of COVID-19 tests per number of users by state. c, Associations of professions and symptoms with receiving a SARS-CoV-2 PCR test, adjusted for demographics and other covariates (see Methods). Common symptoms listed by the CDC are starred (n = 4,759 users with a reported test within 14 d of a survey response of 277,151 users). df, UMAP visualization of 667,651 multivariate symptom responses among HWF users that reported at least one symptom. Colouring indicates responses according to users feeling well (d), the reported number of COVID-19 symptoms listed by the CDC (e), and the COVID-19 test result among tested users (f). g, Proportions of patients positive for COVID-19 (red) and negative for COVID-19 (blue) experiencing CDC common symptoms (dark), only non-CDC symptoms (light) or no symptoms (grey) on the day of their test. n = 1,170 positive users and 8,892 negative users who reported a test result between 2 April and 12 May 2020. h, Histogram of reported symptoms among COVID-19-tested users. i, Longitudinal self-reported symptoms from users that tested positive for COVID-19. Dates are centred on the self-reported test date. j, Ratio of symptoms comparing users that tested positive versus those that tested negative for COVID-19.

We first examined via logistic regression which factors either collected in the survey or inferred from US Census data by user ZIP code were associated with receiving a SARS-CoV-2 PCR test, regardless of test result. As expected, we observed a higher fraction of tested users from states with higher per-capita test numbers, according to the COVID Tracking Project26 (Extended Data Fig. 3b). Healthcare workers (odds ratio (OR), 2.94; 95% confidence interval (95% CI), 2.75, 3.15; P < 0.001) and other essential workers (OR, 1.39; 95% CI, 1.28, 1.52; P < 0.001) were more likely to have received a PCR test compared with users who did not report those professions (Fig. 2c). Users who reported experiencing fever, cough or loss of taste/smell (among other symptoms) had higher odds of being tested compared with users who never reported these symptoms (Fig. 2c). The majority of these symptoms are listed as common for COVID-19 cases by the Centers for Disease Control and Prevention (CDC) (Fig. 2c, starred)27. A less-common symptom, reporting a tight feeling in one’s chest, was also associated with receiving a PCR-based test (OR, 2.27; 95% CI, 1.93, 2.66; P < 0.001). These results suggest that the most commonly reported symptoms are being used as screening criteria for determining who receives a test, potentially missing asymptomatic and mildly symptomatic individuals. This group could include those who are at high risk for infection but do not meet the testing eligibility criteria.

To obtain a global view of self-reported symptom patterns, we applied an unsupervised manifold learning algorithm to visualize how symptoms were correlated across users (see Methods). As expected, we found that symptom presentation separated broadly by feeling well versus feeling unwell (Fig. 2d and Extended Data Fig. 4). Users who felt unwell were concentrated in a single cluster indicating similar overall symptom profiles, which was characterized by high proportions of common COVID-19 symptoms as defined by the CDC27 (Fig. 2e), and contained the vast majority of responses from users with both positive (+) and negative (−) SARS-CoV-2 PCR tests (Fig. 2f). Thus, COVID-19 symptoms tend to overlap with symptoms for other diseases and do not necessarily predict positive test results.

This overlap suggests that commonly used symptoms may not be sufficient criteria for evaluating COVID-19 infection. It has previously been reported that many people infected with SARS-CoV-2 are asymptomatic, mildly symptomatic or in the presymptomatic phase of their presentation28,29,30 and therefore unaware that they are infected. In our dataset, on the day of their test, most users (73%) that tested PCR positive for SARS-CoV-2 reported feeling unwell with the common symptoms listed by the CDC (dry cough, shortness of breath, chills/shaking, fever, muscle/joint pain, sore throat, loss of taste/smell). However, 11.5% of positive users reported feeling unwell and exclusively reported symptoms not listed as common for COVID-19 by the CDC on the day of their test, and 15.4% reported feeling no symptoms at all (Fig. 2g). Because of the commonly used symptom- and occupation-based screening criteria for receiving a PCR test and under-testing, this total of 36.9% probably underestimates the true fraction of asymptomatic, presymptomatic and mildly symptomatic cases, which in Wuhan, China, was estimated to be ~87% (ref. 21), and in the United States was estimated to be >80%. A large number of asymptomatic cases were also observed in serological studies31,32. In total, 48.9% of users testing negative for SARS-CoV-2 reported feeling unwell with the most common COVID-19 symptoms, compared with an expected false-negative rate of 20–30% for PCR-based tests of symptomatic patients33, again suggesting symptom presentation overlap with other diseases (Fig. 2g).

We investigated the symptoms that were most predictive of COVID-19 by exploring the distribution and dynamics of symptoms in PCR test (+) and (−) users around the test date. PCR test (+) users reported a higher rate of common COVID-19 symptoms, including dry cough, fever, loss of appetite, and loss of taste and/or smell, than PCR test (−) users (Fig. 2h). Many PCR-tested users longitudinally reported symptoms in the app in an interval extending ±2 weeks from their test date (Extended Data Fig. 5). We used these data to examine the time course of symptoms among those who tested positive. In the days preceding a test, dry cough, muscle pain and nasal congestion were among the most commonly reported symptoms. Reported symptoms peaked in the week following a test and declined thereafter (Fig. 2i). Taking the ratio of the symptom rates at each point in time between PCR test (+) and (−) users showed that the most distinguishing feature in users who tested positive was loss of taste and/or smell, as has been previously reported11 (Fig. 2j).

We next investigated medical and demographic factors associated with testing PCR positive for acute SARS-CoV-2 infection, focusing on 3,829 users who took the V3 survey within ±2 weeks of their reported PCR test date (315 positive, 3,514 negative) (Fig. 3a and Supplementary Tables 26). These users are a subset of all of the users who reported taking a test in the V3 survey, as some of the reported test results were outside of this time window. To correct for selection bias of receiving a PCR test when studying the risk factors of a positive test result, we incorporated the probability of receiving PCR tests as inverse probability weights (IPWs) into our logistic model of PCR test result status (+/−) (see Methods)34. As with the analysis of who received a test, the reported symptom of loss of taste and/or smell was most strongly associated with a positive test result (OR, 33.17; 95% CI, 17.3, 67.94; P < 0.001). Other symptoms associated with testing positive included fever (OR, 6.27; 95% CI, 2.82, 13.70; P < 0.001) and cough (OR, 4.45; 95% CI, 2.83, 6.99; P < 0.001). Women were less likely to test positive than men (OR, 0.55; 95% CI, 0.38, 0.80; P = 0.002), and both Hispanic/Latinx users (OR, 2.59; 95% CI, 1.67, 3.97; P < 0.001) and African-American/Black users (OR, 2.35; 95% CI, 1.29, 4.18; P = 0.004) were more likely to test positive than white users, highlighting potential racial disparities involved with COVID-19 infection risk. The odds of testing positive were also higher for those in high-density neighbourhoods (OR, 1.85; 95% CI, 1.15, 3.07; P = 0.014). Healthcare workers (OR, 1.92; 95% CI, 1.36, 2.73; P < 0.001) and other essential workers (OR, 1.69; 95% CI,1.13, 2.52; P = 0.01) also had higher odds of testing positive compared with non-essential workers. Pregnant women were substantially more likely to test positive (OR, 6.30; 95% CI, 2.45, 14.68; P < 0.001). However, we note that this result is based on a small sample of 48 pregnant women included in this analysis (9 test positive, 39 test negative) and is unstable, subject to potentially high selection bias. Performing this analysis with and without correction for selection bias produced similar results (Fig. 3a). As a further sensitivity analysis, we reran the analyses excluding users from the states of California and Connecticut, the state containing most users (Extended Data Fig. 7a), and correcting for broader demographic differences using US Census data (Extended Data Fig. 7b), obtaining similar results to the uncorrected model in both cases. Finally, we performed Firth-corrected logistic regression to check for bias in our testing model related to the large fraction of users testing negative, and obtained similar results to our uncorrected model (Extended Data Fig. 8).

Fig. 3: SARS-CoV-2 PCR test result associations and predictions.
figure 3

a, Factors associated with respondents receiving and reporting a positive test result, as determined through logistic regression. Left: results from unweighted model. Right: results from model incorporating selection probabilities via IPWs. Reference categories are indicated where relevant; when not indicated, the reference is not having that specific feature. log ORs and their confidence intervals are plotted, with red indicating positive association and blue indicating negative association. Darker colours indicate confidence intervals that do not cover 0. Population density and neighbourhood household income were approximated using county-level data. L, lower bound; U, upper bound of 95% CIs; n = 3,829 users (315 positive, 3,514 negative) who took the V3 survey within ±2 weeks of receiving a test. b, Prediction of positive test results using ±2 weeks of data from the test date, using fivefold cross-validation, shown as ROC curves. The XGBoost model was trained on different subsets of questions: CDC symptom questions, using just the subset of COVID-19 symptoms listed by the CDC; all survey questions, using the entire survey; four-question survey, using a reduced set of four questions that were found to be highly predictive. Numerical values are AUC; n = 3,829 users.

Motivated by previous studies that reported that high cluster transmissions occurred in families in China, Korea and Japan35,36,37, we explored household and community exposures as risk factors for users testing PCR positive. The odds of testing positive were much higher for those who reported within-household exposure to someone with confirmed COVID-19 than for those who reported no exposure at all (see Methods) (OR, 19.10; 95% CI, 12.30, 30.51; P < 0.001) (Fig. 3a and Supplementary Table 5). This is stronger than comparing the odds of testing positive among those who reported exposure outside their household versus no exposure at all (OR, 3.61; 95% CI, 2.54, 5.18; P < 0.001). Further, the odds of testing PCR positive are much higher for those exposed in the household versus those exposed outside their household or not exposed at all, after adjusting for similar factors (OR, 10.3; 95% CI, 6.7, 15.8; P < 0.001) (Supplementary Table 10). These results are consistent with previous findings that indicate a very high relative risk associated with within-household infection36,38,39,40,41. This is compatible with the findings that other closed areas with high levels of congregation and close proximity, such as churches42, food-processing plants43 and nursing homes44, have shown similarly high risks of transmission.

Developing models to predict who is likely to be SARS-CoV-2(+) from self-reported data has been proposed as a means to help overcome testing limitations and identify disease hotspots11,12. We used data from the 3,829 users who used the app within ±2 weeks of their reported PCR test results to develop a set of prediction models that were able to distinguish positive and negative results with a high degree of predictive accuracy on cross-validated data (Fig. 3b). We used the machine learning method XGBoost, which outperformed other classification methods (Extended Data Fig. 6). For each user, we predicted their test results either using data before the test (pre-test), which would be most useful in predicting COVID-19 cases in the absence of molecular testing, or using data before and after the test (all data) as a benchmark for the best possible prediction we could make using all available data. We considered: (1) a symptoms-only model, which included only the most common COVID-19 symptoms listed by the CDC; (2) an expanded model, which further incorporated other features observed in the survey; and (3) a minimal-features model, which retained only the four most predictive features (loss of taste and/or smell, exposure to someone with COVID-19, exposure in the household to someone with confirmed COVID-19 and exposure to household members with COVID-19 symptoms) (see Methods and Supplementary Tables 1114). The symptoms-only model achieved a cross-validated area under the receiver operating characteristic (ROC) curve (AUC) of 0.76 using data before and after a test, and AUC 0.69 using just the pre-test data. Expanding the set of features to include other survey questions substantially improved performance (cross-validated AUC 0.92 all data, 0.79 pre-test). In the minimal-features model, we were able to retain high accuracy (cross-validated AUC 0.87 all data, AUC 0.80 pre-test) despite only including four questions, one referring to a symptom and three referring to potential contact with known infected individuals. Restricting the observed inputs to the 1,613 users (89 positive, 1,524 negative) who answered the survey in the 14 d before being tested limited the sample size and reduced the overall accuracy, but the relative performance of the models was similar (Fig. 3b).

The fact that a fraction of SARS-CoV-2(+) users report no symptoms or only less-common symptoms (Fig. 2g) raises the possibility that many infected users might behave in ways that could spread disease, such as leaving home while unaware that they are infectious. In spite of widespread shelter-in-place orders during the sample period, we found extensive heterogeneity across the United States in the fraction of users reporting leaving home each day, with 61% of the responses from 24 April to 12 May indicating the user had left home that day (Fig. 4a). The majority (77%) of these users reported leaving for non-work reasons, including exercising; 19% left for work (Fig. 4b). Of people who left home, a majority of users, but not all, reported social distancing and using face protection (Fig. 4c). Different states had persistently different levels of people wearing masks and leaving home (Extended Data Fig. 9). This incomplete shutdown with partial adherence, and lack of total social and physical protective measures, coupled with insufficient isolation of infected cases, may contribute to continued disease spread.

Fig. 4: Behavioural factors potentially contributing to COVID-19 spread.
figure 4

a, Proportion of responses indicating users leaving home across the United States (map) or overall (inset pie chart) (n = 1,934,719 responses from 279,481 users). b, Percentage of responses for users reporting work or other reason for leaving home (n = 1,176,360 responses from 244,175 users). c, Reported protective measures taken by users upon leaving home per response (n = 1,176,360 responses from 244,175 users). d, Time course of proportion of SARS-CoV-2 PCR-tested (+) or (−) users staying home, leaving for work and leaving for other reasons (n = 4,396 total users who reported being tested positive or negative in the V3 survey and responded on at least 1 d within ±1 week of being tested). e,f, Proportion of users SARS-CoV-2 PCR-tested (+) or (−), or untested, going to work (e) (n = 14 of 203 positive, 664 of 2,533 negative, 62,483 of 269,833 untested), and going to work without a mask (f) (n = 7 of 203 positive, 255 of 2,533 negative, 34,481 of 269,833 untested), who responded within the 2–7 d post-test for tested (T) or the 3 weeks since last check-in for untested (U). Healthcare workers and other essential workers are compared with non-essential workers as the baseline. g, Average reported number of contacts per 3 d in the 2–7 d after their test date. T(+), n = 138 users; T(−), n = 2,269 users; U, n = 254,751 users. LB, lower bound; UB, upper bound. h, Logistic regression analysis of factors contributing to users going to work in the 2–7 d after their COVID-19 test (n = 678 users going to work of 2,736 users with definitive test outcome and survey responses in the 2–7 d after their test date).

Given the large number of people leaving home each day, it is important to understand the behaviour of people who are potentially infectious and therefore likely to spread SARS-CoV-2. To this end, we further analysed the behaviour of people reporting to be PCR test (+) or (−). There was an abrupt, large increase in users reporting staying home after receiving a positive test result (Fig. 4d,e). Many, but not all, PCR test (+) users reported staying home in the 2–7 d after their test date (7% still went to work, n = 14 of 203 users), whereas 23% (n = 62,483 of 269,833 users) of untested and 26% (n = 664 of 2,533 users) of PCR test (−) users left for work (Fig. 4d,e). Similarly, 3% of PCR test (+) (n = 7 of 203 users) users reported going to work without a mask, in contrast with untested (12.7%, n = 34,481 of 269,833 users) and PCR test (−) (10%, n = 255 of 2,533 users) users (Fig. 4f). Positive individuals reported coming into close contact with a median of 1 individual over 3 days in contrast to individuals who tested negative or were untested, who typically came into close contact with a median of 4 people within 3 days (Fig. 4g). Regression analysis suggested that healthcare workers (OR, 9.3; 95% CI, 7.3, 11.8; P < 0.001) and other essential workers (OR, 6.8; 95% CI, 5.2, 8.9; P < 0.001) were much more likely to go to work after taking a positive or negative test, and PCR-positive users were more likely to stay home (OR, 0.1; 95% CI, 0.1, 0.2; P < 0.001) (Fig. 4h and Supplementary Table 15).

Discussion

Using individual-level data collected from the HWF app, we showed that incorporating information beyond symptoms—in particular, household and community exposure—is vital for identifying infected individuals from self-reported data. This finding is particularly important for risk assessment at the early stage of transmission (for example, during the latent and presymptomatic periods when subjects have not developed symptoms yet), so that high-risk subjects can have priorities for being tested and quarantined and close contacts can be traced, to block the transmission chain early on. Our results show that vulnerable groups include subjects with household and community exposure, healthcare workers and essential workers, and African-American and Hispanic/Latinx users. They are at higher risk of infection and should have priorities for being tested and protected. Our findings also show statistically significant racial disparity after adjusting for the effects of pre-existing medical conditions, which needs to be addressed.

We find evidence among our users for several factors that could contribute to continued COVID-19 spread despite widespread implementation of public health measures. These include a substantial fraction of users leaving their homes on a daily basis across the United States; users who claim to not socially isolate or return to work after receiving a PCR test (+) result; self-reports of asymptomatic, mildly symptomatic or presymptomatic presentation; and a much higher risk of infection for people with within-household exposure.

That said, we note several limitations of this study. The HWF user base is inherently a non-random sample of voluntary users of a smartphone app, and hence our results may not fully generalize to the broader US population. In particular, the study may be subject to selection bias by not capturing populations without internet access, such as low-income or minority populations, who may be at elevated risk, and over-representation of females. Our results are based on self-reported survey data, and hence may suffer from misclassification bias—particularly those based on self-reported behaviours. Moreover, a relatively small percentage of subjects received PCR testing. As shown in Fig. 2, the subjects who were tested were more likely to be symptomatic, healthcare workers and essential workers, and people of colour. Naïve regression analysis of test results using responses of subjects who were tested could be subject to selection bias. To mitigate this, we have attempted to correct for these selection biases via the inverse probability weighting approach by estimating the selection probability, the probability of receiving tests, using the observed covariates (see Methods). Some residual bias may persist if there remain some unobserved factors related to underlying disease status and receiving a test, or if the selection model is misspecified. What is more, the HWF user base may not be representative of the broader US population. Although our regression analysis conditioned on a wide range of covariates to account for possible selection bias, if any unobserved factors associated with underlying disease status are also related to using the app—for example, health literacy and access to the internet, particularly in vulnerable groups such as low-income families—the results may be subject to additional selection bias.

Although there is enormous economic pressure on states, businesses and individuals to be able to return to work as quickly as possible, our findings highlight the ongoing importance of social distancing, mask wearing and large-scale testing of symptomatic, asymptomatic and mildly symptomatic people, exposure assessment and, potentially, even more rigorous ‘test–trace–isolate’ approaches45,46,47,48 as implemented in several states, such as Massachusetts, New York, New Jersey and Connecticut, which have bent the infection curve45,46,47,48. Applying predictive models on a population scale will be vitally important to provide an ‘early warning’ system for timely detection of a second wave of infections in the United States and for guiding an effective public policy response.

As testing resources are expected to continue to be limited, HWF results could be used to identify which groups should be prioritized, or potentially to triage individuals for molecular testing based on predicted risk. HWF’s integration of behavioural, symptom, exposure and demographic data provides a powerful platform to address emerging problems in controlling infection chains, to rapidly assist public health officials and governments with developing evidence-based guidelines in real-time and to stop the spread of COVID-19.

Methods

Ethics statement

The HWF application was approved as exempt by the Ethical & Independent Review Services LLP IRB (Study ID 20049–01). The analysis of HWF data was also approved as exempt by Harvard University Longwood Medical Area Institutional Review Board (IRB) (Protocol no. IRB20-0514) and the Broad Institute of MIT and Harvard IRB (Protocol no. EX-1653). Informed consent was obtained from all users and the data were collected in de-identified form.

Open-source software

We used the following open-source software in the analysis:

Application

The HWF application was developed in React Native (https://reactnative.dev/), using Google App Engine (https://cloud.google.com/appengine) and Google BigQuery (https://cloud.google.com/bigquery) for the backend, and launched on the Android and iOS platforms. Users were identified only with a device-specific randomly generated number. Users below the age of 18 were not allowed to use the application.

Inclusion criteria

If a user logged in multiple times in a day, only the first time was retained. We excluded any users who responded to a survey version on one day and then on a later day responded to an older survey version. We excluded any users who reported different genders on different days, and we excluded any observations with missing feeling, gender or smoking history.

Before survey V3, users responded only whether or not they received a COVID-19 test, and we assumed that they received a PCR test. In survey V3, users reported the type of test they received, and we excluded antibodies tests from analyses.

Logistic regression: receiving a test

The HWF app allows users to report previous COVID-19 test information, including test date, test type (swab versus antibody), test result (positive, negative or unknown), location of test and reason for receiving the test (Fig. 2). A user may report that the test result is not yet known, and then update this information in future check-ins. A test was considered to be ‘unique’ if it was reported by the same user with the same test date (including ‘NA’ (not available), n = 11) and type. For this analysis, ‘swab’ tests were assumed to be PCR-based tests for SARS-CoV-2. Tests with a reported test date before 1 January 2020 were excluded. Before V3, users were not asked about their test type. Tests from the same user with the same test date may have been missing a reported test type in earlier check-ins, but the user may have filled in this information at later check-ins; in this case, we consider this to be the same test and assign the reported test type. For each unique test, all test information (including result) from the user’s most recent check-in was used.

We compared testing data from HWF with the COVID Tracking Project (https://covidtracking.com/) for all 50 states and the District of Columbia. For comparison with HWF data used in this analysis, we extracted COVID Tracking Project data until 11 May 2020. Tests with a ‘not yet known’ test result were excluded from this analysis. In Extended Data Fig. 6, the left panel compares the number of unique swab tests divided by the number of unique users in HWF with the total tests per state (totalTestResults) reported by the COVID Tracking Project divided by the state population as estimated by the 2010 Census (https://pypi.org/project/CensusData/). The right panel compares the proportion of unique swab tests in HWF with a positive result with the proportion of tests in the COVID Tracking Project with a positive result.

For the analysis of who received a test, the outcome was 1 if a user reported a swab test, or 0 otherwise. We fit a logistic regression model using demographics, professions, exposure and symptoms, among other covariates. Time-varying measures (for example, symptoms) were averaged over their V3 survey responses. Analysis was conducted with the statsmodel package (v.0.11.1) in Python54,55. We reported the log ORs and ORs, along with corresponding 95% CIs. Supplementary Table 3 lists the covariates used in the selection (who received a test) regression model, as well as the estimated coefficients, 95% CIs and P values.

Uniform manifold approximation and project for dimension deduction (UMAP)

Of the 3,661,716 survey responses collected by HWF up until 12 May 2020, 667,651 reported having at least one symptom (excluding ‘feeling_not_well’) from the set of 25 symptom questions asked across all surveys. Only these responses were used for UMAP analysis (Fig. 2d–f). Each of the 25 queried symptoms was treated as a binary variable. The input data were therefore a matrix of 667,651 survey responses with 25 binary symptom variables. UMAP was applied to this matrix following McInnes and Healy60 using the Python package umap-learn with parameters: n_neighbors=1000, min_dist=0.5, metric = ’hamming’. The resulting two-dimensional embedding was plotted with different colourmaps for each response in Fig. 2. The distributions of all 25 symptoms are shown individually in Extended Data Fig. 4.

Asymptomatic analysis

Status of each symptom was categorized as a CDC symptom, a non-CDC symptom or asymptomatic (Fig. 2g). The CDC symptoms were defined as patients that reported feeling well or unwell with a dry cough, shortness of breath, chills/shaking, fever, muscle/joint pain, sore throat or loss of taste/smell. The non-CDC symptoms were defined as patients that reported feeling well or unwell with any symptoms that were not defined by the CDC, including abdominal pain, confusion, diarrhoea, facial numbness, headache, irregular heartbeat, loss of appetite, nasal congestion, nausea/vomiting, tinnitus, wet cough, runny nose and so on.

We restricted analysis to the subset of patients for which we observed symptom data on their test date. For each user that tested positive or negative, we categorized participants into three groups: {CDC symptoms, Non-CDC symptoms, Asymptomatic}. Participants were grouped into CDC symptoms if they reported any CDC symptoms and participants that reported only non-CDC symptoms were grouped in the Non-CDC symptoms category. Participants were considered asymptomatic if they reported none of the above symptoms. Proportions were reported and graphically represented for each group in Fig. 2g.

COVID-19 symptoms and dynamics

In the HWF survey data up to 12 May 2020, a total of 8,429 unique users reported the result of a quantitative PCR (qPCR) COVID-19 test (1,067 positive, 7,362 negative) (Fig. 2h–j). For surveys V1–2, we assumed that all tests were qPCR tests since antibody tests were rare before 24 April. In the V3 survey (24 April to 12 May 12) the test type was explicitly asked. Among qPCR-tested users, each response was assigned a date in days relative to the self-reported test date. The aggregate fraction of responses reporting each symptom was visualized in a histogram in Fig. 2h. The aggregate fraction of responses reporting each symptom at each timepoint among users that tested positive was visualized in a heatmap in Fig. 2i. Figure 2j shows the element-wise log ratio of the positive-test and negative-test heatmaps. That is, each element = log(fraction positive responses reporting symptom at time t/fraction negative responses reporting symptom at time t). The heatmaps were smoothed by taking the average for each symptom within a sliding window of ±1 d for visualization.

Logistic regression: test results

A large number of risk factor survey questions were added in V3 of the survey, so we restricted analysis to V3 survey data for the purposes of identifying risk factors associated with SARS-CoV-2(+) test results (Fig. 3a). User responses were selected using a symmetric 28-d window around the last reported COVID-19 swab test date for any given user. Users that had no reported test outcome, or reported both positive and negative outcomes in different responses, were removed. Users who identified as ‘other’ in the gender response were dropped due to small sample size. Median neighbourhood household income was estimated by mapping user ZIP codes to corresponding ZCTAs (ZIP code tabulation areas) from the census, and then using the American Community Survey 5-year average results from 2018 to infer a neighbourhood household income (B19013_001E). Population density was calculated at the county level for each user based on data from the Yu Group at University of California at Berkeley61.

Race was a categorical variable, with distinct groups: ‘white’; ‘African-American’; ‘Hispanic/Latinx’; ‘Asian’; ‘multiracial’ for those who marked two or more race categories; ‘other’ for those who marked ‘other’, ‘Native American’ or ‘Hawaiian/Pacific Islander’; and ‘unknown’ for those who did not disclose their race. A given food source was marked as ‘True’ if the user had indicated the use of that food source over any response within the given time window.

Because the HWF app asks for a separate set of symptoms depending on whether or not the user reports feeling ‘well’, there is not a one-to-one correspondence between symptoms reported by those feeling ‘well’ and ‘not well’. We excluded symptoms that were only present among those feeling ‘well’ or only among those feeling ‘not well’. For symptoms reported by both those who were ‘well’ and ‘not well’, we combined them into single symptoms. Supplementary Table 2 shows the variables merged using the ‘any’ function. Each symptom’s responses were then averaged over all available responses over the 28-d window. Similarly, distribution of sleep was averaged across the time window.

Multiple logistic regression was performed using statsmodels with the binary response outcome being the swab test outcome (positive coded as 1, negative as 0) to estimate coefficients, which were converted to ORs using exponentiation. Supplementary Table 4 lists the covariates used in this outcome regression model, as well as the estimated coefficients, 95% CIs and P values.

To mitigate selection bias inherent in restricting the analysis to those who have received a test, we used several inverse probability weighting adjustments. The probability of selection was estimated via the logistic regression analysis of who received a test (Fig. 2c). These estimated selection probabilities were incorporated into the outcome model via inverse probability weighting, and we reported confidence intervals based on robust (sandwich-form) standard errors and bootstrap standard errors. As inverse probability weighting can be sensitive to very small selection probability, we truncated them at several different values, using 0.1 and 0.9; and 0.05 and 0.95. The results using the truncated IPW selection probabilities at 0.1 and 0.9 are reported in Fig. 3. The result using truncated IPW selection probabilities at 0.05 and 0.95 were similar. Supplementary Table 5 lists the covariates used in the outcome regression model with IPW truncation at 0.1 and 0.9, as well as the estimated coefficients, and 95% CIs. Confidence intervals were obtained by bootstrapping the entire model selection process with 2,000 replicates. Specifically, for each bootstrap replicate, the entire dataset was resampled with replacement, a new selection/propensity model was fitted for who gets a test, followed by a new IPW model fit using the inferred propensities from the bootstrap sample. Coefficient estimates for the IPW models across the bootstrap samples were used to generate the confidence intervals and mean value of the coefficient.

For additional sensitivity analysis, we used the bivariate probit model with sample selection used in econometrics to simultaneously estimate a selection (who gets tested) equation and an outcome (who tests positive) equation incorporating the selection probability as an additional covariate. Due to possible collinearities, not all features could be used in both the selection model and the outcome model. Specifically, profession could only be included in the selection model, and thus should be interpreted with caution. Supplementary Table 6 lists the covariates used in the full information maximum likelihood estimates of the selection and outcome regression model, as well as the estimated coefficients, 95% CIs and P values. Qualitatively, the trends observed in the simultaneous selection/outcome model fitting are similar to those found in the two-step selection + IPW outcome logistic models.

To address sample bias in the user distribution in comparison with the distribution of individuals in the United States, we employed a poststratification correction for non-probability sampling models as an additional analysis. Poststratification using age, gender, ethnicity and location was performed on the testing selection model which generates the IPWs for the testing positive model. The United States was subdivided into the nine major census regions (see Supplementary Table 7). A joint distribution of estimated population over age, gender, ethnicity and region was obtained from the American Community Survey 5-year estimates from 2018. The corresponding distribution of users was generated across the same variables, and the ratio between each cell in the census distribution and the user distribution was used as the corresponding inverse probability weight in the testing selection model. The testing selection model thus should represent a user’s probability of getting tested from a corrected user base distribution matching major US Census demographics. The census-corrected testing selection model was used to generate IPWs for the subsequent testing positive model and was otherwise performed in the same way as that calcuated using only the probablility of receiving a test, as calculated using the HowWeFeel samples. Bootstrapping was performed on the entire process. The coefficient estimates for the poststratification testing model are shown in Supplementary Table 8, while estimates and confidence intervals for the subsequent poststratified IPW test outcome model are shown in Supplementary Table 9. A comparison of results with and without poststratification can be found in Extended Data Fig. 9. A comparison of the census-based poststratification-corrected models with the uncorrected models can be found in Extended Data Fig. 7. Performing census-based poststratification correction yields model coefficients and confidence intervals that are similar compared with when no census-based poststratification is performed.

To assess whether or not the states with the largest numbers of users bias the results, we also performed a comparison between the selection and outcome models with IPW correction with and without users from California and Connecticut (Extended Data Fig. 7). When removing California and Connecticut data, coefficient estimates from the selection and outcome models remain largely similar, suggesting limited bias due to California and Connecticut. Moreover, there is an overall increase in confidence interval widths of the outcome model, reflecting an overall increase in variance. Together, this comparison suggests that the California and Connecticut user base adds observations without adding substantial bias that may make the overall sample and corresponding analyses unrepresentative of the entire US population.

Household transmission analysis

In the HWF survey V3, users were first asked if they were exposed to someone with confirmed COVID-19. If they answered ‘yes’, then they were asked if that person lived in their household. We removed users who answered something other than ‘yes’ to the first question and who answered the second question. Additionally, we restricted the analysis to users who reported a negative or positive COVID-19 swab test and those who reported two or more household members. The outcome of interest was the binary outcome of testing positive on the COVID-19 swab test. The exposure of interest was the binary variable of having a household member test positive for COVID-19; we grouped respondents who answered ‘no’ with those who did not answer the question regarding household members.

The rest of the analysis proceeded similarly to the analysis for Fig. 3a, including the covariates used and the symptom collapsing strategy for each user across their responses within the 2-week window before the test and 2-week window after the test. We also performed sensitivity analysis using symptoms before the test. The difference between this analysis and that in Fig. 3a is that the reference group for household exposure was any other exposure or no exposure, whereas the reference group for household exposure and for other exposure in Fig. 3a is no exposure.

For both the unadjusted and the adjusted analyses, we performed logistic regression without and with the covariates. Supplementary Table 10 shows that the 95% CIs were calculated on the log OR scale and then exponentiated to obtain ORs.

Sensitivity analysis: Firth regression

Because of the small number of users in the user base who received a SARS-CoV-2 PCR test (1.7%) and the small number of tested users who received a positive test (8.2%), it is possible for standard logistic regression to be biased. To address this issue, we performed sensitivity analysis with Firth regression62, as implemented in the logistf R package (https://cran.r-project.org/package=logistf). We found very little difference between the Firth regression results and the logistic regression results presented in the paper (Extended Data Fig. 8), indicating that the imbalance of tested users or users who tested positive was not so severe as to bias the results.

Prediction models

XGBoost was compared across different featurizations and subsets of the data to assess the predictiveness of the algorithm on the HWF test result data (Fig. 3c). Two datasets were generated according to the data selection and featurization used in the regression analysis of COVID-19 swab test outcomes, with the difference between the two sets being the time span used for the window, and the inclusion of additional features not used for inference. In the pre-test dataset, the window was selected such that only responses from 14 days before the test up until the day before the last reported test were included for analysis. The post-test dataset, on the other hand, is identical to the regression analysis dataset, using data from 14 days before and after the last reported test. The features for the different feature sets are shown in Supplementary Tables 1113. Mask wearing and social isolation were computed as time averages of the responses to these questions. Models were trained and tested using five-fold cross-validation over the datasets. Within each fold, an additional threefold cross-validation was performed on the training set to optimize model hyperparameters before testing on the test set of that fold (see Supplementary Table 14 for grid-search coordinates). Test set AUCs from each fold were averaged to form a final AUC estimate. Final ROC curves were computed using the combined test set scoring and test set labels from each fold.

In addition to the models shown in the main text, we tested a range of classifiers, feature sets and data aggregation strategies for their performance at predicting COVID-19 test results from HWF survey data (shown in Extended Data Fig. 6). Input data were restricted to V3 survey data collected between 24 April and 12 May, and to qPCR-tested users who responded within −10 and +14 d of their test: a total of 3,514 negative tests and 315 positive tests. Three different feature sets, each consisting of a series of binary input variables from the HWF survey, were used: 56 symptoms, 77 additional features or all 133 features together. Note that this featurization differs slightly from the featurization used in the logistic regression in Fig. 3a, the goal of which was estimation and inference rather than prediction. Each of the 3,829 qPCR-tested users responded between 1 and 25 times within the time window of analysis. To account for time and sparse response rates, we binned data across time in four different ways: (1) average response for each feature in the 9 d preceding the test data (pre-test); (2) average response from −10 to +14 d (average); (3) binning the data into 3 weeks ([−10,−1], [0,7], [8,14]) and averaging each separately, creating a separate time-indexed feature label for each time bin (week_bins_avg); or (4) imputing the response for days with no data by backfilling, then forward filling, then proceeding as in point ‘(3)’ (week_bins_imp). The classifiers were implemented from the scikit-learn and XGBoost Python packages with the following parameter choices: LogisticRegression(), LassoCV(max_iter=2000), ElasticNetCV(max_iter=2000), RandomForestClassifier(n_estimators=100), MLPClassifier(max_iter=2000), XGBClassifier(). Hyperparameters for cross-validation (CV) methods were automatically optimized by grid-search using fivefold cross-validation. Mean AUC was calculated for each classifier using fivefold cross-validation.

Post-test behaviour analysis

Users with post-test information (in the 2–7 d) after their test date (or hypothetical test date for untested users) were collected and analysed (Fig. 4d–g). All featurization on this post-test window was identical to that of the selection/test outcome models. For computing whether a user went to work at least once, all responses for which users either leaving the house or not from V3 were used, and if any response for a user contained a ‘yes’ answer to leaving the house for work, the user was marked as leaving the home for work. Similar analysis was performed for leaving to work without a mask by marking the user as a ‘yes’ if they reported they were going to work and separately reported not using a mask when leaving the house that day. Proportions of each behaviour across the three populations (tested positive, tested negative and untested) were computed, and were bootstrapped with 2,000 replicates to generate confidence intervals.

Estimated number of contacts was performed similarly, except using the average value over individual user responses across the 2–7 d after their test.

Logistic analysis was performed to understand the effect of PCR test result on user behaviour in the 2–7 d after the test, adjusting for other potential covariates. Supplementary Table 15 lists the covariates used in the unadjusted outcome regression model, as well as the estimated coefficients, 95% CIs and P values.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.