Main

Governments, businesses and researchers rely on survey data to inform the provision of government services7, steer business strategy and guide the response to the COVID-19 pandemic8,9. With the ever-increasing volume and accessibility of online surveys and organically collected data, the line between traditional survey research and Big Data is becoming increasingly blurred10. Large datasets enable the analysis of fine-grained subgroups, which are in high demand for designing targeted policy interventions11. However, counter to common intuition12, larger sample sizes alone do not ensure lower error. Instead, small biases are compounded as sample size increases1.

We see initial evidence of this in the discrepancies in estimates of first-dose COVID-19 vaccine uptake, willingness and hesitancy from three online surveys in the US. Two of them—Delphi–Facebook’s COVID-19 symptom tracker2,3 (around 250,000 responses per week and with over 4.5 million responses from January to May 2021) and the Census Bureau’s Household Pulse survey4 (around 75,000 responses per survey wave and with over 600,000 responses from January to May 2021)—have large enough sample sizes to render standard uncertainty intervals negligible; however, they report significantly different estimates of vaccination behaviour with nearly identically worded questions (Table 1). For example, Delphi–Facebook’s state-level estimates for willingness to receive a vaccine from the end of March 2021 are 8.5 percentage points lower on average than those from the Census Household Pulse (Extended Data Fig. 1a), with differences as large as 16 percentage points.

The US Centers for Disease Control and Prevention (CDC) compiles and reports vaccine uptake statistics from state and local offices13. These figures serve as a rare external benchmark, permitting us to compare survey estimates of vaccine uptake to those from the CDC. The CDC has noted the discrepancies between their own reported vaccine uptake and that of the Census Household Pulse14,15, and we find even larger discrepancies between the CDC and Delphi–Facebook data (Fig. 1a). By contrast, the Axios–Ipsos Coronavirus Tracker5 (around 1,000 responses per wave, and over 10,000 responses from January to May 2021) tracks the CDC benchmark well. None of these surveys use the CDC benchmark to adjust or assess their estimates of vaccine uptake, thus by examining patterns in these discrepancies, we can infer each survey’s accuracy and statistical representativeness, a nuanced concept that is critical for the reliability of survey findings16,17,18,19.

The Big Data Paradox in vaccine uptake

We focus on the Delphi–Facebook and Census Household Pulse surveys because their large sample sizes (each greater than 10,000 respondents20) present an opportunity to examine the Big Data Paradox1 in surveys. The Census Household Pulse is an experimental product designed to rapidly measure pandemic-related behaviour. Delphi–Facebook has stated that the intent of their survey is to make comparisons over space, time and subgroups, and that point estimates should be interpreted with caution3. However, despite these intentions, Delphi–Facebook has reported point estimates of vaccine uptake in its own publications11,21.

Delphi–Facebook and Census Household Pulse surveys persistently overestimate vaccine uptake relative to the CDC’s benchmark (Fig. 1a) even taking into account Benchmark Imprecision (Fig. 1b) as explained in ‘Decomposing Error in COVID Surveys’. Despite being the smallest survey by an order of magnitude, the estimates of Axios–Ipsos track well with the CDC rates (Fig. 1a), and their 95% confidence intervals contain the benchmark estimate from the CDC in 10 out of 11 surveys (an empirical coverage probability of 91%).

One might hope that estimates of changes in first-dose vaccine uptake are correct, even if each snapshot is biased. However, errors have increased over time, from just a few percentage points in January 2021 to Axios-Ipsos’ 4.2 percentage points [1–7 percentage points with 5% benchmark imprecision (BI)], Census Household Pulse’s 14 percentage points [5% BI: 11–17] and Delphi-Facebook’s 17 percentage points [5% BI: 14–20] by mid-May 2021 (Fig. 1b). For context, for a state that is near the herd immunity threshold (70–80% based on recent estimates22), a discrepancy of 10 percentage points in vaccination rates could be the difference between containment and uncontrolled exponential growth in new SARS-CoV-2 infections.

Conventional statistical formulas for uncertainty further mislead when applied to biased big surveys because as sample size increases, bias (rather than variance) dominates estimator error. Figure 1a shows 95% confidence intervals for vaccine uptake based on the reported sampling standard errors and weighting design effects of each survey23. Axios–Ipsos has the widest confidence intervals, but also the smallest design effects (1.1–1.2), suggesting that its accuracy is driven more by minimizing bias in data collection rather than post-survey adjustment. The 95% confidence intervals of Census Household Pulse are widened by large design effects (4.4–4.8) but they are still too narrow to include the true rate of vaccine uptake in almost all survey waves. The confidence intervals for Delphi–Facebook are extremely small, driven by large sample size and moderate design effects (1.4–1.5), and give us a negligible chance of being close to the truth.

One benefit of such large surveys might be to compare estimates of spatial and demographic subgroups24,25,26. However, relative to the CDC’s contemporaneously reported state-level estimates, which did not include retroactive corrections, Delphi–Facebook and Census Household Pulse overestimated CDC state-level vaccine uptake by 16 and 9 percentage points, respectively (Extended Data Fig. 1g, h) in March 2021, and by equal or larger amounts by May 2021 (Extended Data Fig. 2g, h). Relative estimates were no better than absolute estimates in March of 2021: there is little agreement in a survey’s estimated state-level rankings with the CDC (a Kendall rank correlation of 0.31 for Delphi–Facebook in Extended Data Fig. 1i and 0.26 for Census Household Pulse in Extended Data Fig. 1j) but they improved in May of 2021 (correlations of 0.78 and 0.74, respectively, in Extended Data Fig. 2i, j). Among 18–64-year-olds, both Delphi–Facebook and Census Household Pulse overestimate uptake, with errors increasing over time (Extended Data Fig. 6).

These examples illustrate a mathematical fact. That is, when biased samples are large, they are doubly misleading: they produce confidence intervals with incorrect centres and substantially underestimated widths. This is thev Big Data Paradox: “the bigger the data, the surer we fool ourselves”1 when we fail to account for bias in data collection.

A framework for quantifying data quality

Although it is well-understood that traditional confidence intervals capture only survey sampling errors27 (and not total error), the traditional survey framework lacks analytic tools for quantifying nonsampling errors separately from sampling errors. A previously formulated statistical framework1 permits us to exactly decompose the total error of a survey estimate into three components:

$$\begin{array}{c}{\rm{Total}}\,{\rm{error}}=\,{\rm{Data}}\,{\rm{quality}}\\ {\rm{defect}}\times {\rm{Data}}\,{\rm{scarcity}}\times {\rm{Inherent}}\,{\rm{problem}}\,{\rm{difficulty}}\end{array}$$
(1)

This framework has been applied to COVID-19 case counts28 and election forecasting29. Its full application requires ground-truth benchmarks or their estimates from independent sources1.

Specifically, the ‘total error’ is the difference between the observed sample mean $${\overline{Y}}_{n}$$ as an estimator of the ground truth, the population mean $${\overline{Y}}_{N}$$. The ‘data quality defect’ is measured using $${\hat{\rho }}_{Y,R}$$, called the ‘data defect correlation’ (ddc)1, which quantifies total bias (from any source), measured by the correlation between the event that an individual’s response is recorded and its value, Y. The effect of data quantity is captured by ‘data scarcity’, which is a function of the sample size n and the population size N, measured as $$\sqrt{(N-n)/n},$$ and hence what matters for error is the relative sample size—that is, how close n is to N—rather than the absolute sample size n. The third factor is the ‘inherent problem difficulty’, which measures the population heterogeneity (via the standard deviation σY of Y), because the more heterogeneous a population is, the harder it is to estimate its average well. Mathematically, equation (1) is given by $${\overline{Y}}_{n}-{\overline{Y}}_{N}={\hat{\rho }}_{Y,R}\times \sqrt{(N-n)/n}\times {\sigma }_{Y}$$   . This expression was inspired by the Hartley–Ross inequality for biases in ratio estimators30. More details on the decomposition are provided in ‘Calculation and interpretation of ddc’ in the Methods, in which we also present a generalization for weighted estimators.

Decomposing error in COVID surveys

Although the ddc is not directly observed, COVID-19 surveys present a rare case in which it can be deduced because all of the other terms in equation (1) are known (see ‘Calculation and interpretation of ddc’ in the Methods for an in-depth explanation). We apply this framework to the aggregate error shown in Fig. 1b, and the resulting components of error from the right-hand side of equation (1) are shown in Fig. 1c–e.

We use the CDC’s report of the cumulative count of first doses administered to US adults as the benchmark8,13, $${\overline{Y}}_{N}$$. This benchmark time series may be affected by administrative delays and slippage in how the CDC centralizes information from states31,32,33,34. The CDC continuously updates their entire time series retroactively for such delays as they are reported. But to account for potentially unreported delays, we present our results with Benchmark Imprecision (BI) in case the CDC’s numbers from our study period, 9 January to 26 May 2021, as reported on 26 May by the CDC suffer from ±5% and ±10% imprecision. These scenarios were chosen on the basis of analysis of the magnitude by which the CDC’s initial estimate for vaccine uptake by a particular day increases as the CDC receives delayed reports of vaccinations that occurred on that day (Extended Data Fig. 3Supplementary Information A.2). That said, these scenarios may not capture latent systemic issues that affect CDC vaccination reporting.

The total error of each survey’s estimate of vaccine uptake (Fig. 1b) increases over time for all studies, most markedly for Delphi–Facebook. The data quality defect, measured by the ddc, also increases over time for Census Household Pulse and for Delphi–Facebook (Fig. 1c). The ddc for Axios–Ipsos is much smaller and steady over time, consistent with what one would expect from a representative sample. The data scarcity,$$\sqrt{(N-n)/n},$$ for each survey is roughly constant across time (Fig. 1d). Inherent problem difficulty is a population quantity common to all three surveys that peaks when the benchmark vaccination rate approaches 50% in April 2021 (Fig. 1e). Therefore, the decomposition suggests that the increasing error in estimates of vaccine uptake in Delphi–Facebook and Census Household Pulse is primarily driven by increasing ddc, which captures the overall effect of the bias in coverage, selection and response.

Equation (1) also yields a formula for the bias-adjusted effective sample size neff, which is the size of a simple random sample that we would expect to exhibit the same level of mean squared error (MSE) as what was actually observed in a given study with a given ddc. Unlike the classical effective sample size23, this quantity captures the effect of bias as well as that of an increase in variance from weighting and sampling. For details of this calculation, see ‘Error decomposition with survey weights’ in the Methods.

For estimating the US vaccination rate, Delphi–Facebook has a bias-adjusted effective sample size of less than 10 in April 2021, a 99.99% reduction from the raw average weekly sample size of 250,000 (Fig. 2). The Census Household Pulse is also affected by over 99% reductions in effective sample size by May 2021. A simple random sample would have controlled estimation errors by controlling ddc. However, once this control is lost, small increases in ddc beyond what is expected in simple random samples can result in marked reductions of effective sample sizes for large populations1.

Comparing study designs

Understanding why bias occurs in some surveys but not others requires an understanding of the sampling strategy, modes, questionnaire and weighting scheme of each survey. Table 1 compares the design of each survey (more details in ‘Additional survey methodology’ in the Methods, Extended Data Table 1).

All three surveys are conducted online and target the US adult population, but vary in the methods that they use to recruit respondents35. The Delphi–Facebook survey recruits respondents from active Facebook users (the Facebook active user base, or FAUB) using daily unequal-probability stratified random sampling2. The Census Bureau uses a systematic random sample to select households from the subset of the master address file (MAF) of the Census for which they have obtained either cell phone or email contact information (approximately 81% of all households in the MAF)4.

In comparison, Axios–Ipsos relies on inverse response propensity sampling from Ipsos’ online KnowledgePanel. Ipsos recruits panellists using an address-based probabilistic sample from USPS’s delivery sequence file (DSF)5. The DSF is similar to the MAF of the Census. Unlike the Census Household Pulse, potential respondents are not limited to the subset for whom email and phone contact information is available. Furthermore, Ipsos provides internet access and tablets to recruited panellists who lack home internet access. In 2021, this ‘offline’ group typically comprises 1% of the final survey (Extended Data Table 1).

All three surveys weight on age and gender; that is, assign larger weights to respondents of underrepresented age by gender subgroups and smaller weights to those of overrepresented subgroups2,4,5 (Table 1). Axios–Ipsos and Census Household Pulse also weight on education and race and/or ethnicity (hereafter, race/ethnicity). Axios–Ipsos additionally weights to the composition of political partisanship measured by “recent ABC News/Washington Post telephone polls”5 in 6 of the 11 waves we study. Education—a known correlate of propensity to respond to surveys36 and social media use37 are notably absent from Delphi–Facebook’s weighting scheme, as is race/ethnicity. As noted before, none of the surveys use the CDC benchmark to adjust or assess estimates of vaccine uptake.

Explanations for error

Table 2 illustrates some consequences of these design choices. Axios–Ipsos samples mimic the actual breakdown of education attainment among US adults even before weighting, whereas those of Census Household Pulse and Delphi–Facebook do not. After weighting, Axios–Ipsos and Census Household Pulse match the population benchmark, by design. Delphi–Facebook does not explicitly weight on education, and hence the education bias persists in their weighted estimates: those without a college degree are underrepresented by nearly 20 percentage points. The case is similar for race/ethnicity. Delphi–Facebook’s weighting scheme does not adjust for race/ethnicity, and hence their weighted sample still overrepresents white adults by 8 percentage points, and underrepresents the proportions of Black and Asian individuals by around 50% of their size in the population (Table 2).

The overrepresentation of white adults and people with college degrees explains part of the error of Delphi–Facebook. The racial groups that Delphi–Facebook underrepresents tend to be more willing and less vaccinated in the samples (Table 2). In other words, reweighting the Delphi–Facebook survey to upweight racial minorities will bring willingness estimates closer to Household Pulse and the vaccination rate closer to CDC. The three surveys also report that people without a four-year college degree are less likely to have been vaccinated compared to those with a degree (Table 2, Supplementary Table 1). If we assume that vaccination behaviours do not differ systematically between non-respondents and respondents within each demographic category, underrepresentation of less-vaccinated groups would contribute to the bias found here. However, this alone cannot explain the discrepancies in all the outcomes. Census Household Pulse weights on both race and education4 and still overestimates vaccine uptake by over ten points in late May of 2021 (Fig. 1b).

Delphi–Facebook and Census Household Pulse may be unrepresentative with respect to political partisanship, which has been found to be correlated with vaccine behaviour38 and with survey response39, and thus may contribute to observed bias. However, neither Delphi–Facebook nor Census Household Pulse collects partisanship of respondents. US Census agencies cannot ask about political preference, and no unequivocal population benchmark for partisanship in the general adult population exists.

Rurality may also contribute to the errors, because it correlates with vaccine status8 and home internet access40. Neither Census Household Pulse nor Delphi–Facebook weights on sub-state geography, which may mean that adults in more rural areas who are less likely to be vaccinated are also underrepresented in the two surveys, leading to overestimation of vaccine uptake.

Axios–Ipsos weights to metropolitan status and also recruits a fraction of its panellists from an ‘offline’ population of individuals without internet access. We find that dropping these offline respondents (n = 21, or 1% of the sample) in their 22 March 2021 wave increases Axios–Ipsos’ overall estimate of the vaccination rate by 0.5 percentage points, thereby increasing the total error (Extended Data Table 2). However, this offline population is too small to explain the entirety of the difference in accuracy between Axios–Ipsos and either Census Household Pulse (6 percentage points) or Delphi–Facebook (14 percentage points), in this time period.

Careful recruitment of panellists is at least as important as weighting. Weighting on observed covariates alone cannot explain or correct the discrepancies we observe. For example, reweighting Axios–Ipsos 22 March 2021 wave using only Delphi–Facebook’s weighting variables (age group and gender) increased the error in their vaccination estimates by 1 percentage point, but this estimate with Axios–Ipsos data is still more accurate than that from Delphi–Facebook during the same period (Extended Data Table 2). The Axios–Ipsos estimate with Delphi–Facebook weighting overestimated vaccination by 2 percentage points, whereas Delphi–Facebook overestimated it by 11 percentage points.

The key implication is that there is no silver bullet: every small part of panel recruitment, sampling and weighting matters for controlling the data quality measured as the correlation between an outcome and response—what we call the ddc. In multi-stage sampling, which includes for example the selection of potential respondents followed by non-response, bias in even a single step can substantially affect the final result (‘Population size in multi-stage sampling’ in the Methods, Extended Data Table 3). A total quality control approach, inspired by the total survey error framework41, is a better strategy than trying to prioritize some components over others to improve data quality. This emphasis is a reaffirmation of the best practice for survey research as advocated by the American Association for Public Opinion Research:6 “The quality of a survey is best judged not by its size, scope, or prominence, but by how much attention is given to [preventing, measuring and] dealing with the many important problems that can arise.”42

The three surveys discussed in this article demonstrate a seemingly paradoxical phenomenon—the two larger surveys that we studied are more statistically confident, but also more biased, than the smaller, more traditional Axios–Ipsos poll. These findings are paradoxical only when we fall into the trap of the intuition that estimation errors necessarily decrease in larger datasets12.

A limitation of our vaccine uptake analysis is that we only examine ddc with respect to an outcome for which a benchmark is available: first-dose vaccine uptake. One might hope that surveys biased on vaccine uptake are not biased on other outcomes, for which there may not be benchmarks to reveal their biases. However, the absence of evidence of bias for the remaining outcomes is not evidence of its absence. In fact, mathematically, when a survey is found to be biased with respect to one variable, it implies that the entire survey fails to be statistically representative. The theory of survey sampling relies on statistical representativeness for all variables achieved through probabilistic sampling43. Indeed, Neyman’s original introduction of probabilistic sampling showed the limits of purposive sampling, which attempted to achieve overall representativeness by enforcing it only on a set of variables18,44.

In other words, when a survey loses its overall statistical representativeness (for example, through bias in coverage or non-response), which is difficult to repair (for example, by weighting or modelling on observable characteristics) and almost impossible to verify45, researchers who wish to use the survey for scientific studies must supply other reasons to justify the reliability of their survey estimates, such as evidence about the independence between the variable of interest and the factors that are responsible for the unrepresentativeness. Furthermore, scientific journals that publish studies based on surveys that may be unrepresentative17—especially those with large sizes such as Delphi–Facebook (biased with respect to vaccination status (Fig. 1), race and education (Table 2))—need to ask for reasonable effort from the authors to address the unrepresentativeness.

Some may argue that bias is a necessary trade-off for having data that are sufficiently large for conducting highly granular analysis, such as county-level estimation of vaccine hesitancy26. Although high-resolution inference is important, we warn that this is a double-edged argument. A highly biased estimate with a misleadingly small confidence interval can do more damage than having no estimate at all. We further note that bias is not limited to population point estimates, but also affects estimates of changes over time (contrary to published guidance3). Both Delphi–Facebook and Census Household Pulse significantly overestimate the slope of vaccine uptake relative to that of the CDC benchmark (Fig. 1b).

The accuracy of our analysis does rely on the accuracy of the CDC’s estimates of COVID vaccine uptake. However, if the selection bias in the CDC’s benchmark is significant enough to alter our results, then that itself would be another example of the Big Data Paradox.

Discussion

This is not the first time that the Big Data Paradox has appeared: Google Trends predicted more than twice the number of influenza-like illnesses than the CDC in February 201346. This analysis demonstrates that the Big Data Paradox applies not only to organically collected Big Data, like Google Trends, but also to surveys. Delphi–Facebook is “the largest public health survey ever conducted in the United States”47. The Census Household Pulse is conducted in collaboration between the US Census Bureau and eleven statistical government partners, all with enormous resources and survey expertise. Both studies take steps to mitigate selection bias, but substantially overestimate vaccine uptake. As we have shown, the effect of bias is magnified as relative sample size increases.

By contrast, Axios–Ipsos records only about 1,000 responses per wave, but makes additional efforts to prevent selection bias. Small surveys can be just as wrong as large surveys in expectation—of the three other small-to-medium online surveys additionally analysed, two also miss the CDC vaccination benchmark (Extended Data Fig. 5). The overall lesson is that investing in data quality (particularly during collection, but also in analysis) minimizes error more efficiently than does increasing data quantity. Of course, a sample size of 1,000 may be too small (that is, leading to unhelpfully large uncertainty intervals) for the kind of 50-state analyses made possible by big surveys. However, small-area methods that borrow information across subgroups48 can perform better with higher-quality—albeit few—data, and whether that approach would outperform the large, biased surveys is an open question.

There are approaches to correct for these biases in both probability and nonprobability samples alike. For COVID-19 surveys in particular, since June 2021, the AP–NORC multimode panel has weighted their COVID-19 related surveys to the CDC benchmark, so that the weighted ddc for vaccine uptake is zero by design49. More generally, there is an extensive literature on approaches for making inferences from data collected from nonprobability samples50,51,52. Other promising approaches include integrating surveys of varying quality53,54, and leveraging the estimated ddc in one outcome to correct bias in others under several scenarios (Supplementary Information D).

Although more needs to be done to fully examine the nuances of large surveys, organically collected administrative datasets and social media data, we hope that this comparative study of ddc highlights the concerning implications of the Big Data Paradox—how large sample sizes magnify the effect of seemingly small defects in data collection, which leads to overconfidence in incorrect inferences.

Methods

Calculation and interpretation of ddc

The mathematical expression for equation (1) is given here for completeness:

$${\overline{Y}}_{n}-{\overline{Y}}_{N}={\hat{\rho }}_{Y,R}\times \sqrt{\frac{N-n}{n}}\times {\sigma }_{Y}$$
(2)

The first factor $${\hat{\rho }}_{Y,R}$$ is called the data defect correlation (ddc)1. It is a measure of data quality represented by the correlation between the recording indicator R (R = 1 if an answer is recorded and R = 0 otherwise) and its value, Y. Given a benchmark, the ddc $${\hat{\rho }}_{Y,R}$$ can be calculated by substituting known quantities into equation (2). In the case of a single survey wave of a COVID-19 survey, n is the sample size of the survey wave, N is the population size of US adults from US Census estimates55, $${\overline{Y}}_{n}$$ is the survey estimate of vaccine uptake and $${\overline{Y}}_{N}$$ is the estimate of vaccine uptake for the corresponding period taken from the CDC’s report of the cumulative count of first doses administered to US adults8,13. We calculate $${\sigma }_{Y}=\sqrt{{\overline{Y}}_{N}(1-{\overline{Y}}_{N})}$$ because Y is binary (but equation (2) is not restricted to binary Y).

We calculate $${\hat{\rho }}_{Y,R}$$ by using total error $${\overline{Y}}_{n}-{\overline{Y}}_{N}$$, which captures not only selection bias but also any measurement bias (for example, from question wording). However, with this calculation method, $${\hat{\rho }}_{Y,R}$$ lacks the direct interpretation as a correlation between Y and R, and instead becomes a more general index of data quality directly related to classical design effects (see ‘Bias-adjusted effective sample size’).

It is important to point out that the increase in ddc does not necessarily imply that the response mechanisms for Delphi–Facebook and Census Household Pulse have changed over time. The correlation between a changing outcome and a steady response mechanism could change over time, hence changing the value of ddc. For example, as more individuals become vaccinated, and vaccination status is driven by individual behaviour rather than eligibility, the correlation between vaccination status and propensity to respond could increase even if the propensity to respond for a given individual is constant. This would lead to large values of ddc over time, reflecting the increased impact of the same response mechanism.

Error decomposition with survey weights

The data quality framework given by equations (1) and (2) is a special case of a more general framework for assessing the actual error of a weighted estimator $${\overline{Y}}_{w}=\frac{{\sum }_{i}{w}_{i}{R}_{i}{Y}_{i}}{{\sum }_{i}{w}_{i}{R}_{i}}$$, where $${w}_{i}$$ is the survey weight assigned to individual $$i$$. It is shown in Meng1 that

$${\overline{Y}}_{{\rm{w}}}-{\overline{Y}}_{N}={\hat{\rho }}_{Y,{R}_{{\rm{w}}}}\times \sqrt{\frac{N-{n}_{{\rm{w}}}}{{n}_{{\rm{w}}}}}\times {\sigma }_{Y},$$
(3)

where $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}={\rm{Corr}}(Y,{R}_{{\rm{w}}})$$ is the finite population correlation between $${Y}_{i}$$ and $${R}_{{\rm{w}},i}={w}_{i}{R}_{i}$$ (over i = 1, …, N). The ‘hat’ on ρ reminds us that this correlation depends on the specific realization of {Ri, i = 1, …, N}. The term nw is the classical ‘effective sample size’ due to weighting23; that is, $${n}_{{\rm{w}}}=\frac{n}{(1+{{\rm{CV}}}_{{\rm{w}}}^{2})}$$, where CVw is the coefficient of variation of the weights for all individuals in the observed sample, that is, the standard deviation of weights normalized by their mean. It is common for surveys to rescale their weights to have mean 1, in which case $${{\rm{CV}}}_{w}^{2}$$ is simply the sample variance of W.

When all weights are the same, equation (3) reduces to equation (2). In other words, the ddc term $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}$$ now also takes into account the effect of the weights as a means to combat the selection bias represented by the recording indicator R. Intuitively, if $${\hat{\rho }}_{Y,R}={\rm{Corr}}(Y,R)$$ is high (in magnitude), then some Yi’s have a higher chance of entering our dataset than others, thus leading to a sample average that is a biased estimator for the population average. Incorporating appropriate weights can reduce $${\hat{\rho }}_{Y,R}$$ to $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}$$, with the aim of reducing the effect of the selection bias. However, this reduction alone may not be sufficient to improve the accuracy of $${\overline{Y}}_{w}$$ because the use of weight necessarily reduces the sampling fraction $$f=\frac{n}{N}$$ to $${f}_{{\rm{w}}}=\frac{{n}_{{\rm{w}}}}{N}$$ as well, as nw < n. Equation (3) precisely describes this trade-off, providing a formula to assess when the reduction of ddc is significant to outweigh the reduction of the effective sample size.

Measuring the correlation between Y and R is not a new idea in survey statistics (though note that ddc is the population correlation between Y and R, not the sample correlation), nor is the observation that as sample size increases, error is dominated by bias instead of variance56,57. The new insight is that ddc is a general metric to index the lack of representativeness of the data we observe, regardless of whether or not the sample is obtained through a probabilistic scheme, or weighted to mimic a probabilistic sample. As discussed in ‘Addressing common misperceptions’ in the main text, any single ddc deviating from what is expected under representative sampling (for example, probabilistic sampling) is sufficient to establish that the sample is not representative (but the converse is not true). Furthermore, the ddc framework refutes the common belief that increasing sample size necessarily improves statistical estimation1,58.

By matching the mean-squared error of $${\overline{Y}}_{w}$$ with the variance of the sample average from simple random sampling, Meng1 derives the following formula for calculating a bias-adjusted effective sample size, or neff:

$$\begin{array}{r}{n}_{{\rm{eff}}}=\frac{{n}_{{\rm{w}}}}{N-{n}_{{\rm{w}}}}\times \frac{1}{E[{\hat{\rho }}_{Y,{R}_{{\rm{w}}}}^{2}]}\end{array}$$

Given an estimator $${\overline{Y}}_{w}$$ with expected total MSE T due to data defect, sampling variability and weighting, this quantity neff represents the size of a simple random sample such that its mean $${\bar{Y}}_{N}$$, as an estimator for the same population mean $${\overline{Y}}_{N}$$, would have the identical MSE T. The term $$E[{\hat{\rho }}_{Y,{R}_{{\rm{w}}}}^{2}]$$ represents the amount of selection bias (squared) expected on average from a particular recording mechanism R and a chosen weighting scheme.

For each survey wave, we use $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}^{2}$$ to approximate $$E[{\hat{\rho }}_{Y,{R}_{{\rm{w}}}}^{2}]$$. This estimation is unbiased by design, as we use an estimator to estimate its expectation. Therefore, the only source of error is the sampling variation, which is typically negligible for large surveys such as Delphi–Facebook and the Census Household Pulse. This estimation error may have more impact for smaller surveys such as the Axios–Ipsos survey, an issue that we will investigate in subsequent work.

We compute $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}$$ by using the benchmark $${\overline{Y}}_{N}$$, namely, by solving equation (3) for $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}$$,

$${\hat{\rho }}_{Y,{R}_{{\rm{w}}}}=\frac{{Z}_{{\rm{w}}}}{\sqrt{N}},{\rm{where}}\,{Z}_{{\rm{w}}}=\frac{{\overline{Y}}_{{\rm{w}}}-{\overline{Y}}_{N}}{\sqrt{\frac{1-{f}_{{\rm{w}}}}{{n}_{{\rm{w}}}}}{\sigma }_{Y}}$$

We introduce this notation Zw because it is the quantity that determines the well-known survey efficiency measure, the so-called ‘design effect’, which is the variance of Zw for a probabilistic sampling design23 (when we assume the weights are fixed). For the more general setting in which $${\overline{Y}}_{w}$$ may be biased, we replace the variance by MSE, and hence the bias-adjusted design effect $${D}_{e}=E[{Z}_{{\rm{w}}}^{2}]$$, which is the MSE relative to the benchmark measured in the unit of the variance of an average from a simple random sample of size nw. Hence $${D}_{I}\equiv E[{\hat{\rho }}_{Y,{R}_{{\rm{w}}}}^{2}]$$, which was termed as ‘data defect index’1, is simply the bias-adjusted design effect per unit, because $${D}_{I}=\frac{{D}_{e}}{N}$$.

Furthermore, because $${Z}_{{\rm{w}}}$$ is the standardized actual error, it captures any kind of error inherited in $${\overline{Y}}_{w}$$. This observation is important because when Y is subject to measurement errors, $$\frac{{Z}_{{\rm{w}}}}{\sqrt{N}}$$ no longer has the simple interpretation as a correlation. But because we estimate $${D}_{I}$$ by $$\frac{{Z}_{w}^{2}}{N}$$ directly, our effective sample size calculation is still valid even when equation (3) does not hold.

Asymptotic behaviour of ddc

As shown in Meng1, for any probabilistic sample without selection biases, the ddc is on the order of $$\frac{1}{\sqrt{N}}$$. Hence the magnitude of $${\hat{\rho }}_{Y,R}$$ (or $${\hat{\rho }}_{Y,{R}_{{\rm{w}}}})$$ is small enough to cancel out the effect of $$\sqrt{N-n}$$ (or $$\sqrt{N-{n}_{{\rm{w}}}}$$) in the data scarcity term on the actual error, as seen in equation (2) (or equation (3)). However, when a sample is unrepresentative; for example, when those with Y = 1 are more likely to enter the dataset than those with Y = 0, then $${\hat{\rho }}_{Y,R}$$ can far exceed $$\frac{1}{\sqrt{N}}$$ in magnitude. In this case, error will increase with $$\sqrt{N}$$ for a fixed ddc and growing population size N (equation (2)). This result may be counterintuitive in the traditional survey statistics framework, which often considers how error changes as sample size n grows. The ddc framework considers a more general set-up, taking into account individual response behaviour, including its effect on sample size itself.

As an example of how response behaviour can shape both total error and the number of respondents n, suppose individual response behaviour is captured by a logistic regression model

$${\rm{logit}}[{\rm{\Pr }}(R=1|Y)]=\alpha +\beta Y.$$
(4)

This is a model for a response propensity score. Its value is determined by α, which drives the overall sampling fraction $$f=\frac{n}{N}$$, and by β, which controls how strongly Y influences whether a participant will respond or not.

In this logit response model, when $$\beta \ne 0$$, $${\hat{\rho }}_{Y,R}$$ is determined by individual behaviour, not by population size N. In Supplementary Information B.1, we prove that ddc cannot vanish as N grows, nor can the observed sample size n ever approach 0 or N for a given set of (finite and plausible) values of {α, β}, because there will always be a non-trivial percentage of non-respondents. For example, an f of 0.01 can be obtained under this model for either α = −0.46, β = 0 (no influence of individual behaviour on response propensity), or for α = −3.9, β = −4.84. However, despite the same f, the implied ddc and consequently the MSE will differ. For example, the MSE for the former (no correlation with Y) is 0.0004, whereas the MSE for the latter (a −4.84 coefficient on Y) is 0.242, over 600 times larger.

See Supplementary Information B.2 for the connection between ddc and a well-studied non-response model from econometrics, the Heckman selection model59.

Population size in multi-stage sampling

We have shown that the asymptotic behaviour of error depends on whether the data collection process is driven by individual response behaviour or by survey design. The reality is often a mix of both. Consequently, the relevant ‘population size’ N depends on when and where the representativeness of the sample is destroyed; that is, when the individual response behaviours come into play. Real-world surveys that are as complex as the three surveys we analyse here have multiple stages of sample selection.

Extended Data Table 3 takes as an example the sampling stages of the Census Household Pulse, which has the most extensive set of documentation among the three surveys we analyse. As we have summarized (Table 1, Extended Data Table 1), the Census Household Pulse (1) first defines the sampling frame as the reachable subset of the MAF, (2) takes a random sample of that population to prompt (send a survey questionnaire) and (3) waits for individuals to respond to that survey. Each of these stages reduces the desired data size, and the corresponding population size is the intended sample size from the prior stage (in notation, Ns = ns −1, for s = 2, 3). For example, in stage 3, the population size N3 is the size of the intended sample size n2 from the second stage (random sample of the outreach list), because only the sampled individuals have a chance to respond.

Although all stages contribute to the overall ddc, the stage that dominates is the first stage at which the representativeness of our sample is destroyed—the size of which will be labelled as the dominating population size (dps)—when the relevant population size decreases markedly at each step. However, we must bear in mind that dps refers to the worst-case scenario, when biases accumulate, instead of (accidentally) cancelling each other out.

For example, if the 20% of the MAFs excluded from the Census Household Pulse sampling frame (because they had no cell phone or email contact information) is not representative of the US adult population, then the dps is N1, or 255 million adults contained in 144 million households. Then the increase in bias for given ddc is driven by the rate of $$\sqrt{{N}_{1}}$$ where N1 = 2.55 × 108 and is large indeed (with $$\sqrt{2.5\times {10}^{8}}\approx \mathrm{15,000}$$). By contrast, if the the sampling frame is representative of the target population and the outreach list is representative of the frame (and hence representative of the US adult population) but there is non-response bias, then dps is N3 = 106 and the impact of ddc is amplified by the square root of that number ($$\sqrt{{10}^{6}}=\mathrm{1,000}$$). By contrast, Axios–Ipsos reports a response rate of about $$50 \%$$, and obtains a sample of n = 1,000, so the dps could be as small as N3 = 2,000 (with $$\sqrt{\mathrm{2,000}}\approx 45$$).

This decomposition is why our comparison of the surveys is consistent with the ‘Law of Large Populations’1 (estimation error increases with $$\sqrt{N}$$), even though all three surveys ultimately target the same US adult population. Given our existing knowledge about online–offline populations40 and our analysis of Axios–Ipsos’ small ‘offline’ population, Census Household Pulse may suffer from unrepresentativeness at Stage 1 of Extended Data Table 3, where N = 255 million, and Delphi–Facebook may suffer from unrepresentativeness at the initial stage of starting from the Facebook user base. By contrast, the main source of unrepresentativeness for Axios–Ipsos may be at a later stage at which the relevant population size is orders of magnitude smaller.

CDC estimates of vaccination rates

Our analysis of the nationwide vaccination rate covers the period between 9 January 2021 and 19 May 2021. We used CDC’s vaccination statistics published on their data tracker as of 26 May 2021. This dataset is a time series of counts of 1st dose vaccinations for every day in our time period, reported for all ages and disaggregated by age group.

This CDC time series obtained on 26 May 2021 included retroactive updates to dates covering our entire study period, as does each daily update provided by the CDC daily update. For example, the CDC benchmark we use for March 2021 is not only the vaccination counts originally reported in March but also includes the delayed reporting for March that the CDC became aware of by 26 May 2021. Analyzing several snapshots before 26 May 2021, we find that these retroactive updates 40 days out could change the initial estimate by about 5% (Extended Data Fig. 3), hence informing our sensitivity analysis of +/− 5% and 10% benchmark imprecision.

To match the sampling frame of the surveys we analyze, US adults 18 years and older, we must restrict the CDC vaccination counts to those administered to those adults. However, because of the different way states and jurisdiction report their vaccination statistics, the CDC did not possess age-coded counts for some jurisdictions, such as Texas, at the time of our study. The number of vaccinations with missing age data reached about 10 percent of the total US vaccinations at its peak at the time of our study. We therefore assume that the day by day fraction of adults among individuals for whom age is reported as missing is equal to the fraction of adults among individuals with age reported. Because minors became eligible for vaccinations only towards the end of our study period, the fraction of adults in data reporting age never falls below 97%.

The Census Household Pulse and Delphi–Facebook surveys are the first of their kind for each organization, whereas Ipsos has maintained their online panel for 12 years.

Question wording

All three surveys ask whether respondents have received a COVID-19 vaccine (Extended Data Table 1). Delphi–Facebook and Census Household Pulse ask similar questions (“Have you had/received a COVID-19 vaccination/vaccine?”). Axios–Ipsos asks “Do you personally know anyone who has already received the COVID-19 vaccine?”, and respondents are given response options including “Yes, I have received the vaccine.” The Axios–Ipsos question wording might pressure respondents to conform to their communities’ modal behaviour and thus misreport their true vaccination status, or may induce acquiescence bias from the multiple ‘yes’ options presented60. This pressure may exist both in high- and low-vaccination communities, so its net effect on Axios–Ipsos’ results is unclear. Nonetheless, Axios–Ipsos’ question wording does differ from that of the other two surveys, and may contribute to the observed differences in estimates of vaccine uptake across surveys.

Population of interest

All three surveys target the US adult population, but with different sampling and weighting schemes. Census Household Pulse sets the denominator of their percentages as the household civilian, non-institutionalized population in the United States of 18 years of age or older, excluding Puerto Rico or the island areas. Axios–Ipsos designs samples to be representative of the US general adult population of 18 or older. For Delphi–Facebook, the US target population reported in weekly contingency tables is the US adult population, excluding Puerto Rico and other US territories. For the CDC Benchmark, we define the denominator as the US 18+ population, excluding Puerto Rico and other US territories. To estimate the size of the total US population, we use the US Census Bureau Annual Estimates of the Resident Population for the United States and Puerto Rico, 201955. This is also what the CDC uses as the denominator in calculating rates and percentages of the US population60.

Axios–Ipsos and Delphi–Facebook generate target distributions of the US adult population using the Current Population Survey (CPS), March Supplement, from 2019 and 2018, respectively. Census Household Pulse uses a combination of 2018 1-year American Community Survey (ACS) estimates and the Census Bureau’s Population Estimates Program (PEP) from July 2020. Both the CPS and ACS are well-established large surveys by the Census and the choice between them is largely inconsequential.

Axios–Ipsos data

The Axios–Ipsos Coronavirus tracker is an ongoing, bi-weekly tracker intended to measure attitudes towards COVID-19 of adults in the US. The tracker has been running since 13 March 2020 and has released results from 45 waves as of 28 May 2021. Each wave generally runs over a period of 4 days. The Axios–Ipsos data used in this analysis were scraped from the topline PDF reports released on the Ipsos website5. The PDF reports also contain Ipsos’ design effects, which we have confirmed are calculated as 1 plus the variance of the (scaled) weights.

Census Household Pulse data

The Census Household Pulse is an experimental product of the US Census Bureau in collaboration with eleven other federal statistical agencies. We use the point estimates presented in Data Tables, as well as the standard errors calculated by the Census Bureau using replicate weights. The design effects are not reported, however we can calculate it as $$1+{{\rm{CV}}}_{{\rm{w}}}^{2}$$, where CVw is the coefficient of variation of the individual-level weights included in the microdata23.