Introduction

The SARS-CoV-2 virus, and the COVID-19 pandemic it effected, hardly needs introducing more than 2 years after the World Health Organization (WHO) first announced evidence of human-to-human transmission in January of 20201. As of this writing, the WHO states there have been 626 million confirmed cases and more than 6 million deaths attributed to COVID-19 worldwide2. Post-acute sequelae of SARS-CoV-2 infection (PASC) have been widely reported and can include any complication resulting from SARS-CoV-2 infection weeks or months after infection3,4,5. Long COVID is a single diagnosis that encapsulates a broad array of symptoms attributed to PASC. The WHO used a Delphi method to create a clinical definition of long COVID that includes both clinically observed features as well as patient-reported features6. Long COVID is a multi-system disease, characterized by diverse features such as dyspnea, chest pain, fatigue, cognitive impairment, deep vein thrombosis, gastrointestinal dysfunction, etc.7,8. Numerous efforts to define long COVID using electronic health record (EHR) data exist, with the goals of supporting public health surveillance and research9. However, a gold standard definition of long COVID has been elusive. We have previously provisioned the first machine learning EHR-based long COVID definition (a Computable Phenotype, or CP model), which leveraged the long COVID ICD-10 code U09.9 as well as visits to long COVID specialty clinics to train a classifier to identify putative long COVID Patients10; as well as later work by others11.

The National Institutes of Health (NIH) created the RECOVER initiative to address the uncertainty surrounding long COVID by coordinating research across hundreds of researchers and more than 30 institutions12. The National COVID Cohort Collaborative (N3C)13, sponsored by NIH’s National Center for Advancing Translational Sciences, provides access to harmonized electronic health records through the N3C Data Enclave. More than 75 sites have contributed longitudinal data for over 15.5 million patients with a confirmed SARS-CoV-2 infection, COVID-19 symptoms, or their matched controls.

Vaccines have been shown to be safe and effective at dramatically reducing the risk of severe COVID-1914,15. However, their impact on long COVID is less understood, with most studies indicating a significant protective effect16,17,18,19 while others reported mixed or no effects20, or even an anti-protective effect21. While some have studied the impact of administering vaccines after the onset of PASC22,23,24, we attempt to address ambiguity around the association between pre-COVID-19 vaccination and eventual long COVID diagnosis. To our knowledge, we are the first to consider vaccination with long COVID as a clinical diagnosis or computable phenotype10; previous studies have relied on surveys or the occurrence of one or two symptoms consistent with long or acute COVID. Ours is also the largest study to leverage time-to-event modeling or control for differences in the vaccinated and unvaccinated populations.

Results

Patients with a COVID-19 infection between August 1, 2021 and January 31, 2022 were split into two cohorts. In a clinic-based cohort of 47,404 individuals, 695 (1.5%) received a clinical diagnosis of long COVID and 26,354 (55.6%) were fully vaccinated (Supplementary Table 1 shows the number of individuals with each of the long COVID diagnoses; a single person can receive multiple diagnoses). In a model-based cohort of 198,514 individuals, 3391 (1.7%) had a computational phenotype10 (CP) score above a threshold of 0.9 and were labeled as having long COVID; 86,248 (43.4%) were fully vaccinated. All available EHRs beginning 45 days after COVID-19 infection were used to establish evidence of long COVID. The end of patients’ follow-up periods varied depending on the cadence of their healthcare facility’s data contributions and ranged from June 10, 2022 to August 1, 2022. The minimum observed follow-up period between COVID-19 infection and the end of an individual’s data availability was 164 days. Distributions of the length of the follow-up period for both cohorts are shown in Supplementary Table 2. Full summaries of patient characteristics for both cohorts are shown in Tables 14. Unadjusted cross-tabulations of vaccination status and long COVID diagnosis are shown in Table 5.

Table 1 Model-based cohort patient demographics
Table 2 Model-based cohort medical characteristics
Table 3 Clinic-based cohort patient demographics
Table 4 Clinic-based cohort medical characteristics
Table 5 Long COVID by vaccination status: unadjusted counts

Statistical analysis

Inverse probability of treatment weighting (IPTW) was applied to logistic regression and Cox proportional hazards models for both cohorts to account for known confounders between vaccination propensity and risk of long COVID. The result is four adjusted estimates, which showed consistent protective associations between vaccination and long COVID diagnosis and are reported in Table 6. The full tables of model coefficients are provided as Supplementary Tables 36. Unadjusted estimates are also reported in Table 6, which do not exhibit the same association. The IPTW-adjusted Kaplan–Meier curves for the model-based and clinic-based outcomes are shown in Fig. 1.

Table 6 Long COVID by vaccination status: measures of association
Fig. 1: IPTW-adjusted Kaplan–Meier curves.
figure 1

Our definition of long COVID (LC) can only be observed at least 45 days after index; time from the COVID index therefore starts at 45. Long COVID events can only be observed for the model-based outcome in 30-day increments, resulting in the observed stair-step structure. A reduced vertical axis scale is used to highlight the differentiation between the vaccinated and unvaccinated curves.

Key results of the sensitivity analyses are summarized in Fig. 2. Adjusted and unadjusted estimates were evaluated across multiple CP score thresholds, and by including or excluding covariates in addition to vaccination status. The association between vaccination and long COVID was robust to excluding either IPTW-adjustment or non-vaccination covariates, but not both. While not relevant in the clinic-based outcome, the association in the model-based outcome was not robust to the varying CP thresholds, with lower thresholds resulting in a progressively weaker protective association. In the proportional hazards models, an additional analysis determined that the estimates were not sensitive to whether or not post-COVID-19 vaccination events are censoring events (uncensored points are not pictured in Fig. 2 as they closely overlap the censored points). The remaining sensitivity analysis results are shown in Supplementary Table 7. In the clinic-based cohort, an analysis showed the results to be robust to using only the most specific ICD-10 code (U09.9) to label long COVID. Including only sites with the most complete vaccine reporting (with recorded vaccine ratios of at least 89%) resulted in associations similar to or stronger than the four primary associations. Censoring patients in the clinic-based analysis after their last recorded healthcare visit and eliminating the requirement for a post-COVID-19 visit resulted in a slightly stronger, but not significantly different, association as in our primary results.

Fig. 2: Sensitivity analysis of vaccination associations.
figure 2

Odds ratios (OR) are shown for logistic regression (LR), hazard ratios (HR) are shown for proportional hazards (PH). Point estimates are from models built using the full cohorts and are shown with 95% confidence intervals derived from 200 bootstrap samples. The vertical line at 1.0 represents no association. The clinic diagnosis points (n =  47,404 individuals) are using the clinic-based outcome, the long COVID (LC) model points (n =  198,514 individuals) represent different thresholds of the computational phenotype model to label LC. Higher thresholds represent higher confidence in an LC phenotype. With or without covariates refers to the presence or absence of non-vaccination predictors in the outcome models. Adjusted or unadjusted refers to the presence or absence of IPTW weighting.

The subanalysis did not offer robust evidence that the association between vaccination and long COVID diagnosis is dependent on the time between vaccination and acute COVID-19 onset. The full tables of subanalysis coefficients, including for indicators of vaccination timing, are shown in Supplementary Tables 811.

After IPTW-adjustment, all covariates were well-balanced (Supplementary Figs. 1, 2 illustrate the standardized differences in covariates in both cohorts). Logistic regression diagnostics did not indicate any overly influential observations. Observations with large residuals tended to have low leverage and vice versa. In the model-based analysis, the greatest Cook’s distance was <0.01 and the greatest absolute DFBETA for vaccination status was 0.07. In the clinic-based analysis, the greatest Cook’s distance was 0.01 and the greatest absolute DFBETA for vaccination status was 0.09. In the model-based analysis, five patients had stabilized inverse probability of treatment weights above 20 (max of 32); excluding these patients did not impact vaccination coefficients at the precision reported here. The maximum weight in the clinic-based analysis was ten.

Discussion

Our four analyses yielded consistent results. We see protective associations of vaccination with long COVID diagnosis in both logistic and time-to-event models, and in both clinic-based and model-based outcomes. While these findings are similar to those of other large observational studies16,17,18,19, previous sources have only looked for evidence of COVID-associated symptoms as evidence of long COVID. A major finding of our analysis is that the protective association remains consistent in results requiring a clinical diagnosis, and among those who contracted COVID-19 in a later period that includes Omicron infections.

The use of a clinical diagnosis resulted in a significantly lower long COVID prevalence in our study (less than 2% in both cohorts) than studies based on long COVID symptoms, which have reported prevalences between 8 and 38%, depending on which and how many symptoms were required17,18,19. However, both of our cohorts are large, and the use of a CP allowed us to expand our sample from six to eleven sites and 47,404 to 198,514 COVID-positive patients, providing a sufficient sample of strictly defined long COVID diagnoses. Due to the underdiagnosis of long COVID in a clinical setting, our conclusions are limited to associations with diagnosis and not with long COVID onset more generally.

Interestingly, the protective association of vaccination with long COVID diagnosis is weaker or reversed in the unadjusted coefficients and cross-tabulations (Table 6 and Fig. 2). Several features that are associated with a higher likelihood of long COVID (coefficients in Supplementary Tables 36) are also associated with a higher likelihood of vaccination (coefficients in Supplementary Tables 12, 13). The most significant is age: Supplementary Table 14 shows how older adults are both more likely to be vaccinated and more likely to contract long COVID in comparison to younger adults. Failing to account for the substantial differences between individuals who were and were not vaccinated prior to COVID-19 could lead one to inaccurately conclude that vaccination is harmful.

The sensitivity analysis presents other instructive complexities. Reducing the CP score threshold lowers the amount of evidence required to denote someone as having long COVID; it also moderates the protective association of vaccination with long COVID (key results in Fig. 2, full range of thresholds in Supplementary Fig. 3). We expect that including healthy adults in the long COVID, population would dilute the observed association, but individuals with a CP score between 0.6 and 0.9 are not entirely healthy—they have some evidence of long COVID. In fact, our sample’s long COVID incidence rate at lower thresholds is closer to long COVID incidence rates reported elsewhere (although the true incidence rate of long COVID is unknown). This suggests a hypothesis that vaccination may be more effective at preventing clinically diagnosed long COVID than undiagnosed long COVID. More research is needed to determine the differences between high confidence and clinically diagnosed long COVID cases compared to low confidence and undiagnosed cases. If they are more severe, then our results could suggest that vaccination is associated with reduced severity of long COVID symptoms.

Healthcare utilization is one of the most important features of the CP model10. If fully vaccinated patients are more likely to utilize the healthcare system, the CP model’s marginal predictions may be assigning more fully vaccinated individuals to long COVID because they are more likely to interact with the healthcare system, depressing the observed benefit of vaccination. A known challenge of analyzing EHR data is that they tend to provide more information on individuals who regularly utilize healthcare systems25, though we attempt to control for this by requiring multiple recorded encounters outside of COVID-19 for inclusion in the study.

Our use of long COVID diagnosis and a computable phenotype as outcomes differentiate this study from others17,18,20, which measure the association between vaccination and a curated list of long COVID symptoms. Each approach has its strengths. Our clinical outcomes reduce measurement error due to false positives (e.g., long COVID symptoms caused by something other than long COVID). However, other studies show that long COVID symptoms differ in their relationship with vaccination. Our outcomes obscure such variation. We conclude that it is beneficial to study this relationship from both perspectives.

Vaccination reduces the risk of developing COVID-19 for a period of time after vaccination14,15, offering one mechanism for preventing long COVID. However, there is evidence that widely circulated vaccines are less effective against now-dominant Omicron than earlier SARS-CoV-2 variants26,27,28, increasing interest in whether or not vaccination reduces the risk of long COVID in breakthrough infections. That is the aim of this study, in which all eligible patients had a COVID-19 diagnosis. As a result, we are excluding any effect due to vaccination’s primary prevention of COVID-19 in the first place that is present in the general population.

Several studies conclude that the protective effect of vaccination on acute COVID-19 infection severity wanes over time27,29, but we are unaware of any studies making the same claim for long COVID. As can be seen in Supplementary Tables 811, the subanalysis incorporating time between vaccination and acute COVID-19 does not offer any evidence that the association between vaccination and long COVID diagnosis changes over time. The reference level for Weeks Since Last Vaccination is those who received their last vaccine dose at least 25 weeks prior to their COVID-19 infection. Negative coefficients for the modeled indicators suggest stronger protective associations. Three models present statistical significance (alpha = 0.05) for at least one indicator, indicating a significant difference between that level and those vaccinated 25+ weeks prior to COVID-19, but results are not consistent between models. Contrary to intuition and previously reported results with acute COVID-19, those vaccinated at least 25 weeks prior to COVID-19 are among the least likely to be diagnosed with long COVID across the four models. We do not present this as evidence that the benefits of vaccination with respect to long COVID do not wane. Caution should be used when interpreting conditional coefficients and investigating the time between vaccination and COVID-19 was not a primary focus in this study30.

IPTW is often used to estimate causal effects from observational data and is employed here to provide more robust associations. However, we do not interpret these results as causal effects. This is for two reasons: (1) we are unwilling to assume that there are no unmeasured confounders in our treatment model and (2) our causal model includes several latent variables, which obstruct the estimation of treatment effects through covariate adjustment. We explore each reason in the Supplementary Discussion and provide a directed acyclic graph of confounders in Supplementary Fig. 4.

Our study is limited by its reliance on EHRs and other factors. Those who choose to not seek healthcare, or are unable to do so, are not represented in EHRs. This could be particularly problematic among long COVID patients, who may lack the energy or resources required to receive a clinical diagnosis, or whose providers may not be familiar enough with long COVID symptoms to make a diagnosis. If vaccinated long COVID patients are less likely to be clinically diagnosed than unvaccinated long COVID patients, then our estimate of the association between vaccination and long COVID diagnosis will overstate the association between vaccination and long COVID onset. Furthermore, we had previously identified a heterogeneous set of features that were differentially present in clinical observations versus patient-reported symptoms9. This agreed with the WHO suggestion that a definition of long COVID must necessarily include both clinician and patient-reported features - which are not commonly available in the EHR.

We are forced to assume that those without a recorded condition or symptom do not exhibit it, including our exposure (vaccination), our outcome (long COVID diagnosis), and potentially unrecorded reinfections of COVID-19. We take two steps to mitigate the risk of unrecorded records relevant to the outcome: (1) we require that all participants in the study had established care at the partner facility prior to COVID-19 infection, as evidenced by two healthcare visits in the year prior, and (2) we require that all participants in the study were seen at the partner facility at least 120 days after COVID-19 infection. Our utilization-related inclusion criteria result in a cohort that is disproportionately female, which may be due in part to females being higher healthcare utilizers than males on average31,32,33. We account for confounding due to sex by including sex as a covariate in treatment weighting and in all primary models. The utilization criteria result in a significantly smaller cohort and biases the sample towards high utilizers and those with hospitalizations. However, it remains sufficiently large for analysis and has a lower risk that long COVID will go undiagnosed, as patients were active users of the partner facility both before and four months after COVID-19 infection. In the clinic-based cohort, there is an additional requirement that the facilities have a track record of diagnosing long COVID (though the variation between doctors remains).

A sensitivity analysis that censors individuals after their last healthcare visit (rather than at the end of the study period) yields an association similar to our four primary results (Supplementary Table 7). Censoring is not possible in logistic regression models, but allows the proportional hazards model to relax the assumption that individuals that established care at a facility continue to use that facility after their last recorded visit. For this analysis, we did not require a recorded visit after the acute COVID-19 infection, but individuals remained ineligible for long COVID designation until 45 days after infection.

Our cohort is further refined by our requirement that the partner facility have reasonably high recorded vaccine ratios, as defined in our Methods. Most facilities fail to achieve recorded vaccine ratios greater than 66%, as they are not the primary provider of vaccinations in their community, do not link to their state’s vaccine registry, or do not consistently record the vaccinations they provide in the EHR. We do not use a facility’s vaccination rate as an individual characteristic in our models, but rather as a facility-wide inclusion criterion. By limiting to partner facilities with a high vaccination rate, as with our utilization criteria, we refine our cohort to be smaller but more data-rich.

We strictly define our study cohort to minimize the underreporting of vaccination and long COVID, though we acknowledge that it is not entirely resolved. Our sensitivity analysis using only sites with the highest recorded vaccine ratios (≥89%) offers some evidence that incomplete vaccine records result in a conservative estimate. The cohorts are small (the model-based cohort has 10,122 patients; the clinic-based cohort has 5545), resulting in wide confidence intervals that include the primary estimates for every model (Supplementary Table 7). However, the mean estimated associations are stronger than our primary results in three of the four models and remain significant in all four models with 95% confidence. We conclude that our primary estimates are likely conservative, but our primary result—that pre-COVID-19 vaccination is associated with a reduced risk of long COVID diagnosis—is not threatened.

The confidence intervals around the CP model-based risk estimates are likely too narrow, as there remains residual misclassification of long COVID outcomes in that cohort not factored into the confidence interval boundaries. We did not distinguish between vaccine types, though previous studies and initial tabulations failed to detect significant differences in their associations with long COVID17,18,19. The ICD-10 code for long COVID, U09.9, was not implemented until October 2021, and it has not been fully adopted. The previously recommended ICD-10 code, B94.8, is more general and is used to diagnose long-term complications from any viral infection. We accepted B94.8 as a long COVID diagnosis because the use of the code in our data by mid-2021 was 40 times higher than its baseline use in 2018 and 2019. A sensitivity analysis using only U09.9 returned nearly identical results.

In conclusion, vaccination was consistently associated with lower odds of both a long COVID clinical diagnosis as well as a high-confidence computationally derived diagnosis, regardless of viral epoch and taking into account age, sex, and demographics. This multi-method strategy provides additional evidence on the controversial and yet understudied and challenging topic of whether vaccination reduces the risk of long COVID.

Methods

Base population

This study is part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative, which seeks to understand, treat, and prevent PASC. For more information on RECOVER, visit https://recovercovid.org. All analyses described here were performed within the secure N3C Data Enclave. N3C’s methods for patient identification, data acquisition, ingestion, data quality assessment, and harmonization have been described previously in refs. 13,34. The study population was drawn from 5,434,528 COVID-19-positive patients available in N3C. A COVID-19 index date (index) was defined as the earliest recorded indication of COVID-19 infection. Individuals who met the following inclusion criteria were eligible: (1) having an International Classification of Diseases-10-Clinical Modification (ICD-10) COVID-19 diagnosis code (U07.1) or a positive SARS-CoV-2 PCR or antigen test between August 1, 2021 and January 31, 2022; (2) having a recorded health care visit between 120 and 300 days after index; (3) having at least two recorded health care visits in the year prior to index; (4) being ≥18 years old at index; and (5) having either completed or not started a COVID-19 vaccine regimen at index. One exclusion criterion for a clinical cohort is detailed in the outcome definitions. The end of individuals’ follow-up periods varied according to when their healthcare providers last submitted new data, ranging from June 10, 2022 to August 1, 2022.

A known limitation of EHR data is that only those healthcare encounters and services provided by the specific health system are available in the data35. The proportion of patients with a recorded vaccination at a given healthcare site is driven by two factors: (1) the true rate of vaccination among the population served and (2) how consistently vaccines are captured by the site. Some sites report no vaccinations, while others sync vaccination records with their state’s vaccine registry. There is no explicit indicator of non-vaccination in the N3C Data Enclave, but sites with better-recorded vaccine ratios offer more confidence that patients with no recorded vaccine exposure are unvaccinated. We calculated the recorded vaccine ratio at each site as the ratio of two statistics: the observed proportion of patients with a vaccination record and an expected vaccination rate derived from CDC reporting36 for the population served. Sites with an observed proportion of at least two-thirds of their expected vaccination rate were eligible for analysis, leaving 198,514 patients at eleven sites that met our inclusion criteria. A full breakdown of how many patients met our inclusion criteria is shown in Fig. 3.

Fig. 3: Cohort definition flowchart.
figure 3

Cumulative number of patients meeting the study’s inclusion criteria.

As much as possible, we account for confounding due to sex through the inverse probability of treatment weighting and by including sex as a covariate in all primary models. Demographics were defined through standard concepts available in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM)37,38. Gender is not available in the CDM39, and was therefore not considered in the study.

Exposure definition

Those who completed their vaccine regimen (two doses of BNT162b2 or mRNA-1273 or a single dose of Ad26.COV2.S) two weeks prior to the index were considered vaccinated, while those with no recorded vaccines at the index were considered unvaccinated. Partially vaccinated patients at index failed to meet the fifth inclusion criterion.

Outcome definitions

Definitions of long COVID vary. The CDC defines post-COVID symptoms as those beginning at least 4 weeks after infection40, while the WHO defines long COVID as beginning “usually 3 months” from COVID-19 onset with symptoms lasting more than 2 months not explained by another condition6. We use two definitions, one each for the clinic and model-based cohorts, that balance these organizational definitions with the strength of evidence available in each cohort.

Clinical definition

We considered three clinical indicators of long COVID: (1) an ICD-10 code for post-COVID-19 condition (U09.9), (2) an ICD-10 code for sequelae of other specific infectious and parasitic diseases (B94.8), or (3) a visit to a long COVID clinic. Prior to the introduction of U09.9 in October 2021, the CDC endorsed B94.8 to indicate long-term complications of SARS-CoV-2 infection. As with vaccination, not all sites report clinical indicators of long COVID. Six out of eleven sites, comprising 47,404 of 198,514 eligible patients, submitted clinical indicators of long COVID for at least 250 patients. We used patients from these six sites to form a clinic-based cohort of patients, whom we deemed eligible for receiving a clinical long COVID indicator. The clinic-based cohort has one additional exclusion criteria: those with a long COVID clinical indicator within 45 days of the index were omitted, because diagnoses within this time period are less likely to align with the generally accepted long COVID definitions.

Any long COVID clinical indicator was sufficient to label a patient as having had long COVID in the logistic regression. If patients had multiple encounters with a clinical indicator of long COVID, the earliest was used as the event date for purposes of the time-to-event analysis. Death and COVID-19 vaccination after COVID-19 onset were censoring events.

Model-based definition

Long COVID was classified in the model-based cohort as a computational phenotype (CP) using the long COVID cohort identification machine learning model described in ref. 10. A CP is a model trained on EHR data, which can be used to infer the likelihood that a patient has a phenotype (in this case, long COVID) based on their clinical history41. For the purposes of this study, the CP model was retrained with U09.9 diagnoses as the target event and without vaccination status as an input. The model calculates a long COVID likelihood score (range 0 to 1) for each patient beginning 100 days after the index using only conditions and drugs observed as of that day. New scores are generated in 30-day intervals until 300 days after the index or June 1, 2022, whichever comes first. Patients scoring above 0.9 in any interval were labeled as having long COVID. A threshold of 0.9 was chosen as it resulted in a similar prevalence of long COVID across the model-based and clinic-based outcomes. The earliest interval receiving a score above 0.9 was assigned as the event date for purposes of the time-to-event analysis. As in the clinic-based definition, death and COVID-19 vaccination were censoring events.

Any patient meeting our inclusion criteria from any of the eleven sites was eligible for a model-derived indicator of long COVID and was included in the model-based cohort. Therefore, all patients in the clinic-based cohort are also included in the model-based cohort, where they can (and sometimes do) have a different assigned long COVID outcome. This is not unexpected—the CP model was trained using U09.9 as the target, while we include U09.9, B94.8, and long COVID clinic visits as valid clinical diagnoses. Both labels are rare and imperfect; we do not expect one indication to guarantee the other.

Institutional review board oversight

The N3C data transfer to the United States National Center for Advancing Translational Sciences (NCATS) was approved under a Johns Hopkins University Reliance Protocol #IRB00249128 or individual site agreements with NIH. The use of human data for this study was approved by the Johns Hopkins Medicine Institutional Review Board (IRB) #IRB00279988 through a data use agreement entitled “Characterization of long-COVID: definition, stratification, and multi-modal analysis”. The N3C Data Enclave is managed under the authority of the NIH; information can be found at https://ncats.nih.gov/n3c/resources.

Statistical analysis

Two analyses were carried out to estimate the association between vaccination and long COVID: (1) logistic regression to calculate an overall association while controlling for patient characteristics and (2) Cox proportional hazards to incorporate differences in the time-to-event for long COVID. We consider both analyses as primary, as each has its own strengths and weaknesses. Proportional hazards uses censoring to account for varying follow-up horizons but require a date for a long COVID diagnosis and for hazard functions to be proportional over time. Logistic regression considers varying follow-up horizons through indicators of acute COVID-19 onset timing, but does not explicitly model times-to-event as done in proportional hazards. We present the results of both analyses as a test of the robustness of the association.

Given our requirement that long COVID diagnoses occur at least 45 days after index, our proportional hazards model uses index + 45 days as the beginning of the modeled time period.

Inverse probability of treatment weighting (IPTW) was applied to both logistic regression and proportional hazards to control for differences in patient characteristics across the vaccinated and unvaccinated groups. IPTW is a method to adjust for confounding (covariates which affect both the treatment and the outcome) in observational studies. IPTW creates a pseudo-cohort in which the likelihood of treatment is independent of the measured covariates42. Logistic regression was used to estimate a treatment propensity score based on sex, demographics, medical history, social determinants of health, and spatial and temporal variables. Our selection of covariates was informed by the literature on important indicators of long COVID10,43,44. Covariate balance before and after weighting was evaluated with standardized mean differences. Covariates with a standardized mean difference of less than 0.1 were considered well-balanced. Stabilized treatment weights were calculated as outlined in ref. 45. Standard errors in the IPTW-adjusted models were calculated from 200 bootstrapped iterations based on the standard deviation of the estimates46. Unadjusted associations were also calculated and reported.

For logistic regression models, studentized residuals, leverage scores, Cook’s distances, and DFBETAS were examined to identify influential observations. Residual analysis helps to identify whether the regression assumption of homogeneous variance is violated. The other statistics identify observations which have an outsized influence on model parameters, which may indicate that the model is unstable47. For proportional hazards models, the Lifelines package’s CoxPHFitter.check_assumptions method was used to test the assumption that each covariate’s effect on the hazard rate is constant over time48,49. Interactions with time were added to the model for covariates which did not meet the proportional hazards assumption. Variables with more than two levels were binned and represented through indicators. Any indicator with fewer than ten patients identified as having long COVID for a given analysis is not used in that analysis.

Sensitivity analyses

Six sensitivity analyses were conducted. The first four use the same cohorts as the primary analyses. They test the sensitivity of the IPTW-adjusted and unadjusted vaccination status coefficients in the logistic regression and proportional hazards models across four dimensions: (1) CP score threshold (0.3 to 0.95), (2) with or without independent features in addition to vaccination, (3) including or not including post-index vaccinations as a censoring event, and (4) using only U09.9 diagnoses to label long COVID. The first sensitivity dimension was not relevant for the clinic-based outcome, the third was not relevant for logistic regression analyses, and the fourth was not relevant for the model-based outcome.

The fifth and sixth sensitivity analyses used modified cohorts. The fifth analysis included only patients from partner facilities with the highest recorded vaccine ratios (≥89%). Four sites have recorded vaccine ratios of 89–90%; the next highest is 78%. The sixth sensitivity analysis eliminated the requirement for a recorded healthcare visit after COVID-19 infection and censored individuals after their last visit. Those without a visit after COVID-19 were censored the day after the index. For this analysis, the modeled time period began the day after the index (instead of 45 days after the index in the primary analysis), though individuals remained ineligible for long COVID designation until 45 days after the index. The sixth analysis was only relevant for the proportional hazards model in the clinic-based analysis, as the computable phenotype model requires a post-COVID-19 visit at least 60 days after COVID-19 and censoring is not available in logistic regression models10. Vaccination propensity scores, as well as the model coefficients, were recalculated for each of the modified cohorts in the fifth and sixth sensitivity analyses.

Subanalysis

A subanalysis was performed to determine if the time of vaccination relative to acute COVID-19 diagnosis severely modulates the association between vaccination and long COVID diagnosis. Both primary analyses were repeated with the addition of indicators for the number of weeks between an individual’s last pre-COVID-19 vaccination and their COVID-19 diagnosis date.

All analyses were conducted using Python (version 3.6.10) with the Statsmodels (0.12.2) and Lifelines (0.26.4) packages. Preprocessing was done in R (3.5.1) and Python (3.6.10) with the PySpark (3.2.1), pandas (0.25.3), and numpy (1.19.5) packages. Study design elements, methods, and results were reported as consistent with STROBE guidelines50.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.