The value of large-scale, real-world data — from electronic health records (EHRs), for example — has been used to establish vaccine efficacy, to elucidate the genetic etiologies of diseases, and to advance epidemiological research1,2,3. Real-world data also have the potential to capture the wide spectrum of clinical features attributed to post-acute sequalae of SARS-CoV-2, also called long COVID, in diverse patient populations4.

We are an international consortium that has operationalized definitions of long COVID using health-agency guidelines, and established a chart-review procedure based on these definitions5. During this process, we identified three major challenges in using real-world data to study long COVID: ambiguity and heterogeneity in clinical coding of long COVID; inadequacy of diagnostic codes in capturing the constellation of symptoms; and biases in EHR data arising from variability in the number and kind of contacts with the healthcare system. These challenges warrant special attention if the clinical community wishes to arrive at a robust understanding of long COVID using evidence derived from real-world data.

We carried out a manual medical-record review of 300 randomly sampled individuals infected with SARS-CoV-2, and assigned an International Classification of Diseases (ICD)-10 code (U09.9) for long COVID, at the Beth Israel Deaconess Medical Center, University of Pittsburgh Medical Center, and national US Veterans Health Administration5. These three health systems collectively serve more than 15 million patients each year.

We evaluated the extent to which patients with the ICD-10 code for this condition met our operationalized definitions based on guidelines from the World Health Organization (WHO) and the US Centers for Disease Control5,6,7. Our definition of long COVID based on WHO guidelines required that a patient present with at least two new-onset persistent symptoms lasting for 60 days after infection, whereas our definition based on CDC guidelines required that a patient present with at least one new-onset persistent symptom lasting for 30 days5.

A comparison of real-world EHR and administrative data with manually extracted clinical information (obtained through chart review of patients with the U09.9 code) found that functional definitions of long COVID varied widely by provider, which led to inconsistencies in coding practice and adherence to clinical definitions. Among patients assigned the U09.9 code, an average of 40.2% met the more-stringent WHO definition, 58.3% had a single symptom that met the WHO definition, and 65.4% met the least-stringent CDC definition5. This shows that the ICD-10 code is an unreliable surrogate of the status of long-COVID disease in research. Research and policy efforts are needed to converge on a definition that will standardize coding practices and improve the reliability of the ICD-10 code.

Clinical coding is further obfuscated by the potential for misclassifying long COVID as long-lasting complications from acute hospitalization, which are not specific to COVID-19 (ref. 8). We found that an average of 42.3% patients assigned the U09.9 code were hospitalized after infection, and an average of 12.3% received intensive care — both of which can produce long-lasting symptoms that overlap with long COVID5. Physical and physiological effects of hospitalization or critical care are important patient-level factors that should not be misattributed to SARS-CoV-2 infection.

Capturing the symptomology of long COVID using diagnosis codes is difficult, as the syndrome encompasses a constellation of nonspecific symptoms — including pain, fatigue and brain fog — that are not well represented by coding schemes such as ICD-10 (refs. 4,5,6,7). Leveraging textual data from EHRs may improve the ability to capture symptomatology. When we examined the data capture of symptoms by ICD-10 codes and by the natural-language processing of clinical narratives — such as clinician notes and discharge summaries — we found that the incorporation of narrative data substantially improved the identification of symptoms, compared with using diagnosis codes alone5. This shows the potential use of natural-language-processing techniques to ascertain a more-complete representation of a patient’s health.

A further challenge is the definition of cohorts of individuals with long COVID, as the syndrome is defined by a time to presentation and therefore requires an index date from which to observe clinical outcomes. The index date is usually an initial infection date, which may become increasingly difficult to ascertain with the use of at-home testing, the results of which are inconsistently reported in EHRs. Researchers should therefore carry out routine quality controls (such as checking the time period between initial infection and input of the ICD-10 code for long COVID), to better understand the biases present in the data. Researchers should also allow for some flexibility in defining index dates, such as considering an infection time period rather than a single date, which would help to account for delays in billing or data processing.

Researchers must remain cognizant of potential biases in patient selection in real-world data; using visits to a long-COVID clinic as a proxy for true disease status is problematic. We found that, on average, only 24.0% of patients assigned the U09.9 code visited a long-COVID clinic, suggesting that the majority of patients sampled were being coded by physicians who do not work at these clinics5. Among patients who met the WHO definition of long COVID, an average of just 35.6% visited a long-COVID clinic, suggesting that many patients are not being seen at these specialty care facilities5.

Studies should also account for differential data density and healthcare utilization. We found that individuals who visited a long-COVID clinic were, on average, annotated with more new-onset conditions when compared with individuals who never visited a long-COVID clinic5. Physicians working at long-COVID clinics could be more experienced with the syndrome and therefore document the disease more thoroughly. This contributes to a difference in data density and granularity, which can confound findings if not properly addressed.

Studies of long COVID using real-world data must be based on robust and comprehensive clinical data sets. The incorporation of narrative data obtained using natural-language-processing techniques should better capture symptoms, and researchers should take care when using only the ICD-10 code or a visit to a long-COVID clinic as surrogates for disease status.

Computational phenotypes (in which data elements are combined, using machine-learning algorithms, to describe a particular disorder) have the potential to account for the longitudinal persistence of symptoms while avoiding the misattribution of conditions that existed prior to initial infection9. Semi-supervised machine-learning algorithms are resistant to some of these challenges, and so may be powerful tools to capture complex underlying temporal patterns in the data using a small number of manually curated labels. Rule-based algorithms may be less suited for the inherent complexity of long COVID10.

Real-world data have an important role in supporting research into long COVID, but these pitfalls should be considered so that the most equitable clinical and policy decisions can be informed by population-level studies.