To the Editor — Here, we report the VDJdb database ( update prepared between 2019 and 2022, marked by the emergence of SARS-CoV-2, the causative agent of COVID-19.

In 2016, we started a community effort to gather and curate publicly available sequence data acquired from T cell receptor (TCRs) with defined antigen specificities, as well as communicated datasets from our colleagues, by developing the VDJdb database, which has since been extended with a web interface that allows batch querying of adaptive immune receptor repertoire sequencing (AIRR-seq) datasets and the identification of TCR sequence motifs linked with specific epitopes1.

In the current pandemic era, a large majority of recent T cell repertoire profiling and antigen-specificity studies have focused on TCR variants that target the SARS-CoV-2 coronavirus2,3,4. As a consequence, millions of TCR sequences have now been isolated from donors with COVID-19. To complement these efforts, in the latest release of VDJdb, we incorporated TCR specificity data from various studies of COVID-19. We collected data from an international network of laboratories focused on assaying antigen-specific T cell responses in COVID-19 (Fig. 1a). Data acquired from multiple laboratories across the world feature over 3,000 TCR α and β chain sequences recognizing dozens of SARS-CoV-2 epitopes. These analyses revealed a set of reproducible TCR motifs that could find utility in large-scale clinical and experimental studies focused on COVID-19. We showed consistency and reproducibility of TCR specificity data across laboratories. Inferred TCR motifs will facilitate the tracking SARS-CoV-2-specific T cells and the discovery of immune signatures associated with protection against COVID-19. T cell antigen specificity is encoded by somatically rearranged TCRs. Current techniques allow the comprehensive profiling of TCR repertoires via high-throughput sequencing, which is compatible with various methods for elucidating the antigen specificity of T cell populations5.

Fig. 1: Overview of COVID-19 data compendium stored in VDJdb.
figure 1

a, General pipeline used to acquire and store COVID-19 TCR specificity data. SARS-CoV-2 epitopes of interest are selected and used to construct MHC multimers, which are in turn used to enrich T cells and select T cells specific to a given epitope; those T cells are then subjected to a conventional TCR repertoire sequencing procedure (part 1). The data on TCR receptor sequences and their cognate epitopes is acquired independently by proficient laboratories around the globe; pie chart sizes reflect the number of TCR specificity records, with chart colors representing distinct epitopes (part 2). Data is processed, curated and stored in the VDJdb, which provides means to browse the COVID-19 compendium and annotate novel TCR sequences of unknown specificity (part 3). Maps are adapted (see for code) from open-source R package “maps” released under GPL-2 license (, copyright 2015–2022 VDJdb Developers and reproduced with permission of VDJdb Developers. b, Numbers of TCR specificity records for SARS-CoV-2 epitopes presented by various HLAs. Correspondence is shown using an alluvial plot with bands colored by epitopes. First three letters are used to code epitopes; only epitopes with ≥10 records are shown; band widths represent log-scaled number of records. c, Comparing TCR repertoires specific for the HLA-A*02-restricted YLQ epitope from SARS-CoV-2 obtained by different laboratories using sequence similarity map, with each dot representing a unique CDR3 sequence (top). Dot locations are based on CDR3 sequence similarity graphs generated using the TCRNET algorithm (see Supplementary Methods). Each dot is colored according to the parental dataset (key). Large red dots represent CDR3 sequences that were identified in multiple datasets. Left, TCR α chains; right, TCR β chains. Labels highlight TCRs that were successfully used to refold TCR–peptide–MHC complexes6. Sequence motif logos for clusters from the similarity map are shown below. Two recurring motifs each, CVVNXXDKIIF and CVVNXXDDMRF for TCRα and CAS-NTGELFF and CASSXDIEAFF for TCRβ, were shared among datasets (“Multi-lab” means shared across all laboratories).

The first set of TCR repertoires with known specificity for SARS-CoV-2 epitopes was acquired from the Efimov laboratory4. This work prioritized the HLA-A*02-restricted YLQ and RLQ epitopes, producing 573 VDJdb records (unpaired TCR α and β chains), which were subsequently detected in other studies and served as a template for the first SARS-CoV-2-specific TCR–peptide–MHC crystal structures6. This submission was followed by a number of studies from different laboratories performed in 2021. One dataset reported multiple TCR sequences specific for SARS-CoV-2 epitopes restricted by HLA-A*247, a prominent HLA class I allotype among indigenous Asian populations. A report from the Kedzierska laboratory complemented these data with the addition of TCR sequences specific for SARS-CoV-2 epitopes restricted by HLA-A*02, HLA-A*24 and HLA-B*073. A large set of paired TCRαβ sequences specific for a range of SARS-CoV-2 epitopes was acquired from the Thomas laboratory8. Smaller datasets were also imported from other published works and private communications (all listed in the issue section of the VDJdb github repository), including one notable study that reported TCR sequences specific for SARS-CoV-2 epitopes restricted by HLA class II allotypes9. In total, the current VDJdb release features 3,187 unique TCR specificity records spanning 46 distinct SARS-CoV-2 epitopes (Fig. 1b and Supplementary Table 1).

An important test of consistency for any biological dataset is independent reproducibility, and TCR repertoire sequencing in particular is prone to methodological and operator-dependent biases. To explore potential biases in the SARS-CoV-2-related VDJdb dataset, we performed a comparative analysis of TCR α and β chain specificity records for the most widely studied epitope, YLQ-HLA-A*02. No preferential clustering of these specificity records was observed across laboratories (Fig. 1c, top), while the overall structure of the TCR similarity map was preserved, suggesting that different laboratories sampled uniformly from the same space of epitope-specific TCR sequences.

Conversely, the independently generated data validated a set of TCR complementarity-determining region 3 (CDR3) sequences, which clustered as clearly defined motifs across different laboratories (Fig. 1c). Of note, the most commonly obtained CDR3 sequences were used successfully in crystallographic studies to generate ternary structures6, providing new insights into the molecular mechanisms that underpin TCR recognition of the YLQ epitope in complex with HLA-A*02.

Imprints of common infections can be detected in TCR repertoire sequencing datasets10, which in turn can be used to predict immune responses and stratify patients with COVID-195. VDJdb has been used successfully in the past for similar purposes and currently serves as a benchmark standard for testing TCR-specificity prediction algorithms2. In this work we demonstrated that the COVID-19 TCR-specificity compendium is unaffected by inter-laboratory biases and thus can be employed as a reference in TCR repertoire annotation. These precedents suggest that VDJdb can be used in the future to build classifiers trained to identify biologically relevant T cell responses in patients with COVID-19. Overall, we anticipate that the present release will enhance the versatility of VDJdb in the pandemic era, supporting the development of more effective vaccines and addressing future challenges associated with viral evolution and the emergence of new pathogens beyond SARS-CoV-2.