Rapid evaluation of COVID-19 vaccine effectiveness against symptomatic infection with SARS-CoV-2 variants by analysis of genetic distance

Timely evaluation of the protective effects of Coronavirus Disease 2019 (COVID-19) vaccines against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of concern is urgently needed to inform pandemic control planning. Based on 78 vaccine efficacy or effectiveness (VE) data from 49 studies and 1,984,241 SARS-CoV-2 sequences collected from 31 regions, we analyzed the relationship between genetic distance (GD) of circulating viruses against the vaccine strain and VE against symptomatic infection. We found that the GD of the receptor-binding domain of the SARS-CoV-2 spike protein is highly predictive of vaccine protection and accounted for 86.3% (P = 0.038) of the VE change in a vaccine platform-based mixed-effects model and 87.9% (P = 0.006) in a manufacturer-based model. We applied the VE-GD model to predict protection mediated by existing vaccines against new genetic variants and validated the results by published real-world and clinical trial data, finding high concordance of predicted VE with observed VE. We estimated the VE against the Delta variant to be 82.8% (95% prediction interval: 68.7–96.0) using the mRNA vaccine platform, closely matching the reported VE of 83.0% from an observational study. Among the four sublineages of Omicron, the predicted VE varied between 11.9% and 33.3%, with the highest VE predicted against BA.1 and the lowest against BA.2, using the mRNA vaccine platform. The VE-GD framework enables predictions of vaccine protection in real time and offers a rapid evaluation method against novel variants that may inform vaccine deployment and public health responses. A model that predicts the effectiveness of COVID-19 vaccines against circulating SARS-CoV-2 variants, before the acquisition of real-world effectiveness data, may help guide more rapid public health and research responses to new variants of concern.

V accination is a crucial measure to control the scale of SARS-CoV-2 transmission and mitigate the severity of COVID-19. To date, 38 vaccines against SARS-CoV-2 are in early use or have been approved for application in the general population 1 . However, the protective effect of the various vaccine products is challenged by new genetic variants. VE against COVID-19, which measures the relative reduction of risk for a disease outcome in clinical trials or in the general population, exhibited a wide range of variation, from −2.7% to 97.2% 2,3 .
Several factors may contribute to the variations in VE that make it difficult to directly interpret the protective effect of vaccines. The notable contributors include the technology platforms, calendar period of studies, the target population, dosing interval, differences in study protocols and background risk of COVID-19, among others. The various vaccine technology strategies generated non-identical immune responses to provide protection against SARS-CoV-2 infection 4 . For instance, the LNP-mRNA vaccine, mRNA-1273, induces spike (S)-specific IgG, high T H 1 cell responses, low T H 2 cell responses and CD8 + T cell responses 5,6 , whereas the inactivated virus vaccine, CoronaVac, elicits robust CD4 + and CD8 + T cell responses to the structural proteins, including S, nucleocapsid (N), envelope (E) and matrix (M), in addition to humoral responses 7,8 . Among all the influencing factors, emerging genetic variants relative to the vaccine strain play a critical role in determining vaccine effectiveness. Serology studies showed that neutralizing activity against the Omicron variant decreased substantially in recipients of two COVID-19 vaccine doses 9,10 . Viral structure studies demonstrated that the amino acid substitutions in the receptor-binding domain (RBD) and N-terminal domain (NTD) alter virus-host cell interactions and reshape antigenic surfaces of the major neutralizing sites, leading to immune evasion 9,[11][12][13][14] . Although the mechanisms of immune escape caused by the new mutations are being elucidated in experimental studies, an integrative framework to quantify the effect of genetic mismatch on VE would be instrumental for efficient evaluation of vaccine protection for any country in real time.  19) vaccines against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants of concern is urgently needed to inform pandemic control planning. Based on 78 vaccine efficacy or effectiveness (VE) data from 49 studies and 1,984,241 SARS-CoV-2 sequences collected from 31 regions, we analyzed the relationship between genetic distance (GD) of circulating viruses against the vaccine strain and VE against symptomatic infection. We found that the GD of the receptor-binding domain of the SARS-CoV-2 spike protein is highly predictive of vaccine protection and accounted for 86.3% (P = 0.038) of the VE change in a vaccine platform-based mixed-effects model and 87.9% (P = 0.006) in a manufacturer-based model. We applied the VE-GD model to predict protection mediated by existing vaccines against new genetic variants and validated the results by published real-world and clinical trial data, finding high concordance of predicted VE with observed VE. We estimated the VE against the Delta variant to be 82.8% (95% prediction interval: 68.7-96.0) using the mRNA vaccine platform, closely matching the reported VE of 83.0% from an observational study. Among the four sublineages of Omicron, the predicted VE varied between 11.9% and 33.3%, with the highest VE predicted against BA.1 and the lowest against BA.2, using the mRNA vaccine platform. The VE-GD framework enables predictions of vaccine protection in real time and offers a rapid evaluation method against novel variants that may inform vaccine deployment and public health responses.

Rapid evaluation of COVID-19 vaccine effectiveness against symptomatic infection with SARS-CoV
In this study, we evaluated the link between genetic mismatch of circulating SARS-CoV-2 viruses and reported COVID-19 VE from population studies. Based on our bioinformatics approach previously established for influenza viruses 15,16 , we tailored the VE estimation framework for COVID-19 by controlling the clustered random variation of technology platforms or manufacturers using a mixed-effects model. Through extensive analysis of publicly reported VE studies and genetic sequences, we showed that a substantial proportion of the change in VE could be explained by GD, and we proposed an efficient approach to evaluate vaccine protection against symptomatic COVID-19.

Results
GD, or genetic mismatch, is calculated by the average Hamming distance on the RBD of the genome of the circulating viruses to the vaccine strain during the timeframe of VE studies. VE data used are detailed in Supplementary Table 1. The prediction method for VE was constructed through a mixed-effects model using GD as the main predictor, controlling for the confounding variables, including the midpoint (days) since the second dose and age group of the study. Particularly, variations in VE caused by technology platform or manufacturer were controlled by random effect in the mixed model (see Methods for details). In the following, we will first describe the variations in VE and GD by vaccine platform and then investigate their relationship. VE and GD distributions by vaccine platform. VE and GD of the four vaccine platforms with authorized use are compared in Fig. 1. Within each vaccine platform, the vaccine effectiveness is generally lower compared to the efficacy outcome (Fig. 1a), whereas, in terms of genetic mismatch ( Fig. 1b and Extended Data Fig. 1), the vaccine effectiveness cohort encompasses larger genetic mismatch relative to the vaccine efficacy cohorts. The result indicates that genetic mismatch had increased during the mass vaccination phase compared to the earlier clinical trial periods. This could be due to the accumulation of virus mutations through time, as well as the generally longer evaluation period of the effectiveness studies compared to the efficacy trials. Across the technology platforms, vaccine protection (efficacy/effectiveness) shows considerable difference (ANOVA test P < 0.001; Fig. 1a Interestingly, the genetic mismatch of these platforms shows a perfect reverse trend, of which the mRNA vaccines cohorts correspond to the smallest mismatch, and the other platforms exhibit larger mismatches. This might also be contributed by the timeframe of the vaccine evaluations for these platforms, in which the mRNA trials were the earliest to complete and corresponded to a more homogeneous viral population. The genetic mismatch summarizes the deviation of genetic variants with respect to the vaccine strains, accounting for time, locations and multiple strain co-circulation, for vaccine evaluation at population level using sequencing data. Relationship between vaccine protection and GD. Next, we explored the effect of GD on vaccine protection. At most, 86.3% of the variations in VE can be explained by the GD measure, controlling for the random effects of vaccine technology platforms (  Table 2). The NTD and S protein demonstrate weaker per-amino-acid substitution association with VE (P = 0.086 and P = 0.082, respectively) (Extended Data Fig. 4 and Supplementary Table 4). When no genetic mismatch is present, VE for the mRNA vaccines is expected to be 95.8% (95% CI: 92.0-99.5), estimated by the RBD region; the protein subunit vaccine's expected VE is similar; and the inactivated and viral vector vaccines are expected to exhibit a systematically lower VE by 17.3% and 20.6% compared to the mRNA vaccines. The estimates using the manufacturer-based model can be found in Supplementary Table 3.   In Fig. 3a, the predicted and observed VEs for the genetic variants are overlayed. The calibration plot (Fig. 3b) 20 . These validation results demonstrate high predictive feasibility of using genetic mismatch to estimate vaccine performance.

Prediction for variants and Omicron sublineages without known
VEs. Next, we fitted the model with all available data and predicted VE against circulating variants as well as the Omicron sublineages for which there are no observed VE data at the time of writing (Fig. 4a).
Interestingly, among the four sublineages of Omicron (BA.1, BA.1.1, BA.2 and BA.3), the expected VEs vary between 11.9% for BA.1 and 33.3% for BA.2, using the mRNA vaccines. This might contribute to the considerable variations in VEs for Omicron reported from observational studies, whose cohorts might have been infected by divergent Omicron sublineages, in addition to differences in immune history. The model predicts that VEs against variants of concern (VOCs) or variants of interest (VOIs) other than the Omicron, such as the Lambda and Mu variants, are expected to be above 50% within 3 months after the second dose of an mRNA vaccine; however, the VEs of inactivated vaccines against symptomatic infection are predicted to wane most under the challenge of new genetic variants.

Depicting trend of VE in serial cross-sectional sequencing data.
We demonstrated the application of predicting VE in real time against the circulating virus in a given geographical region, using California as an example. Sequencing data of virus isolates from California were downloaded from public databases. VEs were estimated for the major vaccine platforms at weekly intervals by GD in the serial cross-sectional sequencing data (Fig. 4b). In general, a decreasing trend of VE is depicted, with a sharp drop after the Omicron predominance since December 2021. The observed VEs from clinical trials and observational studies conducted during the period in the United States are overlaid on the prediction outcomes for reference 2,21-32 . BNT162b2 (1) BNT162b2 (2) BNT162b2 (3) BNT162b2 (9) BNT162b2 (11) BNT162b2 (13) BNT162b2 (14) BNT162b2 (15) BNT162b2 (4) BNT162b2 (5) AZD1222 (6) AZD1222 (7) AZD1222 (8) AZD1222 (12) AZD1222 (16) AZD1222 (17) AZD1222 (21) AZD1222 (22) Omicron mRNA-1273 (23) AZD1222 (18) Covaxin (19) CoronaVac (20) Ad26.COV2.S (10) Exploration of candidate vaccine strains. We further explored the possibility of developing region-specific vaccines and how well they would match the circulating virus profiles. We investigated the optimal candidate vaccine strains for 13 regions, including the United Kingdom, Germany, South Africa, Russia, India, Hong Kong, Malaysia, Japan, California, New York, Mexico, Peru and Brazil. Based on the GD between the vaccine strain and observed viruses circulating in a given region and period, hierarchical clustering of regions was performed to show the similarity of vaccine mismatches ( Fig. 5 and Extended Data Fig. 5). We found that, although the Omicron sublineages can match to epidemic viruses in all investigated regions except for Russia during January and February 2022, the dominant sublineages were not the same in these regions. This suggests that updating vaccine compositions with a single genetic variant might not be sufficient for matching the distribution of global viral population.

Discussion
As novel variants of SARS-CoV-2 keep emerging in the ongoing pandemic, rapid assessment of vaccine performance in populations is crucial to inform public health and clinical responses. This study established an efficient computational framework to estimate COVID-19 VE against symptomatic infection using viral sequence data. We show that the predicted VEs against genetic variants are close to the observed outcomes. The framework has several advantages. First, it enables prediction of VE against novel variants using existing virus surveillance networks to derive a rapid estimate; thus, it could inform timely public health preparedness. Second, it provides an integrated measure to facilitate the interpretation of vaccine effects, which accounts for the potential confounding effects of time and location related to genetic evolution. Third, through mixed-effects modeling, the framework controls for variations by vaccine type, providing a consistent and  adaptable prediction framework for inclusion of multiple vaccine platforms and manufacturers.
Among candidate genomic regions, the RBD region exhibits the strongest statistical association with VE. Weaker associations between VE and GD were detected for NTD and the entire S protein. These findings are also supported by biological evidence. The RBD is the major target for neutralizing antibodies that interfere with viral receptor binding 33,34 . The NTD is reported to be the target of 5-20% of S-specific monoclonal antibodies from memory B cells against SARS-CoV-2 (refs. 35,36 ).
Recent studies have investigated the use of neutralization titer as a predictor of vaccine efficacy [37][38][39] ; however, the neutralizing results against SARS-CoV-2 genetic variants showed varying outcomes. The vaccine protection against the B.1.351 variant reduced from 95.0% 2 to 75.0% 40 for BNT162b2 in early 2021. Due to differences in standardization and cohorts, one neutralization study showed that the titer against B.1.351 is 7.6-fold and nine-fold lower compared to the early Wuhan-related Victoria variant in the BNT162b2 vaccine serum and AZD1222 vaccine serum, respectively 41 , whereas another experiment reported a 2.7-fold decrease in neutralization titers against the B.1.351 lineage in the BNT162b2-elicited serum 42 . Similar results have also been observed for the Omicron variant [43][44][45] . The varying neutralization results increase the challenge of inferring vaccine performance solely by neutralization levels. The association of neutralization with protection across studies showed that neutralizing antibodies might not be deterministic in mediating protection, and the effect of other vaccine-induced immune responses also need to be quantified. This work uses an alternative angle to bridge the link between genetic variations and population-level vaccine responses. Further investigations are needed to integrate potential correlates of vaccine protection and improve the existing framework.
Although 42% of the world population has not completed the full vaccine primary series up to this date 46 , additional booster doses of vaccine are being rolled out in many places. Neutralization activity after the booster can be restored to a higher level for a short period of time. BNT162b2 immune sera of individuals who received only two doses had a low ability to neutralize the Omicron variant, whereas a third dose of the BNT162b2 increased the Omicron-neutralizing titer 23-fold relative to their level at 21 days after the second dose 47 . Similar results have been reported for the mRNA-1273 vaccine 48 . The booster-enhanced neutralizing level against Omicron was lower than that against the Beta, Delta and Wuhan strains and declined faster than those against the D614G variant 47,48 . Recent studies showed that the VE against symptomatic infection of Omicron is restored up to near 50% after the booster. In Qatar, the VE against symptomatic Omicron infection was 56.6% and 53.1% for the BNT162b2 and mRNA1732 vaccines, respectively, 1 month after the third dose 49 ; and, in Israel, the VE against symptomatic Omicron infection was 43% and 31% for BNT162b2 and mRNA1273, respectively, 1 month after the fourth dose among healthcare workers 50 . The flexible VE-GD framework proposed here could be further extended to account for the booster's protection as more effectiveness data of homologous and heterologous booster studies are available.
VE against infection is generally lower compared to the VE against symptomatic infection. For instance, in the Coronavirus Efficacy (COVE) phase 3 trial of the mRNA-1732 vaccine, the VEs for infection and symptomatic infection are 82.0% and 93.2%, respectively 51 . In view of waning immunity, a systematic review including 78 VE studies up to 2 December 2021 showed that the VE dropped by 21.0% (95% CI: 13.9-29.8) and 24.9% (95% CI: 13.4-41.6) against infection and symptomatic infection, respectively, 6 months after the second dose, aggregating the data from several vaccine platforms 52 . VE against severe disease or hospitalization showed longer preservation compared to the protection against symptomatic infection. In Qatar and Canada, the VE against hospitalization due to infection with the Alpha, Beta and Delta variants among all age groups was above 90% after the second dose of the mRNA-1273, BNT162b2 and AZD1222 vaccines [53][54][55][56] . VE against hospitalization with Delta infection remained at above 80% in the United Kingdom 20 weeks after vaccination with the BNT162b2 and AZD1222 vaccines 57 . In Qatar and South Africa, VE against hospitalization was in the range of 70-80% during the Omicron predominance within 6 months after the second dose for mRNA vaccines 49,58 .
Previously, the effect of genetic diversity on vaccine efficacy was investigated by sieve analysis, originating in the study of the human immunodeficiency virus 1 (HIV-1) vaccines [59][60][61] . Sieve analysis compares the infection strains between vaccinated and unvaccinated individuals and estimates the odds ratio of a viral strain type to penetrate the vaccine protection barrier. The sieve method requires individual-level data of virus isolate sequences and infection outcome of trial participants, whereas the model proposed in this study uses viral sequences in the general population and integrates multiple VE studies. Other studies have considered the proportion of genetic mismatch in the dominant epitope region to account for variations in the VE against influenza viruses 62,63 , whereas the VE-GD model in this report provides a unified framework to account for multiple genes and vaccine types.
This study has several limitations. The scope of inference is subject to the range of VE studies included in model fitting; thus, the VE estimated is presumably for a time close to the second vaccine dose. In model estimation, the effect of waning immunity on VE was controlled by a proxy time variable at population level, and the VE decline corresponding to time was estimated to be 2.4% (95% CI: 1.0-3.8) per 30 days for mRNA vaccines. This estimation is in line with the phase 2/3 efficacy trial of the BNT162b2 vaccine through 6 months of follow-up 32 , which showed an average decline of 2.5% per month by comparing the VE after 4-6 months to VE within 2 months since the second dose. The exact relationship between time and waning of host immunity will be calibrated in individual-level data, in which the main variable of interest is time-to-infection. For these analyses, including the genetic mismatch information would be helpful to control for the genetic variant's effect on vaccine breakthrough alongside waning of host immunity. Second, VE prediction in this study only considered the GD of vaccine strain to circulation strains, and the effect of prior infection on vaccine protection was not captured. Studies showed that natural infection, either before or after vaccination, substantially increased vaccine protection for symptomatic infection and hospitalization during the Beta-predominant and Delta-predominant periods 64 and against the Omicron variant by the mRNA vaccine 65,66 . As more hybrid immunity data become available, the mixed-effects prediction model could be extended to account for this additional level of variation. Moreover, bias might occur if sequences in databases disproportionately represented regions with known circulation of a given variant. Enhanced efforts are needed to ensure better geographical representativeness of available SARS-CoV-2 sequences. Despite these limitations, we demonstrated a robust relationship between genetic mismatch and VE, which we validated using independent data.
To conclude, this work developed a modeling framework integrating data from genetics and epidemiological studies for estimating COVID-19 vaccine effectiveness against a specific variant or for a particular cohort in a given period and region. Rapid assessment of VE against an evolving pathogen can be a useful instrument to inform vaccine development, distribution and public health responses.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41591-022-01877-1.

Methods
VE data. VE is calculated by (1 − RR) × 100), where RR is the relative risk of a disease outcome in the vaccinated group compared to the unvaccinated group. Vaccine efficacy is measured in randomized controlled trials, whereas vaccine effectiveness is obtained from observational studies. VE reports before 24 December 2021 were collected from published articles and preprint articles. Inclusion criteria for the vaccine effectiveness studies include: target population is a cohort without special conditions; the primary outcome is symptomatic COVID-19 infection after the second vaccine dose; and the study period of VE evaluation is clearly reported. A total of 78 VE data from 49 studies were obtained for estimating the effect size of GD, among which were 33 efficacy data and 45 were effectiveness data. The vaccine efficacy studies include 28 phase 3 trials, one phase 2 trial and four phase 2/3 trials. The vaccine effectiveness studies include 16 cohorts and 29 case-control studies. Detailed information of VE studies is available in Supplementary Table 1.
Genetic sequences. Human SARS-CoV-2 strains with collection dates ranging from 4 August 2020 to 6 March 2022 were retrieved from the Global Initiative on Sharing All Influenza Data (GISAID) EpiCoV database 67 . All available sequences that matched to the period and locations of the clinical trials or observational studies totaled 1,984,241 full-length genome sequences from 31 geographical regions. The sources of SARS-CoV-2 sequences involved in this study are reported in the Supplementary Acknowledgement Table. Strains with duplicated names and unclear collection time of samples were removed. Multiple sequence alignment was performed using MAFFT (version 7). The 'Wuhan-Hu-1' genome (GenBank NC_045512.2 or GISAID EPI_ISL_402125) was set as the reference sequence. The variants involved in this study are summarized in Supplementary Tables 6 and 7. Lineage classification for sequences was referenced from the GISAID.
Statistical methods. GD. Following our previous framework developed for influenza virus 15 , let X = {xij} denote the i-th sample from the GISAID database collected for a target population, where i = 1,…, n, j = 1,…, J; and let V = {vj} denote the vaccine strain applied in the target population, where index j indicates the j-th codon position in the sequence. Denote the amino acids in a given genomic region as W = {w k }, where k is the index for codon positions contained in the segment, k = 1, …, K, 0 ≤ K ≤ J. Suppose the Hamming distance is used as a basic measure of dissimilarity between two sequences, the vaccine genetic distance (d) calculated for the target population is: Thus, the d summarized the average amino acids mismatch of circulating strains versus the vaccine strain based on a given genomic segment in a target population. In this study, we considered a wide range of candidate W, including the RBD, NTD and S, E, M, N, ORF1ab and accessory proteins. A schematic representation of the SARS-CoV-2 genome and the structure of S protein are available in Supplementary Figs. 1 and 2. All vaccine strains are based on the Wuhan strain isolated in January 2020. When the target population is composed of individuals infected with multiple co-circulating variants, the d captures the average mismatch over all co-circulating variants in the cohort, whereas, when the target population is a single genetic variant, d captures the variant-specific distance.
The VE-GD mixed-effects model. A two-level mixed-effects model was adopted to account for the random effect associated with vaccine type (technology platform or manufacturer). The genetic distance, d ij , is the main predictor variable for study i and vaccine type j, i = 1,…, n j , and n j is the number of studies for vaccine type j. Therefore, the following random intercept and random slope model is specified for the VE response Y j : In the equation, X j is the covariate matrix of fixed factors, and β is the fixed effect vector. Zj = [1, dj] is the matrix containing a unit vector and the n j -length genetic distance vector d j ; and uj = (u0j, u1j) T is composed of a random intercept variable u0j and a random slope variable u1j. uj ∼ N(0, D), where D is a variance component matrix. The fixed factors include the age category of the study, midpoint (days) after the second dose extracted from each study and the genetic distance d j . εj ∼ N(0, Rj) is the error term of the mixed-effects model, Rj = σ 2 In j . The model was fitted using the R package lmerTest 68 . The prediction interval of the mixed-effects model was calculated using the R package merTools 69 . All analyses were performed using R statistical software (version 4.0.3). Statistical significance was declared if P < 0.05.
Model assessment was performed in a training-validation setting. A total of 23 variant-specific VEs were extracted from the data as the validation set (Supplementary Table 5). The model was fitted using the remaining 57 VEs (non-variant specific), and predictions were made for the genetic variants. The agreement between the predicted and observed VEs is measured by the CCC 70 .
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
All data used in this study are publicly available. Detailed information of VE outcomes is available in the Supplementary Materials. Viral sequence data were downloaded from the GISAID at http://platform.gisaid.org/, and the accession numbers are provided in the online Supplementary Acknowledgment Table ( Last updated by author(s): May 11, 2022 Reporting Summary Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code Data collection No software was used for data collection.

Data analysis
We used MAFFT (version 7) for multiple sequence alignment and R statistical software (version 4.0.3) in all statistical analyses. R packages used in this study include lmerTest (3.1-3) and merTools(0.5.2). All code is freely available at https://github.com/VaccineEffectivenessPrediction/COVID19-Vaccine-Effectiveness.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy All data used in this study is publicly available. The detailed information of VE outcomes is available in the Supplementary Information. Viral sequence data were downloaded from the global initiative on sharing all influenza data (GISAID) at http://platform.gisaid.org/ and the accession numbers are provided in the online Supplementary Acknowledgment Table (https://github.com/VaccineEffectivenessPrediction/COVID19-Vaccine-Effectiveness).

March 2021
Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
The study extracted vaccine efficacy or vaccine effectiveness (VE) data before 24 Dec, 2021 from published articles and preprint articles. A total of 78 VE data were obtained for model building. All available sequences that matched to the period and locations of the clinical trials or observational studies totaled 1,984,241 full-length genome sequences from 31 geographical regions.
Data exclusions For VE data, exclusion criteria include: target population has special conditions; the primary outcome is not symptomatic COVID-19 infection after the second vaccine dose; and the study period of VE evaluation is not reported. For sequence data, strains with duplicated names and unclear collection time of samples were removed.

Replication
This study demonstrated a clear relationship between COVID-19 VE and genetic distance on RBD, NTD and entire S protein. Our findings can be supported by biological experiments. We first collected data before June 2021 and determined genetic distance is associated with VE against symptomatic infection. After adding subsequent data before Match 2022, the results are consistent with previous results. Such relationships exist in different vaccine platforms and vaccine products. The prediction results were validated by independent data. All attempts at replication were successful. Additionally, this bioinformatics framework has been applied to influenza A/H1N1pdm09, H3N2 and influenza B viruses and such a relationship was also detected.
Randomization Randomization is not applicable in our study design. The vaccine efficacy outcomes included in this study were based upon clinical trials. The vaccine effectiveness outcomes were obtained from observational studies. All available sequences that matched to the period and locations of the clinical trials or observational studies were collected.

Blinding
Blinding is not relevant to the study. This study used population-level data and did not involve individual participants.

Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.