In the first wave of the pandemic, many countries restricted non-essential travel to mitigate the spread of SARS-CoV-2. The restrictions crippled most tourist economies, with estimated losses of US$1 trillion among European countries and 19 million jobs3. As conditions improved from April to July, countries sought to partially lift these restrictions, not only for tourists, but also for the flow of goods and labour.

Different countries adopted different border screening protocols, typically based on the origin country of the traveller. Despite the variety of the protocols, we group those used in early summer 2020 into four broad types: allowing unrestricted travel from designated ‘white-list’ countries; requiring travellers from designated ‘grey-listed’ countries to provide proof of a negative test by PCR with reverse transcription before arrival; requiring all travellers from designated ‘red-listed’ countries to quarantine on arrival; forbidding any non-essential travel from designated ‘black-listed’ countries.

Most nations employed a combination of all four strategies. However, the choice of which ‘colour’ to assign to a country differed across nations. For example, as of 1 July 2020, Spain designated the countries specified in ref. 1 as white-listed, whereas Croatia designated these countries as grey-listed or red-listed.

To the best of our knowledge, in all European nations except Greece, the above ‘colour designations’ were entirely based on population-level epidemiological metrics (for example, see refs. 1,2) such as cases per capita, deaths per capita and/or positivity rates that were available in the public domain4,5,6. (An exception is the UK, which engaged in small-scale testing at select airports that may have informed their policies.) However, such metrics are imperfect owing to under-reporting7, symptomatic population biases8,9 and reporting delays.

These drawbacks motivated our design and nationwide deployment of Eva: the first fully algorithmic, real-time, reinforcement learning system for targeted COVID-19 screening with the dual goals of identifying asymptomatic, infected travellers and providing real-time information to policymakers for downstream decisions.

Overview of the Eva system

Eva as presented here was deployed across all 40 points of entry to Greece, including airports, land crossings and seaports from 6 August to 1 November 2020. Figure 1 schematically illustrates its operation; Supplementary Fig. 7 provides a more detailed schematic diagram of Eva’s architecture and data flow.

Fig. 1: A reinforcement learning system for COVID-19 testing (Eva).
figure 1

Arriving passengers submit travel and demographic information 24 h before arrival. On the basis of these data and testing results from previous passengers, Eva selects a subset of passengers to test. Selected passengers self-isolate for 24–48 h while laboratories process samples. Passengers testing positive are then quarantined and contact tracing begins; passengers testing negative resume normal activities. Results are used to update Eva to improve future testing and maintain high-quality estimates of prevalence across traveller subpopulations.

We next describe the main steps in processing a passenger.

Passenger locator form

All travellers must complete a passenger locator form (PLF; one per household) at least 24 h before arrival, containing (among other data) information on their origin country, demographics, and point and date of entry. Ref. 10 describes the exact fields and how these sensitive data were handled securely.

Estimating prevalence among traveller types

We estimate traveller-specific COVID-19 prevalence using recent testing results from previous travellers through Eva. Prevalence estimation entails two steps. First, we leverage LASSO (least absolute shrinkage and selection operator) regression from high-dimensional statistics11 to adaptively extract a minimal set of discrete, interpretable traveller types based on their demographic features (country, region, age and gender); these types are updated on a weekly basis using recent testing results. Second, we use an empirical Bayes method to estimate each type’s prevalence daily. Empirical Bayes has previously been used in the body of literature on epidemiology to estimate prevalence across many populations12,13. In our setting, COVID-19 prevalence is generally low (for example, ~2 in 1,000), and arrival rates differ substantively across countries. Combined, these features cause our testing data to be both imbalanced (few positive cases among those tested) and sparse (few arrivals from certain countries). Our empirical Bayes method seamlessly handles both challenges. Estimation details are provided in Supplementary Methods 2.2.

Allocating scarce tests

Leveraging these prevalence estimates, Eva targets a subset of travellers for (group) PCR testing on arrival on the basis of their type alone, but no other personal information. The Greek National COVID-19 Committee of Experts approved group (Dorfman) testing14 in groups of five but eschewed larger groups and rapid testing owing to concerns over testing accuracy.

Eva’s targeting must respect various port-level budget and resource constraints that reflect Greece’s testing supply chain, which included 400 health workers staffing 40 points of entry, 32 laboratories across the country and delivery logistics for biological samples. These constraints were (exogenously) defined and adjusted throughout the summer by the General Secretariat of Public Health.

The testing allocation decision is entirely algorithmic and balances two objectives. First, given current information, Eva seeks to maximize the number of infected asymptomatic travellers identified (exploitation). Second, Eva strategically allocates some tests to traveller types for which it does not currently have precise estimates to better learn their prevalence (exploration). This is a crucial feedback step. Today’s allocations will determine the available data in the prevalence estimation step above when determining future prevalence estimates. Hence, if Eva simply (greedily) sought to allocate tests to types that currently had high prevalence, then, in a few days, it would not have any recent testing data about many other types that had moderate prevalence. Since COVID-19 prevalence can spike quickly and unexpectedly, this would leave a ‘blind spot’ for the algorithm and pose a serious public health risk. Such allocation problems can be formulated as multi-armed bandits15,16,17,18—which are widely studied within the body of literature on reinforcement learning—and have been used in numerous applications such as mobile health19, clinical trial design20, online advertising21 and recommender systems22.

Our application is a nonstationary23,24, contextual25, batched bandit problem with delayed feedback26,27 and constraints28. Although these features have been studied in isolation, their combination and practical implementation poses unique challenges. One such challenge is accounting for information from ‘pipeline’ tests (allocated tests whose results have not yet been returned); we introduce a novel algorithmic technique of certainty-equivalent updates to model information we expect to receive from these tests, allowing us to effectively balance exploration and exploitation in nonstationary, batched settings. To improve interpretability, we build on the optimistic Gittins index for multi-armed bandits29; each type is associated with a deterministic index that represents its current ‘risk score’, incorporating both its estimated prevalence and uncertainty. Algorithm details are provided in Supplementary Methods 2.3.

Grey-listing recommendations

Eva’s prevalence estimates are also used to recommend particularly risky countries to be grey-listed, in conjunction with the Greek COVID-19 taskforce and the Presidency of the Government. Grey-listing a country entails a tradeoff: requiring a PCR test reduces the prevalence among incoming travellers, but it also reduces non-essential travel significantly (approximately 39%; Supplementary Methods 5), because of the relative difficulty/expense in obtaining PCR tests in summer 2020. Hence, Eva recommends grey-listing a country only when necessary to keep the daily flow of (uncaught) infected travellers at a sufficiently low level to avoid overwhelming contact-tracing teams30. Ten countries were grey-listed over the summer of 2020 (Supplementary Methods 5).

Unlike testing decisions, our grey-listing decisions were not fully algorithmic, but instead involved human input. Indeed, while in theory, one might determine an ‘optimal’ cutoff for grey-listing to balance infected arrivals and reduced travel, in practice it is difficult to elicit such preferences from decision-makers directly. Rather, they preferred to retain some flexibility in grey-listing to consider other factors in their decisions.

Closing the loop

Results from the tests performed according to the test allocation step are logged within 24–48 h, and then used to update the prevalence estimates from the previous step.

To give a sense of scale, during peak season (August and September), Eva processed 41,830 (±12,784) PLFs each day, and 16.7% (±4.8%) of arriving households were tested each day.

Value of targeted testing

We first present the number of asymptomatic, infected travellers caught by Eva relative to random surveillance testing (that is, where every arrival at a port of entry is equally likely to be tested). Random surveillance testing was Greece’s initial proposal and is very common, partly because it requires no information infrastructure to implement. However, we find that such an approach comes at a significant cost to performance (and therefore public health).

We perform counterfactual analysis using inverse propensity weighting31,32, which provides a model-agnostic, unbiased estimate of the performance of random testing.

During the peak tourist season, we estimate that random surveillance testing would have identified 54.1% (±8.7%) of the infected travellers that Eva identified. (For anonymity, averages and standard deviations are scaled by a (fixed) constant, which we have taken without loss of generality to be the actual number of infections identified by Eva in the same period for ease of comparison.)

In other words, to achieve the same effectiveness as Eva, random testing would have required 85% more tests at each point of entry, a substantive supply chain investment. In October, when arrival rates dropped, the relative performance of random testing improved to 73.4% (±11.0%; Fig. 2). This difference is largely explained by the changing relative scarcity of testing resources (Fig. 3). As arrivals dropped, the fraction of arrivals tested increased, thereby reducing the value of targeted testing. In other words, Eva’s targeting is most effective when tests are scarce. In the extreme case of testing 100% of arrivals, targeted testing offers no value as both random and targeted testing policies test everyone. See Supplementary Methods 3 for details.

Fig. 2: Comparing Eva versus randomized surveillance testing.
figure 2

The number of infections caught by Eva (red) versus the estimated number of cases caught by random surveillance testing (teal). The peak (respectively, off-peak) season is 6 August to 1 October (respectively, 1 October to 1 November) and is denoted with triangular (respectively, circular) markers. Seasons are separated by the dashed line. The solid lines denote cubic-spline smoothing, with the 95% confidence intervals in grey.

Fig. 3: Relative efficacy of Eva over random surveillance versus fraction tested.
figure 3

The ratio of the number of infections caught by Eva relative to the number of (estimated) infections caught by random surveillance testing, as a function of the fraction of tested travellers. The short-dashed (respectively, long-dashed) line indicates the average fraction tested during the peak (respectively, off-peak) tourist season. Triangular (circular) markers denote estimates from peak (off-peak) days. The solid blue line denotes cubic-spline smoothing, with the 95% confidence interval in grey.

Value of reinforcement learning

We now compare the performance of Eva with that of policies that require similar infrastructure as Eva, namely PLF data, but instead target testing based on population-level epidemiological metrics (for example, as proposed by the European Union2) rather than reinforcement learning. The financial investments of such approaches are similar to those of Eva, and we show that these policies identify fewer cases. (Supplementary Methods 3.2.3 highlights additional drawbacks of these policies, including poor data reliability and a mismatch in prevalence between the general population and the asymptomatic traveller population.)

We consider three separate policies that test passengers with probability proportional to cases per capita, deaths per capita or positivity rates for the passenger’s country of origin4,5,6, while respecting port budgets and arrival constraints. We again use inverse propensity weighting to estimate counterfactual performance (Fig. 4).

Fig. 4: Comparing Eva to policies based on epidemiological metrics.
figure 4

The lines represent cubic-spline smoothing of daily infections caught for each policy; raw points are shown only for Eva and the ‘Cases’ policy for clarity. The dashed line separates the peak (6 August to 1 October) and off-peak (1 October to 1 November) tourist seasons. The inset table shows the relative improvement of Eva over a policy based on the indicated epidemiological metric with the same testing budget for both the peak season and the off-peak season.

During the peak tourist season (August and September), we found that policies based on cases, deaths and positivity rates identified 69.0% (±9.4%), 72.7% (±10.6%) and 79.7% (±9.3%), respectively, of the infected travellers that Eva identified per test. In other words, Eva identified 1.25×–1.45× more infections with the same testing budget and similar PLF infrastructure. In October, when arrival rates dropped, the relative performance of counterfactual policies based on cases, deaths and positivity rates improved to 91.5% (±11.7%), 88.8% (±10.5%) and 87.1% (±10.4%), respectively. Like our results in the previous section, our findings show that the value of smart targeting is larger when testing resources are scarcer. In fact, Eva’s relative improvement over these policies was highest in the second half of the peak season (when infection rates were much higher and testing resources were scarcer). See Supplementary Methods 3 for details.

Supplementary Methods 4 discusses possible reasons underlying the poor performance of simple policies based on population-level epidemiological metrics, including reporting delays and systematic differences between the general and asymptomatic traveller populations.

Poor predictive power of epidemiological metrics

Given the poor performance of simple policies based on population-level epidemiological metrics, a natural question is whether more sophisticated functions of these metrics would perform better. Although it is difficult to eliminate this possibility, we argue that this is probably not the case through a related analysis of the extent to which population-level epidemiological metrics can be used to predict COVID-19 prevalence among asymptomatic travellers as measured by Eva. Surprisingly, our findings suggest that widely used epidemiological data are generally ineffective in predicting the actual prevalence of COVID-19 among asymptomatic travellers (the group of interest for border control policies).

Specifically, we examine the extent to which these data can be used to classify a country as high risk (more than 0.5% prevalence) or low risk (less than 0.5% prevalence); such a classification informs whether a country should be grey- or black-listed. (A cutoff of 0.5% was typical for initiating grey-listing discussions with the Greek COVID-19 taskforce, but our results are qualitatively similar across a range of cutoffs.) We compute the true label for a country at each point in time on the basis of Eva’s (real-time) estimates. We then train several models using a gradient boosted machine33 on different subsets of covariates derived from the 14-day time series of cases per capita, deaths per capita, testing rates per capita and testing positivity rates. Figure 5 summarizes their predictive accuracy; we obtained similar results for other state-of-the art machine learning algorithms.

Fig. 5: Predictive power of publicly reported epidemiological metrics.
figure 5

Each of the models 1–4 uses a different subset of features from: 14-day time series of cases per capita, deaths per capita, tests performed per capita and testing positivity rate. Model 5 additionally includes country fixed effects to model country-level idiosyncratic behaviour. Models 1–4 are essentially no better than random prediction, while model 5 achieves slightly better performance. See Supplementary Methods 4.1 for details on model construction and features used in each model. AUROC, area under the receiver operating characteristic curve.

Note that a random model that uses no data has an area under the receiver operating characteristic curve of 0.5. Thus, models 1–4 offer essentially no predictive value, suggesting that these population-level epidemiological data are not informative of prevalence among asymptomatic travellers.

Model 5, which additionally uses country-level fixed effects, offers some improvement. These fixed effects collectively model country-specific idiosyncrasies representing aspects of their testing strategies, social distancing protocols and other non-pharmaceutical interventions that are unobserved in the public, epidemiological data. The improvement of model 5 suggests that these unobserved drivers are critical to distinguishing high- and low-risk countries.

Overall, this analysis raises concerns not only about travel protocols proposed by the European Union2 based solely on widely used epidemiological metrics, but also about any protocol that treats all countries symmetrically. Indeed, the idiosyncratic effects of model 5 suggest that the threshold for deciding whether COVID-19 prevalence in travellers from country A is spiking may differ significantly from that of country B. See Supplementary Methods 4.1 for details.

In Supplementary Methods 4.3, we also study the information delay between a country’s publicly reported cases (the most common metric) and prevalence among asymptomatic travellers from that country. We expect a lag because of the time taken for symptoms to manifest, and reporting delays induced by poor infrastructure. We find a modal delay of 9 days.

Value of grey-listing

Eva’s measurements of COVID-19 prevalence were also used to provide early warnings for high-risk regions, in response to which Greece adjusted travel protocols by grey-listing these nations. We estimate that Eva prevented an additional 6.7% (±1.2%) infected travellers from entering the country through its early grey-listing decisions in the peak season; results in the off-peak season are similar. For privacy, we have expressed the benefit relative to the number of infected travellers identified by Eva. See Supplementary Methods 5 for details.

Lessons learned from deployment and design

Eva is a large-scale data-driven system that was designed and deployed during the COVID-19 crisis. Leading up to and throughout deployment, we met twice a week with the COVID-19 Executive Committee of Greece, an interdisciplinary team of scientists and policymakers. Through those meetings, we gleaned several lessons that shaped Eva’s design and contributed to its success.

Design the algorithm around data minimization

Data minimization (that is, requesting the minimum required information for a task) is a fundamental tenet of data privacy and the General Data Protection Regulation (GDPR). We met with lawyers, epidemiologists and policymakers before designing the algorithm to determine what data and granularity may legally and ethically be solicited by the PLF. Data minimization naturally entails a tradeoff between privacy and effectiveness. We limited requests to features thought to be predictive on the basis of the best available research at the time (origin, age and gender34,35), but omitted potentially informative but invasive features (for example, occupation). We further designed our empirical Bayes estimation strategy around these data limitations.

Prioritize interpretability

For all parties to evaluate and trust the recommendations of a system, the system must provide transparent reasoning. An example from our deployment was the need to communicate the rationale for ‘exploration’ tests (that is tests for types with moderate but very uncertain prevalence estimates). Such tests may seem wasteful. Our choice of empirical Bayes allowed us to easily communicate that types with large confidence intervals may have significantly higher risk than their point estimate suggests, and thus require some tests to resolve uncertainty (see, for example, Supplementary Figs. 9 and 11, which were featured on policymakers’ dashboards).

A second example was our choice to use Gittins indices, which provide a simple, deterministic risk metric for each type that incorporates both estimated prevalence and corresponding uncertainty, driving intuitive test allocations. In contrast, using upper-confidence-bound or Thompson sampling with logistic regression36,37 would have made it more difficult to visualize uncertainty (a high-dimensional ellipsoid or posterior distribution), and test allocations would depend on this uncertainty through an opaque computation (a high-dimensional projection or stochastic sampling).

This transparency fostered trust across ministries of the Greek Government using our estimates to inform downstream policymaking, including targeting contact-tracing teams, staffing of mobile testing units and adjusting social distancing measures.

Design for flexibility

Finally, as these systems require substantial financial and technical investment, they need to be flexible to accommodate unexpected changes. We designed Eva in a modular manner disassociating type extraction, estimation and test allocation. Consequently, one module can easily be updated without altering the remaining modules. For example, had vaccine distribution begun in the summer of 2020, we could define new types based on passengers’ vaccination status without altering our procedure for prevalence estimates or test allocation. Similarly, if rapid testing were approved, our allocation mechanism could be updated to neglect delayed feedback without affecting other components. This flexibility promotes longevity, as it is easier to get stakeholder buy-in for small adjustments to an existing system than for a substantively new approach.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.