Initiatives to gather massive epidemiological datasets aim to cut through national COVID-19 stats in a bid to understand the new coronavirus and aid public health policymakers.
Two population-scale COVID-19 studies in the UK, OpenSAFELY and the COVID-19 Symptom Study, are the first robust demonstrations of the power of big data to cut through the confusing mass of information and statistics generated by the pandemic to uncover biological signals. Although machine learning has the potential to deepen researchers’ understanding of this new virus and how it affects its human host, most platforms capable of analyzing real-world data accurately are still immature given the lack of high-quality data required to develop such models. But even if many artificial intelligence (AI) initiatives may not have much of an impact on the present crisis, in the long run some could have a profound effect on future pandemic preparedness.
The OpenSAFELY team analyzed the National Health Service (NHS) electronic health records of over 17 million individuals — about 40% of England’s adult population — to identify the main risk factors for COVID-19-related death. In addition to old age and the presence of underlying medical conditions, they found that Black and South Asian ethnicity were among the main risk factors for mortality. The COVID-19 Symptom Study collected self-reported data from over 2.6 million people using a smartphone app developed by King’s College London and London-based AI firm Zoe Global. By analyzing real-time reports from people with and without symptoms, the researchers established the combination of symptoms most likely to predict infection — including loss of taste and smell, which they were first to flag as a COVID-19 symptom.
Because of their scale, both studies have a statistical rigor that has been lacking in many other COVID-19 population studies. One of the most troubling uncertainties weighing on epidemiologists is the confusion about the actual number of cases. Behind the slick interfaces of the many dashboards developed to track the virus in real time lies a morass of messy, inconsistent data, which renders any cross-country comparisons invidious. “We’ve ended up with a jumble of metrics, some of which are useful, some of which are not useful,” says Murray Aitken, executive director at the IQVIA Institute for Human Data Science. Patchy approaches to testing and uncertainties about both asymptomatic carriers and how long immunity persists in those who have recovered from the infection mean that epidemiologists’ understanding of the pandemic remains incomplete. The uncertainty is compounded by divergent reporting standards across and within countries and by the constant revision of official case data. “I can’t emphasize how chaotic the situation is,” says Aditya Prakash, associate professor in Georgia Institute of Technology’s school of computational science and engineering.
Traditional statistical analysis and machine learning can help to clean up messy or incomplete data. Zoe and Intellegens are among several firms (Table 1) repurposing machine learning capabilities developed in other disciplines for pandemic forecasting. Intellegens, which has been focused on a range of industries, including materials science and drug discovery, received funding from Innovate UK to apply its deep learning algorithm Alchemite to build a predictive tool for governments and healthcare providers. The aim is to improve the algorithm’s forecasting accuracy so analysts can assess the likely impacts of different policy interventions.
Zoe was set up, not for COVID-19, but for weight loss and dietary health. Its initial focus was to apply machine learning to analyze individuals’ metabolic responses to food. Its COVID-19 work has demonstrated that it is feasible to gather high-quality data directly from users. “This is real citizen science,” says Zoe CEO Jonathan Wolf. “You can collect data from millions of people today,” he says. In addition to the symptom study, which was led by Zoe’s cofounder and long-term collaboration partner Tim Spector, of King’s College London, it has also recruited 800,000 volunteers to participate in a clinical study to explore whether machine learning could serve as a digital diagnostic for COVID-19. Obtaining a data signal is difficult at present, however, because of the relatively low levels of coronavirus in the United Kingdom.
IQVIA and xCures have also been running community-based population studies, albeit at a smaller scale. IQVIA has so far recruited about 20,000 volunteers for its web-based CARE registry study and is following participants longitudinally. “The main focus is on progression of symptoms and severity,” says Nancy Dreyer, CSO at IQVIA’s Real World Solutions arm. “We have very little information in the epidemic about symptoms and symptom severity outside of the hospital setting.” xCures’ BEAT19 study has recruited about 4,000 volunteers in the United States and is now tapping into Brazilian data through relationships with several medical centers there.
Machine learning is a relatively new addition to epidemiology forecasting. It is not a replacement for classical mechanistic modeling, but it can integrate unstructured and informal data from multiple data streams and can uncover hidden patterns within diverse datasets. In contrast, mechanistic models draw on more limited datasets but incorporate an understanding of the transmission dynamics of the disease outbreak. As Inga Holmdahl and Caroline Buckee, of the Harvard T.H. Chan School of Public Health, note in a recent Perspective, the two approaches address different questions and have different limitations. Machine-learning forecasts are more suited to short-term predictions, which can help prioritize the allocation of healthcare resources, for example, whereas mechanistic models are used to examine long-term trends and the impacts of policy measures, such as encouraging social distancing or mandating face masks. The distinction between the two approaches is becoming blurred, Prakash says, by hybrid approaches that include elements of both.
The annual FluSight challenge run by the US Centers for Disease Control on influenza forecasting, in which Prakash participates, provides an important testbed for assessing new approaches to disease forecasting. Prakash is also one of several dozen researchers working on a large-scale computational epidemiology initiative funded by the US National Science Foundation, which aims to build sophisticated network models, operating at multiple scales and on multiple data layers, to develop insights into the control of epidemics and pandemics. The C3.ai Digital Transformation Institute, an AI research consortium founded in March by the AI software firm C3.ai, Microsoft and several US academic institutions, is also developing modeling and AI-based tools to mitigate pandemics. Its first research awards are focused on a broad swathe of topics that intersect with COVID-19, including social issues, such as housing precariousness and the social determinants of health, as well as technical problems, such as mathematical modeling and computational biology.
For example, Vince Poor, professor of electrical engineering at Princeton University, and collaborators in Princeton, Carnegie Mellon University and the University of Pennsylvania are applying network engineering concepts to model the epidemiology of COVID-19. A key element of their approach is to incorporate a more nuanced description of R0, which is the average number of new infections expected to arise in a naïve population from one infected individual. By modifying the ‘susceptible, infected and recovered’ (SIR) model, they aim to develop a more accurate picture of the spread of the virus. “Instead of applying a uniform R0 across the entire population, the idea is to apply a probability distribution to the transmissibility of each individual in the population,” he says.
At a completely different scale, genomics represents another domain that may be amenable to analysis with machine learning. In response to the COVID-19 pandemic, Adaptive Biotechnologies and Microsoft have extended an existing alliance to map T cell receptor (TCR) sequences to specific disease states. Adaptive is making TCR sequence data from de-identified geographically and ethnically diverse COVID-19-infected people available to vaccine and drug developers, to enable them to assess T cell responses in clinical trials. At the same time, the partners are developing a COVID-19 diagnostic powered by machine learning, which works by identifying all possible TCRs in a blood sample capable of binding a SARS-CoV-2 antigen. “Our big thesis is [that] the immune system is really, really good at detecting disease,” says Harlan Robins, cofounder and CSO at Adaptive. Because TCRs only recognize antigen fragments presented by major histocompatibility molecules, the ‘rules’ for understanding TCR–antigen recognition are constrained. “It’s a hard problem, but it’s tractable,” Robins says. The company aims to ship the test during the fourth quarter. “The diagnostic is complete. Now we have to validate it in a way the FDA agrees to.”
AI-driven diagnostics based on lung imaging have received substantial attention, although some critics argue that the field is still at a nascent stage. “It’s so easy to create machine learning models,” says Roger Noble, founder and CEO of Oxford-based Zegami, which combines development of machine learning tools with image management and analysis software. “The difficulty is in making sure that they work correctly, that they’re unbiased, that they’re fair, and they work against the real-world data that they’re seeing, not just the training and validation data that’s been collected maybe in a more sterile environment.”
Sophia Genetics, a Lausanne, Switzerland-based firm that cut its teeth in AI-powered analysis of genomic data for cancer, is convinced that a multimodal approach to COVID-19 will generate useful insights. This involves layering lung imaging data on top of other clinical and host and viral genomic data to predict the course of disease in patients. “You can probably only make predictions if you have multimodal data,” says Philippe Menu, chief medical officer at Sophia Genetics.
In other domains, such as facial recognition, computer vision and police profiling, AI has been widely criticized because of a range of biases that result in racially skewed outcomes. In healthcare, those who would introduce AI in ways that may alter clinical practice have a high bar to clear in terms of equity as well as safety and efficacy. The string of retractions associated with Surgisphere, the notorious healthcare analytics firm that claimed to have built a global database of electronic health records from hundreds of hospitals, epitomizes the fast-and-loose culture of data publishing that has thrived during the COVID-19 pandemic.
At the same time, clinging to the status quo may not represent the best response either. The OpenSAFELY project, which is jointly led by Ben Goldacre, director of the Evidence-Based Medicine DataLab at the University of Oxford, and Liam Smeeth, professor of clinical epidemiology at the London School of Hygiene and Tropical Medicine, demonstrates the potential of unlocking the enormous data resources of the UK’s NHS to better understand and improve health outcomes. Goldacre decries the “insufficient focus on the implementation of data science in healthcare.” The present crisis has lowered the administrative and cultural barriers that in more normal times would have either prevented or delayed the implementation of such a study. Crucially, its design and implementation have been exemplary, in stark contrast to the discredited studies based on Surgisphere’s data. A handful of AI-based diagnostics in several indication areas have already gained approval, but the regulatory framework for assessing the utility of AI-based technologies in healthcare is still evolving. Whichever way these emerging technologies are assessed, the usual clinical parameters of safety and efficacy will still apply.
Of course, no smart technologies or sophisticated models can serve any useful purpose if political leaders and their supporters continue to flout basic public health guidance. The extraordinary trajectory of the pandemic in the United States and several other countries led by populist politicians who pay little heed to scientific advice is a stark illustration of the price ordinary people pay when the public health arena is transformed into a political battleground. The chaos that is a feature of the COVID-19 pandemic in some countries is as much a function of bad politics as it is of inconsistent reporting. No technology can alleviate that fundamental problem.
Rights and permissions
About this article
Cite this article
Sheridan, C. Massive data initiatives and AI provide testbed for pandemic forecasting. Nat Biotechnol 38, 1010–1013 (2020). https://doi.org/10.1038/s41587-020-0671-4