There is art in ‘big data’ — in the poetic claims that it competes in volume with all the stars in the firmament. And in the seductive potential of its exponential, uncontrolled, ungraspable growth to improve our lives: by allowing medical treatments to be developed and approved more quickly — and, ultimately, truly personal medicine.

But at a workshop held in London by the European Medicines Agency earlier this month, just how much science has to happen to make this beautiful future a reality was apparent to all. Patient groups and research scientists attended, alongside computational heavyweights from IBM Watson Health and Google Cloud Platform. Together, they tackled chewy questions to which there are few answers.

How many data are ‘enough’ to reliably predict clinical effect? Which data sets can be useful? How can they be managed? What’s the best way to win the confidence of public and regulators? And, crucially, is academia training enough mathematicians and medical-data scientists, who will have to develop and harness all this new potential? The last of these questions at least has a clear answer: no.

Big data sets in medicine include genomics, transcriptomics and proteomics (which, respectively, describe our genomes, identify which of our genes are being expressed and catalogue all of the proteins in a tissue sample). Genomic data sets alone have already shown their value. The presence or absence of a particular gene variant can put people in high- or low-risk groups for various diseases and identify in some cases which people with cancer are likely to respond to certain drugs.

But a single molecular data set will not contain sufficient information to tell the whole story of an individual’s medical fate. Integration of different types of molecular data might tell more, but that remains a computational challenge. Even more would emerge if an individual’s molecular data were placed in the context of their physiology, behaviour and health. Electronic health records, the numbers of which are skyrocketing, could be useful here. So could disease registries, hospital and health-insurance records, as well as research publications and clinical-trial data. New types of data set come from wearables and apps, which collect health data directly from individuals, and from genome sequences of volunteers. Some researchers are even trying to extract medically relevant data from social-media platforms such as Twitter.

This adds up to a mind-boggling volume of data. According to one estimate at the workshop, clinical data from a single individual will generate 0.4 terabytes of information per lifetime, genomics data around 6 terabytes and additional data, 1,100 terabytes. By 2020, the amount of health-related data gathered in total will double every 73 days. Health professionals will confront more data than do those in finance.

Health professionals will confront more data than do those in finance.

To collect and hold all of this information within strict privacy regulations — non-negotiable for medical data — is another challenge. Some tech firms, such as IBM Watson Health and Hewlett-Packard, are building systems that keep data local — algorithms can dip into them, but the data are not transmitted anywhere else. Google, unsurprisingly, thinks that all these data are safer on the Cloud.

The big question for scientists is how to take the next step to convert these artistic sketches of potential into scientific knowledge. Data sets vary in their reliability. Those derived from skimming social media will be very messy and will have to prove their usefulness. And large data sets, reliable or not, inevitably throw up spurious correlations, and so recognizing meaningful patterns requires deep understanding of biology, something that software developers don’t generally have. Future biologists — funders and universities should note — will need much more training in mathematics and data science.

Big data has the exciting potential to allow clinical trials to be conducted partly in silico — which would mean using fewer animals in drug testing as well as recruiting fewer patients to the actual trials. Yet the field is developing at a time when public trust in experts is at an all-time low.

Regulators are ready to be persuaded to accept computational information in clinical trials. On occasion, the European Medicines Agency and the US Food and Drug Administration already accept it in pharmacokinetics, one of the simplest data sets that drug developers must supply agencies with. But the trust of doctors, patients and regulators in the abstract informatics and mathematical know-how that go into developing scientific and clinical predictions cannot be taken for granted. As medics and researchers nudge the field gradually from poetry and seduction to delivery, they must engage the next generation of scientists — and the public — and inform and educate them in the art and science of what is possible.