The UK Biobank database contains health data from more than half a million people. It is an invaluable resource for understanding diseases and developing predictive machine-learning algorithms (see also Nature 562, 163–164; 2018). In our experience, however, proprietary data formats by manufacturers of medical equipment are obstructing the sharing of clinical data.
In ophthalmology, for example, the Eye and Vision consortium has collected data for more than 100,000 individuals from the UK Biobank data set — one of the largest annotated data sets available. Unfortunately, a key data modality — from optical coherence tomography (OCT) — is inaccessible. It is present only in a proprietary format that is defined by the manufacturer of the OCT machine.
We obtained approval to develop new machine-learning algorithms for age-related macular degeneration, only to find that the OCT data could not be read without the manufacturer’s commercial software suite. This does not allow bulk processing. Moreover, the data cannot be read using analytics languages such as Python or R.
Manufacturers must fall in line with open-data obligations if they are not to jeopardize the design of future cohort studies.
Nature 565, 429 (2019)