This year, a study of ten million people revealed the power of longitudinal research. A team mined a vast medical database of US military personnel, including tissue samples and health records collected over 20 years. The researchers found that those who had previously been infected with Epstein–Barr virus had a 32-fold higher risk of developing multiple sclerosis — the strongest evidence yet for a causal link (K. Bjornevik Science 375, 296–301; 2022).

Such massive medical databases could have the power to identify causes of cancer — but the bulk of cancer registries fall short. They include too few individuals or are missing people’s past medical histories. They are insufficient to untangle the many risk factors for some cancers, or why the disease progresses differently in different people.

Right now, for example, it’s hard to determine how people from minority ethnic groups respond to therapies or which risk factors are unique to their cancers. In the United States, Black men are 50% more likely than white men to develop prostate cancer and are twice as likely to die of it. Without large, diverse data sets, we can’t identify unique, targetable genetic or molecular features or lifestyle factors that underlie increased cancer risk in this group and others.

The dearth of data is a major obstacle in my own efforts to identify causes of soft-tissue sarcomas, rare but devastating tumours in muscle, fat, connective tissue and blood vessels. Standard treatment involves radiation therapy, major surgery or even amputation. Immunotherapies are emerging, but only some people respond well. Researchers have struggled to explain whether the differences in responses are due to genetics, features of the immune system or past medical events. To find out, we need a comprehensive registry of individuals with sarcoma.

The ideal cancer registry would aggregate information from millions of consenting participants; include populations of different ancestry and socio-economic status; collect information going forwards from the time of cancer diagnosis — including imaging, tissue samples and genetic data; and capture participants’ histories by automatically linking to their complete medical records. With these detailed profiles, we could trace cancer diagnoses, health effects and risk of death back to potential risk factors.

No registries fit the bill right now. Long-standing ones, such as those in Norway and Denmark, are relatively small, and often collect only information from the time of diagnosis. The UK Biobank and the Count Me In initiative of the Broad Institute in Cambridge, Massachusetts, do link cancer diagnoses with medical histories, but participants must be willing to enrol; provide scans, bloodwork and other data; and undergo repeated medical assessments.

To bring together information at this scale, the cancer research community must take advantage of technologies that can link disparate but relevant data. As a scientific consultant at Datavant in San Francisco, California, a company that provides data-linkage tools, I am predisposed to say that. But the past decade has seen major improvements in methods to link health data and preserve patient privacy. An individual’s records can be connected across many databases by replacing personally identifiable information with an encrypted ‘token’ and by matching participant-specific tokens across data sets. This technology is widely used in the pharmaceutical sector and to link hospital medical records with private health-insurance claims. All corporate, government and academic stakeholders should ensure that the software is continually improved and that any new confidentiality risks are addressed by privacy experts.

A huge obstacle to building better cancer registries in the United States is fragmented health care. A patient might see community oncologists for routine visits and doctors at academic medical centres for cutting-edge treatments. But the records from those visits are not electronically linked. Information on comorbidities, prescription medications, family medical history and previous diagnoses is scattered. Aggregating these data by hand is time-consuming and expensive. It took a student in my laboratory six months to sort through medical records by hand for around 100 people with muscle abnormalities, looking for root causes of sarcoma. Electronically linked data would hasten the process, as countries with centralized health systems, such as the United Kingdom and Denmark, already know.

Building registries on such a huge scale will require buy-in from academic institutions, government and the public. Critics might argue that gaining public support is a large hurdle, but millions of people around the world have already shown willingness to participate in health registries, which have generated crucial scientific results.

The COVID-19 pandemic has catalysed promising initiatives in the United States that offer a blueprint for cancer registries. More than 75 institutions work together in the National COVID Cohort Collaborative, organized by the National Institutes of Health, to collect de-identified clinical data on more than 6.5 million people with COVID-19. The US government should prioritize registries as part of its Cancer Moonshot programme, which aims to halve the death rate from cancer by 2047. It should also require privacy-preserving record-linkage methods, such as tokenization, for existing cancer registries and clinical trials. Such changes would benefit people and bring much-needed unification to a fragmented health-care system.