With the increasing use of artificial intelligence (AI) and machine learning applications in healthcare, genomics and medical research, urgent questions have emerged regarding the collection and control of the human data needed to develop these applications. For example, whose data are being collected? What information is contained in the data? Why were the data collected in the first place, and under what conditions? Where are the data stored, and who ‘owns’, oversees or otherwise controls them? How are the data used, and by whom?

A Comment in this issue by Boscarino et al. discusses data sovereignty in genomics datasets for Indigenous peoples. Recent research has emphasized the importance of including diverse populations in genomics and healthcare datasets, in order to train machine learning models on heterogeneous data and thereby avoid health inequities. But a paradox may arise from this approach, as open data sharing is not always in the best interest of certain communities; Indigenous people, for example, are under-represented in genomics datasets, but are understandably apprehensive about contributing their data owing to a history of mistrust and scepticism from certain interactions with medical researchers1. Data sovereignty is a way for Indigenous communities to control, or take back control, of their personal data.

The creation, curation and sharing of data in genomics has a complex history. When the Human Genome Project (HGP) was launched in the early 1990s, research data were largely considered to belong to individual researchers for investigation within their own labs. In 1996, HGP researchers established the Bermuda Principles, requiring human genome sequences to be made available in publicly accessible databases within 24 hours after generation. Subsequent decades saw scientific progress, but also data dilemmas: various entities such as funding agencies, research institutes, governments, publishers and private research consortia created data repositories with byzantine rules or policies for access and sharing. In a Nature Feature, this lack of data access and sharing was described as a “broken promise” to researchers who depend on these rich data being made available2. But this perspective, focused on researchers’ interests, underplays the fundamentally personal nature of genomic data for individual humans, which may be part of the personal identity for disadvantaged groups who have their own interests and hopes in regard to how their data are used. In a Correspondence in response to this article, scientists wrote that the broken promise in genomics is to Indigenous people, whose data has been collected but who see little benefit from medical research because of persistent inequities3.

Data, algorithms and compute (the main elements of AI) have advanced rapidly over the past two decades and are being deployed in people’s lives in disruptive ways — for good and ill — without much regulation or, until recently, community deliberation. The fallout is considerable, and many concerns have been raised over harms of AI algorithms in society to individuals or groups. A recent example is a white paper by the American Civil Liberties Union (ACLU) entitled “AI in healthcare may worsen medical racism”4. The paper discusses several studies that inadvertently used biased data and machine learning models to make harmful medical decisions. In recent years, the machine learning community has called for participatory approaches whereby those whose life is impacted by algorithms have a major role in their development5. But such approaches can only work if the underlying power relations are addressed, as discussed in a recent paper by Birhane et al.6. Some of the challenges can be traced back to the question of who owns and controls the data that support machine learning developments. For instance, the authors of ref. 6 discuss a case study in which the Māori community in New Zealand participated in a research project to record and annotate audio data of their native language. The community contested open data sharing of this dataset and developed a data sovereignty protocol to take control over their own data and prevent them from falling into the hands of corporate entities7.

In their Comment, Boscarino et al. describe a way forward for Indigenous communities to increase participation in scientific projects while keeping control over data resources, for a more equitable use of data and AI. Instead of datasets being openly shared, they can be made available for federated machine learning, which is a technique for training machine learning models without uploading datasets to a central server. Instead, a partially trained model is sent to the data, and the trained model is returned to a central server8. The technique is being developed for various applications in which individuals or institutes with large datasets need to keep their data private. It can be used, for example, to collaboratively train models on various patient datasets that remain at individual hospitals that collect and store them.

As Boscarino et al. discuss, resources would be needed to set up the infrastructure to enable federated machine learning, and Indigenous communities should decide whether to implement the approach and how to manage it. This tool deserves further exploration as a way to increase diversity in data, with benefits for all, while ensuring that communities are treated as partners in research rather than as data subjects.