More than one million people have now had their genome sequenced, or its protein-coding regions (the exome). The hope is that this information can be shared and linked to phenotype — specifically, disease — and improve medical care. An obstacle is that only a small fraction of these data are publicly available.

In an important step, we report this week the first publication from the Exome Aggregation Consortium (ExAC), which has generated the largest catalogue so far of variation in human protein-coding regions. It aggregates sequence data from some 60,000 people. Most importantly, it puts the information in a publicly accessible database that is already a crucial resource (http://exac.broadinstitute.org).

There are challenges in sharing such data sets — the project scientists deserve credit for making this one open access. Its scale offers insight into rare genetic variation across populations. It identifies more than 7.4 million (mostly new) variants at high confidence, and documents rare mutations that independently emerged, providing the first estimate of the frequency of their recurrence. And it finds 3,230 genes that show nearly no cases of loss of function. More than two-thirds have not been linked to disease, which points to how much we have yet to understand.

The study also raises concern about how genetic variants have been linked to rare disease. The average ExAC participant has some 54 variants previously classified as causal for a rare disorder; many show up at an implausibly high frequency, suggesting that they were incorrectly classified. The authors review evidence for 192 variants reported earlier to cause rare Mendelian disorders and found at a high frequency by ExAC, and uncover support for pathogenicity for only 9. The implications are broad: these variant data already guide diagnoses and treatment (see, for example, E. V. Minikel et al. Sci. Transl. Med. 8, 322ra9; 2016 and R. Walshet al.Genet.Med.http://dx.doi.org/10.1038/gim.2016.90;2016).

These findings show that researchers and clinicians must carefully evaluate published results on rare genetic disorders. And it demonstrates the need to filter variants seen in sequence data, using the ExAC data set and other reference tools — a practice widely adopted in genomics.

The ExAC project plans to grow over the next year to include 120,000 exome and 20,000 whole-genome sequences. It relies on the willingness of large research consortia to cooperate, and highlights the huge value of sharing, aggregation and harmonization of genomic data. This is also true for patient variants — there is a need for databases that provide greater confidence in variant interpretation, such as the US National Center for Biotechnology Information’s ClinVar database.

Improving clinical genetics will need continued investment in such databases, more contributions from clinical labs, researchers and clinicians, expanding human genetic-reference panels and work to link these to phenotype data. This often involves re-contacting volunteers and donors; it will be trialled with an ExAC data subset where consents allow.

More broadly, enabling the sharing of linked genetic and clinical data in ways that do not violate privacy requires fresh thinking in regulation and ethics. The US National Institutes of Health and the Global Alliance for Genomics and Health have begun to tackle this; others should follow. The ExAC study highlights the potential rewards.