More than 1.2 million coronavirus genome sequences from 172 countries and territories have now been shared on a popular online data platform, which is a testament to the hard work of researchers around the world during the pandemic.
“Because countries are submitting data from so many parts of the world, you have a system where we can watch how the virus spreads through the world, and see if control measures and the vaccines still work,” says Sebastian Maurer-Stroh, a Singapore-based scientific adviser at the the non-profit organization hosting the repository, GISAID — the Global Initiative on Sharing Avian Influenza Data.
Several databases for genome sequences exist, but GISAID is by far the most popular for SARS-CoV-2. It was conceived in 2006 as a repository of genomic data from flu viruses. At the time, many countries withheld genomic information for a range of reasons. One fear was that the countries generating the data would not get credit, or would not reap the benefits of research stemming from their original sequencing work. But after two years of negotiations between governments and scientists about data-sharing agreements, GISAID launched.
Charting the spread
When COVID-19 began spreading in China, Maurer-Stroh says, the GISAID team immediately reached out to researchers and politicians around the world, to understand what barriers might prevent them from sharing genomic data on SARS-CoV-2.
For example, when researchers in West Africa said that they lacked bioinformatics training, a scientist affiliated with GISAID in Senegal began to hold workshops on sequencing, analytics and how to use the tools on the platform. Some of GISAID’s features allow researchers to see how genomes they’ve uploaded relate to others, or to explore where new variants appear from day to day.
Although outreach has helped, Maurer-Stroh says the site’s popularity is mainly due to its mechanism of sharing and the quality of its tools for sequence display and analysis.
Some wealthy countries have uploaded huge numbers of sequences and account for the lion’s share in their regions (see ‘Collaboration in the time of COVID’). For example, as of 20 April, the United States had shared 303,359 sequences, and the United Kingdom’s tally stood at 379,510 sequences.
Not entirely comprehensive
But glaring gaps exist. Not a single SARS-CoV-2 sequence has been uploaded from Tanzania, where the late president John Magufuli denied the existence of the pandemic for many months. And several countries with significant outbreaks, including El Salvador (67,851 cases, but only 6 sequences uploaded) and Lebanon (513,006 cases, 49 sequences uploaded) are lagging far behind.
To search or download sequences from GISAID, or use the platform’s genomic-analysis tools, people must register with their name, and agree to terms that include not publishing studies based on the data without acknowledging the scientists who uploaded the sequences, and even contacting them to ask about collaboration. Such gatekeeping has upset some scientists, who argue that there should be no barriers standing in the way of access.
But GISAID probably would not have hit the one-million mark without such an approach, because it would have lacked assurances against exploitation, speculates Tulio de Oliveira, the director of the KwaZulu-Natal Research Innovation and Sequencing Platform in Durban, South Africa. He says: “This is the first time I’ve seen people sharing so much data before publication.”
Nature 593, 21 (2021)