Transmission Electron Microscope photograph of SARS-CoV-2. © CSIRO

Researchers mapping the genetic blueprint of the novel coronavirus SARS-CoV-2 have by now shared more than 12,000 genome sequences from across the world on the open platform Global Initiative on Sharing All Influenza Data (GISAID). The repository has seen unprecedented activity since December when the first sequence from Wuhan in China came in. On NCBI’s GenBank, more than 20,000 nucleotide and protein sequences of the virus have already been submitted.

The virus is all set to become the most sequenced ever in history.

Researchers, however, warn that unless the sequences are accompanied by de-identified data from patients, the billions of dollars being spent in sequencing the virus globally will not be of much clinical or epidemiological value, a crucial need during a rapidly evolving pandemic.

Laboratories, clinicians, epidemiologists and governments wanting to quickly use this gold mine of information are meeting a stumbling block as the look for more granular data that should ideally supplement the primary sequence data.

“We badly need de-identified meta-data from the patients from whom these sequences came so that it makes sense for any kind of analysis,” says Seshadri Vasan, who leads the Dangerous Pathogens team at the Australian Animal Health Laboratory and is senior principal research consultant for Health and Biosecurity at the Commonwealth Scientific and Industrial Research Organization (CSIRO), Australia's national science agency.

De-identified data does not reveal the identity of the patient. Vasan says the minimum set of de-identified data that researchers need is the patient’s age, gender, if they had mild, moderate or severe disease and if they survived. Questions around lifestyle and comorbidities, such as do they smoke, have pre-existing respiratory illness or diabetes, are also important to add meaning to this data. “We usually get information on country and city, but it may be beneficial to have postcode and ethnicity data too,” he says.

India has announced an ambitious 1000-genome sequencing project to better understand the viral and host genomics of the COVID-19 outbreak. India’s Council for Scientific and Industrial Research (CSIR), which undertook a mega 1008-human genome sequencing project last year, has been leading the sequencing efforts in India.

Scientists at the Centre for Cellular and Molecular Biology (CCMB), Hyderabad; Institute of Genomics and Integrated Biology (IGIB), Delhi; Institute of Microbial Technology, Chandgarh; the National Institute of Virology, Pune, and Gujarat Biotechnology Research Centre, Gandhinagar are sequencing the viral genome. Besides, the Central Drug Research Institute (CDRI), Lucknow and IICB, Kolkata are also gearing up to sequence the viral genome.

With the 1000-genome project, about 10 more facilities across the country will be pulled in to sequence the virus.

Virologist Mitali Mukerji, a genomic scientist at IGIB who is coordinating CSIR’s sequencing efforts says at the moment scientists are only trying to analyse the strain of the virus and where the sequences came from. “Clinical history is not getting submitted from any place. It’s very important since this is not the end of the outbreak we are seeing,” she says. Epidemiologists need to identity people who might be more at risk and analysing clinical information will be crucial, she says.

IGIB director and clinician scientist Anurag Agrawal, who is overseeing a molecular and digital surveillance project around the genome sequences from India, says it would be extremely useful to know the viral loads and numbers of symptomatic versus asymptomatic cases. “Nothing is meaningful for molecular epidemiology or our knowledge of clusters unless these clinical parameters are well defined in the data,” he says.

The biggest barrier, he says, is coordination among researchers sequencing the data and agencies uploading it on to the databases. “We work with the National Centre for Disease Control (NCDC), who have the underlying patient information and since they upload the sequences, they do add much more value to the data.”

Upasana Ray Banerjee, a virologist at the CSIR-Indian Institute of Chemical Biology (IICB) whose team recently analysed the genome sequence from a COVID-19 patient from Gujarat, agrees. “This remains a concern for most of us – to correlate this data with our analysis,” she told Nature India . “It is extremely important for us when we want to assign clinical significance to our sequencing efforts,” she says.

The reason this additional data is needed is that the same viral strain could be fatal for one person, and result in mild, moderate or severe symptoms in others. “And some strains could also be more or less virulent than others,” Vasan adds.

Vasan, who holds an honorary chair in Health Sciences at the University of York in the UK, says the World Health Organisation should lead this effort to standardise the meta-dataset that can be followed globally, with consistent definitions to categorise severity and outcomes of COVID-19. “No country can solve this problem in isolation. It is important for the WHO to specify the minimal meta-dataset not just for SARS-CoV-2 but also a future ‘Disease X’,” he told Nature India .

In the absence of patient meta-data “we don’t know how the disease is progressing, how long the virus shedding occurs in different settings and what kind of immunity levels exist in individuals or populations," says epidemiologist Giridhara R Babu from the Public Health Foundation of India (PHFI).

“As we move forward, we have to be very careful in improving the quality of the meta-data and, more importantly, have it unbiasedly assessed by people who don’t run the clinical trials,” Babu told Nature India . That way measurement errors and selection biases can be removed from the data to make it more useful.

Information on severity of symptoms and disease progression dynamics would be immensely helpful when combined with the genomic sequences. “For instance, one could actually know if there is a sub-group of asymptomatic people who never go on to develop the disease. They would be way more useful to design a disease modifying mechanism or immunomodulation, instead of the quest for a vaccine as the endgame.”

Disregarding all these data elements eliminates the possibility of other non-pharmacological interventions to disrupt the transmission of the virus, Babu says.

Evolution, mutations and clades

The global effort to peer into the genetic make-up of the pandemic-causing virus since the start of the COVID-19 outbreak has provided real-time understanding of the organism. Databases such as the GenBank and GISAID provide ammunition to researchers trying to understand the evolution and mutations of viruses. They are also solid tools for research and development of drugs and vaccines against the virus.

The data so far reveals some minor mutations in the virus which may have no functional consequence, Vasan says. “For instance, when we looked at 388 sequences from Australia, only 162 had protein-changing mutations,” he says. However, his team was unable to determine clinical or epidemiological impacts of these minor mutations without the underlying meta-data. Only 14 out of these 388 sequences had clinical annotations, the rest were either annotated as unknown or not at all.

CSIRO has developed1 a novel visualization platform – similar to the one used to analyse the human genome – to pinpoint differences among the thousands of individual genetic sequences of COVID-19 now globally available. The data visualisation platform highlights evolving genetic mutations of the virus as it continues to change and adapt to new environments.

"Analysing global data on the published genome sequences of this novel corona virus will help fast track our understanding of this complex disease, how changes in the virus could affect its behaviour and impact," Vasan says. "Assessing the evolutionary distance between these data points helps researchers find out about the different strains of the virus – including where they came from and how they continue to evolve,” he says.

Vasan, whose team has analysed the first 181 published genome sequences from the current COVID-19 outbreak says the RNA virus can "evolve into a number of distinct clusters that share mutations." The analysis has already helped determine which strains of the virus are suitable for testing vaccines underway at the Australian Centre for Disease Preparedness in Geelong.

Transmission Electron Microscope photograph of SARS-CoV-2. © CSIRO

RNA viruses, Vasan adds, generally evolve into clusters and show ‘quasispecies diversity’, meaning not just a single genotype but an ensemble of related sequences. Quasispecies arise from rapid genomic evolution powered by the high mutation rate of RNA viral replication. The novel coronavirus, an RNA virus, emerged from China and restrictions on air travel and movements of people did not come into place for a while after the outbreak in Wuhan. “Therefore, the clusters do not correspond to countries. For instance, the first 181 published genomic sequences could be grouped into three clusters (with three more emerging), and Australian isolates can be found in each of them,” he says.

For this reason it is unhelpful to call the virus ‘an Indian strain’ or ‘Australian strain’ or ‘Chinese strain’ or make claims that one regional strain is more virulent than the other.

“Over time, we may likely find clusters with varied virulence in all countries. The real question is whether we can link the accumulated mutations in the genome to clinical meta-data and find clinically/epidemiologically meaningful correlations,” he says.

A GISAID statement says the circulating virus strains globally can be classified into different number of clades based on genetic variation. ”These are part of the natural evolution of the virus currently not known to be associated with any differences in virulence,” it says. Data from the early outbreak period is not enough for a detailed interpretation of the early history of global transmissions from a few genomes, according to GISAID.

Ray Banerjee, whose team reported in a preprint paper2 two novel mutations in the spike protein of the SARS-CoV-2 isolate from Gujarat as compared with the Wuhan virus isolates, says these mutations have a somewhat different origin. “One of the mutations is exclusive in the virus obtained from Gujarat whereas the other was also seen in North American and European isolates.”

Almost 95 per cent of the strains reported in global databases till now are from Wuhan in China where the outbreak began. “The rest five per cent are from the rest of the world. So some descriptions of virulence being low or high in a particular region are wishful thinking at best," Giridhara Babu says.

[Nature India's latest coverage on the novel coronavirus and COVID-19 pandemic here. More updates on the global crisis here.]


1. Bauer, D. C. et al. Pandemic response using genomics & bioinformatics, a case study on the emergent SARS-CoV-2 outbreak. Transbound. Emerg. Dis. (2020) doi: 10.1111/tbed.13588

2. Banerjee, A. K. et al. Novel mutations in the S1 domain of COVID 19 spike protein of isolate from Gujarat origin, Western India. Preprints (2020) doi: 10.20944/preprints202004.0450.v1