Machine learning spots treasure trove of elusive viruses

Artificial intelligence could speed up metagenomic studies that look for species unknown to science.

Search for this author in:

Virus illustrations

Many viruses are difficult to study because they cannot be grown in the lab.Credit: Sebastian Kaulitzki/SPL/Getty

Researchers have used artificial intelligence (AI) to discover nearly 6,000 previously unknown species of virus. The work, presented on 15 March at a meeting organized by the US Department of Energy (DOE), illustrates an emerging tool for exploring the enormous, largely unknown diversity of viruses on Earth.

Although viruses influence everything from human health to the degradation of trash, they are hard to study. Scientists cannot grow most viruses in the lab, and attempts to identify their genetic sequences are often thwarted because their genomes are tiny and evolve fast.

In recent years, researchers have hunted for unknown viruses by sequencing DNA in samples taken from various environments. To identify the microbes present, researchers search for the genetic signatures of known viruses and bacteria — just as a word processor’s ‘find’ function highlights words containing particular letters in a document. But that method often fails, because virologists cannot search for what they do not know. A form of AI called machine learning gets around this problem because it can find emergent patterns in mountains of information. Machine-learning algorithms parse data, learn from them and then classify information autonomously.

“Previously, people had no method to study viruses well,” says Jie Ren, a computational biologist at the University of Southern California in Los Angeles. “But now we have tools to find them.”

For the latest study, Simon Roux, a computational biologist at the DOE Joint Genome Institute (JGI) in Walnut Creek, California, trained computers to identify the genetic sequences of viruses from one unusual family, Inoviridae. These viruses live in bacteria and alter their host’s behaviour: for instance, they make the bacteria that cause cholera, Vibrio cholerae, more toxic. But Roux, who presented his work at the meeting in San Francisco, California, organized by the JGI, estimates that fewer than 100 species had been identified before his research began.

Roux presented a machine-learning algorithm with two sets of data — one containing 805 genomic sequences from known Inoviridae, and another with about 2,000 sequences from bacteria and other types of virus — so that the algorithm could find ways of distinguishing between them.

Next, Roux fed the model massive metagenomic data sets. The computer recovered more than 10,000 Inoviridae genomes, and clustered them into groups indicative of different species. The genetic variation between some of these groups was so wide that Inoviridae is probably many families, he said.

Viral learning

In a separate study, Deyvid Amgarten, a bioinformatician at the University of São Paulo in Brazil, deployed machine learning to find viruses in compost piles at the city’s zoo. He programmed his algorithm to search for a few distinguishing features of virus genomes, such as the density of genes in DNA strands of a given length. After the training, the computer recovered several genomes that seem to be new, says Amgarten, who presented his results at the JGI meeting. The final step will be to learn what proteins those viruses produce, and see whether any of them speed the rate at which organic matter breaks down. “We want to improve the efficiencies of composting,” he says.

Amgarten took his cue from a machine-learning tool reported last year, called VirFinder1, from Ren’s team. VirFinder is programmed to notice combinations of DNA letters, such as AT or CG, in DNA strands. Ren applied the algorithm to metagenomic samples from faeces of healthy people and those with liver cirrhosis, a condition caused by diseases ranging from hepatitis to chronic alcoholism. Once the machine classified groups of viruses in the samples, the team noticed that particular types were more or less common in healthy people compared to those with cirrhosis — suggesting that some viruses might play a part in disease.

Ren’s is a tantalizing finding: biomedical researchers have long wondered whether viruses contribute to the symptoms of several elusive conditions, such as chronic fatigue syndrome (also known as myalgic encephalomyelitis) and inflammatory bowel disease. Derya Unutmaz, an immunologist at the Jackson Laboratory for Genomic Medicine in Farmington, Connecticut, speculates that viruses might trigger a destructive inflammatory reaction — or they might modify the behaviour of bacteria in a person’s microbiome, which in turn could destabilize metabolism and the immune system.

With machine learning, Unutmaz says, researchers might identify viruses in patients that have remained hidden. Further, because AI has the ability to find patterns in massive data sets, he says, the approach might connect data on viruses to bacteria, and then to protein changes in people with symptoms. Says Unutmaz, “Machine learning could reveal knowledge we didn’t even think about.”

Nature Briefing

Sign up for the daily Nature Briefing email newsletter

Stay up to date with what matters in science and why, handpicked from Nature and other publications worldwide.

Sign Up


  1. 1.

    Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. & Sun, F. Microbiome 5, 69 (2017).

Download references