Identification of mobile genetic elements with geNomad

Identifying and characterizing mobile genetic elements in sequencing data is essential for understanding their diversity, ecology, biotechnological applications and impact on public health. Here we introduce geNomad, a classification and annotation framework that combines information from gene content and a deep neural network to identify sequences of plasmids and viruses. geNomad uses a dataset of more than 200,000 marker protein profiles to provide functional gene annotation and taxonomic assignment of viral genomes. Using a conditional random field model, geNomad also detects proviruses integrated into host genomes with high precision. In benchmarks, geNomad achieved high classification performance for diverse plasmids and viruses (Matthews correlation coefficient of 77.8% and 95.3%, respectively), substantially outperforming other tools. Leveraging geNomad’s speed and scalability, we processed over 2.7 trillion base pairs of sequencing data, leading to the discovery of millions of viruses and plasmids that are available through the IMG/VR and IMG/PR databases. geNomad is available at https://portal.nersc.gov/genomad.


Supplementary Note 1: Marker-based classification features
To classify a given sequence into chromosome, plasmid, or virus with the marker-based classifier, geNomad performs gene prediction using prodigal-gv and annotates the predicted proteins by aligning them to geNomad's markers using MMseqs2.From the sequence's gene structure, RBS motifs, and the identity of the markers that were assigned to its proteins, a total of 25 informative features are computed and used as input for the classification model.Below we list and describe each one of these features: • strand_switch_rate: The fraction of genes located on a different strand from the gene upstream.
• coding_density: Sum of the lengths of all the protein-coding regions (in base pairs) divided by the total sequence length.
• no_rbs_freq: Fraction of genes without a detectable RBS motif.
• cc_marker_freq: Number of genes assigned to the CC specificity class (high chromosome SPM, low plasmid SPM, low virus SPM) divided by the total number of genes.
• cp_marker_freq: Number of genes assigned to the CP specificity class (high chromosome SPM, medium plasmid SPM, low virus SPM) divided by the total number of genes.
• cv_marker_freq: Number of genes assigned to the CV specificity class (high chromosome SPM, low plasmid SPM, medium virus SPM) divided by the total number of genes.
• pc_marker_freq: Number of genes assigned to the PC specificity class (medium chromosome SPM, high plasmid SPM, low virus SPM) divided by the total number of genes.
• pp_marker_freq: Number of genes assigned to the PP specificity class (low chromosome SPM, high plasmid SPM, low virus SPM) divided by the total number of genes.
• pv_marker_freq: Number of genes assigned to the PV specificity class (low chromosome SPM, high plasmid SPM, medium virus SPM) divided by the total number of genes.
• vc_marker_freq: Number of genes assigned to the VC specificity class (medium chromosome SPM, low plasmid SPM, high virus SPM) divided by the total number of genes.
• vp_marker_freq: Number of genes assigned to the VP specificity class (low chromosome SPM, medium plasmid SPM, high virus SPM) divided by the total number of genes.
• vv_marker_freq: Number of genes assigned to the VV specificity class (low chromosome SPM, low plasmid SPM, high virus SPM) divided by the total number of genes.
• gv_marker_freq: Number of genes annotated with giant virus markers divided by the total number of genes.
2. Each profile has three associated SPM values that range from 0 to 1 and measure how specific that profile is to each one of the three classes (chromosome, plasmid, and virus).
3. Markers were assigned to the nine specificity classes (CC, CP, CV, PC, PP, PV, VC, VP, and VV) based on their SPM values.Briefly, we used the "binned_statistic_dd" function from the SciPy Python library (version 1.7.3) to divide the three-dimensional SPM space into 125 equally sized bins.Next, each marker was assigned to a bin based on its SPM profile, so that all the markers within a given bin had similar chromosome, plasmid, and virus SPMs.Finally, we manually labeled each bin, and the markers within it, with the nine specificity classes, depending on their SPM profiles.

Supplementary Note 2: Score calibration
During the inference process, a classification model assigns scores to predictions, indicating the level of confidence in each prediction, where higher values signify greater confidence.However, these scores do not necessarily represent the true probabilities of the predictions being correct, as classification models will exhibit varying false discovery rates when classifying samples with distinct underlying compositions.
For example, if the same classification model is used to identify viruses in a metagenome (where cellular sequences outnumber viral sequences) and in a virome (that is enriched in viral sequences), it is expected that the model will yield a higher proportion of false positive viruses in the metagenome, where more cellular sequences (that are prone to be misclassified as viruses) will be present (Extended Data Fig. 2A).This issue stems from the fact that models assign the same score to a given sequence regardless of the composition of the sample.
To address this, we devised an optional calibration mechanism in geNomad that leverages sample composition data to approximate the true underlying probabilities.The algorithm consists of a dense neural network that takes raw scores and the empirical sample composition (i.e., the frequency of chromosomes, plasmids, and viruses in the pre-calibration classification) as inputs and outputs calibrated scores (Fig. 1A, box A3) that accurately approximate probabilities (mean absolute errors for pre-and postcalibration scores in Fig. 1D).Because this process depends on reliable estimates of the underlying compositions, it works best for samples with sufficient size (e.g., ≥ 1,000 sequences), for which the mean absolute error of the calibration is very low (≈ 1%, Extended Data Fig. 2B).In essence, the calibration mechanism adjusts raw scores by reducing or increasing the scores of a given class (chromosome, plasmid, or virus) when its frequency within the sample is low or high (Extended Data Fig. 2C and D).
When the sample composition is very uneven, this tends to result in large changes in raw scores, while very high or low scores are less affected (Extended Data Fig. 2C and E).The calibrated scores produced by geNomad offer users two benefits: (1) estimated probabilities can be used to compute false discovery rates, allowing users to make more informed decisions (e.g., setting a threshold to achieve a desired proportion of false positives), and (2) improved classification performance by adjusting the assigned labels of some sequences after calibrating scores.

Supplementary Note 3: Classification performance benchmarks
To evaluate the classification performance of geNomad and compare it to other virus and plasmid identification tools that use different approaches for sequence classification (Table 1), we used test datasets consisting of diverse sequence fragments with varying lengths (Extended Data Fig. 5A).To minimize overestimation of geNomad's performance due to the presence of similar sequences in the train and test data, we randomly assigned RCs to five different data splits and performed cross-validation using the leave-one-group-out strategy, which forced sequences from the same RC to remain together in either the train or test sets.Performance metrics for all tools were measured five times, using each RC as the test set at a time.The following metrics were computed: precision (fraction of true plasmids/viruses among the sequences classified as plasmid/virus); sensitivity (fraction of the true plasmids/viruses that were classified as such); Matthews correlation coefficient (MCC, correlation between the true and predicted labels); and F1-score (harmonic mean of sensitivity and precision).

geNomad exhibits better overall classification performance when compared to other tools
By inspecting the classification performance as a function of the similarity to the train data, we found that geNomad's performance dropped amongst sequences that were more divergent from the train data.
However, it still performed rather well on unseen sequences (Extended Data Fig. 5B), especially viruses, illustrating its potential for the discovery of new viral taxa.Measurement of geNomad's performance on sequences with varying marker coverage (i.e., fraction of proteins assigned to markers) revealed that even those that were targeted by no or few markers were still detected due to the sequence branch of the algorithm (Extended Data Fig. 5C).
When compared to other tools, geNomad presented superior overall classification performance across all sequence length ranges in both plasmid and virus classification tasks (Fig. 3A and B, Supplementary Tables 3 and 4).The difference in performance was especially apparent in short sequences (< 6 kb): while the performance of most tools declined due to the limited genetic information in such sequences, geNomad leveraged its extensive marker dataset and alignment-free neural network to extract as much information as possible and maintain high sensitivity and precision.This highlights the usefulness of geNomad in metagenomic and metatranscriptomic assemblies, where most scaffolds are short.

Score calibration improves sequence classification
geNomad's calibration mechanism enhances the classification process by incorporating sample composition data and assigning estimated probabilities to each sequence, which reflect the likelihood of the sequence belonging to each class.By using calibrated scores instead of raw scores to assign labels, the average classification performance improves because biases introduced during model training are corrected.Indeed, our analysis showed that the plasmid classification performance increased with the use of calibrated scores, particularly for shorter sequences (average ΔMCC: +11.8% for sequences < 3 kb, +5.6% for 3-6 kb, and +3.2% for 6-9 kb) (Extended Data Fig. 5D).We also found that short virus sequences benefited from calibration, though the improvement was not as pronounced.These results showcase the effectiveness of the introduced calibration mechanism for improving classification quality.

Plasmid classification benchmarks
Plasmid classification is a challenging task due to the variable genetic makeup of these elements, their similarity to other mobile elements that can integrate into host chromosomes, and the lack of a standard for reporting plasmids in sequencing data.As a result, most evaluated tools (DeepMicroClass 1 , PPR-Meta 2 , PlasClass 3 , and viralVerify 4 ) had low average classification precision (11.0-40.1%,Supplementary Table 3), even when classifying long sequences (Supplementary Table 4), as they often produced a high number of false positives that can impact downstream analysis.In contrast, PlasX 5 had high precision (81.6%), but low sensitivity (40.5%), which impairs the detection of plasmids in sequencing data.geNomad had the best overall performance by a substantial margin (Fig. 3A, MCC and F1-score in Supplementary Tables 3 and   4), with the highest sensitivity (89.8%) and the second highest precision (70.8%), after PlasX.It's worth noting that geNomad's marker branch, which can be run independently, achieved a considerably higher precision than PlasX (91.2%).
Most of the plasmid sequences in public databases are limited to a few taxa, such as Gammaproteobacteria and Bacilli, which can bias the training process if taxonomic imbalance is not taken into account.Because it was designed to reduce the effects of taxonomic representation biases during marker selection and training, geNomad is able to identify plasmids from underrepresented groups more accurately.A similar process was also used in PlasX.When compared to other plasmid identification tools, geNomad had the best performance across all appraised taxa (Supplementary Table 5).Notably, geNomad was the only tool to accurately identify the majority of Archaea plasmids (92.54%), which were frequently missed by other tools (0.0-55.3%), and it greatly outperformed other tools for identifying plasmids from major phyla such as Cyanobacteria (geNomad: 96.7%, other tools: 6.3-64.3%),Actinobacteria (geNomad: 95.5%, other tools: 2.5-61.9%),and Bacteroidota (geNomad: 86.4%, other tools: 2.4-69.2%).
Plasmid identification algorithms can be affected by low quality plasmid annotations in public data.
Extrachromosomal viruses and secondary chromosomes are often incorrectly labeled as plasmids in these databases, so it's important to carefully filter the data to train reliable models and assess classification performance (details in the Methods section).To evaluate if existing plasmid identification tools are prone to misclassifying viruses as plasmids -possibly due to contamination in the training data -we measured the fraction of viruses in our test dataset that were labeled as plasmids by the benchmarked tools (Supplementary Table 6).geNomad, PlasX, and viralVerify had the best performances in this benchmark (1.7%, 1.5%, and 3.7% respectively), while DeepMicroClass, PlasClass, and PPR-Meta performed the worst (11.3%, 64.4%, and 9.8% respectively).Of note, geNomad's marker branch classified only 0.2% of the virus sequences as plasmids, which highlights the limitations of current alignment-free tools at this task and the importance of careful dataset curation.
The development of tools that can accurately identify diverse viral taxa is challenging, as no genes are universally shared across the virosphere.Additionally, unequal representation of viral groupsillustrated by the dominance of tailed phages from the Caudiviricetes class -in sequencing data can bias classification models and prevent the discovery of underrepresented taxa.In a benchmark study using representative genomes from the ICTV, we found that geNomad outperformed other tools in all major taxa we evaluated (Fig. 3C, Supplementary Table 7).Notably, geNomad was the only tool that achieved high sensitivity for viruses that encode an RNA-dependent RNA polymerase (Orthornavirae, 98.64%), and giant viruses (Megaviricetes, 94.74%) at a fixed false discovery rate of 5%.The only other tools to display sensitivity over 50% for all taxa were DeepMicroClass and viralVerify, while the remaining tools failed to achieve this for at least two of the groups.When evaluating sensitivity across different host clades, we found that geNomad was the only tool that identified more than 90% of the viruses infecting bacteria, archaea, and multiple eukaryotic groups, while other tools struggled to identify viruses that infect at least two of the eukaryotic groups that were evaluated (Supplementary Table 8).In an additional benchmark where we measured classification sensitivity on a catalog of metagenomic Inovirus 10 , which are known to be challenging to detect automatically, geNomad (sensitivity: 84.8%) also outperformed other evaluated tools (average sensitivity: 32.5%, Supplementary Table 9).These results show that geNomad can be employed to identify a wide range of virus taxa infecting a variety of hosts, enabling the discovery of viruses that are often missed in metagenomic analyses, such as non-tailed phages and viruses that infect eukaryotes.It is worth noting that several of the tested tools (DeepVirFinder, PPR-Meta, Seeker, and VIBRANT) were trained only on phage data and are therefore not designed to identify viruses that don't infect prokaryotes.In fact, VIBRANT was a top performer for Caudoviricetes, Tokiviricetes, Tubulavirales, and Microviridae.

4 .
To label profiles as giant virus markers, we treated giant viruses (Nucleocytoviricota, Pandoravirus, Mollivirus, Pithoviridae, Naldaviricetes) as a fourth class, separate from all other viruses, and recomputed SPM values.Profiles with giant virus SPM ≥ 0.94 were considered giant virus markers.This threshold was picked based on the SPM of profiles of known Megaviricetes capsid proteins.