Covid-19 diagnosis by combining RT-PCR and pseudo-convolutional machines to characterize virus sequences

The Covid-19 pandemic, a disease transmitted by the SARS-CoV-2 virus, has already caused the infection of more than 120 million people, of which 70 million have been recovered, while 3 million people have died. The high speed of infection has led to the rapid depletion of public health resources in most countries. RT-PCR is Covid-19’s reference diagnostic method. In this work we propose a new technique for representing DNA sequences: they are divided into smaller sequences with overlap in a pseudo-convolutional approach and represented by co-occurrence matrices. This technique eliminates multiple sequence alignment. Through the proposed method, it is possible to identify virus sequences from a large database: 347,363 virus DNA sequences from 24 virus families and SARS-CoV-2. When comparing SARS-CoV-2 with virus families with similar symptoms, we obtained \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.97 \pm 0.03$$\end{document}0.97±0.03 for sensitivity and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.9919 \pm 0.0005$$\end{document}0.9919±0.0005 for specificity with MLP classifier and 30% overlap. When SARS-CoV-2 is compared to other coronaviruses and healthy human DNA sequences, we obtained \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.99 \pm 0.01$$\end{document}0.99±0.01 for sensitivity and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0.9986 \pm 0.0002$$\end{document}0.9986±0.0002 for specificity with MLP and 50% overlap. Therefore, the molecular diagnosis of Covid-19 can be optimized by combining RT-PCR and our pseudo-convolutional method to identify DNA sequences for SARS-CoV-2 with greater specificity and sensitivity.

Related works. Several studies have proposed rapid tests for the diagnosis of Covid-19. The most common methods are based on antibodies 15 . proposed a simple and rapid test for the combined detection of IgG and IgM antibodies. Both antibodies are indicative of infection. However, immunoglobulins M provide an immediate response to viral infections and can be detected within 3 to 6 days after infection. Immunoglobulin G, on the other hand, is important for long-term immunity or for the body's immune memory. The test was developed to detect IgM and IgG simultaneously in blood samples, allowing detection in a longer time window. For the development of the rapid test, the authors collected samples from eight different laboratories and hospitals in China, with a total of 397 patient samples positive for Covid-19 and 128 negative samples. These results were confirmed by the RT-PCR technique in a sample of the respiratory tract. Blood samples from the patients were pipetted into the test kit, followed by two or three drops of dilution buffer. After 15 min, it was possible to analyze the result using three markers. The first marker (letter C) or line on the display appears in red when the sample is negative.  46 repository. We employed 24 virus families with more than 500 DNA viral sequences per family, including the Coronaviridae family. SARS-CoV-2 sequences are separated from other Coronaviridae sequences, since we are interested in SARS-CoV-2 identification. Thus, the classification problem involves 25 classes, since SARS-CoV-2 samples are inserted separated from Coronaviridae family. We converted each sequence to numeric vectors using our pseudo-convolutional process. Afterwards, we performed several experiments with different state-of-the-art machine learning methods. Results were evaluated considering four metrics: accuracy, kappa index, sensitivity and specificity. Figure 1 presents a general flow-chart of our proposal, whilst Fig. 2 illustrates our specific contribution. The pseudo-convolutional representation process is described as follows. Initially, the genome sequence is divided into n sub-sequences. These sub-sequences are overlapped with its neighbors. In the overlapping process, a parameter received by the method determines the size of the superimposed pieces. Every ith sub-sequence is combined with a piece of the sub-sequence immediately to its left, the (i − 1)th, and with the piece at its right, the (i + 1)th. An exception is made for the first and the last sequences of the matrix, given that they have only one sub-sequence from which to take a piece. This procedure results into two new sequences for each of the sub-sequences generated from the original genome. After that, these smaller sequences are represented by 4 × 4 co-occurrence matrices. Each element of the matrix represents the number of occurrences of a given pair of nucleotide bases, as well as the relationship between bases and their immediate neighbors. These elements are AA, AC, AT, AG, CA, CC, CT, CG, TA, TC, TT, TG, GA, GC, GT, and GG. For instance, reading a sequence from left to right, when the present nucleotide base is T and its right neighbor is G, the matrix element in row T and column G is incremented. The final matrix is then normalized, where its maximum value is used as normalization factor. Finally, all the 4 × 4 matrices are stacked together, forming a 4 × 4 × m volume, wherein m is the number of sub-sequences resultant from the overlapping process. In general terms: (1) m = (n − 1) × 2. Figure 1. General scheme of the proposal: the RT-PCR process converts and amplify the virus RNA obtained from collected samples and returns a textual DNA sequence as a text file. This textual DNA sequence is converted into a numeric vector by the pseudo-convolutional representation process and then classified into healthy human tissue or one of the 25 virus families or SARS-CoV-2 species by a determined learning machine. The analysis software then returns the diagnostic decision: if the input sample corresponds to a healthy human sample or a determined virus. These virus DNA sequences were obtained from the NIAID Virus Pathogen Database and Analysis Resource-ViPR 46 . Human participants were not involved in our research. No demographic data were collected as well. Each sequence is analysed, being divided into several sub-sequences considering with a determined degree of overlapping. Each sub-sequence is then represented by 4 × 4 co-occurrence matrices representing the distribution of nucleotides neighborhoods. The 4 × 4 × m volume is then stacked to generate the respective feature vector used as input for the classifier. www.nature.com/scientificreports/ After obtaining this set of matrices, they are then concatenated, forming an attribute vector, i.e. a feature vector. These extracted attributes correspond to a high-level vectorial representation of the initial DNA sequence, independent from its size. The detailed view of this representation process is presented on Fig. 3.

Results
Multiclass classification. To assess the efficiency of the proposed method of extracting characteristics, this first round of experiments was conducted in a more challenging scenario with twenty-five different viruses, including SARS-CoV-2. Five types of classifiers were tested: IBk, Multilayer Perceptron (MLP), Naive Bayes classifier (NBC), Random Forest and Support Vector Machines (SVM). All experiments were carried out with the Weka data mining Java library. The parameters used in each machine learning method are shown in Table 1. Figure 4 shows the accuracy for all classifiers in the datasets with 30%, 50%, and 70% of overlap, respectively. Considering this multiclass classification, all three datasets (with 30% , 50% , and 70% overlap) presented Random Forest classifier with the highest accuracies (approximately 94% in all the datasets). Figure 5 shows box plots for the Kappa statistic. Since Kappa statistic is less sensitive to the high imbalanced test dataset, it is a better evaluation metric then accuracy. Nevertheless, the Random Forest classifier achieves the highest Kappa statistics compared with the other classifiers (above 0.88 in all experiments). Figure 6 presents the weighted average sensitivity, specificity, and ROC area for all datasets and classifiers. For the weighted average sensitivity and ROC area, Random Forest results are higher or at least equal to other classifiers. For the weighted average specificity, visual analysis of Fig. 6b suggests that the IBK classifier achieves Steps of the proposed method: a new technique for representing genome sequences based on the analysis of the relationship between nitrogenous bases. The complete genome sequence is subdivided into n folders. Each sub-sequence is combined with a piece of its neighbors, generating two new sequences. These smaller sequences are represented by co-occurrence matrices, considering the occurrence of each of the nitrogenous bases, and the relationship between bases and their immediate neighbors. In the next step, theses matrices are stacked together as a volume. Finally, this set in concatenated, forming attribute vectors, which are a high-level vectorial representation of the original sequence. www.nature.com/scientificreports/ higher scores on this metric. However, all classifiers, except Naive Bayes Classifier, achieved results above 0.99 on weighted average specificity. Therefore the Random Forest is presented as a robust classifier for this task. Aiming to evaluate the overlap percentage in the feature extraction method, Fig. 7 shows box plots for accuracy, Kappa statistic, weighted average precision, recall and ROC area for the Random Forest classifier in the datasets with 30%, 50%, and 70% overlap percentages. The variance of accuracy and kappa in the dataset with 30% overlap is higher than in the 50% and 70% overlap dataset. However, 30% overlap seems to be slightly better, or at least at the same level, as the others overlap percentages.
Due to class imbalance in the test dataset, we need to evaluate sensitivity, specificity, and ROC area for each class separately. Considering the Random Forest classifier in the dataset with 30% overlap, Table 2 shows the results of sensitivity, specificity, and ROC area individually for each virus in the database. Specificity and ROC Area results are above 0.9 for every virus. The sensitivity varies from 0.99391 for Pneumoviridae to 0.23397 for Filoriviridae. However, for most of the classes, sensitivity has values greater than 0.8, including SARS-CoV-2 class with a sensitivity of 0.82.
In order to perform a visual analysis of these results, Fig. 8 shows the average confusion matrix for the Random Forest classifier in the dataset with 30% overlap. The confusion matrix is expressed in terms of percentage for the particular class, and the classes indexes numbers are the same as shown in Table 2. We can see that, for some classes, there is a confusion with another virus. For example, most of the Picornaviridae virus (index 0) is classified as Enterovirus (index 14). This confusion is not symmetrical: Picornaviridae is misclassified as Enterovirus, but Enterovirus is not misclassified as Picornaviridae. The only exception for this observation of confusion with another virus type is the Phasma Viridae (index 23), which is confused with two other viruses: Hantaviridae  Since the ROC area for SARS-CoV-2 is 0.99883 (Table 2), we performed a threshold adjustment for SARS-CoV-2 class in order to reach 100% sensitivity. The new average confusion matrix is shown in Figure 9. Binary classification. Since the highest false positives for SARS-CoV-2 are from Coronaviridae in the multiclass scenario, we evaluated the same classifiers for a binary classification between Coronavirus and SARS-CoV-2. For this experiment, only the dataset with 30% overlap was used, since this overlap percentage has shown to represent the virus genome sequences satisfactorily. Figure 10 shows the accuracy, kappa statistic, weighted average sensitivity, specificity, and ROC area for each classifier. It is important to state that there is still a relevant imbalance between the number of Coronaviridae and SARS-CoV-2 examples in the dataset (3256 and 171, respectively). So, the Kappa statistic is still more appropriate than accuracy to assess the classifier's overall performance. Regarding Kappa statistics, weighted average specificity, and ROC area, MLP results are higher or equal to other classifiers. For the weighted average sensitivity, SVM achieves higher results than MLP. Nevertheless, given that average sensitivity for MLP is higher than 0.96 and MLP overcomes SVM in all other metrics, MLP seems to be a more robust classifier for this particular task. Table 3 shows the sensitivity, specificity and ROC Area for each class. It is possible to notice that each one of those metrics has values above 0.96. Figure 11 shows the average Confusion Matrix for MLP classifier. There was no relevant difference with the multiclass scenario regarding the confusion between Coronaviridae and SARS-CoV-2 since there is still a 3.85% of Coronaviridae examples misclassified as SARS-CoV-2. However, about the confusion between SARS-CoV-2 and Coronaviridae, the binary MLP classifier achieved 2.61% of confusion while 11% in the multiclass scenario.
Viruses with similar symptoms. In this experiment, viruses were selected due to similar symptoms. The dataset was arranged into four classes: SARS-CoV-2, Coronaviridae, Paramyxoviridae, and Miscellaneous. The Miscellaneous Class is a compound of Peneumoviridae, Hantaviridae, Enterovirus, and Nairoviridae. Then, the same classifiers used previously were evaluated in this classification task. Figures 12 and 13 shows the accuracy and kappa for all classifiers and datasets in this classification task. Except for the Naive Bayesian classifier, classifiers have similar performance metrics, with approximately 97% accuracy and kappa equal to 0.96. Figure 14 shows the weighted average specificity and sensitivity and ROC are. The weighted average sensitivity and specificity look very similar to all classifiers (except Naive Bayes Classifier). However, the weighted average ROC areas for MLP and Random Forest classifiers are slightly higher than the other classifiers, although IBK and SVM classifiers also achieved weighted average ROC areas above 0.98 in all datasets.
In order to better evaluate the MLP and Random Classifier, Fig. 15 shows the confusion matrices for those classifiers in all datasets. The Random Forest presents a confusion between the SARS-CoV-2 and the www.nature.com/scientificreports/ Coronaviridae of approximately 10%. It is very similar to the achieved results in the multiclass scenario. However, the MLP classifier achieved significantly low confusion between SARS-CoV-2 and Coronaviridae (1.57% in the datasets with 30% and 50% overlap). The main confusion found in the MLP classifier is between Conronarividae and SARS-CoV-2 (3.81% for the dataset with 30% overlap). By MLP confusion matrix analysis is not possible to www.nature.com/scientificreports/ find significant differences between the 30%, 50%, or 70% overlap percentages. Since the 30% overlap requires less computational effort to extract the features, we can select the MLP classifier with a 30% overlap dataset as a better approach to this particular task. Table 4 shows the sensitivity, specificity and ROC area for each class. The average ROC Area and specificity are above 0.99 for all classes. The average sensitivity is also above 0.99 for the Paramyxoviridae and Miscellaneous classes. The lowers sensitivity is for Coronaviridae (0.959), while a slightly higher sensitivity is achieved for SARS-CoV-2 (0.97).
Real test scenario. In this scenario, the SARS-CoV-2 test is designed as a three-class classification problem: SARS-CoV-2 (the test target), GRCh38 (the healthy human reference), and Coronaviridae (a virus control sample). The same classifiers used in the other experiments were applied to this new task. Figure 16 shows the accuracy, whilst Fig. 17 presents the kappa statistic results. Except for the Naive Bayes Classifier, all other classifiers achieved accuracies higher than 99% and kappa indeces above 0.9. By these metrics, it is not possible to distinguish the best classifier. The same behavior is observed in the weighted average metrics www.nature.com/scientificreports/ shown in Fig. 18. Weighted average sensitivity, specificity, and ROC area are higher than 0.99 for all classifiers except the Naive Bayes Classifier. Aiming to better evaluate the results of the classifiers in the SARS-CoV-2 test task, all the confusion matrices for IBK, MLP, Random Forest, and SVM classifiers are shown on Fig. 19. IBK and Random Forest classifiers presents a confusion between SARS-CoV-2 and Coronaviridae that varies from 10.26% (Fig. 19h) to 14.77% (Fig. 19c). This outcome is even worse for SVM classifier since most of the SARS-CoV-2 examples are misclassified as Coronaviridae. By confusion matrix analysis, the MLP classifier has lower confusion rates between SARS-CoV-2 and Coronaviridae. The results from MLP classifier in the dataset with 50% overlap (Fig. 19e) shows 99.92% average true positive rate for GRCh38 class, and 98.82% for the SARS-CoV-2. For the Coronaviridae class, this classifier achieves 96.2%, while only 3.73% of Coronaviridae examples are misclassified as SARS-CoV-2. Table 5 shows the sensitivity, specificity and ROC Area for each of the classes for this MLP classifier.

Discussion
Considering the proposal for the representation of genomic sequences, the good classification results using classic machine learning methods show evidences that the pseudo-convolutional method for feature extraction proposed in this work may be sufficient to guarantee the identification of viruses from the transcribed DNA with high sensitivity, specificity and area of the ROC curve. The Random Forest classifier obtained the best overall performance for multiclass scenarios, while the MLP classifier presented the best results for scenarios with fewer classes.
Evaluating the parameters for the proposed sequence-based feature extraction method, dividing the virus genome sequence into four folders ( n = 4 ) seems to be sufficient to produce representative characteristics. Regarding the percentage of overlaping, the proposed feature extraction method is not very sensitive to this parameter, although 30% to 50% seems to be sufficient to produce good characteristic representations.
The first multiclass scenario (with 25 virus classes) is a scenario designed to demonstrate extreme difficulty. It is highly unlikely that many of the virus families present in the classification are diagnostic possibilities, since the forms of contagion can be different and the diseases can manifest themselves through very different symptoms. However, the Random Forest classifier achieved sensitivity and specificity above 0.9 for many classes. For classes with lower sensitivity, the confusion matrix shows that most confusions are particular between two families of viruses. For example, Filoriviridae is the class with the lowest sensitivity rate (0.23). However, checking the confusion matrix, on average 76.27% of Filoriviridae are mistakenly classified as Ebola virus. There is no other significant confusion for Filoriviridae. Then it is possible to design a cascade of classifiers to solve this specific confusion between two viruses. www.nature.com/scientificreports/ A particular class of virus is the Pharma Viridae. This family has only 42 samples in the data set (30 used for training and 12 for testing). Even with this small number of samples in the multiclass scenario, the significant incorrect classifications for Pharma Viridae are Henteraviridae (22.78%) and Peribunyavirida (35.26%). With a larger sample size for Pharma Viridae, classifiers could find a better decision frontier at this false-negative rate. However, for this particular class, cascading classifiers of three classes can be evaluated to deal with these types of errors.
Regardless of the characteristics extraction parameters or even the classifier used, there are still 3-4% of Coronaviridae samples classified incorrectly as SARS-CoV-2. However, this is an expected result, since SARS-CoV-2 belongs to the family Coronaviridae. We artificially separate the SARS-CoV-2 sequences from the Coronaviridae family, in order to verify whether the proposed pseudo-convolutional representation of genomic sequences would be able to help detect differences between SARS-CoV-2 and the other human coronaviruses . Thus, visualizing the extracted features, we found some samples of SARS-CoV-2 and Coronaviridae that cannot be distinguished, as shown in Fig. 20. Therefore, it is difficult for any classifier to ideally separate these two classes.
In this work we presented a novel method to represent DNA sequences by using pseudo-convolutions and co-occurrence matrices. With this method, we were able to represent hundreds of thousands of DNA sequences from 24 virus families. Then we separated SARS-CoV-2 sequences from the Coronaviridae family and demonstrated that our model is able to differentiate all virus families present on our database. SARS-CoV-2 was discriminated from virus families other than Coronaviridade and even from other coronaviruses with very high sensitivity and specificity.
Our goal was to show the ability to optimize the molecular diagnosis of Covid-19 by combining RT-PCR, Covid-19's gold standard diagnostic method, and our pseudo-convolutional method to identify DNA sequences www.nature.com/scientificreports/ from SARS-CoV-2 from others candidate viruses without high computational cost bioinformatics methods, such as multiple sequence alignment. From the results obtained, we can assume that the proposed method is able to characterize DNA sequences from SARS-CoV-2 transcribed from the RT-PCR process. This new representation of DNA sequences can be used successfully as a feature extraction stage for fully connected networks, in order to use the deep learning philosophy or other classical classification architectures. The assessment of the proposed approach in real test scenarios, necessarily reduced to a limited set of candidate virus families and DNA from healthy human samples, also showed high sensitivity (greater than 0.988) and specificity (greater than 0.998). Consequently, other researchers can use our solution and our methods to improve their results and diagnose Covid-19 and other viral diseases considering a considerably high number of candidate viruses with very high accuracy, precision, sensitivity and specificity. The virus identification system proposed in this work was designed for clinical application, in clinical analysis laboratories. Potential users are biomedical technicians and other health professionals responsible for analyzing DNA samples from RT-PCR tests. The system was proposed as an alternative to web systems based on DNA sequence alignment, which can be used in a similar way from the point of view of potential users, i.e. biomedical technicians. The main limitation of the system is its restriction for analyzing DNA sequences of viruses, since many viruses are expressed as RNA, and not as DNA. The system is in the process of being implemented at the Biomedical Computing Laboratory of the Federal University of Pernambuco. As future work, we propose to adapt the proposed pseudo-convolutional network approach to RNA sequences, providing potential users to identify viruses both by their expression in RNA or in DNA and by the translation of RNA into DNA, in the case of RNA viruses. Random Forest. Random Forests are based on decision tree committees organized in bagging 47 . Decision trees, as illustrated in Fig. 21, separate data iteratively, testing one property at a time. The resulting sheets represent the most specific category. The root represents the raw data. The random forest is built with many of these trees, all with their own class prediction for any input provided. The most voted class is the departure of Random Forest. Random Forests have been used to solve a plethora of biomedical problems, specially to develop intelligent systems to support diagnosis [48][49][50] . www.nature.com/scientificreports/ As the most relevant characteristics to determine the decision boundary between classes of virus DNA sequences are unknown, Random Forests can be powerful methods for classification, as they are able to verify many relevant properties through their different trees. In the bagging process, each tree receives a version of the training set with a reduced number of attributes. Thus, it is possible to build decision criteria that take into account only a few attributes and these criteria can be winners in the vote, determining the final decision of the classifier.
Naive Bayes classifier. This machine learning model is based on Bayesian Decision Theory 51 . Given a set of classes � = {ω 1 , ω 2 , . . . , ω m } and a universe R n of feature vectors x = (x 1 , x 2 , . . . , x n ) T that model instances of a representation universe, the probability of deciding for class w i given an attribute vector x is given by: Figure 11. MLP average Confusion Matrix for binary classification task (Coronaviridae vs. SARS-CoV-2 using the 30% overlap dataset). www.nature.com/scientificreports/ where whilst P(ω i ) is the a priori probability of ω i and p(x|ω i ) is the density probability distribution of x given the class ω i . The decision is given according to the following rule, called the maximum probability hypothesis: It is called naive because it assumes independence in the features that lead to the events, i.e. given a feature vector x = (x 1 , x 2 , . . . , x n ) T , the features x i and x j are statistically independent, since i = j . Then, we can have the following: Consequently, the discriminant functions to model each class, P(ω i |x) , are defined as following: where p(x j |ω i ) are usually modeled as Gaussian probability density functions: whilst x j |ω i ∼ N(µ j,i , σ 2 j,i ) , ∀i, j = 1, 2, . . . , n. Naive Bayesian classifiers are commonly tested against other classifiers in diagnosis support solutions [48][49][50] . Furthermore, they assume all features/predictors have an equal weight. This approach could be beneficial because it explores the possibility that the genomes have characteristics that are not correlated. Should that be the case, this classifier might yield good results.
Instance-based learner. In this algorithm, also known as IBK 52 , a model is not built. It stores the training set. Given a sample, the prediction is made by calculating the distance k between that instance and the training set instances, as shown in Fig. 22. The distance is determined by some similarity criteria. Thus, this algorithm makes the decision based on the search for instances similar to the input instance. Thus, it may be able to identify genome sequences belonging to the same virus. Figure 13. Kappa Statistic for similar symptoms scenario.  Fig. 23, is an artificial neural network designed to solve nonlinearly separable problems 53 . Each artificial neuron has weights that multiply the input, which in turn is processed by an activation function to generate the output. The weights are adjusted until the net can satisfy a certain pre- www.nature.com/scientificreports/ cision at the output. In this way, he could identify the characteristics that are particular to each class. Multilayer perceptrons have been used to several biomedical problems, specially to develop intelligent systems to support diagnosis [48][49][50]54 .
Support Vector Machines. Support Vector Machines (SVMs) were originally created for classification tasks where classes can be separated linearly, that is, by means of hyperplanes 55 . Through search algorithms, such as the Sequential Minimum Optimization Algorithm (SMO), vectors of characteristics of each class of the training set are chosen as supports. These support vectors are used to find the optimal hyperplan of separation of classes, according to the scheme of Fig. 24. To deal with nonlinear problems, SVMs use kernel transformations, which take the original attribute vectors to other representation spaces. According to the Coverage Theorem, the nonlinear mapping of attribute vectors in representation universes with more dimensions makes it more likely to find linear solutions to discriminate classes. SVMs have been applied to solve several biomedical problems, specially on developing intelligent systems for diagnosis support [48][49][50]54 . Since SVMs are commonly used in biomedical problems, presenting good results, it is reasonable to evaluate a considerable amount of SVM configurations and kernels to solve our problem considering the sequence representation method presented in this work.      46 , which features multiple whole-genome sequences (WGS) from several viruses. Table 6 displays the number of examples per virus famliy for each of the selected viruses. Human participants were not involved in our research. No demographic data were collected as well.
The viruses have different sample sizes, ranging from 42 sequences, as for Phasmaviridae, to 216,223 nucleotide sequences, for Hepatitis C. The bar graphs below depict the distribution of sample sizes in both a linear and a logarithmic scale.
The second dataset used in this paper is from the Genome Reference Consortium 57 . Its purpose was to represent the human genome. We used 103,959 sequences.

Feature selection.
To evaluate the relevance of the features generated by the DNA sequences representation method proposed in this work, we used meta-heuristic optimization methods. Populations of 20 individuals were used, evolving in 20 generations. As an objective function, we use a decision tree, trained and tested using 10-fold cross validation. Each individual represents the attributes used in the classification by means of a binary vector, where 1 models the presence of that attribute, while 0 represents its opposite. We employed meta-heuristic libraries developed in Java for Weka data mining platform 58 . We adopted the following feature selection methods:    www.nature.com/scientificreports/ • Particle Swarm Optimization, individual weight of 0.34, inertia weight of 0.33, mutation probability of 0.01, report frequency of 20, social weight of 0.33 [86][87][88][89] .
All feature selection parameters were defined empirically, taking into account the most commonly used parameters values.
Each attribute selection method returns the probability of relevance for each of the 97 attributes extracted from the base of virus DNA sequences. At the end, we checked the agreement between the methods, seeking to find a consensus among the attributes selected by each method. Interestingly, all methods returned 100% relevance for each of the 97 attributes, showing that the attributes generated by the DNA sequence representation method proposed in this work generate equally relevant and distinctive attributes, with no representation redundancies. Experiment setups. In order to validate our proposal, we designed four experiments based on theoretical and practical situations in which it is necessary to identify SARS-CoV-2 among a determined set of virus candidates represented by transcripted DNA sequences: (a) multiclass classification; (b) binary classification; (c) virus with similar symptoms; and (d) real test scenario. These experiments are described as following.
Multiclass classification. The purpose of this experiment is to identify the SARS-CoV-2 virus considering 24 virus families, including its own family: Coronaviridae. The SARS-CoV-2 sequences were separated from the Coronaviridae family to differentiate SARS-CoV-2 from other human coronaviruses. The complete list of virus families is shown in Table 6. All 25 classes in Table 6 were used to build the database, which was divided into a training set and a test set. For classes with more than 500 instances, the training set consisted of 500 instances, while the remaining instances were allocated to the test set. Classes with less than 500 distances were divided into 70% of instances for training and 30% for testing. In addition, the n attribute extraction hyperparameter was set to 4. The sequence overlap was tested for 30%, 50% and 70%. The general objective of this experiment is to evaluate the ability of the proposed sequence representation algorithm to classify virus DNA sequences transcribed from RT-PCR in the worst case scenario, where there is no knowledge about the virus present in the sample. This is a purposefully difficult scenario, although unlikely from a practical point of view, but one that tends to test the performance of the proposal in the worst case scenario.
Binary classification. This test was used to analyze the efficiency of the proposed method in differentiating SARS-CoV-2 from other viruses belonging to the Coronaviridae family. Viruses in the same family can be potentially challenging to classify when compared, given that the structural similarity is considerably high. In this scenario, the two with their genomes are contrasted only with each other. The training and testing division was carried out exactly as in the multiclass evaluation. The n feature extraction hyperparameter was set to 4 and the overlap was set to 30%, a percentage considered satisfactory in the multiclass experiment.
Viruses with similar symptoms. A third test was planned to classify viruses with symptoms similar to SARS-CoV-2. This test scenario is useful for situations where it is necessary to determine whether a patient has been infected with SARS-CoV-2 but has symptoms that may indicate infection with other viruses, which have similar symptoms. Four classes were established: SARS-CoV-2; Coronaviridae; Paramyxoviridae; Peneumoviridae, Hantaviridae, Enterovirus, and Nairoviridae. The criteria for building the training and test sets and for choosing the n hyperparameter were maintained as in the previous tests. The overlap was defined as 30%, 50% and 70%.
Real test scenario. This test was designed to reproduce the most common scenario of using RT-PCR to diagnose Covid-19. This scenario included three classes: the human genome, obtained from the Genome Reference Consortium 57 , SARS-CoV-2 and the other viruses in Table 6. It tests the actual use case of the proposed method, in which SARS-CoV-2 must be identified between the human genome and other viruses. The training and test sets were built in the same way as in the previous scenarios. The n hyperparameters has been maintained. The overlap was tested for 30%, 50% and 70%.
Metrics. The following metrics were used to evaluate the classification performance of our proposal in all test scenarios:

• Confusion Matrix
The confusion matrix provides a straightforward framework for portraying the model output. The lines represent the prior knowledge and the columns represent the results predicted by the model. n expresses the total number of instances. Each line, when added, is equivalent to the total number of instances per class. The number of correctly classified instances can be obtained by adding all the elements of the main diagonal. On the other hand, the number of instances classified incorrectly is obtained by adding the elements outside the main diagonal.

• Accuracy
The accuracy describes the rate of correct classification of instances and is the most commonly used metric in machine learning. Considering a confusion matrix T = [t i,j ] n×n for a classification task with n classes, in which i denotes the index of the ith true class and j points to the index of the class associated to the classification decision, the jth class, the accuracy is defined as following: where FN is the number of false negatives.

• Sensitivity
The sensitivity, or True Positive Rate, is given by: • Specificity The specificity, or True Negative Rate (TNR), if defined as following: where TN is the number of true negatives.

• Area Under the ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graph that plots the True Positive Rate (TPR) and False Positive Rate (FPR) of classification for different thresholds. The FPR is defined by: Thus, the Area Under the ROC Curve (AUC) measures performance for all possible thresholds of classification in a given model, and therefore it portrays the quality of results independently of it.
Human or animal subjects. This article does not contain any studies with human or animal subjects.

Data availability
The main data supporting the results in this study are available within the paper. Data was obtained from the NIAID Virus Pathogen Database and Analysis Resource (ViPR) 46 .