Proof of concept of the potential of a machine learning algorithm to extract new information from conventional SARS-CoV-2 rRT-PCR results

Cabrera Alvargonzález, Jorge; Larrañaga Janeiro, Ana; Pérez Castro, Sonia; Martínez Torres, Javier; Martínez Lamas, Lucía; Daviña Nuñez, Carlos; Del Campo-Pérez, Víctor; Suarez Luque, Silvia; Regueiro García, Benito; Porteiro Fresco, Jacobo

doi:10.1038/s41598-023-34882-6

Download PDF

Article
Open access
Published: 13 May 2023

Proof of concept of the potential of a machine learning algorithm to extract new information from conventional SARS-CoV-2 rRT-PCR results

Jorge Cabrera Alvargonzález^1,2,3,
Ana Larrañaga Janeiro⁴,
Sonia Pérez Castro^1,2,3,
Javier Martínez Torres⁵,
Lucía Martínez Lamas^1,2,
Carlos Daviña Nuñez¹,
Víctor Del Campo-Pérez⁶,
Silvia Suarez Luque⁷,
Benito Regueiro García^1,2,8 &
…
Jacobo Porteiro Fresco⁴

Scientific Reports volume 13, Article number: 7786 (2023) Cite this article

718 Accesses
1 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been and remains one of the major challenges modern society has faced thus far. Over the past few months, large amounts of information have been collected that are only now beginning to be assimilated. In the present work, the existence of residual information in the massive numbers of rRT-PCRs that tested positive out of the almost half a million tests that were performed during the pandemic is investigated. This residual information is believed to be highly related to a pattern in the number of cycles that are necessary to detect positive samples as such. Thus, a database of more than 20,000 positive samples was collected, and two supervised classification algorithms (a support vector machine and a neural network) were trained to temporally locate each sample based solely and exclusively on the number of cycles determined in the rRT-PCR of each individual. Overall, this study suggests that there is valuable residual information in the rRT-PCR positive samples that can be used to identify patterns in the development of the SARS-CoV-2 pandemic. The successful application of supervised classification algorithms to detect these patterns demonstrates the potential of machine learning techniques to aid in understanding the spread of the virus and its variants.

Early prognosis of respiratory virus shedding in humans

Article Open access 25 August 2021

Real-time prediction of COVID-19 related mortality using electronic health records

Article Open access 16 February 2021

COVID-19 in early 2021: current status and looking forward

Article Open access 08 March 2021

Introduction

There have been 235 million cases and more than 4.5 million deaths of SARS-CoV-2, the causal agent of coronavirus disease 2019 (COVID-19)¹. The progress of this pandemic has been characterized by a continuous rise and fall of infections in somewhat unpredictable temporal intervals. These periods of higher infection-rate are generally identified as epidemic “waves”, although its definition is somehow subjective since the onset and ending of each wave has not been formally defined^2,3,4.

The standard method for the detection of SARS-CoV-2 is based on real-time reverse transcriptase polymerase chain reaction (rRT-PCR) performed using a nasopharyngeal swab sample⁵. Massive testing, in conjunction with other control measures, has been implemented to identify symptomatic or asymptomatic carriers to prevent the spread of SARS-CoV-2. The cycle threshold (Ct) value is inversely related to the amount of RNA of the virus present in the sample, which has awoken interest as an indirect method to predict infectivity, disease progression, severity and even associated mortality⁶.rRT-PCR tests are commonly considered qualitative tests (i.e., providing just a positive or negative result); however, they provide a Ct value for each target gene, which indicates the number of PCR cycles required to reach the threshold level of fluorescence associated with a positive result. Hence, the Ct value is inversely proportional to the viral load, although this correlation is not linear and depends on many factors. A recent work reviewed several works published on the connection of Ct with patient conditions and clinical outcomes⁷.

In addition, mutations that affect the template sequence where the primers bind can affect the amplification efficiency and, therefore, the Ct values for a specific gene. For SARS-CoV-2, increases in Ct values or nondetection of different genes associated with specific mutations of the virus have been described^8,9,10. In fact, variant B.1.1.7 can be efficiently identified with specific tests through an undetectable S gene target or significant increase in the Ct value of the S gene compared to other targets^11,12,13. Nevertheless, a significant delay of the Ct to N gene was observed in another test¹⁴, which confirms that the differences in Ct values between genes depend on the virus mutations but also on the primers used and, therefore, could be considered assay-dependent.

The mutation rates of coronavirus are low in general. However, in the case of SARS-CoV-2, the numbers reached in the pandemic have led to the accumulation of mutations and the emergence of multiple lineage and variants. Some of them are classified as variants of interest (VOIs), and variants of concern (VOCs)^15,16,17, which have relevant epidemiological characteristics that may affect the virus’s properties, spreads, clinical characteristics of disease, and vaccine and drug performance. For this reason, it is important to track known variants and implement surveillance systems capable of detecting significant changes in the predominant variants, the variants causing an outbreak or even the emergence of a variant of high consequence (VOHC). The gold standard for identifying and tracking variants in circulation is the whole genome sequencing, although it has important limitations related to costs, resource availability, lack of expertise, standardization, and time delay.

Currently, machine learning or deep learning models are being used in the medical field and, more specifically, in the field of COVID, even as a proof of concept like this work ^18,19. In particular, several works have developed classification models based on X-ray images^20,21,22,23. Beyond image-based models, several works have addressed information from blood²³ or medical information^19,24. More specifically, different works present a model established to predict COVID diagnose^25,26 and PCR results²⁷ based on clinical information alone.

There are two main approaches: unsupervised learning and supervised learning. In unsupervised learning, we should allow the algorithm to seek its own way to classify the data; however, for the supervised approach, we must give the algorithm a target set of clusters. In this work, we opted for the latter.

The Microbiology Department of the University Hospital of Vigo (Complexo Hospitalario Universitario de Vigo, CHUVI), was a pioneer in the use of pooling in saliva for detecting SAR-CoV-2 in a nonsymptomatic population²⁸; and over the pandemic, the ‘pooling lab’ has screened more than 750,000 individual samples by pooling. When we analyzed the results of the positive samples in pools versus individuals, we observed how the relationship between Ct values remained constant for each sample despite the increase caused by dilution. These relationships between Ct values could be gathered into similar profiles, and some groups seemed to have temporal accumulation. These findings and those explained in the previous paragraph made us consider whether there is an underlying ‘signature’ or ‘pattern’ in the Ct results of an rRT-PCR that can be used for the classification of samples.

The main goal of this work is to assess whether there is a Ct pattern that is characteristic of virus variants. Since ML algorithms require a high volume of training data to be effective and genome sequencing for variant determination was impractical, we decided to use the wave index as the characteristic target cluster. Hence, the working hypothesis is the following: each wave has a distinguishable pattern (signature) on the rRT-PCR results that allow an ML algorithm to efficiently predict the wave to which each individual test belongs.

Due to limitations in the number of tests completely sequenced, we will train our algorithm to predict the waves inside the evolution of the pandemic and then compare this prediction with the arrival and predominance of the different virus variants in the area where this study was conducted. By creating a large database to efficiently train a classification algorithm, we show that there is an underlying signature response in rRT-PCR results probably related to variations in viral genome.

The aim of this study is to demonstrate the presence of a distinguishable signature in the Ct pattern of rRT-PCR. However, it is important to note that any further generalizations should only be made once the results are confirmed through various labs, test conditions, and sequenced samples in which the variant is already determined. Until such verification is obtained, it would be premature to draw further conclusions.

Methods

Brief description of the rRT-PCR technique employed, primers, and software

We performed nucleic acid extraction in a MicrolabStarlet IVD platform using the STARMag 96 × 4 Universal Cartridge Kit (Seegene Inc, Seoul, South Korea). To detect SARS-CoV-2, we applied the Allplex™ SARS-CoV-2 Assay (Seegene Inc, Seoul, South Korea), a multiplex one-step rRT-PCR able to simultaneously detect four viral targets, including the structural protein envelope (E) gene, the RNA-dependent RNA polymerase (RdRP) gene, the spike (S) gene, the nucleocapsid (N) gene, and an exogenous RNA-based internal control (IC). This rRT-PCR step was run on a CFX96™ system (Bio-Rad Laboratories, Hercules, CA, USA), and the analysis was performed using Seegene Viewer-specific SARS-CoV-2 software (Seegene Inc, Seoul, South Korea), resulting in separate cycle threshold (Ct) values for the E and N genes and one combined Ct value for the RdRp and S genes (RdRp/S) in the FAM, Cal Red 610 and Quasar 670 channels, respectively. The HEX channel is used for internal control. Regarding interpretation of the results, according to the manufacturer’s instructions, Cts values ≤ 40 are considered detected, and Cts value > 40 or not applicable (N/A) are considered not detected.

Ethics

The study protocol (2021/295) was approved by the Galician network of committees of research ethics conformed to the principles outlined in the declaration of Helsinki. All methodologies were performed according to the relevant guidelines and regulations, and patient data were anonymized. The dataset used and the waiver of informed consent was approved by Galician network of committees of research ethics.

Description of the dataset employed in this work

Positive samples from two different sources were used for this study. First, 3274 positive samples were obtained from the 688,763 samples processed by the pooling techniques between August 2020 and July 2021. Second, we also included 17,144 positive samples obtained from the 313,939 samples processed in the microbiology laboratory between February 2020 and March 2021.

The samples processed in the ‘pooling lab’ are screenings to detect SAR-CoV-2 in a non-symptomatic population. The participants were asked to collect saliva (self-sampling) in TRANSPORT MEDIUM-2 (Vircell® Ref: TM013) immediately after waking up, following the manufacturer's instructions. Although each result pertains to an individual rRT-PCR for each sample, these samples were first flagged as possibly positive by group testing. Then, the original samples from the positive pool are individually analyzed. These are the results that are used here. Individual samples and pools were analyzed following the same standard rRT-PCR protocol described in “Brief description of the rRT-PCR technique employed, primers, and software” section.

The other samples were nasopharyngeal swabs processed individually in the laboratory of the CHUVI Microbiology Department as part of the assistance routine for SARS-CoV-2 diagnosis. It is important to note that the supply of this source of positive samples ended prematurely in March 2021 due to the need to change the reagent used in this laboratory (Allplex™ SARS-CoV-2 Assay to Allplex™ SARS-CoV-2/Flu A/Flu B/RSV assay, both from Segene Inc.) because of the high demand. In this way, we were able to keep the Allplex™ SARS-CoV-2 assay to group testing, since in this case, an assay change requires a full re-evaluation of the system, and the increase in the Cts for the N gene previously described by Wollschäger et al.¹⁴ may have greater significance in group testing. As explained in “Classification results” section, the data from 12,313 positive samples obtained by the Allplex™ SARS-CoV-2/Flu A/Flu B/RSV assay between February and August 2021 could not be included in the present study.

Characterization of the wave concept

Since the pandemic began in March 2020, the concurrent increases and decreases in cases have been linked to the concept of ‘wave’, which are determined using subjective, unofficial criteria. To the best of the authors’ knowledge, this is an abstract nomenclature whose rigorous definition has not yet been clearly established to date. To characterize the pandemic dynamics in our area, we tracked the curve of active cases at the level of Galicia and, more specifically, Vigo and determined the boundaries between the so-called ‘waves’ in a data-driven way.

The database of active cases in the entire Galician region during the SARS-CoV-2 pandemic was obtained from data provided by the public health service of the Autonomous Spanish Community. In order to determine the time limits of each wave, the contagion curve is fit to a smoothed spline (R² = 0.99), and the waves are defined by the local minima and maxima of the curve, as shown in Fig. 1. Therefore, it can be concluded that although vaguely defined, waves are quite distinguishable, and the number of samples is inherently higher near the peak of each wave and much lower in their frontiers. Additionally, each new wave could also be associated with a higher proportion of samples with lower Ct values at the beginning²⁹.

Focusing now on the data available for this study, Fig. 2 shows how the existence of the waves is reduced to four clearly differentiated peaks. First, the slight increase in the number of cases experienced in autumn of 2020 is not seen in the data collected. The peak of cases in the second wave is concentrated in the last months of the year.

As will be explained in this work, the key aspect is the capacity of the machine learning algorithm to correctly predict the wave to which each sample belongs based on the numerical results of rRT-PCR for each gene. Therefore, a change in the target genes of the PCR performed during the fourth wave is too strong to indicate the temporal position of those tests and therefore had to be removed from our database to avoid giving an unfair advantage to the algorithm. Unluckily, the dire circumstances under which laboratories had to work during the pandemic led to this type of disturbance. Fortunately, in our case, it only significantly affected the fourth wave.

Descriptive analysis of the database used in the work

Even after extracting the samples that could lead to unfair results, the resulting database used in this study corresponds to a set of 20,418 PCR samples collected by the Microbiology Department of the CHUVI from March 2020 to July 2021. For each sample that tested positive for SARS-CoV-2 the database included an anonymized identification number, the date when the sample was taken, the threshold value for each target gene (E, N and RdRP/S) and the threshold value for the internal control (IC). The RdRP and S genes share the same channel; therefore, we obtained a single Ct value for both genes.

Figure 3 shows the distribution of the number of cycles from the analyzed gene profiles, where the average Ct value is approximately 26 for genes E and N and close to 28 for the combination of genes RDRP/S.

Some visual features arise from the simple analysis of the data csollected. As seen in Fig. 4, the RdRP/S gene distribution seems to be slightly offset toward higher Ct values; and in fact, a more abrupt end is shown. However, a strong linear relationship between the number of cycles of the three genes can be observed from the database (R²_E-N = 0.96, R²_R-N = 0.95, and R²_E-R = 0.97). This is anticipated since the presence of the genes is expected to be similar and each number of cycles is allegedly related to the viral load of the individual; thus, the values of the numbers of cycles detected in any sample are usually quite close.

Figure 5 shows the temporal evolution of the number of cycles of each gene during the pandemic. The figure clearly shows that, at least at first glance, there is no trend or time evolution that points to Ct differentiation over time.

Classification techniques employed

To cluster the samples, a supervised learning technique would allow predicting the membership of a sample to a wave based simply on the number of cycles presented as a result of the PCR. Supervised learning algorithms are based on using labeled input data, i.e., with a correct answer with reference to its classification. Thus, as the algorithm is trained, it compares its predicted output with the correct input response until the error in its decision is minimized. In this work, we have labeled each sample with the wave that was dominant at the time the sample was taken from the individual. Since the definition of each wave, described in “Characterization of the wave concept” section, is unique, there is no possible ambiguity on the wave that we assigned to each sample. However, when a new wave becomes dominant, that means that, for the reasons discussed later, there is are new mechanisms in the pandemic progression that become dominant over the receding conditions of the previous wave. Hence, there is an intrinsic overlap that, by the methodology employed, cannot be resolved. Instead, the wave assigned to each sample is simply the dominant wave at that specific date.

Considering that a supervised algorithm was chosen to classify the waves, it was decided to confront two classic approaches within the machine learning field: a support vector machine (SVM) and a neural network (NN), using MATLAB R2021a as the main tool for the model’s development and post-processing the results. The fundamentals of each of the classification techniques tested are completely different although their final performance, as will be shown, is similar. In the NN approach, the model learns according to the training strategy and adjusts the weights of each of the neurons towards the optimum, whereas in the SVM approach a maximum margin hyperplane is created by means of kernel functions that allow the increase of dimension and thus facilitates the classification task. A detailed mathematical description of both models utilized in this work, SVM and NN, can be found in the Supplementary material.

Structure of the model

Figure 6 includes the information detailing the several steps that constitute the entire ML pipeline. First, the number of cycles of each of the genes for a single sample and an additional number of cycles corresponding to the internal control are taken as input parameters. Then, the training process starts after the machine learning algorithm is chosen.

The output of the algorithm corresponds to a confidence score that represents the probability that a sample belongs to a particular group. Considering that the main objective is to predict the membership of a sample to a wave, the output of this algorithm will correspond to a confidence level associated with the probability that a sample belongs to a wave. Thus, the wave with the highest confidence level assigned to it will be the one chosen as the predicted wave.

Subsequently, the prediction will be compared with the real wave value assigned to the sample. If the prediction coincides with the real value, it represents a good estimating; conversely, it represents an incorrect classification. The actual wave value of the sample is determined from the date of the sample taken and the estimated cutoffs with the approximation of the wave of active cases to a spline.

Results

This section will discuss the results obtained during the present investigation. The section will begin with the characteristics of the algorithms used, then address choosing the best alternative by evaluating the outcome and end with some conclusions drawn from the analysis.

Metrics

First, for the training of both models, a technique called cross-validation was utilized to avoid overfitting (i.e., the situation in which the network overlearns and extracts some noise as the main structure of the data, which affects the generalization of the predictions that tend to be incorrect). The idea behind cross-validation is to divide the database into a number of randomly chosen partitions, usually balanced, in order to train the model on subsets that it has not seen before. Thus, in this case, the database was divided into five parts. Four parts were used for training and the remaining part was used for testing. Then, the part used for validation was alternated as each training step was completed.

The criteria used to compare the results of the supervised algorithms was the accuracy together with the detailed information drawn from the confusion matrix. Considering that cross-validation was utilized, the accuracy was calculated as the percentage of observations correctly classified considering only the number of samples held out for validation in each training segment.

Among the diverse metrics used to evaluate a model, the confusion matrix is often used to evaluate the results of a model and identify its “weak points”. In this matrix, the diagonal shows the percentage of correctly identified results (i.e., second waves identified as such), and the off-diagonal elements show the failure rates.

Furthermore, the confusion matrix also shows the true positive rate (i.e., the TPR) and false negative rate (i.e., the FNR) in the right-hand columns. The TPR is the proportion of samples correctly classified with respect to their true class, and the FNR is the proportion of samples incorrectly classified with respect to their true class.

Overview of the performances of the SVM and NN

The classifications made with both algorithms obtained similar results. This can be seen in the accuracy results in Table 1. This similarity was expected from the linearity observed in the data, which led to equivalent results for both approaches. The cases in which the SVM shows a real improvement over the other techniques are usually related to nonlinear trends among the original data, which was not found in the present study.

Table 1 Classification algorithm specifications and training results.

Full size table

Table 1 includes the model characteristics for the SVM, following a one-versus-one multiclass method with no standardization of the data; and for the NN, which has a unique hidden layer with 100 neurons and uses the ReLU activation function. It is important to note that despite the large difference in training times between the two algorithms, their accuracies are practically identical.

Figure 7a and b include the confusion matrices for the algorithms, where the blue color is related to responses that are correct and the orange tone is related to misclassified samples. The fourth wave is the most clearly identified, with an accuracy (micro averaged) of over 93% in both cases. However, the time frame between the second and third waves, as well as the transition from the first to the second, causes more confusion in the algorithms. As seen in the TPR column, the lowest hit percentage corresponds to the first wave in the case of the SVM (Fig. 7a) and to the third wave in the case of the NN (Fig. 7b).

Classification results

When discussing the results, the outcome offered by the SVM is considered since they showed quite similar results and the SVM is a model that has certain advantages over the neural network in terms of performance and extrapolation of results. One such advantage is the opportunity to avoid retraining the model with the input of new samples, which is required in the case of the neural network.

As seen in Fig. 8, the majority of individuals are classified within their wave, and those points (located in the middle-bottom area of the figure) that have been misclassified are usually assigned to the upcoming or preceding wave.

An example of this confusion can be seen clearly between the second and third waves, where the yellow dots (individuals from the second wave that have been identified as third) can be identified within the second wave region. This can also be noticed, to a lesser extent, in the section of the first wave where a small cloud of green dots is located in the lower area (that is, real first one individual wrongly predicted by the classifier as second wave samples).

Moreover, Fig. 9 shows how the increase in failures is related to a higher uncertainty in the solution obtained by the classifier. It should be considered that since the classifier has four possible answers, if a decision is made with a score close to 0.25, it means that all four options are necessarily similar in terms of score level. In contrast, those answers obtaining greater certainty (i.e., with a confidence level close to unity), correspond to a higher number of correct classifications and almost no confusion. Hence, the algorithm seems to be well aware of its real performance.

It is worth mentioning that since the ground truth for each sample is based only on the date of their tests, which in the end is converted into a class membership (wave), the overlapping waves and the underlying causes that will be discussed later can easily be explained the fact that some (maybe many) of the individuals misclassified by ML could be in fact individuals misclassified to their assigned waves.

In fact, the algorithm precision tends to be higher at wave centers where the sample characterization is more solid. In contrast, those points that are located near the frontiers of the waves tend to be more conflicting (which can be easily seen, for example, in the intersection between the second and third waves in Fig. 8).

Discussion

Based on the results obtained in the classification, a deep reason to justify how a mathematical model, using only the detected number of cycles of each gene together with the number of cycles detected by the internal control as input data, could have such a high and accurate precision rate for each wave (as shown in the gray areas of Fig. 10) was sought.

Starting with the results corresponding to the first wave (Fig. 10a), the accuracy area (whose maximum reaches 80% at the peak of the wave) coincides temporally with the appearance of variant 20A in the Galician region. More specifically, the end of the wave also coincides temporally with the peak and decrease in active variant 20B cases (dashed line in the yellow area). This indicates that in addition to being characterized by variant 20A, this wave also picks up some features of variant 20B that cause a spike in the failure rate in October 2020 and again in July 2021.

In the case of the second wave (Fig. 10b), the tendency of the accuracy rate to follow the presence of specific variants still persists with the appearance of 20E (EU1) at the start of the second wave. Even so, this wave shows a higher error rate prior to its beginning and once it has ended since the 20B variant is present simultaneously with 20E (EU1) throughout the entire wave. Again, the figures show how the peaks in the error rate (red solid line) correspond approximately with spikes of 20B. Furthermore, the figures also show that from January 2021, the error rate decreases as the 20E (EU1) variant disappears.

The third wave (Fig. 10c), once again, its characterized by the presence of the 20E(EU1) variant together with 20I. This again causes a certain spike in the error rate due to the presence of these same variants during the rest of the waves. However, it is clearly observed that a higher accuracy corresponds to the coincidence of a higher percentage of cases of these variants simultaneously.

Finally, the fourth wave (Fig. 10d) is more clearly identified than the rest because the 21A variant is only present during the summer months of 2021. This means that the failure rate remains practically null until the arrival of this wave in April 2021.

Limitations

The main limitation of this study is that it is applied only to the data from a single lab and area. This is justified by the fact that the dataset required for training had to avoid any trace of data that could lead the algorithm to distinguish data from other data. In fact, as mentioned above, this reduced the size of our dataset since we had to discard many tests due to the use of a different set of target genes during a specific period of time. The purpose of this work is to show that a distinguishable signature on the Ct pattern seems to exist, but until proven using different labs, test conditions, etc., no further generalization should be made than the mere existence of this pattern. Until verified by sequenced samples in which the variant is determined, no generalization can be made about the capacity of this model to distinguish virus variants.

Assessment of the potential interest of the proven concept

The ML tool proposed in this work represents an additional tool that can improve the relevance of rRT-PCR results. The classical interpretation of a qualitative, Boolean result (positive or negative) can be completed with additional information regarding the estimation of probable virus variants (if recognized by an ML algorithm). This can be useful for many purposes, such as the following: pandemic control (quick detection of the arrival of new variants), as a screening tool for virus sequencing, as a quality check of the tests and/or reagents, etc.

In addition, this work is just the initial step toward a completely new methodology applied to rRT-PCR not only in the case of SARS-CoV-2 but also in many other diseases.

Conclusions

In this work, we found that an ML algorithm trained with a sufficiently rich database can efficiently identify the moment in the SARS-CoV-2 pandemic when an individual was infected based on a simple, standard, rRT-PCR test with three channels (E, N and RdRP/S). No additional information regarding gender, age, condition, etc. was required by the algorithm. The subjacent reason for the precision of the ML algorithm seems to be an underlying characteristic signature of the main SARS-CoV-2 variants. Only by collecting a sufficient amount of data of different variants, individuals, tests, laboratories, etc. can the concept presented here be proven directly and not through the wave clustering concept employed here.

The results of this work can be a first step toward a new accessible and inexpensive surveillance method for tracking and/or selecting candidate samples for sequencing. Even with its limitations, this method may help monitor the changes to the virus and extend the surveillance to areas where current systems are scarcely implemented and have contributed significantly to the expansion of VOCs.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Dong, E., Du, H. & Gardner, L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. 20, 533–534 (2020).
Article CAS PubMed PubMed Central Google Scholar
Cacciapaglia, G., Cot, C. & Sannino, F. Multiwave pandemic dynamics explained: How to tame the next wave of infectious diseases. Sci. Rep. 11, 6638 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Jefferson, T., Heneghan, C. Covid 19—Epidemic ‘waves’. The centre for evidence-based medicine (2020).
Zhang, S. X., Arroyo Marioli, F., Gao, R. & Wang, S. A second wave? What do people mean by COVID waves?—A working definition of epidemic waves. Risk Manag. Healthc. Policy 14, 3775–3782 (2021).
Article PubMed PubMed Central Google Scholar
World Health O. Laboratory Testing of 2019 Novel Coronavirus (2019-nCoV) in Suspected Human Cases: Interim Guidance (World Health Organization, 2020).
Rao, S. N., Manissero, D., Steele, V. R. & Pareja, J. A Systematic Review of the Clinical Utility of Cycle Threshold Values in the Context of COVID-19.
Shah, V. P. et al. Association between SARS-CoV-2 cycle threshold values and clinical outcomes in patients with COVID-19: A systematic review and meta-analysis. Open Forum Infect. Dis. https://doi.org/10.1093/ofid/ofab453 (2021).
Article PubMed PubMed Central Google Scholar
Ziegler, K. et al. SARS-CoV-2 samples may escape detection because of a single point mutation in the N gene. Euro Surveill. 25, 2001650 (2020).
Article PubMed PubMed Central Google Scholar
Hasan, M. R. et al. A novel point mutation in the N gene of SARS-CoV-2 may affect the detection of the virus by reverse transcription-quantitative PCR. J. Clin. Microbiol. 59, e03278-e3220 (2021).
Article CAS PubMed PubMed Central Google Scholar
Artesi, M. et al. A recurrent mutation at position 26340 of SARS-CoV-2 is associated with failure of the E gene quantitative reverse transcription-PCR utilized in a commercial dual-target diagnostic assay. J. Clin. Microbiol. 58, e01598-e1520 (2020).
Article CAS PubMed PubMed Central Google Scholar
Borges, V. et al. Tracking SARS-CoV-2 lineage B.1.1.7 dissemination: Insights from nationwide spike gene target failure (SGTF) and spike gene late detection (SGTL) data, Portugal, week 49 2020 to week 3 2021. Euro Surveill. 26, 2100131 (2021).
Article CAS PubMed PubMed Central Google Scholar
Brown, K. A. et al. S-gene target failure as a marker of variant B.1.1.7 among SARS-CoV-2 isolates in the greater Toronto Area, December 2020 to March 2021. JAMA 325, 2115–2116 (2021).
Article CAS PubMed PubMed Central Google Scholar
Washington, N. L., White, S., Barrett, K. M. S., Cirulli, E. T., Bolze, A. & Lu, J. T. S gene dropout patterns in SARS-CoV-2 tests suggest spread of the H69del/V70del mutation in the US. medRxiv. https://doi.org/10.1101/2020.12.24.20248814v1 (2020).
Wollschläger, P. et al. SARS-CoV-2 N gene dropout and N gene Ct value shift as indicator for the presence of B.1.1.7 lineage in a commercial multiplex PCR assay. Clin. Microbiol. Infect. 27, 1353 (2021).
Article PubMed Central Google Scholar
World Health, O. Tracking SARS-CoV-2 Variants. Available in: https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/ (2021).
Control ECfDPa. SARS-CoV-2 Variants of Concern as of 12 November 2021. Available in: https://www.ecdc.europa.eu/en/covid-19/variants-concern (2021).
Prevention CfDCa. SARS-CoV-2 Variant Classifications and Definitions. Available in: https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-info.html (2021).
Tran, N. K. et al. Novel application of automated machine learning with MALDI-TOF-MS for rapid high-throughput screening of COVID-19: A proof of concept. Sci. Rep. 11, 8219 (2021).
Article CAS PubMed PubMed Central Google Scholar
Gangloff, C., Rafi, S., Bouzillé, G., Soulat, L. & Cuggia, M. Machine learning is the key to diagnose COVID-19: A proof-of-concept study. Sci. Rep. 11, 7166 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Zargari Khuzani, A., Heidari, M. & Shariati, S. A. COVID-classifier: An automated machine learning model to assist in the diagnosis of COVID-19 infection in chest X-ray images. Sci. Rep. 11, 9887–9887 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Shinde, S. V. & Mane, D. T. Deep learning for COVID-19: COVID-19 Detection based on chest X-ray images by the fusion of deep learning and machine learning techniques. In Understanding COVID-19: The Role of Computational Intelligence (eds Nayak, J. et al.) (Springer, 2022).
Google Scholar
Du, R. et al. Machine learning application for the prediction of SARS-CoV-2 infection using blood tests and chest radiograph. Sci. Rep. 11, 14250 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Kukar, M. et al. COVID-19 diagnosis by routine blood tests using machine learning. Sci. Rep. 11, 10738 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Ramanathan, S. & Ramasundaram, M. Accurate computation: COVID-19 rRT-PCR positive test dataset using stages classification through textual big data mining with machine learning. J. Supercomput. 77, 7074–7088 (2021).
Article PubMed PubMed Central Google Scholar
Fernandes, F. T. et al. A multipurpose machine learning approach to predict COVID-19 negative prognosis in São Paulo, Brazil. Sci. Rep. 11, 3343 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Papoutsoglou, G. et al. Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets. Sci. Rep. 11, 15107 (2021).
Article CAS PubMed PubMed Central Google Scholar
Langer, T. et al. Development of machine learning models to predict RT-PCR results for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in patients with influenza-like symptoms using only basic clinical data. Scand. J. Trauma Resusc. Emerg. Med. 28, 113 (2020).
Article PubMed PubMed Central Google Scholar
Cabrera, J. J., et al. Pooling for SARS-COV-2 control in care institutions. medRxiv. 2020.2005.2030.20108597 (2020).
Mishra, B. et al. High proportion of low cycle threshold value as an early indicator of COVID-19 surge. J. Med. Virol. 94, 240–245 (2022).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors would like to thank all the professionals who collaborated in information management, sampling, supplies, contact tracing and organization at all levels during pandemic. Especially to the technician staff of the Microbiology Department of the CHUVI. The work of Ana Larrañaga has been supported by the 2020 predoctoral grant of the University of Vigo.

Author information

Authors and Affiliations

Microbiology and Infectology Research Group, Galicia sur Health Research Institute (IIS Galicia Sur), SERGAS-UVIGO, Vigo, Spain
Jorge Cabrera Alvargonzález, Sonia Pérez Castro, Lucía Martínez Lamas, Carlos Daviña Nuñez & Benito Regueiro García
Microbiology Department, Complexo Hospitalario Universitario de Vigo (CHUVI), Sergas, Vigo, Spain
Jorge Cabrera Alvargonzález, Sonia Pérez Castro, Lucía Martínez Lamas & Benito Regueiro García
Universidade de Vigo, Vigo, Spain
Jorge Cabrera Alvargonzález & Sonia Pérez Castro
CINTECX, GTE, Universidade de Vigo, 36310, Vigo, Spain
Ana Larrañaga Janeiro & Jacobo Porteiro Fresco
Applied Mathematics I, Telecommunications Engineering School, Universidad de Vigo, 36310, Vigo, Spain
Javier Martínez Torres
Department of Preventive Medicine and Public Health, Álvaro Cunqueiro Hospital, Vigo, Pontevedra, Spain
Víctor Del Campo-Pérez
Dirección Xeral de Saúde Pública, Consellería de Sanidade, Xunta de Galicia, Santiago de Compostela, A Coruña, Spain
Silvia Suarez Luque
Microbiology and Parasitology Department, Medicine and Odontology, Universidade de Santiago, Santiago de Compostela, Spain
Benito Regueiro García

Authors

Jorge Cabrera Alvargonzález
View author publications
You can also search for this author in PubMed Google Scholar
Ana Larrañaga Janeiro
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Pérez Castro
View author publications
You can also search for this author in PubMed Google Scholar
Javier Martínez Torres
View author publications
You can also search for this author in PubMed Google Scholar
Lucía Martínez Lamas
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Daviña Nuñez
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Del Campo-Pérez
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Suarez Luque
View author publications
You can also search for this author in PubMed Google Scholar
Benito Regueiro García
View author publications
You can also search for this author in PubMed Google Scholar
Jacobo Porteiro Fresco
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.C., A.L., J.M., and J.P. conceived the idea, designed the methodology and discussed the concept with B.R., J.C., S.P., L.M., C.D., V.D.C., S.S. and B.R. participated in laboratory results generation and collected the data. J.C., A.L., J.M., and J.P. analyzed the data and wrote draft of the paper. All authors reviewed and edited the manuscript. All authors read and approved the final version of the paper. BR and JP supervised the work.

Corresponding author

Correspondence to Jacobo Porteiro Fresco.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Cabrera Alvargonzález, J., Larrañaga Janeiro, A., Pérez Castro, S. et al. Proof of concept of the potential of a machine learning algorithm to extract new information from conventional SARS-CoV-2 rRT-PCR results. Sci Rep 13, 7786 (2023). https://doi.org/10.1038/s41598-023-34882-6

Download citation

Received: 20 December 2021
Accepted: 09 May 2023
Published: 13 May 2023
DOI: https://doi.org/10.1038/s41598-023-34882-6

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.