The application of a deep learning system developed to reduce the time for RT-PCR in COVID-19 detection

Reducing the time to diagnose COVID-19 helps to manage insufficient isolation-bed resources and adequately accommodate critically ill patients. There is currently no alternative method to real-time reverse transcriptase polymerase chain reaction (RT-PCR), which requires 40 cycles to diagnose COVID-19. We propose a deep learning (DL) model to improve the speed of COVID-19 RT-PCR diagnosis. We developed and tested a DL model using the long short-term memory method with a dataset of fluorescence values measured in each cycle of 5810 RT-PCR tests. Among the DL models developed here, the diagnostic performance of the 21st model showed an area under the receiver operating characteristic (AUROC), sensitivity, and specificity of 84.55%, 93.33%, and 75.72%, respectively. The diagnostic performance of the 24th model showed an AUROC, sensitivity, and specificity of 91.27%, 90.00%, and 92.54%, respectively.

Development of the DL model. The RT-PCR results (positive or negative) were used as the output variable to train the models. A total of 40 models were developed and validated, from the model trained with the fluorescence value of the first RT-PCR cycle to the model trained from the fluorescence value of all 40 RT-PCR cycles.
For example, the first model was trained with the fluorescence value of the first RT-PCR cycle, and the second model was trained with the fluorescence values from the first to second RT-PCR cycles. In the same way, the 39th model was trained with the fluorescence values from the first to the 39th RT-PCR cycle, and the 40th model was trained with the fluorescence values from the first to the 40th RT-PCR cycle.
The raw RT-PCR test data were obtained from the first cycle to the 40th cycle according to the passage of time. In other words, the raw data were collected in a time series. The RT-PCR test is a diagnostic method based on the time when the fluorescence value reaches a threshold value by measuring the fluorescence value measured at each cycle.
Thus, for the model development in this study, we applied the long-term short memory (LSTM) method, which is typically used to address the vanishing gradient problem with existing recurrent neural networks (RNNs) for time series data.
Since the fluorescence values derived in the RT-PCR process have the characteristics of time series data, we developed a total of 40 DL models using LSTM (Fig. 1). All deep learning analyses were performed using Python.
Training and test datasets. The results of the RT-PCR virology test were used as the reference to train the models. Of the 5810 patients' data included in the study, 181 had positive RT-PCR results, while 5629 had negative results. These data were divided into two datasets for training and testing. The data for training and validation were composed of curves of RT-PCR results of 91 positive cases and 2814 negative cases. The data of 90 positive and 2815 negative cases were used for testing (Fig. 2).
Outcomes. Primary outcomes were the sensitivity, specificity and area under the receiver operating characteristic (AUROC) values of each model. Secondary outcomes suggested an optimal model using positive predictive value (PPV), negative predictive value (NPV) and accuracy according to the prevalence of each model for several countries: the United States, Italy, and South Korea. The prevalence data for each country were referenced from the "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE)" at Johns Hopkins University 11 . The prevalence was based on values measured for each country between June and July 2021. In a triangular-shaped radar chart using the PPV, NPV, and accuracy values affected by the prevalence to visualize diagnostic performance, each model was compared by calculating the ratio of the area of the triangle covered by each model to the total triangle area of the radar chart (Figs. 4, 5, 6).
Statistical analysis. All statistical analyses were performed using SPSS software V.26.0 (IBM, SPSS, Inc., Chicago, IL, United States). Sensitivity (the proportion of true positives) and specificity (the proportion of true negatives) were calculated in comparison with the positivity or negativity of RT-PCR results. We calculated the false-positive and false-negative rates using the confusion matrix and calculated the PPVs and NPVs of each model using the COVID-19 prevalence for three countries: the United States, Italy and South Korea.       Table 2). In Italy, which showed a prevalence of 6.98%, the model with the highest PPV was the 25th model, at 65.02% (98% CI 59.68% to 69.97%), and the model with the highest NPV was the 33rd model, at 100% (95% CI N/A). The accuracy was the highest, at 95.68% (95% CI 94.88% to 96.39%), in the 35th model. The NPV of the 25th model with the highest PPV was 98.47% (95% CI 97.71% to 98.98%), and the accuracy was 95.60% (95% CI 94.79% to 96.31%). The PPV of the 33rd model with the highest NPV was 55.57% (95% CI 51.92% to 59.13%), and the accuracy was 94.42% (95% CI 93.52% to 95.22%) ( Table 3).
In South Korea, which showed a prevalence of 0.27%, the model with the lowest PPV was the 1st model, at 0.23% (95% CI 0.10% to 0.49%), and the highest PPV was that of the 25th model, at 6.43% (95% CI 5.07% to 7.76%). The model with the lowest NPV was the 1st model, at 99.72% (95% CI 99.71% to 99.74%), and the highest NPV was that of the 33rd model, at 100% (95% CI N/A). Accuracy was the lowest, at 43.52% (95% CI 41.70% to 45.34%), in the 3rd model and the highest, at 96.72% (95% CI 96.01% to 97.34%), in the 25th model. The 25th model was the model with the largest proportion of area occupied by the radar chart, at 36.44% (95% CI 35.29% to 37.54%) (Fig. 6).

Discussion
In this study, we developed a total of 40 DL models to reduce the time required for the diagnosis of COVID-19 using RT-PCR as much as possible and compared the diagnostic and screening performance of each model.  www.nature.com/scientificreports/ In a previous meta-analysis, Kim et al. 12 determined that the pooled sensitivity of RT-PCR was 89%, and the PPVs and NPVs, affected by the prevalence, were 47.3% to 98.3% and 93.4% to 99.9%, respectively. We used the pooled sensitivity of the RT-PCR test investigated by Kim et al. to compare the performance of each model obtained in this study as a reference value.
Considering a pooled RT-PCR sensitivity of 89% as a sensitivity reference value 12 , the sensitivity of the 21st model exceeded this standard, at 93.33% (95% CI 86.05% to 97.51%). In addition, considering the approximate trend of diagnostic performance of all models, the 24th model, with a sensitivity of 90% (95% CI 81.86% to 95.32%), showed a tendency to exceed the sensitivity reference value (Fig. 3). In view of these results, using a Ct value of 36 rather than the time taken by 40 cycles for RT-PCR diagnosis, it can be inferred that a meaningful time reduction may be possible through the development of this DL model. Furthermore, the sensitivity reference value was exceeded or showed a similar level from the 3rd model to the 9th model and in the 11th, 16th, and 18th models (Supplementary Table 1). However, the specificities of these models were generally lower than 80%, so it was difficult to judge whether the model was appropriate based on the diagnostic performance.
In the case of the PPV in this study, in the United States, where the prevalence was 10.06%, the 25th model showed the highest PPV at 73.33%. Similarly, in Italy, with a prevalence of 6.98%, and South Korea, with a prevalence of 0.27%, the PPV was highest in the same model as that in the United States, at 65.02% and 6.43%, respectively (Tables 2, 3, 4). However, according to the study results of Kim et al. 12 , in the United States, with a prevalence of 17.7% in March-April 2020; Germany, with a prevalence of 5.7%; and Taiwan, with a prevalence of 1%; the PPVs of RT-PCR itself were 95%, 84.3% and 47.3%, respectively. Although the prevalence did not match between the two studies and the timing at which the prevalence was measured was different, considering the range of prevalence levels, it can be inferred that the positive screening performance of the model developed in this study is somewhat inferior to that of RT-PCR.
On the other hand, in the case of negative screening performance, which is affected by the prevalence, in the United States, where the prevalence is 10.06%, the 20th model showed an NPV of 96.34% (95% CI 95.89% to 98.33%), and in Italy (prevalence 6.98%) and South Korea (prevalence 0.27%), the NPVs were 98.21% (95% CI 97.18% to 98.86%) and 99.21% (95% CI 99.90% to 99.96%) in the same model, respectively. These findings show that the negative screening performance of the model developed using fluorescence values up to 20 cycles, which is half of the 40 cycles, is very good (Tables 2, 3, 4).
Furthermore, in research reported by Kim et al. 12 , the PPV and NPV of RT-PCR showed a distribution of 47.3% to 98.3% and 93.4% to 99.9%, respectively, according to the national prevalence (prevalence range of 1% to 39% from March to April 2020). The negative screening performance of the models developed in this study can be considered at a similar level to that of RT-PCR. Although the statistical significance cannot be compared, this result shows that the model trained only with raw data up to 20 cycles differs little from the negative screening performance of RT-PCR itself, for which all 40 cycles were evaluated.
In this study, we created a radar chart for each model using PPV, NPV and accuracy, which were affected by prevalence, representing screening performance (Figs. 4, 5, 6). Then, the screening performance of each model was expressed as the ratio of the area covered by each model to the total area of the radar chart as a percentage, and the area ratio of each model was entered into a radar chart. This chart confirmed that the model with the largest area ratio was the 25th model when considering the PPV, NPV and accuracy. We propose that it would be reasonable to present the 25th model as a model with minimal bias in negative screening performance, positive screening performance and accuracy based on these results.
To the best of our knowledge, no study has reduced the time required to diagnose based on RT-PCR by developing a model trained with raw RT-PCR data and confirming its diagnostic performance. In addition, since the start of the COVID-19 pandemic, no similar research design has been reported in papers that reviewed the performance of various artificial intelligence or deep learning models for diagnosing COVID-19 until recently 13 . Although there was a single study that used RT-PCR curves to build an AI model such as a convolutional neural network (CNN) to reduce false-positive diagnoses, the study was not related to shortening the time for diagnosis and used graph images, differentiating it from our study 14 .
In addition, a recently published AI-and DL-related COVID-19 diagnostic study presented a model trained on CT images or CXR images using various CNN methods. Other studies on the diagnosis of COVID-19 have reported on models trained with blood test results or clinical information. First, in the studies that reported the performance of models trained based on CNNs using chest CT images, the sensitivity ranged from 77 to 90%, the specificity ranged from 68 to 96.6%, and the AUROC ranged from 0.85 to 0.97 [1][2][3][15][16][17][18][19][20] . Second, in studies that reported the performance of models trained on CNNs using chest CXR images, the sensitivity ranged from 78 to 97%, the specificity ranged from 72.6 to 99.17%, and the AUROC ranged from 0.77 to 0.92 [4][5][6][7][21][22][23] . Third, there have been studies evaluating the diagnostic performance of COVID-19 using models trained with blood tests or clinical information. In these studies, the sensitivity ranged from 66 to 93%, the specificity ranged from 64 to 97.9%, and the AUROC ranged from 0.86 to 0.979 [24][25][26] . Considering the diagnostic performance of the various models presented in these references, the diagnostic performance of the model developed in this study appears to be sufficiently high.
What is needed in the clinical field is to increase the efficiency of hospital bed resource management through rapid isolation, rapid diagnosis, and rapid and safe release from isolation. From that perspective, the above studies suggest that COVID-19 diagnosis may be possible through the application of AI. Nevertheless, the models presented in the existing references have lower clinical relevance when considering the realistic clinical conditions due to the following problems.
Due to the imbalance and bias of the data selected for use in training, we question whether this approach can be safely used in clinical settings for the diagnosis of COVID-19. On these issues, Laghi A agrees that efforts to diagnose COVID-19 through AI models are necessary. However, he noted that it seems very risky to trust www.nature.com/scientificreports/ the diagnostic performance of the AI models presented in these studies and use it in clinical settings because imaging tests such as CXR or chest CT at the early stage of COVID-19 infection can show normal findings 27 . The model developed in the present study is not trained from imaging tests such as CXR or chest CT, blood test results, or clinical information, as in previous studies. In this study, a model trained with LSTM was developed as a DL method applied to time series data training using raw data from 1 to 40 cycles of RT-PCR. Thus, there is potential for early diagnosis via RT-PCR using the DL model developed in this study.
In this study, the sensitivity of the 21st model started to exceed the sensitivity reference value, and the sensitivity and specificity of the 24th model exceeded a sensitivity of 90% (Table 1, Fig. 3). Considering the time it takes to diagnose RT-PCR, the diagnostic performance of the model developed in this study shows the possibility of reducing the time required for RT-PCR diagnosis by almost half.
In addition, the model developed in this study showed that the PPV had somewhat lower positive screening performance than RT-PCR; however, the NPV showed negative screening performance similar to that of RT-PCR (Tables 2, 3, 4). Considering this excellent negative screening performance, if various information, such as the patient's clinical characteristics, blood test results, and imaging information, such as CXR or chest CT results, are combined with this DL model, it can be assumed that the diagnostic performance for early diagnosis will be improved. We can infer that employing this model has the potential to contribute to improving the efficiency of in-hospital bed resource management for patients with fever or screening symptoms.
This study has several limitations: First, 181 positive cases and 5629 negative cases used for training constituted too few positive cases compared to negative cases. This data bias can affect the diagnostic performance of the developed DL models, and in the end, it is difficult to apply the DL model universally. However, through this study, we were able to confirm that the diagnostic performance was not significantly impaired by not performing all 40 cycles of PCR.
Second, other than LSTM, other DL methods that can be trained using time series data were not applied. As a result, it is not known whether LSTM is the best method because comparative analysis with models that can be developed through other DL methods has not been performed. Nevertheless, LSTM is an RNN-based method that was first selected and used in this study because this method was developed to solve the vanishing gradient problem of existing RNNs 28 . Of course, it is necessary to collect additional data in a follow-up study and perform comparative analysis with DL methods applied to time series data.
Third, a range of evaluation metrics were not used in this study. As described in the second limitation, the proposed model could not be compared with models developed through other methods. We acknowledge that it is difficult to apply evaluation methods other than showing the level of diagnostic performance of the model with this study design. Understanding this limitation, we paid attention to the difference in prevalence by country, investigated the screening performance of the model for each representative country, and presented the results.
Fourth, the method of presenting the screening performance of the model as the ratio of the area of the radar chart is not generally employed. The area of the triangle is calculated assuming the PPV, NPV, and accuracy to have a 1:1:1 weight ratio. Therefore, if this weight ratio is set differently, that is, if the three weights are set differently according to need (such as accuracy being more important, etc.), the calculated area and the ratio may be different. Nevertheless, as the PPV, NPV and accuracy all have high values, it is natural that the screening power is high. We believe that the ratio of the area of the radar chart does not perfectly reflect the screening power of the DL model; however, it does help to explain the approximate trend.

Conclusion
Through the test results of the DL models developed in this study, we confirmed the possibility of shortening the diagnosis time of RT-PCR without impairing its diagnostic performance. This reduction in time to diagnosis is expected to be of great help in managing insufficient bed resources in the clinical field.

Data availability
All data generated or analysed during this study are included in this published article (and its supplementary information files).