Background & Summary

The first cases of coronaviruses disease 2019 (COVID-19) due to SARS-CoV-2 were detected in China in December 2019 and spread afterwards quickly around different countries from around January-February 20201. From the first months of viral dissemination, SARS-CoV-2 clusters were observed, in particular in occupational environments for several essential sectors such as health care centres2,3 or food processing plants4,5,6,7. The transmission of coronaviruses among humans was reported as possibly through aerosol (inhalation of aerosolized or falling contaminated droplets) or through contact (hand, objects or surfaces)8,9. Infected persons (symptomatic or asymptomatic) can send out several contaminated droplets that could stay in aerosol, be directly inhaled or fall on surfaces and potentially infect afterwards other persons. However, the changes of the infectious viral load in the contaminated droplets susceptible to be inhaled and to induce disease are not well known in either aerosol or surfaces. Furthermore, the persistence of SARS-CoV-2 depends on environmental conditions that are very different from one occupational location to another10,11. Some factors such as airflow, ventilation, temperature and relative humidity modify the probability of SARS-CoV-2 transmission through respiratory pathways since it can affect droplet movements and virus survival, notably through droplet desiccation12. Some studies reported the influence of few temperature and/or relative humidity conditions. For example, the virus remains infectious for longer periods at lower temperatures and very high relative humidity10, metal surfaces such as stainless steel could allow the virus to remain infectious longer than on others under some specific temperature-humidity conditions13,14. Studies on coronaviruses persistence were generally conducted by experiments in laboratory consisting in monitoring the virus kinetics under controlled conditions. The reduction of virus infectivity over time could be evaluated by fitting inactivation mathematical models on experimental data. Studying the effects of environmental conditions encountered in food premises requires collecting kinetics data as exhaustively as possible to cover large ranges of values for each condition: inert or food surfaces, temperature, relative humidity, experimental quantification method, virus strain etc. To our knowledge, such exhaustive kinetics dataset are not available yet in the literature. It should also be noted that the gathering of data between different studies is not easy. For example, kinetics are frequently plotted in research articles but raw data are not often provided in all of them as quantitative (numerical) values that can be used for further studies of theirs.

Thus, the main goal of this paper was to collect and make available the compilation of a large dataset of quantitative kinetics related to SARS-CoV-2 and other coronaviruses under different conditions useful for assessing and modelling their persistence in food processing environments.

Methods

The overall procedure for collecting and pre-treating literature data is briefly illustrated in Fig. 1. Firstly, a literature review to identify relevant publications presenting kinetics data was carried out. This research was based on a query of scientific bibliographical databases in accordance with the PRISMA guidelines15 (Step 1). The second step consisted in converting raw data from scientific publications (texts, tables or figures) into a ready-to-use numerical dataset with a manual collection for tables or a semi-automated collection after digitalization of figures (Step 2). Afterwards, an inactivation primary model was fitted on each kinetics to estimate the viral infectivity reduction parameter and its uncertainty (Step 3). Finally, a quality-ranking step (Step 4) was performed to evaluate the quality of the data collected from kinetics. The quantitative dataset and the different tools used in this procedure are freely available and detailed in the following sub-sections.

Fig. 1
figure 1

Schematic overview of the data collection workflow.

Scoping review

The scoping review is part of a scientific project to describe the persistence of coronaviruses in food production environments. Firstly, a query process was carried out to identify relevant records, using weekly advanced searches on several topics associated with SARS-CoV-2. This weekly literature search was conducted between March 2020 and 25th August 2021 using a combination of keywords related to the main thematic (1) “SARS-CoV-2 and coronavirus” joined by the logical connector AND with one of the following (2a) “Human and food”, (2b) “Water” or (2c) “Environmental persistence”. The keywords used for each theme are specified in Table 1. Studies were collected weekly from two bibliographic search engines PubMed and Scopus (from March 2020 to August 2021) and by query on Frontiers (from November 2020 to August 2021). A date restriction was defined: only publications from the 1st of January 2020 were collected. The search was limited to publications with abstracts written in English. The last query process using those above-mentioned criteria performed on the 25th of August 2021 identified overall 14,267 references exported to EndNote software, after duplicate removal. From this corpus, a thematic filter about “Persistence” was built with a “from group” tools in EndNote software. This filter consisted in selecting articles with “persisten*” “survival” or “stability” in the field “Title–Abstract–Keywords”, joined by the logical connector AND with “environment*”. This search resulted in 418 references. These records were afterwards filtered in accordance with the PRISMA Statement guidelines15 (Fig. 2), using different inclusion/exclusion criteria based on title, abstract and sometimes full-text when needed. The inclusion criteria were (1) studies on persistence on materials, surfaces or aerosols or (2) studies on working environment. The exclusion criteria were (1) studies on therapeutic or vaccine development, (2) studies on untreated wastewater, (3) studies on diet or nutrition (4) language other than English and French, and (5) full text not available. A first screening step identified 82 references for which full papers were read to determine if they were included as fully documented kinetics of persistence data (either in tables or in figures) or used to complete the identification of relevant publications (e.g. in the case of reviews). All the above screening and completion stages identified a final total number of 65 studies in which available raw kinetics data could be extracted from tables and/or figures.

Table 1 Keywords used in the four themes for weekly bibliographic database queries.
Fig. 2
figure 2

Flow chart outlining the procedure for quantitative data collection from the literature based on the PRISMA guidelines statement and preliminary studies.

Information associated with kinetics data

The second step consisted in converting raw data from scientific publications into ready-to-use numerical dataset by extracting data from texts, tables and/or using figure digitalization techniques. This step provided several “kinetics”, meaning the tracking of viral titer at different time points under different conditions. Kinetics corresponding to viral genome quantification (e.g. by RT-qPCR techniques) were excluded since such quantification did not represent the viral infectivity. In total, 464 kinetics were available from the 65 identified studies11,13,14,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77 (Tables 2 and 3). It is worth noting that kinetics for coronaviruses other than SARS-CoV-2 collected in these studies were also retained for analyses.

Table 2 Overview of the persistence kinetics of SARS-CoV-2 on different matrices and the associated environmental conditions.
Table 3 Overview of the persistence kinetics of coronaviruses (Alphacoronavirus and Betacoronavirus other than SARS-CoV-2) on different matrices and the associated environmental conditions.

Several information associated with the above kinetics were gathered in an Excel spreadsheet (more details in the Data records section). For each identified kinetics, we assigned a unique kinetics key. The virus Strains, Species, Subgenus and Genus were indicated for each kinetics, as well as other conditions such as the nature of the materials (stainless steel, plastic, paper,…), the medium (liquid media, aerosol, porous and non-porous surfaces), temperature and relative humidity (expressed in range and or median value), pH (if available) etc. Other information was also indicated for each kinetics such as the initial virus load, the type of cell used for infectious viral titration (Vero, etc.) or the number of experimental replicates. Table 2 provides an overview of the extracted kinetics from SARS-CoV-2 studies, the Table 3 from studies for other coronaviruses.

Data extraction

The persistence kinetics data were taken from texts, tables or figures in the research papers identified above. Data from tables or texts were manually filled into the database. Raw data from figures were extracted using the R package metadigitize78. This tool provided the possibility to process simultaneously many figures, as well as reproducibility (e.g., correcting the digitalized data, sharing digitalization). For reproducibility purposes, all files associated with raw digitalizations of our study are provided and detailed in the Data records section. The monitored time points extracted from all studies (extractions from tables or figures digitalization) were converted and expressed in hours in order to homogenize for comparative purposes between studies.

Viral infectivity reduction parameter

The third step aimed to estimate a parameter to characterize the viral infectivity reduction for each kinetics (condition). This parameter, denoted D and expressed in hour, characterized the decimal log reduction time and was estimated by fitting a primary inactivation model on the extracted data79. The value of D corresponding to the inverse of the slope from a linear model10,71 was written as follows:

$${\log }_{10}\frac{N}{{N}_{0}}=-\frac{t}{D},$$

with N0 and N corresponding to the number of infectious viruses at the initial time point and the time point t (expressed in hour), respectively. The model was fitted independently on each kinetics using the function nls() from the R package nlstools80 running with the ‘nl2sol’ algorithm from the Port library81. The starting parameter values necessary for the fitting algorithm was set depending on the kinetics curves or optimized using the R package nls282 in some rare cases of non-convergence. For each kinetics, a value of D was estimated as well as its uncertainty expressed by standard error value SE. The value of log10 D was computed accordingly for each kinetics. Finally, the coefficient of variation was calculated as the proportion: CV=SE/D.

Evaluation of kinetics quality

In this work, kinetics data were collected from different publications in which the laboratory experiments were not conducted with the same design. These data represented then important variabilities, e.g. in terms of number of time points, replicates, etc. Therefore, criteria can be useful to evaluate and classify the quality of the collected kinetics. Indeed, the quality of raw data was susceptible to influence the statistical estimation of D. Criteria-based ranking approaches have been proposed in some predictive microbiology studies aiming to deal with difficulties in terms of data selection (inclusion or exclusion for modelling)83,84. Herein, for the establishment of a quality score by kinetics, we considered three criteria: (i) the number of the time points of the kinetics; (ii) the importance of the extracted point considering if it represented a single value or multiple measure (i.e. at least two technical replicates); and (iii) the value of the coefficient of variation (CV) of the estimated value of D, characterizing the fit quality of the inactivation model. For each kinetics, we attributed three scores corresponding to these three criteria and classified them into different categories. It is worth noting that ones can arbitrarily define the threshold values separating these categories as well as the given corresponding score values depending on the studies and the extracted dataset. In our work, for each kinetics, the score associated with the number of time points, denoted s1, was firstly defined as follows:

$${s}_{1}=\left\{\begin{array}{c}1\left({n}_{t}\le 3\right),\\ 2\left(4\le {n}_{t} < 6\right),\\ 3\left(6\le {n}_{t} < 8\right),\\ 4\left({n}_{t}\ge 8\right),\end{array}\right.$$

where nt corresponds to the number of time points collected from the kinetics.

The score s2, based on the importance of points (‘unique’ or ‘multiple’), was defined as follows:

$${s}_{2}=\left\{\begin{array}{c}1\left(unique\right),\\ 3\left(multiple\right).\end{array}\right.$$

The score s3, based on the coefficient of variation CVi of the kinetics i, was given as follows:

$${s}_{3}=\left\{\begin{array}{c}1\left(CV > 0.3\right),\\ 2\left(0.2\le CV < 0.3\right),\\ 3\left(0.1\le CV < 0.2\right),\\ 4\left(CV < 0.1\right);\end{array}\right.$$

or s3 = 1 for some kinetics for which standard errors and coefficients of variation could not be computed (kinetics with only two points).

Finally, a global score S taking into account all criteria was calculated for each kinetics:

$$S={s}_{1}\times {s}_{2}+{s}_{3}.$$

The calculated scores for all kinetics are gathered in Spreadsheets and R Data objects provided in the Data Records section.

Data Records

Intermediate data: Figure digitalization raw files

The digitalized source figures (jpg or png image files) were used in the second step of the collection procedure (see Fig. 1). The digitalization were carried using the R package metaDigitise78 that automatically created the directory denoted ‘caldat’ containing raw digitalization files to assure traceability and also to avoid re-doing manual digitalization at every run of the procedure. Raw digitalization files generated by metaDigitise were automatically renamed like their corresponding image files. All sources figures and digitalization raw files used herein are provided in a data repository85.

Input and output quantitative data spreadsheet

Output spreadsheets were obtained at the end of the overall collection procedure under the Excel and CSV file formats (see Fig. 1) (“DataRecord_OutputData.xlsx” and “DataRecord_OutputData.csv”).

All information reported from publications (experimental conditions, figure sources, publication references, DOI, etc.) related to each kinetics used as input, as well as the corresponding estimated values as described in the Method section (D, coefficient de variation, scores, etc.) are present85. Each row represents a kinetics, each column is completed, when available, by qualitative and quantitative variables, as follows:

  • ID of each kinetics (Kinetics key), denoted for example: ‘K001, ‘K002, etc.;

  • ID of each study (Study key);

  • Studied viruses and their classification: Genus, Sub-genus, Virus, Strain;

  • Temperature considered in the experimental design: temperatures were gathered by precise values reported from the publication if available (column Temperature) or by ranges (column Temperature range) in which case its median values were considered (e.g. 23.5 °C reported in Temperature for a range of 22–25 °C given in the publications);

  • Relative humidity (columns Relative humidity and Relative humidity range) considered in the experimental design: as for temperatures, RH were reported as precise values and/or ranges;

  • pH values (column pH) if available;

  • Information related to the matrices sorted in three columns:

    • (i) the studied matrices (Studied matrices – fully named) in as described in publications, with some details;

    • (ii) the standardized matrices (column Standardized matrices), is a practical annotation to class similar matrices, such as liquid medium, stainless steel, etc.; and

    • (iii) the medium grouping the above matrices as four classes, which are “liquid media”, “porous surface”, “non-porous surface” or “aerosol” (column Medium);

  • Information related to the kinetics monitoring methods including:

    • (i) the quantification method (column Quantification method) indicating the experimental techniques such as viral infectivity assays by different cell types;

    • (ii) the used inoculum (column Inoculum) and

    • (iii) the replicate (column Replicate) indicating if the monitored kinetics were extracted as unique time points or multiple ones;

  • Sources of kinetics including

    • the bibliographic references (column References);

    • the name of tables or figures (column Table or Figure of the study) in the original publication where the kinetics raw data were transcribed or digitalized and

    • the corresponding file names of these tables and figures (column Re-transcribed tables or digitalized figures) provided in Data records allowing their re-use by other researchers;

  • Total number of points (column nb_points) extracted from each kinetics;

  • Different estimated values for all kinetics collected in the present study as described in the Method section:

    • (i) values of D (column Dvalues)

    • (ii) its standard error (column Dvalues_stderr) and

    • (iii) the coefficient of variation (column Dvalues_CV);

    • (iv) the decimal log of D (column log10D)

  • For some kinetics and for comparison purposes, the estimated values of log10D previously estimated using another modelling approach10 (column log10D_AEM);

  • The scores given to each kinetics, including s1, s2 and s3 as well as the global score S (columns s1, s2, s3 and S, respectively).

Input and output as RData object

All input used and output obtained at the end of the collection workflow is also provided as a ready-to-use RData object (DATASET.RData)85. From this RData object, when opened in R/RStudio softwares, one can extract:

  • the input and output data spreadsheet described above (object DATASET);

  • raw data (from tables or figures) associated with the monitoring of each kinetics (measured values at each sampling time points), only the data above LOQ are recorded (object kinetics_rawdata);

  • regression plots (inactivation linear model, see Method section) generated for each kinetics (object regplot). These regression plots were also exported as PDF files provided (output_adjusted_kinetics). The pattern of inactivation kinetics (increasing or decreasing) may be different depending on the unit used by the authors (e.g. logTCID50/ml, \({\rm{\log }}(\frac{{N}_{t}}{{N}_{0}})\), viral titer reduction in percentage, etc).

Technical Validation

The technical validation focused on the figures’ digitalization step, since the latter remained a manual work that could probably vary from one user to another. In order to check the quality of the data collected by digitalization, this step was re-conducted repeatedly by three independent users for evaluating its repeatability and its reproducibility. This checking procedure was performed on a random sample of eleven kinetics among those collected, and each kinetics was digitalized three times per user. The values of the parameter D were afterwards estimated as described in previous sections. The comparison between the values of D estimated by different users was firstly done by fitting the major axis regression model on bootstrap data generated for each pair of users86,87. Afterwards, the Gage R&R tool from the R package SixSigma88,89 was used in order to identify and quantify the error parts in the estimated values of log_10 D due to the user repeatability as well as the between user reproducibility, respectively. The R scripts and data associated with the technical validation procedure are provided (see the ‘Code Availability’ section below).

As illustration, the comparison between users (denoted user 1, 2 and 3, respectively) by major axis regression is plotted in Fig. 3 (user 1, plotted in the X-axis, was arbitrarily chosen herein as the reference one for comparison). The results of the Gage R&R analysis showed a good repeatability: the latter estimated a very low error part due to intra-user variation, estimated at only 0.01% of the overall variation. The error part due to the between-users variation was estimated at 1.5%. After confrontation between experimenters, this part of inter-users error can be explained by the difficulties for choosing the points to be digitalized. Indeed points below the limit of quantification (LOQ) should not be included for avoiding bias. This choice could then strongly influence the estimation of log10 D as illustrated in Fig. 3, since this parameter conditioned the slope of the linear model fitted to the chosen data points. Yet, in many articles, the information LOQ is not provided. In such conditions, it is up to scientist in charge of digitalization to include/exclude points to be included. This is prone to introduce uncertainty especially for points corresponding to the end of the experiment. In view of this user-dependent choice, in the present study, we provided then all raw digitalization files that be imported, re-used or modified by other users if needed according to their expertise.

Fig. 3
figure 3

Illustration - Comparison of the D values estimated by different users performing repeatedly the figure digitalization step on the same subset of kinetics.

Usage Notes

Re-use of figure digitalization files

The digitalization step were done using the R package metaDigitise78 providing reproducible and flexible tools for tracing every digitalization. In practice, the digitalizations were done using R commands (check the R scripts provided in the Code Availability section) allowing users to process the different image files ready-to-digitalize. This process consisted, for each image (plot), to click manually on the different chosen points of the image to calibrate the plotted axes and convert afterwards the different clicked points (from curves, barplots, etc.) to numerical values saved in R object. The different groups of points can be assigned with user-defined group names in order to separate different kinetics from the same image if necessary. For each digitalized image, a digitalization file is automatically generated in a specific directory, denoted ‘caldat’ to ensure the traceability of this manual step. Indeed, such a file can give the possibility, using R commands, to import the numerical values already digitalized and/or edit/recalibrate some values/points by other users if needed without having to re-process the whole image .

Fig. 4
figure 4

Detailed schema of the data collection procedure including the used R scripts and data records files.