Introduction

Acute infectious respiratory diseases is one of the main causes of morbidity and mortality worldwide, and viral infections of lower respiratory tract account for a large proportion1. Among them, coronaviruses are the largest group of non-segmented, single-stranded, positive-sense ribonucleic acid viruses (+ ssRNA)2. They belong to the order Nidovirales, family Coronaviridae, subfamily Coronavirinae, and cause zoonotic infections in many vertebrates3. In December 2019, a new coronavirus, severe acute respiratory syndrome-Coronavirus-2 (SARS-CoV-2), was reported for the first time in the city of Wuhan, Hubei Province, China, causing a rapidly pandemic severe infection in humans (COVID 19). SARS-CoV-2 was sequenced as an enveloped ssRNA virus with a complete genomic sequence containing 29,903 nucleotides and encoding 7986 amino acids4. Phylogenetic analysis of coronavirus genomes has revealed that SARS-CoV-2 belongs to subgenus Sarbecovirus in genus Betacoronavirus, with high similarity (96%) to bat betacoronavirus RaTG13, suggesting its potential zoonotic origin5.

Like other RNA viruses, beta-coronaviruses can have complex and dynamic cycles of genomic variation within a population or within a single host, and thus exhibit significant polymorphism 6. The rate of evolution of SARS-CoV-2 is considered moderate, estimated at 1.19–1.31 × 103 substitutions per site per year6, which tends to increase today to around 2.68–3.86 × 103 per site per year, mainly due to the low fidelity of its RdRp, which could evolve with time7. Thus, new mutants, clones, and then viral variants born from each infected host, having different infectivity and contagiousness and playing in an incredible way on the evolution of the different epidemic currents of COVID-198. As an example, a link between increased mutations and treatment has recently been demonstrated, as well as the selection pressure of the host immune system, associated with more mutations in spike domain9. It may suggest, however, that the origin of these new inter-individual viral entities called "variants" is more subtle as several teams have, in an analogous way to HIV or other RNA viruses, studied the possibility of the existence of significant intra-individual variability leading to this genetic polymorphism10. Advent of next generation sequencing (NGS) techniques has allowed identification of theses intra-individual viral subpopulations, called quasispecies, in patients infected with SARS-CoV11 and MERS-CoV12, suggesting their existence for SARS-CoV-2 yet7 with an estimated average genetic distance of ~ 8.36 × 10–4. The presence of these SARS-CoV-2 viral quasi-species was thus observed in various types of biological samples, particularly nasopharyngeal, with minority variants distributed evenly along the genome ranging in frequency from 1 to ~ 30%13. Appearance of viral variants has now been strongly suggested to be an indirect consequence of finest intra-individual genetic evolutions, and hence, fair questions are rising about accountability in this mechanism14. Information on literature is missing, first, regarding the effect of anti-SARS-CoV-2 treatments and vaccines on mutability, but also regarding clinical risk factors to become a “variant maker”, while prevention of escaping mutations in the framework of a genomic watch has become indispensable.

Among others, the question of persistence of SARS-CoV-2 viral excretion is not well defined yet and could potentially accelerate genomic viral evolutions15. Two meta-analyses, including 79 and 28 studies, converged to indicate a naso-pharyngeal viral shedding duration of 17 days (mean) and 18.4 days (median), respectively16,17. Viral persistence defined as longer viral shedding (> 17 days) has concerned about 30% of patients during the initial outbreak with the WU strain, mainly immunocompromised, with comorbidities, or a severe clinical stage18, but also recently with emerging VOCs such as Omicron 21 K (https://wwwnc.cdc.gov/eid/article/28/5/22-0197_article).

Significant differences in cytokine profiles and immune transcriptomes between persistent and non-persistent patient populations also exist, associated with a longer host–pathogen interaction, and consequently a higher mutational risk19. It is thus legitimate to assess the role of these persistent patients in the evolution of SARS-CoV-2, both by the presence of a longer transmission risk, and by that of pre-existing mutation fixation in the viral sub-population.

In this context, we conducted a prospective study on 160 nasopharyngeal PCR-positive SARS-CoV-2 samples to assess the possible differences in the intra-individual genetic variability between persistent and non-persistent patients. Primary endpoint being the mean intra-individual genomic variability compared between the so-called "persistent" and "non-persistent" patient populations. As secondary endpoint, we also analyzed intra-host variation in spike gene, we analyzed in detail the most variable genomic positions and patients and investigated whether mutations of interest currently present were already present in viral subpopulations before they spread.

Results

Characteristics of patients

A total of 160 samples were collected, divided into 105 persistent and 55 non-persistent patients (control group). After clinical data analysis from persistent group, 17 patients were excluded, 14 for errors on persistence viral shedding (below 17 days), and 3 patients for being below 18 years old of age (Fig. 1). After quality sequence analyzing, bad sequencings were found in 17 patients from persistent group and 1 from control group. Among persistent patients, mean age was 67 years old (SD 17.8) and there were 63% of men. Mean shedding delay was measured at 26 (+ /− 6) days. Immunodepression background was divided into five sections: 0: none (46%); 1: diabete mellitus (29%); 2: hemopathies (5%); 3: solid organ graft with immunosuppressors (6.5%); 4: active solid cancer with chemotherapy or immunotherapy (6.5%) and 5: autoimmune disease with immunotherapy (3.9%). We do not have background data for 2.6% of patients (n = 2). Main patients received specific SARS-CoV-2 treatment (75%) among them azithromycin, hydroxychloroquine, dexamethasone, ivermectin, alone or in association (Table 1). Antibiotic therapy against bacterial infections was not assessed. Severity of the disease was divided into 4 items according to the maximum stage reached by patient: 1: mild stage (ambulatory); 2: moderate stage (hospitalization); 3: severe stage (intensive care unit); 4: death. Thus, we had 27.8% of patients on stage 1 and 3, 25% on stage 2 and 2.8% on stage 4. Missing data concerned 6.5% of patients. No differences were found between persistent group and control group in propension matched multivariate analyses, Except for age, disease stage 3 and ICU admission (Table 1).

Figure 1
figure 1

Flowchart.

Table 1 Characteristics of patients.

Characteristics of sequencings

After clinical data analysis, we have sequenced 144 SARS-CoV-2 samples from nasopharyngeal swab. Mean genome coverage was measured at 90.7% (+ /− 12.5), median 99.7%, and mean depth per position at 1.738 reads (+ /− 1.065). SARS-CoV-2/human reads ratio was as follows: median 0.89, mean 0.79 (+ /− 0.25). Whole genome quality was also assessed on Nextstrain and Auspice (Supplementary Fig. 2). 12 sequencings were excluded for too low coverage, 11 in persistent group, 1 in control group. Details on sequencing-including additional mutations-are notified on Supplementary Table 2. According to Nextstrain analysis, we have obtained a clade distribution consisting of 47% and 25% of 20A, 15% and 6% of 20E, 19% and 32% of 20I/Alpha variant, 12% and 32% of 20 J/Beta variant, 7% and 6% of other clades, in persistent and non-persistent groups, respectively (Supplementary Fig. 2).

In the aim to assess risk of bias from ARTIC amplification, we analyzed read distribution and observe on linear regression a negative correlation between Ct value and number of reads per position (R = 0.44, p < 0.001), between mean variability per sample and number of reads per position per sample, and a positive correlation between Ct and mean variability per sample (R = 0.29, p < 0.001). Thus, position with a high number of reads does not wrongly reflect high variability. (Supplementary Fig. 3).

Comparison between variabilities from persistent and non-persistent patients in whole genomes and in Spike domain

In global analysis (Fig. 2a), the mean intra-host variability for all samples and in the whole genomes was 5.4% (SD 0.9%) in persistent group versus 4.6% (SD 0.3%) in control group, with significant difference of the means and variances found on unpaired t-test analysis with Welch correction (−0.67 ± 0.12; p < 0,001). Within clades groups analysis (Fig. 2b), the intra-host variability was significantly different and higher between persistent and non-persistent samples from clades 20A and 20I (p = 0.009 and p = 0.019 respectively), but not from clade 20 J (p = 0.15). Within severity groups analysis (Fig. 2c), no differences on means were found between persistent and non-persistent patients suffering from severe-clinical stage 3 and 4-COVID 19 (5146 vs 4522, p = 0.17), whereas significant differences occurred between persistent and non-persistent patients from mild and moderate clinical group (5019 vs 4143, p = 0.0005 and 5222 vs 4414 p = 0.019, respectively). In spike gene (positions 21,563–25,384), we found ten super-variable positions (21,635; 22,063; 22,210; 23,104; 23,144; 23,231; 24,056; 24,290; 24,673 and 25,101). Four showed significant mean differences: 22,210; 23,104 and 24,056 harbored increasing variability in persistent sample (differences between means: + 9.5 p < 0.001; + 5.5 p = 0.002; + 8.9, p < 0.001 respectively), while variability was more important in non-persistent samples on position 23,231 (difference between means −6.7, p = 0.0017) (Fig. 2d,e). We did not find any correlation between age and intra-host variability on simple linear regression test, with R2 equal to 0.009840 and Sy.x equal to 0.83 (Fig. 2f). Global representation of variability per sample and for the whole genome is given in Fig. 3.

Figure 2
figure 2

Comparison of intra-host variability among several criteria between samples from persistent (yellow) and non-persistent (blue) patients. (a) global comparison of intra-host variability per sample. Welch’s t comparison test. (b) differences between clades. Kolmogorov–smirnov comparison test. (c) differences between severity of COVID-19. Unpaired t test used. (d) Hotspot mutations in spike domain. Mixed effect analysis, Šídák's multiple comparisons test. (e)Variance’s comparison in spike domain. Mixed effect analysis, Šídák's multiple comparisons test. (f) linear regression between intra-host variability and age. Simple linear regression test.

Figure 3
figure 3

Details on variability for each sequenced samples (Y rows) by position in SARS-CoV-2 genome (X columns). Scale of variability is described at the right.

Description of hot-spot positions

A total of 123 hot-spot positions were found, 5 positions located in 5’UTR gene, 3 in NSP1, 9 in NSP2, 19 in NSP3, 7 in NSP4, 2 in NSP5 and NSP8, 4 in NSP6 and NSP10, 6 in RdRp, 3 in Helicase, Endonuclease, Exonuclease and Methylase domains, 22 in spike gene, 11 in gene “E”, 3 in genes M and ORF8, 8 in gene “M” and 3 in 3’UTR domain (Fig. 4 and Table 2). Comparing P and NP samples, only 25 positions showed significant differences, with more differences in persistent group, 5 in 5’UTR, 1 in NSP1 and NSP2, 4 in NSP3 gene, 2 in NSP4, 1 in RdRp, 2 in Methylase gene, 5 in spike domain, 3 in gene “E” and 2 in gene “N” (Fig. 4 (stars); Table 2). Significant differences showing higher intra-host variations in favor of non-persistent samples have been found in positions 3833; 7814; 21,409; 24,673; 26,562 and 28,215 positions (6 out of 25).

Figure 4
figure 4

Hotspot position chosen in sample with median of variability for a given position was higher than 25%. One bare represents the mean of variability for one position, blue bars and circle for non-persistent samples, and yellow bars for persistent samples.

Table 2 Uncorrected Fisher's LSD test comparing differences between persistent and non-persistent (on way) in positions showing a median of variability higher than 25%. Spike domain in bold character. CI: confidence interval; LSD: least significant difference; Diff.: differences.

Presence of intra-host N501Y and P681H variants in 20A clade samples

We assessed only clade 20A, which do not contain any of N501Y neither P681H mutations, from our sample cohort to find those mutations in intra-host variants. There were 35 clade 20A within samples from P patients and 10 clade 20A within samples from NP patients. In P samples N501Y mutation was present in minor variant for 15 out of 35 P samples (43%), in a range from 1.6 to 28.6% of variants per sample (median: 15.9%). P681H mutation was, in turn, present in minor variant for 28 out of 35 P samples (80%), in a range from 1.1 to 44.6% of variants per sample (median: 2.5%). In the NP population, there were 10 samples from Clade 20A, and we found 6 N501Y variants (60%), with a median at 3.9%, and 8 P681H variants (80%) with a calculated median at 5.9% (Fig. 5). With ANOVA statistic settings, we could not find any significant differences between P and NP samples (p = 0.63 for N501Y mutation and p = 0.45 for P681H mutation).

Figure 5
figure 5

Mutant cloud (non on major quantity) found on specific 23,063 and 23,604 positions, corresponding to N501Y and P681H mutations assessing in patients infected from clades 20A only. Comparison between P and non-P for those position does not show significant differences.

Discussion

Mutation’s origins in SARS-CoV-2 evolution are hard to assess, and especially to prevent, as shown Wu et al. Chinese’s team in a work where rising mutations and interacts with host immune system were represented in a one year retrospective eye20. Quasi-species, well studied in HIV advances, remains challenging current research on SARS-CoV-2 because of its propensity to see behind mutations, to see deeper in genomic flows, further than consensus sequence14. What is very interesting about what is described as a "cloud of viral mutants" is the way in which these populations are intrinsically selected. The pathogenesis was well described by Domingo et al. in 2019 in other RNA viruses, as an addition of micro-evolutionary events creating rich phenotypic intra-host reservoir, moving between dominance between variant clouds and interaction within host and intra-mutant spectra21. About SARS-CoV-2, studies on quasi-species are rare, but trend to put quasi-species as the number one suspect of mutational genesis22.

We here describe a large SARS-CoV-2 quasi-species study, in a relatively early population of viruses in the pandemic, notably before the appearance of the large monophyletic Variant of concerns (VOCs) delta and Omicron, and we suggest in our persistent population the higher ability to ad hidden nucleotide events in crucial positions. Persistent COVID-19, as we said above, is a rising entity suggesting high intra-host variability and concerning immune-injured population19. A recent study, Perez-Lago et al. have shown remarkable SARS-CoV-2 intra-host variability in three persistent shedding cases with time evolution23. They saw mutations rising from genomic weaknesses, especially in Spike and ORF1ab domains. This finding converges with our results since the most variable positions in our cohort and those that differed from NP were in the Spike and NSP3 domains. NSP3 gene, which code for Papain-like protease (PLpro), has been shown to have important function on host interactions, by ubiquitin-like action on inflammatory response and evasion from type 1 β-Interferon immune role24. Proofs are rising also concerning PLpro function in viral spreading control25. As persistence of viral shedding is linked with those host-pathogens interactions, we can extrapolate our results saying a higher intra-host variability might be due to those interactions, rather than the contrary.

In additional, intra-host variability was especially discovered, in our cohort, in persistent viral shedding patients. We particularly detected the same type of subvariant’s mutations (deletion, transversion, transition) in persistent and non-persistent samples, but in a higher percentage per position in persistent samples. Even if common quasi-species analyses are studied within a genomic evolutional timeline composed by several samples in the same patients, we have chosen a different way, shot gunning quasi-species at a t-time from on patient sample. Most of the subvariants cloud modifications found in persistent samples were deletions or synonymous mutations, as in several studies on quasi-species26,27,28,29, which could suggest natural correction and vanishing of those potential sources of mutation. But, it exists a potential silent role of synonymous mutations, as Khateeb et al. described significant reduction of infectivity and escape from BNT162b2 vaccine in minor part of pseudo viruses nasal population, but with a major synonymous mutation composition30.

In our spike gene analysis (positions 21,563–25,384), ten super-variable positions (21,635; 22,063; 22,210; 23,104; 23,144; 23,231; 24,056; 24,290; 24,673 and 25,101) were found, corresponding to the amino acids 25, 167, 216, 514, 528, 557, 832, 910, 1037, 1180, respectively. In the literature, Rocheleau et al. has described an intra-individual variability early in 2021, mainly on spike domain, with a positive correlation between high variability per nucleotide location and gene length29. They detected, among 15,289 Sars-CoV-2 genomes analyzed, high frequency intra-host variability on codon 194, 215, 261, 655, 1254, 1258 and 1259 in spike domain, that represent a close region to our super variant codons and seems to be in similar distribution, close to key mutations E484, N501 per example. Agius et al. identified kinds of high variables clouds near to the mean VOC mutations, considering a potential role of those variability strand in deep mutational process, linked with strong interactions with our immune system27. In their interesting works, intra-host variability was the most important in ORF1a domain and in spike domain as we found for spike and NSP3 domain.

In our cohort, initial population were different on age and severity, which could have an important impact on conclusions, instead of no link was found between age and variability in our linear regression analysis. Patients suffering from malignancies, immunosuppressive treatment face higher COVID-19 related mortality risk and longer viral shedding. Although Laubscher et al. showed no more quasi-species rising in 6 patients from oncological department 31, our high throughput analysis showed higher number of subvariants in persistent shedding, and those discrepancies could be explained with the fewer number of patients than in our study. Moreover, they did not include samples collected after 3 weeks from diagnosis.

Diabetes mellitus constituted a large part (30%, n = 22) of our persistent patients compared to the non-persistent, and we did not conduct any subgroup analysis toward this part. To our knowledge, studies working on quasi-species in diabetic patients with acute COVID-19 has not been reported yet in literature, and still be built to understand deeper the intra-host SARS-CoV-2 evolution. We also saw differences between persistent and non-persistent intra-host genomic variability in mild patient, which confers reliability because persistent viral shedding has been related in mild patients to interact longer with host immune system32. Al Khatib et al. have interestingly found a such higher intra-host variability in severe patient, which differs with our results, likely because there were not severe patient enough in control group so we cannot conclude with significant difference26.

Furthermore, our study suffers from biases, residing in the fact that the ARTIC protocol is a source of significant variability. The use of the Oxford Nanopore technology is indeed characterized by a higher per-base error rate than short-reads sequencing techniques. Unless we circumvented this using a dedicated bioinformatic pipeline to avoid amplification errors (unpublished source), the genome’s depth we obtained is such that these errors are, at the end, in similar quantities to other NGS techniques. In fact, the majority of viral quasi-species studies use Illumina technology, which is described as more reliable11, and we demonstrate here the feasibility of in-depth analysis with Nanopore technology.

Important finding in this work may consist of N501Y and P681H mutation presence in spike domain, in high percentage on samples from 20A clade, sampled before Alpha (20I) or Omicron (21 K) variants rising. Although not all minority variants may emerge as VOC, intensive sequencing and analysis of SARS-CoV2 quasi-species by NGS, especially in persistent patients, would allow to anticipate potential future variants spreading8. As a matter of fact, SARS-CoV-2 cellular entry, which is effective thanks to spike protein and ACE2 receptor, can be dramatically changed by a single different nucleotide, the latter changing the entire 3D conformation of the target to its receptor33. Moreover, not only can cell biologists now predict the conformational structure of a nucleotide in the spike domain as a result of mutations, but also the viral target-cell receptor affinity resulting from those modifications34, which remains extremely sensitive as studies revealed a particular links between Sars-Cov-2 celerity of cellular entry and clinical severity35. We strongly encourage teams to involve quasi-species analysis on variant of concern massive surveillance, as we could keep one step ahead fill our quiver with another arrow.

Conclusion

We found significant differences in global number of quasi-species clouds between persistent and non-persistent patient, which validates the hypothesis of persistent viral shedding patient could be a variant nursery. Further studies are absolutely needed to characterize variant virus “farmers” and provide clues for variant hunters.

Materials and methods

Collection samples

Among the thousand daily SARS-CoV-2 samples taken routine screening centralized at the IHU Mediterranean infection, APHM, Marseille, France, we prospectively and randomly selected 205 nasopharyngeal samples positive in SARS-CoV-2 real-time polymerase chain reaction. Samples selection was conducted from a routine sample list levied from March 2020 to August 2021. Inclusion conditions were designed as follow: to be older than 18 years, to have an RT-PCR positive test for SARS-CoV-2, regardless of clade, with Cycle threshold (Ct) between 10 and 34, regardless of comorbidities or treatment, regardless of duration of symptoms and stage of disease severity. Only patients with two positive PCR tests at least 17 days apart were selected, and up to 90 days to avoid including samples from re-infection. Randomization was done informatically from a list of patients who meet all inclusion criteria. For control population, we have selected positive SARS-CoV-2 nasopharyngeal samples as the same way, with randomization from a list which belong to the routine sequencing in our center. Inclusion criteria was viral clearance up to 17 days.

Sequencing protocol

Samples that were positive for SARS-CoV-2, identified by real-time PCR with a Ct-

value from 10 to 34, were processed for next-generation sequencing. Whole genome sequencing was performed following the Eco PCR tiling of SARS-CoV-2 virus with native barcoding (Oxford Nanopore, version PTCE_9122_v109_revB_10feb2020). 200 μL of nasopharyngeal swab fluid after viral RNA extraction with the EZ1 Virus Mini Kit v2.0. Briefly, cDNA was synthesized from 10 μL of viral RNA using the LunaScript RT SuperMix kit (NEB, USA) with random hexamers. PCR was performed using Q5 Hot Start High-Fidelity DNA Polymerase (NEB, USA) and a set of primers targeting regions of the SARS-CoV-2 genome designed by the ARTIC network (https://artic.network/ncov-2019). The PCR mixture was initially incubated for 30 s at 98 °C for denaturation, followed by 35 cycles of 98 °C for 15 s and 65 °C for 5 min. The purified DNA was repaired with NEBNext Ultra II End Repair (NEB, USA), followed by DNA end preparation using NEBNext Ultra II End repair/dA-tailing Module (NEB, USA) and the successive attachment of native barcodes and sequencing adapters supplied in the EXP- NBD196 kit (Oxford Nanopore Technologies, UK) to the DNA ends. The DNA concentration was determined with a Qubit 3.0 instrument using a dsDNA HS Assay Kit (Thermo Fisher, USA). Repaired and “endpreped” products were pooled (480 µL for 48 samples) and purified with 192 µL of AMPure XP beads (Beckman Coulter, USA) and Short Fragment Buffer (NEB, USA) to exclude small nonspecific fragments. After priming the flow cell, 20 ng of DNA per sample of the products was pooled in a DNA library with a final volume of 12 μL. GridION Mk1 was used to perform genome sequencing in an virgin R9.4.1 flow cell from 2 to 4 h (depending on run quality and reads obtained).

Bioinformatic analysis

Base calling was performed by using guppy (https://community.nanoporetech.com). High Accuracy Model (flip-flop) with the parameter settings “-c dna_r9.4.1_450bps_hac. cfg -x auto”, different samples were separated, and adapters were trimmed with the additional parameter settings “-trim_barcodes -barcodes EXP-NBD104/EXP-NBD114/EXP-NBD196”. FASTQ reads were filtered for quality control according to a cutoff “length ≥ 200 and Phred value ≥ 7” using the program “filtlong v0.2.0” (https://github.com/rrwick/Filtlong). Reads between 400 and 700 base pairs were kept; thus, potential chimeric reads were removed using artic pipeline (https://artic.network/ncov-2019). Selected reads were mapped against SARS-CoV-2 reference (Genbank accession no: NC_045512) using Minimap2 (v2.9). Sam2 consensus were used to sort the aligned BAM files, to obtain coverage data and a consensus sequence. Consensus sequences were extracted with a minimum depth coverage at 150X and stringency 70%. After we share the mapping (BAM files) on CLC Genomics workbench v.7 software. Data were inspected and alignment statistics were also calculated with CLC Genomics workbench v.7 software. All sequencings obtained were deposed on GISAID website (https://www.gisaid.org/) or in Genbank on the submission number: SUB11504102 (https://www.ncbi.nlm.nih.gov/genbank/).

Nucleotide variation representation (supp data)

SARS-CoV-2 genomes and the reference genome (NC_045512.2) were aligned using MAFFT v.7 (Katoh and Standley, 2013) before using snipit tool (https://github.com/aineniamh/snipit) that summarises SNPs relative to the given reference genome.

Phylogenetic analysis with whole genome

Phylogenetic trees were constructed using the nextstrain/ncov tool (https://github.com/nextstrain/ncov) and visualized with Auspice (2.36.0) software (https://auspice.us/). Pangolin lineage was added from a tsv file in the Auspice interface.

Quasi-species analysis

Genomic variability was assessed for each sample using an in-house Excel matrix available on supplementary data (Supplementary Table 1). Sequencing format used were on “.TSV” from CLC Genomic workbench v.7, then copy and paste on the in-house matrix which can define, for every position, the proportion of variant reads from every nucleotide, in percentage value (for each position: % of A, T, C, G and deletion). We define stable variant quasi-species if variability on a specific position was higher than 25%, as previously describe14. The threshold for position of interest at 25% was also chosen following a tangent line on repartition of variability for all samples (Supplementary Fig. 1). Intra-host variability was thus defined by difference of 25% in nucleotides repartition given by genomic position.

We assessed and found hot spots of variations defined by more than 50% samples with a genomic variation > 24% for one given position (supplementary Fig. 1).

Ethical statement

Whole genome sequencing was performed on nasopharyngeal samples that were collected in the context of routine diagnosis and not for research purpose. No additional samples were actively collected for this study. Clinical data were retrospectively retrieved from medical files and anonymized before analysis, only in the Assistance Publique-Hôpitaux de Marseille site and all methods were carried out in accordance with respecting the French GPDR (General Data Protection Regulation). Experimental protocol has been approved by the IRB research department unit from Assistance publique-Hôpitaux de Marseille under the number PADS-BJP737. No human genome has been sequenced. In line with the European General Data Protection Regulation No 2016/679, patients were informed of the potential use of their medical data and that they could refuse the use of their data. No ethical approval requirement was needed other than informed consent.

Statistical analyses

Statistical analyses were carried out using Prism 9 for macOs (Version 9.1.1 (223), April 16, 2021, GraphPad Software, LLC, URL: https://www.graphpad.com). Categorical variables are presented as numbers and percentages, and continuous variables are presented as the means ± SD (standard deviation). Comparative analyses of the means of variabilities between persistent and non-persistent patients were built with Graphpad Software multiple comparison tools, using nonparametric Welch’s t-test or ANOVA. Comparative analyses between percentages were conducted with Chi-square or Fisher’s exact tests when appropriate. Alpha risk was considered for a p value > 0.05.