In a previous study performed in March 2020 including patients with mild and severe SARS-CoV-2 infection1, we reported that a minor population of viral genomes accumulated deletions at the S1/S2 cleavage site (PRRAR/S), generating a frameshift with appearance of a premature stop codon. This finding concurred with the results of a study published in 2020, where spike and S1 proteins were detected in serum of COVID19 patients2. We suggested a mechanism by which this event could reduce the severity of the infection and tissue damage without losing viral transmission capability.

With progression of the COVID-19 pandemic over time, new variants have arisen and some have been considered variants of concern (VOCs), such as the Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2)3, and more recently, Omicron (B.1.1.529) variant4. In our geographic area (Barcelona, Spain), B.1.5 and B.1.1 were the dominant variants during the first pandemic wave. These were replaced by B.1.177, which predominated in many European countries during the second wave. The Alpha variant replaced B.1.177 during the third wave, and later Delta outcompeted the other prevalent variants in regions where it appeared5,6,7. Omicron has quickly spread to become the world’s dominant variant. Within 4 weeks of its emergence, it was found in 100% of infected patients consulting in primary care centers of Barcelona city. A widespread explanation for this dominance is its higher transmissibility and likely higher resistance to the acquired immune response attained with the current COVID-19 vaccines and previous infections8,9. A surprising feature of this variant is that Omicron sequences are clustered away from the other SARS-CoV-2 genomes uploaded to GISAID10,11,12, which opens an interesting debate about the origin of Omicron.

Although data on the SARS-CoV-2 consensus sequence, point mutations, deletions, and insertions are continuously added to the GISAID repository, there is a lack of information on the composition of the viral quasispecies in clinical samples, and on changes in the quasispecies structure occurring over the consecutive pandemic waves. This information could be of value to understand the particular biology of SARS-CoV-2 and it might have implications regarding the epidemiology and origin of the virus and its variants.

Therefore, the aim of this study was to compare the type and frequency of defective deletions found in the SARS-CoV-2 spike gene in patients with mild infection caused by the most prevalent variants from the first to the sixth pandemic waves, including B.1.5, B.1.1, B.1.177, Alpha, Beta, Delta, and Omicron.


The analysis included 9 samples from patents with the B.1.5 variant, 8 with B.1.1, 11 with B.1.177, 9 with Alpha, 11 with Beta, 11 with Delta, and 19 with Omicron (Supplementary Table S1). In total, 91,526,555 reads (range 3814–620,216 reads per amplicon) were obtained from the 78 COVID-19-infected patients (Supplementary Table S2). Sequences from the B.1.5, B.1.1, B.1.177, Alpha, Beta, and Delta variants were uploaded to the GenBank Sequence Read Archive (SRA) database with BioProject accession number PRJNA788442, and sequences from Omicron with accession number SUB11151740.

Amplicon A78, running from nt 1905 to 2260 in the spike protein (aa636–aa753), is of particular interest because it includes the S1/S2 O-linked glycan residue polybasic TMPRSS2 cleavage site (aa685–aa686). We obtained coverages of 10,346 to 215,520 reads per amplicon (Supplementary Table S2) for A78, and it had the narrowest interquartile range (IQR) compared to the other spike regions (Supplementary Fig. S1). Comparison of coverage between amplicons showed no significant differences, with IQRs lower than 0.242 in all cases except for A82 and A84 (Supplementary Table S3).

Sixty-two of the 78 patients (79.5%) showed deleted haplotypes over 14 overlapping amplicons (Supplementary Figure S2A to S2G). In 28 of the 78 patients (35.9%), the deletions resulted in the appearance of a premature inframe stop codon, thus generating a defective genome (Supplementary Figure S3A to S3G). Specifically, we identified 49 defective deletions in these 28 patients, resulting in the loss of 2 (Δ2) to 100 (Δ100) nucleotides (Tables 1, 2, 3, 4, 5, 6 and 7). Deletions did not randomly accumulate along the S gene. Several were found in the same nucleotide position in different patients (Supplementary Figure S2A to S2G), even in some patients infected with different variants (Table 8). For example, deletion Δ817F-822L was found in three different patients (V69S13, V70S02, V70S03) with the B.1.5 variant (Table 1), but also in one patient (V69S03) with B.1.1 (Table 8).

Table 1 Variant B.1.5.
Table 2 Variant B.1.1.
Table 3 Variant B.1.177.
Table 4 Variant B.1.1.7 (Alpha).
Table 5 Variant B.1.351 (Beta).
Table 6 Variant B.1.617.2 (Delta).
Table 7 Variant B.1.1.529 (Omicron).
Table 8 Population frequency of defective genomes per deleted amino acid position (Del/Def) and variant.

Table 1, 2, 3, 4, 5, 6 and 7 Deletions found along the spike gene from amplicon A73 to A83. Population frequency was calculated as the number of reads with deletions per amplicon divided by the total number of reads per amplicon. Table 1 for variant B.1.5; 1B for B.1.1; 1C for B.1.177; 1D for Alpha (B.1.1.7); 1E for Beta (B.1.351); 1F for Delta (B.1.617.2); and 1G for Omicron (B.1.1.529). Reads Δ = number of reads showing a deletion.

Sixteen of the 28 (57.1%) patients with defective mutations belonged to the first and second waves of the pandemic (B.1.5, B.1.1, and B.1.177) (Supplementary Figs. S3A−SG), whereas only 2 in 9 (7.1%) of Alpha, 2 in 11 (7.1%) of Beta, and 1 in 11 (3.6%) of Delta patients showed defective genomes. However, in contrast to the small percentage of defective mutations seen in Alpha, Beta and Delta patients, the Omicron variant showed defective genomes in 7 of the 28 (25%) patients.

Comparison of variants according to the genomic location of the defective positions showed some differences. The predominant lineages in the first and second wave (B.1.5, B.1.1, and B.1.177), as well as Omicron in the sixth wave, showed 11 to 18 deleted genomic regions leading to defective particles along the whole spike gene, from amplicon A71 to amplicon A84 (Tables 1, 2, 3 and 7). However, patients infected with the Alpha, Beta and Delta variants showed sporadic defective genomes only in amplicons A71, A72, A75, A76, A77, A79, A80, and A82 (Tables 4, 5 and 6).

As to the relative frequency of the variants found (number of defective reads per amplicon divided by total number of reads per amplicon) in all patients and in the whole spike gene, the largest percentage of variants with defective genomes were seen in the first and second pandemic waves (Tables 1, 2 and 3) and in Omicron (Table 7). The most prevalent defective deletion was Δ246R-249L in 7.03% of Omicron patients (Table 7), Δ852N–856V in 4.12% of B.1.5 patients (Table 1), and Δ656Y-670Y in 2.02% of B.1.177 and 1.82% of B.1.5. All other defective deletions were present at frequencies below 1% (Tables 1, 2 and 3).

Of particular note, we found a significant presence of defective deletions in the Δ640S-674Y region (nt1920-2021) in amplicon A78 on the S1/S2 cleavage site in patients from the first and second waves and in Omicron patients, but not in Alpha, Beta and Delta patients (Table 2, Supplementary Fig. S4). The absence of defective genomes at this position in Alpha, Beta and Delta was not the result of inadequate sequencing coverage or number of patients studied (Supplementary Table S4). Defective genomes close to S1/S2 appeared in variants occurring at the beginning of the pandemic and in Omicron regardless of the number of sequences studied, whereas none of the Alpha, Beta or Delta patients showed any defective genomes in this sequence (Fig. 1). Moreover, although Omicron had the lowest sequencing coverage, it showed a pattern of defective deletions similar to those of B.1.1, B.1.5, and B.1.177.

Figure 1
figure 1

Amplicon A78 (nt 1905 to 2260, aa636-aa753) coverage per variant. Red dots represent reads with no defective haplotypes detected in amplicon A78 and blue triangles indicate reads with defective haplotypes in this position.

Some specific defective deletions in the Δ640S–674Y fragment coincided in 2 or more patients infected by variants from the first and second waves. For example, the Δ656Y–670Y deletion was found in 3 patients with B.1.5 and 6 with B.1.177 (Tables 1, 2 and 3). Furthermore, Δ640S–674Y deletions were found in 4 patients with Omicron (Table 7). However, no deleted genomes in patients infected with the Alpha, Beta and Delta variants were detected in amplicon A78 (Tables 4, 5 and 6). Comparison of specific defective deletions between patients showed that deletions in the Δ640S–674Y region appeared only in patients with variants from the first and second pandemic waves (B.1.5, B.1.1, and B.1.177) at frequencies of 1.82%, 0.23%, and 2.02%, respectively, and in those with the Omicron variant (0.69%). This deletion was not detected in Alpha, Beta or Delta patients (Table 2) even though coverage was similar among all variants studied (Supplementary Fig. S4, Supplementary Table S4). However, we cannot exclude the presence of defective haplotypes with abundances below the 0.1% threshold.

The number, type, and frequency of deleted positions coincided in 22 of the 25 (88%) samples amplified in parallel using the ARTIC and N071 protocols, ruling out bias caused by primer amplification (primers shown in Supplementary Table S5). In the remaining 4 samples, N07 detected deletions of 2 nucleotides that were not found with ARTIC, possibly because of the double PCR (RT-PCR-Nested) required for N07, which has a higher risk of introducing artefactual mutations, or the 2 to 5 times higher coverage in the N07 study. Interestingly, in 2 samples (patients V69S13 and V70S02), ARTIC was able to identify defective deletions that were not visualized with the N07 primers, even though ARTIC had lower coverage (81,248 and 45,575 reads) than N07 (242,697 and 240,491 reads) (Supplementary Table S6). In general, the ARTIC protocol identified a larger number of defective deletions, except for a 7-nt deletion in V70S12 and a 29-nt haplotype in V71S13, which were only seen using the N07 protocol. Our data show that results using ARTIC were concordant with those of N07, and that in the worst case, the N07 protocol might even underestimate the rate of defective deleted haplotypes.


The arrival of Omicron (B.1.1.529) to our geographic area has changed the profile of circulating SARS-CoV-2 variants, as it is now detected in 100% of infected patients in Barcelona city (Fig. 2). This predominance suggests that Omicron has biological advantages over the other variants: higher transmissibility and likely greater resistance to the immune response acquired by current COVID-19 vaccines or overcoming a previous SARS-CoV-2 infection8,9. The origin of Omicron is an open question. One hypothesis is that it could have evolved and emerged from a population undergoing little surveillance before being detected. Another proposal is that it may have resulted from viral evolution in patients with long-term persistent infection, a situation reported in clinical groups with a weak immune response, such as immunocompromised patients. A third hypothesis is that Omicron may have emerged by a cross-species jump from humans to nonhuman species, subsequently spilling back to humans13,14. Independently of its origin, whole-genome sequencing using next-generation techniques has shown that Omicron sequences are clustered far from the millions of SARS-CoV-2 genomes uploaded to GISAID. The present study points to an additional differentiating feature of Omicron: the pattern of mutations occurring at the S1/S2 cleavage site.

Figure 2
figure 2

Lineage distribution per week and positive samples detected (discontinuous line) from March 2020 to February 2022 in Barcelona, Spain. Defective genomes are shown below the lineage distribution in bar plots starting from the first patients detected in March 2020: B.1.5 (yellow), B.1.1 (dark green), B.1.177 (violet), Alpha (light green), Beta (pink), Delta (blue), and Omicron (black). The bar plots below each variant indicate defective deletions in amplicon A78 in the 78 patients studied at the nucleotide level. The box located in the left corner are the defective bar plots related to patients in March 20201. The x-axis provides the multiple alignment (MA) nucleotide positions and the amplitude of the deletions by subregions, and the y-axis shows the frequency of the deletion (percentage) on the right and the number of reads on the left. As no insertions causing defective genomes were observed, the MA positions correspond to S gene positions. Dashed lines indicate the S1/S2 (left) and S2’ (right) cleavage sites. Detailed bar plots for the 78 patients by amplicon are provided in supplementary figures (Supplementary Figures S2A−2G for deletions and 3A-3G for defective deletions).

A higher presence of any variant in the human population as the pandemic progressed might reflect acquisition of a certain biological advantage over previous variants15,16,17. The scientific community has shown special interest in VOCs because of reported evidence indicating an impact on transmissibility and immunity attributed to multiple mutations in the receptor binding domain of the spike protein6,7. Another type of mutation involves deletions, which are a loss of nucleotides during the replication process that can cause a change in the correct reading frame. In a previous study1, we found that the early lineages showed deletions across the spike gene, some of them close to the S1/S2 spike cleavage site, which, in most cases, caused a frameshift and the appearance of a premature stop codon. Using a ribosome-profiling approach in samples from Germany, Finkel et al.18 described a deletion located in the furin-like cleavage site (TNSPRRAR, referred to in the present study as the S1/S2 cleavage site), affecting nucleotides 23,595 to 23,615, that was recurrently selected during passage of the original SARS-CoV-2 (EPI_ISL_406862) in Vero E6 cells18. In our analysis of 5,730,959 sequences from 78 patients, we did not find the exact deletion of that nucleotide region, which includes the amino acids PRRAR or HRRAR (Omicron). The deletions we identified occurred just before the cleavage site, specifically from nucleotide 23,555 (aa640S) to 23,580 (aa674Y). Nonetheless, the deletions found in both cases (naturally occurring and selected during Vero E6 viral cell culture) occurred in a small region of the spike, near or above the furin-like cleavage site, suggesting that this region is a hot spot for nucleotide deletions that could provide the virus with an evolutionary advantage. Postnikova et al.19 have suggested that the presence of two tandem CGG codons (the rarest in the SARS-CoV-2 genome), coding the two consecutive Rs in the PRRAR region, could cause ribosomal pausing and even frameshifting. Translational pausing resulting from usage of an extremely rare dicodon, together with naturally occurring defective deletions close to S1/S2 that cause a frameshift with appearance of a new stop codon1, might suppress spike protein extension and lead to production of S1 free protein, a situation suggested in our previous study1. It is reasonable to postulate that the deletions and usage of rare codons indicate that this region is extremely important for the biology of the virus, enabling SARS-CoV-2 to readily infect humans. Free spike and truncated S1 protein, recently reported in plasma of severely ill COVID-19 patients, has been attributed to tissue damage resulting from severe disease2. In any case, this finding indicates that the complete spike protein and the truncated S1 form are soluble and can be secreted by infected cells, perhaps even in mild cases, as we suggested previously1.

Here we show that variants dominating the first (B.1.5, B.1.1) and second (B.1.177) pandemic waves had a large frequency of minority mutants with deletions causing defective genomes, whereas the Alpha, Beta and Delta variants, predominant in several regions of the world5,20,21, had a smaller presence of deleted genomes, suggesting that this situation may have favored their spread in the human population, overcoming other variants. Defective viral genomes occur in most, if not all, RNA viruses during infection22. Deep sequencing has shown that several species of defective genomes are generated, as was seen in samples from influenza virus-infected patients and cultures of metapneumovirus and measles virus23,24,25. Coronaviruses are not an exception1,26,27,28,29,30,31. It has been reported that defective genomes can interfere with wild-type viral infection, causing a reduction of virulence in vivo32,33, and there are many examples of defective genomes having an impact on human and animal health as a part of viral evolution and adaptation to the host22.

The Alpha, Beta and Delta variants appeared at a time when there was a high incidence of new infections, and they had to compete with other variants. The success of these lineages suggests that they had acquired a biological advantage to beat their competitors and dominate the pandemic wave. Our results show that Alpha, Beta, and Delta had no defective deletions in the spike region close to the S1/S2 cleavage site and much fewer defective deletions than the B.1.5, B.1.1, or B.1.177 lineages. A smaller presence of defective genomes might be associated with a greater presence of infectious viral particles, without affecting viral load (measured using the surrogate Ct value). With fewer defective genomes, there would be less circulating free SI and the putative competition of free S1 with infecting particles would decease or fail, with one outcome being greater infectivity. In addition, a significant reduction in defective genomes could be associated with longer duration of positive PCR testing and greater disease severity. It has been reported that the time interval from symptoms onset to viral load decrease is longer in infection by Delta than by other variants, and that Delta infection is associated with a higher risk of severe disease in unvaccinated people, and a greater need for remdesivir or corticosteroid treatment34. In their publication, the CDC concluded that Delta was more contagious than previous variants35 and a study from Ontario, Canada reported that there was a pronounced increase in hospitalization, ICU admission, and death during Delta variant dominance, particularly in unvaccinated people36.

Surprisingly, Omicron has shown a frequency of minority mutants with deletions more similar to variants dominating the first and second waves than to Alpha, Beta and Delta. Delta and its subvariants accounted for the vast majority of infections up to the end of 2021, and it was expected that one of these subvariants would ultimately predominate over others and become endemic and seasonal. In this scenario, the appearance of Omicron was unexpected and surprising, and its rise and dominance over Delta suggests that Omicron has higher fitness than other variants. The Omicron genome includes mutations that help the virus overcome the host’s immune response and spike gene mutations and deletions affecting the receptor binding domain and the S1/S2 cleavage site (HRRAR in Omicron) that increase host cell ACE2 receptor recognition. The explosion of Omicron, which at this time is the most prevalent variant all over the world, may also be explained by its tropism for the upper respiratory tract, which facilitates viral transmission37. Nonetheless, despite its dominance over Delta, Omicron subvariants BA.1 and BA.1.1 have not led to an increased severity of the infection (hospital admissions to ICU units) or mortality38.

In conclusion, the frequency of concomitant defective deletions close to the S1/S2 cleavage site was higher in dominant variants seen at the beginning of the pandemic than in the Alpha, Beta and Delta variants, suggesting that these mutations may contribute to adapting SARS-CoV-2 to human infectivity. The observation obtained here that the presence of mutations in Omicron is similar to that of variants at the beginning of the pandemic provides information that could be of value when studying the evolution of Omicron subvariants spreading among the human population. Our results concur with findings from previous studies indicating that the S1/S2 cleavage site is an important region for the biology of the virus, affecting the capability of SARS-CoV-2 to readily infect humans.

Materials and methods

Patients and methods

Naso/oropharyngeal swab samples were collected from laboratory-confirmed SARS-CoV-2 patients meeting the World Health Organization (WHO) definition39 admitted to the Vall d’Hebron University Hospital (HUVH) and from primary care centers in Barcelona. From April 2020 to December 2021, samples from patients carrying the most prevalent SARS-CoV-2 lineages over the 6 pandemic waves occurring in the Barcelona metropolitan area were randomly selected for deep-sequencing of the spike gene. B.1.5 and B.1.1 dominated the first wave (week 11/2020 to 26/2020), B.1.177 the second (week 27/2020 to 52/2020), B.1.1.7 (Alpha variant) the third (week 53/2020 to 13/2021), B.1.351 (Beta variant) the fourth (week 14/2021 to 25/2021), B.1.617.2 (Delta variant) the fifth wave (week 26/2021 to 43/2021), and B.1.1.519 (Omicron variant) the sixth wave (week 44/2021- ongoing).

The patients’ demographic data (sex and age) were obtained retrospectively. Clinical data, sample extraction data, cycle threshold (Ct) value (when measured), and symptoms are specified in Supplementary Table S1. Only samples with Ct values lower than 30 (with some exceptions) were included in the whole-genome sequencing weekly analysis5. Only samples from mild or asymptomatic patients were selected for study because this group showed the highest prevalence of deletions at the beginning of the pandemic1. Institutional review board of Clinical Research Ethics Committee (CEIm) Vall d’Hebron University Hospital approval was obtained(PR(AG)259/2020). The need for informed consent was waived by CEIm Vall d’Hebron University Hospital, as study used routinely collected for surveillance activities. All methods were performed in accordance with the relevant guidelines and regulations.

SARS-CoV-2 detection is described in the Supplementary Material for this study and in Andres et al.1. To sequence the whole spike gene, we mainly used ARTIC v3 primers including the 21,658 bp to 25,673 bp position, corresponding to the nCoV-2019_72 (A72) to nCoV-2019_84 (A84) overlapping amplicons (artic28-ncov2019/nCoV-2019.scheme.bed, ARTIC Network), and reformulated the ones that gave amplification problems due to deletions. Spike pair and impair primers were tested for efficacy and mixed in 2 different pools for the 2 posterior multiplexed PCRs. Each pool was optimized by adjusting primer concentration to obtain a balanced number of reads for each amplicon.

The bioinformatics analysis was done as reported by Andres et al.1, and the Wilcoxon test was used to compare Ct values between samples with and without emergent deletions. Pearson’s chi-square test was used to compare the emergence of deletions with sex or presence of COVID-19 disease symptoms (eg, high fever, anosmia, ageusia, persistent headache).

Rationale for the bioinformatics analysis used

Conventional bioinformatics analyses can be carried out using several tools and platforms. In terms of sequence quality control, the analysis is performed with tools that can analyze various quality metrics and correct these if necessary. For example, in normal conditions, sequence trimming can be performed with the trimmomatic40 or fastp41 tools, which can remove low-quality sequences, nucleotides, primers, and adapters, among others. However, this approach could not be applied in our analysis, as reads would be shortened according to their quality and ultimately have different lengths, making proper analysis and comparison impossible.

With the use of conventional pipelines, once the reads are trimmed, you would map them against a reference to perform variant calling, where you would be able to see all changes. Several tools can be used for this type of analysis, such as bwa42, bowtie243 or bbmap44 for alignment, and samtools or BCF tools45 for variant calling, among others. However, our aim was to detect deletions with specific characteristics according to the haplotypes present, which cannot be achieved using conventional pipelines, as this would result in the identification of all changes according to the reference.

In addition, analysis of the variants would lead to the consensus sequence, which was not the aim of our pipeline, as we were interested in studying the diversity of haplotypes.

Ethics approval

The study was approved by the Institutional review board of Clinical Research Ethics Committee Vall d’Hebron University Hospital (CEIm), with reference PR(AG)259/2020. (