Genomic epidemiology of SARS-CoV-2 reveals multiple lineages and early spread of SARS-CoV-2 infections in Lombardy, Italy

From February to April 2020, Lombardy (Italy) reported the highest numbers of SARS-CoV-2 cases worldwide. By analyzing 346 whole SARS-CoV-2 genomes, we demonstrate the presence of seven viral lineages in Lombardy, frequently sustained by local transmission chains and at least two likely to have originated in Italy. Six single nucleotide polymorphisms (five of them non-synonymous) characterized the SARS-CoV-2 sequences, none of them affecting N-glycosylation sites. The seven lineages, and the presence of local transmission clusters within three of them, revealed that sustained community transmission was underway before the first COVID-19 case had been detected in Lombardy.


Sampling criteria and patients' characteristics 28
For 9,251 patients we were able to retrieve information regarding sex, age, and residence. In order 29 to exclude sampling bias that could affect viral diversity, only one patient per family unit was 30 selected (n=7,617). In order to have the measure of viral load of the selected samples, samples 31 with Ct available were retrieved (n=1,561 samples). To warrant high quality sequences and good 32 genomic coverage, samples with Ct values >35 cycles (n=418) were excluded. Out of the 33 remaining 1,143 patients, 371 samples were selected for inclusion, according to the geographical 34 distribution of COVID-19 cases. In Supplementary Table 1, the characteristics of the 9,251 Sars-35 CoV-2 infected patients with sex, age, and residence information available, were compared with 36 the 371 selected samples. Likelihood Ratio Test, followed by a multinomial logistic regression 37 model to estimate 95% confidence intervals of odds ratios, was used to compare demographic and 38 clinical findings between general and selected SARS-CoV-2 infected populations. By looking at 39 sex, age distribution, the selected population is well representative of SARS-CoV-2 infected 40 general population at that time (at the time of writing the epidemiology is substantially different). 41 Prevalence of chronic comorbidities is also similar, with the exception of a higher prevalence of 42 cardiovascular and lung diseases in the selected population, compared to the general population 43 (33.2% vs. 24.5%, P<0.001 by Likelihood Ratio Test; and 14.2% vs. 11.3%, P=0.04 by Likelihood 44 Ratio Test, respectively). Disease severity and evidence of interstitial pneumonia were largely 45 comparable, even though a lower prevalence of critical COVID-19 cases was observed in the 46 selected population (4.3% vs. 9.6% in the general population; P=0.001 by Likelihood Ratio Test). 47 The most frequent symptom observed is fever in both populations (66.0% and 63.4%; P=0.290 by 48 Likelihood Ratio Test), followed by cough and dyspnea, whose prevalence were lower in the 49 selected population (46.0% vs. 52.2% in general population, P=0.001 by Likelihood Ratio Test; and 50 38.8% vs. 50.1% in general population, P<0.001 by Likelihood Ratio Test). The geographical 51 distribution is also comparable between general and selected populations, with the exception of 52 Milan, Pavia and Como. In this respect, it should be noted that this retrospective observational 53 study involved two major hospitals localized in Milan and Pavia. Consequently, most of the SARS-CoV-2 infected population resided in these two provinces (Milan: 31.1%; Pavia: 25.8%). In order to 55 balance the geographical distribution in relation to population density and general prevalence of 56 COVID-19 cases, patients from Milan and Pavia were under-sampled down to 20.6% and 19.2% of 57 the selected population, respectively. A higher prevalence of patients residing in Como remained in 58 the selected population in relation to the general population (19.2% vs. 8.7%, P<0.001 by 59 Likelihood Ratio Test). 60 Overall, the study population was well representative of the whole Lombardy region, with the 61 exception of the eastern part and northern valleys (i.e. Brescia, Mantua, Valtellina and 62 Valcamonica, Figure 1). 63

Homoplasies checking 66
To account for regions which might potentially be the result of hypervariability or sequencing 67 artifacts, alignment positions showing significant homoplasy were identified by a combined 68 approach. Homoplasies were firstly identified using HomoplasyFinder, and then confirmed by 69 Treetime (homoplasy setting). 1,2 In detail, MPBoot was run on the alignment to reconstruct the 70 Maximum Parsimony tree and to assess branch support following 1000 replicates (−bb 1000). The 71 resulting Maximum Parsimony treefile was used, together with the input alignment, to rapidly 72 identify homoplasies using HomoplasyFinder. 1 To obtain a set of high confidence homoplasies, we 73 then confirmed the results obtained by HomoplasyFinder using Treetime (homoplasy setting). 2 The 74 top-10 significant homoplastic positions identified by TreeTime and confirmed in HomoplasyFinder 75 were masked in the final alignment. 76

SARS-CoV-2 genome data set 77
In order to represent the global diversity of the lineages by the end of April 2020 while minimizing 78 the impact of sampling bias, 395 GISAID deposited sequences were added to the 346 consensus 79 sequences obtained by our samples. 80 The 395 GISAID sequences were selected as follows. All available whole-genome SARS-CoV-2 81 sequences (n=3244) on GISAID (gisaid.org) on 3 May 2020 were downloaded and submitted to 82 the Pangolin application. Sequences from GISAID that were error-rich, and identical sequences 83 from each country outbreak were removed. The exact date of virus collection was available for all 84 sequences except for one genome from Lithuania for which only the month of viral collection 85 (February, 2020) was available. In this case, the lack of tip date precision was accommodated by 86 sampling uniformly across a 30-day window. Finally, the dataset was reduced to 395 sequences by 87 only retaining the earliest, and the most recently sampled sequences from each country outbreak 88

Maximum likelihood tree and Bayesian interference 92
In order to explore the phylogenetic structure of SARS-CoV-2, we used both the maximum 93 likelihood (ML) and Bayesian coalescent methods. The ML phylogeny was estimated with IqTree 3 94 using the best-fit model of nucleotide substitution GTR+I. 4 Tree topology was assessed with the 95 fast bootstrapping function with 1000 replicates. The ML tree was inspected in TempEst, 5

in order 96
to define the correlation between genetic diversity (root-to-tip divergence) and time of sample 97 collection ( Supplementary Fig. 3). In order to obtain a corresponding time-scaled maximum clade 98 credibility tree, a Bayesian coalescent tree analysis was undertaken with BEAST v1.10.5, 6 using 99 the GTR+I substitution model with an exponential population growth tree prior and strict molecular