Introduction

The SARS-CoV-2 pandemic has caused substantial morbidity and mortality globally1,2. Universities have been considered conduits for transmission due to extensive social networks of young adults, many of whom live communally, and in-person teaching of large groups3. Outbreaks of SARS-CoV-2 have been observed in a number of higher education institutions, but the drivers for transmission in these settings are poorly understood4. It is speculated that infection dynamics are dependent on transmission chains involving student courses, residence, study year and social networks5. Understanding these dynamics is essential in order to devise effective infection control measures while minimising disruption to teaching, research and the mental health of students and staff6. Furthermore, while university students are less likely to develop severe COVID-19 disease, there is concern that university outbreaks could seed infections in more vulnerable populations, including staff, the local community, and upon returning home to older relatives7. Identifying possible sources of cross-transmission is therefore vital.

Although SARS-CoV-2 genome sequencing has clear utility to identify virus emergence and cryptic transmission8,9, no large-scale genomic studies in university settings have been conducted. The United Kingdom has an extensive community genomics surveillance programme through COG-UK10 which complements traditional contact tracing approaches by providing understanding of circulating viral populations.

We report the results of a genomic epidemiology study of SARS-CoV-2 across a complete term at the University of Cambridge (UoC). Importantly, these findings are from a study period prior to the established circulation of variants of concern and the availability of vaccination, with therefore fewer confounding factors. From 5 October to 6 December 2020, the UoC ran PCR-based symptomatic testing for all staff and students, and offered asymptomatic screening to 15,500 students living in university-managed accommodation. We therefore provide a unique study of SARS-CoV-2 infection that encompasses pre-symptomatic and asymptomatic students11. Positive samples from the UoC were sequenced and compared with systematic surveillance SARS-CoV-2 sequences from the local community. The results were analysed in conjunction with epidemiological data derived from the screening programme and national contact tracing. Overall, we describe introductions of SARS-CoV-2 into a higher education setting, the dynamics of transmission both within the university and between the university and the surrounding community, and the impact of local and national measures to control the spread of SARS-CoV-2 infections.

Results

In total, 972 SARS-CoV-2 cases were identified among university students and staff over the course of term (5 October to 6 December 2020). High-quality genomes were generated from 446/778 (57.3%) positive cases from the university testing programme, from 107/266 (40.2%) cases identified through the Healthcare worker (HCW) screening programme (95 HCWs, 8 students, 4 university staff) and 104 patients identified by hospital testing (71 SARS-CoV-2 positive patients from Cambridge University Hospitals (CUH) and 33 from other medical facilities in Cambridgeshire). A further 797 local cases identified by community testing during the study period were present within the COG-UK dataset, of which 17 were identified as students, 7 as university staff and 26 as HCWs (Fig. 1). Of all identified SARS-CoV-2 cases from Cambridgeshire (university and community) during this period, 8.0% were sequenced (Supplementary Fig. 1).

Fig. 1: Study cohort and available genome sequences.
figure 1

*Includes 14 students identified through ad hoc asymptomatic screening conducted as part of an outbreak investigation by the University of Cambridge in conjunction with local public health authorities, responding to increased rates of infection in a block of student accommodation (described in further detail in cluster 2 below). **Includes two students associated with a single sequenced pooled sample (see supplementary methods). CUH Cambridge University Hospitals.

SARS-CoV-2 lineages and transmission clusters

Over the 9-week term, 62 Pango lineages were identified across the university and community (Fig. 2a, c). In the university, 23 Pango lineages were identified, and 438/482 (90.9%) cases were from just 4 lineages (B.1.60.7, B.1.177, B.1.36, B.1.177.16), all of which were detected by the second week of term. Twelve lineages were only observed after the second week of term and accounted for 6.9% cases. By comparison, 57 lineages were identified in the local community over the same 9-week period. Viral genomes containing mutations in the spike protein that have been linked to decreased sensitivity to antibody-mediated immunity or impact viral transmission were observed in the university population: three sequences from the B.1.258 lineage containing the N439K mutation and ∆H69/∆V70; two cases of B.1.1.7/alpha variant and its associated mutations12; and 88 cases of B.1.177 with the A222V mutation13. Of these, Pango lineage B.1.1.7 is most reliably associated with increased transmission14; both cases of B.1.1.7 were amongst postgraduate students with no epidemiological links, during national lockdown, and failed to transmit further within the university.

Fig. 2: Genomic diversity of SARS-CoV-2 in the university and community.
figure 2

a Maximum likelihood tree showing that the majority of lineages from university isolates were distinct from community isolates. The node leaves (branch tips) show case location and global PANGO lineage is illustrated in the vertical bar. b Time-scaled coalescent tree including university members and local community isolates from study period with visible segregation between the two groups. College affiliation is shown for university members in the second set of vertical columns, highlighting the ‘top nine’ colleges by cluster 1 prevalence. c Epidemic curves demonstrating a steeper decline in SARS-CoV-2 cases in the University of Cambridge (i) compared to the local community (ii), with associated lineages. Only cases with available genomes are included. University term ran from the week commencing October 5 to the week commencing November 30. The light blue shaded area reflects a 4-week national lockdown in the UK, which was associated with a large fall in COVID-19 cases in University students. Specific lineages highlighted are the four largest lineages within the University (minimum 20 cases over the study period) and the community (minimum 50 cases over the study period). For (i), weekly individual case ascertainment for staff and students testing positive for SARS-CoV-2 through both symptomatic and asymptomatic testing pathways provided at the University of Cambridge is indicated. For (ii), weekly cases with genomes available from the local community are shown. Source data are provided as a Source Data file.

In total, 198 putative transmission clusters were defined by CIVET (https://github.com/artic-network/civet). Only 8/36 clusters with university cases contained five or more university members (range 6–337), which together represented 91.3% of all university cases, signifying that the majority of introductions into UoC did not cause ongoing transmission. To further investigate the largest of these, cluster 1 described below, we identified groups of identical samples (0 SNP differences) which produced 19 additional clusters (a total of 34 clusters with >2 university cases) for further analysis.

Determinants of viral spread across the university

To determine transmission dynamics following introduction into the university, we performed a detailed investigation of the largest genomic cluster (Cluster 1), which accounted for 337/484 (69.6%) sequenced university cases (Fig. 3). This was widely dispersed across the university by the middle of term, affecting students from 29/31 colleges, 28 undergraduate courses and 208 households in university accommodation alone (Fig. 4).

Fig. 3: Emergence and transmission of SARS-CoV-2 in a large university cluster.
figure 3

a Time-scaled phylogenetic tree of largest university cluster (cluster 1) derived from the BDSKY model implemented in BEAST 2.6 (Fig. 5). The left-sided heatmap is coloured by case location, and the right-sided heatmap is coloured by student college affiliation, highlighting the top nine colleges by cluster 1 prevalence. Cluster 1 was widely dispersed across the university with limited transmission into the community. b Frequency of Lineage B.1.160.7 (to which cluster 1 belongs) in each region of the UK and the University of Cambridge. Regions are defined as ‘Nomenclature of territorial units for statistics’ (NUTS) regions, where the UK has 9 regions. It is visible that the lineage B.1.160.7 was first sequenced in Wales, and then in the neighbouring South West of England, before becoming prevalent within the University of Cambridge. The lineage remained infrequently detected in the community populating the wider surrounding region (Cambridgeshire, East Anglia, Bedfordshire and Hertfordshire, and Essex, making up East of England) throughout the university term. c A continuous transmission chain of SARS-CoV-2 infections in cluster 1 commenced with a single introduction. Relationships between individuals in cluster 1 were calculated within A2B-COVID. Colours denote potential transmission events from the donor (vertical axis) to the recipient (horizontal axis) that are consistent with transmission12 or which are borderline possibilities (yellow). The plot shows that the data are consistent with a continuous transmission chain of SARS-CoV-2 infections in cluster 1 occurring via a single introduction; there are multiple potential networks of transmission events between these individuals for which each event would be consistent with a statistical model of direct transmission. We note that individuals in this plot are ordered by the date of the first positive COVID test. Source data are provided as a Source Data file.

Fig. 4: Demographics of Cluster 1 across the first university term.
figure 4

a Cumulative number of colleges involved in the cluster. Cases included in this cluster were between a number of colleges early during the university term. b Frequency of cases involved in the cluster by year of study. c Frequency of cases involved in the cluster by course type. Source data are provided as a Source Data file.

Cluster 1 was classified as belonging to Pango lineage B.1.160.7. No mutations previously noted to be associated with increased transmissibility were observed in this lineage compared to other genomes in the study. Interrogation of the entire COG-UK dataset of samples from 2020 showed that this lineage was first identified in the UK on 4 October 2020, in Wales, before becoming predominantly sampled in the UoC (Fig. 3b). The B.1.160.7 lineage was not identified in the local community until term week 3 (19–25 October 2020). This was supported by the median estimate of the time to the most common recent ancestor of cluster 1, in comparison to its most closely related cluster from Cambridgeshire community isolates of 165 days (C.I. 127–207) prior to the start of term (6 October 2020). Together, these results suggest the university cases were introduced from outside Cambridgeshire. Additional analysis with A2B-COVID15, which uses genomic data alongside timing of infection data to evaluate plausibility of transmission between individuals, we showed that these sequences were consistent with a single introduction into the university (Fig. 3c).

National and university contact tracing data were used to identify the initial source of dispersion of this cluster. Ten students from the first two weeks of term reported visiting the same nightclub (venue A). Nine individuals either had an isolate from cluster 1 or (in the event that their sample did not yield a high-quality sequence) were household contacts of an individual with a sequenced cluster 1 isolate. No information was available for one student (Supplementary Fig. 5).

Transmission of cluster 1 was sustained from the first week of term until a national lockdown was enforced on 5th November. Students testing positive in the two weeks around lockdown reported common exposure events predominantly linked to nightclub venues (25/59 (42.4%) of exposures external to the university reported by 48 students). Venue A, identified above as the possible source of dispersion of this cluster at the start of term, was also the most common venue identified in the two weeks around lockdown (n = 16). 9/16 cases had sequences in cluster 1, and a further five individuals (where no sequence was available) were household contacts of sequenced cases in cluster 1 (Supplementary Fig. 6).

To determine the impact of lockdown and other control measures within the university, a birth-death skyline model16 was used to measure changes in the effective reproduction number (Re) within cluster 1. The model indicated an initial Re at the start of term that was slightly larger than 1, albeit with wide uncertainty (median 1.14; 95% HPD: 0.27–2.21 on 5 October). Over the next 2 weeks Re continued to rise (median 1.52; 95% HPD 0.94–2.22 on 15 October) followed by a subsequent gradual decline over the next 2 weeks (Fig. 5a). There was a rise immediately prior to the start of lockdown (median 1.55; 95% HPD 1.25–1.86 on 5 November), followed by a steep decrease thereafter (median 0.23; 95% HPD 0.07–0.41 on 19 November) (Fig. 5a), consistent with declining absolute numbers of SARS-CoV-2 infections seen during this time (Fig. 2c). The model estimated the median effective infectious period for individuals in the cluster at 3.03 days (95% HPD: 2.44–3.59 days) (Fig. 5b). As the model does not explicitly incorporate an incubation period and assumes that individuals cannot transmit after being sampled, the effective infectious period represents the mean time from infection until testing positive and assumes perfect infection control measures thereafter. Estimates of Re and the effective infectious period are robust to model parameterisations (Supplementary Figs. 810). Sampling proportion estimates largely overlap with empirical estimates based on the number of positive cases that were sequenced during each week (Fig. 5c). Although sampling proportion estimates are sensitive to the prior specifications, Re estimates are unaffected (Supplementary Fig. 11).

Fig. 5: Effective reproduction number and infectious period of SARS-CoV-2 from a dominant university cluster.
figure 5

A 20-epoch birth-death skyline model shows the effect of local infection control measures and the national lockdown on the effective reproduction number (Re), and estimates of the mean effective infectious period as 3.03 (95% HPD = 2.44-3.59) days. a Re posterior estimates (dark shading = 50% HPD; light shading = 95% HPD). The dotted line indicates the start of term and the light blue shaded area the 4-week national lockdown in the UK, which was associated with a large fall in COVID-19 cases in University students. The red dashed line indicates Re= 1. b Effective infectious period posterior estimates (shaded region = 95% HPD; dashed line = median). c Weekly sampling proportion posterior estimates (dark shading = 50% HPD; light shading = 95% HPD). The red dashed line indicates the empirical sampling proportion estimates for each week in term (number of sequenced genomes from all University clusters divided by the number of positive tests among University staff and students). Source data are provided as a Source Data file.

Transmission within university households

There was evidence of transmission of SARS-CoV-2 in student accommodation in 18/34 university clusters. In cluster 1, 169/337 (50.1%) students had a virus genome sequence identical to at least one other student living in the same or neighbouring household (sub-clusters within 0 SNPs ranging between 2 and 11 students).

The largest cluster associated with transmission in accommodation was cluster 2 (lineage B.1.36). By term week 3, this cluster involved 30 students, of which 24 (80%) lived in the same accommodation block in College A and 4 students lived in two separate households in the same college (Supplementary Fig. 12). Interventions from the university, supported by local public health authorities, included isolation of all households in the main accommodation block and individual screening offered to all students. Half of all cases in this cluster were diagnosed by asymptomatic screening. No further genomically-related isolates were identified after term-week 3, indicating a successful intervention, and cessation of transmission.

To quantify the importance of household transmission, a Reed-Frost Chain Binomial Model was employed to estimate the household attack rate. Using A2B-COVID15, we identified 265 households in which the data were consistent with only 1 introduction of SARS-CoV-2. The per household contact probability that an infected person passed on the virus to an uninfected individual within the same household was estimated at 7.8% (95% C.I. 6.9–8.7%).

Further genomic clusters where transmission between household members was implicated are outlined in Supplementary Table 1. They follow similar patterns, with groups of cases confined to a single college not leading to sustained transmission.

Other transmission routes among university members

In addition to household transmission, there was evidence of viral spread between students in the same course and year of study in 14/34 genomic clusters, with the highest proportion being students in their first year of study. In cluster 1, 203/337 (60.2%) students had an identical isolate to at least one other student studying the same course in the same year (cluster size range 2–14 students). Statistical modelling using data from cluster 1 across the term showed a bias towards infections being observed in first year students (p-value = 0.002) (Supplementary Fig. 13, model details in Supplementary Methods). Two further small clusters were comprised of postgraduate students working in the same university department. However, we were not able to determine the probable location of transmission in most cases: there is considerable overlap between course and household clusters, and complex social and study networks exist between students (illustrated in Supplementary Table 1, for example in clusters 3, 4 and 10). Of note, 23/34 clusters with 2 or more genomically linked cases in the dataset contained at least one university member that could not be epidemiologically linked with any other case in their cluster.

The number of SARS-CoV-2 sequences from university staff members were limited in comparison to students (n = 30). There was evidence of transmission between staff members working in the same department, college or ancillary role in four genomic clusters. Two clusters contained staff members who shared the same household. There are 8 clusters involving both university staff and students. However, epidemiological associations between these two groups could only be identified in one cluster: a shared household between a student and staff member working in separate university departments.

Transmission between the university and local community

We next sought to address the degree of transmission between the university and the local community. Two distinct phylogenetic approaches, shown in Fig. 2, demonstrate segregation of the majority of community and university cases into separate clusters and therefore a lack of substantial cross-transmission. 29/198 (14.6%) transmission clusters contained both university and community cases. Only six clusters contained five or more university cases and included three or more community cases.

To identify transmission clusters involving university and hospital (patient and healthcare worker) cases, we ran CIVET (https://github.com/artic-network/civet) separately with these cases for a focused phylogenetic analysis of this setting. Associations were identified between the university and hospital settings, with 17 clusters involving both university members and either patients or staff. Cluster 1 (69.6% of student cases), contained only 1 patient and 1 healthcare worker with no identifiable epidemiological link to students. The remaining 16 clusters comprised 133 individuals, including 26 patients, 55 hospital staff or their family members and 52 university members (including 18 staff and 15 clinical medical students). The second-largest cluster of university members (n = 21 university and hospital cases) included nine medical students, five healthcare workers and two patients. Phylogenetically, the medical students and one of the healthcare workers were closely linked (Supplementary Fig. 14) and analysis of these cases with A2B-COVID15 confirmed the plausibility of transmission. All 9 medical students were on clinical rotations at the time of diagnosis of the index case; 7/9 lived in neighbouring households in the same college and the remaining two were named contacts of the index student. Plausible transmission events between this group and the other cluster members were refuted using A2B-COVID (Supplementary Fig. 14).

To further investigate epidemiological associations in clusters involving university members and the local community, 1243/1455 of the cases sequenced over the sampling period were linked to national contact tracing data (excluding hospital cases). 219 (17.6%) cases reported 127 common exposure events. Cluster 1, representing 69.6% of cases within the university, included only 17/976 (1.7%) community cases; only one community case had a common exposure with a university student, dining at the same restaurant. No other epidemiological links were identified in all other genomic clusters. Transmission suspected in 19 epidemiologically linked clusters defined by common exposures was refuted by phylogenetic variation (i.e. identified in separate transmission clusters as defined by CIVET).

Discussion

We report the first comprehensive and integrated epidemiological and genomic analysis of SARS-CoV-2 transmission in a higher education setting. Following a limited number of introductions, the majority of cases were linked to a single genetic cluster, that was likely to have dispersed across the university following multiple social gatherings at a nightclub. There was considerable transmission associated with student accommodation and student courses, but minimal evidence of transmission within departments, or between students and staff. We observe the great majority of transmissions occur either within the university or within the local community. Finally, we present evidence demonstrating the efficacy of university measures and national lockdown in reducing COVID-19 cases.

Nearly 70% of all university cases belonged to one genetic cluster (cluster 1), introduced into the UoC by the arrival of students and likely forming a single transmission chain. A nightclub was implicated as an important transmission event at the start of term and again prior to lockdown. This corroborates previous studies identifying such venues as a risk factor for substantial SARS-CoV-2 transmission17,18. We urge a cautious approach to the access of such venues during a SARS-CoV-2 pandemic, particularly in the context of a young susceptible student population.

Our data suggest a substantial change in case numbers and the effective reproduction number over the course of the term. This likely reflects a combination of changes in student behaviour and effective interventions to reduce transmission. Overall, we note that incidence and the effective reproductive number within the university are lower than in other higher education settings and the general UK young adult population during the study period19. We highlight a limited number of introductions and low lineage diversity in the university compared to the surrounding community. While the natural extinction of lineages is relatively common20, multiple genetically diverse clusters may be expected given the congregation of students from across the globe (international students make up 35% of students in college accommodation)11. The lack of diversity may reflect the impact of robust and widely implemented university infection control measures maintained throughout the term, full details of which are provided in the Supplementary Materials, but include social distancing, mask wearing and quarantine of international students at the beginning of term.

There was an initial rise in cases over the first two weeks, coinciding with the first week of term and university Freshers week. This is known to be a period of more intense social mixing between students in venues both inside and outside university premises. Between term weeks three and five there was a fall in the effective reproductive number, which coincides with both a reduction in social mixing and the identification of, and subsequent university measures to control, transmission events identified in college residences. In multiple clusters, transmission in student households was successfully interrupted through a combination of measures provided by the university, including rapid case identification through asymptomatic screening, readily available symptomatic testing, contact tracing and comprehensive support provided by colleges for cases and their contacts while in isolation. Further details, including the elaboration of the specific measures to control cluster 2, an outbreak associated with a large accommodation block described above, are provided in the Supplementary Materials. Although we have demonstrated that transmission between students in the same accommodation block is an important factor in the spread of SARS-CoV-2, we report a lower secondary household attack rate (7.8%) than that identified in domestic households (16.6–21.1%) and a lower than expected effective infectious period (3.0 days)21.

University measures may have been less successful in controlling transmission in settings outside colleges. There was a rise in the effective reproduction number coinciding with the announcement of a national lockdown on 31 October, to begin on 5 November 2020. This announcement prior to implementation of a major socially restrictive public health measure, alongside existing Halloween festivities, may have led to increased levels of behaviour associated with a higher risk of transmission. This supports either reducing the time from announcement to implementation of socially restrictive measures, or the need for a targeted public health campaign to limit high-risk activities where this is not possible. In addition, having identified considerable transmission between students on the same course, we suggest that further mitigation of viral spread may be obtained by implementing shared student accommodation based on university courses.

The national lockdown dramatically reduced case numbers within the university, at a faster rate than the local community, demonstrating high levels of compliance from our study population with an effective control strategy. Contemporary studies conducted elsewhere in the UK have demonstrated that adherence to COVID-19 prevention measures, such as national lockdown, are mixed22. Although young age is a risk factor for poor adherence, other associations are less common within the university population, such as having a dependent child in the household, financial hardship and working in a key sector. Although no direct incentives were provided to students, the expectation of individuals to adhere to rules was communicated widely in both national and university media. We also believe that the key to the successful implementation of lockdown was the additional support provided by the collegiate university, ranging from the practical provision of food and drink through to the pastoral and community support provided by established networks of staff, tutors and student representatives.

Finally, we observed limited transmission between the university and the local community. The largest university cluster, accounting for the majority of student infections, was largely phylogenetically distinct from community cases. Further, epidemiological evidence describing common exposures for community and university cases was sparse. However, clinical medical students were disproportionately represented within community clusters. This is an important epidemiological link between secondary care and the university; we highlight this group as being at-risk for both acquisition and transmission of SARS-CoV-2 and medical students should therefore be prioritised for interventions such as vaccination.

A combination of contact tracing and genomics was instrumental to understanding transmission within the university and with its surrounding population; notably in refuting transmission within epidemiologically linked clusters. We advocate for a combined genomic epidemiological approach to inform outbreak investigations as used in other settings8,23.

This study has a number of limitations. Incomplete sampling and subsequent sequence filtering in both the university and community should be considered when interpreting transmission; the asymptomatic and active case ascertainment in this study should mitigate this discrepancy. The lower community case ascertainment may result in unobserved transmission chains (such as those when assessing the introduction of Pango lineage B.1.160.7 into the university). Further, epidemiological links are dependent on self-reporting and therefore some data will be missing; whilst a lack of epidemiological association between groups in clusters is important and reassuring (such as between staff and students), it does not confirm a lack of transmission. We highlight shared student courses as a risk factor for transmission; this does not take into account the setting of transmission, i.e., during educational or social activities. Finally, the UoC is distinct in its collegiate structure with limited integration with the community; any generalisation of conclusions should be tempered by the study setting.

We present the first comprehensive integrated epidemiological and genomic evaluation of transmission of SARS-CoV-2 within a university. The insights gained will inform public policy regarding infection control measures in higher education settings. We find containment of transmission in student accommodation necessary to mitigate onward propagation. We highlight the importance of targeted public health measures towards nightclub venues to limit transmission. Critically, these findings are likely to be informative for future pandemic preparedness.

Methods

Ethics

The COG-UK study protocol was approved by the Public Health England Research Ethics Governance Group (reference: R&D NR0195). Public Health England affiliated authors had access to identifiable Cambridgeshire community case data. This data was processed under Regulation 3 of The Health Service (Control of Patient Information) Regulations 2002- permitting the processing of confidential patient information for communicable disease and other risks to public health and as such, individual patient consent is not required. Other authors only had access to anonymised or summarised data. Ethical approval for the UoC asymptomatic COVID-19 screening programme was granted by the UoC Human Biology Research Ethics Committee (HBREC.2020.35) with informed consent gained from participants.

Study setting

The UoC has ~23,000 students and 12,600 staff. The university is divided into 31 colleges and 150 departments, faculties and other institutions. Students belong to a college community, as well as being members of the university and an academic faculty/department. Colleges provide residential accommodation for approximately two thirds of students, either on campuses or in off-site housing, and offer social and sports activities, pastoral and academic support for each individual24. All colleges have membership from students across multiple courses. The university is based in the City of Cambridge (which has an estimated population of 123,90025), in the county of Cambridgeshire (estimated population 855,796 people in 201926) in the East of England.

Participants and samples

Samples were derived from university symptomatic testing and asymptomatic COVID-19 screening programmes between 5 October 2020 and 6 December 2020, covering the full term. Testing for all symptomatic students and staff was available on weekdays. The asymptomatic screening programme has been described in detail elsewhere11. In brief, screening was offered on a voluntary basis to all students residing in accommodation owned or managed by a college or the Cambridge Theological Federation. In total, 15,561 students were eligible to participate. To optimise testing efficiency, multiple swabs were pooled into the same tube of viral transport medium at the time of sample collection. Testing pools varied in size from 1 to 10 students, with each devised to include one or more student households as far as possible11. In this study, households are defined as individuals who share a kitchen, bathroom and/or lounge facilities. The members of any pool testing positive were re-tested using individual confirmatory PCR tests to confirm the result and identify the positive subject(s) (see Supplementary Methods for further details including infection prevention control measures). Only samples from individuals that were confirmed positive upon the re-testing were used for sequencing.

SARS-CoV-2 strains circulating in the local community were identified from the COG-UK dataset for Cambridgeshire. These data were derived from local community samples from non-hospitalised, symptomatic individuals, who requested a free diagnostic test via national community testing. Other samples were derived from patients treated at three Cambridgeshire hospital trusts: Cambridge University Hospitals NHS Foundation Trust (a teaching hospital providing secondary care services for Cambridge and the surrounding area as well as tertiary referral services for the East of England and surge capacity for COVID-19); Royal Papworth Hospital NHS Foundation Trust (specialist heart and lung hospital, also providing surge capacity for COVID-19); Cambridgeshire and Peterborough NHS Foundation Trust (provider of community, mental health and learning disability services in Cambridgeshire). Hospital samples were obtained from both asymptomatic screening and those exhibiting COVID-19 symptoms. Finally, samples were derived from the asymptomatic HCW programme at Cambridge University Hospitals27.

Sequencing

Positive samples from UoC testing with a PCR cycle threshold value ≤33 were selected and sequenced using the GridION platform (Oxford Nanopore). All Cambridgeshire samples sequenced between 24th September and 21st December 2020 were included to overlap with the university term. Samples from the local Cambridgeshire community and hospital cases (described above) were collected as part of national SARS-CoV-2 testing, and sequenced at one of seventeen COG-UK sequencing sites (further details in Supplementary Methods). The samples were prepared using either the ARTIC28 or veSeq29 protocols, and were sequenced using Illumina or Oxford Nanopore platforms. Genomic data were filtered to exclude sequences with >5% Ns and those of spuriously low file sizes (<29 KB). Genomes were aligned with minimap230 to the Wuhan Hu-1 reference genome (MN908947.3), collected December 2019. All samples were processed through COVID-CLIMB pipelines31,32. Protocols are available at https://github.com/COG-UK.

Phylogenetic analysis

Maximum likelihood phylogenetic trees were estimated using IQ-TREE (version 2.1.2 COVID-edition)33 and rooted using Wuhan Hu-1 (MN908947.3) as an outgroup. Trees were constructed using the GTR + Γ substitution model34, as determined by ModelFinder35. Branch support statistics were generated using the ultrafast bootstrap method36. TempEst37 was used to explore the temporal signal in the data. Trees were visualised, explored, and labelled with associated metadata using Microreact38 to identify epidemiological links supported by the genomic data. Specified mutations were identified using type_variants (https://github.com/cov-ert/type_variants). Possible transmission clusters were defined by extracting phylogenetic neighbourhoods identified using the CIVET tool (version 2.1.0) on 11 January 2021 (https://github.com/artic-network/civet). In selected clusters, further evaluation was conducted using A2B-COVID15. A2B-COVID evaluates data from individuals in a pairwise manner. Using viral genome sequences from two individuals, alongside data describing the timing of infection, it evaluates whether or not these data are consistent with a hypothesis that SARS-CoV-2 was transmitted directly from one individual to the other; data from each pair is described as being either consistent, borderline, or unlikely to have been observed given this hypothesis. Where indicated, collapsed nodes from trees generated from CIVET were inspected to visualise data in the context of the COG-UK national database (https://www.cogconsortium.uk/). For further evaluation of transmission in the largest cluster identified by CIVET, pairwise SNP differences between sequences were determined using SNP-dist (https://github.com/tseemann/snp-dists/releases/tag/v0.7.0).

Lineages

Global Pango Lineages39 were assigned to each genome using Pangolin (https://github.com/cov-lineages/pangolin/releases/tag/v2.1.6) with analyses performed on COVID-CLIMB32 (further details in Supplementary Methods).

Molecular clock and phylodynamic analyses

BEAST v1.10.440 was used to perform a time-scaled phylogenetic analysis using an exponential growth coalescent treeprior and a GTR + Γ substitution model including all university and community high-quality genomes from the study period. As there was a lack of clear temporal signal in our dataset due to the relatively short time period analysed, the substitution rate was fixed to 8 × 10−4 substitutions per site per year (s/s/y) under a strict clock model in line with previous SARS-CoV-2 analyses13,41,42,43,44. Two chains of 100 million iterations were run independently to ensure convergence to the correct posterior distribution. Convergence was assessed using Tracer45, and 10% of states were removed to account for burn-in. Finally, a maximum clade credibility (MCC) tree was generated using TreeAnnotator.

To estimate the effective reproduction number (Re) and infectious period of SARS-CoV-2 over the term, a dominant clade (representing 69.6% of all university genomes) was selected and all community genome sequences that cluster with it incorporated, resulting in a total of 354 genomes. A Bayesian birth-death skyline (BDSKY) model16 was employed using BEAST v2.646. A GTR + Γ substitution model was used along with a strict clock model, placing a lognormal prior with mean 8 × 10−4 s/s/y (in real space) and standard deviation 0.1 on the clock rate. A lognormal prior with mean 0 and standard deviation 1 was placed on Re and a Beta prior with ɑ = 5 and β = 5 was placed on the sampling proportion. Re was parameterised into 20 epochs, equidistantly spaced between the origin time and the most recent sequence collection date. The sampling proportion was fixed to 0 before the first week of term and estimated for each week thereafter. The rate at which infected patients become non-infectious was assumed to be constant and a lognormal prior with mean 48.7 years−1 (in real space) and standard deviation 0.25 was placed on it, resulting in a prior mean effective infectious period between ~5 and ~15 days. To test the robustness of the posterior estimates different parameterisations were used for Re and the sampling proportion, and the sampling proportion prior was varied. Further details are provided in the supplementary methods. To test the robustness of posterior estimates to the clock rate prior all analyses were repeated using a lognormal prior with mean 1 × 10−3 s/s/y (in real space) and standard deviation 0.1 on the clock rate. Finally, to test the assumption of a strict clock model, analyses were repeated using an uncorrelated lognormally distributed relaxed clock model47. In these analyses the 95% HPD interval of the coefficient of variation of the clock rate did not exclude 0, indicating poor support for a relaxed clock model in this dataset. Furthermore, estimates of the BDSKY model parameters did not differ significantly from estimates under a strict clock model. Therefore, we only show results under a strict clock model. For all models three chains of 200 million iterations were run independently. Convergence was assessed using the R-package coda48, and 10% of states were removed to account for burn-in. MCC trees were generated using TreeAnnotator.

Household attack rates

A2B-COVID15 was used to exclude households for which the sequence and epidemiological data were inconsistent with a single viral introduction to the household. A chain binomial model was then used to estimate the probability that an infected person transmitted the virus to an uninfected person within the same household (further details in supplementary methods).

Epidemiological data

University student demographic data were derived from the UoC student electronic record system CamSIS, and household structure and membership data from the UoC asymptomatic screening programme. To identify university affiliated cases (students and staff) and hospital staff accessing the national SARS-CoV-2 testing service, Second Generation Surveillance System (SGSS) and contact-tracing data provided by NHS Test and Trace (T&T) data were interrogated. Epidemiologically linked common exposures for students, university staff and the local community were identified through T&T data. Common exposures were defined by T&T as locations or events that two or more people testing positive for COVID-19 visited in the same two to seven day period before symptom onset or positive test. Additional contact tracing information was also provided by the UoC COVID helpdesk. These data were compared with observed phylogenetic clusters to determine potential sources of transmission and determine the extent of transmission between the university and community.

Epidemiological data from UoC were initially compiled in Microsoft Azure SQL and Excel 2013 (Microsoft) and analysed in STATA 14.2 (College Station, TX, USA). Further data manipulation, statistical analysis and figure generation was undertaken with RStudio (version 1.3.1093) using R (version 4.0.2). Network diagrams were produced with R package iGraph (v1.2.6).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.