Introduction

Invasive cervical cancer (ICC) is the fourth most common cancer worldwide1 and virtually all cases are caused by an infection with one of the 13 high-risk (HR) human papillomavirus (HPV) types2. The natural history of HPV leading to ICC is well-established, mostly based on HPV16 and squamous cell carcinoma (SCC), and is characterized by a multistage disease model that starts with HPV infection, that is persistently detectable over time when not controlled by the immune system2. These persistent infections often lead to the development of precancerous lesions that grow within the epithelium, often for years, that eventually can invade the surrounding tissue to become ICC2. The natural history of adenocarcinoma (ADC), the second most common histologic subtype, remains poorly understood.

The cancer genome atlas (TCGA) project has identified driver mutations that presumably lead to ICC3,4. However, ICC is a heterogeneous disease with distinct somatic mutation spectrums related to SCC and ADC histologies4. Recurrent somatic mutations in PIK3CA, FBXW7, MAPK1, PTEN, EP300, NFE2L2, CASP8, STK11, HLA-A, and HLA-B are enriched in SCC, while in ADC, ELF3, CBFB, KRAS and ARID1A are enriched4,5. One study also noted that the epigenomic and transcriptomic landscape of ICC differed by HPV species groups (Alpha 9 vs. Alpha 7)6. ICC is additionally enriched with somatic mutations induced by the off-target activity of APOBEC3 enzymes, responsible for inducing C to T or C to G changes at specific trinucleotide motifs (5’TCW3’ [W is A or T]), in response to the exogenous DNA from HPV infection7,8,9,10. The two most frequent mutational hotspots in ICC, E542K and E545K in PIK3CA, are linked to APOBEC3 activity4,5,11. If the intended anti-viral activity of APOBEC3 does not lead to viral clearance, it is postulated that the off-target somatic mutations may instead help drive progression to precancer and cancer12, although, it is still unclear what triggers or promotes off-target APOBEC3 activity. The established cervical carcinogenic model, with a well-defined initiation event of HPV infection, represents a valuable opportunity to investigate when in the multi-step model somatic mutations arise and which of them drive carcinogenesis.

Different HR-HPV types, defined by ≥ 10% DNA sequence difference in the viral L1 gene Chen13,14, are linked to profound differences in both risk and prevalence of ICC and its histological subtypes. The three most common HPV types detected in ICC worldwide are HPV16, HPV18 and HPV4515. HPV18 and HPV45 are genetically similar, with 74% sequence homology and both are part of the Alpha 7 species group, while HPV16 is more genetically distant, with 52% sequence homology to both HPV18 and HPV45, and it is part of the Alpha 9 species group2,16. Within each HR-HPV type there are lineages and sublineages, defined by 0.5–9% DNA sequence difference, and even finer genetic variants, that have been further linked to differences in precancer/cancer risk and lesion histology17,18,19. HPV16 is the most common cause of all ICC worldwide, including 62% of the SCC and 56% of the ADC, while HPV18 and HPV45 combined are relatively more common among ADC (43%) than among SCC (15%)15,20,21. Somatic mutations in PIK3CA are more frequent in SCC compared to ADC and in HPV16-positive tumors compared to HPV18 and HPV45 tumors11. It is unknown if the different mutation patterns observed across tumors are primarily linked to differences in histology or to the different associated HPV types and lineages/variants, or both.

In this study we took advantage of the Persistence and Progression (PaP) cohort, which collected residual exfoliated cervical cell samples from women routinely screened for cervical cancer precursors at Kaiser Permanente Northern California (KPNC), to investigate host somatic hotspot mutations (i.e., previously reported cancer driver mutations) by deep targeted sequencing (average coverage 820x). We utilized these residual exfoliated cervical cell samples to evaluate important driver mutations, not only in cancer samples, but also in both precancers and transient HPV infections (<cervical intraepithelial neoplasia grade 2 [CIN2] or subsequently cleared infections) at a very low variant allele fraction (VAF). We also evaluated differences in somatic mutations associated with different HPV types and APOBEC3.

Results

Somatic hotspot mutations are detected only in HPV-positive exfoliated cervical cells

We detected previously reported cervical cancer driver mutations among 3351 HPV-positive exfoliated cervical cell samples. A total of 3192 nucleotide loci were detected with one or more mutations (i.e., mutated sites) after quality control and somatic mutation filters in these single time-point samples (one sample per woman; Table S1), including 27 TIER1 mutated sites, 165 TIER2, 1236 TIER3, and 1764 passenger mutated sites (Fig. 1, Table 1, Supplementary Data 2). The VAF of mutations detected at TIER1, TIER2, TIER3 and passenger classified sites are illustrated in Fig. S2; TIER3 and passenger mutations likely included rare germline mutations with a high VAF (>0.50). Therefore, we focused on TIER1 and TIER2 mutations, and detected 176 and 784 total hotspot mutations at these TIER1 and TIER2 sites in 14 of the 20 genes sequenced (Fig. 2a, Table 1). Mutated sites in PIK3CA, FBXW7 and KRAS were most common in TIER1, while mutated sites in TP53 and PTEN were most common in TIER2 (Fig. 2a). Twenty-two percent of the TIER1 mutated sites (6 of 27 sites) and 9.1% of the TIER2 sites (15 of 165) were APOBEC3-associated mutations. In our 144 ICC samples, somatic hotspot mutations were detected at 55% of the TIER1 sites and 15% of TIER2 sites.

Fig. 1: Workflow of mutation filters and the TIER classification scheme.
figure 1

Footnote: Samples from single time-point analyses only. aa = amino acid.

Table 1 Distribution of mutated sites by gene and by TIER classification
Fig. 2: Distribution of hotspot mutations by gene and TIER classification.
figure 2

a Distribution of mutated sites classified in each TIER. b Proportion of mutations in each gene by the total number of mutations in each TIER, by status. c Proportion of mutations in each gene by the total number of mutations in each TIER, by HPV type. Mutations from multiple HPV16/18/45 type co-infections were excluded. Footnote: CIN3 Cervical intraepithelial neoplasia grade 3, AIS adenocarcinoma in situ. P-values were estimated with two-sided Fisher’s Exact tests for count data with simulated p-values (based on 2000 replicates). Source data are provided as a Source Data file.

Only the hotspot mutations at TIER1 sites were distinctly distributed across disease status groups (Fig. 2b). Specifically, hotspot were detected in 33.3% of SCC, 21.4% of ADC, 4.5% of CIN3, 10.2% of AIS, 3.1% of CIN2, and 2.6% of controls. Of the hotspot mutations detected in the ICC samples, the most common was PIK3CA E545K (15.2% of all ICC), for each histology (11.4% of ADC, 20.6% of SCC), and by HPV type (16.5% of HPV16-positive ICC, 11.1% of HPV18-positive, 9.1% of HPV45-positive). Hotspot mutations FBXW7 R505G (2.9%) and STK11 c.290+1 G > A (2.9%) were the next most common in ADC, while for SCC, the next most common mutations were EP300 D1399N (4.8%) and MAPK1 E322K (4.8%) Fig. 3.

Fig. 3: Frequency of individual hotspot mutations in the ICC cases by HPV type and histology.
figure 3

Footnote: SCC squamous cell carcinoma, ADC adenocarcinoma, Cancer* = unknown histology. Somatic mutation distribution for 141 total cancers, HPV16-positive or HPV18/45-positive, using single time point samples within 2 years of diagnosis. Source data are provided as a Source Data file.

We further evaluated whether somatic hotspot mutations were present in 32 HPV-negative exfoliated cervical cell samples. No somatic hotspot mutations were detected, and only one presumed germline heterozygous TIER2 mutation was found (TP53 R175C) with a VAF of 0.53.

Frequency of hotspot mutations differ by HPV type

First, we evaluated the distribution of hotspot mutations among single HPV16, HPV18, and HPV45 infections only (i.e., HPV co-infected samples were excluded). Among hotspot mutations, the distribution of TIER1 mutations was significantly different between HPV16-, HPV18-, and HPV45-positive samples (p = 0.01; Fig. 2c). In particular, PIK3CA mutations were more common in HPV16-positive samples (47.2% of mutations) compared to both HPV18-positive (33.3%) and HPV45-positive (26.7%) samples; this pattern was consistent for both SCC and ADC. In contrast, FBXW7 mutations and KRAS mutations were less common in the HPV16-positive samples (FBXW7: HPV16 19.2% vs. HPV18 25.0% and HPV45 26.7% of mutations; KRAS: HPV16 4.8% vs. HPV18 29.2% and HPV45 6.7% of mutations, respectively). The distribution of mutations in HPV18- and HPV45-positive samples were more similar to eachother (p = 0.21) than either were to HPV16 (p ≤ 0.1). The distributions of TIER2 mutations were similar among HPV types (p = 0.54; Fig. 2c).

The mean number of hotspot mutations per ICC sample was 1.43 (range of 1–4). HPV18/45-positive ICC were 11-fold more likely to have ≥ 2 hotspot mutations than HPV16-positive ICC (p = 5.6 × 10–3, OR = 11.2, 95%CI = 2.0–61.9), while between histologies the hotspot mutation counts were not significantly different (Table 2). HPV18/45-positive ADC were also associated with ≥2 hotspot mutations compared to HPV16-positive ADC (p = 0.04, OR = 7.9, 95%CI = 1.1–56.1), but not for SCC, although this is likely due to the small number of HPV18/45-positive SCC with a HS mutation (N = 1; Table 2).

Table 2 Hotspot mutation counts among cancers only, by HPV type/group and cancer histology

Hotspot mutations are progressively enriched in CIN3/AIS precancers and cancers, and influenced by viral genetic variation

We evaluated the occurrence of hotspot mutations in samples collected either at the time of or within 2 years of the case/control diagnosis (N = 3031; Table S1). Compared to controls, the frequency of TIER1 hotspot mutations was similar in CIN2 (p = 0.59), while statistically significantly increased in CIN3/AIS precancers and highest in ICC, overall and by squamous and glandular histologies: CIN3 and AIS were 2 and 4-fold more associated with TIER1 hotspot mutations, and SCC and ADC were 18 and 10-fold more associated with TIER1 hotspot mutations than controls, respectively (Table 3). The APOBEC3-associated TIER1 hotspot mutations were more strongly associated with ICC compared to controls (p = 2.5 × 10−16, OR = 32.5, 95%CI = 14.1–74.8) than TIER1 hotspots at non-APOBEC3 motifs (p = 1.4 × 10−7, OR = 6.7, 95%CI = 3.3-13.6; Table 4). Since PIK3CA was the most mutated gene in our cohort, we further evaluated whether it was driving the associations between ICC and hotspot mutations. Among TIER1 hotspots, PIK3CA mutations were 39-fold more associated with ICC (p = 2.2 × 10−16, OR = 38.9, 95% CI = 16.3–92.7) compared to controls, while non-PIK3CA mutations were 6-fold more associated with ICC (p = 1.3 × 10−6, OR = 5.9, 95%CI = 2.9–12.2) compared to controls (Table 3). These findings indicate that PIK3CA is a key driver of cervical carcinogenesis, nevertheless other mutations play a significant role.

Table 3 Association of TIER1 hotspot mutations with precancers and cancers among single time-point samples collected within 2 years of outcome ascertainment
Table 4 Associations of TIER1 hotspot mutations matching APOBEC3 and non-APOBEC3 motifs

For all TIER2 hotspot mutations, the total frequencies in controls and cases were similar (Table S5). However, both the specific TIER1 and TIER2 hotspot mutation sites that were observed in our ICC cases were rarely observed in the controls (TIER1: 1.8% of controls vs. 25.7% of ICC, p < 2.2 × 10−16; TIER2: 4.0% vs. 15.3% of ICC, p = 8.2 × 10−7) (Table S6).

Given the varying distribution of hotspots among HPV types/groups, we examined the relationship between hotspot mutations and disease status with respect to HPV types/groups and within type lineages/sublineages. For HPV16-positive samples, TIER1 hotspot mutations were significantly increased in AIS precancers (p = 8.8 × 10−4, OR = 3.8, 95%CI = 1.7–8.5), and in both SCC (p = 6.7 × 10−16, OR = 19.8, 95%CI = 9.6–40.8) and ADC (p = 8.6 × 10−6, OR = 7.4, 95%CI = 3.0-17.9) compared to controls (Table 5). For HPV18/45-positive samples, TIER1 hotspot mutations were significantly increased in AIS (p = 0.04, OR = 3.4, 95%CI = 1.1–10.7) and ADC (p = 4.3 × 10−7, OR = 13.6, 95%CI = 4.9–37.3) compared to controls, but not significantly for squamous lesions (Table 5). Comparing cancers to controls for each HPV type and lineage/sublineage, only the previously identified ‘riskier’ HPV16 A4/D2/D3 sublineages18 were more likely to be cancers with TIER1 hotspot mutations (p < 2 × 10−16, OR = 7.7, 95%CI = 3.5–17.3) (Fig. S3). Both APOBEC3-associated and PIK3CA TIER1 hotspot mutations were more enriched in ICC than the non-APOBEC3 and non-PIK3CA mutations, respectively, compared to controls for each HPV type/group, particularly for HPV16-positive SCC (p = 1.5 × 10−10; OR = 76.4, 95%CI = 20–288; Table 5, Table S7). Non-PIK3CA mutations were only associated with HPV16-positive SCC (p = 2.7 × 10−5, OR = 8.9, 95%CI = 3.1–24.8) and only with HPV18/45-positive ADC (p = 1.7 × 10−4, OR = 10.9, 95%CI = 3.1–37.7) (Table 5).

Table 5 Association of TIER1 hotspot mutations with precancers and cancers by HPV type/groups among samples collected within 2 years of outcome ascertainment

Hotspot mutations were detected years before clinical diagnosis

We leveraged the prospective aspect of this cohort to look for hotspot mutations in samples collected years before clinical diagnosis. Interestingly, we detected hotspot mutations in women whose cervical samples were collected 3 or more years before their cancer diagnosis (18% of SCC; Table S8). Compared to control samples collected within the same time-period, SCC samples collected ≥3 years prior to diagnosis had significantly more TIER1 hotspot mutations (p = 0.02, OR = 7.7, 95%CI = 1.3–45.2) (Table S8). The SCC cases with hotspot mutations detected in cervical cell samples collected 4 to 6 years prior to diagnosis (Table S8), had normal (WNL) or benign cytology (ASCUS) at this prior screening visit at which the hotspot was detected. For the 2 AIS and 2 ADC cases that had hotspot mutations detected 3–5 years prior to their diagnosis, one AIS had an atypical glandular cell (AGC) cytology and the other AIS and ADC cases had normal cytology at the prior screening visit where the hotspot was detected.

Allele fractions of hotspot mutations are highest in ICC and increase over time

We investigated whether the allele fraction of hotspot mutations differed by disease status over time prior to diagnosis. Our exfoliated cervical cell samples represent an admixed cell population, which includes both normal and tumor cells for the cases, therefore, the allele fraction could be a proxy of cellular clonal expansion. The variant allele fraction of TIER1 hotspot mutations was highest in ICC (median VAF = 0.09) compared to CIN3/AIS precancers (median VAF = 0.05, p = 3.0 × 10−4) and to controls (median VAF = 0.04, p = 1.3 × 10−9) (Fig. 4a). In addition, CIN3/AIS precancers had significantly higher TIER1 hotspot mutation allele fractions compared to controls (CIN3/AIS median VAF = 0.05 vs. controls median VAF = 0.04, p = 3.4 × 10−4). For TIER2 hotspot mutations, the allele fractions varied less by disease status, although the allele fraction in ICC (median VAF = 0.05) was significantly higher than CIN3/AIS (median VAF = 0.04, p = 0.02) and controls (median VAF = 0.04, p = 3.0 × 10−4), but allele fractions were similar between CIN3/AIS and controls (p = 0.09) (Fig. S4a). These findings could indicate that some TIER2 mutations were acquired later or secondary to invasion, possibly contributing to genome instability, rather than being causal.

Fig. 4: Variant allele fraction (AF) of hotspot mutations by status in the single time-point analyses for TIER1 mutations, and by serial time-point analyses in cancer samples.
figure 4

Footnote: CIN3 cervical intraepithelial neoplasia grade 3, AIS adenocarcinoma in situ. P-values estimated using Wilcoxon rank sum test with continuity correction. Tests were two sided. ns not significant; ***p-value ≤ 0.001; ****p-value ≤ 0.0001. Source data are provided as a Source Data file, and provide the exact p-values.

We also looked at this clonal expansion process from a different perspective by using multiple samples from the same women collected in a series of clinical visits. We investigated how far before cancer diagnosis a mutation would be detectable and whether the VAF of these mutations increased over time, to validate the observation presented above in different infection stages. For 396 women with an additional serial sample collected, 35 had either a TIER1 or TIER2 mutation detected in the most recent visit (TP1), including 15 ICC (34.9%), 15 CIN3/AIS precancers (6.9%), 2 CIN2 (3.2%), and 3 of the controls (4.0%) (Table S9). We then looked for these specific mutations in samples collected prior to the most recent visit (TP2, TP3, TP4 and TP5, with TP2 being the closest to TP1). We assessed the mutation VAF throughout time points for each woman. Among the 15 women with ICC and a hotspot mutation, the VAF was significantly higher in TP1 (median VAF = 0.06) than TP2 (median VAF = 0.02, paired Wilcox-test, p = 4.9 × 10−4) (Fig. 4b). The median VAF was also higher in TP1 for non-cancer samples, but not statistically significant (Fig. S4b). No mutations were detected in TP4 and TP5 (Fig. S4b). For this analysis, we included the TP1 mutations that were observed in the TP2-TP5 time-points at a threshold <0.02 (detailed in methods). Using a 0.02 VAF threshold for these samples, only two PIK3CA E545K mutations would have been detected in the TP2-TP5 time-points (Fig. S5).

Lastly, we investigated whether specific mutation’s VAF increased over time faster than others, suggesting that these mutations could be the leading drivers being selected. Because samples were collected in a clinical setting, the time interval between TPs varied, with a mean time interval of 1.24 years (range 0.50–5.37 years), therefore, we calculated a rate of VAF change considering the time interval. Mutations with the fastest average rates of change were PIK3CA E542K (r = 0.132 VAF increase per year) and E545K (r = 0.129), ERBB2 R678Q (r = 0.062), TP53 R213X (r = 0.045), ARID1A R1446X (r = 0.034), and PTEN Q17X (r = 0.022) (Fig. S5). The first three mutations with the fastest VAF change were all missense mutations in oncogenes, and the next three were nonsense mutations in tumor suppressor genes.

Discussion

We have shown that hotspot mutations can be detected in exfoliated cervical cell samples collected prior to precancer/cancer clinical diagnosis at routine cervical cancer screening visits and that HPV types/groups, and within type lineages/sublineages, influence somatic mutation frequencies. Using exfoliated cervical cells and deep targeted sequencing, we were able to detect important hotspot driver mutations. These mutations were significantly more prevalent in precancers and cancers, with up to a 76-fold increase in cancers depending on mutation type and HPV type, compared with controls. PIK3CA and APOBEC3-induced mutations were the most common mutations detected in this cohort, and some non-PIK3CA mutations were also significantly associated with ICC compared with controls. We observed an increase in the allele fraction of hotspot mutations from controls (i.e., HPV transient infections: < CIN2 or subsequently cleared infections) through precancers and cancers, in line with the predicted cellular clonal expansion in cancer development.

We have identified important TIER1 mutation differences by viral genetic variation, and demonstrate that HPV type/group influences the somatic landscape. The overall distribution of TIER1 mutations and the number of mutations were significantly different between HPV16-, HPV18-, and HPV45-positive cases. Hotspot mutations in PIK3CA were more common in HPV16-positive cases, while mutations in KRAS and FBXW7 were more common in HPV18-positive and HPV45-positive samples, respectively. The non-PIK3CA mutations were HPV type and tumor histology dependent. Interestingly, we also observed that the HPV16 A4/D2/D3 sublineages, which have been previously associated with an increased risk of ICC and particularly strong increased risks of ADC18, were specifically associated with cancers having a hotspot mutation, compared to the other HPV16/18/45 sublineages. In addition, the HPV18/45-positive cancers had a higher number of mutations compared to the HPV16-positive cancers, independent of histology. The HPV18-positive ICCs were 11 times more likely to have 2 or more hotspot mutations compared to HPV16-positive ICC. It is possible that HPV16, as a potentially stronger carcinogen, may require fewer additional somatic mutations in host cells to drive carcinogenesis, and in contrast, HPV18/45 may require more mutations, although we did not evaluate other somatic events such as copy number alterations and viral integration. HPV18 and HPV45 have a higher prevalence of HPV integration than HPV16 and are likely associated with more chromosomal damage4,22, which further supports HPV16 being a stronger carcinogen with less associated somatic events. It’s possible that the less prevalent/carcinogenic HR-HPV types could have even more somatic mutations driving carcinogenesis. Follow-up studies to characterize and compare somatic mutations in case samples positive for the less carcinogenic HR-HPV types, including HPV31 and HPV35 (the HPV16-related types), are needed.

Our comprehensive mutation classification scheme into TIER1 and TIER2, based on previously published somatic and functional data, was critical for distinguishing hotspot mutations more likely to be drivers for ICC. Only TIER1 mutations, defined as hotspots previously reported as cervical cancer drivers, were significantly associated with precancers and cancers in our cohort, demonstrating that mutations classified as somatic drivers for non-cervical cancers (TIER 2) were not as important or the main drivers of cervical carcinogenesis. TIER1 hotspot mutations were enriched in both HPV16- and HPV18/45-positive glandular precancers and cancers (AIS and ADC) compared to controls, while only HPV16-positive SCC had significantly more mutations than controls. Although, there were only 10 HPV18/45-positive SCC in our cohort and this likely limited statistical power. Differences in the somatic landscape of ICC by squamous and glandular histologic subtypes have been previously observed and related to expression profiles4,5, however it is not clear if specific driver mutations trigger tumor differentiation differently or if the somatic landscape is influenced by HPV type. The majority of significantly mutated genes previously reported in ADC were also observed in SCC; however very few ADC samples (only 4–31 [TCGA]) were previously investigated in earlier studies4,6,11. Future studies focusing on larger numbers of glandular lesions may help to identify new significantly mutated genes in this subtype, and lead to a better understanding of the true differences in ICC etiology related to specific histologic subtypes and HPV types.

Importantly, our HPV-negative cervical cell samples had no TIER1 mutations, suggesting these mutations are uncommon in HPV-negative cells. Recent studies have shown that some cancer associated recurrent mutations are also identified in normal cells from the same epithelium23,24, proposing that tissue-specific transformation likely requires additional factors such as environmental exposures, additional mutations or ineffective immune surveillance25. In our cohort, it is possible that some of the somatic mutations found in the HPV-positive control samples are a consequence of errors in intrinsic processes of the infected dividing epithelium, such as aberrant DNA replication or repair, and potentially related to the HPV infection; and these mutations are likely not enough to drive transformation alone.

We detect somatic driver mutations in exfoliated cervical cell specimens, which is important because these samples represent a less invasive sampling procedure using residual samples from current routine cervical cancer screening, as compared to tumor blocks or tissue biopsies. We demonstrated that these samples can be used to evaluate potential diagnostic and predictive somatic mutations. Our deep gene-panel sequencing assay with 800x mean depth allowed us to identify somatic hotspot driver mutations in these cervical cells at a low allele fraction. In our single time-point analyses, compared with controls, the allele fraction of TIER1 hotspot mutations were higher in precancers and highest in ICC. In the multiple time-point analyses, the allele fraction of hotspot mutations were also significantly higher closest to diagnosis, supporting the predicted increasing trend of the allele fraction with cellular clonal expansion. A similar study using residual samples from liquid-based cytology specimens from the thyroid, lymph node, breast, pancreas and other fluids used targeted NGS with a mean depth of 500x, and a VAF threshold of 10%, similarly identified important somatic mutations primarily in the samples with a higher proportion of tumor cells26. Our exfoliated cervical samples represent an admixture of normal and tumor cells. As for the allele fraction of TIER2 hotspot mutations, they were similar between controls and precancers in the single time-point analyses, but significantly higher in ICC, which may indicate that some of these mutations are capable of driving tumorigenesis in later clones or they are less important/passengers carried along with pre-malignant cells and may contribute to genomic instability.

In our prospective serial sample collection, we also observed hotspot mutations years prior to the case diagnosis and differences in the allele fraction by time. The majority of mutations were detected within 2 years of the time of diagnosis, and, although rare, we also detected three TIER1 hotspot mutations and six TIER2 hotspot mutations between 3 and 6 years before the cancer diagnosis. These findings support that driver mutations that confer a proliferative advantage for tumors can be detected years before clinical diagnosis27. We may have missed more of these early mutations because residual cervical cell specimens collected many years prior to diagnosis may not be enriched enough for cancer precursor cells compared to the normal cells. The ratio of VAF changes per time-point was highest for PIK3CA E542K and E545K mutations. It is possible that clonal expansion was faster in cells harboring these mutations or that these mutations were mapped to genomic regions that were later amplified. In both scenarios, these mutations could have been selected and contributed to driving carcinogenesis. Selection of driver mutations is still not completely understood, as many coding mutations are tolerated during carcinogenesis24,28, and more studies are needed to understand when driver mutations arise.

PIK3CA was the most frequently mutated gene in our cohort, and we detected strong differences in PIK3CA mutation frequencies between cases and controls, depending on HPV type and histology (with up to 76-fold differences in risk associations). PIK3CA mutations were present in 18% of our ICC samples, which is slightly lower than previously published data from tissue samples, which have been reported in 22% of tumors from Latin-American and up to 45% in HIV-negative tumors from Uganda4,6,11. A combination of other non-PIK3CA mutations including FBXW7, KRAS, PTEN, ERBB3, TP53, ERBB2, EP300 and MAPK1 were also associated with precancer and ICC, but only when classified as TIER1. APOBEC3-induced mutations are a recognized source of somatic mutations29, and here, we confirmed TIER1 hotspot mutations consistent with being induced by APOBEC3 are enriched in our cancers but not in the HPV-positive controls. The most common PIK3CA hotspot, E545K, matching an APOBEC3 motif, was ~18-fold more frequent in ICC than controls. Interestingly, even though APOBEC3, as part of the host’s intracellular defense, is activated upon HPV infection, its induced somatic mutations were significantly lower in both control (i.e., HPV transient infection: <CIN2 or subsequently cleared infection) and persistent HPV+ infections in our study, potentially indicating APOBEC3 mutated host DNA more in the cases. We previously observed that APOBEC3-induced viral mutations in the HPV16 genome were significantly associated with a benign infection or viral clearance30; together these data support a double-edged sword hypothesis, when APOBEC3 mutations in the viral genome do not lead to viral clearance, its off-target activity can instead result in host somatic driver mutations. Among HPV-positive oropharyngeal cancers, we have further shown that APOBEC3 mutations in paired HPV16 genomes and host somatic genomes were correlated, suggesting a common mechanism of APOBEC3 mutagenesis in the host and viral genomes of these tumors31. However, more work is needed to fully understand the combination of mechanisms that likely lead to APOBEC dysregulation and off-target mutagenesis as they are also prevalent in non-virally mediated tumors.

Our study has some limitations that should be noted. Given that our samples are from a clinical setting, DNA from matched “normal” samples was not available, therefore, we used publicly available polymorphism databases to filter out germline variants. Unfortunately, this approach likely missed rare germline variants with a lower-than-expected VAF (e.g., <50%), which could have been considered somatic. For example, the mutation in TP53 (R175C) detected in the HPV-negative sample with an allele fraction of 0.53 is likely a rare germline variant. However, we did not want to restrict mutations to only those with a low allele fraction (e.g., <30%) because this would cause us to miss important driver mutations potentially located in amplified regions (for example at 3q4). Our study was not designed for discovering novel somatic driver mutations, and instead we limited our analysis to a fixed number of important genes to allow for deep targeted sequencing and only to previously reported somatic mutations. Follow-up studies including matched germline samples and evaluations to confirm the cervical tumor origin of the cervical cells are needed. Our prospective cohort includes cervical cell samples collected from women undergoing routine cervical cancer screening, therefore, the precise timing of disease diagnosis may be limited by the timing of the screening visit intervals. To account for potential undetected disease due to the screening visit intervals, we grouped the samples collected within 2 years of the date of clinical diagnosis as ‘at diagnosis’ samples. However, we cannot exclude that a longer time interval underlies prevalent disease.

In summary, our study has identified somatic driver mutations for cervical cancer in residual cervical cell samples that are routinely collected from a clinical setting prior to precancer/cancer diagnosis and demonstrates the feasibility of using them for detecting driver mutations that are potentially diagnostic biomarkers. We further demonstrate that HPV type and genetic variation influence the host somatic landscape and that specific somatic driver mutations are enriched in precancers and cancers compared to HPV-positive control samples (<CIN2 or subsequently cleared infections). Our deep targeted sequencing approach using cervical cells requires validation as it has the potential to be translated into a diagnostic for cervical precancer/cancer in a clinical laboratory, and could be higher throughput than full molecular profiling of cancers for treatment. Our findings demonstrate the potential of using these convenient samples to detect important somatic driver events before cancer diagnosis.

Methods

Study population and sample collection

The Kaiser Permanente Northern California (KPNC) institutional review board (IRB) approved use of the data, and the National Institutes of Health Office of Human Subjects Research deemed this study exempt from IRB review.

Residual exfoliated cervical cell samples were selected from women in KPNC-NCI HPV Persistence and Progression (PaP) cohort. A full and detailed description of the cohort was previously reported32. Briefly, the PaP cohort includes ~55,000 women out of ~1 million who underwent routine cervical cancer screening using the Hybrid Capture 2 assay (HC2; Qiagen Inc., Gaithersburg, MD) and cytology between December 2006 and January 2011. Participants could opt-out of retaining residual cervical specimens from pap-smears and those samples were discarded (~8% of women opted out). The retained residual cervical cells were stored in liquid-based specimen transport medium (STM; Qiagen Inc., Gaithersburg, MD). Women were followed over time and we obtained coded information on subsequent cervical cancer screening test results and histology results from electronic health records through 2019. All personally identifying information was kept strictly at KPNC. Histology was determined based on the cervical intraepithelial neoplasia (CIN) classification system.

For our study, we included cervical cell samples from precancer/cancer cases and controls (described below) that were positive for HPV16, HPV18 and/or HPV45 using Linear Array (LA; Roche Molecular Systems, Pleasanton, CA, USA), Cobas (Roche) and/or lab-specific polymerase chain reaction (PCR). A total of 3,351 women were included: 1,478 controls, 561 CIN grade 2 (CIN2; equivocal squamous precancer), 984 CIN grade 3 (CIN3; squamous precancer), 166 adenocarcinoma in situ (AIS; glandular precancer), 1 precancer with unknown histology, 74 SCC, 76 ADC, and 11 ICC with unknown histology (Table S1). We selected all available HPV16-, HPV18- and HPV45-positive precancer (CIN3, AIS) and cancer (ICC, SCC, ADC) samples. Eight adenosquamous carcinoma cases were included with ADC for histology analyses. Controls were defined as women having baseline HPV16-, HPV18- and/or HPV45-positive specimens that subsequently cleared their infection and/or had an infection defined as normal or low-grade lesion (ASCUS, LSIL, CIN1) with no histologic evidence of equivocal precancer or worse (CIN2 + ) during the study follow-up period according to data obtained from the electronic health records. 77.4% of our controls subsequently cleared their infections during the study follow-up time. At least 1 control per CIN3/AIS+ case was randomly selected for comparisons (Table S1). A subset of the total CIN2 cases available were randomly selected for inclusion. The average age of the women included in our study was 38 years (SD 11), and the majority self-reported their race/ethnicity as White (51%), followed by Hispanic (20%), Asian/PI (15%), Black (8%), or multiracial/other (7%). There were 136 women that had HPV18 and HPV16 co-infections, 21 had HPV18 and HPV45 co-infections, 84 had HPV45 and HPV16, and 9 had HPV16, HPV18 and HPV45 co-infections (Table S2). In addition, 396 women in our study had at least one additional serial sample, collected prior to their most recent screening visit (N = 974 samples), available for inclusion (Table S3; serial time-point samples). We included all available serial samples. In total, there were 3929 samples collected from 3351 women in the study (Tables S1 and S3).

Because all samples are from a prospective cohort, we conducted two sets of analyses: ‘single time-point’ analyses, and ‘serial time-point’ analyses. For ‘single time-point’ analyses, we included one sample per woman, collected during the antecedent screening visit close to (within 2 years, N = 3031) or far from (≥3 years, N = 320) their clinical diagnosis date (total N = 3351 samples) (Table S1). We categorized these samples based on these two time periods from diagnosis because it is possible that women could have missed or undetected disease during a 2 year time frame prior to diagnosis depending simply on when their screening visits were scheduled. For ‘serial time-point’ analyses, we included 2–5 samples per woman, collected up to 10 years before diagnosis (N = 974 total samples from 396 women). We categorized ‘serial time-point’ samples based on their collection time from diagnosis: time-point (TP) one or TP1 = closest to or at the time of diagnosis; TP2 = second available sample collected, next to TP1; TP3 = third available sample collected, next to TP2; TP4 = fourth available sample collected, next to TP3, and TP5 = fifth available sample collected, next to TP4 (Table S3). All case samples were collected before precancer/cancer diagnosis.

We additionally included 32 exfoliated cervical cell samples from women in PaP that were HPV-negative with normal cytology and tested negative for HPV for at least 2 consecutive screening visits, to investigate the occurrence of somatic mutations in HPV-negative cervical cell samples compared to HPV-positive samples.

DNA extraction and sequencing

Selected samples were transferred to the National Cancer Institute, Cancer Genomics Research (CGR) laboratory. DNA was isolated by transferring 30 µL of the STM specimens to a buffer containing 200 µg/mL of proteinase K, followed by incubation at 55 °C and 95 °C for 2 h and 10 min, respectively33. DNA was prepared for sequencing using Thermo Fisher Life Science Ion Torrent S5 GeneStudio platform (Thermo Fisher Scientific, Waltham, MA, USA). A set of custom primers were designed to amplify the exonic region of 20 genes that have been previously described as significantly mutated in ICC4,5,11, including ARID1A, CASP8, ELF1, EP300, ERBB2, ERBB3, FBXW7, HLA-A, HLA-B, HRAS, KRAS, MAPK1, MED1, NFE2L2, PIK3CA, PIK3R1, PTEN, STK11, TGFBR2, TP53. Libraries were constructed using AmpliSeq Library Preparation kit 2.0-96LV (Thermo Fisher Scientific, Waltham, MA, USA). Library quantification was performed with the Kapa Biosystems Library Quantification Kit-IonTorrent/LightCycler 480 (Roche, Basel, Switzerland), and Agilent BioAnalyzer DNA High-Sensitivity LabChip (Agilent Technologies, Santa Clara, California). The average read depth per amplicon for all samples was 820x (SD 527), and the average/median coverage metrics for each sample are provided in Supplementary Data 1.

Mutation calling and quality control

An in-house pipeline was developed to analyze the amplicon panel, detailed in the Supplemental Material. First, sequence reads underwent read quality assessment and trimming, then reads were mapped to the human reference genome hg19 using Torrent Suite software (Thermo Fisher Scientific, Waltham, MA, USA). Somatic variant calling was performed in a single sample fashion without paired normal samples (e.g., tumor-only), given that our samples are cervical cells from residual pap-smears in a clinical setting without a matched blood collection. Single nucleotide variants were called using the Torrent Variant Caller (TVC) v.5.0.3 (Thermo Fisher Scientific, Waltham, MA, USA) with the manufacturer recommendations with low-VAF parameters for a minimum allele fraction of 2%.

To assess the performance of our amplicon-based assay to detect known somatic mutations at a low VAF and to establish filters to clean potential false positive variants, we sequenced the Acrometrix Oncology Hotspot Control DNA (AOH) (Thermo Fisher Scientific, Waltham, MA, USA), detailed in the Supplemental Material: Mutation calling quality control). We applied the following variant calling parameters established in this experiment to improve detection of true low VAF somatic mutations: FILTER = PASS (passed TVC default variant calling parameters), QUAL ≥ 10, FDP ≥ 100, FAO ≥ 6, STB < 0.9, MLLD > 55. We achieved between 70.0% to 98.9% sensitivity for detecting known mutations with an allele fraction between 2% and 35% (Table S4 and Fig. S1). In the serial time-point analyses, we relaxed these filtering criteria in TP2 and beyond to ≥1 read per mutation and ≥100 read depth only for the specific loci with mutations detected in TP1.

SnpEff v.3.6c34 was used to annotate synonymous and nonsynonymous mutations, indels, and frameshifts, and Annovar35 was used to annotate exonic and non-exonic (e.g., introns, UTR) mutations, the gnomAD frequencies, and COSMIC information. To remove potential known germline variants and to keep known somatic mutations, we excluded mutations reported with a minor allele frequency (MAF) ≥ 0.01 in gnomAD36, and kept only those reported in COSMIC v9237. Then, to keep mutations more likely to be cancer drivers, we excluded synonymous mutations and mutations located in intronic regions that did not affect splice sites. Lastly, we excluded mutations located in homopolymers with ≥3 bases and those that mapped to repetitive regions (details described below) (Fig. 1).

Hotspot mutations and viral genetic variation classifications

To identify somatic hotspot mutations previously described (i.e., not to discover new mutations), we used the following somatic mutation databases for previous cancer genomics studies: the Cancer Genome Interpreter (CGI)38, Mutagene39,40, cBioPortal-TCGA-cervix (Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinomas)4,41,42, and CHASMplus43. A detailed description is in Supplemental Material. Based on these databases, we classified previously reported mutations as either known drivers in ICC (i.e., TIER1) or known drivers in other cancers (i.e., TIER2). Specifically, we assigned mutations to TIER1 (restricted classification) if they were mutations reported to be drivers in ICC by CGI or Mutagene, or an amino-acid (aa) change in ≥2 samples from the cBioPortal-TCGA-cervix database; or to TIER2 (expanded classification) if they were mutations reported to be drivers in other cancers by CGI or Mutagene, or mutations at the same aa position in ≥3 samples from the cBioPortal-TCGA-cervix. Therefore, a hotspot mutation is defined as a mutation previously observed in the aforementioned somatic mutation databases and classified as TIER1 or TIER2. Mutations were classified as being potentially induced by APOBEC3 if they were a C > T or C > G DNA change occurring at a 5’TCW3’ [W is A or T] trinucleotide motif12. All hotspot mutations were visually inspected by manually reviewing each hotspot nucleotide position in the aligned reads in IGV44 and excluded if they were present in repetitive regions of the genome, called by ambiguously mapped reads (mapping Quality=0), reads with low base quality (quality ≤20), or showed forward or reverse strand bias45.

We classified the other non- hotspot mutations as TIER3 (new mutations in our cohort predicted as drivers by CGI and reported as common in CHASMplus) or passengers (known as passengers or likely neutral by CGI or Mutagene). Mutation filters, counts and classification criteria are summarized in Fig. 1. We categorized samples as having no hotspot mutation, at least 1 hotspot mutation or 2 or more hotspot mutations.

To assess hotspot mutations by disease status and HPV type/variant, we combined HPV18 and HPV45 for most statistical analyses due to the limited mutation counts in each stratum by histology, because they are genetically related, both are part of the Alpha 7 species group, and they have a similar relative higher frequency among ADC compared to SCC. We categorized samples as HPV16-positive, when HPV16 was present as a single infection or with multiple types other than HPV18 or HPV45, and as HPV18/45-positive, when HPV18 and/or HPV45 were present as single or concurrent infections or with multiple types other than HPV16.

Statistical analyses

To compare hotspot mutation frequencies, we used logistic or multinomial logistic regression models to assess differences by disease status and across histology and HPV types, calculating the odds ratio (OR) and 95% confidence intervals (CI). Disease outcomes were defined as controls (i.e., HPV transient infection: < CIN2 or subsequently cleared infection), CIN2, precancers (CIN3 or AIS), and ICC (including SCC and ADC). When comparing controls versus precancer or cancer cases, we also adjusted models for age, smoking status, and race/ethnicity, and none of these covariates affected the direction and strength of associations, therefore we are presenting only the unadjusted models. To assess differences in the VAF of hotspot mutations across disease status, we used the non-parametric Mann-Whitney test. We investigated the HPV type/lineage-specific18,46 (see Supplemental Material: HPV genome sequencing and lineage assignment) association between hotspot mutations and disease status, using a generalized linear mixed-effects model. To assess differences in the allele fraction of hotspot mutations across serial time point samples we calculated a rate of VAF change ‘r’ by years [formula (r = VAF.TP2 – VAF.TP1 / years.TP2 – years.TP1)] and tested differences with the non-parametric paired Mann-Whitney test. All statistical analyses were performed in R version 4.1.0.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.