Background

The factors that favour progression of human papillomavirus (HPV) infection to cervical cancer are incompletely understood. Advances in next-generation sequencing nowadays allow deep, high-quality sequencing of entire viral genomes in HPV-positive samples from large cohorts, enabling assessment of the importance of viral genomics. HPV16 variant lineages have been implicated in cervical carcinogenesis. The definition of a variant lineage is that the L1 open reading frame differs by more than 1% but less than the 10% that would make it another HPV type1. A variant sublineage is defined as groups of sequences with 0.5–1.0% differences between genomes2. The four identified variants of HPV16 are divided into sublineages A1–3 (formerly termed European), A4 (Asian); B (African-1), C (African-2) and D1–3 (North-American, Asian-American) and have been associated with different cervical precancer and cancer risk3. Even within HPV16 variants, genetic polymorphisms may play a key role for infection persistence and oncogenic potential3.

Primer sets that amplify the whole 8-kilobase pair HPV16 genome have been described4. Conservation of the E7 gene appears to be required for cervical carcinogenesis5. The remaining issues that need to be addressed are: (i) if thorough sequencing will reveal additional sequence variability (single-nucleotide polymorphisms (SNPs)), (ii) if variability is preferentially associated with invasive or in situ cancer and (iii) if variability can be detected already before development of disease in prospective studies. We therefore investigated viral gene variation in a prospective, population-based study of invasive and in situ cervical cancer, using the Illumina NextSeq sequencing platform that enables deep and large-scale sequencing.

Materials and methods

Study participants

The study design has been described previously6. In brief, all squamous cervical cancer (SCC) case women in Sweden during the period 1969–2002 were identified using the Swedish National Cervical Screening Register. We used the same register to draw a random sample of case women with carcinoma in situ (CIS; equivalent to CIN3). Using case–control sampling, we then identified one woman, matched on county, date of entry into cohort (all women in Sweden with a normal smear) (±3 months) and age at first normal smear (±1 year), as an individually matched control for each CIS and SCC case. We retrieved archival pre-diagnostic smears from the case women and matched smears from the control women. Histologic specimens from the identified cases were re-reviewed by a senior pathologist. All pre-diagnostic specimens were tested for HPV and the viral load determined by real-time PCR7.

The first smear from all single-infection HPV16-positive samples (n= 1250) that were found to have an HPV16 viral load >100 copies/µl (n= 439) were retrieved for sequencing. Overall, we used 242 samples from 242 different women with either CIS/CIN3 (146 samples: 134 cases and 12 controls) or SCC (96 samples: 92 cases and 4 controls).

Extraction of DNA

The smears taken prior to the date of diagnosis of the case in each case–control pair had been requested from comprehensive archives. DNA was extracted from archival cervical conventional Pap smears, as previously described6. For our HPV16 sequencing, the HPV16-positive samples were subjected to an additional purification step which was performed using MagNa Pure LC Total Nucleic Acid Isolation Kit in accordance with the manufacturer's protocol (Roche Molecular Systems, Inc., Alameda, CA, USA).

DNA amplification and primer pooling

The entire HPV16 genome (7906 bp) was amplified as 47 overlapping amplicons, ranging in size from 181 bp to 375 bp, as previously described4. Amplification primers were divided into five different reactions (Supplementary Table 1) to reduce occurrence of self-dimers and cross-primer dimers, and extracted DNA was amplified separately by all five PCR reactions for every sample. After PCR, amplification products were pooled together according to sample name, before library preparation. To control for contamination and accuracy, negative controls (Sigma water as well as HPV16-negative but human DNA-positive controls) were also amplified. Each PCR was performed with 5 µl DNA in 20 µl reaction containing 1x Qiagen Multiplex PCR Master Mix (Qiagen, Hilden, Germany) and 0.2 µM of each primer. A pre-heat of 95 °C for 15 min was followed by 45 cycles at 95 °C 30 s, 57 °C 90 s and 72 °C, 90 s, with a final extension at 72 °C for 10 min.

Illumina library preparation

A quality analysis (Bioanalyzer, Agilent, Santa Clara, CA, USA) was performed to check DNA amplicon length prior to library preparation. A total of 255 libraries (242 samples, 9 PCR negative controls with Sigma water and 4 PCR negative controls containing human DNA) were prepared using the TruSeq Nano DNA Sample Preparation kit according to the user guide revision A (Illumina, San Diego, CA, USA) with the following modifications: as each sample consisted of approximately 200–400 bp long PCR products, the tagmentation, end-repair and size selection steps were omitted, and hence the library preparation started with adenylation of 3’-ends. We used 75 ng of PCR product as input in a volume of 17.5 μl of resuspension buffer and 2 adaptor indexed primers were ligated to each sample.

All individual libraries were validated, normalised to 2 nM and pooled in different pools. Each pool contained approximately 48 libraries (including specimens and negative controls) and was denatured and diluted, resulting in a 1.8 pM DNA solution. All library pools were sequenced paired-end 151 + 151 cycles once, using the NextSeq 500 instrument and NextSeq 500 High Output reagent kit (Illumina) as described in the user guides Denature and Dilute Libraries Guide v02 for the NextSeq System, NextSeq 500 kit Reference Guide revision F and NextSeq 500 System Guide v02.

Sequence analyses

We used indices included in the Illumina adaptors to assign raw sequence reads obtained from the NextSeq 500 (Illumina) platform to the originating samples. Reads were quality and adaptor trimmed with Trimmomatic8. All reads with a read length below 150 base pairs (bp) were discarded for further analysis. We aligned 150 bp long quality reads to a modified HPV16REF (human papillomavirus 16 reference sequence from the The PapillomaVirus Episteme, 7906 bp) using NextGenMap9. We considered only paired-end reads where both reads mapped to the genome, with the correct orientation and distance, with >90% identity over 75% of their length as valid and further analysed.

As the HPV genome is circular, the reference genome was modified by adding to the end (after position 7906), the first 258 nucleotides in order to not lose coverage of amplicons 46 and 47, which start at the end of the genome and end at the beginning.

To filter out human reads, HPV16 mapped reads from the first 48 samples were screened against the human reference genome hg19 using NextGenMap9 with same parameters (>90% identity over 75% of their length). No reads mapped to human sequences, and therefore filtering of human reads was omitted from the bioinformatics pipeline.

Resulting BAM files were merged according to sample names, processed through a quality control and left aligned using the GATK version 3.8, LeftAlignIndels Module.

GATK DepthofCoverage was used to perform coverage analysis and generating coverage summary plots. Each nucleotide position had to have more than 5 reads to be considered as “covered”. Amplicons primers were trimmed from aligned reads using BamUtil Trimbam (http://genome.sph.umich.edu/wiki/BamUtil:_trimBam), trimming 32 bases from the 5′-end.

HPV16 variant calling

The HPV16 genome was genotyped by GATK HaplotypeCaller Version 3.8. SNP and indel calls were made and hard filtered, following GATK Best Practices. All variant calls met the following conditions: QualbyDepth <2.0, FisherStrand >60.0, Root Mean Square Mapping Quality <40.0, Mapping Quality Rank Sum Test <−12.5 and Read Pos Rank Sum Test <−8.0 to avoid strand biases, inflation when there was deep coverage, false calls at the end of the reads and low-quality variant calls. All identified nucleotide variants were manually inspected and were only considered as true variants if the call showed at least 100 reads. For each sample, a whole-genome sequence fasta file was generated. All identified nucleotide variants were annotated in a database including the HPV16 gene or region and amino-acid changes.

HPV16 variant lineage assignment

HPV16 variant lineage assignment was based on the maximum likelihood tree topology constructed using MEGA 710, including 10 HPV16 European and non-European variant lineage reference sequences1, and lineage assignments were confirmed with SNP patterns. Variant lineage assignment was performed for all specimens excluding those with poor read depth (<200 median depth) and/or low genome coverage (<80% genome coverage).

Identification of novel SNPs

All sequences with SNPs were blasted against HPV16 sequences deposited in GenBank (both complete and incomplete genomes, deposited until 18 August 2018), and those polymorphisms not reported in any deposited sequence were considered as “novel”.

Statistical analyses

Descriptive statistics on viral mutations were presented along with 95% confidence intervals (CIs) and significance testing carried out using chi-square tests for differences in proportions between categories. All tests were two-sided and a p value of <0.05 was considered statistically significant.

The study was approved by the Regional Ethical Review Board of Stockholm which determined that informed consent from the participants was not required.

Results

HPV genome coverage

The median HPV16 viral load for all samples originally included was 496 copies/µl (range 103 copies/µl to 25,950 copies/µl). Overall, Illumina sequencing generated high-quality sequencing data at median depths of 200× to over 2000×, that covered 80 to 100% of the HPV16 genome in a single sequencing run (Table 1). A total of 218/242 specimens (90.1%) covered >80% of HPV16 complete genome (7906 bp) with sequencing median depths of >200×, with 135 samples (55.8%) covering 100% of the complete genome with sequencing median depths of >2000×. The median sequence depth per sample was high (11,288×).

Table 1 Sequence depth across genome coverage in 242 cervical samples

To consider a position in the genome as “covered”, a sequence depth of more than 5 reads was considered as mandatory. In one viral region (nt 5314–5404) 80–89/242 specimens (33.1–36.8%) did not present more than 5 reads per position. This 91 bp region is located within the HPV16 L2 gene, accounting for 1.2% of the viral genome and corresponded to amplicon 32 (non-overlapping part). Water controls and HPV16-negative but human DNA-positive controls had no HPV16 sequencing reads.

Study participants

The median age at smear taking was 29 years for women with subsequent CIN3, and 39 years for women with subsequent SCC, with a median time from smear to diagnosis of 1.4 years and 2.1 years, respectively. The control women had a median age of 26 years at their smear (Table 2).

Table 2 Basic characteristics of the study participants

Variant lineages

We determined HPV16 variant lineage assignment based on a phylogenetic maximum likelihood tree for all 218 women, after excluding samples with poor median read depth (<200×) and/or low genome coverage (<80%) (Table 3). The 211 out of 218 women (96.8%) in our study had HPV16 variant lineage A infection, corresponding to 144 (66.1%) A1, 63 (28.9%) A2 and 4 (1.8%) A4 sublineages. We found 7/218 (3.2%) women with an HPV16 non-European variant lineage infection. Lineage C was detected in 1/218 (0.5%) women and lineage D in 6/218 (2.8%) women, corresponding to 1 woman classified as sublineage D1, 1 woman classified as sublineage D2 and 4 women classified as sublineage D3. Lineage B was not found in any participant. Lineage A thus dominated cases of both CIN3 and SCC, with no difference between lineages by phenotype (p = 0.27).

Table 3 HPV16 variants and subvariant lineages by diagnosis

HPV16 SNPs and indels

We detected a total of 598 SNPs and 25 indels (insertions and deletions) across the HPV16 genome when analysing all 218 specimens. We detected 563/598 (94.1%) SNPs and 6/25 (24.0%) indels within the HPV genes and upstream regulatory region (URR; Table 4), while 35/598 (5.9%) SNPs and 19/25 (76.0%) indels were detected within the HPV region 4102–4236 bp (non-coding region between E5 and L2). SNPs and indels occurring within the E4 region were considered as synonymous/missense/nonsense considering both protein-coding genes E2 and E4. Therefore, one SNP in one position, e.g., position 3362, might be considered as missense in E2, but as synonymous in E4. Considering SNPs in all protein-coding regions, we detected a total of 280/510 (54.9%) non-synonymous substitutions, 223/510 (43.7%) silent variations, 7/510 (1.4%) SNPs that translated into premature stop codons and 5 indels (Table 4).

Table 4 HPV16 SNPs and indels

Regions exhibiting greater variability were URR (10.8% variability), followed by E4 (10.1%), E5 (9.5%) and L2 (9.4%), while more conservative regions included the E7, E6 and L1 genes (3.4%, 5.2% and 5.6% variability, respectively). The most variable gene was thus E4 (10.07%, 95% CI 6.59–13.54%) and the least variable E7 (3.37%, 95% CI 1.32–5.42%; p = 0.001, Table 4).

Isolates showed a median substitution of 15 SNPs when we compared their complete sequences to the reference genome HPV16REF (min: 1 SNP, max: 131 SNPs). The majority of SNPs (320/598, 53.5%) were detected in only one specimen each. In all, 224/598 SNPs (37.5%) were identified in 2 to 10 specimens each (1 to 5% of total specimens), 30/598 SNPs (5.0%) were detected in 11 to 21 specimens (5–10% of total specimens) and 24/598 SNPs (4.0%) were found in at least 20 specimens each (>10% of total 218 specimens).

When we stratified by diagnosis (CIS or SCC), 430 SNPs were detected in CIN3 cases, and 387 SNPs in SCCs (Table 5). L2 and E1 genes showed the highest numbers of nucleotide substitutions (98 and 80 SNPs for CIN3 cases, and 84 and 70 SNPs for SCC cases, respectively). E7 was the most conserved gene with only 8 identified SNPs for CIN3 and 9 for SCC (Table 5). Most women (183/218; 83.9%) displayed an E7 sequence which was identical to the reference genome, i.e., fully conserved. By case–control status, 170/204 case women and 13/14 control women displayed 0 SNPs (Table 6). These women all had lineage A1, i.e., the same lineage as the reference genome we used. We observed a maximum of 3 SNPs per sample in the E7 gene, and these occurred in women who had an infection of non-European lineage. Women infected with variants A1–A3 had highly conserved E7 sequences with up to only 1 SNP difference to the reference sequence in their E7 sequence.

Table 5 HPV16 SNPs and indels detected according to diagnosis
Table 6 Number of SNPs in the E7 gene in the 218 study participants

Novel SNPs

We identified up to 147 novel SNPs (not reported in GenBank), including 26 SNPs located within non-coding regions, 26 silent substitutions, 86 non-synonymous nucleotide variations, 5 nonsense substitutions and 4 SNPs that were classified as both silent and non-synonymous, depending on the protein to translate (E4 vs E2 or L2 vs L1).

For each of the 8 protein-coding genes, the proportion of non-synonymous nucleotide substitutions that were novel ranged from 6.9% to 25.6% of all SNPs. All SNPs classified as novel are listed in Supplementary Table 2.

HPV16 isolates

We studied 135 samples with coverage at 100% to determine how many women had identical isolates to each other. In these 135 women, we detected up to 123 unique isolates. Out of 135 women, 114 (84.4%) displayed HPV16 isolates differing by at least one nucleotide to any other woman’s sample. Among the remaining 21/135 women, 6 identical pairs of isolates were shared by two women each (6 isolates, 12 women) and 3 identical isolates were shared by 3 women each (3 isolates, 9 women).

Discussion

We here report HPV16 genomic variation in a population-based set of cervical smears from a stringently designed prospective study of women in the Swedish population who had a normal smear but subsequently developed cervical cancer in situ or invasive squamous cervical cancer (CIN3+). We report the hitherto greatest sequencing depth in the literature. We applied a strong epidemiological study design, a high-quality sequencing protocol and a stringent bioinformatical algorithm. We identified mainly variant lineage A-positive women with a large amount of SNPs, 24.6% of which were classified as novel, and confirmed that the E7 was the HPV gene that exhibits the least variability across samples. We detected more non-synonymous variants (55%) than silent substitutions, and all SNPs were determined not to be the result of an HPV lineage coinfection (data not shown) by identifying putative heterozygous allele calls (HPV is monoploid) at nucleotide positions that distinguished between variant (sub)lineages.

In samples with 100% sequencing coverage, we found that the sequence diversity was extensive, with 84.4% of women infected with isolates differing from those of other women in the study. Our data show perfect concordance with previous studies5 that reported nearly exactly the same estimates of unique circulating isolates (84.5%).

Similar to an American study, we found a dominance of A1/A2 subvariants, but few women with other variants or subvariants, likely due to the relative ethnic homogeneity of the Swedish population. Our finding is also expected given the strong association between A1/A2 subvariants and squamous epithelial cervical lesions3. We included no women with glandular lesions, who may be more prone to display D2/D3 variants3,11. We used the reference genome HPV16REF from The PapillomaVirus Episteme, since the original reference clone has known sequencing errors12. Others have opted to use HPV16R as their reference clone4, but the difference between these two2 reference genomes (HPV16R and HPV16REF) is only one nucleotide variation (A2926G, which was detected in all our 218 specimens). No isolate has yet been identified with identical sequence to HPV16R, but one isolate (GenBank accession number NC001526) shares identical nucleotide sequence to HPV16REF. This isolate does not, however, have the same position numbering as the HPV16REF. Use of the same HPV reference genome and making the calls in the same positions is essential to facilitate sharing and comparing results with other authors13.

The HPV16 E7 gene was recently found to exhibit hypovariability in a large sample of invasive cervical cancers and this hypovariability was posited as a requirement for cervical carcinogenesis5. Using greater sequencing depth, we can here confirm that the E7 gene was by far the most conserved gene across CIN3+ samples, indeed with a maximum of only 1 SNP difference observed to the reference among women infected with subvariants A1–A3, and only up to 3 SNPs even in women infected with other subvariant lineages.

The strengths of this study include the population-based prospective design within a national cervical screening programme. This meant that the HPV16-positive samples we included are guaranteed to have a very high generalisability to the female population. Other strengths include the comprehensive genotyping procedure assuring HPV16 presence in the sample, and stringency of sequencing analyses and bioinformatical procedures. Indeed, even in these archival samples, originally far from optimised for DNA sequencing, we obtained excellent statistics on depth and coverage. We improved 3/4 poorly performing amplicons previously reported4 using a deeper throughput as well as by dividing amplification primers into five different reactions to reduce occurrence of self-dimers and cross-primer dimers. While certain positions from one set of primers (amplicon 32) did not show coverage for up to 36.8% samples (89/292), the other 3 amplicons (amplicons 24, 26 and 47) had all their positions covered in at least 249/292 (85.3%) of samples. With the very high sequencing depth, the risk for erroneous calling of genomic variation was minimised.

The chief limitation of our study was that due to few control women being positive for HPV16, meaningful analyses of risk associations comparing to healthy controls were not possible and inference on E7 hypovariability in healthy women was limited due to the low numbers. We also had low numbers of multiple infections with both HPV16 and other HPV genotypes in the original study and chose to not include any such samples in this study.

Finally, most of the smears analysed in this study were originally sampled in the late 1980s to early 2000s, meaning that the isolates we observed may be somewhat different from today’s circulating isolates. However, given the limited mutation rate of HPV types over time2, we do not believe this to have substantially influenced our results.

In conclusion, we here describe high-precision lineage assignation and mutation calling for a national population-based study of women with normal smears who later developed cervical cancer in situ or invasive cervical cancer. A1/A2 variant sublineage infections were most common, and the E7 gene exhibited the least variability among the HPV genes—close to the point of full conservation. The large HPV16 genomic variability would be of interest to study as a possible risk factor in cervical screening.