Use of Cap Analysis Gene Expression to detect human papillomavirus promoter activity patterns at different disease stages

Transcription of human papillomavirus (HPV) genes proceeds unidirectionally from multiple promoters. Direct profiling of transcription start sites (TSSs) by Cap Analysis Gene Expression (CAGE) is a powerful strategy for examining individual HPV promoter activity. The objective of this study was to evaluate alterations of viral promoter activity during infection using CAGE technology. We used CAGE-based sequencing of 46 primary cervical samples, and quantitatively evaluated TSS patterns in the HPV transcriptome at a single-nucleotide resolution. TSS patterns were classified into two types: early promoter-dominant type (Type A) and late promoter-dominant type (Type B). The Type B pattern was more frequently found in CIN1 and CIN2 lesions than in CIN3 and cancer samples. We detected transcriptomes from multiple HPV types in five samples. Interestingly, in each sample, the TSS patterns of both HPV types were the same. The viral gene expression pattern was determined by the differentiation status of the epithelial cells, regardless of HPV type. We performed unbiased analyses of TSSs across the HPV genome in clinical samples. Visualising TSS pattern dynamics, including TSS shifts, provides new insights into how HPV infection status relates to disease state.

start sites (TSSs) in the HPV genome, and have used this technology to quantify the activity of multiple promoters in three cell lines and one patient sample 13 . Direct evaluation of TSSs may represent a novel diagnostic strategy to assess HPV infection status and disease progression.
HPV genes are differentially expressed in parallel with the differentiation programme of the cervical epithelium. At the initial stage of HPV infection, the copy number of the viral genome in cells in the basal layer of the cervical epithelium is very low. Viral DNA replication proceeds along with epithelial differentiation 14,15 . In the upper epithelial layers, the viral late genes L1 and L2 are expressed to allow viral capsid assembly, packaging, and shedding from the superficial layer of the epithelium. As the viral late gene expression is promoted, E2 suppresses the activity of the early promoter by binding to the E2 binding sites (E2BS) of the LCR [16][17][18] . Thus, in the late stages of epithelial differentiation, HPV early promoter activity is relatively suppressed. As the severity of cervical intraepithelial neoplasia (CIN) increases, sustained high expression of E6 and E7 is driven by the early promoter, and, conversely, L1 gene expression is suppressed 19 . Several methods have been devised to evaluate the expression of late genes, such as L1 or E4, as biomarkers for CIN progression [19][20][21][22][23][24][25] . Precise evaluation of the late gene expression patterns could support their use as novel biomarkers for cervical cancer progression.
In this context, we propose that a quantitative assessment of promoter activity, by evaluating TSS activity, would allow for classification of HPV status, as well as CIN severity. In the present study, we developed a novel approach for the evaluation of differences of viral promoter activity at the single-nucleotide level using CAGE technology in clinical HPV samples.

Results
HPV TSS patterns of cervical lesions. Forty-six cervical lesions, from normal and cancerous lesions, were analysed by nAnTi-CAGE or nanoCAGE (9 for nAnTi-CAGE and 37 for nanoCAGE). As the principle of both nAnTi-CAGE and nanoCAGE is highly similar, we first performed nAnTi-CAGE analysis for 9 samples, and we used 37 samples for nanoCAGE analysis, which is a novel technology developed after nAnTi-CAGE to meaningfully observe TSS pattern dynamics with CAGE analysis. The HPV TSS patterns were classified into broad TSS types. First, we visualised TSS activity at a single-nucleotide level using ZENBU software 26 . We identified two TSS patterns when focusing on the most activated TSS clusters: the early promoter-activated pattern and the late promoter-activated pattern, which were designated Type A and Type B, respectively. To analyse multiple HPV subtypes in parallel, we defined broad windows containing the early and late promoters in any HPV genome: from nucleotide 80 to 110, and from nucleotide 600 to 950, respectively. We discovered TSS patterns indicative of the early and late promoters, and we subsequently refined the TSS pattern definitions so that Type A included samples where one-third of early promoter activity ≥ late promoter activity; Type B, one-third of early promoter activity < late promoter activity (Fig. 1). The cervical lesion grades and corresponding HPV TSS types are summarised in Fig. 2 and Table 1. Type B was more common in CIN2 or CIN1 than in other samples, while CIN3 or cancerous lesions were predominantly Type A (chi-square test, p = 0.0224), and the observed frequency of Type B decreased with CIN progression (Cochran-Armitage test, p = 0.0208).
We then investigated whether the initial observation of multiple TSS patterns would be supported by a more systematic approach. We fitted Gaussian mixture models 27 to investigate the accuracy of the classified HPVderived TSS types. Among 37 samples analyzed by nano-CAGE, 33 samples of which HPV-derived TSS was detected were included in this study. Of them, 2 samples were co-infected with two HPV genotypes. Thirty-five HPV-derived TSS types were classified by Gaussian mixture models, and compared to the types of HPV-derived TSSs classified according to the averaged difference in expression between the early and late promoters, defined as (early -late)/(early + late). The model with the highest likelihood was univariate, with two components and unequal variance: this corresponded closely to Type A and the union of Type B, since only one Type A sample (sample #27) was classified as Type B (Fig. 3).
TSS patterns in multiple infections. We detected transcriptomes of multiple HPV types in five samples in the current study ( Table 1). The following co-infections were observed: HPV-16 and HPV-52 (samples C1072_ACG and #30); HPV-31 and HPV-58 (sample #6); HPV-16 and HPV-58 (sample #27); and HPV-67 and HPV-58 (sample #12). Interestingly, the TSS patterns of both detected HPV types were the same in each sample (Fig. 4). Furthermore, a dominant HPV type was apparent in each case of co-infection ( Fig. 4 and Table 1).

Assessment of small promoters by nAnTi-CAGE technology.
In a previous study, we identified numerous HPV-derived TSS clusters in a CIN cell line and a CIN sample 13 . In the present study, we used nAnTi-CAGE to detect small HPV16-derived TSS clusters, as well as the prominent early and late promoters, in clinical samples ( Table 2). One of the small TSS clusters was found to be for the E8^E2 gene, and is located at nt1125-1148. We identified the E8^E2 TSS in 3 of 6 cancer samples and 2 of 3 CIN samples. Another small TSS cluster found to be for the E5 gene, located at nt3391-3420 7 , was identified in all CIN samples. Furthermore, we also identified a cluster located at nt12-15 in all cervical cancer samples. Focusing on the early and late promoters, as well as the cluster located at nt12-15 for the E6/E7 genes and nt3391-3420 for the E5 gene, there are changes in gene expression according to the usage of each viral promoter (Fig. 5).

Discussion
In the present study, we noted that the TSS patterns in the HPV genome may reflect the lesion stage of infected tissue. In all cancer samples and in several CIN samples, the prominent TSS patterns corresponded to the early promoter, while in low-grade CIN samples, the dominant TSS clusters had shifted from the early to the late promoter. Furthermore, in lesions with multiple infections, the prominent TSS patterns were the same, regardless of HPV type. representing both normal and cancerous lesions, were analysed by nAnTi-CAGE or nanoCAGE (9 for nAnTi-CAGE and 39 for nanoCAGE). The HPV TSS patterns were investigated and classified by the prominent TSS types. Regardless of the HPV type, early and late promoter activity was defined by the numbers of TSSs in each transcriptome that started either between nucleotides 80 and 110, or between nucleotides 600 and 950. The TSS patterns were defined as follows: Type A, one-third of early promoter activity ≥ late promoter activity; Type B, one-third of early promoter activity < late promoter activity. To visualise TSS levels at the single-nucleotide level, nanoCAGE data were visualised using ZENBU software. Representative data for each TSS pattern are shown. TSS transcription start site. www.nature.com/scientificreports/ Table 1. Summary of clinical information for patients with cervical lesions. Forty-six cervical lesions, from normal and cancerous lesions, were analyzed by nAnTi-CAGE or nanoCAGE (9 for nAnTi-CAGE and 37 for nanoCAGE). The HPV TSS patterns were classified into broad TSS types: the early promoter-activated pattern and the late promoter-activated pattern, which were designated Type A and Type B, respectively. We defined broad windows containing the early and late promoters in any HPV genome: from nucleotide 80 to 110, and from nucleotide 600 to 950, respectively. TSS patterns were defined as follows: Type A, one-third of early promoter activity ≥ late promoter activity: and Type B, one-third of early promoter activity < late promoter activity. AIS adenocarcinoma in situ; CIN cervical intraepithelial neoplasia; CxCa cervical cancer; NILM negative for intraepithelial lesion malignancy; TSS transcription start site.  #34  37  NILM  52  Type A  1048  48  --#40  49  NILM  ------#25  28  NILM  16  Type B  2  3  --#13  29  CIN1  16  Type B  2  3  --C1072_ATG  37  CIN1  16  Type B  441  17,349 --#4  36  CIN1-2 31  Type A  71    www.nature.com/scientificreports/ Quantitative visualisation of TSS activation had previously been difficult to capture. Previous studies revealed the presence of two major promoters on the HPV genome: the early and late promoters. The activation of these promoters has been extensively investigated in reporter assays using cultured cells. However, quantitative evaluation of promoter activity in clinical samples remained challenging. Furthermore, overlapping transcripts complicated the quantitative evaluation of the individual transcripts. CAGE technology facilitates this by enabling the detection of precise TSSs and the quantitative evaluation of their activity. In the present study, we quantitatively assessed TSS activation in clinical samples using CAGE, and determined the occurrence of at least two TSS patterns in clinical samples: Type A and Type B. Considering that the expression of late genes is up-regulated in later stages of the viral life cycle, which is coordinated with epithelial differentiation, the Type B TSS pattern could Table 2. Summary of the HPV-16-derived tag numbers of cervical samples by nAnTi-CAGE. CAGE tag 5′-coordinates were used for Paraclu clustering with the following parameters: (i) minimum five tags per cluster; (ii) (maximum density/baseline density) ≥ 2; and (iii) 100-bp maximum cluster length. Tag numbers < five were designated as negative for each TSS cluster. nt12-15, nt1125-1148, and nt3391-3420 are highlighted in bold. "·" indicates negative for each TSS cluster (tag numbers < five).  Fig. 2, most of the samples with lower CIN grades (i.e. CIN1 and CIN2) showed the Type B pattern, which could represent a normal viral life cycle, while the Type A pattern accounted for a larger proportion in CIN3 and cancer samples. Another important finding was the detection of weak TSS cluster activity, such as that of the TSS clusters at nt12-15, nt1125-1148, and nt3391-3420. In particular, in our cohort, a weak E6/E7 TSS cluster, nt12-15, was only detected in cervical cancer samples; however, an E5 TSS cluster, nt3391-3420, was only detected in CIN samples. In addition to the TSS patterns, the expression of these weak TSS clusters could serve as diagnostic biomarkers for cervical cancer progression. In the present study, we also identified the cluster at nt1125-1148, a TSS cluster of E8^E2 28 , which is regulated by E1 and E2. The E8^E2 protein plays an important role in regulating viral genome replication during the course of infection [28][29][30][31] , and E8^E2 expression inhibits the proliferation of cancer cells 32 . However, until now, there has been no evidence for the existence of E8^E2 in clinical samples. After a direct evaluation of TSSs, we report here for the first time the identification and quantitation of an activated E8 promoter in clinical samples in three of seven cervical cancers and two of three CIN samples. Further analysis and clinical follow-up of specific patients are required to elucidate the association between E8^E2 expression and cancer progression.
We then demonstrated that the TSS pattern was the same in co-infected samples, regardless of the HPV types involved. It is plausible that viral gene expression changes in parallel with the differentiation of the infected epithelial cells 33,34 . The viral gene expression pattern may thus be determined by the differentiation status of the epithelial cells, regardless of HPV type. In well-differentiated superficial cells, the HPV late promoter is activated 33,34 . In contrast, in high-grade CIN samples, lack of epithelial differentiation may be associated with a stable expression of HPV early genes, such as genes encoding the E6 and E7 oncoproteins.
We originally defined the A and B TSS patterns based on visual inspection of the data, and defined Type A as having early > late promoter activity, and Type B as having late > early promoter activity. Independent classification based on Gaussian mixture models suggested that these definitions could be refined using machine learning. Nevertheless, we demonstrated the feasibility of a novel method for the evaluation of altered HPV promoter activity during disease progression in clinical samples. This constitutes a proof-of-principle for the utility of TSS patterns as a diagnostic marker for CIN severity or progression. An extended CAGE study with more samples would allow for further assessment of the possibility of linking TSS patterns to disease state. Such a study would need to balance the requirement for screening a large number of samples with the requirement for sequencing a sufficiently high number of reads from each sample. Using the classification proposed in the present study, distinguishing between Type A and B patterns required at least 16 tags (in total) for the early and late promoters. www.nature.com/scientificreports/ The HPV transcriptome represents only a fraction of the available sequence libraries, varying roughly between 1 per million and 1 per cent. Therefore, either new samples should be sequenced at a depth of 10 to 20 million reads, or an enrichment method should be developed to address this issue. A limitation of the present study was that the number of clinical samples was not sufficient to allow statistical validation of the association between different TSS patterns and the severity of CIN lesions, or the differentiation status of the epithelium.
In conclusion, in this study we demonstrated the feasibility to analyse TSS activity at the single-nucleotide level using CAGE technology in clinical HPV samples. Further work on a larger cohort following the same patients over time will be needed for determining the sensitivity and specificity of the quantification of dynamic changes of TSS patterns as a biomarker of disease progression.

Methods
Patients and clinical samples. HPV-infected cervical tissues were obtained from biopsy or surgery samples. Diagnosis was confirmed by experienced pathologists and gynaecological oncologists through pathological and colposcopic examination at the University of Tokyo Hospital. HPV-infected cervical tissues were also examined by H&E staining, and the extent of dysplasia was evaluated. Cervical intraepithelial neoplasia was categorised as grade 1, 2, and 3 (CIN1, CIN2, and CIN3) depending upon the proportion of abnormal cell thickness. Then, experienced gynaecological oncologists confirmed the biopsied samples as CIN or a cervical cancer lesion. The samples for CAGE analysis were taken from the same area that met the diagnostic criteria of a cervical lesion. All experimental procedures were approved by the institutional review board at The University of Tokyo (approval number G0637-6), and signed informed consent for the use of the tissues and genomic data was obtained from each participant. Preparation of nanoCAGE libraries at RIKEN was approved by the institutional review board at the Yokohama Campus (approval number H26-26). For the analysis, RNA (5 µg for nAnTi-CAGE and 500 ng for nanoCAGE) was extracted from each sample using an miRNeasy kit (Qiagen, Hilden, Germany). RNA quality was assessed using a Bioanalyzer (Agilent) and standardised to an RNA integrity number (RIN) of > 7.0 for nAnTi-CAGE or > 5.0 for nanoCAGE. The purity of RNA samples was assessed using NanoDrop analysis, which confirmed that the A 260 /A 290 and A 260 /A 230 ratios were > 1.7.
Ethical considerations. This study was approved by the institutional review board at The University of Tokyo (approval number G0637-6) in accordance with the Declaration of Helsinki. All patients provided written informed consent for study participation.
nAnTi-CAGE library construction. First-strand cDNA was transcribed to include the 5′-end of capped RNA, and CAGE 'barcode' tags were attached as previously described 35 . The sequenced CAGE tags were mapped to the HPV-16 and HPV-52 genome based on the infected HPV genotypes using BWA software (v0.5.9), discarding ribosomal or non-A/C/G/T base-containing RNAs. For the HPV-16 genes, CAGE tag 5′-coordinates were used for Paraclu clustering 36 with the following parameters: (i) minimum five tags per cluster; (ii) (maximum density/baseline density) ≥ 2; and (iii) 100-bp maximum cluster length. Tag numbers < 5 were designated as negative for each TSS cluster. nanoCAGE library construction. nanoCAGE libraries were constructed from isolated RNA as previously described 12 , with some modifications. The reverse-transcription products were eluted in 40 µL, and qPCR was conducted using the SYBR Premix Ex Taq kit (TaKaRa). Cycle numbers were estimated as Ct + 4 cycles, and PCR was conducted to generate cDNA using the Ex Taq enzyme (TaKaRa). PCR products were eluted in 30 μL of sterile distilled water after purification, and 0.3 ng of each sample was tagmented individually at 55 °C for 5 min. The extension time of the final PCR was 30 s, and the final purification was achieved using one volume of AMPure reagent (Beckman Coulter, Inc), with the products eluted in 25 μL of reaction mixture. The multiplexed libraries were then paired-end sequenced in five lanes of a HiSeq 2000 sequencer (Illumina) and aligned to human genome version hg38 supplemented with all the HPV genomes available on the Papillomavirus Episteme database 37 on 5 Sep, 2016, using the CAGEscan pipeline v3.0 (https ://gitla b.com/mcfri th/cages can-pipel ine, Kratz et al., in preparation), which assembles overlapping pairs originating from the same molecule and maps them to the genome using the LAST aligner 38 . Statistics. The association between cervical lesion grades and TSS types was evaluated by the chi-square test and Cochrane-Armitage test using JMP Pro version 12.2.0 (SAS Institute, Cary, NC, USA). A p value < 0.05 was considered statistically significant. The 'densityMclust' function in the R package mclust v5.4 27 was used to compare the likelihood of different Gaussian mixtures. After defining a score as the average difference between the expression levels of the early and late promoters, we fitted Gaussian mixture models to these scores. The TSS patterns were classified according to the averaged difference in expression between the early and late promoters, calculated as (early-late)/(early + late). The minimum number of samples required for achieving confidence < 0.25 (i.e. to distinguish between Types A and B) was determined using the 'ciss.wald' function in the R package binomSamSize v0.1-5 (Fig. 3).

Data availability
Demultiplexed sequence files are being submitted to the Japanese Genotype-Phenotype Archive (JGA).