Chromatin structure predicts survival in glioma patients

The pathological changes in epigenetics and gene regulation that accompany the progression of low-grade to high-grade gliomas are under-studied. The authors use a large set of paired atac-seq and RNA-seq data from surgically resected glioma specimens to infer gene regulatory relationships in glioma. Thirty-eight glioma patient samples underwent atac-seq sequencing and 16 samples underwent additional RNA-seq analysis. Using an atac-seq/RNA-seq correlation matrix, atac-seq peaks were paired with genes based on high correlation values (|r2| > 0.6). Samples clustered by IDH1 status but not by grade. Surprisingly there was a trend for IDH1 mutant samples to have more peaks. The majority of peaks are positively correlated with survival and positively correlated with gene expression. Constructing a model of the top six atac-seq peaks created a highly accurate survival prediction model (r2 = 0.68). Four of these peaks were still significant after controlling for age, grade, pathology, IDH1 status and gender. Grade II, III, and IV (primary) samples have similar transcription factors and gene modules. However, grade IV (recurrent) samples have strikingly few peaks. Patient-derived glioma cultures showed decreased peak counts following radiation indicating that this may be radiation-induced. This study supports the notion that IDH1 mutant and IDH1 wildtype gliomas have different epigenetic landscapes and that accessible chromatin sites mapped by atac-seq peaks tend to be positively correlated with expression. The data in this study leads to a new model of treatment response wherein glioma cells respond to radiation therapy by closing open regions of DNA.

www.nature.com/scientificreports/ closed inaccessible DNA chromatin with subsequent gene down-regulation [18][19][20] however atac-seq data on IDH1 mutant glioma samples is limited. Atac-seq is a relatively new sequencing technology that identifies open accessible DNA regions (e.g. "peaks") with very little cellular input enabling the analysis of surgically resected samples. Its low-cost and relative ease has led to wide-spread use across multiple cell types. However, when atac-seq data is used in isolation without additional lines of supporting evidence, it has been difficult to determine how best to interpret the biological significance of these open regions (measured as "peaks" from aligned sequencing reads). Data sets involving multiple modalities (atac-seq and RNA-seq) on the same samples would be useful in supporting or refuting these assumptions. To address these short-comings this study combines a large set of atac-seq and RNA-seq data from surgically removed glioma specimens to create an atac-seq/RNA-seq matrix to identify highly correlated peaks and genes. By correlating these data sets with demographic and survival data we have identified open DNA regions that have prognostic and thereby likely biological significance.

Methods
Generation of clinical database. The University of Cincinnati maintains an IRB-approved biorepository of surgically resected specimens with associated clinical and demographic information including medical record number, diagnosis, name, gender, age, date of collection and survival. A collection of 38 specimens were chosen from the bio-repository with a goal to obtain a heterogenous mixture of low-grade and high-grade specimens as well as those with long and short survival. Specimens were between 2 and 9 years old prior to selection to allow for adequate follow-up for survival analysis. All specimens denoted as "Grade II, " "Grade III, " or "Grade IV primary" are taken from the patient's initial surgery and have not been exposed to any chemotherapy or radiation treatment. All grades were assigned based on the WHO guidelines of that time. Since then, grading guidelines have changed to include more molecular markers (e.g. IDH1, p53, 1p/19q) 21 . Following the initial surgery all patients underwent standard therapy including temozolomide and radiation. Survival was calculated as time between surgical resection and death. The presence of the IDH1 mutation was determined via either immunostaining or by direct sequencing.
Protocol for generating Atac-seq libraries. Nuclei were isolated from patient tumor tissue as described by Habib et al. 22 , with slight modifications. Briefly, frozen tissue samples (< 0.5 cm) were incubated on ice in 1 mL of Nuclei EZ Lysis Buffer (Sigma, #Nuc-101). The tissue was then homogenized using a glass tissue grinder. 1 mL of additional EZ Lysis Buffer was added, the tissue was incubated on ice for 5 min, and then filtered through a 70 μm cell strainer. Nuclei were collected by centrifugation and resuspended in 1 mL ATAC-resuspension buffer (10 mM Tris-HCl (pH 7.4), 10 mM NaCl, 3 mM MgCl 2 , and 0.1% Tween-20) and filtered through a 40 μm cell strainer. 1 mL of ATAC-resuspension buffer was then added and the nuclei were filtered through a 5 μm cell strainer. ATAC-seq library preparation was performed as previously described 23 . Processing of ATAC-seq data. ATAC-seq reads in FASTQ format were first subjected to quality control to assess the need for trimming of adapter sequences or bad quality segments. The programs used in these steps were FastQC v0.11.7, Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program HISAT2 v2.0.5. Aligned reads were stripped of duplicate reads with the program sambamba v0.6.8. Peaks were called using the program MACS v2.1.2 using the broad peaks mode.

RNA-seq library creation.
To obtain the consensus set of unique peaks, called peaks from all samples are merged at 50% overlap using BEDtools v2.27.0. The consensus peaks, originally in BED format were converted to a Gene Transfer Format (GTF) to enable fast counting of reads under the peaks with the program featureCounts v1.6.2. Each feature in the GTF file has the value "peak" on the second column. Peaks located on chromosomes X, Y and mitochondrial DNA are excluded from further analysis. Raw read counts are normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0. Random forest regression model. Log-transformed data were filtered to remove low variance peaks using the VarianceThreshold function from scikit-learn v0.24.1 using a threshold of 0.45. The resulting 8439 features were used to build a random forest regression model using RandomForestRegressor from the ensemble module of scikit-learn. The options used was n_estimators: 1000, max_samples = 0.9, oob_score = True. Permutation importance was used to rank the peaks. The out-of-bag R 2 value for the regression was 0.31.
Elastic-net regularized generalized linear models. We selected the top 20 peaks with highest mean feature importance value from 8439 peaks. To assess the survival predictive power of clinical metadata and top 20 ATAC-seq peaks, we trained elastic-net regularized generalized linear models using 'glmnet' package in R. We tested the performance of three different models that are built using following features (1) top 20 ATAC-seq peaks (2) clinical metadata only; (3) clinical metadata + top 20 ATAC-seq peaks. Leave one out cross validation imputation was applied and alpha parameter was explored between 0 and 1 with incrementing step size of 0.1. www.nature.com/scientificreports/ Out of 36 samples, IDH mutation information of 4 samples was not available. These 4 samples were excluded from models which include clinical metadata as features. Prediction accuracy of models was assessed by computing R 2 value.

Scientific Reports
Processing of RNA-seq data. The quality control check on RNA-seq reads was performed with FastQC v0.11.7. Adapter sequences and bad quality segments were trimmed using Trim Galore! v0.4.2 and cutadapt v1.9.1. The trimmed reads were aligned to the reference human genome version GRCh38/hg38 with the program STAR v2.6.1e. Duplicate aligned reads were removed using the program sambamba v0.6.8. Gene-level expression was assessed by counting features for each gene, as defined in the NCBI's RefSeq database. Read counting was done with the program featureCounts v1.6.2 from the Rsubread package. Raw counts were normalized with respect to library size and transformed to log2 scale using rlog() function in R package DESeq2 v1.26.0.
ATAC-seq and RNA-seq correlation analysis. Sixteen samples have atac-seq chromatin accessibility as well as RNA-seq gene expression data which is used for correlation analysis. To compare the chromatin accessibility and gene expression, we identified the peak-gene pairs which are located within same topologically associating domain (TAD). Peaks having at least 90% overlap with TAD are selected for further analysis. Out of 8439 peaks, 8419 peaks show ≥ 90% overlap with TAD region and 8416 peaks have at least 1 gene within the overlapping TAD. Pearson correlation coefficient was calculated for each peak-gene pair.
Analysis of differential chromatin accessibility.  25 ) underwent a radiation dose of 9 Gy in three fractions in a XenX (X-Strahl, Suwanee, GA) pre-clinical cabinet irradiator, with output calibrated using NIST-traceable instruments. The instrument parameters were: 220 kVp, 0.67 mm Cu HVL, dose rate ~ 6 cGy/s, delivered using a 100 mm × 100 mm collimator with 2 cm backscatter below the cell plates.
Sequencing database. The human database is publicly available from "basespace" login.illumina.com.
All experimental protocols were approved by the University of Cincinnati IRB committee. This study was used de-identified patient specimens and was designated "Not human research" 19-02-25-02 (5/3/2019). Consent is not applicable.
Human database. The creation of the human database followed all methods in accordance with relevant guidelines and regulations. This study was used de-identified patient specimens and was designated "Not human research" 19-02-25-02 (5/3/2019). Consent is not applicable.
Consent to participate. This study does not involve human subjects.

Results
Atac-seq/RNA-seq correlation matrix creates chromatin maps with candidate target genes. This study includes thirty-eight glioma specimens spanning all grades (Grade II-7 specimens, Grade III-9 specimens, Grade IV primary-9 specimens, Grade IV recurrent-13 specimens). Seventeen specimens were determined to have an IDH1 mutation, 17 specimens were determined to be IDH1 wildtype. In four specimens, the IDH1 mutant status could not be determined. The collection includes three paired specimens in which a single patient underwent two surgeries (MG 18/MG19, MG20/MG21, MG27/28). Sixteen samples were of sufficient quality to allow RNA-sequencing analysis. The demographic and clinical information is shown in Table 1. Restricting analysis to those peaks with a univariate correlation of |r 2 | ≥ 0.6 revealed that the majority of peaks were positively correlated with survival (Fig. 1A) and positively correlated with gene expression (Fig. 1B).
Gene ontology enrichment analysis of the identified correlated genes (peak-survival correlation |r 2 | ≥ 0.6 and peak-gene correlation |r 2 | ≥ 0.6) identified "Nervous System Development" as the most enriched gene module (Fig. 1C). Clustering using Uniform Manifold Approximation and Projection (UMAP) identified two clusters that segregated by IDH1 mutant status (Fig. 1D,E). While it is widely presumed that the IDH1 mutation induces DNA methylation and a closed conformation [18][19][20] , there was a non-significant trend (p = 0.1) towards IDH1 mutant samples having more peaks (Fig. 1F).
Prognostic model predicts identifies atac-seq peaks associated with grade/survival. Restrict-ing the focus to only those genes with the most prognostic value, a random forest model identified 8439 peaks that were predictive of survival. The majority of these peaks mapped to intergenic regions. Using the atac-seq/ RNA-seq correlation matrix, 29,679 genes were correlated with the identified 8439 peaks (gene-peak correlation |r 2 | ≥ 0.6). Contrary to the hypothesis that peaks target the "nearest gene", only 1863 (2.5%) of these correlations were between a peak and its "nearest gene". Linear regression models identified six peaks that predicted survival www.nature.com/scientificreports/ with a high degree of accuracy (R 2 = 0.6845). The six most closely correlated target gene regions are shown in Fig. 2A . 2B). Combining the atac-seq peaks and clinical variables created the most accurate model (R 2 = 0.8164 Fig. 2C) indicating that four of the atac-seq peaks have predictive value independent of traditionally used clinical variables. www.nature.com/scientificreports/  www.nature.com/scientificreports/

Recurrent grade IV gliomas are associated with fewer open chromatin sites. Each grade was
compared to all others to identity any differentially represented atac-seq peaks (Fig. 3). These grade-specific peaks then underwent HOMER motif analysis to determine which likely transcription factors drove gene expression in each grade. Finally, peak-associated genes were identified and underwent gene ontology analysis to determine candidate target gene pathways for each grade. Grade II (2326 peaks, 3204 genes), grade III (1708 peaks, 2508 genes), and grade IV (primary) (3058 peaks, 3819 genes) samples had similar and heavily overlapping sets of transcription factor motifs and enriched gene modules (Fig. 3). In contrast, grade IV recurrent (14 peaks, 37 genes) had relatively few peaks and no significantly enriched gene modules.
To determine if this finding was specific to the identified 8439 peaks or part of a larger trend, the total peak count was calculated for each of the grades. Similar to the previous finding, grade IV (recurrent) samples had significantly fewer total peaks than grade II samples (p = 0.02) (Fig. 4A,B). This collection contains three "paired" samples each of which were grade IV that underwent resection followed by standard adjuvant treatment (temozolomide and radiation) and later a second resection. In each case, the later sample ("recurrent" sample) had fewer peaks (Fig. 4C). This invites two possibilities. The first possibility is the decrease in peak count is a selection process where only cells with fewer peaks grow back. The second possibility is that the decrease in peak count is a consequence of the clinical treatment (e.g. surgery, temozolomide and radiation). In support of the latter induction hypothesis, MG 30/31 was a patient who had two distinct tumors that were present on initial diagnosis. The first (MG 30) was resected and the second tumor (MG31) underwent chemotherapy and radiation followed by resection. To provide further evidence for the latter hypothesis, three previously characterized patient-derived glioma cultures (HK296 24 , HK357 24 , TS603 25 ) underwent atac-seq analysis before and after 9 Gray of radiation in three daily fractions. All three cell lines showed a decrease in peak count following radiation (Fig. 4D).

Discussion
Glioblastoma is a currently incurable disease and despite decades of research and hundreds of clinical trials, the prognosis remains grim. One frustration encountered by several trials 6 including those investigating the roles of radiation 4 is that improvements in progression-free survival do not always translate into improvements in overall survival. One possible explanation for this phenomenon is that while certain therapies may initially www.nature.com/scientificreports/ impede tumor growth those cells that survive may take on additional attributes that make them more destructive and refractory to treatment. The data from this study provides an intriguing new model for this transition. The majority of atac-seq peaks identified in this study were correlated with better survival and tended to positively correlate with genes of nervous system differentiation. There was a non-significant trend for total peak count to decrease as the grade increased with the most significant decrease seen in the recurrent grade IV samples. It may be the case, that the closing of DNA regions responsible for driving neural differentiation induces more malignant behavior and that this process can be accelerated by radiation. If this were true, it may be beneficial to pair radiation therapy with other compounds that could potentially minimize this unwanted side effect. With advancements in the availability and cost of sequencing, traditional histological classifications have been replaced with more sophisticated genetic markers. Despite the wealth of glioma expression data, attempts to identify patterns of gene expression that predict survival have been underwhelming 32,33 . In contrast, epigenetic data such as DNA methylation (e.g. G-CIMP) seems to have a high degree of prognostic value 18,32,34 . Supporting this observation, this study created an accurate prognostic survival model with only six atac-seq peaks, four of which remained significant when controlling for grade, age and IDH1 status.
The discovery of the IDH1 35 and H3K27M 36,37 mutations has revealed the importance of epigenetic dysfunction in the early stages of cancer development. The presumed mechanism of the IDH1 mutation is to favor DNA methylation by inhibiting DNA demethylase enzymes (e.g. Tet family) 18,19 . This is hypothesized to induce decreased DNA accessibility with subsequent gene down-regulation 20 . In the current study, UMAP clustering divided samples by IDH1 status although several IDH1 mutant samples clustered with the IDH1 wildtype group. Running counter to the hypothesis that the IDH1 mutation leads to hyper-methylation and decreased DNA accessibility, there was a trend for IDH1 mutant samples to have more total peaks. This implies that the relationship between DNA methylation and DNA accessibility may be more complicated than previously assumed.
The relative ease and low input requirement of atac-seq technology has led to wide-spread use on a variety of tissue types [38][39][40] . However, while the data is easy to obtain it is challenging to interpret without additional sources of supporting evidence. In the absence of this supporting evidence several assumptions have become common in the literature namely that a given peak positively regulates the nearest gene. In the current study, the authors assembled a sufficiently large dataset to create a correlation matrix to make some rational assumptions about www.nature.com/scientificreports/ the relationships of peaks and target genes. Using relatively strict cut-offs, the data showed that most peaks were correlated with one or at most a few genes. The majority of peaks correlated positively with target gene expression. However, there was no correlation between peaks and the nearest genes. On the contrary, peaks and correlated genes were frequently separated by long distances and including many intervening genes. It should be noted that while this study provides some evidence of specific intergenic regions regulating specific target genes, the data is correlational and indirect and needs to be strengthened by alternative methods such as genetic manipulation (CRISPR) or chromosome conformation capture-based methods.
In conclusion, this study uses an innovative peak:gene correlation matrix to create an epigenetic gene regulatory map of gliomas and then uses this matrix to both identify specific gene regulatory regions of interest as well as introduce a possible mechanism of epigenetic malignant progression.

Data availability
All sequencing data will be made publicly available.