Brain metastasis DNA methylomes, a novel resource for the identification of biological and clinical features

Brain metastases (BM) are one the most lethal and poorly managed clinical complications in cancer patients. These secondary tumors represent the most common intracranial neoplasm in adults, most frequently originating from lung cancer, breast cancer, and cutaneous melanoma. In primary brain tumors, such as gliomas, recent advances in DNA methylation profiling have allowed for a comprehensive molecular classification. Such data provide prognostic information, in addition to helping predict patient response to specific systemic therapies. However, epigenetic alterations of metastatic brain tumors with specific biological and translational relevance still require much further exploration. Using the widely employed Illumina Infinium HumanMethylation 450K platform, we have generated a cohort of genome-wide DNA methylomes from ninety-six needle-dissected BM specimens from patients with lung cancer, breast cancer, and cutaneous melanoma with clinical, pathological, and demographic annotations. This resource offers an unprecedented and unique opportunity to identify novel DNA methylation features influencing the behavior of brain metastasis, and thus accelerate the discovery of BM-specific theranostic epigenetic alterations.


Background & Summary
Patients diagnosed with brain metastasis (BM) have a poor quality of life and a dismal prognosis, with survival ranging from three to 25 months 1,2 . While developments in systemic drug treatments have significantly improved the survival of patients with extracranial metastases, BM lesions have shown limited response rates to these approaches 3 . This trend, along with improvements in neuro-diagnostic imaging techniques, has resulted in an increased incidence of metastatic brain tumors 4 . In large epidemiological studies, 8-10% of patients with solid tumors develop BM, this number increases up to 26% when brain autopsy studies are performed [5][6][7][8][9] . In fact, BM represents the most common intracranial neoplasm in adults, outnumbering even primary brain tumors. Interestingly, reflecting an organotropic behavior of tumor cells, the vast majority of secondary brain neoplasms (75-90%) are originated in patients with lung cancer, breast cancer, and cutaneous melanoma [5][6][7][8] .
Genomic and epigenomic landscapes of primary brain tumors have been extensively investigated. In gliomas, for example, the cytosine-guanine island (CGI) methylator phenotype (CIMP) is frequently found in patients with lower grade gliomas harboring mutations on the IDH1 gene and is significantly associated with a better overall prognosis 10 . Additially, a favorable response to the DNA alkylating antineoplastic agent, temozolomide, has been directly connected to high DNA methylation (DNAm) level in the promoter region of the MGMT gene 11 . DNAm profiling has recently been shown to have the potential to accurately stratify primary central nervous system (CNS) tumours 12 and to significantly improve the diagnosis of cancer of unknown primary 13 . These clinically relevant findings have demonstrated DNAm profiling to be a valuable tool in the histomolecular evaluation of brain tumors 14,15 . Yet, while genomic and transcriptomic characterization has been performed to some extent 16 , clinically relevant epigenetic alterations of metastatic brain tumors are still poorly understood. Therefore, given this significant knowledge gap, we constructed a comprehensive dataset that can be used to accelerate the identification of novel DNAm features with biological and clinical relevance for the three most frequent types of BM. Here, we present a dataset including genome-wide DNA methylomes constructed using Illumina Infinium HumanMethylation 450K BeadChips (HM450K) of 96 micro-dissected BM specimens from patients with breast cancer, lung cancer, and cutaneous melanoma (Fig. 1). In addition to DNAm data, this report provides a detailed description of the methodological approaches for patient selection, compliance matters, tissue processing and DNA preparation, data normalization, bioinformatics analyses, and usage notes including clinical and demographic information for all patients in the study. Seven of these patients are part of a cohort study that we previously analyzed to identify genome-wide DNAm variations during cutaneous melanoma progression to BM [17][18][19][20] . Therefore, the current cohort of BM DNA methylomes is composed of HM450K profiles included in two different NCBI's Gene Expression Omnibus (GEO) datasets (GSE108576 and GSE44661). We believe that these datasets offer a unique opportunity for the discovery of novel diagnostic and prognostic biomarkers, while simultaneously providing insight into the underlying biology of this serious clinical complication. In this regard, we have employed these data to further explore the utility of DNAm profiles to accurately discriminate between primary and metastatic brain tumors, identify the origin of the BM lesions, and specifically classify BCBM into therapeutically relevant molecular subtypes 21 . Thus, we generated and validated a three-steps BM DNAm based classifier named "BrainMETH" 21 .

Tissue specimen collection
A total of 96 metastatic brain formalin-fixed paraffin-embedded (FFPE) tumor samples from 94 patients diagnosed with breast cancer BM (BCBM; n = 30), lung cancer BM (LCBM; n = 22), and cutaneous melanoma BMs (MBM; n = 44) were included in this study. Two breast cancer patients presented synchronous or asynchronous multiple lesions. The clinical and demographic characteristics of the patients included in the study have been summarized according to relevant information for each cancer type ( Table 1). All patient-derived samples and clinical and demographic data were collected under research protocols approved by the joint Institutional Review Board of Providence Saint John's Health Center/John Wayne Cancer Institute, the Western Institutional Review Board, the Institutional Review Board of Swedish Medical Center, and the Sydney Local Health District (Royal Prince Alfred Hospital Zone) Human Ethics Review Committee. All patients signed an informed consent before joining the study. The experiments were performed in accordance with the World Medical Association Declaration of Helsinki and the National Institutes of Health Belmont Report. Tissues were de-identified and coded according to recommendations of the Health Insurance Portability and Accountability Act (HIPAA) to ensure confidentiality of the patients.

Histopathological classification of brain metastasis
The BCBM specimens were classified into molecular subtypes according to the expression status of the hormone receptors (HR), i.e. estrogen receptor (ER) and progesterone receptor (PgR), and the human epidermal growth factor receptor 2 (HER2). ER and PgR were assessed by immunohistochemistry (IHC), and HER2 by IHC and/or in situ hybridization assays (ISH). FFPE tissue slides were sectioned at 4 μm, mounted onto plus-coated glass slides, and immunohistochemically stained using a Ventana BenchMark ULTRA automated slide stainer ( Age (years)   Table 1). The MBM samples were categorized according to the mutational status of BRAF and NRAS genes. Genomic DNA from MBM was amplified with standardized primers specific for exon 15 of BRAF, and exons 1 and 2 of NRAS 20 . Polymerase chain reaction (PCR) products were purified using QIAquick ® PCR Purification Kit (#28106 Qiagen, Germany) and subsequently visualized in 2.2% agarose gel DNA cassettes for gel electrophoresis (FlashGel™ System, Lonza Inc, Rockland, ME, USA). Successfully amplified samples were then quantified by UV absorption spectrophotometry and sequenced using an internal primer 20 by Eurofins MWG Operon LCC (Eurofins Genomics LCC, Louisville, KY, USA). Sequencing results were analyzed using Chromas Lite v2.6.5 (Technelysium Pty Ltd, Australia) and mutations in NRAS and BRAF genes were annotated according to the Catalogue of Somatic Mutations in Cancer (COSMIC v86, Wellcome Sanger Institute, Cambridge, UK; http://cancer.sanger.ac.uk/cosmic) 24 .
As the presence of BRAF and NRAS mutations were mutually exclusive events, the MBM specimens were classified into 3 categories: a-BRAF mutated, b-NRAS mutated, and c-BRAF/NRAS wild-type (Table 1). Due to limited tissue availability, two specimens were not profiled for oncogenic mutations on the BRAF or NRAS genes and presented in Tables 1 and 2 as not available (N/A).
The LCBM were histologically classified into non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC; Table 1). Of note, we added four BM specimens from female patients with a presumptive diagnosis of LCBM, but with inconclusive IHC analysis, or with a previous diagnosis of both primary lung and breast cancer. In agreement with the clinical-pathological diagnosis an origin of lung cancer was confirmed by DNAm profiling 21 .

Genomic DNA extraction
Representative FFPE tissue blocks from each metastatic brain lesion were selected by the respective Pathology Departments. FFPE tissue blocks were cut into 4 μm and 8 μm serial slides. Neuropathologists reviewed 4 μm tissue slides stained with hematoxylin & eosin (H&E) for all specimens and labeled representative brain metastatic areas with an estimated tumor purity higher than 70%. After deparaffinization, hematoxylin staining was performed in 8 μm thick serial tissue sections and needle microdissected using the labeled 4 μm tissue slides as template. Genomic DNA (gDNA) was then isolated using ZR FFPE DNA MiniPrep (D3066; Zymo Research, Irvine, CA, USA), according to the manufacturer's instructions. Genomic DNA was quantified by Qubit ® 3.0 Fluorometer (Q33216; Thermo Fisher Scientific, Carlsbad, CA).

Data processing and analysis
Data was extracted from Illumina .idat files using the Bioconductor package minfi 25 . The 'preprocessNoob' function in minfi was used for normalization and dye-bias correction as described in Triche et al. 26 . DNAm levels were reported as β-values [β = intensity of the methylated allele/(intensity of the unmethylated allele + intensity of the methylated allele)], and calculated using the signal intensity value for each CpG site. The effect of normalization on the distribution of β values across samples is shown in Fig. 2.
Using the normalized β values, we compared the genome-wide DNAm profiles for specific genomic features across the three BM groups. DNAm level of CpG sites in high-CpG density regions (known as CpG islands; CGI) and low-CpG density regions (known as CGI shore, CGI shelves, and open sea) were also variable among the three BM groups (Fig. 3a). Additionally, DNAm levels varied among the three BM groups for CpG sites in the promoter regions, 5'UTRs, the first exon, gene body, and intergenic regions (IGRs; Fig. 3b). Finally, to check for overall structure within our dataset, we used the t-distributed stochastic neighbor embedding (t-SNE) 27,28 method with the top 2,500 most variable HM450K probes to cluster all BM specimens. Three distinct clusters were observed that corresponded to each of the three BM types, with MBMs showing the greatest degree of separation from BCBM and LCBM which were positioned more closely to each other (Fig. 3c). No outlier samples were observed.

Code availability
All analyses were performed using open source R and Bioconductor packages. Specifically, the minfi 25 package was used to process raw array data and perform normalizations (see "Data processing and analysis" section), summary statistics were calculated using functions in base R and the matrixStats 29 package, density distribution plots were generated using the densityPlot function in minfi 25 , all other figures were generated using the ggplot2 30 and RColorBrewer 31 packages, and the t-SNE analysis was performed using the Rtsne 32 package. No custom code was used in the processing or analysis of this data.

Data Records
All HM450K raw and normalized data that support the findings of this study have been deposited in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO; https://www. ncbi.nlm.nih.gov/geo/) datasets under the series records GSE108576 (Data Citation 1) and GSE44661 (Data Citation 2). The data is presented in a tabular format that includes the unmethylated intensity values, the methylated intensity values, the p-value from the statistical evaluation of the differences between signal and noise, and the corrected β value. The DNAm data can also be accessed as raw intensity files (.idat). Additionally, the integration of the clinical-demographic characteristics of the 96 BM specimens with the matched .idat file names and the GEO sample identifiers (GSM) is provided in Table 2 (available online only).

Technical Validation
To ensure that only samples with high overall quality were included in this dataset, we applied a three-step quality control pipeline: 1) We filtered samples by probe detection p-value to identify samples with an elevated level of background noise. A significance level of 0.05 for the mean per sample detection p-value was used as a cut-off. All 96 samples included in this dataset showed mean detection p-values less than 0.05 (Fig. 4a). 2) We calculated the number of probes with missing β values per sample. Across all 96 samples, the median number of probes with missing β values was 7.0 probes per sample with a range of 1 to 246 probes (Fig. 4b). Overall, the number of probes with missing β values represents a minuscule fraction of the total number of probes present on the array and therefore is highly unlikely to have an adverse effect on downstream analysis. 3) For each probe, we calculated the number of samples with missing β values. Notably, of the 485,577 probes included on the HM450K microarray, probe cg01550828 showed missing β values in 79 samples (Fig. 4c). Probe cg01550828 is located in the body of the ring finger protein 168 (RNF168) gene and is one of five probes within the RNF168 gene body. While cg01550828 showed missing values, none of the other four RNF168 gene body probes showed any missing values across the 96 samples.  cg01550828  cg15224432  cg02673002  cg18531559  cg21370869  cg10500653  cg13259205  cg16700025  cg12953628  cg21132649  cg18504511  cg11249173  cg26176103  cg25762396  cg06609882  cg00394712  cg24019999  cg19039673  cg22966895  cg02987635  cg10065825  cg00657460  cg03769088  cg04576441  cg11898347 Number of NAs

Usage Notes
To enhance the utility of this resource, we have integrated the most relevant clinical and demographic features of the patient cohort and DNAm data for each BM specimen. In Table 2 (available online only), we included patient age at BM diagnosis, gender, primary cancer of origin, and cancer-specific subtypes matched with GEO sample names and .idat identifiers. This information can be accessed from the respective GEO series GSE108576 (Data Citation 1) and GSE44661 (Data Citation 2). The dataset we present here can be further analyzed to study the differential methylation profiles among the three BM groups described here and/or integrated into larger methylation analyses using new or existing publicly available array data deposited in GEO. Data normalization and differential methylation analysis can be performed using various open source Bioconductor packages. In particular, the ChAMP Bioconductor package provides a comprehensive analysis pipeline that utilizes many wellestablished methods for the normalization and analysis of Illumina HM450K microarray data 33 . This package is well documented and provides a useful first pass pipeline for processing array data.