Visualizing genomic characteristics across an RNA-Seq based reference landscape of normal and neoplastic brain

In order to better understand the relationship between normal and neoplastic brain, we combined five publicly available large-scale datasets, correcting for batch effects and applying Uniform Manifold Approximation and Projection (UMAP) to RNA-Seq data. We assembled a reference Brain-UMAP including 702 adult gliomas, 802 pediatric tumors and 1409 healthy normal brain samples, which can be utilized to investigate the wealth of information obtained from combining several publicly available datasets to study a single organ site. Normal brain regions and tumor types create distinct clusters and because the landscape is generated by RNA-Seq, comparative gene expression profiles and gene ontology patterns are readily evident. To our knowledge, this is the first meta-analysis that allows for comparison of gene expression and pathways of interest across adult gliomas, pediatric brain tumors, and normal brain regions. We provide access to this resource via the open source, interactive online tool Oncoscape, where the scientific community can readily visualize clinical metadata, gene expression patterns, gene fusions, mutations, and copy number patterns for individual genes and pathway over this reference landscape.

To characterize and better understand the molecular intricacies of brain tumors, we downloaded uniformly processed RNA-Seq abundance values from recount-brain, a curated repository for human brain RNA-Seq datasets, for three different uniformly processed datasets-702 adult glioma samples from TCGA 1 , 270 adult glioma samples from CGGA 5,6 , 1409 healthy normal brain samples from GTEx 4 across 12 GTEx-defined brain regions (Supplementary Table 1a).Retrieving data from recount 7 ensured that consistent bioinformatic pipelines were used for these three datasets thus resulting in no batch effects between the three datasets.
The most common solid tumors in children are brain tumors with approximately 1.15 to 5.14 cases per 100,000 children in the United States alone 8 .To adequately represent a wide range of CNS tumors in our reference landscape, we additionally included 802 pediatric tumor samples (Supplementary Table 1b) from the Children Brain Tumor Network (CBTN) 3 .Figure 1a represents an overview of data sources (details in Supplementary Table 1c).
Gene expression data from each of the datasets was converted to units of transcripts per million (TPM) to avoid inter-pipeline difference and were limited to a common set of 19,142 protein-coding genes 9 .While both CBTN and recount used different bioinformatic pipelines (Supplementary Table 1c), in order to ensure that there were no batch effects we used combat 10 method from the R package "sva" to remove unwanted variation in our combined dataset.
We explored three different dimension reduction techniques (Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (tSNE) and uniform manifold approximation and projection (UMAP) for data visualization.We chose UMAP to build a Brain-UMAP (Fig. 1b) on batch corrected TPM integrated counts, as UMAP segregated the mini clusters well and was very effective in visualizing clusters and their relative proximities (Supplementary Figure .

Distinct gene expression profiles in normal human brain
The 1409 normal brain samples segregated into two distinct clusters of multiple supratentorial regions and cerebellum, as we have previously shown 11 (Fig. 1c).The supratentorial regions further revealed three anatomically distinct regions for basal ganglia (caudate, nucleus accumbens, putamen), cortex (amygdala, Brodmann Area 24, cerebral cortex), and the midline structures (spinal cord, substantia nigra and hypothalamus).We confirmed that different classifiers such as postmortem interval (PMI), age (years), sex, Hardy score and type of sample preparation (Supplementary Figure 1b) were not associated with distinct clustering patterns, suggesting that the sample clusters we observe are based on actual biological differences between these brain regions rather than sample preparation parameters.However, we observed that a few samples with much lower RNA integrity number (RIN) compared to all the other samples from different brain regions converged at a point (Supplementary Fig. 1b).

Clustering of transcriptomic glioma datasets reveals distinct glioma subtype
Within the adult glioma clusters, we observe that while the samples from the two TCGA datasets cluster together, there is a more continuous pattern when looking at gene expression profiles for samples from TCGA-GBM and TCGA-LGG (Fig. 2a).This contrasts with the distinct clusters observed by analyzing whole exome single nucleotide alterations (SNAs) and whole genome copy number alterations (CNAs) from the same patients, as shown by Bolouri 12 et al.In line with previous reports, we observed that the age of the patients at diagnosis for the TCGA-GBM and TCGA-LGG samples revealed a sharp gradient illustrating the known correlation between age and outcome (Fig. 2b).By contrast, patient gender was not associated with any specific clusters (Supplementary Fig. 2a).Regarding chromosomal alterations, tumors with a gain of chromosome 7 and hemizygous deletion of chromosome 10 (Fig. 2c) or co-gain of chromosome 19 and 20 (Supplementary Figure 2b) co-localized in the top area of the UMAP containing the TCGA-GBM samples (Fig. 2c).Tumors exhibiting a co-deletion of chromosome 1p and 19q and mutation in isocitrate dehydrogenase 1 and 2 genes (IDH-mut) (Fig. 2d) were concentrated in the lower half of the UMAP containing the TCGA-LGG samples.Tumors containing mutation in www.nature.com/scientificreports/IDH1 (Fig. 2e), TP53 (Fig. 2f) and ATRX (Fig. 2g) were also found to be concentrated and clustered together in specific regions of the adult glioma UMAP.Using the supervised DNA methylation classification, transcriptional subtype, MGMT promoter status and TERT promoter status (Supplementary Figure 2c,d,e,f) to color the adult gliomas from TCGA-LGG and TCGA-GBM also revealed distinct patterns across the adult glioma landscape.Selecting for common glioma mutations and copy number alterations clearly shows three distinct subtypes of adult type diffuse gliomas as defined by the fifth edition of the WHO Classification of Tumors of the Central Nervous System (WHO CNS5) 13 -IDHmutant-1p19q co-deleted oligodendrogliomas, the IDH mutant astrocytomas with TP53 and ATRX mutations, and the IDH-wildtype (IDH-wt) glioblastomas with gain of chromosome 7 and loss of chromosome 10.(Fig. 2h).

Adult glioma subtypes from TCGA and CGGA show similar gene expression profiles
We next assessed if glioma samples of similar molecular subtypes from the TCGA and CGGA datasets exerted similar gene expression profiles and co-localized in their respective clusters.Similar to the TCGA samples, using grade, IDH mutation status and chromosome 1p 19q codeletion status for the CGGA samples, three distinct subtypes of adult glioma from the CGGA samples were observed.Interestingly, we observed that the IDH mutant -1p19q co-deleted oligodendrogliomas, IDH-mutant astrocytomas and IDH-wt glioblastomas from both CGGA and TCGA colocalized (Fig. 3a,b,c,d,e,f).Separated from the large cluster containing adult type diffuse gliomas, there were two small clusters consisting of 47 gliomas from the CGGA dataset (Supplementary Table 1d).Based on their grade, IDH mutation status, chr 1p/19q co-deletion status one cluster consisted mostly of grade 2/3 IDH mutated astrocytoma and the second cluster consisted of grade 4 IDH-wt glioblastomas.For the remainder of the paper, we will refer to the diffuse adult type gliomas from TCGA and CGGA by their molecular subtypes (IDH mutant-1p19q co-deleted oligodendrogliomas, IDH-mutant astrocytomas and IDH-wt glioblastomas).We analyzed the survival for each subtype for all three glioma subtypes and observed similar survival rates between TCGA and CGGA datasets for the respective subtypes.(Fig. 3g).Since survival data is not available for all patients, we then used a nearest neighbor approach to predict the survival for different UMAP subregions.Survival was predicted for samples with the median survival of its nearest neighbors, present in close proximity on the UMAP landscape (Fig. 3h).
We observed a small number of glioma samples formed an isthmus connecting to the normal brain samples (Supplementary Figure 3b).These samples were characterized by a low amount of copy number alterations and were a mix of IDH-wt glioblastomas, IDH mutant-1p19q co-deleted oligodendrogliomas or IDH-mutant astrocytoma.These tumors were characterized by longer survival compared to other gliomas of their respective molecular subtype.It is conceivable that in some samples this region may represent either early forms of gliomas or a mix of glioma and normal brain.Taken together, our results show that the transcriptomic data from different datasets can be combined to generate a population of adult gliomas, creating a landscape where the location of specific tumor sample can be predictive of subtype and outcome.

Pediatric Tumors cluster by disease type
We added the 802 pediatric tumors from the CBTN dataset to our analysis to compare their gene expression patterns to both normal brain and adult glioma samples (Fig. 4a,b).We observed the formation of distinct subclusters for several tumor types that correlated with established molecular subgroups.As an example, we observed that medulloblastoma samples were split into three distinct clusters that correlated with known Medulloblastoma subtypes 14 (Wnt, Sonic hedgehog (SHH) and groups 3,4).(Supplementary Figure 4a).Similarly, Ependymomas (EPN) samples formed several clusters that correlated with the anatomic tumor location (supratentorial (ST)-EPN, spinal-EPN, and posterior fossa (PF)-EPN) 14 (Fig. 4a, Supplementary Figure 4b).Pediatric pilocytic astrocytomas (PAs) and pediatric low-grade gliomas clustered closely together suggesting that they exert similar gene expression patterns.The schwannomas separated from the pediatric tumors and were located near the neurofibromas, which were localized adjacent to the malignant peripheral nerve sheath tumors (MPNST).The subependymal giant cell astrocytoma (SEGA) form a tight cluster as do the meningiomas.Building a UMAP with just normal brain samples and the pediatric tumors, showed similar clustering profile (Supplementary Figure 4c).Taken together, these results suggest that pediatric tumors cluster by disease type and also form subclusters made by subtypes in the case of medulloblastomas and ependymomas.

Using the reference landscape to understand pathway regulation
Alterations in signaling pathways are a hallmark of cancer and understanding the extent to which these pathways are dysregulated in tumor samples compared to healthy normal brain can help inform researchers about the underlying mechanisms of different cancer types.
Bulk gene expression from adult type diffuse gliomas, pediatric brain tumors and healthy brain samples was subjected to a Gene Set variation Analysis (GSVA) and the gene set variation scores for each pathway were used to color in the Brain-UMAP.A score closer to 1 represented up-regulation of pathway in the given samples, whereas a score closer to − 1 represented down-regulation of the pathway.We calculated GSVA scores for all pathways present in Reactome 15 , KEGG 16 and biocarta 17 pathways.We then tested whether there is a difference between the GSVA enrichment scores from different pair of phenotypes using a linear model and moderated t-statistic.
In medulloblastoma, which appear as solid mass in the cerebellum 18 , we found 160 up-regulated and 175 down-regulated pathways compared to cerebellum samples from GTEx (Supplementary Table  www.nature.com/scientificreports/pathways involved in cell cycle, and DNA repair were up-regulated in medulloblastoma, NFΚB pathway, RELA pathway, Interleukin-2 (IL2), Interleukin-27(IL27) signaling pathways were down-regulated in medulloblastoma compared to healthy cerebellum samples from GTEx (Supplementary Table 2a,b, Supplementary Figure 5a).We can thus use the landscape to find pathways which are differentially regulated in regionalized tumors compared to specific healthy brain regions from GTEx.Additionally, we found that 605, 589, and 529 (Supplementary Table 2c-e) pathways were up-regulated in IDH-wt glioblastomas, IDH-mutant astrocytomas and IDHmutant-1p19q co-deleted oligodendrogliomas respectively compared to healthy brain samples from GTEx.A total of 456 pathways (Supplementary Table 2f) were up regulated in all three adult type diffuse glioma subtypes compared to healthy brain.The top pathways which were enriched in all adult glioma subtypes were pathways enriched for cell cycle, DNA repair, translation, splicing, oncogenic signaling pathways such as RAS pathway, Notch pathway, MHC pathway, PI3K/AKT Signaling, Wnt pathway, SHH pathway (Fig. 5, Supplementary Figure 5b).Additionally, neurotransmitter pathways, calcium channel, potassium channel opening pathways were up regulated in healthy brain regions compared to pediatric tumors and adult gliomas (Supplementary Figure 5c).
Interestingly, we noted that the two small clusters (IDH mutated grade 2 and grade3 oligodendroglioma and grade 4 IDH-wt glioblastomas) from the CGGA dataset were enriched in pathways related to olfaction, glucoronidation, ascorbate and aldarate metabolism and xenobiotics (Supplementary Fig. 5d) in comparison to the main adult glioma cluster.
While visualizing pathways across the reference Brain-UMAP is extremely informative, researchers exploring targets for drug development may be also interested in investigating individual genes of a particular pathway.For example, Reactome's mismatch repair pathway (R-HAS-5358508) and Biocarta RELA pathway (M10183) were both found to be up-regulated in all adult glioma subtypes compared to healthy normal brain, coloring in the gene expression for individual genes over the Brain-UMAP show different gene expression patterns across members of the same pathway.(Fig. 6, Supplementary Figure 6).For example, almost all the genes in the mismatch repair pathway have elevated gene expression levels in medulloblastoma, except EXO1, RPA3 and POLD4.

Studying candidate genes at multiple genomic levels
While bulk gene expression exhibited illuminating patterns in our data, analyzing other genomic information such as copy number alteration, gene fusions, or somatic alteration for each of these tumors can further enhance our understanding for a given gene of interest.Since processed copy number, gene fusions and somatic variants were publicly available only for two out of five datasets (TCGA and CBTN), we first built a much smaller UMAP using only the bulk gene expression data from these two datasets.The resulting UMAP (Fig. 7a) showed a similar clustering pattern as our original Brain-UMAP.Next, we downloaded the copy number calls, gene fusion calls and somatic variants for adult gliomas from Genomic Data Commons (GDC) and the pediatric tumors from CBTN.
Oncogenes show altered gene expression in tumor samples, leading to abnormal phenotype in samples.Understanding gene expression patterns across various cancers of the nervous system can further our understanding of the disease.When studying known oncogenes such as EGFR, PTEN and CIC at a gene level (Fig. 8), as expected, we observe high number of mutations, high number of gene fusions, amplified gene expression values and copy number gains for EGFR across IDH-wt GBM.This contrasts with PTEN which shows loss of 1 copy in IDH-wt GBM samples.CIC, a transcriptional repressor, shows high number of mutations and copy number loss in oligodendrogliomas.For pediatric tumors, we observe BRAF gene fusions (Fig. 8) in 63% pilocytic astrocytoma tumors and 34% of low grade pediatric tumors (Supplementary Table 3a) ALK (Fig. 8) mutations are also observed in 38% high-grade pediatric tumors, 33% spinal cord ependymomas, 22% ATRT and 21% of the medulloblastoma tumors (Supplementary Table 3b).This reference landscape is a useful research tool for the scientific community, where researchers can explore existing data to increase their understanding of oncogenic pathways and individual genes that make up these pathways, potentially uncovering candidates for novel therapeutic targets.By providing access to this reference landscape via an open source website like Oncoscape, we provide an interactive tool to researchers doing CNS research.

Discussion
As costs for performing RNASeq continues to decline with increasing technological advances, more and more tumors will be sequenced, and additional tumor banks will be created with the underlying goal to understanding cancer's complexity.The wealth of knowledge that that already exists in publicly available datasets such as those described here (GTEx, TCGA, CGGA, CBTN) is remarkable.Individually these datasets comprise of well-defined biologically similar set of patient samples and allow the analysis in exquisite resolution of genetic changes from one sample to the next.For example, comparing medulloblastomas to other medulloblastomas allows for precise characterization of molecular subtypes.As researchers, we can get so focused on comparing like with like that we lose sight of the proverbial forest for focusing too much on the leaves of a single tree.By integrating multiple datasets while correcting for batch effects, such as with the Brain-UMAP presented here, we can harness the power of multiple datasets.This landscape can be used to prospectively compare patients, similar to work that has been done using methylation arrays.Unlike methylation arrays, however, RNA-Seq is based on gene expression and therefore each cluster represented on the landscape contains granular information about the underlying tumor biology of the samples it contains.Because the overall expression pattern is identifiable for each tumor, this allows for cross comparison of various kinds of cancer to allow for characterization of tumor types not previously known.For example, asking questions about expression of particular genes or pathways across a wide panel of samples and tumor cohorts may uncover previously unknown roles or similarities between adult and pediatric tumor subtypes, which can possibly open new avenues for therapeutic discovery.
The landscape that we present reveals correlations between tumor type and outcome.We have learned how diagnosis of a given tumor type can be confirmed using various pieces of genomic information.For example -the adult glioma subtype of oligodendroglioma can be confirmed by presence of co-deletion of chr1p/19q and somatic alteration in IDH1.
We are also able to compare tumor samples which were sequenced years apart (TCGA and CGGA) in two different continents and confirm that these differences did not contribute to expression changes in majority of the tumor samples.We are also able to identify novel entities and gain insight into genes that drive their unique character, such as in the case of the two adult glioma subtypes seen only in the CCGA data that appear to strongly and uniquely express genes involved in olfaction.
Using this study as an example, we have seen how one diagnosis can be comprised of more than one expression entity, such as in the case of medulloblastoma and ependymoma.The UMAP also indicates room for better classification of tumors.For example, we found tumors which were documented as one type in the publicly available database, but in our UMAP were found to cluster with tumors of a different type.For example, some embryonal tumors ended up clustering with the medulloblastoma samples indicating that they may be medulloblastomas, or at least share many common features with medulloblastoma.We also have seen that multiple diagnoses may really be one entity as in the case of tumors diagnosed as either pilocytic astrocytoma (PA) or pediatric low-grade glioma.
Reference landscapes like the Brain-UMAP, will be informative for researchers who wish to obtain a quick diagnosis or characterization of newly obtained tumor samples.For example, if we look at the medulloblastoma subtypes from CBTN, there are 14 group 3, 49 group 4, 30 SHH, 9 WNT and 19 unclassified medulloblastoma samples.For the 19 unclassified samples, we can predict which subtype they belong to, based on which medulloblastoma subtype samples they cluster with.(Supplementary Figure 4a).
By combining results from both RNA-Seq data (gene expression, gene fusions) as well as whole genome sequencing (copy number, mutation calls) and with the help of a few examples, we illustrate the knowledge that can be mined from this resource, at both a pathway level as well as a gene level.The work presented here demonstrates the utility of reference landscapes for combining genomic data across multiple tumor types for both diagnosis, prognosis and better understanding the biology of the tumors that are similar to a given patient collected prospectively.The power of visualizing gene expression changes, regulation of pathways, chromosomal alteration, gene fusions across multiple tumor types is informative to every researcher, especially those who may not have immediate access to computational biology experts.The reference landscape described here provides a useful tool for researchers interested in gene level questions across large scale patient data, while the methods used for integrating data sources highlight the tremendous potential for combining future datasets with existing resources to address complex biological questions.

Obtaining gene expression RNA-Seq Data
RNA-Seq gene expression was downloaded from two sources for adult gliomas, GTEx defined healthy brain samples and pediatric tumors respectively (Supplemental Table 1).

Conversion of abundance estimates to transcripts per million (TPM)
For consistency, we converted all FPKM gene expression data to TPM data using the formula

Obtaining mutations
Annotated Variant call Format (VCF) files containing somatic variants from MuSE, VarScan2, MuTect2 and Somatic Sniper were downloaded from GDC portal for TCGA-GBM and TCGA-LGG.The variants for each patient were combined based on custom script present in our github repository.Somatic Variant calls for pediatric tumors was obtained from https:// github.com/ Alexs Lemon ade/ OpenP BTA-analy sis.

GSVA pathway analysis
Gene sets for all the pathways from Biocarta, Kyoto Encylcopedia of Genes and Genomes (KEGG) and Reactome pathways were downloaded from Molecular Signature Databases (MSigDB)(v7.2) 26 .https:// www.gsea-msigdb.org/ gsea/ msigdb/ colle ctions.jsp# C2 Batch corrected log2(TPM) counts from each pipeline were used to conduct a Gene Set Variation Analysis (GSVA) 27 using Biocarta, KEGG and Reactome pathways.GSVA scores obtained from 1 and -1 for each sample, were visualized using ggplot2.R package limma 28 was used to test whether there is a difference between enrichment scores for a given pair of phenotypes.

Kaplan-Meier curves for TCGA and CGGA patients
To generate Kaplan-Meier curves for TCGA and CGGA patients, all samples labeled recurrent, secondary, or normal tissue were removed.Duplicate samples were also excluded so that each patient was represented by exactly one sample.Kaplan-Meier curves were drawn by the Python package lifelines (Python version 3.8.6,lifelines version 0.25.2) 29 using survival data and glioma subtype labels downloaded from GDC, where the world health organization (WHO) 2016 criteria for the classification of adult diffuse gliomas was used to determine glioma subtype 30 .

Survival-annotated UMAP
To annotate the Brain-UMAP with survival data, each TCGA and CGGA sample was colored with the median survival of a cohort of patients (nearest neighbors) close to the sample in question on the UMAP landscape.The notion of nearest neighbors was defined as a number of samples within a radius of 2 from the sample in question under the constraint that all nearest neighbors must be of the same glioma subtype and from the same dataset (TCGA or CGGA) as the sample in question.The radius parameter was chosen qualitatively.The number of nearest neighbors was defined as 25% of the total number of samples of the same glioma subtype and dataset as the sample in question.If the median survival of a sample's cohort of nearest neighbors is undefined, or if a sample has fewer than 10 nearest neighbors, the point is not colored in.Non-primary and duplicate samples were excluded in the same manner as was done for the Kaplan-Meier curve analysis.

Figure 1 .
Figure 1.Overview of data analyzed (a) showing datasets used, batch correction and construction of Brain-UMAP.(b) UMAP of complete dataset including adult gliomas from The Cancer Genome Atlas Low Grade Glioma (TCGA-LGG), The Cancer Genome Atlas Glioblastoma Multiforme (TCGA-GBM) and The Chinese Glioma Genome Atlas (CGGA), pediatric tumors and GTEx-defined normal brain.(c) UMAP showing unique clustering of GTEx-defined brain-regions.

Figure 3 .
Figure 3. Co-localization of adult gliomas from two publicly available datasets TCGA and CGGA.Top panel shows (a) grade, (c) IDH mutation status and (e) 1p19q co-deletion status for adult gliomas from TCGA colored in, and adult gliomas from CGGA greyed out.Bottom panel shows (b) grade, (d) IDH mutation status and (f) 1p19q co-deletion status for adult gliomas from CGGA colored in, and adult gliomas from TCGA greyed out.(g) Survival analysis for adult glioma subtypes IDH wildtype(red), Astrocytoma (blue) and Oligodendroglioma (green) from TCGA and CGGA shown in solid and dotted lines respectively.(h) Prediction of survival time (in years) using a nearest neighbor approach shows a gradient for adult gliomas.

Figure 4 .
Figure 4. (a) UMAP of pediatric tumors (b) Updated coloring of the Brain-UMAP showing pediatric tumors and three subtypes for the adult gliomas.

Figure 5 .
Figure 5. Visualization of GSVA Pathway scores across Brain-UMAP for cancer pathways and cellular processes.

Figure 6 .Figure 7 .
Figure 6.Visualization of gene expression profiles for genes from the Reactome mismatch repair pathway across the Brain-UMAP.

Figure 8 .
Figure 8. Integration and visualization of genomic information such as gene expression, mutation, copy number and gene fusions at a single gene level across Brain-UMAP for 5 genes-EGFR, PTEN, CIC, BRAF and ALK.