Uniform genomic data analysis in the NCI Genomic Data Commons

The goal of the National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive (https://gdc.cancer.gov/).

get the up-to-date file summary directly from data repository facet view.
In this work, we discuss the general considerations, implementation details and quality comparisons of a large-scale uniform genomics data analysis that was used by the NCI's Genomic Data Commons for processing and harmonizing the cancer genomics data that it shares with its users.

Results
Reference genome and gene model. The GDC chose GRCh38 as the reference human genome build for all data analyses, because of its improved coverage and accuracy over the previous major build GRCh37 3 . The GRCh38 major human genome assembly was released by Genome Reference Consortium (GRC) on Dec 2013 with GenBank assembly accession GCA_000001405. 15. The complete assembly downloaded from NCBI contains 456 sequences, including 25 continuous chromosomal and mitochondrial sequences, 42 unlocalized scaffolds, 127 unplaced scaffolds, 261 alternative scaffolds, and 1 EBV decoy sequence.
At the time of the development of the GDC pipelines, there was a lack of tool support for alternative scaffolds analysis. As a result, the GDC excluded alternative contigs from the GDC reference sequence, and used GCA_000001405.15_GRCh38_no_alt_analy-sis_set from the NCBI ftp site as a modified GCA_000001405.15 reference contig set without alternative loci and patches.
In addition to the continuous chromosomal and mitochondrial sequences, unlocalized and unplaced scaffolds described above, the GDC reference sequence also includes 2385 human decoy sequences, together named hs38d1 with GenBank assembly accession GCA_000786075.2. These human decoy sequences are additional unlocalized sequences not officially recognized by GRC at the time of reference release, and having them in the GDC reference helps to reduce false alignments of such reads in other regions 4 .
Similar to the idea of improving alignment quality with human decoy sequence, we collected genomic sequences from ten types (200 subtypes) of cancer related viruses in the reference genome, together called "virus decoy" as a collection. The additional benefit of having such virus sequences in the reference genome is to give users the opportunity to directly identify virus genome derived reads. These viruses include Human Cytomegalovirus (CMV, HHV-5), Epstein-Barr virus (EBV, HHV-4), Hepatitis B virus (HBV), Hepatitis C virus Reproducibility of analysis. All major GDC data production pipelines are written in the Common Workflow Language (CWL, https://www.commonwl.org/). In each workflow, a main CWL file describes how tools and sub-workflows, also written in CWL, can be used in clearly defined steps. All major tools have been containerized using Docker containers to support reproducibility and portability of the workflows. The GDC will be redistributing the main GDC workflows to the research community to support reproducible research.
DNA-Seq alignment, mutation calling, annotation, and somatic variant aggregation. The DNA-Seq alignment process includes initial alignment and a post-alignment optimization process. In the initial alignment step, reads are mapped to GRCh38 using BWA, followed by BAM sorting, merging of read group BAMs into a single BAM, and then duplicate-marking using Picard. When the read length of a read group is larger than 70 bps, BWA MEM 8 is used for alignment; or otherwise, BWA Aln 8,9 is used.
After initial alignment of WXS data, we follow GATK Best Practices (https://software.broadinstitute.org/gatk/best-practices/) to process all BAMs from the same patient together for a postalignment optimization process called "co-cleaning" in which includes GATK IndelRealigner and BaseQualityScoreRecalibration (BQSR). IndelRealigner performs local realignment to further improve mapping quality acrossing all reads at loci close to indels, and BQSR detects and fixes systematic errors made by the sequencer when it estimates the quality score of each base call [10][11][12]  Artificial chimera reads can form during the Multiple Displacement Amplification reaction 16 , such as the one used to generate the WGA libraries by REPLI-g. This phenomenon is most obvious in MuTect2 calls as we observed about 10 folds increase of INS/SNP ratio when tumor is a WGA sample (data not shown), compared to other cases in the same project, and most of such INDELs are primarily supported by softclipped reads. In order to increase specificity, we re-analyzed all TCGA tumor WGA samples with MuTect2 option -dontUseSoftClippedBases, and successfully removed most of these false-positive INDELs. However, as these artifacts were introduced during library preparation, they could also exist in MuTect2 SNPs, as well as variants called by other tools, in a smaller scale. We recommend users to consume these WGA calls with care.
Raw Somatic Variant Call Format files (VCFs) generated from each pipeline are further processed by caller-specific filters to tag low quality variants in the FILTER column in VCF. For SomaticSniper, variants with Somatic Score (SSC) < 25 are removed. These VCFs are then annotated using Variant Effect Predictor 17 to generate Annotated Somatic VCFs.
To allow easier investigation of variants and annotations, the VCFs are transformed into project-level tab-delimited Mutation Annotation Format (MAF) files 18 . This is done with a custom tool based upon VCF2MAF (https://github.com/mskcc/vcf2maf) from Memorial Sloan Kettering Cancer Center. VCF and MAF files may contain germline variants and therefore all VCFs and MAFs described above are only available as controlled-access files. To access controlled-access files in the GDC, users must obtain approval from the Data Access Committee through dbGaP (database of genotypes and phenotypes) 19 .
We have also created open-access MAF files by applying stringent criteria to remove potential germline variants. These open-access MAF files are used to support variant visualization and simpler data sharing. We did not produce open-access MAFs for the TARGET program because of privacy concerns of child sample donors. Mutation loads of point mutations and INDELs from both open-access (public) and controlled-access (protected) MAFs of all TCGA projects are displayed in Fig. 1. Of note, this germline-masking process is so stringent that some real somatic variants, for example somatic variants in areas of low sequencing coverage in the paired normal samples, may have been removed from open-access MAF. We encourage users to explore controlled-access MAFs to view additional variants.
Quality assessment of GDC somatic variants. Somatic variant detection is still in a stage of rapid algorithm development, and no single caller is superior to others in every respect [20][21][22][23] . As shown in Fig. 2a, a direct comparison of SNV calls from different callers show significant overlaps, but also many tool-specific calls. Here, we only compared high quality variants, which means they do not have a non-PASS caller assigned FILTER value or GDC assigned GDC_FILTER values in the somatic MAF files of data release version 10. In summary, 56.0% of the clean variants have been identified by all four callers, 15.1% by three callers, 14.0% by two callers, and 14.9% called by only one caller. Among all four somatic callers, MuTect2 has detected the highest number of unique calls, while SomaticSniper has the least. Additional efforts are needed to examine the validity of such calls, especially those not called by all pipelines, in order to evaluate the performance of each tool, and hopefully lead to machine-learning algorithms that help to unify caller outputs. Please note that these results are only relevant to the GDC implementation of these pipelines and filtering strategy.
Evaluation of somatic variant callers often requires comparison with a so-called gold standard dataset that has been extensively sequenced using multiple independent methods 23 , or using a simulated dataset. In a previous effort to evaluate quality of somatic variant callers, TCGA re-sequenced regions of many called variants using orthogonal sequence technologies, such as Sanger Sequencing, and therefore have produced a set of validated variants that GDC can use to evaluate False-Negatives in current GDC callers. However, no validation experiments were designed for GDC-called variants, thus False-Positives or specificity will not be evaluated in this paper. An independent analysis has been performed by comparing GDC-generated somatic variants to variants previously identified by the TCGA Consortium the same samples, and concluded that both datasets are highly concordant 24 .
We extracted and lifted over all TCGA validation information from 196 MAFs available on May 2016 to GRCh38 coordinates, and selected mutation calls from the same tumor samples that GDC has also successfully analyzed with all four pipelines. This results in 1,911 unique tumor samples and 115,476 validated variants, across 13 TCGA projects (BLCA, BRCA, CESC, COAD, GBM, KIRC, LAML, OV, PAAD, READ, SARC, THYM, and UCS). Comparison to GDC called somatic variants of the sample tumor sample shows that only 3.2% of TCGA validated variants are not re-discovered by any of the GDC pipelines (Fig. 2b). This number further decreases to 1.6% if we only compare variants from exactly the same tumor and normal aliquot pairs. 95.6% of the validated variants are called by at least two GDC pipelines; 86.2% by at least three pipelines; and 71.6% called by all four GDC pipelines.
To further investigate the impact of cancer type or project to each caller, we analyzed the validated variant recall rate for each pipeline (Fig. 3). In most of the cancer types, MuTect2, MuSE, and VarScan2 show good performance, while SomaticSniper typically recalls less. In particular, SomaticSniper displays a very low average recall rate of <50% in PAAD (Pancreatic Adenocarcinoma). Interestingly, SomaticSniper has the best recall rate for LAML. The algorithm of SomaticSniper may have been designed to better tolerate high-level tumor contaminations in germline control samples, which often exists as infiltration of liquid tumor cells in skin or buccal swabs of LAML patients.
RNA-Seq data processing and quality assessment. RNA-Seq analysis by the GDC makes use of a modified workflow (https:// github.com/ucscCancer/icgc_rnaseq_align) created by the  International Cancer Genome Consortium. Input RNA-Seq FASTQ files were aligned to GRCh38 using the STAR 2-pass method 25 , and quantified using HTSeq 26 and DEXSeq 27 . The GDC uses GENCODE v22 as the default gene model for both mRNA-Seq alignment and quantification. DEXSeq exon-level quantification results have not yet been imported into GDC Data Portal at the time of publication.
Gene expression is measured using HTSeq. In this pipeline, only reads or read pairs that can be uniquely assigned to a gene are counted. HTSeq supports mapping of both stranded (either forward or reverse) and unstranded libraries; however, the read assignment is different between these two different modes resulting in an inability to compare across these library types. Because the GDC emphasizes data comparability across projects, we have run HTSeq on all projects as if they were unstranded libraries.
For convenience to the scientific community, the GDC also produces gene level quantification in the units of FPKM (Fragments Per Kilobase of transcript per Million mapped reads) 28 and FPKM-UQ (Upper-Quartile Normalized FPKM) 28,29 . The definition of these units is described in the "Materials and Methods" section. Note that the denominators of such normalizations are read counts of all the protein coding genes, instead of all genes. If users are interested in a different set of genes, they are encouraged to perform a normalization based on the genes they are working on, or to use a more sophisticated method, such as DESeq 30 or EdgeR 31 . In addition, GDC does not perform RNA expression batch corrections.
We also compared GDC FPKM-UQ expression data to the original TCGA upper-quartile normalized RSEM expression values using Spearman correlation. This comparison was performed over 10,243 shared aliquots and 18,038 shared genes in both dataset. The average correlation between the same sample from two datasets (Fig. 4. Top) is 0.944, and the majority of the samples have correlation higher than 0.90.
We can also measure the relative expression of the same gene among different samples. The average correlation between the same gene from two datasets is 0.929. We suspected that most of the deviation is from sporadic low-level expressed genes. To address this concern, we further categorized genes into four quartile groups (Q1-Q4) based on their average expression values in the GDC, and then examined gene level correlations within each of these four groups with TCGA results (Fig. 4. Bottom). We observed an average correlation of 0.98 in highlevel expressed Q3 and Q4 groups and much lower values in Q1 and Q2. miRNA data processing and quality assessment. The GDC miRNA quantification analysis workflow is based on the profiling pipeline that was developed by the British Columbia Genome Sciences Centre 32 . After realignment of miRNA-Seq reads to GRCh38 using BWA Aln, the profiling pipeline generates TCGAformatted miRNA gene expression and isoform expression results by comparing the individual reads to sequence feature annotations in miRBase v.21 7 . Of note, however, the tool only annotates those reads that have an exact match with known miRNAs in miRBase and therefore does not identify novel miRNA or transcript with mutations. Similar to RNA-Seq, miRNA expression data has not been batch corrected.
We compared TPM (Transcripts Per Kilobase Million) normalized miRNA gene expression from GDC GRCh38 pipeline to the original TCGA Hg19 pipeline of the same aliquot using Spearman correlation. Similar to what we have described before for mRNA expression, only 9516 shared aliquots, and 641 shared mature miRNAs are used in this comparison. As shown in the boxplots at the bottom of Fig. 5, the majority of the samples have a correlation coefficient of greater than 0.975, with an average of 0.984. Three cancer types, colon adenocarcinoma (COAD), rectum adenocarcinoma (READ), and ovarian carcinoma (OV), have relatively lower correlation, which may reflect specific sensitivities of miRNA species in these cancer types to the reference genome and miRNA database versions.
Expression level groups are also created in the miRNA datasets for detailed correlation comparisons (Fig. 5. Bottom). Because there are many fewer miRNAs compared to mRNA genes, and many of them are expressed at low expression levels, we categorized miRNA into two groups. In the "Low-Expressed" group, all miRNAs show low (Q1 or Q2) in both TCGA and GDC quantifications; and the rest of miRNAs belong to the "Other" group. The average Spearman correlation in these two groups are about 0.956 and 0.979, respectively.
Array-based copy number variation data processing. TCGA used Affymetrix Genome-Wide Human SNP 6.0 (SNP6) array data to identify genomic regions that have Copy Number Variations (CNV) by aggregation of typed loci into larger contiguous regions. Direct liftover of region boundaries from hg19 to GRCh38 results in fragmented segments with poor data quality due to change of probe loci between different genome builds. For this reason, the GDC SNP6 pipeline is built onto the existing TCGA level 2 tangent-normalized copy number data generated by Birdsuite 33 and uses the DNAcopy version 1.44.0 34 R-package to perform a Circular Binary Segmentation (CBS) analysis 35 with GRCh38 probeset metadata. This pipeline normalizes noisy intensity measurements into chromosomal regions of equal copy number in the form of segment mean values, which are equal to log2(copy-number/2), so that diploid regions will have a segment mean of zero, amplified regions will have positive values, and deletions will have negative values.
To be consistent with the original TCGA data, two different output files were produced for each input: copy number segment files that generated from all probes, and masked copy number segment files, equivalent to the original TCGA "nocnv" file, generated by excluding certain probes that have been previously identified to carry copy number variations in a pool of germline samples.
Array-based methylation data processing. TCGA used Illumina Infinium HumanMethylation27 (HM27) and HumanMethyla-tion450 (HM450) BeadChip to measure the level of methylation at known CpG sites as beta values, calculated from array intensities (Level 2 data) as Beta = M/(M + U), where M is the methylated probe intensity and U is the unmethylated probe intensity. The GDC inherited these beta values from existing hg19-based TCGA Level 3 DNA methylation data, and reannotated each probeset with new metadata information based on GRCh38 and GENCODE v22.
Integrated genomic data clustering. To show how users can take advantages of GDC harmonized datasets for cross-project analysis, here we demonstrate integrated analysis with t-distributed stochastic neighbor embedding (t-SNE) 36,37 to reduce high dimensional information in the molecular data to two dimensions (Fig. 6). We utilized four distinct GDC generated molecular features of TCGA primary tumor samples, including mRNA expression, miRNA expression, methylation and copy number variation. Our results are shown only for TCGA data, but could be expanded to other GDC projects.
In our result, different cancer types are well separated and some interesting patterns arise. For example, it has been previously suggested that colon and rectum cancer should be grouped as one colorectal cancer 38 and that esophageal adenocarcinomas resembles a subtype of stomach cancer 39 . Our method supports those observations. In our 2D-clustering the majority of the colon cancer (COAD) samples are co-clustered with rectum cancer (READ) samples, and some esophageal cancer samples cluster together with stomach cancer (STAD) samples. We also identify a few samples designated as primary tumor in one cancer type, but which cluster together with samples from another cancer type. For example, a paraganglioma and phenochromocytoma sample is located within a cluster of adrenocortical carcinoma samples.

Discussion
The rapid decrease in sequencing costs has led to a rapid increase in the resources needed for storage and computation. The GDC provides a solution for this problem by centralizing storage and processing of genomics data. This model currently enables researchers to perform analysis in three different ways: (1) Quick data analysis and exploration using existing GDC visualization tools without the need to download files; (2) Data analysis using GDC generated high level processed data. These files are much smaller than the raw data; (3) Resource-extensive data analysis by downloading raw sequencing data to other data centers or commercial clouds.
The genomic data available through the GDC are analyzed uniformly using common algorithms and pipelines, and will be reanalyzed in the future as improved algorithms and methods are developed. As there is no consensus among the scientific community on the best algorithms on somatic variant detection, the GDC implements four callers, and each generates its own set of variant calling output. As shown in Fig. 2a, b, only 55.97% of the variants are called by all four pipelines; while this category also includes 71.57% of the TCGA validated variants. This suggests a simple strategy to combine results from multiple callers to increase specificity. The boxplots in the bottom of Fig. 3 show the effect of a combined caller approach to the recovery rate of TCGA validated variants by different cancer types. The downside of using a consensus call is a decrease in sensitivity. While we cannot measure sensitivity directly, a good approach may be to combine results from only two somatic callers rather than require a variant be called by all four pipelines. The GDC is in the development stage to generate a new set of MAFs that contain merged results from each tool.
The GDC data harmonization process also makes cross-project analysis much simpler, and reduces artificial findings due to differences in algorithms. Users will be then able to perform joint analysis of data originating from different projects because all data are processed by the same tools and represented in the same format.
Major GDC production workflows are available in GitHub https://github.com/NCI-GDC/gdc-workflow-overview. As workflow updates are inevitable for an on-going data processing project like the GDC, users are able to find workflow versions using the GDC data portal and API. In addition, major version changes are also described in detail in the GDC workflow documentation. When new workflows are introduced, GDC will reprocess old data if necessary.
The NCI's Genomic Data Commons has provided a feasible and scalable model for easy sharing of large set of genomics data. We hope our efforts in data sharing and uniform data processing will be valuable not only to researchers, but also to clinicians, patients, and other interested parties, and help accelerate the long-term goal of precision oncology.

Methods
Pipeline development and production. The GDC takes advantage of containerization technology and built pipelines using Docker (www.docker.com) to ensure pipeline scalability, portability, and reproducibility. The alignment was managed by an in-house job management system that creates new virtual machines of the designated configuration on demand in an OpenStack environment. The corresponding Docker container that holds the entire workflow then runs inside of the virtual machine for an additional layer of security. During development of high-level data generation pipelines, we transitioned from single Docker workflows to using Common Workflow Language (CWL, www.commonwl.org) to describe analysis workflows of multiple Dockerized tools. CWL provides an additional transparent layer between workflow description and workflow execution, that allows even better scalability through parallel execution and portability. The GDC managed CWL pipeline production uses the Slurm Workload Manager (slurm.schedmd.com).
DNA-Seq alignment and co-cleaning. The GDC DNA-Seq alignment pipeline follows GATK Best Practices (https://software.broadinstitute.org/gatk/bestpractices/). The main steps include regenerating FASTQ files from BAM input on a per-read group basis using biobambam2 40 and alignment by read group using BWA (version 0.7.12) in both paired-end and single-end mode. This was followed by BAM sort, merge, and MarkDuplicates using Picard 41,42 . The GDC maps reads using either BWA MEM mode if length is equal or larger than 70 bp or BWA Aln mode if below. BWA Aln is also used when mapping older FASTQ reads formatted with Illumina-1.3 and Illumina-1.5 quality scores. Multiple QC metrics were collected both before and after realignment using FastQC 43 , samtools 41 and Picard 42 . Re-aligned BAM files from the same patient are then collected together for cocleaning using GATK 3.6 IndelRealigner and BaseQualityScoreRecalibration.
The TCGA and TARGET BAMs to be harmonized were originally processed over a relatively large time scale in relation to the development of NGS technology. Some originated from as early as 2010 and were generated by a variety of workflows and reference genomes (11 different reference genomes in total including variants of hg18, hg19, GRCh36, and GRCh37). Some of the workflows had introduced incorrect information into the data, such as sample swapping and mislabeled experimental strategies, which were corrected in the GDC harmonization process.
MuTect2 is built upon the capability of local de novo assembly by HaplotypeCaller and somatic genotyping engine of Mutect. Mutect applies a Bayesian classifier to detect somatic mutations 11 . The GDC uses MuTect2 tools from the GATK nightly-2016-02-25-gf39d340 version. Before tumor normal pairs can be used for somatic variant calling, it is important to generate a Panel of Normals (PoNs) filter that contains calling artifacts and potential germline variants. As mentioned previously, whole genome amplified (WGA) samples are analyzed with dontUseSoftClippedBases turned on.
VarScan2 is another somatic variant caller that identifies both SNV and INDELS. It uses heuristics and statistics to identify variants and considers the confounding impacts of read depth, base quality, variant allele frequency and statistical significance 13 . GDC uses VarScan2 version 2.3.9. The first step of VarScan2 calling is to generate a mpileup file of both tumor and normal BAMs using samtools for a single mpileup file. We set the quality cutoff for samtools to be 1 and also disabled Base Alignment Quality score computation. The mpileup is then used as input to VarScan Somatic to generate a VCF file that contains both SNP and INDEL calls. The resulting VCF is filtered for significant calls using VarScan ProcessSomatic.
MuSE calls somatic variants using Markov Substitution model for Evolution 12 . The first step, "MuSE call", estimates the equilibrium frequencies of all four alleles and presents the maximum a posteriori on every genomics locus. The second step, "MuSE sump", performs a tier based cutoff based on a sample-specific error model which also takes dbSNP information into account. GDC uses MuSE version 1.0rc_submission_c039ffa. Parallelization can be implemented for the first step of MuSE, based on genomic chunks, which can accelerate the production close to linear. The GDC currently only passes calls with quality filter "PASS" to the GDC public MAF files; however, variants with other quality Tier values could also be considered a user's discretion.
SomaticSniper is a somatic variant caller that only identifies SNPs. It uses a bayesian inference to compare genotype likelihoods between tumor and normals 14 . GDC uses the default parameter settings of SomaticSniper version 1.0.5.0.
In addition to the built-in filters in each somatic caller, the GDC also applies additional filtering tools to label caller-generated variants. Because these filters are frequently updated, we have highlighted only a few of the major steps below.
A. False Positive Filter (FPFilter, https://github.com/ucscCancer/fpfilter-tool) was applied to both VarScan2 and SomaticSniper VCFs. B. SomaticSniper variants with SSC < 25 are removed from annotated VCFs. This is the only step in the entire GDC somatic variant pipeline in which low-quality variants are removed, instead of tagged. C. A WXS Panel of Normals was generated internally by MuTect2 calling on about 5,000 TCGA normal WXS samples in artifact detection mode and combined using GATK CombineVariants. The GDC received the sample list from the TCGA Genomics Data Analysis Center (GDAC) as TCGA normal samples that were previously identified to be free of hematopoiesis events (unpublished) at the time of GDC data processing. This PoN is not only used as a MuTect2 built-in filter 44 , but also applied to the other three somatic calling outputs in a similar manner. D. d-ToxoG (http://archive.broadinstitute.org/cancer/cga/dtoxog) is used to remove oxoG artifacts from point mutation calls. These artifacts were generated due to oxidative DNA damage during sample preparation 45 .
E. DKFZ Strandbias Filter (https://github.com/eilslabs/DKFZBiasFilter) is used to tag variants that are supported with significant bias from one strand direction compared to the other. Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF Files and are generated on a project-level. The GDC currently produces two types of MAF files: controlled-access MAFs that contain all variants in VCFs, and open-access somatic MAFs that contain filtered variants and reduced germline contaminations and thus considered "high quality". Any user can explore the open-access somatic MAF for high quality calls; while a more sophisticated user may want to apply for dbGaP access to obtain the superset of mutations in the controlled-access MAF. With the larger set of mutations they may perform custom filtering based on FILTER and GDC_FILTER columns, or collect information that was removed from the open-access version, such as supporting read depth in the normal samples.
RNA-Seq alignment. The RNA-Seq alignment pipeline performs alignments of raw reads against the reference genome using a two-pass approach using STAR 25 . The first pass alignment recognizes splice junctions in the sample, and the second pass uses those splice junctions to perform the final alignment. STAR version 2.4.0f1 was initially used and we then switched to version 2.4.2a that fixed a bug and allowed us to complete processing all the input files. If both BAM and FASTQ input files existed, only FASTQ files were used.
HTSeq mRNA quantification. The HTSeq (version 0.6.p1) 26 pipeline is used for calculating the number of reads that align to different genes in the genome. As mentioned previously, only reads that can be uniquely assigned to a gene are counted. The GDC ran HTSeq on all samples as unstranded libraries in order to maintain consistency for cross-sample comparisons.
The raw counts are normalized into Fragment per kilobase million mapped reads (FPKM) and Upper Quartile Normalized FPKM (FPKM-UQ) using all protein-coding genes as the denominator.
where, DexSeq exon quantification. The GDC has also generated exon-level quantification using the DEXSeq 27,46 pipeline. The first step in this pipeline is to create the flattened General Feature Format (GFF) file, which essentially collapses the information for multiple transcripts spanning the same exon into exon counting bins for that exon. Once the flattened GFF file is obtained, the number of reads that overlap with each exon counting bin are calculated. The result is a flat file which has raw counts for each exon. This data type is not currently available in the GDC data portal and will appear in a later data release.
miRNA-Seq alignment and profiling. The GDC miRNA harmonization pipeline begins with a realignment of TCGA and TARGET miRNA-Seq reads using a similar strategy of the GDC DNA-Seq alignment pipeline. Because reads of miRNA-Seq are typically short, only BWA Aln was used. miRNA quantification is done with a modified version of the miRNA Profiling Pipeline v0.2.7 32 from BCGSC (British Columbia Genome Sequencing Center). In this pipeline, miRNA species and miRNA isoforms are counted differently, and normalized Reads Per Million values are also derived. The final results from each miRNA-Seq sample is a miRNA species quantification file and a miRNA isoform quantification file, in a human-readable format compatible to the original TCGA data. SNP 6.0 array copy number segmentation. The hg19-based probeset metadata were obtained from the Affymetrix website, and then lifted over to GRCh38. Probes with reference bases not matching between hg19 and GRCh38 were removed.
To generate Copy Number Segment file, all SNP and CNV probes are used for CBS calculation, with the only exception that probes in the Pseudo-Autosomal (PAR) regions were removed in males prior to calculation. To generate the Masked Copy Number Segment file from this result, all probesets in chromosome Y and in the frequent copy number variant regions in germlines obtained from GenePattern were also removed prior to calculation. Methylation array beta value annotation. Using probe sequence information provided in the manufacturer's manifest, HM27 and HM450 probes were remapped to the GRCh38 reference genome 47 . Type II probes with a mapping quality of <10, or Type I probes for which the methylated and unmethylated probes map to different locations in the genome, and/or had a mapping quality of <10, had an entry of '*' for the 'chr' field, and '−1' for coordinates 47 . These coordinates were then used to identify the associated transcripts from GEN-CODE v22, the associated CpG island, and the CpG sites' distance from each of these features. Multiple transcripts overlapping the target CpG were separated with semicolons. Beta values were inherited from existing TCGA Level 3 DNA methylation data (hg19-based) based on Probe IDs.
Variant comparison. The same genetic variant can be represented in VCF format in multiple different ways 48 , and many of these discrepancies can not be easily solved by existing normalization tools. In order to reduce false-positive annotations, GDC requires a strict matching of Chromosome, Position, and Alternative Alleles during implementation of MAF annotations. However, in various variant comparisons in this paper, we applied a loose matching strategy to regard two variants the same if they have overlapping regions between starting and ending positions. This is particularly useful when a non-INDEL caller, such as SomaticSniper or MuSE, represents INDEL sites as point mutations.
t-SNE clustering. mRNA expression count, miRNA expression count, Copy Number Segmentation, and Methylation Beta Values were collected from the GDC Data Portal. For mRNA expression and miRNA expression, we removed low-expressed genes and miRNAs if 99% or more samples have less than or equal to 1 count. For Copy Number Segments, we computed average segmentation means on each gene weighted on overlapped length between segments and genes. For methylation data, we removed probes that are empty in more than 5% of the samples, and imputed the remaining empty values with the probeset mean. The methylation data is also randomly down-sampled by 25% of the probes to reduce computational burden.
To integrate these data together for t-SNE clustering, we first computed the top 200 Principal Components (PCs) from each of the four data types using the R function prcomp. The top 200, ranked by "variation explained", from these 800 combined PCs were selected and further scaled to the arbitrary weights of 3:3:1:1 for mRNA:Methylation:miRNA:CNV, and finally used as input for the t-SNE analysis with the R package Rtsne. We gave less weights to miRNA and copy number because they don't separate cancer types as well compared to the other two data types. We found that the 3:3:1:1 ratio gives the best visual performance in the scatter plot compared to the other combinations tested. We ran t-SNE 1000 times with random seeds and displayed the result that minimizes the cost function 49 .
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The datasets generated during the current study are all available in the NCI's Genomic Data Commons (GDC, https://portal.gdc.cancer.gov/). These include the raw data used to generate all figures and statistical analysis. While the majority of the data used in the paper is open-access, access to TCGA protected MAFs from the GDC requires dbGaP approval (phs000178, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi? study_id=phs000178.v11.p8). The following databases were used in data production: the GRCh38 major human genome assembly with GenBank assembly accession GCA_000001405.15 https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/; hs38d1 human decoy sequences with GenBank assembly accession GCA_000786075.2 https://www.ncbi.nlm.nih.gov/assembly/GCA_000786075.2/; HPV sequences from The PapillomaVirus Episteme (PaVE, http://pave.niaid.nih.gov/); gene model of GENCODE version 22 from https://www.gencodegenes.org/human/release_22.html; and, miRNA database miRBase version 21 from http://www.mirbase.org/. GenBank accessions of additional virus sequences are listed in Table 1. The remaining data are available within the Article, Supplementary Information or available from the authors upon request. Source data are provided with this paper.