Chromatin-based, in cis and in trans regulatory rewiring underpins distinct oncogenic transcriptomes in multiple myeloma

Multiple myeloma is a genetically heterogeneous cancer of the bone marrow plasma cells (PC). Distinct myeloma transcriptome profiles are primarily driven by myeloma initiating events (MIE) and converge into a mutually exclusive overexpression of the CCND1 and CCND2 oncogenes. Here, with reference to their normal counterparts, we find that myeloma PC enhanced chromatin accessibility combined with paired transcriptome profiling can classify MIE-defined genetic subgroups. Across and within different MM genetic subgroups, we ascribe regulation of genes and pathways critical for myeloma biology to unique or shared, developmentally activated or de novo formed candidate enhancers. Such enhancers co-opt recruitment of existing transcription factors, which although not transcriptionally deregulated per se, organise aberrant gene regulatory networks that help identify myeloma cell dependencies with prognostic impact. Finally, we identify and validate the critical super-enhancer that regulates ectopic expression of CCND2 in a subset of patients with MM and in chronic lymphocytic leukemia.

For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above. High-throughput sequencing data for patient and normal donor BM samples and MM cell lines were generated in-house. Additional MM patient WGS and RNA-seq data were obtained from the MMRF database (https://research.themmrf.org/). The Chromatin State Segmentations (ChromHMM) for 19 B-cell lineage stages we retrieved from The DeepBlue Epigenomic Data Server (https://deepblue.mpi-inf.mpg.de/). CLL samples data were downloaded from http://inb-cg.bsc.es/hcli/ IDIBAPS_Biomedical_Epigenomics/CLL_Reference_Epigenome/. Additional information is provided in the Supplementary file.

Software and code
All computational methods used in this paper are detailed in Supplementary Methods. In short, quality control of High-throughput sequencing data was performed using FastQC (v0.11.3). The human genome Grch38 annotations were obtained from Ensembl (v85). Bowtie2 was used for ChIP-seq data alignment. Picard (4.0.1) was used to mark and remove duplicate reads. MACS2 (2.1) was used for peak calling. Deeptools (v2.0) was used to create signal tracks from bam files. Tools from Homer package (v4.9) were used for motif analysis, super-enhancer calling and annotation of genomic regions against the hg38 human genome, following the default mode. Salmon (v0.11.4) was used to obtain expression estimates and perform fragment GC bias correction. DeSeq2 (v1.18.1) was used for RNA-seq data normalization and differential expression. Batch effects data were removed using limma package (v3.34.9). Unannotated TSS present in samples were obtained by mapping RNA-seq reads using Hisat (v0.1.6). Stringtie (v1.2.3) was used to assemble mapped reads and identify novel transcripts. Cutadapt (v1.9.1) and Sickle (v1.33) were used for ATAC-seq adapters trimming. Functions from the samtools (v1.3.1) and bedtools (v2.20.0) packages were used for file re-formatting and filtering. The R library 'Annotatr' (v1.8.0) was used to annotate accessible chromatin regions using the TxDb.Hsapiens.UCSC.hg38.knownGene reference package.

October 2018
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative. RNA-seq and ATAC-seq data for ND3_CD19neg sample were excluded due to low sequencing quality.
All primary samples were sufficient for one RNA-seq and ATAC-seq reaction, so replication of primary samples analysis is impossible. after the analysis making replication of the in vivo part of the study impossible. Cell lines data were replicated (other experiments n=2-3). Each replication was successful.
Primary samples preparation and analysis was performed randomly, based on the availability of clinical samples. For MM cell line experiments, cells were randomly split into treatment/control groups and subjected to the same culture conditions and processing, to exclude any technical bias.
All primary samples were anonymous. Clinical information was only revealed after samples processing , in order to assist in subgroups identification and analysis. For MM cell line experiments, blinding was not relevant as none of the recorded data was subjective (e.g. CRISPRi constructs, GFP+ fluorescence, etc.) Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. Note that full information on the approval of the study protocol must also be provided in the manuscript.

ChIP-seq Data deposition
Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks. Karyotyping was performed for all cell lines All cell lines were routinely tested (every two weeks) for mycoplasma throughout this study. The cell lines used were confirmed negative.
No commonly misidentified lines were used in this study.
The covariate-relevant population characteristics considered in this study were gender, disease sub-category, disease stage and cytogenetic information. Our experimental design considered a roughly equal representation of the covariates characteristics across all processed samples (additional information provided in Suppl. Data1 table).
Samples were processed randomly, based on resources availability. No other inclusion/exclusion criteria were implemented.
NHS Health Research Authority, East of England -Cambridge Central Research Ethics Committee Reference: 11/H0308/9