STATegra, a comprehensive multi-omics dataset of B-cell differentiation in mouse

Gomez-Cabrero, David; Tarazona, Sonia; Ferreirós-Vidal, Isabel; Ramirez, Ricardo N.; Company, Carlos; Schmidt, Andreas; Reijmers, Theo; Paul, Veronica von Saint; Marabita, Francesco; Rodríguez-Ubreva, Javier; Garcia-Gomez, Antonio; Carroll, Thomas; Cooper, Lee; Liang, Ziwei; Dharmalingam, Gopuraja; van der Kloet, Frans; Harms, Amy C.; Balzano-Nogueira, Leandro; Lagani, Vincenzo; Tsamardinos, Ioannis; Lappe, Michael; Maier, Dieter; Westerhuis, Johan A.; Hankemeier, Thomas; Imhof, Axel; Ballestar, Esteban; Mortazavi, Ali; Merkenschlager, Matthias; Tegner, Jesper; Conesa, Ana

doi:10.1038/s41597-019-0202-7

Download PDF

Data Descriptor
Open access
Published: 31 October 2019

STATegra, a comprehensive multi-omics dataset of B-cell differentiation in mouse

David Gomez-Cabrero^1,2,3^na1,
Sonia Tarazona⁴^na1,
Isabel Ferreirós-Vidal⁵^na1,
Ricardo N. Ramirez⁶^na1,
Carlos Company⁷^na1,
Andreas Schmidt⁸^na1,
Theo Reijmers⁹^na1,
Veronica von Saint Paul¹⁰^na1,
Francesco Marabita ORCID: orcid.org/0000-0001-6180-0106²,
Javier Rodríguez-Ubreva⁷,
Antonio Garcia-Gomez⁷,
Thomas Carroll⁵,
Lee Cooper ORCID: orcid.org/0000-0002-4425-1843⁵,
Ziwei Liang⁵,
Gopuraja Dharmalingam⁵,
Frans van der Kloet¹⁶,
Amy C. Harms ORCID: orcid.org/0000-0002-2931-4295⁹,
Leandro Balzano-Nogueira¹¹,
Vincenzo Lagani^12,13,
Ioannis Tsamardinos^13,14^na2,
Michael Lappe¹⁵^na2,
Dieter Maier¹⁰^na2,
Johan A. Westerhuis ORCID: orcid.org/0000-0002-6747-9779^16,17^na2,
Thomas Hankemeier⁹^na2,
Axel Imhof ORCID: orcid.org/0000-0003-2993-8249⁸^na2,
Esteban Ballestar⁷^na2,
Ali Mortazavi⁶^na2,
Matthias Merkenschlager⁵^na2,
Jesper Tegner ORCID: orcid.org/0000-0002-9568-5588^2,3,18^na2 &
…
Ana Conesa¹¹^na2

Scientific Data volume 6, Article number: 256 (2019) Cite this article

7716 Accesses
20 Citations
8 Altmetric
Metrics details

Subjects

Abstract

Multi-omics approaches use a diversity of high-throughput technologies to profile the different molecular layers of living cells. Ideally, the integration of this information should result in comprehensive systems models of cellular physiology and regulation. However, most multi-omics projects still include a limited number of molecular assays and there have been very few multi-omic studies that evaluate dynamic processes such as cellular growth, development and adaptation. Hence, we lack formal analysis methods and comprehensive multi-omics datasets that can be leveraged to develop true multi-layered models for dynamic cellular systems. Here we present the STATegra multi-omics dataset that combines measurements from up to 10 different omics technologies applied to the same biological system, namely the well-studied mouse pre-B-cell differentiation. STATegra includes high-throughput measurements of chromatin structure, gene expression, proteomics and metabolomics, and it is complemented with single-cell data. To our knowledge, the STATegra collection is the most diverse multi-omics dataset describing a dynamic biological system.

Measurement(s)	messenger RNA • miRNA • methylation • deoxyribonuclease activity • assay for transposase-accessible chromatin using sequencing • scRNA • chromatin immunoprecipitation • protein expression profiling • metabolite • metabolite (http://purl.obolibrary.org/obo/CHEBI_25212)
Technology Type(s)	RNA sequencing • RRBS • DNase-Seq • ATAC-seq • scRNA-seq • scATAC-seq (Microfluidics) • ChIP-seq • mass spectrometry • ultra-performance liquid chromatography-mass spectrometry • gas chromatography-mass spectrometry
Factor Type(s)	Ikaros level • harvesting time
Sample Characteristic - Organism	Mus musculus

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.9773624

Sex differences orchestrated by androgens at single-cell resolution

Article 10 April 2024

Three million images and morphological profiles of cells treated with matched chemical and genetic perturbations

Article Open access 09 April 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Background and Summary

The concept of multi-omics and data-integration has been increasingly used during the last 5 years to describe the multitude of high-throughput molecular technologies that can be applied to the study and analysis of biological systems¹. Such techniques hold the promise to uncover the different biological processes and layers of regulatory complexity within biological systems. In brief, high-throughput molecular methods can extract information of essentially three basic, yet different components of living cells. Nucleic acids can readily be profiled using massive, parallel sequencing, which in turn provide deep a characterization of chromatin properties (i.e. Hi-²C, ATAC-seq³, DNase-seq⁴, ChIP-seq⁵, WGBS⁶, RRBS⁷) and the dynamics of gene expression (i.e. RNA-seq⁸, microRNA-seq^9,10, PAR-CLiP¹¹, iCLIP-seq¹²). Proteins are measured by proteomics and phosphoproteomics approaches, based on Liquid Chromatography (LC) and Isotope-coded affinity tag labeling (iTRAQ) coupled to Mass Spectrometry (MS). Finally, the metabolome and lipidome, i.e. organic compounds, are captured using mature techniques such as LC/GC-MS or Nuclear Magnetic Resonance (NMR). Increasingly, multi-omics technologies are applied during the same physiological conditions from either the same or different samples to generate a comprehensive set of data spanning multiple molecular levels. The general expectation of multi-omics projects is that the combination of multi-layered data will reveal aspects of the complexity of biological systems that cannot be fully understood using only a particular data-type. Moreover, in addition to the exciting technical reality of being able to monitor several complementary data-types, the community has come to realize the power of using time in the experimental design. Hence, by collecting data over time, where as a rule the different molecular entities are correlated, it is much more amenable to extract key processes from each data-type as well as uncovering dependencies between different regulatory layers. These technical and conceptual advances are currently being transferred into the vibrant single-cell biology community. Thus, recent advances in single-cell omics technologies have made it feasible to perform multi-omics profiling of individual cells. Consequently, the single-cell community can benefit from the experiences and lessons derived from time-dependent bulk multi-omics analysis. Clearly, a high-resolution single-cell analysis has proven crucial to assess tissue heterogeneity^13,14,15, cell fate^16,17. In conclusion, we are most likely entering an era where we can target regulatory networks in single cells¹⁸ using a temporal paradigm coupled to a multi-omics analysis.

While multi-omics projects are frequently depicted as a set of stacked molecular layers that are connected to pass information from the genetic component to the organismal phenotype, the harsh reality is that still many multi-omics project are constrained by budgetary restrictions and sample limitations which evidently reduce the number technologies that can realistically be assessed. In most cases, only a few data types can be included, with a limited number of samples, and analyses is as a rule restricted to focus on 2 or 3 regulatory layers. A few international projects have however successfully collected large datasets and generated comprehensive portfolios of omics measurements. For example, ENCODE¹⁹, TCGA²⁰, IHE²¹, ImmGen²², had the explicit goal to perform an extensive characterization of a particular set of cells or tissues. These projects have impacted the scope and type of analysis methods and scientific discoveries that can be achieved so far by the multi-omic approach. In some cases combining multi-level data has the ambition to increase the required statistical power to enable the classification of samples or predict disease outcomes. By measuring different types of features the chance of identifying relevant biomarkers increases, but the analysis does not automatically lend itself to a mechanistic account of the inter-dependencies between these biomarkers as well as their relationship with the outcome, such as a disease. In some cases however, two specific omics layers are measured in order to probe their regulatory relationships. For example, methods that integrate ATAC-seq or RRBS with RNA-seq might shed light on the epigenetic control of gene expression²³, while integrating transcriptomics and metabolomics data may help elucidate metabolic regulation^24,25. Yet, there have been very few multi-omic studies that evaluate dynamic processes such as cellular growth, development and adaptation. Hence, we still lack formal analysis methods and comprehensive multi-omics datasets that can be leveraged to develop true multi-layered models for dynamic cellular systems. This state-of-affairs has been the rationale underpinning the formulation of what is referred to as the STATegra project (http://www.stategra.eu/). This is a transnational initiative to develop methods, software and data for dynamic multi-omics analyses. From the STATegra project several tools for integrative multi-omics data analyses have been published and released^{26,27,28,29,30,31,32,33}.

Here we share the collection of the different STATegra datasets, a multi-omics dataset that combines measurements from up to 10 different omics technologies applied to the same biological system. STATegra uses a well-studied system, namely mouse pre-B-cell differentiation, in a cell line model³⁴. This is a highly reproducible in vitro system^33,34,35,36 that allows the generation of sufficient material to deploy a comprehensive set of omics measurements. STATegra covers the three types of biomolecules and the different layers that comprise the basic flow of genetic information: chromatin structure (through DNase-seq, RRBS and ChIP-seq), gene expression (RNA-seq and miRNA-seq), proteomics and metabolomics. The collection is complemented with single-cell RNA-seq and ATAC-seq data on the differentiating conditions. The STATegra multi-omics dataset is unique in the number and diversity of omics technologies available and in the dynamic nature of the system. Our ambition has been to generate this collection of data to serve – in full or using parts of it- as workbench for the development of integrative analysis methods for the multi-layered systems biology.

In previous studies, ChIP-seq data from this collection have been used to identify Ikaros targets³⁴. ChIP-seq, DNase-seq, RNA-seq and scRNA-seq datasets were used in Vidal et al.³⁵ to describe the cross-talk between IKAROS Foxo1 and Myc transcription factors in regulating B-cell development. scATAC-seq, scRNA-seq and ATAC-seq data have been used to develop new statistical methods for the integration of single-cell multi-omics³³.

Methods

Experimental design

Figure 1 illustrates the STATegra dataset. The mouse B3 cell line models the pre-BI (or Hardy fraction C’) stage. Upon nuclear translocation of the Ikaros transcription factor these cells progress to the pre-BII (or Hardy fraction D) stage, where B cell progenitors undergo growth arrest and differentiation^34,37. The B3 cell line was retrovirally transduced with a vector encoding an Ikaros-REt2 fusion protein, which allows control of nuclear levels of Ikaros upon exposure to the drug Tamoxifen³⁴. In parallel, cells were transfected with an empty vector to serve as control for the Tamoxifen effect. After drug treatment, cultures were harvested at 0 h, 2 h, 6 h, 12 h, 18 h and 24 hs (Fig. 1a) and profiled by several omics technologies: long messenger RNA-seq (mRNA-seq) and micro RNA-seq (miRNA-seq) to measure gene expression; reduced representation by bisulfite sequencing (RRBS) to measure DNA methylation; DNase-seq to measure chromatin accessibility as DNaseI Hypersensitive Sites (DHS) and transcription factor footprints, shotgun proteomics and targeted metabolomics of primary carbon and amino-acid metabolism. Moreover, single-cell RNA-seq (scRNA-seq) data for the entire time-series, while bulk ATAC-seq (ATAC-seq) and single-cell ATAC-seq (scATAC-seq) were obtained in a later round of experiments for 0 h and 24 h-time points of Ikaros induction only (no control series were run for these datasets). The dataset is complemented by existing ChIP-seq data on the same system equivalent to our 0 h and 24 h time points³⁴. In total, 793 different samples across the different omics datasets define the STATegra data collection (Fig. 1b).

The time points analyzed were based on previous microarray studies³⁴ and have been fully validated by comparing the transcriptional response in this experimental system to pre-B cell differentiation in vivo. Ikaros translocates to the nucleus of B3 cells within minutes, binds to target promoters and changes RNAP2 occupancy and primary transcript levels with immediate effect³⁶. The 2 h time point is relatively late compared to changes in primary transcript levels³⁶ and was chosen because the data presented here were generated by conventional RNA-seq, which relies on changes in steady state, rather than primary transcript levels.

Culture conditions

B3 cells containing inducible Ikaros can be expanded before induction of Ikaros to produce sufficient material for all omics experiments. G1 arrest occurs within 16 h following Ikaros induction. Cells containing inducible Ikaros were generated by transducing mouse pre-B cell line B3 with mouse stem cell virus (MSCV) retroviral vectors encoding a fusion protein of haemagglutinin-tagged wild type Ikaros (HA-Ikaros) and the estrogen receptor hormone-binding domain (ERt2), followed by an internal ribosomal entry site (IRES) and GFP. Control cells were generated by transducing mouse pre-B cell line B3 with mouse stem cell virus (MSCV) retroviral vectors encoding the estrogen receptor hormone-binding domain (ERt2) followed by an internal ribosomal entry site (IRES) and GFP. Retroviral infected B3 cells were sorted based on GFP levels. GFP positive cells were expanded in culture for few days (3–4) and then frozen. Frozen vials containing 5 million cells were stored in liquid nitrogen.

For time course experiments, 10 million control and Ikaros cells were thawed and expanded for 4 days. Four days later cells were plated for induction of the different time points. Both control and Ikaros cells were split in flasks containing 20 million cells at a density of 0.5 million cells per ml each. For time point inductions, 0.5 uM 4-hydroxy-tamoxifen (4-OHT) was added to both, a flask containing Ikaros cells and a flask containing control cells, at one of the specified times: 2 h, 6 h, 12 h, 18 h or 24 h before collection. Cells for time point 0 h (no 4-OHT) induction were obtained separately in three different batches (Fig. 1b). All cells within the same experimental batch were harvested simultaneously. Cells were centrifuged for 5 min at 1200 rpm, washed twice in PBS and counted to aliquot. Aliquots of 10 million cells were done for RNA-seq and metabolomics and proteomics platforms and of 5 million cells for miRNA-seq and Methyl-seq platforms. Cell pellets were snap-frozen in liquid nitrogen and stored at -80. 20–25 million and 50,000 cells were used for DNase-seq and bulk ATAC-seq samples. The full time course experiment was repeated different times (batches) to generate biological replicates (Fig. 1b). The same physical cultures were used to obtain cells for mRNA-seq, miRNA-seq, RRBS and proteomics. Other omics technologies ran their own cultures to obtain cell material.

Acquisition of Multi-omics data

RNA-seq

Total RNA was isolated with RNAbee (Ambion), frozen ICL and transported via courier (<1 day) to Karolinska Institutet. To account for the impact of the different sources of variability during RNA-seq profiling, we implemented a carefully balanced distribution of samples in relation to time points (6 time points), treatment (Ikaros vs Control), library preparation, bar-code, sequencing run and lanes and biological replicates (3 batches). Briefly, samples were first balanced in six library preparation runs of 6 samples each (Fig. 2). Secondly, each RNA-seq library was split into two (total of 72) in order to better account for variability associated with sequencing. Finally, for sequencing, 75 nucleotides paired-end, the 72 libraries were balanced into 4 flow-cells and in each lane we included 3 libraries. In each lane, we ensured to have different libraries, different batches, different time points and at least both conditions present. Additionally, we balanced the time-points, conditions and batches within each flow-cell. For each flow-cell, a full lane was reserved for quality control. We aimed to obtain 50 M reads per library, therefore 100 M reads per sample. Libraries were built using the strand-specific RNA-seq dUTP protocol³⁸. Sequencing was conducted on an Illumina HiSeq 2500 platform.

Small RNA-seq for miRNA analysis

Small RNA-seq analysis was performed using Trizol-extracted total RNA of 3 biological replicates (4,5,6) for time 0 h and total RNA of 3 biological batches (1, 2 and 3) for times 2 h, 6 h, 12 h, 18 h and 24 h. RNA quality was assessed using Bioanalyzer (Agilent Technologies) evaluating the RNA integrity number (RIN). The library was generated using TruSeq Small RNA Sample Preparation Kit and deep sequencing was performed in Illumina Hiseq 2000 platform. Between 15 and 20 millions of sequencing reads were obtained from each sample.

The library preparation and sequencing of the biological replicates were conducted in two different occasions (technical batches). Figure 3 shows the experimental design according to the batch in which samples were processed. There were two experimental conditions (C = Control, IK = Ikaros) and the 3 biological replicates per condition and time point were numbered as 1, 2 and 3. For some of these biological replicates one additional technical replicate was generated (Fig. 3) in order to estimate the variability between technical batches and to correct any potential batch effect.

DNase-seq

DNase-seq was performed on ~20–25 million cells with 3 biological replicates for all time-points (0–24 hours) and conditions (Ikaros-inducible and control). Briefly, cells were harvested and washed with cold 1X PBS, prior to nuclei lysis. Lysing conditions were optimized to ensure >90% recovery of intact nuclei. DNaseI concentrations were titrated on Ikaros-inducible and control cells using qPCR against known positive DNaseI hypersensitive promoters (Ap2a1, Ikzf1, Igll1) and negative inaccessible hypersensitive promoters (Myog, Myod) in our biological system, thereby reducing excessive digestion of DNA. Enrichment of DNaseI hypersensitive fragments (0–500 bp) was performed using a low-melt gel size selection protocol. Library preparation was performed and sequenced as 43 bp paired-end NextSeq 500 Illumina reads. DNaseI libraries were sequenced at a minimum depth of 20 million reads per each biological replicate. To perform DNaseI footprinting analysis, libraries were further sequenced and merged to achieve a minimum of 200 million mapped reads.

RRBS

Genomic DNA was isolated using the high salt method and used for reduced representation bisulfite sequencing (RRBS), a bisulfite-based protocol that enriches CG-rich parts of the genome, thereby reducing the amount of sequencing required while capturing the majority of promoters and other relevant genomic regions. This approach provides both single-nucleotide resolution and quantitative DNA methylation measurements. In brief, genomic DNA is digested using the methylation-insensitive restriction enzyme MspI in order to generate short fragments that contain CpG dinucleotides at the ends. After end-repair, A-tailing and ligation to methylated Illumina adapters, the CpG-rich DNA fragments (40–220 bp) are size selected, subjected to bisulfite conversion, PCR amplified and then sequenced on an Illumina HiSeq 2500 PE 2 × 100 bp³⁹. The libraries were prepared for 100-bp paired-end sequencing. Around 30 million sequencing reads were obtained from each sample.

Single-cell RNA-seq

Single cells were isolated using the Fluidigm C1 System. Single-cell C1 runs were completed using the smallest IFC (5–10 um) based on the estimated size of B3 cells. Briefly, cells were collected for each time-point at a concentration of 400 cells/μl in a total of 50 μl. To optimize cell capture rates on the C1, buoyancy estimates were optimized prior to each run. Our C1 single-cell capture efficiency was ~75–90% across 8 C1 runs. Each individual C1 capture site was visually inspected to ensure single-cell capture and cell viability. After visualization, the IFC was loaded with Clontech SMARTer kit lysis, RT, and PCR amplification reagents. After harvesting, cDNA was normalized across all libraries from 0.1–0.3 ng/μl and libraries were constructed using Illumina’s Nextera XT library prep kit per Fluidigm’s protocol. Constructed libraries were multiplexed and purified using AMPure beads. The final multiplexed single-cell library was analyzed on an Agilent 2100 Bioanalyzer for fragment distribution and quantified using Kapa Biosystem’s universal library quantification kit. The library was normalized to 2 nM and sequenced as 75 bp paired-end dual-indexed reads using Illumina’s NextSeq 500 system at a depth of ~1.0–2.0 million reads per library. Each Ikaros time-point was performed once, with the exception of 18 and 24 hour time-points, in which two C1 runs were required in order to achieve approximately ~50 single-cells per each time-point.

Bulk and single-cell ATAC-seq

Single-cell ATAC-seq was performed using the Fluidigm C1 system as done previously⁴⁰. Briefly, cells were collected for 0 and 24-hours post-treatment with tamoxifen, at a concentration of 500 cells/μl in a total of 30–50 μl. Additionally, 3 biological replicates of ~50,000 cells were collected for each measured time-point to generate bulk ATAC-seq measurements. Bulk ATAC-seq was performed as previously described³. ATAC-seq peak calling was performed using bulk ATAC-seq samples. ATAC-seq peaks were then used to estimate the single-cell ATAC-seq signal. Our C1 single-cell capture efficiency was ~70–80% for our pre-B system. Each individual C1 capture site was visually inspected to ensure single-cell capture. In brief, amplified transposed DNA was collected from all captured single-cells and dual-indexing library preparation was performed. After PCR amplification of single-cell libraries, all subsequent libraries were pooled and purified using a single MinElute PCR purification (Qiagen). The pooled library was run on a Bioanalyzer and normalized using Kappa library quantification kit prior to sequencing. A single pooled library was sequenced as 40 bp paired-end dual-indexed reads using the high-output (75 cycle) kit on the NextSeq 500 from Illumina. Two C1 runs were performed for 0 and 24-hour single-cell ATAC-seq experiments.

Proteomics

A heavy-isotope labeled cell line representing the preB3 cell line at the starting condition was spiked to the sample before trypsin digestion to balance differences in sample amount resulting from sample preparation. After tryptic digestion, proteomic measurements of the 36 biological batches were analyzed by one-dimensional nanoRP-C18 LC-MS/MS in technical triplicates on an LTQ Orbitrap platform coupled to an Ultimate 300 RSLC system (Thermo-Fisher). First, peptide mixtures were desalted on a trapping column (0.3 × 5 mm, Acclaim PepMap C18, 5 µm, Thermo-Fisher) at a flow rate of 25 µl/min of 0.05% TFA. A linear gradient from 3% B to 32% acetonitrile in 0.1% formic acid in 4 h was applied optimal separation of the complete proteome sample. Peptides eluting from the column were directly transferred to the gas phase via a nano-electrospray ionization source (Proxeon) and detected in the mass spectrometer. A data-dependent acquisition cycle consisting of 1 survey scan at a resolution of 60,000 and up to 7 MS/MS scans were employed. Orbitrap MS spectra were internally calibrated on the siloxane signal at 442.1 m/z Charge-state detection was enabled allowing for a precursor selection of charges 2–5 and excluding precursors with undefined, single and higher charge. Precursors with minimal signal intensity of 5000 cps, were isolated within a 1.2 Da window and fragmented by CID (normalized collision energy 35, activation time 30 ms, Q 0.25) and analyzed in the ion trap. Previously analyzed precursors were dynamically excluded from MS/MS selection for 180 seconds.

Metabolomics

Metabolomics measurements were performed on different biological batches than the other omics platforms because the sample preparation part for metabolomics is different than for the rest. In particular, metabolomics requires acute stopping of all metabolic reactions after sampling, while for other types of measurements this is not so critical. The cell extraction protocol for metabolomics consisted of filtration, washing, and quenching steps to remove medium from the cells and stop metabolism. Four biological batches (9, 10, 11 and 12) were acquired. Visual inspection of the cell pellets showed that batch 11 and 12 contained samples that were not completely dry. The metabolomics measurements were obtained with two different analytical platforms, a targeted liquid chromatography mass spectrometry (LC-MS) platform and gas chromatography mass spectrometry (GC-MS) platform. The LC-MS is a targeted platform measuring amino acids and biogenic amines and the GC-MS focuses on polar metabolites of the primary metabolism such as glycolysis, cyclic acid cycle and amino acid metabolism. LC-MS and GC-MS data had measurements for respectively 36 and 40 metabolites. The measurements were done on exactly the same samples. 80% of the pooled extract was for GC-MS, 10% for LC-MS, 8% for protein weight. Some metabolites were measured at both platforms. In that case, the LC-MS value was selected.

The metabolomics measurement pipeline includes two types of control: the quality control (QC) sample and the internal standard solution. The QC sample is typically a mixture of study samples that is inserted after each six study samples in the measurement series and is used to correct for experimental drift of the analytical instrument. Because of the limited availability of sample material the QC sample used here was not a mixture of study samples but material of control B3 cells not activated with tamoxifen. The internal standard solution for the GC-MS and LC-MS consists of 13C labeled yeast extract, which is added to each study sample at the beginning of the sample preparation process to correct for experimental errors made during the sample processing. For LC-MS an additional internal standard solution is added consisting of 13C labeled amines for most of the amines measured with the platform. For LC-MS the labeled versions of the metabolites were used as internal standard while for GC-MS the best internal standard was chosen based on the smallest residual standard deviation of the QC samples. During the process of measurement, the time points for each batch were randomized, but each Ikaros sample and its control were maintained together.