Manually curated transcriptomics data collection for toxicogenomic assessment of engineered nanomaterials

Toxicogenomics (TGx) approaches are increasingly applied to gain insight into the possible toxicity mechanisms of engineered nanomaterials (ENMs). Omics data can be valuable to elucidate the mechanism of action of chemicals and to develop predictive models in toxicology. While vast amounts of transcriptomics data from ENM exposures have already been accumulated, a unified, easily accessible and reusable collection of transcriptomics data for ENMs is currently lacking. In an attempt to improve the FAIRness of already existing transcriptomics data for ENMs, we curated a collection of homogenized transcriptomics data from human, mouse and rat ENM exposures in vitro and in vivo including the physicochemical characteristics of the ENMs used in each study. Measurement(s) microarray data • transcriptome • RNA • Toxicogenomics Technology Type(s) digital curation Factor Type(s) exposure to engineered nanomaterials Sample Characteristic - Organism Homo sapiens • Mus musculus • Rattus Measurement(s) microarray data • transcriptome • RNA • Toxicogenomics Technology Type(s) digital curation Factor Type(s) exposure to engineered nanomaterials Sample Characteristic - Organism Homo sapiens • Mus musculus • Rattus Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.13154381

Metadata curation. Next, supporting information (metadata) for each entry in the initial collection was downloaded and manually curated on R (version 3.5.2). Metadata gives context to the data by mapping each sample to biological variables, such as dose and time point, as well as technical variables crucial for the preprocessing of the data.
Metadata were obtained from the sample records of GEO entries by using the function getGEO from the R package GEOquery 5 . For data sets available only on ArrayExpress, the sample information for each entry was downloaded. These data were then manually curated to produce a homogenized file for each data set consisting of the following variables: GSE (a unique identifier for each data set), GSM (sample id), treatment (exposure; i.e. ENM or control), group (experimental group; combination of a unique exposure, dose, and time point), organism, biological system, dose, dose unit, time point, time point unit, slide, array, dye and platform. Although some of these variables are not relevant for RNA-Seq data, all the columns were included for all the data to ensure convenient data usability. The nomenclature was unified to an extent that could be reached based on the information provided in the original metadata. Each sample was then mapped to its corresponding raw data file (column filenames) or annotated later to the fastq-files based on the sample names (GSM). If one or more predefined technical variables were missing, the column was left empty (NA). However, if biological variables were missing or ambiguous, the data set was discarded. Lastly, for entries containing human primary cells, the donor was further included in the metadata as an additional column donor.
ENM physicochemical characteristics curation. The majority of the datasets were associated with a published article describing the study and including some details of the materials used and their physico-chemical characteristics. In some cases, the information provided was the nominal size information from the ENM manufacturer, while others provided more detailed characterization of the ENM in the exposure medium. Newer studies tended to provide more detailed characterization information than older ones, as the community knowledge regarding minimum characterization needs and properties influencing ENM toxicity increased 6,7 . Several of the studies utilized ENMs already used in previous studies and referred to the characterization provided in those earlier studies, in which case the information was manually extracted from the earlier papers. The curated information for the ENMs includes information on the supplier (including batch and lot information where available), the purity / impurities, the nominal size and surface area, as well as characterization data such as the core particle size (shape) as determined by Transmission Electron Microscopy (TEM) size, the hydrodynamic size and zeta potential (surface charge) in water and/or the exposure medium determined by Dynamic Light Scattering (DLS), information on the presence of endotoxin contamination (where provided) and a link to the commercial providers material specification sheet where relevant. As many of the studies utilized several different ENMs, or several variants (e.g. sizes, capping agents, polymeric coatings etc.) each individual ENM within each study is described in a separate row of the ENM characteristics datasheet.
Manual quality assessment. The quality of transcriptomics data is highly dependent on the experimental design 2 . Low number of replicates results in weak statistics, while transcriptomics technologies themselves are often prone to technical bias. In order to ensure the quality and usability of each individual data set, evaluation was carried out based on the availability of raw data and supporting information as well as technical aspects of the  Fig. 1 The workflow applied to compile the data collection. Solid-lined boxes represent the steps applied while the output is marked with a dashed line.
www.nature.com/scientificdata www.nature.com/scientificdata/ experimental setup. The experiment was considered inappropriate for the collection if the experimental groups consisted of less than three biological replicates or if the experimental design introduced an unmanageable batch effect. Such batch effects were commonly introduced by consistently labeling different experimental groups with separate dyes in a two-color microarray experiment (i.e. lack of dye swapping). Furthermore, data sets representing non-commercial/custom or marginally represented platforms, for instance microarrays specific for miRNA or lncRNA, were excluded. As a result, only commercial gene expression microarrays from Agilent, Affymetrix, and Illumina were included alongside Illumina RNA-Seq platforms. The manual quality assessment of the collection is further described in the section Technical Validation.
Data preprocessing. Preprocessing of transcriptomics data must be performed prior to any further analysis. The current standard preprocessing pipeline for microarray data includes steps for sample quality checking, probe filtering, data normalization, batch effect assessment and correction as well as probe annotation 8 . Similarly, the state-of-the-art preprocessing of RNA-Seq data includes quality control, read alignment, read count extraction, filtering low counts, normalization, and batch effect assessment 8 . Here, each data set was preprocessed and analyzed individually. Data sets consisting of several cell lines or tissues were further separated by the biological system to better focus on the transcriptional differences between the exposures.
Preprocessing was performed in the R programming language (R version 3.5.2) following standard preprocessing pipelines suitable for each platform. For Agilent and Affymetrix microarrays, the preprocessing was implemented in the software eUTOPIA 9 . For Illumina BeadChips, a similar approach was applied following the suggested workflow of the R Bioconductor package lumi 10 . The preprocessing workflow applied to each platform is summarized in Fig. 2.

Quality check.
Omics data are prone to technical errors that can arise from sample handling as well as the intrinsic characteristics of the platforms 8 . For this, an important step prior to any manipulation of the data is the quality check (QC) that allows the assessment of the gene expression distributions across samples revealing outliers and poor-quality samples. We applied a platform specific QC on each data set to evaluate the quality of the samples as well as the prevalence of outliers in the data.
For Agilent microarrays, the R package arrayQualityMetrics 11 was used, while the QC for Affymetrix was performed using the R packages affyQCreport 12 and yaqcaffy 13 . Outliers were further assessed based on the visual representation in the form of density plots, bar plots, dendrograms, and multi-dimensional scaling (MDS) plots, which were also the primary method of outlier detection for Illumina arrays. Outliers were removed from subsequent preprocessing and analysis.
Quality checking of the RNA sequencing data was performed using FastQC v0.11.7 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Read alignment. RNA sequencing reads of mouse samples were aligned against the mouse reference genome assembly GRCm38, while sequencing reads of human samples were aligned against the human reference genome assembly GRCh38. The alignment was performed using the HISAT2 algorithm 14,15 employing the genome indexes built for usage with HISAT2 (retrieved from https://ccb.jhu.edu/software/hisat2/manual.shtml).

Fig. 2
Preprocessing workflow applied to Agilent, Affymetrix, and Illumina microarrays and Illumina RNAsequencing. Boxes with a blue background represent preprocessing steps and methods applied for each platform while boxes outlined with a dashed line represent the output obtained for each data set. The lack of a white box indicates that the step was not applied for the platform.
Low counts filtering. In order to filter out the transcripts with low expression levels in the samples of each RNA-Seq dataset, the proportion test was used as implemented in the Bioconductor NOISeq package (v2.31.0) 18 . probe filtering. For microarray experiments, probe filtering is commonly applied to remove probes showing low variance in the intensity range similar to the background 8 . These low-intensity probes were removed prior to data normalization. For Agilent microarrays, filtering was based on estimating the robustness of the probe signal intensities against the background (negative control probes) and applying a quantile-based method for eliminating probes with low signals. Individual thresholds based on the data and the number of experimental groups and replicates were determined for Agilent. For Illumina gene expression microarrays, probe filtering was performed after normalization based on the detection p-values 10 provided in the raw data. Only probes with a detection p-value < 0.01 in at least one sample were considered for further analysis.
Normalization. Normalization of transcriptomics data is crucial for robust comparisons of gene expression. Here, the normalization of the expression signal distribution in the samples was performed on the log2 transformed signal intensities using the quantile normalization from the R package limma 19 for Agilent, and the function justRMA from the package affy 20 for Affymetrix microarrays, respectively. For Illumina microarrays, quantile normalization was performed with the function lumiN from the lumi R package 10 , while for Illumina RNA-Seq data, normalization was performed using the Bioconductor DESeq. 2 package 21 . In detail, the filtered raw counts underwent normalization by median of ratios method implemented in the package (for details see DESeq. 2 documentation).
Batch effect assessment and correction. Microarray experiments are susceptible to technical variation arising from the experimental setup, sample preparation, and the equipment, for example. This type of variation can lead to decreased quality and incorrect results. Thus, reducing the variation associated with technical variables (batch effect), while maintaining biological variation, improves the robustness of the results. Here, batch effects were evaluated by inspecting the results of principal component analysis, hierarchical clustering and multi-dimensional scaling 9 . Technical variation arising from unknown batches were evaluated with the function sva from the R package sva 22 . If variation associated to known technical variables or any of the surrogate variables was observed, its correlation with biological variables of interest was assessed via a confounding plot 23 . Batches that were not confounded with any of the variables of interest were corrected using the ComBat 24 function from the R package sva 22 . probe annotation. Lastly, it is meaningful to map the probes to genes. For Agilent, the latest version of the annotation file for the specific microarray design was downloaded from the Agilent eArray website (https:// earray.chem.agilent.com/earray/, 2020), and the probes were mapped to the Ensembl transcript IDs 25 . For Affymetrix gene expression arrays, the latest available alternative CDF files with Ensembl gene ID mappings were downloaded from Brainarray (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/ CDF_download.asp, 2020), while for Illumina BeadChips, the platform specific R annotation packages (illumi-naHumanv3.db 26 , illuminaHumanv4.db 27 , illuminaRatv1.db 28 or illuminaMousev2.db 29 ) were used.
Multiple probes mapped onto the same gene ID were summarized by their median values. Agilent probes that were initially annotated to Ensembl transcripts were further mapped to the corresponding Ensembl gene IDs. If multiple transcripts were mapped to the same gene, the one with the highest absolute score, as calculated by the -log(p-value) x log 2 (fold change) for each exposure vs. control pairwise comparison, was selected.

Differential expression analysis.
Transcriptomics analysis aims at identifying gene expression differences between biological conditions. Here, we performed a differential expression analysis on each microarray data set using the R package limma 19 . Comparisons were made between each specific experimental group consisting of a single exposure, dose, and time point and its corresponding control samples. Batch corrected variables were included as covariates of the linear model. In case the biological material was obtained from human donors, the donor was included as a covariate for the analysis. For RNA-Seq based data sets similar comparisons were made using the Bioconductor DESeq. 2 package 21 .
As a result of the differential expression analysis, we provide full lists of genes with their specific fold changes and statistics as well as the results filtered to only contain significantly differentially expressed genes with the threshold of |logFC| > 0. 58  FAIRness optimization. To further assist accessibility, interoperability and reusability, the data sets have been curated, imported and made publicly available from the NanoPharos database (https://db.nanopharos.eu/), which has been developed under the Horizon 2020 (H2020) NanoSolveIT 30 (https://www.nanosolveit.eu) and (2021) 8:49 | https://doi.org/10.1038/s41597-021-00808-y www.nature.com/scientificdata www.nature.com/scientificdata/ NanoCommons projects (https://nanocommons.eu/). The NanoPharos database has been primarily developed to include computationally derived data based on simulations for ENMs at different levels of accuracy. The database was then further extended to include ENM characterization data and biological effects. With the inclusion of omics data, the NanoPharos database is now covering, in a ready for modelling format, the full spectrum of data needed to initiate a computational workflow for in silico exploitation of the data. The data set was checked for inconsistencies in the data structure and harmonized where needed. The ENM physico-chemical characterization data have been enriched, where applicable, with molecular (e.g. atomic/ionic radii, electronegativity, energy band gap) and structural (e.g. crystallographic space group, unit cell dimensions and angles). Each ENM has been linked to the respective transcriptomics data set to facilitate querying and user study. The datasets can be queried and grouped, among others, based on the ENM core material, ENM batch, exposure time and dose, biological information, experiment type, analysis platform etc. (Supplementary File 1).
The NanoPharos database has been designed under the FAIR data principles 3 to offer users with high-quality, ready-for-modelling data sets, while allowing further development, adaptation and expansion. The FAIR data principles are meant to help database managers to improve data accessibility and reusability from the wider community in a way resembling Library Science 31 . To achieve this, data digitization in the NanoPharos database is being optimized to be machine readable to allow the seamless data comparison, transformation and, where possible, combination, providing the user with bigger and more complete data sets. On top of that, the NanoPharos database goes beyond the technical character of the FAIR data principles and is implementing the scientific FAIR data principles (SFAIR) as defined recently by Papadiamantis et al. 31 , providing users with the necessary scientific context and background information for them to be able to reuse the data with the highest possible confidence. Furthermore, NanoPharos is readily accessible via Representational State Transfer (REST) application programming interface (API) and is able to interact with external databases (e.g. NanoSolveIT Cloud) and modelling tools through API programmatic access. The available datasets can be accessed through: https://db.nanopharos.eu/ Queries.

Data Records
The data collection 32 generated here is freely available on Zenodo at https://doi.org/10.5281/zenodo.4146981. The collection comprises 85 preprocessed microarray-based data sets totaling 506 unique ENM vs. control comparisons and 16 RNA-Seq based data sets representing 23 ENM vs. control comparisons. Additionally, 24 comparisons of non-nanoparticle compounds used as positive/negative controls in the original experiments are included for the microarray data sets and 7 additional compounds are included for the RNA-Seq data. All of the data sets and their descriptions are available in Online-only Table 1, while the physico-chemical characteristics of the tested ENMs are available in Online-only Table 2, respectively.
In order to facilitate the selection of data suitable for different applications and modelling approaches, we classified the data into four categories based on the experimental design as follows: The proportion of each data class in the collection is visualized in Fig. 3a. Each class contains data obtained both in vivo and in vitro with at least two organisms represented (Fig. 3b). The collection covers a range of ENM compositions, as well as variants in size, shape, surface capping/coating etc. within a specific composition, in multiple biological systems in these organisms (Fig. 3c,d).
Files available for each data set. Each data set contains a homogenized metadata file, normalized and batch corrected expression matrices as well as complete and filtered results of the differential expression analysis (Table 1).

technical Validation
The quality of transcriptomics data is a product of careful design of the experiment, technical execution as well as reporting of the data. The results of each downstream analysis substantially rely on the quality of the data. For this, we ensured that the collection contains high-quality data sets and defined a selection of criteria for data sets to be included: • Three or more biological replicates are included for statistical robustness • Microarray platform is a commercial gene expression microarray produced by Agilent, Affymetrix or Illumina • The labelling of 2-color microarrays has been done considering dye swapping • Non-normalized raw data is available • Supporting information reports all variables required for preprocessing • Untreated control samples are included Each entry was evaluated based on the criteria, and either removed from the collection or selected for further preprocessing and analysis. The number of entries discarded for each of the listed reasons is represented in Table 2. Out of the 124 original entries 84 passed the quality assessment and were further divided into a total of 101 data sets (85 microarray and 16 RNA-Seq) based on the biological systems as specified in Data preprocessing.

Usage Notes
Here we provide the biggest homogenized collection of transcriptomics data sets in the field of nanosafety supplemented with metadata and ENM physico-chemical characteristics. The collection offers a valuable source for multiple analysis and modeling approaches 33 . For instance, the mechanism of action of each ENM can be characterized by investigating the provided lists of differentially expressed genes, and may be linked to specific physico-chemical characteristics such as size, surface capping or coating which can guide redesign of ENMs that  www.nature.com/scientificdata www.nature.com/scientificdata/ are safer and may support grouping into sets of nanoforms in accordance with REACH regulation (https://echa. europa.eu/documents/10162/13655/how_to_register_nano_en.pdf/f8c046ec-f60b-4349-492b-e915fd9e3ca0), for example. Moreover, pathway enrichment analysis can be performed to annotate these genes onto biological functions 34 . ENMs can be further compared and grouped based on the similarities between their molecular alteration profiles.
Due to the homogenized preprocessing and manual curation of the metadata, this collection is a relevant resource for identification of toxicity biomarkers. This can be addressed by using multiple feature selection approaches 35,36 or more advanced data modelling techniques [37][38][39] . Biomarkers could also be detected by means of gene co-expression network analysis, under the assumption that central network genes play a key role in the adaptation to the exposure 40,41 .
The availability of data for multiple organisms or tissues can contribute to the development of more accurate adverse outcome pathways by linking ENM-specific molecular initiating events with cascades of relevant biological processes leading to an adverse outcome 42,43 . In addition, our data collection can be easily integrated with other transcriptomics data in the context of a read-across analysis to identify similarities in the molecular alterations induced by the ENMs with other phenotypic entities such as chemicals, drugs, and diseases 44 . Moreover, the data sets that we denoted as class I and II, where exposure at multiple doses are available, can be further analyzed to identify dose-dependent molecular alterations [45][46][47][48] .
Our manually curated transcriptomics data collection with supporting ENM descriptions will have a high impact on the nanosafety community and can aid the development of new methodologies for nanomaterial safety assessment 2,8,30,33,43 .

code availability
Preprocessing of the data was performed on R version 3.5.2. The preprocessing of Agilent and Affymetrix expression data was performed using eUTOPIA 9 , an R shiny software freely available on https://github.com/ Greco-Lab/eUTOPIA. Custom scripts used for preprocessing of Illumina BeadChip and RNA sequencing data are available on GitHub on https://github.com/grecolab/Public_Nano.

Lack of replicates 26
Non-commercial or marginally represented platform 5 Two-color setup with no dye swapping 4 No raw data available 2 Incomplete metadata 2 Lack of control samples 1 Total entries discarded 40 Table 2. Reasons for discarding data during the manual quality assessment.