Background & Summary

Engineered nanomaterials (ENMs) are an emerging class of chemicals with great technological and societal impact. Their unique physicochemical properties have already inspired multitudes of applications, ranging from medicine to industry and consumer products. While these unique properties make ENMs attractive for endless applications, they can also be responsible for potentially harmful effects on human health and the environment. ENMs can be synthesized in various sizes, shapes and chemistries with the smallest differences in the composition leading to novel properties and effects that need to be considered. Rigorous risk assessment is needed to ensure the safety of ENMs. Toxicogenomics (TGx) has emerged as a complementary approach to traditional toxicology with the potential to facilitate faster and cheaper hazard assessment of ENMs1,2. The large-scale profiling of exposure-induced molecular alterations sets the stage for mechanistic toxicology and expedites the development of predictive models. Furthermore, the application of TGx data to nanosafety can provide novel possibilities of grouping and classifying ENMs based on the similarity of molecular alterations in biological systems and furthermore can help to derive biomarkers to identify nano-specific signatures.

Transcriptomics technologies are the frontline of TGx. Vast amounts of transcriptomics data for multiple ENMs have already been generated offering a valuable resource for future studies and applications. However, the data are scattered across public repositories, and their FAIRness is currently hampered by their heterogeneous nature and lack of standardization in the preprocessing and analysis. The FAIR principles for scientific data were defined in 2016 and have since been the guide for more Findable, Accessible, Interoperable, and Reusable data3. The FAIRness of ENM-relevant databases, including ArrayExpress, the Gene Expression Omnibus (GEO), eNanoMapper and NanoCommons have recently been evaluated, and while the six datasets extracted from these met the majority of the criteria defined by the FAIR maturity indicators, areas identified for improvement included the use of standard schema for metadata and the presence of specific attributes in registries of repositories that would increase the FAIRness of datasets4. In order to unleash the full potential of already existing transcriptomics data on ENM exposures, which are lacking the metadata related to the exposure conditions and ENM characteristics, we created a unified collection of 101 manually curated and preprocessed data sets, covering a range of ENMs, organisms, and exposure setups, using the approach represented in Fig. 1.

Fig. 1
figure 1

The workflow applied to compile the data collection. Solid-lined boxes represent the steps applied while the output is marked with a dashed line.

The overarching aim of this study was to manually curate a comprehensive collection of transcriptomics data in the field of nanosafety, thereby increasing the degree of FAIRness of the original data sets. In particular, our collection is characterized by a higher degree of FAIRness as compared to the individual original data sets composing it.

Methods

Data set identification and collection

The first step in compiling the collection was to identify relevant data sets across public repositories. The search was limited to human, mouse, and rat data. We queried the Gene Expression Omnibus (GEO) and ArrayExpress databases with the following search terms: “engineered nanomaterial”, “nanomaterial” and “nanoparticle”. The initial collection yielded 124 unique entries, which went through manual assessment. Raw, non-normalized data for each microarray-based entry was downloaded from the series entry page, while for RNA-Seq data sets raw sequencing data in .fastq format were retrieved from the European Nucleotide Archive (ENA) (https://www.ebi.ac.uk/ena/browser/home).

Metadata curation

Next, supporting information (metadata) for each entry in the initial collection was downloaded and manually curated on R (version 3.5.2). Metadata gives context to the data by mapping each sample to biological variables, such as dose and time point, as well as technical variables crucial for the preprocessing of the data.

Metadata were obtained from the sample records of GEO entries by using the function getGEO from the R package GEOquery5. For data sets available only on ArrayExpress, the sample information for each entry was downloaded. These data were then manually curated to produce a homogenized file for each data set consisting of the following variables: GSE (a unique identifier for each data set), GSM (sample id), treatment (exposure; i.e. ENM or control), group (experimental group; combination of a unique exposure, dose, and time point), organism, biological system, dose, dose unit, time point, time point unit, slide, array, dye and platform. Although some of these variables are not relevant for RNA-Seq data, all the columns were included for all the data to ensure convenient data usability. The nomenclature was unified to an extent that could be reached based on the information provided in the original metadata. Each sample was then mapped to its corresponding raw data file (column filenames) or annotated later to the fastq-files based on the sample names (GSM). If one or more predefined technical variables were missing, the column was left empty (NA). However, if biological variables were missing or ambiguous, the data set was discarded. Lastly, for entries containing human primary cells, the donor was further included in the metadata as an additional column donor.

ENM physicochemical characteristics curation

The majority of the datasets were associated with a published article describing the study and including some details of the materials used and their physico-chemical characteristics. In some cases, the information provided was the nominal size information from the ENM manufacturer, while others provided more detailed characterization of the ENM in the exposure medium. Newer studies tended to provide more detailed characterization information than older ones, as the community knowledge regarding minimum characterization needs and properties influencing ENM toxicity increased6,7. Several of the studies utilized ENMs already used in previous studies and referred to the characterization provided in those earlier studies, in which case the information was manually extracted from the earlier papers. The curated information for the ENMs includes information on the supplier (including batch and lot information where available), the purity / impurities, the nominal size and surface area, as well as characterization data such as the core particle size (shape) as determined by Transmission Electron Microscopy (TEM) size, the hydrodynamic size and zeta potential (surface charge) in water and/or the exposure medium determined by Dynamic Light Scattering (DLS), information on the presence of endotoxin contamination (where provided) and a link to the commercial providers material specification sheet where relevant. As many of the studies utilized several different ENMs, or several variants (e.g. sizes, capping agents, polymeric coatings etc.) each individual ENM within each study is described in a separate row of the ENM characteristics datasheet.

Manual quality assessment

The quality of transcriptomics data is highly dependent on the experimental design2. Low number of replicates results in weak statistics, while transcriptomics technologies themselves are often prone to technical bias. In order to ensure the quality and usability of each individual data set, evaluation was carried out based on the availability of raw data and supporting information as well as technical aspects of the experimental setup. The experiment was considered inappropriate for the collection if the experimental groups consisted of less than three biological replicates or if the experimental design introduced an unmanageable batch effect. Such batch effects were commonly introduced by consistently labeling different experimental groups with separate dyes in a two-color microarray experiment (i.e. lack of dye swapping). Furthermore, data sets representing non-commercial/custom or marginally represented platforms, for instance microarrays specific for miRNA or lncRNA, were excluded. As a result, only commercial gene expression microarrays from Agilent, Affymetrix, and Illumina were included alongside Illumina RNA-Seq platforms. The manual quality assessment of the collection is further described in the section Technical Validation.

Data preprocessing

Preprocessing of transcriptomics data must be performed prior to any further analysis. The current standard preprocessing pipeline for microarray data includes steps for sample quality checking, probe filtering, data normalization, batch effect assessment and correction as well as probe annotation8. Similarly, the state-of-the-art preprocessing of RNA-Seq data includes quality control, read alignment, read count extraction, filtering low counts, normalization, and batch effect assessment8. Here, each data set was preprocessed and analyzed individually. Data sets consisting of several cell lines or tissues were further separated by the biological system to better focus on the transcriptional differences between the exposures.

Preprocessing was performed in the R programming language (R version 3.5.2) following standard preprocessing pipelines suitable for each platform. For Agilent and Affymetrix microarrays, the preprocessing was implemented in the software eUTOPIA9. For Illumina BeadChips, a similar approach was applied following the suggested workflow of the R Bioconductor package lumi10. The preprocessing workflow applied to each platform is summarized in Fig. 2.

Fig. 2
figure 2

Preprocessing workflow applied to Agilent, Affymetrix, and Illumina microarrays and Illumina RNA-sequencing. Boxes with a blue background represent preprocessing steps and methods applied for each platform while boxes outlined with a dashed line represent the output obtained for each data set. The lack of a white box indicates that the step was not applied for the platform.

Quality check

Omics data are prone to technical errors that can arise from sample handling as well as the intrinsic characteristics of the platforms8. For this, an important step prior to any manipulation of the data is the quality check (QC) that allows the assessment of the gene expression distributions across samples revealing outliers and poor-quality samples. We applied a platform specific QC on each data set to evaluate the quality of the samples as well as the prevalence of outliers in the data.

For Agilent microarrays, the R package arrayQualityMetrics11 was used, while the QC for Affymetrix was performed using the R packages affyQCreport12 and yaqcaffy13. Outliers were further assessed based on the visual representation in the form of density plots, bar plots, dendrograms, and multi-dimensional scaling (MDS) plots, which were also the primary method of outlier detection for Illumina arrays. Outliers were removed from subsequent preprocessing and analysis.

Quality checking of the RNA sequencing data was performed using FastQC v0.11.7 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

Read alignment

RNA sequencing reads of mouse samples were aligned against the mouse reference genome assembly GRCm38, while sequencing reads of human samples were aligned against the human reference genome assembly GRCh38. The alignment was performed using the HISAT2 algorithm14,15 employing the genome indexes built for usage with HISAT2 (retrieved from https://ccb.jhu.edu/software/hisat2/manual.shtml). Sequencing file format conversions, such as.sam to.bam, sorting and extraction of uniquely mapped reads were performed using SAMtools (version 1.8-27-g0896262)16.

Read counts extraction

Raw read counts for the RNA-Seq data were computed using the R package Rsubread (v2.2.3)17. The human Gencode version 35 annotation was applied for read counts extraction of human samples, while for mouse samples the mouse Gencode version M25 was employed. Both of the annotations were downloaded from https://www.gencodegenes.org.

Low counts filtering

In order to filter out the transcripts with low expression levels in the samples of each RNA-Seq dataset, the proportion test was used as implemented in the Bioconductor NOISeq package (v2.31.0)18.

Probe filtering

For microarray experiments, probe filtering is commonly applied to remove probes showing low variance in the intensity range similar to the background8. These low-intensity probes were removed prior to data normalization. For Agilent microarrays, filtering was based on estimating the robustness of the probe signal intensities against the background (negative control probes) and applying a quantile-based method for eliminating probes with low signals. Individual thresholds based on the data and the number of experimental groups and replicates were determined for Agilent. For Illumina gene expression microarrays, probe filtering was performed after normalization based on the detection p-values10 provided in the raw data. Only probes with a detection p-value < 0.01 in at least one sample were considered for further analysis.

Normalization

Normalization of transcriptomics data is crucial for robust comparisons of gene expression. Here, the normalization of the expression signal distribution in the samples was performed on the log2 transformed signal intensities using the quantile normalization from the R package limma19 for Agilent, and the function justRMA from the package affy20 for Affymetrix microarrays, respectively. For Illumina microarrays, quantile normalization was performed with the function lumiN from the lumi R package10, while for Illumina RNA-Seq data, normalization was performed using the Bioconductor DESeq. 2 package21. In detail, the filtered raw counts underwent normalization by median of ratios method implemented in the package (for details see DESeq. 2 documentation).

Batch effect assessment and correction

Microarray experiments are susceptible to technical variation arising from the experimental setup, sample preparation, and the equipment, for example. This type of variation can lead to decreased quality and incorrect results. Thus, reducing the variation associated with technical variables (batch effect), while maintaining biological variation, improves the robustness of the results. Here, batch effects were evaluated by inspecting the results of principal component analysis, hierarchical clustering and multi-dimensional scaling9. Technical variation arising from unknown batches were evaluated with the function sva from the R package sva22. If variation associated to known technical variables or any of the surrogate variables was observed, its correlation with biological variables of interest was assessed via a confounding plot23. Batches that were not confounded with any of the variables of interest were corrected using the ComBat24 function from the R package sva22.

Probe annotation

Lastly, it is meaningful to map the probes to genes. For Agilent, the latest version of the annotation file for the specific microarray design was downloaded from the Agilent eArray website (https://earray.chem.agilent.com/earray/, 2020), and the probes were mapped to the Ensembl transcript IDs25. For Affymetrix gene expression arrays, the latest available alternative CDF files with Ensembl gene ID mappings were downloaded from Brainarray (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp, 2020), while for Illumina BeadChips, the platform specific R annotation packages (illuminaHumanv3.db26, illuminaHumanv4.db27, illuminaRatv1.db28 or illuminaMousev2.db29) were used.

Multiple probes mapped onto the same gene ID were summarized by their median values. Agilent probes that were initially annotated to Ensembl transcripts were further mapped to the corresponding Ensembl gene IDs. If multiple transcripts were mapped to the same gene, the one with the highest absolute score, as calculated by the -log(p-value) x log2(fold change) for each exposure vs. control pairwise comparison, was selected.

Differential expression analysis

Transcriptomics analysis aims at identifying gene expression differences between biological conditions. Here, we performed a differential expression analysis on each microarray data set using the R package limma19. Comparisons were made between each specific experimental group consisting of a single exposure, dose, and time point and its corresponding control samples. Batch corrected variables were included as covariates of the linear model. In case the biological material was obtained from human donors, the donor was included as a covariate for the analysis. For RNA-Seq based data sets similar comparisons were made using the Bioconductor DESeq. 2 package21.

As a result of the differential expression analysis, we provide full lists of genes with their specific fold changes and statistics as well as the results filtered to only contain significantly differentially expressed genes with the threshold of |logFC| > 0.58 and Benjamini & Hochberg adjusted p-value < 0.05. Due to the implementation of DESeq. 2 independent filtering (for details see DESeq. 2 documentation), we also computed the adjusted p-values for RNA-Seq data externally from DESeq. 2 to obtain the full list of adjusted p-values with no missing values. These values are included in the unfiltered result files of the differential expression analysis under the column “adj.P.Val.no.ind.filt”.

FAIRness optimization

To further assist accessibility, interoperability and reusability, the data sets have been curated, imported and made publicly available from the NanoPharos database (https://db.nanopharos.eu/), which has been developed under the Horizon 2020 (H2020) NanoSolveIT30 (https://www.nanosolveit.eu) and NanoCommons projects (https://nanocommons.eu/). The NanoPharos database has been primarily developed to include computationally derived data based on simulations for ENMs at different levels of accuracy. The database was then further extended to include ENM characterization data and biological effects. With the inclusion of omics data, the NanoPharos database is now covering, in a ready for modelling format, the full spectrum of data needed to initiate a computational workflow for in silico exploitation of the data. The data set was checked for inconsistencies in the data structure and harmonized where needed. The ENM physico-chemical characterization data have been enriched, where applicable, with molecular (e.g. atomic/ionic radii, electronegativity, energy band gap) and structural (e.g. crystallographic space group, unit cell dimensions and angles). Each ENM has been linked to the respective transcriptomics data set to facilitate querying and user study. The datasets can be queried and grouped, among others, based on the ENM core material, ENM batch, exposure time and dose, biological information, experiment type, analysis platform etc. (Supplementary File 1).

The NanoPharos database has been designed under the FAIR data principles3 to offer users with high-quality, ready-for-modelling data sets, while allowing further development, adaptation and expansion. The FAIR data principles are meant to help database managers to improve data accessibility and reusability from the wider community in a way resembling Library Science31. To achieve this, data digitization in the NanoPharos database is being optimized to be machine readable to allow the seamless data comparison, transformation and, where possible, combination, providing the user with bigger and more complete data sets. On top of that, the NanoPharos database goes beyond the technical character of the FAIR data principles and is implementing the scientific FAIR data principles (SFAIR) as defined recently by Papadiamantis et al.31, providing users with the necessary scientific context and background information for them to be able to reuse the data with the highest possible confidence. Furthermore, NanoPharos is readily accessible via Representational State Transfer (REST) application programming interface (API) and is able to interact with external databases (e.g. NanoSolveIT Cloud) and modelling tools through API programmatic access. The available datasets can be accessed through: https://db.nanopharos.eu/Queries.

Data Records

The data collection32 generated here is freely available on Zenodo at https://doi.org/10.5281/zenodo.4146981. The collection comprises 85 preprocessed microarray-based data sets totaling 506 unique ENM vs. control comparisons and 16 RNA-Seq based data sets representing 23 ENM vs. control comparisons. Additionally, 24 comparisons of non-nanoparticle compounds used as positive/negative controls in the original experiments are included for the microarray data sets and 7 additional compounds are included for the RNA-Seq data. All of the data sets and their descriptions are available in Online-only Table 1, while the physico-chemical characteristics of the tested ENMs are available in Online-only Table 2, respectively.

In order to facilitate the selection of data suitable for different applications and modelling approaches, we classified the data into four categories based on the experimental design as follows:

I – Multiple doses, multiple time points.

II – Multiple doses, one time point.

III – One dose, multiple time points.

IV – One dose, one time point.

The proportion of each data class in the collection is visualized in Fig. 3a. Each class contains data obtained both in vivo and in vitro with at least two organisms represented (Fig. 3b). The collection covers a range of ENM compositions, as well as variants in size, shape, surface capping/coating etc. within a specific composition, in multiple biological systems in these organisms (Fig. 3c,d).

Fig. 3
figure 3

The data collection comprises of various experimental setups and exposures of multiple ENM compositions. (a) The total of 101 data sets were divided into four classes based on the experimental setup. The pie chart represents the distribution of data sets by class. (b) Bars representing the proportion of data sets in each organism divided by the four classes. In vivo and in vitro exposures are separated. (c) Horizontal bars represent the number of data sets with the specific ENM core material or material type. Grey bars represent in vivo exposures and pink bars in vitro exposures, respectively. (d) Bars represent the biological system used in the experiment. In vitro exposures are represented by pink bars and in vivo exposures by grey bars.

Files available for each data set

Each data set contains a homogenized metadata file, normalized and batch corrected expression matrices as well as complete and filtered results of the differential expression analysis (Table 1).

Table 1 Files provided for each entry in the collection.

Technical Validation

The quality of transcriptomics data is a product of careful design of the experiment, technical execution as well as reporting of the data. The results of each downstream analysis substantially rely on the quality of the data. For this, we ensured that the collection contains high-quality data sets and defined a selection of criteria for data sets to be included:

  • Three or more biological replicates are included for statistical robustness

  • Microarray platform is a commercial gene expression microarray produced by Agilent, Affymetrix or Illumina

  • The labelling of 2-color microarrays has been done considering dye swapping

  • Non-normalized raw data is available

  • Supporting information reports all variables required for preprocessing

  • Untreated control samples are included

Each entry was evaluated based on the criteria, and either removed from the collection or selected for further preprocessing and analysis. The number of entries discarded for each of the listed reasons is represented in Table 2. Out of the 124 original entries 84 passed the quality assessment and were further divided into a total of 101 data sets (85 microarray and 16 RNA-Seq) based on the biological systems as specified in Data preprocessing.

Table 2 Reasons for discarding data during the manual quality assessment.

Usage Notes

Here we provide the biggest homogenized collection of transcriptomics data sets in the field of nanosafety supplemented with metadata and ENM physico-chemical characteristics. The collection offers a valuable source for multiple analysis and modeling approaches33. For instance, the mechanism of action of each ENM can be characterized by investigating the provided lists of differentially expressed genes, and may be linked to specific physico-chemical characteristics such as size, surface capping or coating which can guide redesign of ENMs that are safer and may support grouping into sets of nanoforms in accordance with REACH regulation (https://echa.europa.eu/documents/10162/13655/how_to_register_nano_en.pdf/f8c046ec-f60b-4349-492b-e915fd9e3ca0), for example. Moreover, pathway enrichment analysis can be performed to annotate these genes onto biological functions34. ENMs can be further compared and grouped based on the similarities between their molecular alteration profiles.

Due to the homogenized preprocessing and manual curation of the metadata, this collection is a relevant resource for identification of toxicity biomarkers. This can be addressed by using multiple feature selection approaches35,36 or more advanced data modelling techniques37,38,39. Biomarkers could also be detected by means of gene co-expression network analysis, under the assumption that central network genes play a key role in the adaptation to the exposure40,41.

The availability of data for multiple organisms or tissues can contribute to the development of more accurate adverse outcome pathways by linking ENM-specific molecular initiating events with cascades of relevant biological processes leading to an adverse outcome42,43. In addition, our data collection can be easily integrated with other transcriptomics data in the context of a read-across analysis to identify similarities in the molecular alterations induced by the ENMs with other phenotypic entities such as chemicals, drugs, and diseases44. Moreover, the data sets that we denoted as class I and II, where exposure at multiple doses are available, can be further analyzed to identify dose-dependent molecular alterations45,46,47,48.

Our manually curated transcriptomics data collection with supporting ENM descriptions will have a high impact on the nanosafety community and can aid the development of new methodologies for nanomaterial safety assessment2,8,30,33,43.