Background & Summary

The recent advances of nanotechnology have led to concerns for the potential release of engineered nanomaterials (ENM) into the environment causing exposure to, and perhaps adverse effects on, humans or sensitive ecological species1. Accordingly, the United States Environmental Protection Agency (US EPA) Office of Research and Development (ORD) has developed a research program aimed at understanding the potential environmental implications of ENM. ORD research encompasses potential releases of ENM from manufacturing and commercial uses; environmental transformations, fate, and transport; exposures; and potential adverse health effects. A framework was developed to organize and integrate this diverse set of information2. To support this larger effort, a relational database was developed containing ORD nanomaterial research data to better enable the use and synthesis of study results, and to facilitate higher-order analyses such as quantitative structure-activity relationships (QSAR). One goal is to probe the relationships between physical and chemical properties of ENM and their environmental actions to see if predictive relationships can be determined. This publication announces the release of “NaKnowBase” (NKB), a knowledge base containing the results of multiple ORD publications on the actions of ENM in environmental or biological media.

The design of NKB was intended to compliment efforts in nanoinformatics – the strategic curation and collation of nanomaterial data for analytic purposes. A roadmap for nanoinformatics in the European Union (EU) and US was recently published providing a comprehensive overview of the inter-related scientific disciplines of nanomaterials science, physicochemical characterization, computational modelling, informatics, and ecological and human toxicology3. This analysis identified three challenges facing nanoinformatics: (1) limited datasets, (2) limited data access, and (3) regulatory requirements for validating and accepting computational models. NKB partially addresses the first two of these issues by providing a publicly available source of curated data relevant to ENM environmental health and safety (EHS). Collating datasets from multiple sources facilitates more comprehensive meta-analyses, QSAR, and risk assessment approaches such as read-across4. To date, such “big data” endeavours in ENM EHS tend to be designed around large datasets that must be generated in advance, or remain limited by a paucity of relevant, curated data from disparate sources4,5,6. Efforts like NKB can help overcome these research hurdles by being strategically designed to leverage extant data while also being amenable to newly generated data.

There are other nanomaterial-related databases indexed in the appendix section of the EU-US roadmap3. These databases are independently operated and vary according to the intended use and operability, the types of data captured, and the data format, access, and control. Although it may appear advantageous to consolidate these, there are several factors favouring the maintenance of independent databases: ability to control access to, quality of, and integrity of the data, managing and protecting proprietary and confidential business information, the pragmatics of scale, and the availability and continuity of funding. Therefore, the original scope of the NKB was limited to data collected by the EPA ORD. To our knowledge, the data provided in NKB are not collected elsewhere. The data in NKB represent the only collated source of published data from the US Environmental Protection Agency in a relational database regarding the potential environmental effects of engineered nanomaterials.

NKB was built as an SQL relational database. The overall structure is shown in Fig. 1. The database has separate tables on the source publication, the tested materials and their physicochemical properties, the media in which the materials were tested, the assays performed, the parameters evaluated, and the results. There are sub-tables to capture data on chemical contaminants, attached functional groups, and test media additives. Data entry is accomplished by curators via a set of prescribed Excel spreadsheets that are then imported to the database using a script. During curation, efforts are maintained to use terminology consistent with an expanded nanomaterial ontology being developed by several nanoinformatics groups including the EU NanoSafety Cluster and the Center for the Environmental Implications of Nanotechnology (CEINT), in coordination with the foundational work published by the eNanoMapper database7,8. In addition, a simple, user-friendly interface was developed which allows users to search the database and obtain outputs of data in spreadsheet format.

Fig. 1
figure 1

Overview of the NKB SQL structure. The lines indicate the nature of each relationship. Each relationship is of a one-to-many nature, where the end with two lines is “one” and the end with a triangle is “many”, such as one publication being able to have many mediums.

Methods

Publications selected for curation were limited to research conducted by ORD and related to environmental or biological actions of ENM. This included in vivo, in vitro, and in silico experiments as well as life-cycle analyses and physicochemical characterisations. The data in the database reflect over 120 relevant publications from approximately 2012 through November 2019. Over 70 unique nanomaterials as defined by the combined composition of the core, shell and coatings were studied. Over 160 named assays and 22,000 individual assays were run. We expect to maintain the database and continue to make additions over time as new research becomes available. Though NKB will be made available through the Office of Science Management as a public EPA database tool, pertinent NKB data will also be integrated with the CompTox Chemicals Dashboard (https://comptox-prod.epa.gov/dashboard/chemical_lists/), which maps the DSSTox substance records to the most current list of NKB nanomaterials. The addition of new data will be announced via the CompTox Chemical Dashboard (https://comptox.epa.gov/dashboard/) on the ‘News’ (https://comptox.epa.gov/dashboard/news_info) and ‘Downloads’ (https://comptox.epa.gov/dashboard/downloads) pages of the Dashboard, as appropriate.

The EPA maintains various repositories for planned, ongoing, and completed research and projects. These repositories were searched for relevant publications for curation. The description and content of these repositories are detailed below.

STICS

The Scientific & Technical Information Clearance System (STICS) is used by ORD to electronically approve and monitor scientific and technical products produced by ORD. STICS allows approved users with an EPA account and password (such as EPA employees and contractors) to search entries and download the results.

Science inventory

The Science Inventory (SI) stores publicly available records about research conducted by the EPA, allowing EPA account-holding users to search through entries. Much of the database-relevant information in SI overlaps with STICS.

Science hub

Science Hub is a data storage site for datasets associated with recently published EPA journal articles (beginning in 2016). EPA employees and contractors may access these datasets directly through Science Hub while the general public is granted access through a separate portal (The Environmental Dataset Gateway; https://edg.epa.gov/metadata/catalog/main/home.page).

Direct input from investigators

Where available, ORD researchers provided their publication(s) and original data for inclusion in the database. These papers and submitted data were evaluated on a case-by-case basis and formatted by trained curators for inclusion in the database. Approximately 9% of the entries were submitted directly by the investigators. Among the reasons that original data may not have been available included the primary investigators having left the Agency, data having been archived, lack of access to raw data from scientific instruments, and incompatible formats. An example of an incompatible format was lists of differentially expressed genes encoded as “increased” or “decreased” where the data fields in the NKB required numeric value entries.

Systematic article selection

Papers of interest were identified by running keyword searches through STICS, Science Inventory and Science Hub. A list of entries containing “nano” in the keywords or title were obtained. Additional queries were run separately using search terms including the composition of common ENM (e.g. silver, copper, titanium dioxide, cerium dioxide, etc.). Results were checked for duplicates, and posters, abstracts, or meeting presentations were not considered for curation. Over 600 titles were identified for further screening. These results were then reviewed to identify only original, peer-reviewed research. Finally, titles and abstracts were carefully read for relevance to nanotoxicology, environmental effects of nanomaterials, physical and chemical properties, and ENM life cycle. Other nanomaterial papers including literature reviews and those relating to topics such as incidental or naturally occurring nanomaterials, method development or “green chemistry” synthesis of nanomaterials were excluded.

Table organization and curation procedures

The curation of data into the database required a set of trained data curators and a substantial commitment of time and effort. Artificial intelligence or other automated procedures were not used. The original training of data curators was generously conducted by the database experts of the Center for Environmental Implications of Nanotechnology (CEINT) in association with the Nano Informations Common (CEINT NIC), a database maintained at Duke University in Durham NC. Experienced NKB curators subsequently oversaw the training of new data curators as needed. Training consisted of explaining the overall purpose and structure of the database and the data input templates, and then overseeing the curation of selected model datasets which had been curated previously by others. When the novice curators were sufficiently proficient at capturing data from the training sets, they began with oversight to encode new manuscripts. Curators typically became proficient in a matter of a few weeks. Once curators were proficient, curation of data from each new manuscript typically required between one to several workdays depending on the complexity of the material. Questions or uncertainty about experimental procedures or parameters were referred to the project management and occasionally required contact with the authors of the original manuscripts for clarification. Thus, the robust curation of data for the database required considerable time and effort of skilled personnel.

Data extraction and curation occurred in accordance with an approved EPA quality assurance project plan (QAPP E-TAB-0030177, Project ID “Emerging Materials Project 18.02”). In summary, all data were collected from published journal articles. Metadata were attached to all curated data. Data were extracted from manuscript figures using a web application called WebPlotDigitizer (https://automeris.io/WebPlotDigitizer/). Modifications to curated data (for correction of curation errors, etc.) were logged and described in a separate text file.

Publications were added to NKB by entering metadata, experimental procedures, and results into a data collection template comprised of 11 preformatted Excel spreadsheets. Once completed, automated uploading of curation tables into database was accomplished by an in house Java program that transformed the contents of the templates into database-ready tables (csv files).

SQL structure

The overall SQL structure of NKB is presented in Fig. 1, and a brief description of each data table is provided in Table 1. An overview of the fields and columns, in each NKB data table is further detailed in Tables 211. Field names are PascalCase to distinguish them from lowercase data table names. Primary keys, or fields comprised of unique identifiers for each entry in a data table, are listed first. Most tables use a single field as the primary key; the Material, Assay, and Medium tables use two keys. Primary keys and foreign keys are used to connect related data that are stored in different tables.

Table 1 An overview of the data tables in NKB, with a brief description of the general type or category of data collated in each table.
Table 2 Publication table data fields.
Table 3 Medium table data fields.
Table 4 Additive table data fields.
Table 5 Material table data fields.
Table 6 Contam table data fields.
Table 7 Materialfg table data fields.
Table 8 Assay table data fields.
Table 9 Parameters table data fields.
Table 10 Results table data fields.
Table 11 Molecularresults table data fields.

NKB User interface

The NKB user interface application is currently under development. Deployment is expected in 2023 under the EPA web domain naknowbase.epa.gov. Here, curated data can be accessed through a user-friendly interface and search results can be downloaded for subsequent analysis by the user. NKB data can be filtered by numerous parameters such as ENM composition, physical and chemical characteristics, assay name and type, assay parameters, and result name. NKB data points are also linked to the original peer-reviewed publications via a single hyperlink.

The NKB user interface allows users to search for data using a pre-defined list of relevant search terms categorized by data tables and table fields. The searchable data fields were derived from those listed in Tables 211.

Data Records

Figure 1 and Table 1 describe all the individual data sources integrated in NKB. The NKB data frame has been uploaded into a single collection entitled “NaKnowBase-SQL backend-080121” 9. The files contained in this collection include the most recent SQL data structure for NKB, including all tables, as well as corresponding data categories and keys for the backend of the database.

EPA nanomaterials present in NKB are also provided through the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard/chemical-lists/NAKNOWBASE), which maps EPA chemical substance records to the most current list of NKB nanomaterial substance records (last updated 12/14/2020).

Technical Validation

In general, there are many varied methodologies for cataloguing nanomaterials metadata and physicochemical properties; NKB attempts to capture as much of this information as possible.

Publications considered for curation were limited to ORD research, which is subject to rigorous internal and external quality control and peer review. All research conducted at ORD must have a corresponding Quality Assurance Project Plan (QAPP). QAPPs describe the necessary quality assurance and quality control measures needed to produce results that meet stated performance criteria. ORD OAPPs are peer-reviewed, approved by management, overseen by a quality assurance manager, and subject to periodic QA and performance quality checks. Manuscripts submitted for publication are linked to approved QA plans and are subject to QA review and approval. Furthermore, manuscripts are subject to thorough internal scientific peer review before undergoing additional external, independent peer review by the publishing journal. These systems are intended to ensure the quality and accuracy of ORD data, and help assure the reliability of data being curated in NKB. Because of this, the results of the papers themselves were not checked for errors during data curation. Instead, quality control efforts focused on ensuring the accuracy of the curated data compared to the original raw data, as well as consistent curation procedure between curators.

To assess the quality of NKB curation, a random sampling (approx. 5%) of curated papers were manually checked for quality control. It was found that data derived from the digitization of published graphs differed from the original data by an average of 0.20% ± 0.29% (N = 316) and that curation of the same data by different curators differed by an average of 0.33% ± 3.3% (N = 736). The data are calculated as Mean ± SD normalized to the axis scale.

Usage Notes

Potential uses of the data include input to quantitative structure-activity relationships (QSAR), meta-analyses, or other modeling or investigative approaches. Users should be aware that data obtained from the NKB includes a large number of potential parameters related to physicochemical properties of ENM. Because relatively few of these properties were entirely consistent across sources, the NKB contains many sparsely populated fields. Users should consider this when planning analyses of data from the NKB. Updates to the NKB described herein help inform new testable hypotheses about the etiology and mechanisms underlying ENM effects in the environment and adverse health outcomes of toxicological concern in relation to human exposure to nanomaterials.