The 2021 update of the EPA’s adverse outcome pathway database

The EPA developed the Adverse Outcome Pathway Database (AOP-DB) to better characterize adverse outcomes of toxicological interest that are relevant to human health and the environment. Here we present the most recent version of the EPA Adverse Outcome Pathway Database (AOP-DB), version 2. AOP-DB v.2 introduces several substantial updates, which include automated data pulls from the AOP-Wiki 2.0, the integration of tissue-gene network data, and human AOP-gene data by population, semantic mapping and SPARQL endpoint creation, in addition to the presentation of the first publicly available AOP-DB web user interface. Potential users of the data may investigate specific molecular targets of an AOP, the relation of those gene/protein targets to other AOPs, cross-species, pathway, or disease-AOP relationships, or frequencies of AOP-related functional variants in particular populations, for example. Version updates described herein help inform new testable hypotheses about the etiology and mechanisms underlying adverse outcomes of environmental and toxicological concern. Measurement(s) adverse outcome pathway • gene interactions • Orthologous Gene • chemical gene interactions • molecular pathway • disease gene associations • SNP Technology Type(s) digital curation Measurement(s) adverse outcome pathway • gene interactions • Orthologous Gene • chemical gene interactions • molecular pathway • disease gene associations • SNP Technology Type(s) digital curation Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.14737557


Background & Summary
There is a need for approaches to understand the biological mechanism of adverse outcomes and human variability in response to environmental chemical exposure. A recent legislation, the Frank R. Lautenberg Chemical Safety for the twenty-first Century Act of 2016 1 , requires the US Environmental Protection Agency to evaluate new and existing toxic chemicals with explicit consideration of susceptible populations of all types (life stage, exposure, genetic, etc.). In addition, on September 10, 2019, EPA Administrator Andrew Wheeler signed a directive that prioritizes efforts to reduce animal testing. In response to this directive, the EPA has developed a 2019 Strategic Plan to Promote the Development and Implementation of Alternative Test Methods Strategies (or New Approach Methodologies (NAMs)) per TSCA Section 4(h)(2)(C). The EPA Adverse Outcome Pathway Database (AOP-DB) is a decision support tool developed by the EPA's Center for Public Health and Environmental Assessment, which contributes to NAMs (e.g. computational toxicology tools) used for TSCA. The EPA Adverse Outcome Pathway Database (AOP-DB) is a database resource that combines different data types (AOP, gene, chemical, disease, pathway, orthology, and ontology) to characterize the impacts of chemicals to human health and the environment 2 , and for the characterization of human genetic susceptibility for the purpose of human health risk assessment 3 . The AOP-DB was originally developed with the primary aim of integrating AOP molecular target information with other publicly available datasets and related toxicological data, to facilitate computational analyses of AOP information. Near term goals for use of the AOP-DB are to address the biological and mechanistic aspects of alternative test methods in terms of the adverse outcome pathway construct to facilitate Integrated Approaches to Testing and Assessment (IATA) for regulatory purposes [4][5][6] .
Here we present an updated version of the database, AOP-DB v.2, which includes an increased number of adverse outcomes and corresponding key events derived from updated feeds from the AOP-Wiki 2.0 (https:// AOP-Wiki.org/), as well as updated chemical, disease, tissue, individual and population level data and ontology information. In this second iteration of the database, we update all code and previously described data from www.nature.com/scientificdata www.nature.com/scientificdata/ AOP-DB version 1, as well as provide integration with three new data areas: tissue, individual and population level data. We discuss the collaborative semantic mapping efforts for AOP-DB data, and highlight the AOP-DB web user interface, which will be deployed to the public in 2021.
Pittman et al. 2 presented the first version of the AOP-DB which focused on chemical and species-specific analyses. Mortensen et al. 3 developed a computational approach that implements the AOP-DB to integrate mechanistic data associated with an AOP with data capturing human genetic variability and function, for the purpose of characterizing human molecular variation that may impact individual and population level responses to environmental chemicals. The human individual and population level data included in AOP-DB v.2 provide the relevant data sources and organizational structure to envision the approach presented by Mortensen et al. 3 . We believe that this work represents a first step in organizing a coherent research program in molecular environmental adverse outcomes. Current and future work will focus on the application and interrogation of these data for case study examples, the development of computational, quantitative AOP (qAOP) models for estimation of chemical-MIE protypes, and machine learning methods to understand human susceptibility and variation in functional response that will inform chemical safety assessment.

Methods
Data included in the AOP-DB v.2 represent an aggregation of publicly available sources associated with adverse outcome pathways. All data included in AOP-DB v.2 were acquired or generated as detailed below. Online-Only Table 1 lists the current data sources integrated in the AOP-DB v.2. Table 1 lists the actual record count for integrated data within the AOP-DB v.2. For example, 261 supported AOPs in the AOP-DB v.2, as indicated in Table 1, indicates there are 261 unique, expert-derived AOPs that map to one or more unique gene or protein accession numbers. Similarly, for those 261 AOPs, 1029 chemical stressors map chemical ID to AOP ID, underlining a unique AOP-chemical association. The AOP-DB v.2 contains a full list of AOP stressors obtained and updated from the AOP-Wiki 2.0, which is available and updated as part of the OECD supported AOP Knowledge Base (AOP-KB) through e.AOP.Portal (https://aopkb.oecd.org/).
Methods and included data sources presented here reflect the AOP-DB v.2 as of May 5th, 2021. In addition to regular updates to the AOP-DB, we anticipate that additional data will be added to the database over time. The addition of new data will be announced via the AOP-DB web user interface https://aopdb.epa.gov/, on the AOP-DB 'Home' page https://www.epa.gov/healthresearch/adverse-outcome-pathway-aop-database, and the CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard/), hereafter referred to as the "Dashboard", on the 'News' page, (https://comptox.epa.gov/dashboard/news_info) and 'Downloads' (https://comptox.epa.gov/ dashboard/downloads) pages, as appropriate. The third-party data sets included in the AOP-DB v.2, and additional details of methods used for any modifications performed in the integration process, are described briefly below and with full details in Fig. 1  gene and protein accession numbers manually using the concept of the Event-Component using selected gene and protein ontologies, according to the methods described in Ives et al. 7  AOP-Wiki XML import, and mapping of molecular identifiers. AOP-DB v.2 has updated the primary source of AOP information, and now implements the AOP-Wiki 2.0 XML dump, which is updated quarterly. This dump includes all stressor, event, status, and description data associated with AOP's from the AOP-Wiki 2.0 in XML format. We used the XML package (version 3.98-1.2) in conjunction with xPath (version 3.1) notation to parse the AOP-Wiki 2.0 XML into tabular data for import into the AOP-DB v.2, in order to create a data frame with essential columns. Once organized into tabular format, this processed XML data is separated into AOP_info, AOP_gene, AOP_stressor, stressor_info, and event_info tables. Individual subroutines automate and validate the accuracy of the parsed data at each stage. The AOP_stressor and stressor_info tables, due to somewhat loosely structured submission structure, require manual processing and mapping of chemical ids to chemicals without mapping information, discussed below.
To create the link between AOP's from the AOP-Wiki 2.0 and gene identifiers in the AOP-DB v.2, which are not supplied in the AOP-Wiki 2.0 XML directly, we map key event information within each AOP containing a biological object, where a biological object is comprised of three ontological components: Process; Object; and Action. When the object term contains a protein ontology (PR) value, this value is considered a "molecular identifier" and is used to map that key event to a corresponding gene identifier (e.g. Entrez, UniProt, etc.).

AOP-Wiki Stressor tables.
After the AOP-Wiki XML is parsed, the AOP-DB stressor tables undergo an additional processing step. The AOP-Wiki XML contains stressor_name, user_term, stressor_id and Chemical Abstract Service Identifier (CASRN). Stressors entered into the AOP-Wiki can include a link to chemical stressors, via the DSSTox Substance Identifier (DTXSID), which maps the stressor to substances registered in the DSSTox database 29 . The chemical DTXSID, a unique substance identifier, provides a link to the Dashboard using www.nature.com/scientificdata www.nature.com/scientificdata/ the process described in Williams et al. 30 . When no DTXSID is provided for stressors imported from the AOP-DB, manual curation to the Dashboard has been performed on individual substances, on a substance-by-substance basis and using available identifiers (e.g. CAS Registry Numbers and chemical names) according to the process described in Grulke et al. 31 . A number of issues were addressed in this mapping process for the AOP-DB stressor mapping; for example, the same chemical with different chemical identifiers, mapping to the same substance (e.g. Dexamethasone vs Stressor:492 Dexamethasone); spelling errors (e.g. Tacrorimus versus Tacrolimus); general naming that prevents specific mapping to a DTXSID, for instance, "various hydrocarbons", that is in no way definitive for a particular substance and cannot be mapped to a DTXSID. It should however be noted that ambiguous substances can map to individual chemical structures, to mixtures or to UVCB chemicals (Unknown or Variable Composition, Complex Reaction Products and Biological Materials), for example, as discussed in Williams et al. 30 . The AOP-DB v.2 stressor linkages are provided through the Dashboard (https://comptox.epa. gov/dashboard/chemical_lists/AOPSTRESSORS), and map EPA chemical substance records to the most current list of AOP stressors (last updated 06/30/21).

AOP-DB Tissue-gene networks.
The AOP construct organizes biological and related data from molecular gene targets, both cellular and organ specific, to individual and population levels, so that we can better understand the progression of initiating events of toxicological concern to their benign or adverse outcomes.
The primary aim of the AOP-DB is to organize publicly available data so that users can understand and characterize the biological context of any given AOP. The AOP-DB v.2 addresses the need to understand tissue specific events in an AOP-tissue specific context. Here we implement HumanBase 21 data, which includes tissue-specific gene networks for 144 human tissues. By combining genome-wide association studies (GWAS) with tissue-specific gene networks, it is possible to develop tissue-specific gene-interaction networks with probabilities representing the strength of the interaction 21 . Using the HumanBase API (https://hb.flatironinstitute.org/api/), we import tissue-gene networks for all 144 tissues included in HumanBase and link to each AOP gene. So, for each AOP gene, 144 individual, tissue-gene networks are added. The number of networks will increase as the number of AOP genes in the AOP-DB increases. The final tabular form of the tissue-gene data contains edge data (Entrez1 and Entrez2) representing gene interactions, with individual edge probability scores. Each edge record has a tissue and query gene field, indicating network relationship.
The AOP-DB v.2 implements the HumanBase tissue-gene networks in the creation of a tissue-gene visualizer in the AOP-DB web user interface (Fig. 2). This allows users to navigate to AOP relevant genes and tissue-specific results in a real-time network viewer. The tissue-gene network results can be explored for hypothesis generation and, for example, to understand how tissue-specific AOP activity relates to known disease states in the AOP-DB.

AOP Haplotypes from 1000 genomes single nucleotide polymorphisms (SNPs). To our knowl-
edge the AOP-DB v.2 is the first tool that incorporates the novel integration of human population level data with AOP information. AOP-DB v.2 incorporates individual and population level data using data from Ensembl 22,23 , ENCODE GWAS 24 , GTex [25][26][27] , and the 1000 Genomes Project 28,32 , with the goal of identifying functionally important AOP-gene variants. The organization of such functional variants for described AOPs makes it possible to identify potential differences in human susceptibility to adverse outcomes on both an individual and group or population level. AOP-DB v.2 incorporates population level SNP frequencies for AOP-genes for five of the geographic super-populations reported by the 1000 Genomes Consortium (European, S. Asian, E. Asian, African, American 32 ) for the Phase 3 data.
In addition, we identified SNPs as functionally important from the ENCODE GWAS catalog 24 and the GTEx portal 25,26 for each AOP-gene. The SNPs were filtered secondarily using the Ensembl Regulatory Build 23 to be www.nature.com/scientificdata www.nature.com/scientificdata/ located in gene regulatory regions. The resulting SNPs were used to request genotypes using the Ensembl REST API's variation endpoint for each SNP for each AOP-gene. With these data we can construct individual AOP-gene haplotypes. These data are stored in the Haplotype table of AOP-DB v.2. The resulting functional AOP-gene list for each AOP was then screened using 1000 genomes samples to identify population frequencies of functionally relevant AOP gene SNPs and AOP haplotypes. Figure 3 illustrates the minor allele frequencies (MAF) observed for 104 functionally relevant Human AOP-gene SNPs. These multi-allelic, functional AOP-gene haplotypes can be used for further inquiries to determine levels of variation and significant differences in outcome across population groups.
AOP-DB web-based user interface. The EPA AOP-DB user interface is delivered via a graphical web user interface (coded in Java and React), an application programming interface (API; coded in node.js) and a relational database (MySQL and MariaDB). The AOP-DB web user interface, publicly available at https://aopdb. epa.gov/, has been developed with the goal of providing both data accessibility and data visualization according to the FAIR Principles 33 2019b . Basic search functionality of the AOP-DB web user interface allows users who prefer to interact with the web user interface to easily navigate, organize and download AOP and related data to their desktop.
Basic functionality includes a search interface (Fig. 4), batch query tool, and downloads page. The AOP-DB system has been modified for EPA standard visualization as well as integration with current Office of Science and Information Management (OSIM) security protocols for EPA internet hosting. At present, data updates to the AOP-DB occur approximately every 6 months. Initial AOP-DB data visualization has been optimized and  www.nature.com/scientificdata www.nature.com/scientificdata/ includes the following queries: AOP-gene, AOP-disease, AOP-pathway, and AOP-chemical. Search box functionality has also been extended to include autofill processing for inputs. Download capabilities have been extended to include user column select and scroll features, including data download in .csv and .xml, or other useful formats. The 'Documents' page links to the AOP-DB v.2 User Manual, which describes all current search functionality enabled for the user interface in detail, and includes sections on maintenance of new records and updates, how content is selected, query tips, and frequently asked questions.
Research description framework mapping of the AOP-DB data tables. Semantic mapping using a Research Description Framework (RDF) 34 was performed for four AOP-DB tables 35 as part of an ongoing, international collaboration between the US EPA, OpenRiskNet (https://openrisknet.org/) and researchers at the Department of Bioinformatics -BiGCaT, Maastricht University (http://www.bigcat.unimaas.nl/). Figure 5 illustrates the AOP-DB RDF tables for Chemical-Gene Interaction, Protein-Protein Interaction, ToxCast Assay, and Pathway, and each table's secondary (and tertiary) keys. A link to the AOP-DB SPARQL endpoint and the workflows generated using these mappings is provided by the OpenRiskNet e-infrastructure (https://openrisknet. org/e-infrastructure/services/147/).

Data Records
Online-only Table 1 describes all the individual data source integrated in the AOP-DB. Data sources are organized according to their biological category, with a short description of the data, along with any manipulations performed, and the URL. The AOP-DB v.2 data frame with all corresponding table data and custom code has been uploaded to a single collection entitled "The Adverse Outcome Pathway Database (AOP-DB) version 2.0" 36 . The files contained in this collection include the most recent SQL data structure for the AOP-DB v.2, all custom code, tables, and corresponding data categories and keys to create the backend of the database. www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
The main quality objective for AOP-DB is ensuring that data included in the data set accurately reflects the data obtained from the raw data source. Data sets considered for curation in the AOP-DB are limited to publicly available, peer-reviewed datasets that have been subjected to rigorous quality control and peer review in their own right. Quality assurance measures to validate the data have focused on ensuring that data records were transcribed accurately from the original data source and represented appropriately in the repository. As a research product of the United States Environmental Protection Agency Office of Research and Development (ORD), the AOP-DB has a Quality Assurance Project Plan (QAPP) which describes each individual data source integrated in the database, how it is implemented, stored and modified. QAPPs describe the necessary quality assurance and quality control measures needed to produce results that meet stated performance criteria. ORD QAPPs are peer-reviewed, approved by management, overseen by a quality assurance manager, and subject to periodic QA and performance quality checks. Figure 1 illustrates, for each data source, each step in obtaining the data from the third-party source, any subroutine that has been created to modify the data, and including any stored local files. In all cases, custom scripts were written to identify and correct any errors in the process of migrating data from source formats into the MySQL database (e.g. duplicated entries).
With public deployment of the AOP-DB v.2, data updates occur approximately every 6 months. Additional, periodic updates will be made to the production version of the database when substantial changes have been made requiring a version update. Changes to the development version of the database will be ongoing and as new, and relevant data become available.

Usage Notes
Though data included in AOP-DB v.2 can be used in many ways in future analyses, users should be aware of limitations of the data set, and appropriate usage of the data. The AOP-DB v.2 incorporates only those AOPs that have molecular identifiers. Many AOPs found in the AOP-Wiki 2.0 do not associate with gene or protein molecular identifiers. This may be due to the nature of the biology described or the state of the science for a particular research area. Users should also be aware that AOP data obtained from the AOP-DB v.2 may be under various levels of review (e.g. Status or SAAOP Status "Do not cite", or "Included in OECD Workplan", for example) or the AOP may simply be incomplete. This is due to how AOPs are entered into the AOP-Wiki 2.0 at this time. Additionally, because an AOP may be developed with reference to a particular species, the AOP-DB v.2 reports the gene/protein accession for that species and maps to orthologous gene pairs in humans or other species orthologs when possible, without indication of relevance.

code availability
All custom code created to process of manipulate external datasets in the construction or subsequent update of the AOP-DB v.2 relational database tables are made publicly available by the U.S. Environmental Protection Agency, Office of Research and Development (ORD) 36 .