From 1994 through 2021, The International Agency for Research on Cancer (IARC) and the World Health Organization maintained a comprehensive database on variations in the tumor protein p53 gene (TP53), one of the most frequently mutated genes in human cancer. TP53 plays crucial roles in cell signaling, apoptosis, metabolism, DNA repair and transcription, earning it the moniker “guardian of the genome” (reviewed in [1]). Germline genetic variants in TP53 are the primary cause of Li-Fraumeni syndrome (LFS, OMIM 151623), a hereditary cancer predisposition disorder [2] associated with an approximately 24 times higher lifetime incidence of any cancer compared with the general population [3]. The database was initiated by Hollstein et al. [4] and, posteriorly, developed and curated at IARC by Pierre Hainaut and Magali Olivier from 1995 until 2021. During this period, the dataset grew from 2500 to over 50,000 annotated variations in the current database release [5], making it the largest single-locus cancer database. The database has served as an important resource for numerous TP53- and LFS-associated studies. Since 1997, the key publications describing and referencing the database have accumulated over 9000 citations in the scientific and medical literature (source: Google Scholar; selected significant papers include [5,6,7,8]). Data from the IARC TP53 database have been widely mined and analyzed to systematically explore functional and structural properties of p53 variants [9,10,11], genotype–phenotype associations [12], temporal patterns of cancer penetrance [13], carcinogen-induced mutation signatures [14,15,16], and cancer prognosis and outcomes [17]. The most recent publication using this database, as of the writing of this commentary, focused on the germline dataset and investigated differences in variant distribution and cancer patterns to better refine the variable LFS-associated phenotypic spectrum [18].

On October 25th, 2021, the IARC-sponsored TP53 database was fully transferred to the US National Cancer Institute (NCI) to host and facilitate important upgrades to its infrastructure (https://tp53.isb-cgc.org). The original TP53 Database was run on an on-premises server at IARC, using a Microsoft platform. The NCI-sponsored TP53 Database is hosted on the Google Cloud Platform, primarily using its App Engine, BigQuery, and Cloud Storage services. The high-level architecture, along with additional specifications, are illustrated in Fig. 1. The web application was rewritten using the Python-based Flask framework, and now runs on App Engine, which provides automatic load balancing to ensure scalability and high availability. Data files available for download and files used to support the web application are kept in Cloud Storage. There are 49 tables and 21 views stored in BigQuery dataset, isb-cgc-tp53.P53_data, that serves as the read-only database for the application. This dataset is publicly readable and can thus be used by researchers directly for cloud-based analyses using the Google BigQuery API. For the initial rollout, the existing database and files were copied directly from the IARC system. In order to obtain input from the TP53-associated scientific and clinical community, the database content was divided into three main subgroups according to the data types available: germline, tumor (somatic), and mouse and other experimental models. We invited 380 individuals previously registered for two conferences (the 17th and 18th International p53 Workshops) to participate in the working groups convened to oversee the transition. Seventy individuals, from 11 countries, expressed interest in being included in at least one of the working groups. The “Germline Variants” working group is made up of 48 members, the “Mouse and Other Experimental models” working group has 25 members, and the “Somatic Variants” working group has 23 members (Supplementary Table 1).

Fig. 1: A high-level system architecture of the redesigned NCI-sponsored TP53 database (https://tp53.isb-cgc.org).
figure 1

Application Environment: Google Cloud platform-Google App Engine Flexible Environment. Application Framework: Flask. Programming Language: Python 3.7. Object Data Storage: Google Cloud Storage. Database Platform: Google BigQuery. For questions and requests: tp53-info@isb-cgc.org. The application source code can be found in GitHub: https://github.com/isb-cgc/TP53.

Based on insights from the working groups and digital media/user-experience experts, several important updates were made. The web interface was redesigned to implement user-centered design principles and modern aesthetics, optimized for search, and programmed to function responsively across device platforms. The language throughout the site, including the user manual and database descriptions, was updated to improve clarity and usability. One of the major enhancements was allowing users to easily preview the downloadable data, and filter certain rows by the column values. This was designed to locate and download data of interest more efficiently. Whenever possible, the database will follow the “Findable, Accessible, Interoperable, Reusable” principles (FAIR) to guide data management [19], and efforts will be made to remove and replace tools available behind a paywall with publicly available resources. Links to external databases, additional tools, and publications will be added on an ongoing basis to ensure the database remains an effective and up to date resource to facilitate variant curation, achieved with close collaboration with ClinGen’s TP53 Variant Curation Expert Panel (https://clinicalgenome.org/affiliation/50013/). We also intend to add resources to advance studies on emerging topics, such as the variable LFS phenotypic spectrum and potential role of genetic modifiers, abnormal TP53 variant allele frequency, and TP53-related clonal hematopoiesis. Variant annotation will focus on pertinent new in silico prediction tools and functional assays, mutational signatures and hotspots, investigating variant-specific DNA-binding affinity, characterizing variants-associated neoantigens, haplotype associations, and other mechanisms of p53 impairment. We also seek to include data on additional model organisms (such as Drosophila melanogaster [fruit fly], Dario rerio [zebrafish], Caenorhabditis elegans, among others) to enable comparative genomics studies, promote collaborative research, and maximize the use of reagents and strains of animal models.

The mission of the NCI-Sponsored TP53 Database is to serve as a publicly available resource by providing data to better understand existing and new aspects related to the TP53 gene, its pathways, and the phenotypic manifestations caused by changes in its structure. The NCI TP53 Database team will coordinate data inclusion and curation requests. Prioritization of database updates will be made in consultation with the working groups, and with research consortia and other collaborative efforts such as the LiFT UP study (ClinicalTrials.gov Identifier: NCT04541654) and the ClinGen’s TP53 Variant Curation Expert Panel. Future endeavors will be geared towards curation of new literature, fostering research, collecting and linking resources to new types of data, and integrating collaborative efforts among clinicians, scientists, and commercial laboratories to expand the characterization of both TP53 and LFS.

Disclaimer

Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer/World Health Organization. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.