Biomedical data are amassed at an ever-increasing rate, and machine learning tools that use prior knowledge in combination with biomedical big data are gaining much traction1,2. Knowledge graphs (KGs) are rapidly becoming the dominant form of knowledge representation. KGs are data structures that represent knowledge as a graph to facilitate navigation and analysis of complex information, often by leveraging semantic information. Their versatility has made them popular in areas such as data storage, reasoning, and explainable artificial intelligence3. However, for many research groups, building their own biomedical KG is prohibitively expensive. This motivated us to build the BioCypher framework to support users in creating KGs (https://biocypher.org).
The ability to build a task-specific KG is important, since directly standardizing the representation of biomedical knowledge is not appropriate for the diverse research tasks in the community. While human researchers can contextualize and abstract concepts easily, the same does not apply to algorithms. For example, drug discovery tasks (viewing genes as functional ancestors of protein targets) require a different KG structure and content from the implementation of a molecular tumor board (genes as clinical markers), which is different still from research into cell type-contextualized gene regulatory network inference (genes as targets of regulatory mechanisms). Even for similar tasks, the KG structure or subtle decisions about included resources lead to different results for many modern analytic methods2. In addition, decisions about how to represent knowledge at each primary resource pose problems in their integration — for instance, via the use of different identifier namespaces, levels of granularity or licenses4,5.
The current landscape of biomedical KGs is not easily navigated; neither the KGs themselves nor the pipelines used to build them consistently adhere to FAIR (Findable, Accessible, Interoperable and Reusable)6 and TRUST (Transparency, Responsibility, User focus, Sustainability and Technology)7 principles. Understandably, the overhead required to implement these principles may not be justified when building a one-off, task-specific KG for research. Thus, many KGs are built manually for specific applications, which leads to issues in their reuse and integration4. For downstream users, the resulting KGs are too distinct to easily compare or combine5. Maintaining KGs for the community is more work; once maintenance stops, they quickly deteriorate, leading to reusability and reproducibility issues4 (Supplementary Note 1).
BioCypher has been built with continuous consideration of the FAIR and TRUST principles, yielding benefits to the entire community in multiple respects:
Modularity. To rationalize efforts across the community, we propose a modular architecture that maximizes reuse of data and code in three ways: input, ontology and output (Fig. 1a). Input adapters allow delegating maintenance work to one central place for each resource, ontology adapters give access to the wealth of structured information curated by the ontology community, and output adapters allow benchmarking and selection of database management systems. Together, these mechanisms enable a workflow that reduces the time and effort to develop and deploy custom KGs.
Harmonization. By using ontologies as expertly crafted repositories of conceptual hierarchies, we facilitate harmonization from a biological perspective. We help with the technical aspects of using and manipulating ontologies — for instance, by flexibly extending or hybridizing complementary ontologies.
Reproducibility. By sharing the mapping of KG contents to ontologies, we facilitate reproduction of the structure of the corresponding database without access to the primary data, which may be prohibited by license or privacy issues. We also enable extraction of subgraphs, effectively converting storage-oriented to task-specific KGs, which because of their reduced sizes are easier to share alongside analyses.
Reusability and accessibility. The sustainability of research software is strongly related to adoption in — and contributions from — the community. BioCypher is developed as a TRUSTworthy open-source software, applying methods of continuous integration and deployment and including a diverse community of researchers and developers from the beginning. This facilitates workflows that are tested end-to-end, including the integrity of the scientific data. We operate under the permissive MIT license and provide community members with guidelines for their contributions and a code of conduct (https://github.com/biocypher).
Different measures further increase the accessibility and FAIRness of our framework. For example, we provide a template repository for a BioCypher pipeline with adapters, including a Docker Compose setup. To enable learning by example, we curate existing pipelines, as well as all adapters they use, in our GitHub organization. Using the GitHub API and a BioCypher pipeline, we build a ‘meta-graph’ for the simple browsing and analysis of BioCypher workflows (https://meta.biocypher.org). To inform the contents of this meta-graph, we have reactivated and now maintain the Biomedical Resource Ontology8, which helps to categorize pipelines and adapters into research areas, data types and purposes (Supplementary Note 2).
BioCypher is implemented as a Python library that provides a low-code access point to data processing and ontology manipulation, emphasizing the reuse of existing resources to the highest extent possible. We have begun to open the platform to other bioinformatics ecosystems, starting with R/Bioconductor (https://biocypher.org/r-bioc.html). By our design principles and the automation of data management tasks, we aim to free up developer time and guide decision making on how to represent knowledge, bridging the gap between the field of biomedical ontology and the broad application of databases in research.
By abstracting the KG build process as a combination of modular input adapters, we save developer time in the maintenance of integrative resources built from overlapping primary sources (Fig. 1b): for instance, OmniPath9, Bioteque2, CROssBAR DB10 and the Clinical Knowledge Graph11.
By mapping the contents of those resources onto a common ontological space, we gain interoperability between the different biomedical domains (Fig. 1c). BioCypher helps with the mapping procedure by providing examples and an interface, as well as numerous user-friendliness measures. By using the industry standard Web Ontology Language (OWL) format, we provide access to most available ontologies. Separating the ontology framework from the modeled data enables the implementation of reasoning applications at the ontology level — for instance, the ad hoc harmonization of disease ontologies.
By providing access to a range of modular output adapters, we facilitate the project-specific benchmarking and selection of suitable database management systems. For instance, a Neo4j adaptor provides rapid access to extensive databases for maintenance of knowledge and enables queries from analysis (Jupyter) notebooks. Switching to alternative graph or relational databases (for example, ArangoDB or PostgreSQL) allows task-specific performance optimization. A comma-separated values (CSV) writer and Python-native adapters (for example, Pandas, sparse matrix or NetworkX formats) yield knowledge representations that can directly be used programmatically by a wide range of machine learning frameworks. As a result of BioCypher’s modular nature, more output adapters can quickly be added.
Application programming interfaces (APIs) built on top of the BioCypher KGs enable complex and versatile queries and simplify the interaction of users with the knowledge. For example, web widgets and apps (such as drug discovery and repositioning with https://crossbar.kansil.org and analysis workflows with https://drugst.one) allow researchers to browse and customize the database and to plug it into standard pipelines. Additionally, a structured, semantically enriched knowledge representation facilitates connection to and improves performance of modern natural language processing applications such as GPT12, which can be specifically tuned for biomedical research13. The use of common standards enables sharing of tools across projects and communities or in cloud-based services that preserve sensitive patient data (Supplementary Note 3).
There have been numerous attempts at standardizing KGs and making biomedical data stores more interoperable. We can identify three general types of approaches, in increasing order of abstraction: centrally maintained databases, explicit standard formats (modeling languages) and KG frameworks. With BioCypher, we aim to improve user-friendliness on all three levels of abstraction (see Supplementary Note 4). Despite many efforts, there is no widely accepted solution. Very often, resources take the ‘path of least resistance’ in adopting their own, arbitrary formats of representation. To our knowledge, no framework provides easy access to state-of-the-art KGs to the average biomedical researcher, a gap that BioCypher aims to fill. We demonstrate some key advantages of BioCypher by case studies in Supplementary Note 5.
We believe that creating a more interoperable biomedical research community is as much a social effort as it is a scientific software problem. To facilitate adoption of any approach, the process must be made as simple as possible, and it must yield tangible rewards, such as substantial savings in developer time. We will provide hands-on training for all interested researchers, and we invite all database and tool developers to join our collective effort.
Li, M. M., Huang, K. & Zitnik, M. Nat. Biomed. Eng. 6, 1353–1369 (2022).
Fernández-Torras, A., Duran-Frigola, M., Bertoni, M., Locatelli, M. & Aloy, P. Nat. Commun. 13, 5304 (2022).
Tiddi, I. & Schlobach, S. Artif. Intell. 302, 103627 (2022).
Bonner, S. et al. Brief. Bioinform. 23, bbac404 (2022).
Callahan, T. J., Tripodi, I. J., Pielke-Lombardo, H. & Hunter, L. E. Annu. Rev. Biomed. Data Sci. 3, 23–41 (2020).
Wilkinson, M. D. et al. Sci. Data 3, 160018 (2016).
Lin, D. et al. Sci. Data 7, 144 (2020).
Tenenbaum, J. D. et al. J. Biomed. Inform. 44, 137–145 (2011).
Türei, D., Korcsmáros, T. & Saez-Rodriguez, J. Nat. Methods 13, 966–967 (2016).
Doğan, T. et al. Nucleic Acids Res. 49, e96 (2021).
Santos, A. et al. Nat. Biotechnol. 40, 692–702 (2022).
Andrus, B. R., Nasiri, Y., Cui, S., Cullen, B. & Fulda, N. Proc. AAAI Conf. Artif. Intell. 36, 10436–10444 (2022).
Lobentanzer, S. & Saez-Rodriguez, J. Preprint at https://doi.org/10.48550/arxiv.2305.06488 (2023).
This project has received funding from the European Union’s Horizon 2020 research and innovation programme (grant agreement No 965193 (DECIDER) and 116030 (TransQST)), the German Federal Ministry of Education and Research (BMBF, Computational Life Sciences grant No 031L0181B and MSCoreSys research initiative research core SMART-CARE 031L0212A), the US Defense Advanced Research Projects Agency (DARPA) Young Faculty Award (W911NF-20-1-0255), and the Medical Informatics Initiative Germany, MIRACUM consortium (FKZ: 01ZZ2019). We thank Henning Hermjakob, Benjamin Haibe-Kains, Pablo Rodriguez-Mier, Daniel Dimitrov and Olga Ivanova for feedback on the manuscript and Ben Hitz and Pedro Assis for feedback on their use of BioCypher.
J.S.-R. reports funding from GSK, Pfizer and Sanofi and fees from Travere Therapeutics and Astex Pharmaceuticals.
Peer review information
Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
About this article
Cite this article
Lobentanzer, S., Aloy, P., Baumbach, J. et al. Democratizing knowledge representation with BioCypher. Nat Biotechnol 41, 1056–1059 (2023). https://doi.org/10.1038/s41587-023-01848-y