Main

With collaborations of over 2,000 scientists across more than 1,000 institutes from 76 countries to date, the Human Cell Atlas (HCA) has generated comprehensive molecular profiles of tens of millions of single cells across 18 different organs and systems, which in turn are advancing our understanding of the definition of cell types and states1,2. Technological advances in single-cell and spatial genomics are rapidly expanding the compendium of known cell types3 and accelerating discoveries of a large variety of novel cell populations.

For instance, these efforts have been applied to system-level disciplines such as immunology and neuroscience, both of which require an understanding of vast networks of cells and tissues. In immunology, cell types have been historically recognized and well characterized. Yet the number of discrete cell types and specific cell states identified from single-cell genomics has exceeded expectations, in particular with respect to the diversity of cell states derived from developmental dynamics4, tissue-resident phenotypes5 and activation states6. For example, transcriptomics profiling identified three decidual natural killer cell populations at the maternal–fetal interface that show varying levels of immunoregulatory properties and modulate trophoblast invasion7. Transcriptomics and genomics profiling studies have also captured an increasing variety of cell types and gene programmes in the central and peripheral nervous systems. Cell atlasing (that is, the creation of a cell atlas) of mammalian brains has led to the discovery of previously uncharacterized cell types, including more than 100 cell types in a single region of the neocortex8, as well as of cellular diversity due to species-specific adaptations in the cortex8. A similar dramatic increase in diversity has been reported in the peripheral nervous system such as in the enteric nervous system9,10.

This incredible progress takes us closer to answering a general question that motivates stem and developmental cell biologists, as well as the HCA project: what is the complete cellular makeup of the human body? Annotating cells and gene programmes is crucial not only to address this question but also to fully exploit these data for biological discovery, including in pathological states. This can only be achieved by naming the entities we study in a consolidated manner so that findings can be related between studies and one study can build on findings from multiple previous ones as knowledge is accrued and expanded. However, most annotations of single-cell genomics datasets to date have used uncontrolled free text (that is, arbitrary naming schemes) for cell type names, which makes the cross-searching of annotations across separate datasets challenging and unreliable. In some cases, with a naming scheme absent, cells are described merely by a subset of their molecular characteristics and therefore hard to match between studies.

To fully answer the question of what constitutes the cellular composition of the human body, there is an urgent need to put new discoveries from the HCA into the context of classical cell biology and anatomy, as well as developmental biology, neurobiology and pathology. Cell ontologies, a structured controlled vocabulary for cell types in animals, are a tremendously powerful way of formalizing such knowledge, which in turn opens up opportunities for quantitative scientific interrogation of the HCA data in new and exciting ways.

In this Perspective, we discuss the utility and parts of cell ontologies, review the state of current cell ontologies and conclude with ongoing efforts and how they can be applied to discovery over the coming years.

Using cell ontology for knowledge integration and mining

Biomedical ontologies originated in simple controlled vocabularies developed to supplement or replace the free-text metadata in databases, clinical records and medical billing systems11. Standardizing the text used to record, for example, diseases, gene functions, anatomical structures and cell types within and between databases makes it possible to reliably search and group records referring to the same entities (for example, by diseases or cell types). However, controlled vocabularies are not sufficient for searching and grouping records with closely related contents. For example, a user searching a database for records relating to macrophages or liver sinusoid would not find records for Kupffer cells unless the data structures driving the search had some meaningful ways to relate the terms ‘macrophage’, ‘Kupffer cell’ and ‘liver sinusoid’. Cell ontologies provide mechanisms for this integration, which allows us to record a Kupffer cell as a type of macrophage located in the liver sinusoid and then to enrich search results to take advantage of the classification and location relationships (Fig. 1).

Fig. 1: Representation of part of CL centred around the term Kupffer cell.
figure 1

A graph showing the relationships between terms for anatomical structures (for example, hepatic sinusoid), cell types (for example, macrophage) and functional roles (for example, erythrocyte clearance). The following relationships are shown: ‘is a’, which records the classification; ‘part of’, which relates cells to their tissues and organs; ‘located in’, which relates cells to spaces such as the hepatic sinusoid; ‘develops in’, which records the developmental origin; and ‘capable of’, which records the function.

Ontologies of cell types such as those in Cell Ontology (CL)12 and Drosophila Anatomy Ontology13 are increasingly being used to annotate single-cell transcriptomics data. The use of ontology terms in dataset annotation relates annotated data back to hard-earned legacy knowledge, classical terminologies and the accompanying understanding of cell types, anatomies and development. Such annotation makes data cross-searchable, discoverable, integrable and more accessible to general cell biologists. It facilitates cross-dataset analyses, which then allows more quantitative analyses of similarities across thousands of individual cells and leads to more nuanced views of cell types, their classification and their properties.

CL was first developed as a platform in 2004 to collect major cell types from humans and model organisms, and has since been applied to various fields. For example, the Encyclopedia of DNA Elements (ENCODE) Consortium used CL to annotate its compendium of cell types, which yielded a prioritized set of genetic and epigenetic elements14. Because the precise terms used for cell types, anatomical structures and diseases often greatly vary across sources, biomedical ontologies, including CL, typically use a bipartite system of universally resolvable identities (IDs) in the form of URLs for ontology terms, with each linked to an official label. For example, the term with the primary label ‘Kupffer cell’ in CL is identified by the permanent URL http://purl.obolibrary.org/obo/CL_0000091, which is further abbreviated to the compact form CL:0000091 (ref. 15). Critically, using resolvable IDs rather than labels to refer to cell types in database records allows associated metadata (for example, labels, descriptions and references) and their relationships (for example, anatomy, development, functional and pathological relevance) to evolve over time with no cost for the databases and records that use IDs to refer to them (Fig. 1).

Ontologies can serve to link and integrate heterogeneous data types related to the same cell type across multiple modalities. For example, Virtual Fly Brain16,17 and the Fly Cell Atlas18 use the same ontology terms to annotate three-dimensional images of neurons (>70,000 images), connectomics data (>3.5 million pairwise connections) and single-cell transcriptomics data (~600,000 cells). Similarly, CL terms, classifications and relationships are also increasingly being used to define and classify terms in the Gene Ontology database19 (>750 terms) and in widely used ontologies of phenotypes (730 terms in Human Phenotype Ontology20) and diseases (>3,000 terms in Mondo Disease Ontology21). These links make it possible to combine single-cell, phenotype and disease data relating to the same cell types. With the advent of large-scale single-cell transcriptomics atlasing, community-driven nomenclature- and ontology-building projects have emerged and are coordinating with existing ontology-building efforts (for example, HCA Biological Networks2, the Human BioMolecular Atlas Program (HuBMAP)22, BRAIN Initiative Cell Census Network (BICCN)23 and Cell Annotation Platform (http://celltype.info)).

This is already affecting our ability to organize our knowledge of cell types for comparisons of datasets across individual laboratories and, notably, for effectively interpreting health and disease using the knowledge from both classical histopathology and single-cell genomics. For instance, ontological distinctions between fetal and mature cells in the kidney are mirrored by differences in their molecular signatures, which are critical to understanding the divergent origins of paediatric and adult kidney cancers, respectively24. Similarly, datasets that were annotated in a consistent manner facilitated cross-tissue meta-analyses for COVID-19 that identified specialized nasal epithelial cells that were enriched in the expression of entry factors for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)25. Such datasets also enabled the identification of covariates such as age, sex and smoking status associated with SARS-CoV-2 entry factor expression in lung and airway cells26, and a comparison between cells from autopsy tissue samples from patients with COVID-19 and from healthy and other disease conditions27. Together, these studies highlight the necessity and utility of establishing previously agreed ontological classifications.

Considerations in the classification of human cell types

Biologists have long recognized that the natural world lends itself to hierarchical systems of classification, which capture the underlying hierarchical processes that drive biology, such as the phylogenetic classification of species by morphological and molecular observations. Similarly, cell types can be hierarchically classified and categorized in ever-increasing levels of resolution, from a general cell type such as an endothelial cell to more specialized types such as a liver sinusoidal endothelial cell (LSEC) and then down to highly specialized types found in specific locations such as a periportal LSEC. As with taxonomy for a species, various kinds of observations inform the ultimate classification, and these different types of information are often used in concert to arrive at a particular cell-type definition.

Taking anatomical locations as an example, CL12 imports information about anatomical structures and features from the Uber-anatomy Ontology (Uberon)28 and relates them to CL terms using, for example, ‘part of’ to relate cell types to the tissues and organs, and ‘located in’ to relate cell types to cavities within structures. For example, the definition of an LSEC in CL includes a ‘part of’ relationship to ‘hepatic sinusoid’, which indicates that the LSEC forms part of the structure of the hepatic sinusoid as defined in Uberon. By contrast, the definition of Kupffer cells records that they are ‘located in’ (the lumen of) the hepatic sinusoid. In an anatomically higher hierarchy, the definition of hepatic sinusoid involves relations to the liver lobule and the liver overall, which is in turn defined by its structure, location and physiological role in the body. The LSEC is therefore hierarchically defined relative to the whole organism down to its individual position in the specific tissue where it is found (Fig. 2a). Furthermore, since CL classifies cell types hierarchically from generic cell types down to more specialized types, an LSEC is also defined as a descendent of the general endothelial cell class in CL. The main LSEC class (officially ‘endothelial cell of hepatic sinusoid’) has its own descendent classes, which represents the following further specializations of LSECs: ‘endothelial cell of periportal hepatic sinusoid’ and ‘endothelial cell of pericentral hepatic sinusoid’.

Fig. 2: CL links human cell types with anatomy and cell-state transition.
figure 2

a, CL has terms for a variety of cell types associated with the hepatic sinusoid (Uberon:0001281). The classification of these cell types allows them to be grouped with other cells from the same location. For example, Kupffer cells (CL:0000091) can be grouped with other tissue-resident macrophages or with cells of the hepatic sinusoid. b, Ontologies can be used to encode transitions through diverse cell states. Examples include T-cell activation following antigen recognition, cell cycling, neuron development and maturation, smooth muscle cell (SMC) contraction and relaxation, and cell destruction after oxidative stress.

Sources of information that contribute to a cell-type categorization include morphological features, developmental origins and functional profiles. Ontologies attempt to capture all terms that are used by different scientific communities to refer to the same cell type, as well as alternative names that may not be commonly used. Historically, different fields in biology have focused on different aspects of cells to drive their naming. For example, many immune cells have been classified according to which cell surface protein(s) they express29,30,31,32,33,34,35,36, whereas cells of the nervous system have been named according to a combination of features, including morphologies, physiologies, connectivities and the roles they play in the neuronal circuitry37. In some systems, such as the retina38, there is strong evidence to indicate that cell types can be consistently classified regardless of the features used to classify them. In these cases, classically defined cell types typically align well with those identified by analyses of single-cell transcriptomics data, which makes cell annotation straightforward. In other cases, different features could in principle lead to different cell-type classifications, which makes consistent annotation more challenging. Formal ontologies are able to support multiple overlapping classification schemes and can therefore potentially help reconcile different classification schemes, at least at the level of more generally grouped classes.

Cell ontologies also represent developmental lineages and, to a more limited extent, cell states such as activation, cycling, morphological changes and stresses (Fig. 2b) either directly or through extensions of existing annotations. Cell-cycle states, for example, can be represented in the annotation system by combining a term from CL with a term from the Gene Ontology Cell Cycle Phase. Developmental or actively regenerating tissues present particular challenges to cell ontology development, as a plethora of intermediate states and continuous branching lineages can be partitioned. In such a setting, cell annotation needs to emphasize the relative ordering of states or their positions on a continuous differentiation path. There are also striking examples of developmental convergence (developmental homoplasy). Somatosensory neurons, for example, can be of mixed origin from the neural crest or sensory placodes39. Similarly, dermal fibroblasts in different parts of the trunk or face are derived from distinct embryonic lineages despite molecular and phenotypic similarities40. Nevertheless, cell ontologies record gross lineage relationships, with limited temporal resolution between developing/progenitor and mature cell types using specific relations where these relationships are stereotyped and consistent. To date, CL records lineage and differentiation relationships for more than 1,900 cell types and connects developing cell types to developing tissues and stages via links to Uberon.

Many processes that drive cell diversifications, including ontogeny (cell differentiation), morphogenesis (often driven by continuous gradients) and the dual impact of the differentiation history and tissue context of a cell, are imprinted in the molecular properties of the cell and can be captured by hierarchical representations. Therefore, molecular features can serve as the basis for robust cell-type classification that reflects these underlying processes (even when the process is not explicitly known). Currently, cell types and states can be elucidated from single-cell transcriptomics, epigenomics and proteomics expression profiles using different software such as SCCAF41. Further complemented by morphological, physiological, developmental and functional properties, this data-driven framework makes cell annotations comparable across independent ontology efforts and makes the inferred cell types understandable across different communities. Of note, while these inferences are unbiased, it is important to reconcile them with conventional biological and clinical understanding and terminologies.

Current state of ontologies

First developed as platforms to integrate cross-species ontology information, CL and Uberon are now species-neutral ontologies with a strong focus on mammalian cell types and anatomies, with standard mechanisms for recording the species applicability of terms. To date, CL has 2,401 terms covering all major cell types. The granularity of this coverage is variable, with the greatest coverage currently for the immune system (>500 cell types). Uberon defines over 14,000 types of anatomical structures and records many types of relationships between them. In practical terms, CL and Uberon are tightly integrated with each other. Almost 2,000 cell types in CL are linked by ‘part of’ relationships to the anatomical structures defined in Uberon. Further combining CL with newly discovered cell populations from HCA data, we are beginning to extensively cover major organs and cell types in the human body (Table 1).

Table 1 Current status of cell-type enumerations in data from CL and HCA

The human-applicable components of CL and Uberon are under active development as part of multiple collaborative efforts. For human data, terms are being added in a coordinated fashion to both ontology platforms in response to the requests of individual laboratories, as well as to the annotation needs of atlasing projects including the Data Coordination Platform of the HCA2 (https://data.humancellatlas.org) and the Cambridge Cell Atlas portal (www.cambridgecellatlas.org). Editing of CL and Uberon is coordinated by a team of researchers drawn from a growing number of collaborating projects, including the HCA (Chan Zuckerberg Initiative), HuBMAP (National Institutes of Health (NIH)), the Monarch Initiative (NIH) and the Cell Annotation Platform (a collaborative effort funded by Schmidt Futures). This team of editing researchers runs regular open training sessions, and anyone trained to edit the ontology terms can join the editing team. Edits are coordinated and reviewed on GitHub (https://github.com/obophenotype/cell-ontology), with all changes and releases subject to automated quality-control tests before approval. Issues not resolved after discussion on open tickets are coordinated via monthly editor video conferences, which also coordinate the general focus of CL and Uberon efforts. These calls frequently feature guest speakers with a particular interest in extending CL or Uberon in specific areas. CL and Uberon are both members of the Open Biological and Biomedical Ontology Foundry group of ontologies15, which is a loose alliance of ontologies committed to adopting common standards and aligning semantics and ontology infrastructure. All these endow CL and Uberon with the ability to continuously evolve with inputs from various projects and perspectives and to supply formalized ontology information back to the projects (Table 2). Examples of the co-evolution of CL and human cell ontology-building efforts are listed below.

Table 2 Projects using and contributing to CL

The Brain Data Standards Initiative, which is part of the NIH BRAIN Initiative Cell Census Network, is extending CL with terms for cortical cell types defined by single-cell transcriptomics, with a current focus on the primary motor cortex of human, marmoset and mouse42. This work leverages existing efforts on nomenclature standards43, but importantly aims to use the quantitative hierarchical cell-type classification from single-cell genomics as a data-driven foundation for ontological definitions. Different data types about these cell types are integrated at different levels of the hierarchy, including their spatial tissue distributions, morphological and physiological properties, and axonal projection targets. Ultimately, such a data-driven approach may be used across the entire human body to provide a common metric in gene usage to measure similarities and potential common developmental origins across organs.

The ASCT+B effort44, presented as an accompanying Perspective in this issue, is a HuBMAP, Human Tumor Atlas (HTAN) and HCA community-wide project to build tables representing the human anatomy and cell-type terminology needed for annotating single cell RNA sequencing (scRNA-seq) datasets, and to record expert-approved lists of markers for cell types. Entries in these tables are mapped to existing CL or Uberon terms where possible or turned into term requests for these data resources when new terms are needed. The relationships between cell types and anatomical structures encoded in these tables are validated against CL and Uberon. The results of this validation are relayed to improve the tables, Uberon and CL through discussions and agreement with experts. For example, the ASCT+B project is building an expert-validated ontological model of the human vasculature that is feeding hundreds of new terms and relationships back into Uberon. An important outcome of this work will be a curated subset of CL and Uberon terms to annotate human scRNA-seq data in a reliable manner, both for the healthy HCA data as well as disease samples.

As part of the human cell-focused Sanger–European Bioinformatics Institute (EBI) Cambridge Cell Atlas portal (https://www.cambridgecellatlas.org), an effort to make results from human single-cell gene expression experiments easily accessible to a broad community of users, including clinicians, CL is being enriched and extended based on contributions from pathologists and clinicians. This will introduce human cell types annotated with details of specific immunohistochemical markers that are in routine clinical use in diagnostic pathology. This ontology can then be integrated into the search functionality of the Cambridge Cell Atlas platform to enable searching based on a specific immunohistochemical marker or panel of markers. This will enable the identification of the normal cell type(s) (and potentially pathogenic cell types) that express the marker(s). This functionality could be useful to pathologists in interpreting and contextualizing the range of cell types stained by different immunohistochemical markers on histological sections, cytological preparations or by flow cytometry, and in understanding perturbations in staining patterns in pathological states.

Applications of a cell ontology

Cell ontologies provide the community a single place to look up cell types. Through this portal, knowledge can be aggregated and standardized in an encyclopaedic sense. First, cross-modal data integration can reinforce or refine the identity of a cell type. For example, the survey on the mammalian neocortex revealed the correspondence of various cellular properties when overlapping imaging, electrophysiology and connectivity data with transcriptomics profiles37. Second, mining of an ontological classification system can reveal major trends with respect to shared cell types across organ-specific atlases (for example, immune, stromal and endothelial cells) versus specialized types (for example, goblet cells in the gut and lung). This emphasizes the concept of a tissue being the collective of its cells operating in concert in a specific three-dimensional organization.

Importantly, with more single-cell resources employing the cell and anatomy ontologies, including but not limited to the Fly Cell Atlas, EBI’s Single Cell Expression Atlas and the Sanger–EBI Cambridge Cell Atlas, cell ontologies can link scientific and medical communities through common nomenclatures and markers for human cell biology, pathology and disease. This link, in a broader sense, represents cross-community research whereby a common cell-type reference can be referred. For example, a well-defined cell-type classification of human head and neck tumours, which covered major immune and non-immune cell populations, was utilized as the reference to interrogate the cellular signals contributing to bulk samples of head and neck squamous cell carcinoma from The Cancer Genome Atlas (TCGA)45. This analysis revealed the association of tumour-infiltrating regulatory T cells with improved survival in head and neck cancer45.

At the same time, immunohistochemical markers in routine clinical use (such as those listed by Pathology Outlines, https://www.pathologyoutlines.com/stains.html), which are linked to the non-pathological cell types by the Cambridge Cell Atlas project, could also be curated and further linked to pathological tissues and cell states that express them. This would provide hundreds of antibodies to link cell types and anatomical structures with CL and Uberon, albeit with a focus on pathological states (CL and Uberon currently focus on healthy homeostatic states).

The application of cell ontologies will be most pertinent in the context of interactive and automated systems for the interpretation and annotation of single-cell genomics datasets. A number of efforts to design such systems are under way, including automated cell annotation projection pipelines46,47,48,49,50,51,52. For example, as part of the HCA initiative, the Cell Annotation Platform (CAP) aims to provide a general repository for cell annotations of different datasets in combination with interactive tools for annotating new datasets. For a cell of interest, CAP user interfaces will suggest the appropriate ontology terms based on text search, learned synonyms and eventually molecular signatures themselves. Where no appropriate term is available from CL, free-text annotation will be used as the basis for the addition of a new term to CL. Similarly, the HuBMAP data portal assigns cell annotations to scRNA-seq datasets with an Azimuth-based label transfer procedure49 based on a vocabulary of cell types from CL, which aims to assess cellular diversities at different levels of resolution. With an initial focus on immune cells, CellTypist uses an expandable cross-tissue cell reference before predicting cell identities with a logistic regression-based label transfer pipeline, with all derived cell types directly interpretable by CL48. Conversely, the resulting knowledge base of commonly used annotation terms and associated molecular signatures will provide a useful resource to extend ontologies as well as to train and optimize machine-learning models that automate the annotation task. In parallel to these efforts, data-driven ontology development is advancing community engagement in specific research domains such as Neuroscience Multi-Omic (NeMO) Analytics for the brain (https://nemoanalytics.org) and gene expression analysis resource (gEAR) for the ear53.

Summary and outlook

Resolving the cellular makeup of the human body warrants the categorization of cells in a standardized framework. The CL database offers one such avenue to consolidating this knowledge in an encyclopaedic manner, with applications from cell and tissue biology all the way to the clinic. Despite potential cell classification ambiguities and transient cellular states, each facet of a cell, ranging from morphological to molecular features, can be taken into account until a defining status is reached and recognized by the community.

Many HCA-related resources, such as cellxgene54, have been using CL for de novo cell annotation. Cell ontologies serve other sources of data by retrieving or delivering ontology-level information. We anticipate that the synergy between the HCA project and CL will continue to grow over the coming years and beyond the completion of HCA, with dimensions of human genetic variation, ageing and disease on the horizon. HCA single-cell omics data provide a foundation for the development of cell ontologies, which are powerful resources to define cell types that are universal across the entire body or specific to subsets of tissues and will facilitate future research. This will become more pressing and clearer as the number of HCA studies of individual tissues and organs increases. The HCA Biological Networks will provide nucleation points for expert community efforts to achieve gold standard, consensus cell annotations with cell ontology terms. With such a quantitative approach, common phenotypes and developmental origins of cell types will become understandable through shared gene usage, and functional similarities will be revealed in gene patterns. Whole-body consequences of disease will be understandable through differential gene usage in differently located cells. This will create opportunities for a new and different kind of quantitative data-driven framework that extends and potentially transforms existing ontology efforts.