Building conceptual spaces for exploring and linking biomedical resources

The establishment of links between data (e.g., patient records) and Web resources (e.g., literature) and the proper visualization of such discovered knowledge is still a challenge in most Life Science domains (e.g., biomedicine). In this paper we present our contribution to the community in the form of an infrastructure to annotate information resources, to discover relationships among them, and to represent and visualize the new discovered knowledge. Furthermore, we have also implemented a Web-based prototype tool which integrates the proposed infrastructure.


Introduction
The ever increasing volume of web resources as well as generated data from automated applications is challenging current approaches for biomedical information processing and analysis.One current trend is to build semantic spaces where corporate data and knowledge resources can be mapped in order to ease their exploration and integration.Semantic spaces are usually defined in terms of widely accepted knowledge resources (e.g.thesauri and domain ontologies), and they are populated by applying (semi)automatic semantic annotation processes.
Apart from these semantic spaces it is also crucial to propose new summarization tools that help both users and machines to better analyze and extract knowledge from these spaces.On-Line Analytical Processing (OLAP) techniques have been very successfully used to analyze summarized data from different perspectives (dimensions) and detail levels (categories).However, OLAP cannot be directly applied to the aforementioned semantic spaces for several reasons: first, data resources and knowledge are highly heterogeneous and dynamic and second, semantic annotations are based on graph structures which make it difficult their translation to OLAP multidimensional spaces.Despite these limitations, OLAP operators could be very useful as they provide an intuitive and interactive way to explore multidimensional spaces.
In this paper we propose a new visual paradigm, called 3D conceptual maps, which allows users to explore and analyze interesting associations derived from data and web resources, which have been previously annotated with a reference domain ontology.Conceptual maps can be dynamically built according to the users analysis requirements, and they provide interactivity through operators similar to traditional OLAP operators (e.g.drill-down, roll-up, etc.)The main novelty of the new operators is that they are semantic-aware, that is, they take into account the semantics of the domain ontologies to summarize the data that is visualized in the conceptual maps.We also present a web-based prototype tool called 3D knowledge browser (3DKB), which integrates the previous visual paradigm and operations.
As far as we know, there are no similar tools in the literature which allow summarizing and exploring discovered concepts and relationships from different biomedical sources (not only literature).Previous work exists on discovering biomedical relationships from semantic annotations, for example [1,2,3] to mention a few, but they are limited to present results as tabular data, and the target collection is always PubMed abstracts.Instead, our proposal is aimed to deal with multiple sources (e.g.PubMed abstracts, patient records, public databases, and so on) and it provides mechanisms to explore the discovered relationships through the reference ontologies.
The paper is organized as follows.In Section 2 we introduce the motivating scenario.Then, Sections 3 and 4 present our prototype and its use through two use cases.Section 5 is devoted to the methodological aspects.First, we describe the normalization formalism to represent both the knowledge resources and the target collections.Then, we introduce the main operators required over the normalized representation to provide interactivity with the conceptual maps.Finally, we give some conclusions and future work.

Motivating Scenario
The need of semantically integrating different biomedical sources arose in the context of the European Health-e-Child (HeC) [4,5] integrated project.HeC aimed to develop an integrated health care platform to allow European paediatrics to access, analyse, evaluate, enhance and exchange integrated biomedical information focused on three paediatric diseases: (1) heart disorders, (2) inflammatory disorders and (3) brain tumours.The biomedical information sources covered six distinct levels of granularity (also referred to as vertical levels), classified as molecular (e.g., genomic and proteomic data), cellular (e.g., results of blood tests), tissue (e.g., synovial fluid tests), organ (e.g., affected joints, heart description), individual (e.g., examinations, treatments), and population (e.g., epidemiological studies).
The 3DKB tool is mainly aimed at providing an integrated and interactive way to browse biomedical concepts as well as to access external information (e.g., PubMed abstracts) and HeC patient data related to those concepts.The 3DKB is intended to facilitate the integration by providing the clinician with a predefined subset of semantically annotated web objects that are relevant to her domain.These objects are thus implicitly linked to clinician and patient data, which are also semantically annotated with the same knowledge resource.
In our current implementation, we selected the Unified Medical Language System Metathesaurus (UMLS) [6] as the knowledge resource with which semantic annotations are generated.UMLS represents the main effort for the creation of a multipurpose reference thesaurus.UMLS contains concepts from more than one hundred terminologies, classifications, and thesauri; e.g.FMA, MeSH, SNOMED CT or ICD.UMLS includes two million terms and more than three million term names, hypernymy classification with more than one million relationships, and around forty millions of other kinds of relationships.

Prototype Implementation
The current prototype has been developed using AJAX (Asynchronous JavaScript and XML) technologies.Figure 1 shows an overall view of the 3DKB tool [7] for the JIA domain.It consists of three main parts, namely: 1) the configuration of the 3D Conceptual Map (from now on 3D-Map), which contains the selected vertical levels (i.e., HeC levels) and an optional free text query to evaluate against the visualized concepts, 2) the 3D-Map itself, which contains the biomedical concepts stratified in vertical levels according to the previous configuration, and 3) a series of tabs that contain a ranked list of objects associated to a selected concept from the 3D-Map.In the latter, each tab represents a different type of object (e.g., PubMed abstract, Swissprot protein and HeC patient data).There is a special tab entitled "Tree" which contains all the possible levels that can be selected to configure and build the 3D-Map.The levels are based on the UMLS semantic types [8,9] which are grouped within the correspondent HeC levels as in [10,11].The layers of the 3D-Map can be defined by selecting levels of the "Tree" tab and also through a keyword-based query.In the second case, only the most specific concepts whose lexical forms match the query are visualized.
The visual paradigm of 3D-Maps relies on the vertical integration vision proposed in HeC.That is, all the involved knowledge, data and information are organized into different disjoint conceptual levels (i.e., vertical levels), each one representing a different perspective of the biomedical research.In this way, the 3DKB presents a stratified view of the information based on vertical levels (see Individual.Disease and Organ boxes in 3D-Map of Figure 1).Within each level, biomedical concepts deemed relevant for both the clinician domain (e.g., rheumatology, cardiology and oncology) and the clinician information requests are shown as balls in the 3D-Map.Relevance of concepts is defined in terms of the collection frequency (e.g., PubMed abstracts), and it is represented in the 3D-Map through the ball size.Regarding the color of the ball, normal concepts are displayed in blue, expanded concepts in red and concepts containing query entities in green.
Semantic bridges are another important visual element of the 3D-Map, which are defined as links between concepts of two different vertical levels and they are represented as 3D lines in the 3D-Map.Semantic bridges can represent either co-occurrences of concepts in the target collection or well-known relationships between concepts stated in some domain ontology (e.g., UMLS).Semantic bridges can help clinicians to select the context in which the required information must hold.For example, from the 3D-Map in Figure 1 we can retrieve documents or patient IDs about arthritis related to limb joints by clicking an existing bridge between the concepts Arthritis and Limb Joints.Finally, semantic bridges have also associated a relevance index, which depends on the correlation measure we have chosen for their definition (e.g.count, log ratio, odds ratio, etc.).
Another interesting feature of 3D-Maps is the ability of browsing through the taxonomical hierarchies of the biomedical concepts (e.g., UMLS hierarchy).In the example of Subfigure 2, the user can expand the concepts Operation and Implantation (biggest balls in Figure 2(a)).The resulting concepts are red-coloured (Subfigure 2(b)) and represent more specific concepts like Catheterisation, Surgical repair, Intubation, or Cardiovascular Operations.
In order to manage the elements of the 3D-Map a series of operations are provided in the 3D-Map tools panel (see left hand-side of Figure 1).These operations are split within two categories: operations to manage the whole 3D-Map (rotate, zoom and shift) and concept-related operations.The operations to manage the concept visualization involve (1) the retrieval of the objects associated to the clicked concept, (2) the expansion of the clicked concept, (3) the removal of the concepts of a level with the exception of the clicked concept, and (4) the deletion of the clicked concept.

Use Cases
In this section we will show the functionalities of the 3DKB through two use cases based on some HeC clinician information requests.

Case 1: Exploring the relation between procedures and results
in the Tertalogy of Fallot (ToF) domain In this case, the clinician is interested in knowing the relation between the different surgical techniques reported in the literature and the findings and results that are usually correlated to them.For this purpose, the clinician builds the 3D-Map for the semantic levels Individual.Health Procedures.and Individual.Finding.As a result the clinician obtains the map presented in Figure 3(a).However, the clinician is only interested in repair operations.So, she refines the query by specifying the keyword repair in the query input field.The resulting 3D-Map is shown in the Figure 3(b), where relevant concepts are coloured in green.These relevant concepts contain at least one sub-concept (including itself) matching the specified query.Now, the clinician can select one of the green-coloured concepts, for example Repair Fallot Tetralogy, in order to filter the map to just those concepts that are related to it (see Figure 3(c)).Finally, she finds an interesting bridge between the selected concept and the finding concept Death.Figure 3(d) shows the documents that are retrieved by clicking this bridge.Notice that these abstracts are about death cases related to TOF repair.

Case 2: Finding potential proteins that can be related to different types of a disease within the Brain Tumours (BT) domain
In this use case, the clinician is interested in comparing the proteins related to a disease and its subtypes.Taking the brain tumour domain, the clinician specifies the concept query epilepsy without selecting any vertical level.As a result, she obtains the 3D-Map of Figure 4(a) which contains the concepts attack epileptic, epilepsy intractable, epilepsy lobe temporal, epilepsy extratemporal and epilepsy focal.
To retrieve the proteins related to these diseases, the tab @SwissProt is selected.For example in Figure 4(b) the related proteins to attack epileptic are shown.The user can then get much more information about these proteins by clicking the buttons NCBI and KEGG, which jump to the corresponding pages in Entrez Gene and KEGG sites respectively.Note that, the relevance of each protein entry is calculated with the frequency of the concept and its sub-concepts in the Swissprot DB description of the protein.

Method
OLAP (On-line Analytical Processing) [12] tools were introduced to ease information analysis and navigation from large amounts of transactional data.OLAP systems rely on multidimensional data models, which are based on the fact/dimension dichotomy.Data are represented as facts (i.e.subject of analysis), while dimensions contain a hierarchy of levels, which provide different granularities to aggregate the data.Since multidimensionality provides a friendly, easy-to-understand and intuitive visualization of data for non-expert end-users, we have borrowed the previous concepts and operations to apply them to our 3D conceptual maps.

Representation of Semantic Spaces
In order to achieve a browsable analytical semantic space, it is necessary to normalize the representation of both the knowledge resource and the target collection (e.g., patient records, PubMed abstracts, and so on).This normalization consists of two main steps: (1) to arrange existing concepts into a well-structured multidimensional schema, and (2) to represent the objects collection under this schema.The first step must be guided by a series of predefined dimensions which roughly represent semantic groups.For example, in the HeC project dimensions correspond to vertical levels: population, disease, organ, and so on.The main issue to be addressed in this step is the irregular structures of the taxonomies provided by existing knowledge resources.The second step has two main tasks: (1) to semantically annotate the objects collection with concepts from the knowledge resource, and (2) normalize the annotation sets of each object to the multidimensional schema defined in the previous step.The subsequent sections are devoted to describe all this process in detail.
Semantic Annotation During the last years, we have witnessed a great interest in massively annotating biomedical scientific literature.Most of the current annotators rely on well-known lexical/ontological resources such as MeSH, Uniprot, UMLS and so on.These knowledge resources usually provide both the lexical variants for each inventory concept and the concept taxonomies.Some knowledge resources are more formal (e.g.FMA, Galen, etc.), providing logic definitions for concepts from which the taxonomy can be inferred.
In our work, the knowledge resource used to generate semantic annotations is called reference ontology, denoted O.The lexical variants associated to each ontology concept c is denoted with lex(c), which is a list of strings.The taxonomic relations between two concepts a and b is represented as a b.A semantic annotation of a text fragment T consists of identifying the concepts in O such that they are more likely to represent the meaning of T .
One of the main drawbacks of current semantic annotation systems is that they usually focus on very specific entity types like proteins and diseases.In our work, we aim to generate semantic annotations of any entity type involved in the biomedical research.For this reason, we have chosen the UMLS-Meta as knowledge resource, which provides more than 100 entity types (semantic types).However, just a few annotation systems are able to manage the huge amount of lexical information provided by UMLS-Meta, and they are too slow to deal with large text collections.As a consequence we developed a novel annotation system, called Concept Retrieval, which is based on information retrieval techniques to efficiently perform the text annotation [17].This annotation system was tested in the CALBC competition over a collection of 150.000PubMed abstracts about immunology [16].

Knowledge normalization
In order to build semantic spaces for analyzing document collections, the reference ontology O associated to the knowledge resource is normalized as follows: -First a set of dimensions are defined, (D 1 , • • • D n ), which represent a partition of the concepts in the domain ontology.Each dimension D i represents a different semantic space (e.g.semantic types or vertical levels), and cannot share any common sub-concept with the other dimensions.-Each dimension D i can define a set of categories or levels L i j , which forms in turn a partition over D i but with the following constraints: (1) there cannot be two concepts c and d in L i j such that either c d or d c, and (2) all the concepts in L i j have a common super-concept that belongs to D i .-Every concept of the ontology is encoded under the labeling scheme presented in [18].Thus, each concept c ∈ O is represented with the following descriptor: c, pre index, anc index, desc intervals, anc intervals, topo order where pre index is the pre-order index in the spanning tree of O, desc intervals is the list of index intervals of the descendants of c (i.e., {c |c c}), anc index is the pre-order index of the reversed spanning tree, and anc intervals is the list of index intervals of the ancestors of c.Finally, topo order is the topological order of the concept in the spanning tree of O.More specifically, this descriptor represents two labeling schemes, namely: L − for descendants, and L + for ancestors.Under these labeling schemes, queries over the taxonomical relationships are efficiently computed with a specific interval algebra [18].
One interesting application of the labeling scheme L + is the efficient construction of ontology fragments tailored to an input set of concepts, called signature.In this way, we can automatically build each dimension D i with the ontology fragment obtained with the signature formed by all the concepts identified in the collection (through semantic annotation) and that belong to some semantic group representing the dimension (e.g.disease, protein, and so on).To obtain the categories of a dimension D i , we take into consideration the taxonomic relationships in the fragment and the previous restrictions over dimensions and their categories.
Data and resource normalization After semantic annotation, each document of the target collection Col has associated a list of concepts from the reference ontology O.However, these annotation sets are not suited for multidimensional analysis, and therefore a normalization process similar to that applied to the ontology must be performed.The main goal of objects normalization is to represent the semantic annotations within the normalized multidimensional space.Thus, each document d ∈ Col is represented as the multidimensional fact: where c i (0 ≤ i ≤ n) is either a concept from the dimension D i or the null value.Remember that concepts are represented under the labeling scheme L − , and consequently they are expressed through their pre index numbers.
As a semantic annotator can tag more than one concept of the same dimension, the normalization process consists in selecting the most relevant concepts for each dimension.For this purpose, for each document d we first build a concept affinity matrix M d of size N c × N c , where N c is the number of distinct concepts present in the annotations of d.This matrix is initialized as follows: The affinity matrix can be used in several existing graph-based algorithms that aim to rank the nodes according to the neighbors contributions.We have chosen the regularization framework proposed in [19], which can be summarized with the following formula: Here, R is the vector representing the rank of concepts.This is obtained by finding an optimal smoothed function that best fits a given vector Y , which is achieved by applying the laplacian operator over the affinity matrix M d as follows: In our case, the vector Y consists of the frequencies of each concept in the document d.The parameter α is directly related to the smoothness of the approximation function (we set it to α = 0.9).
An alternative to this method is to use a centrality-based algorithm over M d .Our preliminary experiments over the HeC collections showed that this method obtains very similar ranks to the previous one.
Once the rank R d is obtained, the normalization process consists in selecting the top-scored concepts of each dimension to represent the d's fact.

Building 3D conceptual maps
As mentioned in the introduction, our main aim is to build a browseable representation of the semantic spaces defined in the previous section.For this purpose, we define the 3D conceptual map, which is a sequence of different layers that correspond to different dimensions expressed at some detail level (category).In this map, concepts are visualized as balls, which are placed within their corresponding layer with a size proportional to their relevance w.r.t. the target collection.Concept bridges (or conceptual associations) are visualized as links between concepts of adjacent layers.3D maps are built from the normalized conceptual representation described in the previous section, by using a series of basic operations, which are described in turn.

Basic operations
The basic operations that can be defined over a dimension D i are the following ones: Layer definition, which establishes the concepts that will be placed at the layer.This operation can be done either by specifying one dimension category or through a keyword-based query.In the first case, all the concepts of the dimension category are visualized, whereas in the second case only the most specific concepts in D i whose lexical forms match the query are visualized.Concept containment, returns all the sub-concepts of a selected concept q of a dimension D i .Formally, Text containment, which returns true if there exists some concept c q whose lexicon, lex(c), matches the specified keywords: contains(D i , q, kywds)) = {c|c ∈ D i ∧ c q ∧ matches(lex(c), kywds)} Direct subconcepts, denoted children(D i , c), which returns the set of direct sub-concepts of a concept C.This operation is used to browse the taxonomy downwards (drill-down operation).
All these operations are efficiently performed by using the interval algebra over the L − scheme associated to the ontology concepts.
Aggregations Summarization is one of the main purposes of the proposed analytical tool to facilitate the exploration of the collection contents.Similarly to OLAP-like systems, summarization is performed through well-defined aggregations over the semantic annotations of the objects collections.More specifically, the following aggregations are performed to visualize summarized information: Concept Relevance.The relevance of a concept c is calculated by aggregating the relevance of its sub-concepts w.r.t each specific collection.Formally, where Γ is an aggregation function (e.g., sum, avg, and so on) and score is the function that is evaluated against the collection.The simplest scoring function is the number of hits, namely: Alternatively, the scoring function can take into account the relevance of each concept in the documents it appears.Thus, we can aggregate the relevance scores estimated to select concept facts (see Formula 1) as follows: Concept Associations.Given two dimension levels L i n and L j m , belonging to dimensions D i and D j (i = j) respectively, the following 2D cube stores the aggregated contingency tables necessary for correlation analysis: Here n i,j measures the number of objects in the collection where c i and c j co-occur, n i is the number of objects where c i occurs, and n j is the number of objects where c j occurs.Notice that n i and n j are calculated in a similar way as concept relevance.The contingency table for each pair (c i , c j ) is calculated as shown in Table 1.

Contingency table for scoring bridges
The measures n i,j , n i and n j are calculated as follows: A semantic bridge is a strong association between concepts which has good evidence in the target collection.Bridges are calculated from contingency tables by defining a scoring function φ(c i , c j ).In this way, bridges will be those concept associations whose score is greater than a specified threshold δ: Bridges φ Col (L i , L j ) = {(c i , c j , φ(c i , c j ))|φ(c i , c j ) > δ} As an example, we can use the interest factor as score, that is: In our current setting, we use a series of well-known interestingness measures such as log likelihood ratio, mutual information, interest factor and F1-measure.
Browsing conceptual maps Two main browsing operations can be performed in a conceptual map: (1) expand a concept into its sub-concepts, and (2) go to a ranked list of objects associated to the clicked map elements (concepts and bridges).The semantics of these operations corresponds to the well-known drilldown and drill-through OLAP operations.
Drill-down: If we expand a concept c in the 3D map, it must be updated accordingly.Thus, the concept c is substituted by its children in the O's taxonomy, bridges involved by c are removed from the map, and new bridges are calculated for the sub-concepts of c and drawn in the map.Drill-through: If a concept (bridge) is selected for drill-through, the system must retrieve the objects of the target collection relevant to it.The ranked list of objects is shown in a separate list (e.g., tab) ordered by relevance.Notice that we can simply use the score calculated to construct facts (i.e., R d ) for ranking documents w.r.t.concepts, formally: For ranking documents w.r.t.bridges, we just combine the scores of the involved concepts in the selected bridge: relevance(d, (c i , c j , φ)) = relevance(d, c i ) • relevance(d, c j )

Conclusions
In this paper we have presented a novel semantics-aware integration and visualization paradigm that allows users to easily explore and navigate discovered relations between data and web resources.The contribution is two-fold.
On one hand, we provide the infrastructure for integrating different information resources through semantic annotation with domain ontologies.On the other hand, users can interactively build conceptual maps according to their requirements and explore them with classical OLAP-style operations such as roll-up and drill-down.Some future work includes the refinement of the created dimension hierarchies in order to account for more meaningful aggregations and also to devise more efficient calculation of new bridges.Finally, we plan to develop an on-line service to provide conceptual maps on-demand.

Fig. 3 .
Fig. 3. Interesting relationships between procedures and findings in the literature

Fig. 4 .
Fig. 4. Proteins retrieved through the @Swissprot tab for concept attack epileptic