Cytokines are signaling molecules secreted and sensed by immune and other cell types, enabling dynamic intercellular communication. Although a vast amount of data on these interactions exists, this information is not compiled, integrated or easily searchable. Here we report immuneXpresso, a text-mining engine that structures and standardizes knowledge of immune intercellular communication. We applied immuneXpresso to PubMed to identify relationships between 340 cell types and 140 cytokines across thousands of diseases. The method is able to distinguish between incoming and outgoing interactions, and it includes the effect of the interaction and the cellular function involved. These factors are assigned a confidence score and linked to the disease. By leveraging the breadth of this network, we predicted and experimentally verified previously unappreciated cell–cytokine interactions. We also built a global immune-centric view of diseases and used it to predict cytokine–disease associations. This standardized knowledgebase (http://www.immunexpresso.org) opens up new directions for interpretation of immune data and model-driven systems immunology.
Protective immunity is mediated through a complex system of interacting cells whose communication network is primarily governed by secreted molecules, chiefly members of the cytokine and chemokine family of proteins. Until recently, the high complexity of the immune system was approached by researchers by using reductionist approaches, but technological advances now enable acquisition of large data sets, with broad enumeration of cell subset types and functions, proteins, gene expression and more1. In addition, papers in immunology alone are being published at the rate of approximately one every 30 min. To maximize discovery, research results must transition to organized standardized models of knowledge, on which automated computational processing can be deployed.
Biomedical text-mining efforts have been an important means of grasping at the breadth and complexity of biological systems. With efforts invested into recognizing biologically relevant entities—such as genes, diseases, chemicals and genomic variants2,3,4,5,6,7,8—driven by gold standards9,10 and community-wide efforts11,12,13,14, text-mining is enabling automatic identification of complex biological relations15,16 and full-scale networks. Recent research has expanded to additional types of molecular events17,18,19, with relation-extraction methods ranging from co-occurrence15,19, pattern matching and rule-based methods to dependency parse graph analysis20,21 and machine-learning21. To date, however, text-mining approaches have not addressed large-scale intercellular communication networks and, in particular, those describing directional cell–cytokine interactions.
Biological literature mining has shown utility for hypothesis generation, particularly in disease contexts22,23,24. Similarly, data-driven disease classifications have shown benefit in understanding shared mechanisms, empowering target identification and drug repositioning choices25,26,27,28. Yet to date, such classifications have not addressed cellular cross-talk and how the immune system may impact disease.
To establish a foundation for systematic reasoning over the intercellular network, we built immuneXpresso (iX), a comprehensive high-resolution knowledgebase of directional intercellular interactions that was text-mined from all available PubMed abstracts across a broad range of disease conditions. Interactions captured by iX include both direct cytokine binding or secretion events and more distant, indirect influencing relations, scored and filtered to emphasize precision. We used the resulting knowledge standardization to characterize the immune intercellular network and to predict and experimentally validate cell–cytokine interactions. By leveraging the breadth and context-awareness of the knowledgebase, we built an immune-centric view of diseases and explored its modularity to predict cytokine–disease associations.
A text-mining pipeline to extract intercellular interactions
We designed a computational pipeline that was focused on mining the primary literature for identification of cells, intercellular signaling molecules (i.e., cytokines) and the directional relations between them (Fig. 1a and Online Methods), and we applied it across the entire PubMed (approximately 16 million articles published electronically by July 2017). Briefly, for each individual sentence, the analysis pipeline tagged cells, cytokines and diseases, as well as standardized terminology through official ontologies to allow for hierarchical data analysis at multiple resolutions (Supplementary Tables 1,2,3,4). We examined sentence structure to identify syntactically related cell, cytokine and verb mentions. For each such 'evidence record', we detected the relation's directionality, polarity (representing its positive, negative or neutral effect) and when possible, the resulting cellular biological function (Supplementary Table 5). We distinguished between 'outgoing' relations (describing cytokine secretion by a given cell type) and 'incoming' relations (describing events in which a cytokine affected a cell type, either directly via binding or indirectly). Finally, for each unique triple of cell, cytokine and directionality summarized across all of the evidence records, we used a trained machine-learning classifier to make a call on whether the collected evidence indeed described an interaction (Online Methods). We assigned confidence scores to these and linked them to the conditions (e.g., diseases) co-mentioned in the same abstracts. In addition, we annotated independent entity mentions, without interaction, of cells and cytokines to allow for entity co-occurrence and enrichment statistics.
To assess the precision of entity recognition, expert human curators evaluated 100 randomly chosen annotations each for cells, cytokines and diseases and found the automatic annotation to be 91%, 96% and 93% correct, respectively (Supplementary Tables 6,7,8 and Online Methods). Similarly, for precision of relation extraction, we randomly chose 590 interaction evidence records (i.e., particular sentence instances) and manually evaluated entity recognition, ontology mapping, verb and relation detection, directionality, polarity and cellular function identification (Supplementary Tables 9 and 10 and Online Methods). We observed a conservative true-positive rate of 75% when all of the metrics were considered, 82% when assessing triple precision (cell–cytokine–directionality) and 93% when checking cell–cytokine relation pair extraction only, ignoring directionality, polarity and other relation characteristics (Fig. 1b).
To evaluate performance in interaction recall (i.e., identifying known interactions), we created a gold-standard set by manually curating directional interactions from a reference book spanning up-to-date knowledge of cytokines29 (Online Methods). Our machine learning–derived knowledgebase covered 79% of the interactions described in the reference book, yet it was approximately five times larger, containing an additional 3,055 directional interactions. Manual assessment of 200 of these yielded an 11.5% false-positive rate, suggesting that a large amount of biologically meaningful interactions appeared in only the primary literature.
Finally, we unified all identified interactions (manual and machine derived) into a single knowledgebase, which we named immuneXpresso (iX) (Fig. 1c). To quantify the advantages of this semantic-based approach, we compared its precision and recall with an alternative of assuming cells and cytokines co-occurring in the same sentence interact, without sentence structure analysis (Online Methods). Though the full set of co-occurrence–based relations showed 98% recall of both the reference book–curated and iX cell–cytokine pairs, 75.6% of these relations appeared in neither resource, suggesting a very high false-positive rate for a co-occurrence approach, even when we set a threshold by a minimal number of repeat co-occurrences (Fig. 1d).
At present, iX contains a total of 4,118 directional cell–cytokine interactions (Supplementary Table 11), three times as many incoming interactions as outgoing ones, an enrichment qualitatively echoed in reference book–annotated interactions. These interactions stemmed from >31,000 articles (Supplementary Table 12). In addition, using the iX pipeline we collected annotations of thousands of diseases (11,260 distinct disease terms; 2,179 of them appearing in at least 100 papers) and identified mentions of 1,300 cell types (360 of which were hematopoietic) and 170 cytokines in these disease contexts (Supplementary Tables 11 and 13). iX is freely accessible for querying through http://www.immunexpresso.org, as well as via the ImmPort web site30.
System-level characterization of intercellular interactions
iX offers an opportunity for a system-level view of intercellular information flow. Given the large number of cells and cytokines, many of which are poorly understood, we first grouped the cells into 16 major categories on the basis of cell ontology hierarchy and the cytokines into families on the basis of structure and function (Online Methods). This yielded a bipartite intercellular interaction network that showed information flow between cell types and cytokine families (Fig. 2a). We noted that cell types, irrespective of the number of identified cell subsets in their lineage, interacted with a large number of cytokine families. By replotting the network using the highest cell and cytokine resolution in iX, we observed an increase in distinct cell subset–cytokine profiles per cell type, particularly for outgoing interactions (Supplementary Fig. 1a, shown for CD4+ T cells). Yet, the bulk of interactions was still described solely at a low cellular resolution, which indicated that the unique cytokine milieu profile of distinct cellular subsets was, for the most part, still lacking.
Signaling of some cytokines may be highly specific or may broadly affect multiple cell subsets, constituting hubs of the intercellular interaction network. iX enables the study of global properties of the intercellular interaction network. For each interaction, we identified the highest cellular resolution it was reported for and, for each cytokine, calculated the number of cellular interactions it was associated with (i.e., its degree), covering both hematopoietic cell types (HPCs) and nonhematopoietic cell types. This demonstrated the existence of only a few hubs in the network, followed by a long tail of modestly and low-interacting cytokines (see Fig. 2b for incoming interactions and Supplementary Fig. 1b for outgoing interactions). We noted that 50% of the incoming interactions in the network were formed by 23 (16% of total) cytokines. These included the top hubs tumor necrosis factor (TNF), transforming growth factor (TGF)-β, interleukin (IL)-6 and interferon (IFN)-γ. Similarly, in the reverse outgoing direction, we attributed 50% of edges to 17 (15% of total) cytokines. Cytokine degrees in incoming and outgoing directions showed high correlation (Fig. 2c; Pearson's r = 0.086), as in the gold-standard reference book (Supplementary Fig. 1c; Pearson's r = 0.69). This correlation was lower, yet still observed, after removal of autocrine interactions that may have inflated similarity (Pearson's r = 0.73 and r = 0.36 for iX and the reference book, respectively), as well as following removal of low-degree cytokines (Pearson's r = 0.5), suggesting that hubs in the literature-derived intercellular interaction network appeared to be bidirectional, both targeting and secreted by a large number of cell types.
Immune intercellular network knowledge is biased
Analyzing the cytokine degree distribution, we could not reject a power-law distribution for either incoming or outgoing interactions (Supplementary Fig. 2, P = 0.73 and P = 0.47, respectively, for incoming and outgoing interactions; Online Methods). Heavy-tailed network distributions may arise due to a research bias, yielding a 'rich-get-richer' phenomenon31,32. Conversely, such degree distributions may arise naturally due to biological network structure33. Analysis of cytokine interaction knowledge accumulation showed the existence of one or two connection-rich leaders per cytokine family, with the other family members trailing well behind (Fig. 3a and Supplementary Fig. 3). These were predominantly founding cytokine family members, such as IL-6, IFN-γ, IL-10 and TNF, and they maintained their overwhelming dominance even when we discarded all explicit references to global family mentions in the text (e.g., TNF family). We detected a low global correlation between a cytokine's date of discovery and its degree, suggesting that intercellular communication knowledge had not reached saturation, either for hubs or for less-connected cytokines (Supplementary Fig. 4; Pearson's r = −0.27 and r = −0.26 for incoming and outgoing interactions, respectively, which were driven by a few highly dominant hubs). Analysis of recent 5-year change in connectivity degree suggested that for some hubs, such as FGF2, their iX degree likely reflected cellular interactivity potential, whereas others, such as IL-6 and TGF-β, were still accumulating new connections at a high rate (Fig. 3b).
To assess how well the literature-derived knowledge represented experimental data, we compared iX cytokine degrees to those obtained from ImmProt, a high-resolution proteomics compendium quantifying protein expression in resting and activated states in more than 20 sorted immune cell types34. We approximated outgoing and incoming interactions based on the expression of cytokine and cytokine receptor proteins in ImmProt cellular profiles. Comparison of incoming cytokine degrees in the resulting experimental network with the literature-derived ones (Fig. 3c) demonstrated significant correlation (Spearman ρ = 0.38; P < 0.001; based on 82 cytokines present in both data sets), with lower correlation for the opposite direction (Spearman ρ = 0.26; P = 0.1; based on 41 shared cytokines), likely due to the fact that the ImmProt secretion profiles were measured in a single condition.
Prediction of cell–cytokine interactions
Using the subset of the iX knowledgebase comprised of articles published up to 2014, we systematically predicted cell–cytokine interactions, using three orthogonal approaches (Fig. 4a and Online Methods): (i) an unsupervised analysis of iX interaction profiles, (ii) a supervised analysis of iX cytokine–cell interactions grouped by cytokine families, as well as by (iii) contrasting iX information to receptor or cytokine expression data, as reflected in the ImmProt34 and ImmGen databases35 (e.g., LTA in Fig. 3c). This systematic prediction process yielded 472 incoming and 367 outgoing ranked interaction candidates (Supplementary Tables 14 and 15). Of these, we manually evaluated 78 predictions by extensive literature searches (Fig. 4b). This process identified 55% of the candidates as already observed, true-positive interactions, 3% with evidence published later than the 2014 training set, and 3% with evidence of cytokine receptor expression only. This high rate of recovery of known interactions was reassuring and suggested that the remaining 40% of candidate interactions were enriched for previously unrecognized interactions.
We tested the validity of two top-rated candidate interaction predictions: for IL-7, our supervised cytokine family–prediction approach suggested an outgoing interaction (i.e., secretion) in monocytes (the literature currently describes the opposite interaction only, i.e., that IL-7 affects monocytes36). In agreement with the prediction, we observed IL-7 production by monocytes in the activated human peripheral blood mononuclear cell (PBMC) population. In addition, as expected, we detected dendritic cell production of IL-7, and no IL-7 production in CD8+ T cells, as reported in the literature (Fig. 4c). Similarly, our unsupervised prediction approach suggested that IL-34 activates signaling in T cells. We also detected expression of a corresponding receptor, CSF1R, on resting CD8+ T cell subsets in ImmProt34 (Supplementary Fig. 5). We stimulated human PBMCs with IL-34 and observed robust phosphorylation of the kinase ERK in monocytes (Fig. 4d), as has previously been reported37. We also observed activation of the transcription factors NF-κB and STAT5 by IL-34 in CD8+ effector memory T cells (Fig. 4d). Moreover, our prediction also supported our recent validation of CD4+ memory cell induction by IL-34 signaling following upregulation of CSF1R during activation34.
Immune-centric classification of diseases
We reasoned that the structured format and breadth of iX could be leveraged to obtain an immune-centered perspective on diseases and their relations. To do so, we picked a set of 188 broadly studied diseases whose associated abstracts we sampled to obtain a characteristic intercellular immune profile (Fig. 5a, Supplementary Fig. 6 and Supplementary Table 16). We clustered these in an unsupervised manner to assemble an immune-centered map of disease similarities and differences (Online Methods).
Analysis of this clustering outcome divided diseases into 18 modules on the basis of associated cells, cytokines and interactions (Fig. 5b). We observed mixed agreement of this classification with a clinically based one (SNOMED)—some modules clustered clinically similar phenotypes; for example, module 2 showed strong clustering of cardiovascular diseases, whereas module 9 captured inflammatory bowel disorders and psoriasis. Similarly, cancers were grouped into four modules on the basis of tissue type. In contrast, in some cases the immune-centered clustering yielded modules that included diseases from presumably unrelated clinical conditions—modules 14 and 15 suggested a high degree of immune similarity between metabolic disorders and a subset of cardiovascular diseases, an observation that was also supported by high interconnectivity of modules 14 and 15 with module 2.
We used iX to automatically generate an intercellular immune-interaction map for type 2 diabetes on the basis of 5,484 abstracts (Fig. 5c and Online Methods). The network recapitulated the molecular basis of the disease and captured the key role of tissue-resident inflammatory macrophages, monocytes, fat cells and preadipocytes in secreting the pro-inflammatory cytokines TNF, IL-6 and IL-1β, which trigger the release of adipokines that contribute to development of insulin resistance38. This profile co-clustered with other diseases in module 15, which consist of metabolic and cardiovascular diseases, an appreciated link39. Enrichment analysis of module 15 and its closely associated module 2, as compared to all others (Fig. 5d and Online Methods), showed a strong association with the pro-inflammatory adipokine RETN and its interaction with monocytes. RETN has been proposed to link the heightened inflammatory state in aged and obese individuals to insulin resistance, vascular inflammation and low-density lipoprotein (LDL)-cholesterol levels, thus contributing to the risk of developing metabolic syndrome40. Moreover, elevated levels of RETN directly induce IL-6 secretion41, which contributes to the constant low-grade inflammatory state associated with age and cardiovascular conditions42,43. Thus, whereas modules 15 and 2 phenotypically represent different pathological conditions, their molecular commonalities suggest consideration for common therapeutic interventions.
Prediction of cytokine–disease associations
Analysis of cytokine profiles across the 188 diseases revealed three predominant cytokine classes with respect to disease: disease specific, module specific and backbone cytokines that were associated with the overwhelming majority of tested diseases (Fig. 6a, Supplementary Table 17 and Online Methods). We noted a high overlap between backbone cytokines and those previously identified as hubs (Fig. 2b), suggesting that their pan-disease universality stems from the dominant role in the overall intercellular interaction network.
We hypothesized that given the unequal within-module knowledge per disease we might be able to predict de novo cytokine–disease associations. For each module, we hierarchically clustered disease subsets across all cytokines and systematically predicted cytokine–disease associations (Online Methods). This yielded over 466 ranked candidates, which stemmed from all 18 modules (Supplementary Table 18). As we assembled the disease immune profiles by sampling, we checked the predicted cytokine–disease associations on the full, nonsampled iX knowledgebase, which validated 81% of predicted interactions as true positives and suggested a likely enrichment for unappreciated cytokine–disease associations in the remaining set (Supplementary Fig. 7).
To test this, we looked for experimental confirmation of two of the top-rated candidate associations, the chemokines CCL8 and CCL24 in psoriasis, which, to the best of our knowledge, have not been reported. We analyzed two publicly available gene expression data sets for psoriasis44,45 and found CCL8 to be significantly upregulated in psoriatic lesions versus healthy skin in both data sets (paired two-sided Wilcoxon signed-rank test, P = 0.0048, P = 2.47 × 10−5; Fig. 6b,c). For CCL24 we observed significant differences in one study45 (P = 1.04 × 10−7).
Knowledge of the immune intercellular network is crucial for understanding immune responses in health and disease. However, the high system complexity leaves even expert researchers struggling to maintain a mental picture of the immune milieu and often leads to knowledge biases. We leveraged a computational model of intercellular interaction network knowledge to accelerate discovery of interactions and the context in which they occurred, identify disease-associated immune profiles and build an immune-centric disease classification partially overlapping with existing clinical disease classification.
The knowledge we captured was unique in breadth and resolution, yet it could be expanded even further. Full-text article support, as well as identification of interactions described beyond sentence boundaries and using more complex sentence structures, can boost the volume of captured evidence and our ability to obtain more reliable and richer view of interactions, with less bias, particularly of those interactions that are understudied. We foresee extension of this approach to also include additional events, such as direct cell–cell interactions and downstream intercellular signaling to capture complex cellular interaction cascades. The extensive metadata we extracted for each article, including MESH terms and bibliographic information, together with detailed characterization of the captured interactions, could be used for advanced filtering, which would allow focus on the most authoritative knowledge. Beyond this, we envision that the structured formatting of knowledge we have achieved can be leveraged by machine-learning applications, using statistical analyses of domain frequency and chronological pattern biases to identify potential discrepancies and erroneous claims in the published knowledge.
Technological advances now enable us to step beyond assaying a narrow set of measures to high dimensional phenotyping across the breadth of the immune system at an unprecedented scale. These data are primarily analyzed by statistical analysis methods that are geared to identify differences and correlations yet that lack any back-end model of the immune system's structure and function, lack in their ability to leverage domain knowledge or interpret what these differences mean. This results in an interpretation process that is primarily manual, not systematized, and that relies strongly on investigator conjecture. Intelligent systematized interpretation requires having a machine-readable map of how immune components are connected and a formalized reasoning framework on which one could test hypotheses and refine knowledge. Here we built immuneXpresso, a framework that structures and standardizes our knowledge of immune intercellular interactions, under many conditions, and updates periodically. Its integration with high-dimensional immune data will enable paradigms of reasoning over heterogeneous cell populations, making first steps towards transforming immunology to systemized, model-based science, a true 'systems immunology'.
iX pipeline execution environment.
The computational pipeline to assemble the iX knowledgebase was executed on a high-performance computing cluster, by running a batch job scheduling system with up to 150 simultaneous jobs allowed. Details of the specific pipeline steps appear below. It typically took ∼2 weeks to generate the iX database from start to finish for the entire corpus.
Corpus selection, parsing and indexing.
Abstracts of all English-language articles published electronically between 1960 and July 2017 (∼16 million) were downloaded from PubMed using the EFetch utility. For each abstract, we extracted additional metadata, such as the article title, year of publication, whether it was a review or not, and its assigned MESH terms. We focused on the mammalian immune system because of the rapid evolution of the immune system46,47, and in particular restricted our analysis to abstracts that were assigned a mouse and/or human MESH term, as these were the prime organisms relevant for biomedical research for which information is available. We used the Stanford Parser engine48 to split the abstract text to sentences and words, perform part-of-speech tagging, and extract sentence syntactic structure and grammatical relations (i.e., 'typed dependencies'). This information was then indexed within the Elasticsearch engine (https://www.elastic.co/) to allow querying for sub-corpora that potentially contained biological entities of interest. Moreover, preserving text processing products in the index eliminated the need for future time-consuming abstract reparsing, if entity recognition was expanded to support additional entity types (e.g., drugs, tissues or pathways).
Entity recognition and ontologies.
To identify mentions of biological entities of interest (diseases, cells and cytokines) within article abstracts, we applied a dictionary-based approach with dictionaries either adapted from standard public knowledgebases or assembled from scratch. For diseases, we first queried elasticsearch for articles that contained synonym phrases, listed by manually curated compilation of UMLS Metathesaurus49, and then post-processed the returned abstracts, sentence-by-sentence, to look for the matches and extract precise information for them. In particular, for each identified disease mention, we recorded its position in the sentence, the particular synonym used, as well as the public 'concept unique identifier' (CUI) and concept name, as defined by the SNOMED CT–controlled vocabulary of the Metathesaurus (Supplementary Tables 8 and 13). This post-processing stage dropped conditions contained within longer disease entities (e.g., 'deficiency' and 'vitamin deficiency' were dropped, and only the most specific 'vitamin A deficiency' term was retained within the sentence “The essential role of vitamin A in kidney development has been demonstrated in vitamin A deficiency and gene-targeting studies.”). To achieve this, we automatically examined all diseases with overlapping sentence positions to retain those with the longest position span, and, among them, containing the highest number of words.
For cells, initial testing suggested that straightforward lookup of terms contained within the official Cell Ontology50 would miss a substantial fraction of cell occurrences in text due to the large number of possible forms of describing cell subsets. This pluralism in naming was hard to anticipate automatically, both when describing cells by name (e.g., 'human CD8+ terminally differentiated memory cell' does not appear in the Cell Ontology and would not be captured by straightforward dictionary lookup) or by cell surface marker combination (e.g., CD3+CD4+CD45RA+ cell), whose delineation in the Cell Ontology is limited. To resolve this, we expanded the Cell Ontology with a manually curated set of synonyms, and, more importantly, introduced a small lexicon of seed words that served as a starting point, an anchor, for cell recognition in sentences (e.g., cell, lymphocyte and macrophage; see Supplementary Table 1 for the full list).
The cytokine dictionary was assembled manually, due to lack of an established lexicon (Supplementary Note 1).
For cell and cytokine entity recognition, we queried elasticsearch for articles containing either a mention of a cytokine synonym or a cell 'seed' word. The focus was on articles that mentioned either or both cell and/or cytokine, to serve the basis for further relation extraction. Akin to diseases, we post-processed candidate articles to better characterize the matches and, for cells, to expand from seeds to often multi-word hard-to-anticipate cell name phrases, using typed dependencies (Supplementary Fig. 8 and Supplementary Note 2). In addition, the captured cell phrases were mapped to the Cell Ontology (Supplementary Note 3).
Lastly, following cell- and cytokine-mention candidate extraction, we analyzed their within-sentence positions to filter out erroneous identification of overlapping entities from different ontologies (e.g., erroneous 'granulocyte' or 'macrophage' cell matches within the cytokine entity 'granulocyte–macrophage colony-stimulating factor'). For all remaining cell and cytokine mentions, we recorded their start and end positions in a sentence, the particular phrase or synonym used, as well as the representative ID and concept name, assigned either by official Cell Ontology for cells or by our manually constructed dictionary for cytokines (Supplementary Tables 6 and 7; see Supplementary Tables 1 and 3 for frequencies of terms captured for each cell seed and cytokine lexicon concept, respectively).
Following extraction of articles containing either the cytokine synonym or cell seed word mentions, we applied sentence-by-sentence post-processing to detect verbs, and when possible, linked cells, cytokines and verbs into relations. We analyzed sentence typed dependencies48 to detect semantically related cells and cytokines within sentence boundaries and identify the directionality (i.e., a cytokine acting on a cell, like 'IL-6 promotes TH17 cell differentiation', or the opposite, a cell producing a cytokine, like 'T cells secrete IL-2'), polarity (i.e., positive, negative or neutral effect of the interaction, such as upregulation, inhibition or just alteration of a cell function, respectively), as well as the cellular biological function impacted by this relation (e.g., cell proliferation or apoptosis elicited by the acting cytokine).
Per-sentence relation extraction process included several steps. First, we found verbs in sentences by looking for words tagged as VB, VBD, VBG, VBN, VBP or VBZ by part-of-speech tagger. For each verb, we examined all typed dependencies it was governing and attempted to resolve verb tense. In particular, we marked verbs that were tagged as VBN and that governed 'passive nominal subject', 'passive auxiliary' or 'agent' dependencies as passive. Second, to allow further relation-directionality detection, we performed cell–cytokine semantic linking by examining all possible entity combinations with a verb located between a cell and a cytokine. We considered a candidate (verb, cell and cytokine) tuple as semantically related, if the sentence contained a directional dependency tree path from one of the elements to the other two. For example, in the sentence “These results suggest that IL6 promotes IL22 secreting TH17 subset differentiation” in Figure 1a, there is a path from the verb “promotes” to the cytokine IL-6 and to the TH17 cell, allowing for (promote, IL-6, TH17) relation identification. Third, we used the verb tense and the cell and cytokine order in the sentence to resolve relation directionality. In the example above, because verb tense is active, and the cytokine precedes the cell entity, we identified the cytokine as acting on the cell. Finally, we applied a manually assembled verb classification lexicon (Supplementary Table 5) to assign relation polarity (i.e., positive in the example above). Verb classification lexicon lookup was performed by using stemmed forms for both lexicon terms and the relation verb.
For the special case of cell–cytokine relations included within cell descriptions, 'noun phrase–internal relations' (e.g., 'IFN-γ-producing CD4+ T cells', 'IL-2–activated NK cells' or 'NK cell IFN-γ'), we applied a tailored rule-based approach (Supplementary Note 4).
If relation directionality could be deduced directly from the verb (e.g., express or stimulate), as marked by human curators of the verb lexicon, then we preferred the directionality denoted by the lexicon to directionality-related decisions made by the algorithm described above.
Finally, we applied cell entity–containing noun phrase analysis to detect cellular biological functions elicited by the interacting cytokine, such as 'migration' in the sentence “GM-CSF enhanced reactive oxygen species release and neutrophil migration in vitro”. We examined noun phrase words on the right of the recognized cell entity to look for one or more matches from a manually curated cellular function list.
In terms of terminology, we referred to each detected (cell entity occurrence, cytokine entity occurrence, directionality, polarity and cellular function) relation in a sentence as a 'relation evidence record', whereas summarization of evidence records through representative labels and IDs of the entities produced unique candidate (cell, cytokine and directionality) interactions. Polarity and cellular function were ignored during summarization. We defined interaction 'context' by the disease entities that co-occurred in the same abstracts with the identified relation evidence records.
Filtering and scoring.
We chose to put strong emphasis on precision over recall to ensure benefit and increase adoption by the community. To do so, we developed a confidence-scoring methodology for both individual relation evidence records and summarized interactions, and we performed four filtering stages as follows:
(i) Baseline filtering. For each putative relation evidence record (i.e., a directional cell–cytokine relation extracted from an individual sentence), we compiled a set of sentence-level features to capture the complexity of the sentence from which this evidence emerged. These included sentence length, number of typed dependencies, entities and relations detected, negation presence, cell ontology mapping score, as well as the distance between the cell and the cytokine occurrences. Next, we filtered evidence with potentially lower confidence, such as negated sentences or sentences with more than one non-noun phrase–internal relation detected or evidence with low cell ontology mapping score (Supplementary Note 3). This yielded a 'baseline' subset of putative interaction evidence records that was used in the subsequent scoring and filtering stages.
(ii) Evidence record confidence scoring. All individual records were assigned confidence scores, based on sentence-level features (Supplementary Note 5) to allow focus on the highest-confidence evidence, both when queried by iX web-based interface users and during analyses, as well as to serve as a feature for further filtering, as detailed below.
(iii) Summary scoring. All summarized interactions (i.e., unique cell–cytokine–directionality triples summarized across the corpus) were assigned with 'summary' and 'enrichment' scores (Supplementary Note 6) to allow focusing on highly cited interactions.
(iv) Lasso-based filtering of summarized interactions. Lastly, to choose optimal parameters for classification and further filtering of summarized interactions (i.e., unique cell–cytokine–directionality triples), we built a lasso logistic regression model51. Here we summarized evidence-level features into the interaction level across all evidence records, such as minimal or maximal sentence length, minimal or maximal cell–cytokine entity distance in sentence, the presence of mouse or human MESH term annotation in any of the evidence papers, and minimal or maximal evidence confidence scores described in (ii) above. Additionally, we defined interaction-level features, such as the overall number of evidence records, and the summary and enrichment scores defined in (iii) above. For all of the verbs in the training set, we added an interaction-level feature that reflected verb presence in any of the evidence records, which produced 178 features in total. We trained the lasso regression model on a set of 203 randomly selected interactions, summarizing 788 baseline evidence records identified by iX. We labeled the summarized interaction as positive if an only if at least one of its evidence records was manually classified as having cell, cytokine and directionality identified correctly. The resulting set of manually labeled (cell–cytokine–directionality) triples was used for training the lasso model, separately for incoming and outgoing interactions, with leave-one-out cross-validation. This procedure identified a linear combination of sparse feature weightings, which we then applied to all putative summarized interactions to classify them as either 'true' or 'false'. The most prominent features included maximal evidence confidence score, having a 'mouse' MESH assigned for evidence record articles, as well as the presence of several verbs, such as 'produce', 'synthesize' and 'affect'. We enforced classification of 'true' for interactions that had at least one noun phrase–internal evidence record, due to their very high identification precision. Filtering out summarized interactions that were classified as 'false', together with all of their evidence records, yielded the resulting data set, which we used for further performance evaluation and system-wide analyses (see Supplementary Table 11 for a breakdown of the articles, records, entities and interactions remaining at various stages and Supplementary Table 12 for PubMed articles for the resulting cell–cytokine relation evidence).
Named entity recognition (NER) performance evaluation.
NER precision for all entity types was assessed by human curators. In addition, for cells, both precision and recall were examined by comparing to the gold-standard Colorado richly annotated full text (CRAFT) corpus9 (Supplementary Note 7).
Relation precision evaluation.
Relation extraction precision assessment was based on manual evaluation of randomly chosen relation evidence records (Supplementary Note 8).
Reference book network assembly and relation recall evaluation.
To the best of our knowledge, no gold standard for directional cell–cytokine relations exists as of now. Thus, to assess iX text-mining process performance in capturing existing knowledge in general, and to evaluate relation recall in particular, 11 human curators manually annotated interaction mentions, specifying the cell and cytokine terms and interaction direction from a reference book with encyclopedic display of up-to-date knowledge about cytokines and their interactions29. We mapped these to the Cell Ontology and cytokine lexicon, respectively, to summarize and acquire reference IDs consistent with those used in iX. This process yielded 725 unique (cell, cytokine and directionality) interaction triples, which we compared to those captured by iX, while allowing non-exact cell type match along the Cell Ontology hierarchy to account for the varying level of granularity of cell type reporting in the literature. This comparison assessed the proportion of reference book triples captured by iX, as well as the number of triples unique to either the reference book or iX. For the latter, to estimate false-positive proportion, we manually evaluated 200 (cell, cytokine, directionality) triples that were randomly selected from the iX knowledgebase and not covered by the reference book. An interaction was classified as false positive by the human curators, if none of the evidence records collected by iX for that triple reported the directional interaction in question. Following the assessment, we unified the list of interactions identified via literature text-mining and reference book annotation into a single knowledgebase.
Evaluation of co-occurrence-based relation extraction performance.
A co-occurrence–based approach would not be able to capture interaction directionality, an essential characteristic of intercellular communication, as our typed dependency-based method inherently does. Still, we aimed to assess the quality of those easier-to-extract relations. To the best of our knowledge, no gold standard for cell–cytokine relations exists as of now. Thus, to evaluate co-occurrence–based relation extraction performance and contrast it with the typed dependency-based approach, we used the interaction network we manually assembled from the reference book (see the subsection “Reference book network assembly and relation recall evaluation”) and discarded interaction directionality for both reference-derived and typed dependency-based interactions (post-lasso filtering, hereafter referred to as 'iX interactions') to produce two sets of cell–cytokine pairs to compare with. We linked cell and cytokine entities, which we previously recognized as co-appearing within sentence boundaries, into relation evidence records and summarized them into interactions, with the number of distinct articles mentioning the relation defining the strength of evidence.
Cytokine degrees and power-law fit.
The literature-derived nature of the iX network inherently reflects the fact that study of cellular interactions is performed at a varying level of granularity, rather than necessarily using the most specific cell type. As such, to avoid situations whereby the same interaction is counted multiple times when calculating degree distributions, we discarded interactions of less specific cell types, if a more specific cell type was known to interact with the same cytokine in the same direction. Cytokine degree counts, calculated separately per direction, took into account interactions with both hematopoietic and non-hematopoietic cells. Discrete power-law and log-normal distribution fit calculations were performed using the poweRlaw R package52.
We examined the similarity of iX and proteome interactions by inspecting each interaction direction separately and comparing cytokine degrees in these data sets. For outgoing interactions, we compared the number of distinct interacting cell types captured by iX and the number of those reported to express the particular cytokine in the ImmProt compendium34. For incoming interactions, we examined the proteomic profiles of genes encoding cytokine receptors and approximated cytokine–cell interactions by mapping expressed receptors to the respective cytokines, based on the KEGG “Cytokine–cytokine receptor interaction—Homo sapiens (human)” (hsa04060) pathway entry53,54. We then compared the resulting proteome-derived degrees to those captured by iX for each cytokine. To avoid situations in which the same interaction was counted multiple times due to literature reporting at varying levels of cell-type granularity, we discarded iX interactions of less-specific cell types if a more specific cell type was known to interact with the same cytokine in the same direction. Moreover, to focus on similar cellular compartments, we discarded non-hematopoietic cell interactions from iX, thereby calculating cytokine degrees across hematopoietic cell types only in both data sets.
Cell–cytokine interaction prediction.
To systematically predict interactions between immune cells and cytokines, we applied several strategies, separately for each interaction direction:
(i) Based on the similarity of global signaling profiles, we assembled a global literature-derived Boolean matrix that indicated whether an interaction has been described for each cell–cytokine pair. We used hierarchical clustering to group together cytokines with similar interaction profiles across all cell types and hypothesized that non-interacting cell–cytokine pairs located within highly connected clusters might actually interact. Therefore, we scanned clusters, confined to all possible combinations of column and row dendrogram branches, and derived interaction candidates for 'gaps' in clusters with at least 85% of interacting pairs. We scored the resulting candidates by counting the number of clusters that predicted the interaction to be novel.
(ii) On the basis of similarity of cytokine structure or function we calculated the proportion of cytokine family members known from the literature to interact with each cell type and derived interaction candidates by hypothesizing that cells interacting with at least 30% of the members of a cytokine family and with its most interactive member (i.e., usually the founding family member) might interact with other cytokines in the family as well. We used the aforementioned proportion to rank the resulting cell–cytokine interaction candidates.
(iii) On the basis of differences between literature-derived knowledge and experimental data (Supplementary Note 9).
Manual evaluation of cell–cytokine interaction candidates.
We evaluated 78 interactions with the best overall scoring (Supplementary Tables 14 and 15). Candidates were scored by manual evaluation (e.g., we considered predictions made by multiple methodologies stronger), however, they incorporated multiple other criteria in defining which candidate interactions to validate. For example, we considered a candidate interaction whose reverse directionality was not reported yet to be stronger than the one for which it had been captured by iX (as that reverse directionality might result from erroneous identification, invalidating prediction novelty). A subset of candidate interactions with no evidence identified using a manual search was then chosen for experimental validation.
Experimental validation of cell–cytokine interaction predictions.
Whole blood was obtained by consent from healthy volunteers through venipuncture. The PBMC fraction was separated by a standard centrifugation at 1,500 r.p.m. on a Ficoll gradient without applying the brake. For phospho-flow, cells were washed twice with Dulbecco's PBS and subjected to stimulation for 15 min with IL-34 and CSF1 (Peprotech, Asia) at 100 ng/ml. Cells were fixed for 10 min at room temperature with 1.6% paraformaldehyde (PFA; Pierce) and stained for 1 h at room temperature with a mix of metal-tagged antibodies. Furthermore, cells were permeabilized with ice-cold methanol and subjected to intracellular staining with metal-tagged antibodies against a phosphorylated form of ERK1–ERK2, pNF-κB, pSTAT1 and pSTAT5. Antibodies to the phosphorylated targets were obtained from Fluidigm.
For intracellular cytokine staining, PBMCs were stimulated with PMA (150 ng/ml) and ionomycin (1 mM) (Sigma) for 4 h at 37 °C in complete medium containing monensin and brefeldin-A at a 1:1,000 dilution (eBioscience). Extracellular epitopes were stained for 1 h with a mix of metal-tagged antibodies, and cells were fixed with PFA as described above and permeabilized for 1 h on ice with saponin permeabilization buffer (eBioscience). Intracellular staining of IL-7 was performed on ice in saponin-containing buffer. All extracellular and cytokine-specific antibodies were conjugated in-house using MaxPar kits from Fluidigm or pre-conjugates purchased from Fluidigm. Cells were stained with Cell-ID Ir191/193 for viability stain and acquired on a CyTOF1 (DVS, Fluidigm) instrument.
Assembly of disease similarity modules.
We defined a context for cells, cytokines and cell–cytokine interactions by diseases co-occurring in the same abstracts, whereas disease mentions were captured by using a manually curated compilation of the UMLS Metathesaurus49. To identify disease-similarity modules, we focused on the 188 top-cited diseases (co-mentioned in at least 500 papers with cytokines and at least 500 papers with hematopoietic cells) that we could classify as pertaining to at least one of the eight predefined clinical categories (e.g., disorder of cardiovascular system, generalized metabolic disorder, neoplasm of hematopoietic and nonhematopoietic cell types, autoimmune diseases and hypersensitivity conditions). To define clinical categories, we used SNOMED CT ontology (available through UMLS Metathesaurus49) and manually expanded its autoimmune disease category with a publicly available autoimmune-related disease list (http://www.aarda.org/autoimmune-information/list-of-diseases). As a preliminary step to module assembly, we extracted a nonspecific across-disease control profile by repeated paper sampling from the entire corpus of 521,625 disease–HPC and 438,012 disease–cytokine co-occurrence papers, respectively, without limiting to any particular context (200 iterations of 200 papers each). We examined cells and cytokines mentioned in the sampled papers, assembling the distribution of hits for each HPC and cytokine across sampling iterations, to serve a control for further disease-specific profile assembly.
Next, for each of the 188 diseases of interest, we assembled its underlying signaling profile by applying several steps: (i) performing 200 samplings of 200 random papers from the disease-specific sub-corpus to control for differences in corpus size, followed by extraction of the distribution of hits for each cell and cytokine across sampling attempts, (ii) filtering out under-represented entities, i.e., those with a hit proportion median lower in the disease-specific sampling than in the control sampling and (iii) linking cells and cytokines that co-occur in the same disease profile to interacting pairs, using interaction potential captured in the overall iX network.
Lastly, we performed unsupervised clustering of the resulting Boolean disease profiles ('0' or '1' to indicate whether the particular cell, cytokine or pair was a part of the profile) using the WGCNA R package55,56 to assemble the set of immune-centric disease similarity modules based on the binary distance metric.
Disease module signature extraction.
To extract features (i.e., cells, cytokines and their interacting pairs) that characterized disease modules, we applied a hypergeometric test, independently for each feature, examining whether the feature was over-represented in the particular module as compared to that in the entire set of 188 diseases. We corrected for multiple testing using a Benjamini–Hochberg correction.
Cytokine–disease association prediction.
To test cytokine utilization across conditions, we repeatedly sampled papers from disease-specific sub-corpora, separately for each of the 188 top-cited diseases (200 iterations of 200 papers each). We extracted the distribution of paper hit proportions for each cytokine–disease pair across iterations and used the median proportion as the measure of cytokine mention frequency in that particular context.
To systematically predict cytokine–disease associations, we employed global within-module immune similarity and applied the following steps separately for each disease module: (i) assembly of literature-derived Boolean matrix indicating whether a cytokine was a part of the disease profile, for each disease in the module and (ii) hierarchical clustering to group together module diseases displaying similar profiles across all cytokines and hypothesizing that an association should exist between currently unlinked cytokine–disease pairs within highly linked clusters. Therefore, we scanned clusters, which were confined to all possible combinations of column and row dendrogram branches, and derived cytokine–disease association candidates for 'gaps' in clusters with at least 30% of linked cytokine–disease pairs. We scored the resulting candidates by counting the number of clusters that predicted the particular association to be novel.
Life Sciences Reporting Summary.
Further information on experimental design is available in the Life Sciences Reporting Summary.
Data of the intercellular communication network and disease context are hosted on ImmPort with periodic updates and are available for query and download with standardization to ontologies at http://www.immunexpresso.org.
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank A. Butte and M. Davis for fruitful discussions and advice, N. Geifman for assistance with cytokine ontology development, D. Dougall for contribution to the cell lexicon, members of the Shen-Orr lab for reference book curation, D. Cohen for the high-performance computing cluster support, R. Reichart for Text Mining insights, and P. Dunn and S. Bhattacharya for the user interface development support. This work was supported by the US National Institutes of Health (NIH)-National Institute of Allergy and Infectious Diseases (U19 AI057229, and BISC contract HHSN272201200028C) and an award from the Rappaport Family Institute for Research in the Medical Sciences (S.S.S.-O.).
Integrated supplementary information
Cell seed recognition statistics.
Blacklist of Cell Ontology nodes.
Cytokine recognition statistics.
Cytokine lexicon fragment for CXC chemokine family.
Verb classification lexicon
Cell entity fields and manual precision evaluation results.
Cytokine entity fields and manual precision evaluation results. 2
Disease entity fields and manual precision evaluation results.
Manual precision evaluation for noun phrase-internal relation 43 evidence records.
Manual precision evaluation for non-noun phrase-internal 52 relation evidence records.
PubMed ids and statistics for relation evidence records
Disease term recognition statistics.
Novel incoming cell-cytokine interaction candidates.
Novel outgoing cell-cytokine interaction candidates.
Detailed profiles of 188 top-cited diseases.
Cytokine sampling for 188 top-cited diseases
Novel cytokine-disease association candidates