A growing variety of statistical analysis approaches are available to identify groups of genes that share common expression patterns; however, the interpretation of the biological characteristics of genes in such clusters remains primarily a manual task. We have developed a data-mining method that uses indexing terms from the published literature linked to specific genes to present a view of the conceptual similarity of genes within a cluster or group of interest. The method takes advantage of the hierarchical nature of medical subject headings used to index citations in the MEDLINE database and the registry numbers applied to enzymes. The results are generated as dynamic HTML with links to the citations whose keywords appear in the term hierarchies. We have applied this method to gene clusters in the publication by Golub et al.1 describing statistical methods for classifying acute myeloblastic leukemia (AML) and acute lymphoblastic leukemia (ALL) without a priori biological knowledge. In both sets of genes the most common enzymatic descriptor class is that of complement-activating enzymes. In the ALL-predictive set of genes, these enzyme descriptors include endonucleases, endopeptidases, amidohydrolases and acid anhydride hydrolases. In the AML-predictive set, several plasminogen activators occur as keywords, a finding that may correlate with defibrination syndromes and other hemostatic abnormalities that are associated with AML but not with ALL. Overall, complement activation is a common and potentially clinically significant phenomena in acute leukemias, and the high frequency of this descriptor in the set of highly expressed genes is consistent with our observations that informative genes were not merely markers of hematopoeitic lineage, but encoded proteins important in cancer pathogenesis. These conceptual similarities, revealed by the automated summing and organization of literature keywords associated with these 50 genes, are a new finding that complements the interpretations of the authors of the original paper.
References
Golub, T. et al. Science 286, 531–537 (1999).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Masys, D., Welsh, J., Fink, J. et al. Use of controlled terminology hierarchies to detect common characteristics of genes within expression clusters. Nat Genet 27 (Suppl 4), 72 (2001). https://doi.org/10.1038/87204
Issue Date:
DOI: https://doi.org/10.1038/87204
This article is cited by
-
An introduction to information retrieval: applications in genomics
The Pharmacogenomics Journal (2002)