A universal information theoretic approach to the identification of stopwords

Gerlach, Martin; Shi, Hanyu; Amaral, Luís A. Nunes

doi:10.1038/s42256-019-0112-6

Article
Published: 02 December 2019

A universal information theoretic approach to the identification of stopwords

Nature Machine Intelligence volume 1, pages 606–612 (2019)Cite this article

1375 Accesses
30 Citations
51 Altmetric
Metrics details

Subjects

Abstract

One of the most widely used approaches in natural language processing and information retrieval is the so-called bag-of-words model. A common component of such methods is the removal of uninformative words, commonly referred to as stopwords. Currently, most practitioners use manually curated stopword lists. This approach is problematic because it cannot be readily generalized across knowledge domains or languages. As a result of the difficulty in rigorously defining stopwords, there have been few systematic studies on the effect of stopword removal on algorithm performance, which is reflected in the ongoing debate on whether to keep or remove stopwords. Here we address this challenge by formulating an information theoretic framework that automatically identifies uninformative words in a corpus. We show that our framework not only outperforms other stopword heuristics, but also allows for a substantial reduction of document size in applications of topic modelling. Our findings can be readily generalized to other bag-of-words-type approaches beyond language such as in the statistical analysis of transcriptomics, audio or image corpora.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Using entropy as a universal measure to quantify the information content of a word.**

**Fig. 2: Identification of stopwords by thresholding of information content.**

**Fig. 3: Removal of information theoretic stopwords makes the topic model more accurate and stable.**

**Fig. 4: Universal improvement of topic model inference for different language corpora.**

**Fig. 5: Robustness of supervised classification accuracy with respect to removal of information theoretic stopwords.**

**Fig. 6: Application to data from scRNA-seq reveals ‘**stopgenes**’.**

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis

Article Open access 25 March 2024

Wenpin Hou & Zhicheng Ji

A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions

Article 05 April 2024

Yanyi Chu, Dan Yu, … Mengdi Wang

Data availability

The text data are available in the public repository https://github.com/amarallab/stopwords.

Code availability

The code for this Article, along with an accompanying computational environment, is available in the public repository https://github.com/amarallab/stopwords and is executable online as a Code Ocean capsule. Code for the calculation of the information theoretic measure \(I\) and for the experiments with topic models can be found at https://doi.org/10.24433/CO.6204149.v1⁴².

References

Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).
Evans, J. A. & Aceves, P. Machine translation: mining text for social theory. Ann. Rev. Sociol. 42, 21–50 (2016).
Article Google Scholar
Rebholz-Schuhmann, D., Oellrich, A. & Hoehndorf, R. Text-mining solutions for biomedical research: enabling integrative biology. Nat. Rev. Genet. 13, 829–839 (2012).
Article Google Scholar
García, S., Luengo, J. & Herrera, F. Data Preprocessing in Data Mining (Springer, 2014).
Dasu, T. & Johnson, T. Exploratory Data Mining and Data Cleaning (John Wiley & Sons, 2003).
Schoenfeld, B., Giraud-Carrier, C., Poggemann, M., Christensen, J. & Seppi, K. Preprocessor selection for machine learning pipelines. Preprint at http://arXiv.org/abs/1810.09942 (2018).
Blei, D. M. Probabilistic topic models. Commun. ACM 55, 77–84 (2012).
Article Google Scholar
Boyd-Graber, J., Hu, Y. & Mimno, D. Applications of topic models. Found. Trends Inf. Retr. 11, 143–296 (2017).
Article Google Scholar
Luhn, H. P. The automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159–165 (1958).
Article MathSciNet Google Scholar
Rasmussen, E. in Encyclopedia of Database Systems (eds Liu, L. & Özsu, M. T.) (2009).
McCallum, A. K. Mallet: a machine learning for language toolkit. http://mallet.cs.umass.edu (2002).
Nothman, J., Qin, H. & Yurchak, R. Stop word lists in free open-source software packages. In Proc. Workshop for NLP Open Source Software (NLP-OSS) (eds Park, E. L. et al.) 7–12 (Association for Computational Linguistics, 2018).
Lo, R. T.-W., He, B. & Ounis, I. Automatically building a stopword list for an information retrieval system. J. Digit. Inf. Manag. 5, 17–24 (2005).
Zou, F., Wang, F. L., Deng, X., Han, S. & Wang, L. S. Automatic construction of Chinese stop word list. In Proc. 5th WSEAS International Conference on Applied Computer Science (ACOS’06) (Huang, W. et al.) 1009–1014 (World Scientific and Engineering Academy and Society, 2006).
Salton, G. & Yang, C. S. On the specification of term values in automatic indexing. J. Doc. 29, 351–372 (1973).
Article Google Scholar
Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).
MATH Google Scholar
Wang, C., Paisley, J. & Blei, D. M. Online variational inference for the hierarchical Dirichelet process. In Proc. 14th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research Vol. 15, 752–760 (AISTAT, 2011).
Hoffman, M. D., Blei, D. M. & Bach, F. Online learning for latent dirichlet allocation. In Advances in Neural Information Processing Systems 23 (NIPS 2010) (eds Lafferty, J. D. et al.) 1–9 (Neural Information Processing Systems Foundation, 2010).
Blei, D. M., Griffiths, T. L. & Jordan, M. I. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM 57, 1–30 (2010).
Article MathSciNet Google Scholar
Blei, D. M. & Mcauliffe, J. D. Supervised topic models. In Advances in Neural Information Processing Systems (eds Platt J. C. et al.) vol. 20, 121–128 (NIPS 2007).
Achakulvisut, T., Acuna, D. E., Ruangrong, T. & Kording, K. Science concierge: A fast content-based recommendation system for scientific publications. PLoS ONE 11, e0158423 (2016).
Article Google Scholar
Schofield, A., Magnusson, M. & Mimno, D. Pulling out the stops: rethinking stopword removal for topic models. In Proc. 15th Conference of the European Chapter of the Association for Computational Linguistics (eds Lapata, M. et al.) Vol. 2, 432–436 (Association for Computational Linguistics, 2017).
Montemurro, M. A. & Zanette, D. H. Towards the quantification of the semantic information encoded in written language. Adv. Complex Syst. 13, 135–153 (2010).
Article Google Scholar
Gries, S. T. Dispersions and adjusted frequencies in corpora. Int. J. Corpus Linguist. 13, 403–437 (2008).
Article Google Scholar
Zipf, G. K. Human Behaviour and the Principle of Least Effort (Addison-Wesley, 1949).
Fan, A., Doshi-Velez, F. & Miratrix, L. Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling. Preprint at http://arXiv.org/abs/1701.03227 (2017).
Schofield, A. & Mimno, D. Comparing apples to apple: the effects of stemmers on topic models. Trans. Assoc. Comput. Linguist. 4, 287–300 (2016).
Article Google Scholar
Shi, H., Gerlach, M., Diersen, I., Downey, D. & Amaral, L. A new evaluation framework for topic modeling algorithms based on synthetic corpora. In Proc. Machine Learning Research Vol. 89 (eds. Chaudhuri, K. & Sugiyama, M.) 816–826 (PMLR, 2019).
Peel, L., Larremore, D. B. & Clauset, A. The ground truth about metadata and community detection in networks. Sci. Adv. 3, e1602548 (2017).
Article Google Scholar
Lancichinetti, A. et al. High-reproducibility and high-accuracy method for automated topic classification. Phys. Rev. X 5, 011007 (2015).
Google Scholar
Aggarwal, C. C. & Zhai, C. in Mining Text Data (eds. Aggarwal, C. C. & Zhai, C.) 77–128 (Springer, 2012).
Uysal, A. K. & Gunal, S. The impact of preprocessing on text classification. Inf. Process. Manag. 50, 104–112 (2014).
Article Google Scholar
Skinnider, M. A., Squair, J. W. & Foster, L. J. Evaluating measures of association for single-cell transcriptomics. Nat. Methods 16, 381–386 (2019).
Article Google Scholar
Bravo González-Blas, C. et al. Cistopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods 16, 397–400 (2019).
Article Google Scholar
Alberts, B. et al. Molecular Biology of the Cell Sixth International Student Edition (W. W. Norton & Co., 2014).
Zheng, C. et al. Landscape of infiltrating T cells in liver cancer revealed by single-cell sequencing. Cell 169, 1342–1356.e16 (2017).
Article Google Scholar
Solé-Boldo, L. et al. Single-cell transcriptomes of the aging human skin reveal loss of fibroblast priming. Preprint at bioRxiv https://doi.org/10.1101/633131 (2019).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (eds Burges, C. J. C. et al.) 3111–3119 (Curran Associates, 2013).
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Google Scholar
Broderick, T., Mackey, L., Paisley, J. & Jordan, M. I. Combinatorial clustering and the beta negative binomial process. IEEE Trans. Pattern Anal. Mach. Intell. 37, 290–306 (2015).
Article Google Scholar
Yan, X., Jeub, L. G. S., Flammini, A., Radicchi, F. & Fortunato, S. Weight thresholding on complex networks. Phys. Rev. E 98, 042304 (2018).
Article Google Scholar
Gerlach, M., Shi, H. & Amaral, L. A. N. Stopwords-filtering. Code Ocean https://doi.org/10.24433/CO.6204149.v1 (2019).

Download references

Acknowledgements

L.A.N.A. acknowledges a John and Leslie McQuown Gift to NICO and support from the Department of Defense Army Research Office (grant number W911NF-14-1-0259). M.G. thanks T. Stoeger and Z. Ren for insightful discussion on scRNA-seq.

Author information

These authors contributed equally: Martin Gerlach, Hanyu Shi.

Authors and Affiliations

Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
Martin Gerlach, Hanyu Shi & Luís A. Nunes Amaral
Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA
Luís A. Nunes Amaral
Department of Physics and Astronomy, Northwestern University, Evanston, IL, USA
Luís A. Nunes Amaral
Department of Medicine, Northwestern University, Evanston, IL, USA
Luís A. Nunes Amaral

Authors

Martin Gerlach
View author publications
You can also search for this author in PubMed Google Scholar
Hanyu Shi
View author publications
You can also search for this author in PubMed Google Scholar
Luís A. Nunes Amaral
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.G. and L.A.N.A. conceptualized the study. M.G. and H.S. obtained all data and conducted all analysis. M.G. and L.A.N.A. wrote the first draft. M.G., H.S. and L.A.N.A. edited and revised the manuscript.

Corresponding authors

Correspondence to Martin Gerlach or Luís A. Nunes Amaral.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary methods, notes, figures and references.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gerlach, M., Shi, H. & Amaral, L.A.N. A universal information theoretic approach to the identification of stopwords. Nat Mach Intell 1, 606–612 (2019). https://doi.org/10.1038/s42256-019-0112-6

Download citation

Received: 07 March 2019
Accepted: 09 October 2019
Published: 02 December 2019
Issue Date: December 2019
DOI: https://doi.org/10.1038/s42256-019-0112-6

This article is cited by

Preprocessing of Unstructured Data Using 2D Coiflet Wavelet-Based Optimized Back-Propagation Neural Network for Opinion Mining
- H. Mohamed Zakir
- S. Vinila Jinny
Arabian Journal for Science and Engineering (2023)
Semantic Academic Profiler (SAP): a framework for researcher assessment based on semantic topic modeling
- Felipe Viegas
- Antônio Pereira
- Leonardo Rocha
Scientometrics (2022)
An approach for detecting the commonality and specialty between scientific publications and patents
- Shuo Xu
- Ling Li
- Guancan Yang
Scientometrics (2021)