Computational thematics: Comparing algorithms for clustering the genres of literary fiction

What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call"computational thematics". These algorithms belong to three steps of analysis: text preprocessing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options: every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the"ground truth"genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the sharp difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.


Introduction
Computational literary studies have rapidly grown in prominence over the recent years. One of the most successful directions of inquiry within this domain, in terms of both methodological advances and empirical findings, has been computational stylometry, or computational stylistics: a discipline that develops algorithmic techniques for learning stylistic similarities between texts (Bories et al., 2023;Burrows, 1987;Eder et al., 2016). For this purpose, computational stylometrists extract linguistic features specifically associated with authorial style, or individual authorial habits. Often, these features are the most frequent words from the analyzed literary texts -they tend to be function words ("a", "the", "on", etc.) -to which various measures of similarity (e.g., Euclidean distance) are applied. The most common goal of computational stylistics is attributing the authorship of texts where it is disputed, like the authorship of Molière's plays , the Nobel Prize winning novel And Quiet Flows the Don (Iosifyan & Vlasov, 2020), or Shakespeare and Fletcher's play Henry VIII (Plecháč, 2021). Thanks to numerous systematic comparisons of various approaches to computational stylometry, we now have a fairly good idea of which procedures and textual features are the most effective ones -depending on the goal of stylometric analysis, the language of texts, or their genre Neal et al., 2017;Plecháč et al., 2018).
At the same time, we lack such systematic comparisons in the research area that might be called "computational thematics": the study of thematic similarities between texts. (Thematic similarities: say, that novels A and B both tell a love story or have a "fantasy" setting.) Why is learning about thematic similarities important? Genre -a population of texts united by broad thematic similarities -fantasy, romance, science fiction, and the like -is a central notion in literary studies, necessary not only for categorizing and cataloging literary works, but also for the historical scholarship of literature. Genres are evolving populations of texts that emerge at certain moments of time, spread across the field of literary production, and then disappear in their original form -usually becoming stepping stones for subsequent genres (Fowler, 1971). For example, the genre of "classical" detective fiction crystallized in the 1890-1930s, and then gave birth to multiple other genres of crime fiction, such as "hardboiled crime fiction", "police procedural", "historical detective", and others (Symons, 1985). Studying the historical dynamics of genres -not only of literature, but also music or painting -is an important task of art history and sociology, and digital archives allow doing so on a much larger scale (Allison et al., 2011;Klimek et al., 2019;Sigaki et al., 2018). But to gain the most from this larger scale, we must determine the best, most reliable algorithms for detecting the thematic signal in books -similarly to how computational stylometrists have learnt the most effective algorithms for detecting the signal of authorship.
Quantitative analysis of genres usually takes one of these forms. The first one is manual tagging of books by genre or using datasets where such tagging has already been done via large crowdsourced efforts, like the data collected on the Goodreads website (Thelwall, 2019). This approach is prone to human bias, it is laborious and also based on the idea that the differences between genre populations are qualitative, not quantitative (e.g., certain book is either a "detective" or "romance", or both, but not 0.78 detective and 0.22 romance, which, we think, would be a more informative description). The second approach is an extension of manual tagging: supervised machine learning of book genres using a training dataset with manually tagged genres (Piper et al., 2021;Underwood, 2019). This approach has important strengths: it is easily scalable and it provides not qualitative but quantitative estimates of a book's belongingness to a genre. Still, it has a problem: it can only assign genre tags included in the training dataset, it cannot find new, unexpected book populations -which is an important component of the historical study of literature. The third approach is unsupervised clustering of genres: algorithmic detection of book populations based on their similarity to each other Schöch, 2017). This approach is easily scalable, allows quantitative characterization of book genres, and does not require a training dataset with manually assigned tags, thus allowing to detect new, unexpected book populations. All these features of unsupervised clustering make it highly suitable for historical research, and this is why we will focus on it in this paper.
Unsupervised clustering can be conducted in a variety of ways. For example, texts can be lemmatized or not lemmatized; as text features, simple word frequencies can be used or some higher-level units, such as topics of a topic model; to measure the similarity between texts, a host of distance metrics can be applied. Hence, the question: what are the best computational methods for detecting thematic similarities in literary texts? This is the main question of this paper. To answer it, we will compare various combinations of (1) preprocessing (which, in this study, we will also call "thematic foregrounding"), (2) text features, and (3) the metrics used for measuring distance between features. To assess the effectiveness of these combinations, we use a tightly controlled corpus of four well-known genres -detective fiction, science fiction, fantasy, and romance -as our "ground truth" dataset. To illustrate the significant difference between the best and the worst combinations of algorithms for genre detection, we later cluster genres in a much larger corpus, containing 5000 works of fiction.

Materials and Methods
Data: The "ground truth" genres Systematic research on computational stylistics is common, while research on computational thematics is still rare (Allison et al., 2011;Schöch, 2017;Underwood, 2016). Why? Computational stylistics has clear "ground truth" data against which various methods of text analysis can be compared: authorship. The methods of text analysis in computational stylistics (e.g., Delta distance or Manhattan distance) can be compared as to how well they perform in the task of classifying texts by their authorship. We write "ground truth" in quotes, as authorship is no more than a convenient proxy for stylistic similarity, and, as any proxy, it is imprecise. It assumes that texts written by the same author should be more similar to each other than texts written by different authors. However, we know many cases when the writing style of an author would evolve significantly over the span of their career, or would be deliberately manipulated (Brennan et al., 2012). Authorship as a proxy for "ground truth" is a simplification -but a very useful one.
The lack of a widely accepted "ground truth" proxy for thematic analysis leads to the comparisons of algorithms that are based on nothing more than subjective judgment (Egger & Yu, 2022). Such subjective judgment cannot lead us far: we need quantitative metrics of performance of different algorithms. For this, an imperfect "ground truth" is better than none at all. What could play the role of such an imperfect, but still useful, ground truth in computational thematics? At the moment, these are genre categories. They capture, to a different degree, thematic similarity between texts. To a different degree, as genres can be organized according to several principles, or "axes of categorization": e.g., they can be based on the similarity of storylines (adventure novel, crime novel, etc.), settings (historical novel, dystopian novel, etc.), emotions they evoke in readers (horror novel, humorous novel, etc.), or their target audience (e.g., young adult novels). It does seem that these various "axes of categorization" correlate: say, "young adult" novels are appreciated by young adults because they often have similar storylines or characters. Or, horror novels usually have a broad, but consistent, arsenal of themes and settings that are efficient at evoking pleasant fear in readers (like the classical Gothic setting). Still, some axes of genre categorization are probably better for comparing the methods of computational thematics than others. Genres defined by their plots or settings may provide a clearer thematic signal than genres defined by their target audience or evoked emotions.
We have assembled a tightly controlled corpus of four genres (50 texts in each) based on their plots and settings: • Detective fiction (recurrent thematic elements: murder, detective, suspects, investigation) • Fantasy fiction (recurrent elements: magic, imaginary creatures, quasi-medieval setting) • Romance fiction (recurrent elements: affection, erotic scenes, love triangle plot) • Science fiction (recurrent thematic elements: space, future, technology) We took several precautions to remove potential confounds. First, these genres are situated on a similar level of abstraction: we are not comparing rough-grain categories (say, romance or science fiction) to fine-grain ones (historical romance or cyberpunk science fiction). Second, we limited the time span of the book publication year to a rather short period of 1950-1999: to make sure that our analysis is not affected too much by language change (which would inevitably happen if we compared, for example, 19th-century gothic novels to 20th-century science fiction). Third, each genre corpus has a similar number of authors (29-31 authors), each represented by 1-3 texts. Several examples of books in each genre are shown in

Analysis: The race of algorithms
To compare the methods of detecting thematic signal, we developed a workflow consisting of four steps -see Figure 1. Same as our corpus, all the detailed steps of the workflow were pre-registered.
Step 1. Choosing a combination of thematic foregrounding, features, and distance As a first step, we choose a combination of (a) the level of thematic foregrounding, (b) the features of analysis, and (c) the measure of distance.
By thematic foregrounding (Step 1a on Figure 1) we mean the extent to which the thematic aspect of a text is highlighted (and the stylistic aspect -backdropped). With weak thematic foregrounding, only the most basic text preprocessing is done: lemmatizing words and removing 100 most frequent words (MFWs) -the most obvious carriers of strong stylistic signal. 100 MFWs roughly correspond to function words (or closed-class words) in English, routinely used in authorship attribution  beginning with the classical study of Federalist Papers (Mosteller & Wallace, 1963). With medium thematic foregrounding, in addition to lemmatizing, we also remove entities (named entities, proper names, etc.) using SpaCy tagger (Honnibal & Montani, 2017). Additionally, we perform part-of-speech tagging and remove all the words that are not nouns, verbs, adjectives, or adverbs, which are the most content-bearing parts of speech. With strong thematic foregrounding, in addition to all the steps of the medium foregrounding, we also apply lexical simplification. We simplify the vocabulary by replacing less frequent words with their more frequent synonyms -namely, we replace all words outside of 1000 MFWs with their more common semantic neighbors (out of 10 closest neighbors), with the help of pre-trained FastText model that includes 2 million words and is trained on English Wikipedia .
Then, we transform our pre-processed texts into lists of features (Step 1b on Figure 1). We vary both the type of features and the length of lists. We consider four types of features. The simplest features are most frequent words as used in the bag-of-words approach (1000, 5000, or 10,000 of them) -a common solution for thematic analysis in computational literary studies (Hughes et al., 2012;Underwood, 2019). The second type of features are topic probabilities generated with the Latent Dirichlet Allocation (LDA) algorithm (Blei et al., 2003) -another common choice (Jockers, 2013;Liu et al., 2021). LDA has several parameters that can influence results, such as the predefined k of topics or the number of most frequent words used. Plus, a long text like a novel is too large for meaningful LDA topic modeling, and the typical solution is dividing the text into smaller chunks. We use an arbitrary chunk size of 1000 words. The third type of features are modules generated with weighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA) -a method of dimensionality reduction that detects clusters (or "modules") in networks (Langfelder & Horvath, 2008). WGCNA is widely used in genetics (Bailey et al., 2016;Ramírez-González et al., 2018), but also showed promising results as a tool for topic modeling of fiction . We used it with either 1000 or 5000 most frequent words. Typically, WGCNA is used without chunking data, but, since chunking leads to better results in LDA, we decided to try using WGCNA with and without chunking, with the chunk size of 1000 words. All the parameters of WGCNA were kept at defaults. Finally, as the fourth type of feature, we use document-level embeddings doc2vec (Lau & Baldwin, 2016;Le & Mikolov, 2014) that directly position documents in a latent semantic space defined by a pre-trained distributional language model -FastText . Document representations in doc2vec depend on the features of the underlying model: in our study, each document is embedded in 300 dimensions of the original model. Doc2vec and similar word embedding methods are increasingly used for assessing the similarity of documents (Dynomant et al., 2019;Kim et al., 2019;Pranjic et al., 2020). As a result of Step 1b, we obtain a document-term matrix formed of texts (rows) and features (columns).
Finally, we must learn the similarity between the texts represented with the chosen lists of features -by using some metric of distance (Step 1c on Figure 1). There exist a variety of metrics for this purpose: Euclidean, Manhattan, Delta, Cosine, Cosine Delta distances and Jensen-Shannon divergence (symmetrized Kullback-Leibler divergence) for features that are probability distributions (in our case, this can be done for LDA topics and bag-of-words features).

Variants of
Step 1a, 1b, and 1c, can be assembled in numerous combinations. In our "race of algorithms", each combination is a competitor -and a potential winner. Say, we could choose a combination of weak thematic foregrounding, LDA topics with 50 topics on 5000 most frequent words, and Euclidean distance. Or, medium thematic foregrounding, simple bag-ofwords with 10,000 most frequent words, and Jensen-Shannon divergence. Some of these combinations are researchers' favorites, while others are underdogs -used rarely, or not at all. Our goal is to map out the space of possible combinations -to empirically test how each combination performs in the task of detecting the thematic signal. In total there are 311 competing combinations.
Step 2. Sampling for robust results A potential problem with our experiment could be that some combinations might perform better or worse simply because they are more suitable to our corpus of novels -for whatever reason. To reduce the impact of individual novels in our corpus, we do cross-validation: instead of analyzing the corpus as a whole, we analyze smaller samples from the corpus multiple times. Each sample contains 120 novels: 30 books from each genre. Altogether, we perform the analysis for each combination on 100 samples. For each sample, all the models that require training -LDA, WGCNA, and doc2vec -are trained anew. Step 1a), feature type (1b), and distance metric (1c). For each such combination, a small loop is run: it randomly draws a genre-stratified sample of 120 novels (Step 2), clusters the novels using Ward algorithm (Step 3), and validates the clusters on the dendrogram using Adjusted Rand Index (Step 4). As a result of these four steps, each combination receives an ARI score: a score of its performance in detecting genres.
Step 3. Clustering As a result of Step 2, we obtain a matrix of text distances. Then, we need to cluster the texts into groups -our automatically generated genre clusters, which we will later compare to the "true" clusters. For this, we could have used a variety of algorithms (e.g., k-means). We use hierarchical clustering with Ward's linkage (Ward, 1963): it clusters two items when resulting clusters maximize variance across the distance matrix. Despite being originally defined only for Euclidean distances, it was empirically shown that Ward's algorithm outperforms other linkage strategies in text-clustering tasks (Ochab et al., 2019). We assume that novels from four defined genres should roughly form four distinct clusters (as the similarity of texts within genre is greater than similarity of texts across genres). To obtain the groupings from a resulting tree we cut it vertically by the number of assumed clusters (which is 4).
Step 4. Cluster validation How similar are our generated clusters to the "true" genre populations? To learn this, we compare the clusters generated by each chosen combination to the original genre labels. For this, we use a measure of cluster validation called the adjusted Rand index (ARI) (Hubert & Arabie, 1985). ARI score of a particular combination shows how well this combination performs in the task of detecting genres -and thus, in picking the thematic signal. Steps 1-4 are performed for every combination, so that every combination receives its ARI score. In the end of the analysis, we obtain a dataset of 29,100 rows (291 combinations, each tested on 100 random samples). Figure 2 shows the average performance of all the combinations of thematic foregrounding, features, and distance metrics. Our first observation: the average ARI of the best performing algorithms ranges from 0.66 to 0.7, which is rather high for the complicated, noisy data that is literary fiction. This gives additional support to the idea that unsupervised clustering of fiction genres is possible. Even a cursory look at 10 best-performing combinations immediately reveals several trends. First, none of the top combinations have weak thematic foregrounding. Second, 6 out of 10 best-performing features are LDA topics. Third, 8 out of 10 distances on this list are Jensen-Shannon divergence. But how generalizable are these initial observations? How shall we learn the average "goodness" of a particular kind of thematic foregrounding, or a feature type, or a distance metric? To learn this, we need to control for their influence on each other, as well as for additional parameters, such as the number of most frequent words and chunking. Hence, we have constructed five Bayesian linear regression models (see Supplement 5.1). They answer questions about the performance of various combinations of thematic foregrounding, features, and distance metrics, helping us reach conclusions about the performance of individual steps of thematic analysis. All the results of this study are described in detail in Supplement 5.1. Below, we focus only on key findings. Figure 3. The effect of thematic foregrounding (weak, medium, or strong) on clustering genres, stratified by feature type.

Conclusion 1. Thematic foregrounding improves genre clustering
The goal of thematic foregrounding was to highlight the contentful parts of the texts and to backdrop the stylistic parts. So, does larger thematic foregrounding improve genre recognition? As expected, we have found that low thematic foregrounding shows the worst performance across all four feature types (see Figure 3). For LDA and bag-of-words, it leads to drastically worse performance. At the same time, we do not see a large difference between the medium and the strong levels of thematic foregrounding. The major difference of the strong level of thematic foregrounding is the use of lexical simplification. This lexical simplification has not led to noticeable improvement of genre recognition. The gains of using strong thematic foregrounding for document embeddings, LDA and bag-of-words are marginal and inconsistent.

Conclusion 2. Various feature types show similarly good performance
Does the choice of feature type matter for the performance of genre clustering? We have found that almost all feature types can perform well. As shown on Figure 2, three out of four feature types -doc2vec, LDA, and bags of words -when used in certain combinations, can lead to almost equally good results. But how good are they on average? Figure 4 shows the posterior distributions of ARI for each type of features used in our analyses -in each case, for high level of thematic foregrounding. As we see, doc2vec shows the best average performance, but this study has not experimented enough with using various other parameters of this feature type. It might be that another number of dimensions (e.g., 100 instead of 300) would worsen its performance. More research is needed to better understand the performance of doc2vec. LDA is the second best approach -and interestingly, the variation of parameters in LDA (such as k of topics or n of MFWs) does not increase the variance compared to doc2vec. Bag-of-words approach, despite being the simplest kind of feature, proves to be surprisingly good. It does not demonstrate the best performance, but it is not far behind doc2vec and LDA. At the same time, bags of words have a powerful advantage: simplicity. They are simpler to use and require fewer computational resources, meaning that in many cases they can still be a suitable choice for thematic analysis. Finally, WGCNA shows the worst ARI scores on average.

Conclusion 3. The performance of LDA does not seem to depend on k of topics and n of most frequent words
LDA modeling depends on parameters, namely k of topics and n of most frequent words, which should be decided, somewhat arbitrarily, before modeling. There exist algorithms for estimating the "good" number of topics, which help assessing how many topics are "too few" and how many are "too many" (Sbalchiero & Eder, 2020). In our study, however, we find no meaningful influence of either of these choices on learning the thematic signal ( Figure 5). The single most important factor making a massive influence on the effectiveness of thematic classification is thematic foregrounding. Weak thematic foregrounding (in our case, only lemmatizing words and removing 100 most frequent words) proves to be a terrible choice that noticeably reduces ARI scores. Our study points towards the need for further systematic comparisons of various approaches to thematic foregrounding, as it seems to play a key role in the solid performance of LDA.

Conclusion 4. Bag-of-words approach requires a balance of thematic foregrounding and n of most frequent words
Using bags of words as features is the simplest approach in thematic analysis, but still an effective one, as we have demonstrated. But how does one maximize the chances that bags of words perform well? We have varied two parameters in the bag-of-words approach: the level of thematic foregrounding and the number of MFWs used. Figure 6 illustrates our findings: both these parameters influence the performance. Using 5000, instead of 1000, MFWs, drastically improves ARI scores. Similarly, using medium, instead of weak, thematic foregrounding, makes a big difference. At the same time, pushing these two parameters further -using 10,000 MFWs and strong thematic foregrounding -brings only marginal, if any, improvement in ARI scores. Figure 6. The influence of the number of most frequent words, used as text features, on learning the thematic signal, measured with ARI. There is a positive relationship between n of words and ARI, as well as between the level of thematic foregrounding and ARI. However, the middle parameter values of both (5000 MFWs and medium foregrounding) should be enough for most analyses.

Conclusion 5. Jensen-Shannon divergence is the best distance metric for genre recognition, Euclidean -the worst
Choosing the right distance metric is crucial for improving genre clustering. Figure 7 shows the performance of various distances for each type of feature (note that Jensen-Shannon divergence, which was formulated for probability distributions, could not be applied to doc2vec dimensions and WGCNA module weights). For LDA and bag-of-words, Jensen-Shannon divergence is the best distance, with Delta and Manhattan distances being highly suitable too. For doc2vec, the choice of distance matters less. Interestingly, Euclidean distance is the worst-performing distance for LDA, bag-of-words, and WGCNA. This is an important, because this distance is often used in text analysis, also in combination with LDA (Jockers, 2013;Schöch, 2017;, while our study suggests that this distance should be avoided in computational thematic analysis. Cosine distance is also known to be useful for authorship attribution, when combined with bag-of-words as a feature type. At the same time, cosine distance is sometimes used to measure the distances between LDA topic probabilities, and our study shows that it is not the best combination. Figure 7. The influence of distance metrics on ARI scores, separately for each feature type. Note that Jensen-Shannon divergence could not be combined with WGCNA and doc2vec.

Comparison of algorithms on a larger dataset
How well does this advice apply to clustering other corpora, not just our corpus of 200 novels? A common problem in statistics and machine learning is overfitting: tailoring one's methods to a particular "sandbox" dataset, without making sure that these methods would work "in the wild". In our case, this means: would the same combinations of methods work well/poorly on other genres and other books than those included in our analysis? One precaution that we took to deal with overfitting was sampling from our genre corpus: instead of analyzing the full corpus just once, we analyzed smaller samples from it. But, additionally, it would be useful to compare the best-performing and the worst-performing methods against a much larger corpus of texts.
For this purpose, we use a sample of 5000 books of the NovelTM dataset of fiction, built from HathiTrust corpus (Underwood et al., 2020). Unlike our small corpus of four genres, these books do not have reliable genre tags, so we could not simply repeat our study on this corpus. Instead, we decided to inspect how a larger sample of our four genres (detective, fantasy, science fiction, and romance) would cluster in the HathiTrust corpus. For this, we included all the books in these four genres that we could easily identify (see Supplement for details) and seeded them into a random sample of 5000 works of fiction. Then we clustered all these books using two approaches: a particularly bad combination of methods for identifying genres (weak thematic foregrounding, bag-of-words with 5000 words, cosine distance) and a particularly good one (medium thematic foregrounding, LDA on 1000 words with 100 topics, clustered with Delta distance). The result, visualized with two UMAP projections (McInnes et al., 2018), is shown on Figure 8. One combination of methods resulted in a meaningful clustering, while the other -in chaos. However, this is only a first step towards further testing various algorithms of computational thematics "in the wild".

Discussion
This study aimed to answer the question: how good are various techniques of learning thematic similarities between works of fiction? In particular, how good are they at detecting genres -and are they good at all? For this, we tested various techniques of text mining, belonging to three consecutive steps of analysis: pre-processing, extraction of features, and measuring distances between the lists of features. We used four common genres of fiction as our "ground truth" data, including a tightly controlled sample of books. Our main finding is that unsupervised learning can be effectively used for detecting thematic similarities, but algorithms differ in their performance. Interestingly, the algorithms that are good for computational stylometry (and its most common task, authorship attribution) are not the same as those good for computational thematics. To give an example, one common approach to authorship attribution -using limited pre-processing, with a small number of most frequent words as features, and cosine distance -is one of the least accurate approaches for learning thematic similarities. How important are these differences in the real-world scenario, not limited to our small sample of books? To test this, we have contrasted one of the worstperforming combinations of algorithms, and one of the best-performing combinations, using a large sample of the HathiTrust corpus of books.
Systematic comparisons between various algorithms for computational thematic analysis will be key for a better understanding of which approaches work and which do not work -a requirement for assuring reliable results in the growing area of research which we suggest calling "computational thematics". Using a reliable set of algorithms for thematic analysis would allow tackling several large problems that remain not solved in the large-scale analysis of books. One such problem is creating better genre tags for systematizing large historical libraries of digitized texts. Manual genre tags in corpora such as HathiTrust are often missing or are highly inconsistent, which leads to attempts of using supervised machine learning, trained on manually tagged texts, to automatically learn the genres of books in the corpus overall. However, this approach, by design, allows capturing only the genres we already know about, and not the genres we do not know exist: "latent" genres. Unsupervised thematic analysis can be used for this task. Another important problem that unsupervised approaches to computational thematics may be good at is historical analysis of literary evolution. So far, we are lacking a comprehensive "map" of literary influences, based on the similarity of books. Such a map would allow creating a computational model of literary macroevolution, similar to phylogenetic trees (Bouckaert et al., 2012;Tehrani, 2013) or rooted phylogenetic networks (Neureiter et al., 2022;Youngblood et al., 2021) used in cultural evolution research of languages, music, or technologies. Having reliable unsupervised algorithms for measuring thematic similarities would be crucial for any historical models of this sort. Also, measuring thematic similarities may prove useful for creating book recommendation systems. Currently, book recommendation algorithms are mostly based on the analysis of user behavior: ratings or other forms of interaction (Duchen, 2022). Such methods are highly effective in the cases when user-generated data is abundant, like songs or brief videos. However, for longer content types, which take more time to consume and, the amount of user-generated data is much smaller. Improving the tools for content-based similarity detection in books would allow recommending books based on their content -as it is already happening to songs: projects such as Spotify's Every Noise at Once (https://everynoise.com/) combine user behavior data with the acoustic features of songs themselves to learn the similarity between songs and recommend them to listeners.
This study is a preliminary attempt at systematizing various approaches to computational thematics. More work is needed to further test the findings of this paper and to overcome its limitations. One apparent limitation is the concept of "ground truth" genres. It may be notedrightly -that there are no "true" genres and that genre tags overall may not be the best approach for testing thematic similarities. As further steps we see using large scale user generated tags from Goodreads and similar websites as a proxy for "ground truth" similarity. Also, this study has certainly not exhausted all the possible techniques for text analysis that can be used for computational thematics. For example, a much wider testing of vector models, like doc2vec, but also BERTopic (Grootendorst, 2022) or Top2Vec  is an obvious next step, or testing other network-based methods for community detection (Gerlach et al., 2018). Likewise, text simplification could have large potential for thematic analysis, it must be tested further. Possibly, the most straightforward way to test our findings would be attempting to replicate our results on other genre corpora, containing more books or other genres. Testing these methods on books in other languages is also critical. The approach taken in this paper offers a simple analytical pipeline -and we encourage other researchers to use it for testing all the various other computational approaches. Such a communal effort will be key for assuring robust results in the area of computational thematics.

Competing interests
The author(s) declare no competing interests.

Ethical approval
The study included no human or non-human participants, and thus requires no ethical approval.

Corpus summary
The corpus was constructed so that books roughly span the same time period across genres ( Figure S1); also, each genre subcorpus does not include more than three books per author ( Figure S2). The total number of authors contributing to each genre was also similar in each subcorpus.

Thematic foregrounding: weak
At the first level of thematic foregrounding we remove 100 most frequent words (MFW) from analysis. 100 MFWs roughly correspond to function words (or closed-class words) in English ) that are routinely used in authorship attribution starting from the classical study of The Federalist Papers (Mosteller & Wallace, 1963). MFWs can be removed to cheaply lower the impact of style (which heavily depends on grammar and syntactic differences) in favor of semantics and content.

Thematic foregrounding: medium
At the second level of thematic foregrounding words are pruned systematically based on morphology: we allow only nouns, adjectives, verbs and adverbs (auxiliary verbs are excluded too). We also remove entities and proper nouns, which might be specific to an author or a series of novels. Morphological tagging and named entity recognition was done with a basic spaCy language model for English due to its accessibility.
We did not use an external list of stopwords, since these lists are often arbitrary, can signficantly alter results, and are dominated by industry (specifically, information retrieval) standards. Lately, there is a tendency to minimize stopwords usage , or to completely avoid them in the tasks like topic inference in a collection of documents (e.g. top2vec algorithm ).

Thematic foregrounding: strong
The third level of thematic foregrounding includes steps from the medium foregrounding level and adds naive semantic simplification. We reduce the sparseness of feature space by turning less frequent words into more frequent words from similar semantic domains. We replace a word outside of 1000 MFWs with its closest semantic neighbor (out of the 10 closest neighbors) if this neighbor ∈ . To infer semantic similarity we use off-the-shelf FastText model  which includes 2M words and is trained on English Wikipedia, which provides a slice of 'modern' use of language. Again, this model is easily accessible and scalable to different tasks or languages.
Table S1 presents a random example of 20 semantic replacements.
As seen from examples, this lexical simplification can loosely sort target words by semantic domains represented by their more frequent semantic neighbors and, in some cases, clean original texts (loove -> claim). Noise is present, too, both from the domain-specific language of underlying word2vec model (download -> free) and the lack of context-based semantic disambiguation (filmclip -> song).
Finally, Figure S3 shows the filtering effects which different pre-processing strategies have on the corpus. The largest drop in word type diversity, predictably, happens after morphological filtering at medium thematic foregrounding; our naive lexical simplification allows removing another 5% of word types, but preserving the amount of tokens.

Bag of words
A classic multivariate representation of texts as bags of words. We follow a stylometric tradition that assumes any weighting would also be part of the distance measure (e.g. Burrows' Delta is scaled Manhattan distance: see more about scaling features and vector length normalization in ), so we only transform word frequencies to relative frequencies, ultimately dealing with a text as a probability distribution over an ordered set of words (arranged by their frequency) and defined by MFW cut-off. Different weightening techniques are widely used in information retrieval (TF-IDF, logarithmic transformation, etc.), but are more suitable as an 4

Topic probabilities (LDA)
Latent Dirichlet Allocation (blei_latent_2003?) is the most widely used probabilistic alogrithm of topic modeling, that still performs competitively with newer methods (Harrando et al., 2021). LDA infers groups of words (topics) based on their co-occurence in documents. Because LDA is generative, we can in turn represent each document as a probability distribution of topics. Compact lexical representation also makes the feature space more interpretable. We use topicmodels LDA implementation in R (Grün & Hornik, 2011). We vary parameters (number of topics) and MFW used. We do leave out hyperparameters alpha and delta at 0.1 default and do not rely on coherence/perplexity measures of a model, since we do not aim to fine-tune the LDA to a particular corpus; there is also empirical evidence that perceived LDA performance does not completely align with validation measures ( (Hoyle et al., 2021); see also (Antoniak, 2022) for a summary of research on LDA performance).
An important pre-processing step for LDA is chunking of texts. A complete novel is too large of a context for inferring topics: too many words co-occur in large documents with many other words. Thus, instead of representing each novel as a bag of words, we represent it as many topic terms 2 ship captain space sea leave time control war send command 3 hand eye head smile sit nod shake stand hold voice 4 father mother son daughter family child sister die live home 7 answer question speak doctor word call talk moment voice reply 8 feel eye smile walk moment stand sit suddenly hand slowly 11 human world planet life time system people space race war 12 dog bone animal bird mouth leg stick foot head eat 15 magic creature power castle change animal form head tree human 16 dragon fire wing fly eye head hold time land shoulder 19 time reach feel moment start completely hope system begin mind smaller bags of words from consecutive parts (chunks) before training an LDA. We use an arbitrary chunk size of 1000, but other structural cues (paragraphs, pages, chapters) for chunking might also be a good idea. We aggregate the probabilities from these smaller documents back to a single novel by taking an average of probability distributions (a centroid). Table S2 demonstrates a sample of topics (10 most probable words per topic) in a model that is built using texts at medium level of thematic foregrounding, 100 topics, document-term matrix (DTM) was cut at 1000 MFWs. Topics clearly capture thematic groups like locations and settings, and are often linked to actions and relationships.

Module's weights (WGCNA)
Weighted gene correlation network analysis (WGCNA) is similar to LDA, but comes from a different research field: genetics. Some point out its promising features for text analysis, like relative independence from high-frequency function words . WGCNA has one advantage over LDA: there is no need to guess the optimal number of topics, as WGCNA "modules" are determined automatically based on network of similarity in behavior between traits. Internally, WGCNA already relies on hierarchical clustering to derive modules that describe the variation in individuals/documents and could be greedy in reducing the word behavior in distinct genres to only one or two modules, especially if the analyzed texts have been chunked.
An example of this behavior from one of the sampling runs (120 novels) is presented on Figure  S4. WGCNA was run on chunked novels, with medium thematic foregrounding, 5000 MFWs. We use implementation of the algorithm by Langfelder & Horvath (2008). The algorithm derived only one module of words that show almost perfectly opposite expression in detective and fantasy fiction, to no surprise: these genres are the easiest to distinguish. However, one module is too greedy when it comes to clustering: romance and sci-fi will be just mixed into two other distinct genres (and share more similarity to detectives than to fantasy). To find the most defining words for this global module we use a connectivity measure: in this case, these are words with the highest positive correlation to an "eigengene", which is a joint expression of a module in samples / documents. Figure S5 shows 20 most correlated words to a module from the same sample as on Figure S4. It is quite clear that these are words from a police procedural universe, and, more generally, these are words of a 'modern-like' urban setting, which also explains this module's expression in romance and science fiction. Conversely, the most inversely correlated words point to open spaces of adventure, magic and medieval attributes.
To give an example of WGCNA producing several meaningful distinct modules akin to LDA topics, we can use another model without chunking (medium thematic foregrounding, without chunking, 5000 MFWs). Examples of 10 most closely correlated words to a sample of modules are presented on Table S3. Unsurprisingly, removing chunking makes modules closely associated with specific books.

Document embeddings (doc2vec)
Doc2vec directly embeds documents into a latent semantic space that is defined by a distributional language model. In the end, each document is represented as a vector in this  module words 1 bugger fighter commander formation strategy practice maneuver video bunk enemy 4 nuclear crisis mayor empire tech scientific science trader policy navy 9 dimension magician disguise assassin kid demon grumble flagon terrific mumble 10 hairy star meadow unicorn chain stall caravan gap innkeeper wax 12 sellsword tyrion Arya godswood maester raven eunuch ranger direwolf knight 16 menion mystic flick beyond sentry awesome massive attacker terrain quickly 19 oblige beg daresay countenance acquaint contrive disposition lordship shocking fashionable 24 laird iain elder outsider topic announce nudge blurt agreement argue 25 camera cowboy truck bridge vest vegetable brandy magazine corn bracelet 26 runciter talent organization anti spray employee commercial tv anyhow elevator -dimensional space (depends on the underlying model). Again, we embed each novel split into chunks, in order to capture semantic variation on a small scale and then average the vectors in the space to get a single-vector-per-novel representation (a centroid of chunk vectors). We chose to use the chunks of 800 words and a pretrained FastText model  for vector representation of word semantics (300 dimensions, 2M words, trained on Wikipedia) and doc2vec implementation that allows fine-tuning and follows Angelov's alogrithm (2020). Figure S6 shows UMAP projections of averaged novel vectors.

Distance measures
We infer similarity between novels by calculating pairwise distances between representations / vectors. We test several classic distances that are used for measuring text similarity (and are widely used in stylometry): Euclidean, Manhattan, Burrows' delta (scaled Manhattan), cosine, cosine delta (scaled cosine) and Jensen-Shannon divergence.
Euclidean. Square root of the squared pairwise differences in features Manhattan. Sum of the pairwise differences in dimensions (cityblock distance)

Clustering
In principle any other clustering algorithm could have been used (e.g. k-means). We use hierarchical clustering (Ward's linkage that pairs items when it minimizes within-cluster variance). Despite being originally defined only for Euclidean distances, it was empirically shown that Ward's algorithm outperforms other linkage strategies in text-clustering tasks (Ochab et al., 2019).
We assume that novels from four defined genres should roughly form four distinct clusters (similarity of texts within genre is greater than similarity of texts across genres). To obtain the groupings from a resulting tree we cut it vertically by the number of assumed clusters (k=4). Then we compare resulting classes to ideal clustering using Adjusted Rand Index (similar usages for unsupervised clutering with literary texts: ). ARI takes values between 1 and 0, where 1 would be a perfect classification and 0 would mean clustering not better than random.

Dendrogram of all novels
To provide an example of clustering performance, we build a dendrogram for all the novels in four genres ( Figure S7). Underlying features are document embeddings at the medium level of thematic foregrounding and we use cosine distance for dissimilarity calculation. Colors of the branches are based on majority of genre neighbors. Adjusted Rand index of the tree presented below is 0.786.

Confusion matrices
As seen from several figures above (S6, S7), genres differ in clustering consistency: detectives and fantasy books group together better than science fiction and romance. To address this difference in behavior we create a confusion matrix, based on all 100 cross-validation runs, which shows a dispersion of books across four clusters. Since this is not a supervised classification, a confusion matrix requires some heuristics to determine which clusters correspond to which genres in each clustering tree and can only show approximate results (we assume a cluster to be the 'detective' cluster if majority of books in this cluster are detectives).
This confusion matrix presents a total share of labeled novels that end up in different clusters across 29100 confusion matrices (100 samples, 291 clustering rounds in each). As usual, in the case of a pefect clustering, the diagonal of the matrix would contain "1". As expected, we see that the most diffused genres are romance (often grouped with detectives, 30% of hits) and science fiction (often grouped with fantasy, 24% of hits).
However, not all the methods summarized in the matrix above are equal and some distance measures (like Euclidean for bag of words) are 'bad choices' by default. To trim the matrix a little, we can follow the strategy that we also employ for modeling: use only well-suited distance measure for each method and remove chunked WGCNA, which proved to be a poor choice for thematic clustering. Now the "good" clustering numbers are higher, but the difference between romance and science fiction becomes more pronounced. Comparatively, romance tends to form much more diffused clusters than science fiction (this tendency is visible on Figure S6).
Are different methods resulting in different sensitivity to genres and cluster formation? Figure  S8  Overall the same pattern holds across all methods, which is to be expected: they all rely on the same lexical frequency-based information. There is an advantage for d2v since it uses an external representations of word co-occurence based on a very large corpus, but higher numbers for doc2vec compared to LDA also should not be treated at face value, since it has fewer degrees of freedom (and, as a result, fewer ways to fail): doc2vec was used only in 3 different combinations per sample, while, for instance, LDA was used in 27. Figure S9 shows the overall distribution of ARI values, with and without chunked WGCNA option. Value concentration on zero comes from unsuited distance choices that are inadequate for a given feature space (e.g. Euclidean for bag of words). When the data is filtered by better performing distances, distribution is not zero-inflated (see Figure S12). Alongside distance calculation and hierarchical clustering, we ran -means clustering (with = 4), but its average performance in separating books in four clusters ( Figure S10), as measured by ARI, was way worse.

Distance selection
To simplify inference we will deal with results that were obtained with suited distance measure per each feature type. We select only the best performing distance measure per used feature. This is done to remove the factor of distances altogether and to equalize model's chances for comparison. There is no real reason to lump results from different distance measures together, since different data (e.g. probabilities vs. feature weights) has different sensitivity to distance selection, while some distances were not measured for some feature types (JSD for WGCNA and doc2vec).
To choose the suited distances we fit a simple model ari~1 + feature*distance to get estimates for each distance measure performance with each feature type ( Figure S11). All further models were built using distances with highest posterior averages.

General model: effect of thematic foregrounding
What is the effect of thematic foregrounding for different feature types? For this model data was filtered by removing chunked WGCNA results and selecting distances with the highest average.
We fit a multilevel model with interaction between method ( ) and the level of thematic foregrounding ( ), pooled by individual samples. In R library brms formula, it is ari1 + Feature * Level + (1|sample). We use regularizing priors for 'intercept' and 'slope' coefficients as seen in the expanded model notation below. (We use dummy coding with brms interface for categorical variables, so coefficients represent difference between a , combination and reference 'intercept' which is doc2vec at level 1. = (0, 0.1) does not expect any difference on average). We use as a shorthand for the level of thematic foregrounding in notation. All further models have the same structure and priors. (1) We model the of thematic foregrounding as categorical variable, and not ordinal, because we constructed the 'levels' artificially: there might not be any order in relationship between these levels. That said, modeling via monotonic effects would still work and resulting models will be similar (as shown by leave-one-out cross validation in Table S6). Additionally, including varying slopes for individual samples does not improve model prediction much. It suggests that, across 100 samples of texts, methods and thematic foregrounding behaved similarly relative to each other. Since adding slopes to random effects can complicate fitting models and chain convergence, we instead only fit models grouped by samples. Multilevel models with group-level effects for individual samples are always a better fit than those without. They allow to be more uncertain about the mean estimates, since clustering results notably differ from sample to sample.
Left-hand side of the Figure S13 shows posterior ARI means in each and each type. Right-hand side shows the same relationship, but now the mean is taken marginal of samples: credible intervals are now much wider.
At the medium and strong thematic foregrounding three out of four feature types seem to behave similarly with doc2vec having an upper hand. We can directly compare their posterior distributions ( Figure S14). Dotted lines represent the mean of distribution for each feature.  Sampling introduces considerable variation to the behavior of all features types. We can use posterior predictions to check differences in specific samples (10 samples drawn at random, Figure S15). Note that doc2vec has only one observation per sample for each level, but model uses grand mean to keep estimations conservative.

Overall best performance, distances filtered
To get an overall picture of comparable method performance we filter results by selected distances only. Figure S16 shows ARI boxplots per each of 51 combinations.   First, we look at direct effects of and on ARI across all thematic foregrounding and marginal of novel samples ( Figure S17). It appears that LDA with larger number of topics and smaller number of MFWs performs slightly better on average. LDA models with 100 topics also show the smallest variance in performance across sampling runs. These effects,  however, mostly come from the corpus with weak thematic foregrounding as seen on Figure  S18. Bars mark posterior .95 CI, shaded dots show empirical LDA results. We see that, again, the level of thematic foregrounding has the largest influence on LDA performance. At medium and strong levels, however, the impact of topics and MFWs is not clear. It seems that, on average, an increase in the number of topics for the small number of features tends to improve clustering, while the effect is reversed for large number of features (smaller number of topics have a slight edge, see Figure S19). Overall, the choice of number of topics and features is more critical in a corpus without pre-processing and becomes less influential when features are foregrounded.

Bags of words
Again, we fit a Bayesian multilevel interaction model. Two factors drive the performance of bag-of-words features: level of thematic foregrounding and the length of vector (number of MFWs). In brms model notation: ari~level*MFWs + (1|sample_no) Figure S20 shows that clustering with bag of words improve on average with longer vectors: since there is no algorithm that summarizes similarity in individual words behavior, word   frequences are more dependent on diverse lexical pools and sparse DTMs. It might be a suboptimal way to model texts, since the final clustering would rely on groups of present/absent words rather than the actual distribution. Also we would expect results to plateau if the length of bag of words is increased further. The plateau is better visible when posterior estimates are taken as average of foregrounding levels and marginal of samples ( Figure S21).

WGCNA
We model three factors in WGCNA performance: chunking, level and MFWs: ari~chunking*level*MFWs + (1 + chunking + level + MFWs | sample_no) First, Figure S22 clearly confirms that that chunking texts drastically reduces the performance of clustering with WGCNA modules, because of greedy module identification problem (see Section 2.3). Figure S23 shows posterior means for different cut-offs of MFWs and thematic foregrounding levels (only for models without chunking).
Non-chunked WGCNA, on average, benefits mostly from medium thematic foregrounding and increasing MFWs.

doc2vec
There is only one predictor for the behavior of doc2vec in our setup: the level of thematic foregrounding. We fit a model with varying slopes per novel sample (Bayesian framework handles single observations in samples just fine): ari~1 + level + (1+level|sample_n) doc2vec embeddings perform similarly across the different levels of thematic foregrounding ( Figure S23), which is not surprising, since it uses external representation of semantics and does not depend too much on filtering words. However, there is a steady increase in ARI, which means that filtering words and simplifying lexicon can improve document representation, even if the same model is used both for semantic similarity scores and document embeddings.

Clustering HathiTrust corpus
To test if our results maintain validity in the 'outside' world, we turned to HathiTrust corpus of fiction. We sampled 5000 "unknown" novels from the same period of time (books released after the year 1950). We couldn't just use our small target corpus as a seed of "known" novels, because HathiTrust does not provide original texts: only the token count per page alongside with morphological tagging. It is still possible to train an LDA model with this data, but not reproduce our spaCy pre-processing steps exactly. In addition, many books from our corpus did not have a match in HathiTrust data.
We used another approach. We have found all of the 97 authors from our dataset of four genres in HathiTrust corpus. All the books by these authors were marked as belonging to a corresponding genre. For example, while our original dataset contained only 3 novels by Agatha Christie, HathiTrust contains 71 novels by her. We labeled all of them as "detective" (which, of course, is a simplification). The distribution of books across four genres that we aquire this way is shown on Figure S25. Table S12 shows 10 authors with the largest amount of books.
We chose two combinations of methods to show the difference between 'better' and 'worse' approaches:  We compare their performance by projecting all 6293 novels in two dimensions with UMAP. We expect a better option to retain visible clusters by genres. Figure S26 sets two projections side by side.