Linguistic measures of chemical diversity and the “keywords” of molecular collections

Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections (“corpora”), including those deposited on the Internet – indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic “chemical words” that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular “keywords” by which such collections are best characterized and annotated.

. Chemical words and vocabularies. (a) Illustration of a common maximal substructure, MCS (colored red), between two molecules, formoterol (an anti-asthmatic/COPD drug) and morphine. (b) Blue lines are statistics of distinct MCS "words" for the entire 1.75-million-rich chemical vocabulary and over 100 randomly chosen subsets of Reaxys molecules (each subset with 500 to 9,000 molecules and 124,750-40,495,500 word tokens). The red, green, and orange lines are the distributions of words in, respectively, Conan Doyle's collected works, Joyce's "Finnegans Wake" novel, and Shakespeare's works. All dependencies are rescaled by the number of words/molecules in a given set. As seen, the distributions for all sets are similar. (c) Examples of chemical words -those in the upper row are popular but not very specific fragments. Those in the lower row are less popular but immediately signal a specific group of chemicals (from left to right 1 st = penicillins and cephalosporins, 2 nd = coumarins, 3 rd = carbohydrates, 4 th = steroids). Note that the structures shown are molecular fragments not actual molecules with correct valences (e.g., if oxygen is monovalent, it can be attached to H, alkyl, aryl, etc.).

Molecular collections.
With the chemical words defined as above, we performed linguistic analyses of various types of chemical collections: (1) a set of ca. 104,000 unique molecules chosen at random from the Reaxys repository (www.reaxys.com) for which we calculated (using RDkit, version 2015.09.2, http://www. rdkit.org/) 668,000,000 unique pairwise comparisons (chosen at random; 14% of the total possible five billion molecule-to-molecule comparisons); (2) multiple 1,000-molecular-long subsets of (1); (3) a set of 1,000 natural products chosen randomly from 1489 natural-product entries in the Zinc Database (http://zinc.docking.org/ catalogs/specsnp); (4) 1,000 FDA-approved drugs chosen at random from 1,800 drugs deposited in https://www. drugbank.ca/; and (5) ten samples, 1,000 molecules each from the libraries sold by Mcule (www.mcule.com), a leading commercial provider of compound libraries. The "vocabularies" derived from these collections comprised 1.75 million unique MCS words for (1), and tens of thousands words for other, smaller collections.

Results and Discussion
Linguistic measures of chemical diversity. We first considered a diversity measure called a type-token ratio, TTR, which is used widely in corpus linguistics 22 to quantify lexical morphological richness of a language 23 , improvement in writing skills 24 , or individual styles of authors 25 . In some languages, TTR does not depend on the text genre (e.g., in Czech 25 ), in others the differences are pronounced, also between written and spoken language (e.g., in English 26 ). TTR is simply the ratio of unique to the total number of words in a given text. For instance, the opening sentence of Arthur Conan Doyle's "The Study in Scarlet" reads as follows: "In the year 1878 I took my degree of Doctor of Medicine of the University of London, and proceeded to Netley to go through the course prescribed for surgeons in the army. " This sentence comprises 32 words ("tokens"), of which 24 are unique "types" (since "the" and "of " repeat four times while "in" and "to" each occur twice), such that the TTR = 24/32 = 0.75.
Our TTR analyses were performed using subsets of 50,000 words chosen randomly from larger vocabularies characterizing a given collection of molecules (for collection of size n, the vocabulary is comprised of n(n − 1)/2 MCSs derived from pairwise molecule-to-molecule comparisons). These analyses give TTR values of 0.1058 for randomly chosen molecules (collection (2)), with averaging done over 100 subsets), 0.2051 for the collection (3) of natural products, and 0.1469 for the collection (4) of drugs. Quite remarkably, this linguistic richness of chemical collections is commensurate with that of the works of Shakespeare (0.1296) and Conan Doyle (0.1228), but is lower than Joyce's "Finnegans Wake" novel (0.3385), which is known to be a linguistic outlier with incredibly rich word inventory coming from numerous languages. Within chemistry, natural products are more diverse than drugs and both types of sets are more diverse than an equally-sized sample of molecules taken at random from Reaxys. In making such comparisons, however, it must be remembered that they remain strictly valid for the same lengths of the text samples. This is so because TTR is sensitive to and generally decreases with the length of the input text 18 (as common words start repeating). One way around this problem is to divide the text into equal-size parts and then take an average over the TTRs of these parts. Another approach is to use moving averages, in which a "window" of a given length is moved over the text and the TTR scores are averaged over all window positions 27 . Kubát & Milička 25 suggested that one can also calculate the distributions of the numbers of windows enclosing texts of a given TTR. We follow this Moving Window TTR, MWTTR, approach in Fig. 2a, which plots the distributions of TTR values within moving windows that are 1,000-words (for literary samples) or 1,000-chemical-words (for sets of molecules) long. As seen, the MWTTR measure preserves the ordering of simple TTR but (i) the differences between the samples are more spread-out and (ii) natural products are now more diverse than Joyce's novel (see Supplementary Fig. 2 for further illustration of Finnegan's linguistic uniqueness).
Based on the above considerations, we conclude that while various forms of TTR give qualitatively similar rankings of molecules' diversity, certain differences between these measures exist -indeed, in linguistics, TTR alone is considered too simple a measure to provide reliable information about word distributions. Consequently, TTR is often supplemented by other metrics, especially those that plot the growth rate of unique vocabulary as a function of text length -that is, by curves plotting the number of new words ("word types") one encounters while reading the text (comprised of "word tokens"). Such curves are described by the so-called Herdan's law (also known as Heap's law 28,29 ), V R (n) = Kn β , where V R is the number of distinct words in a text of size n, and K and β are free parameters that are determined empirically. Dashed lines in Fig. 2b trace vocabulary growth for the Joyce's, Shakespeare's, and Conan-Doyle's works we considered before -as seen, for the latter two authors, the number of new words starts levelling off relatively early; in contrast, Joyce's "Finnegans Wake" keeps surprising the reader with new vocabulary until the very end. Extending this representation to the words of chemistry (solid lines) and scanning through the vocabularies derived from our various chemical collections, we see that natural products and drugs show similar trends (though for the natural products, the rate of increase is initially steeper) whereas the vocabulary of common chemicals from Reaxys is more constrained. In other words, drugs and natural products are again more internally diverse than random chemicals. We emphasize that in all cases, the curves fit to the Heap's law well, with the R 2 values as high as 0.99 for linguistic corpora and 0.98 for chemical data.
These conclusions merit two additional comments. First, the curves for the chemical collections are insensitive to the order in which the molecules are "read" (Supplementary Fig. 3). Second, it should be remembered that when comparing collections of different numbers of molecules, the number of MCS words in their "vocabularies" will be different which, in turn, will affect the rate of increase. This is seen in Supplementary Fig. 4 where the more shallow, blue curve is for ~20,000 words derived from 1,000 molecules whereas the steeper, orange curve is for 1,750,000 words derived from collection (1) (based on 668,000,000 word-to-word comparisons). As in the case of TTR measures, it is therefore important to make comparisons between like-sized sets. Diversity within molecular libraries. Practically, the above measures can be used to estimate the diversity in chemical libraries and also visualize it in ways not available with traditional approaches based on Tanimoto coefficients (cf. a typical matrix of Tanimoto coefficients in Fig. S2 in ref. 17 ). To illustrate this, we teamed up with the Mcule company (www.mcule.com) -a leading European provider of molecular libraries -who shared with us ten samples of their choosing, each 1,000 molecules, drawn from their commercial libraries of potential lead compounds. The exercise was structured as a blind test in the sense that we were initially not provided any information about the samples' diversity. By plotting the rate of new chemical word increase (as in Fig. 2b), we readily established that the samples group into two families of similar diversities -five less diverse (set of five lower curves in Fig. 3a) and five significantly more diverse (upper five curves in Fig. 3a). We also characterized the samples by TTR values and found that for the first family of five samples, TTRs were between 0.024 and 0.069, whereas for the second family of five, between 0.150 and 0.164 (note: these values can be compared with the values of other collections discussed above since the vocabularies were similarly-sized). We then communicated these results to Mcule who, in turn, provided us with their own estimates of diversity. In the Mcule's measure, popular in drug-discovery industry, each molecule was compared with other molecules in the set and assigned a maximum Tanimoto coefficient (e.g., a value of, say, 0.76 for a given molecule would mean that the closest analogue in a given set has a Tanimoto coefficient of 0.76 and all other molecules are less similar). When averaged over an entire collection, this measure decreases with increasing diversity. Figure 3b plots the values of our linguistic TTR metric against average Mcule values for each 1,000 molecule samples -as seen, the two measures correlate closely (R 2 = 0.9545) though, again, only our approach allows for the visualization as in Fig. 3a.
Another potential advantage over Tanimoto-based averages is that we can estimate the diversity of a molecular collection based on the analysis of only its subset. Recall that the curves such as those in Figs 2b and 3a fit well to the Heap's law, V R (n) = Kn β . Such functional dependencies are also observed when analyzing portions (say, 30% or 70% of all molecules) within a collection. With the increasing size of this subset, the fits converge to the distribution characterizing the entire library (Fig. 4a). Importantly, we have verified that this convergence is similar for different datasets we studied -in particular, as the size of the subset under study increases, the best fits are such that prefactor K and exponent β are related by a power law β ~ K −γ (Fig. 4b). Knowing this universal behavior, we can then extrapolate relatively well the diversity of the entire collection by analyzing only its subset (Fig. 4c) -in other words, we can significantly reduce the numbers of molecule-to-molecule comparisons (e.g., by a factor of 4 if 50% of the collection is taken) yet still obtain decent estimates of diversity. We note that this type of extrapolation cannot be used for Tanimoto-based methods for which no universal scaling with dataset size is neither known nor should generally be expected.

Distributions of chemical words.
A potentially quite informative feature of chemical "vocabularies" is the nature of words they contain. For example, although drugs and natural products exhibit similar richness of vocabulary, one might suspect that the "words" in the two sets are of different lengths. Figure 5a plots the distribution of "lengths" (measured as the number of constituent non-hydrogen atoms) of unique MCS words derived from the randomly-chosen molecules in set (1), drugs in set (3) and natural products in set (4). As seen, the words in random molecules and in drugs are relatively short (maximum of the distribution at around 10 atoms) and distributed similarly. Examples of drug-based words in Fig. 5b correspond to structural fragments found in polypeptide antibiotics (e.g., bacitracin, dactinomycin; fragment 1 in the figure), glycoside containing drugs (e.g., amikacin antibiotic, topiramate anticonvulsant, 2), indole alkaloids (e.g., vincanire, ondansetron; 3), nonsteroidal anti-inflammatory drugs (e.g., ibuprofen, ketoprofen; 4), medications for erectile dysfunction or sulphonamide antibiotics (e.g., sildenafil, sulfacetamide; 5), and cephalosporin antibiotics (e.g., cefadroxil, cefazolin; 6). In contrast, an analogous distribution for natural products is much broader and features the main peak centered around 15-16 atoms and a "satellite" peak centered around 20 atoms. As illustrated in Fig. 5c, longer words occurring in natural products usually can be easily recognized as characteristic scaffolds of certain classes of compounds (e.g., steroids, 10, flavonoids, 11, or opioids, 12). Shorter words are generally less informative (e.g., structure 7) but in some cases can be recognized as substructures of classes such as sugars, 8, or catecholamines, 9. Characteristic chemical "keywords". Examples of chemical words in a given set of molecules prompt our final question -namely, how to determine quantitatively those words that are most characteristic of a given collection of molecules and could thus serve as its "keywords". To do so, it is necessary to first develop a metric that measures a "distance" between molecules or sets of molecules. In linguistics, the efforts to compare two texts/ corpora 30 , date back to 1950s but the existing measures are too simplistic or altogether not suitable for meaningful comparisons of chemical data. For instance, when a popular measure proposed by Kilggariff 31 (Chi-by-degrees-of-freedom) is applied to our molecule collections, the distances from drugs, natural products, or other like-sized subsets of randomly-chosen Reaxys molecules to the larger Reaxys collection (1) are all similar (respectively, 1143, 1036, 1077-1087). Accordingly, we considered several other metrics ultimately focusing on the one based on the frequency-corrected positions of "chemical words" in ranked lists (i.e., sorted from the most to the least popular words). Specifically, consider a set of N "chemical words" ranked according to their frequency of occurrence in a given corpus/collection. Denoting the rank of word x as r(x) and its frequency as f(x), we can define the normalized position, P, of this word by summing up the frequencies of this and all words with lower ranks (i.e., more popular than x) as . Then, the distance between the same word in two sets of molecules, say A and B, can be defined as δ ( ), (see also Fig. 6). Similarly, the distance between the entire two sets can be defined as an average of the word-to-word distances δ δ We note that if a word is present only in one list (say, A) and absent in the other (B), we assign the maximal distance possible, − P 1 r x A ( ), , as if the missing word were added at the very end of list B. We also observe that an important and appealing feature of this distance metric based on ranked MCS word lists is that it does not depend on the  Prediction of the type-token ratios, TTRs, based on the partial fits for different types of collections. The true value of the entire collection is taken as 100%. "Database group 1" and "database group 2" are the two families of Mcule databases from Fig. 3. The largest discrepancy between fits and real diversity is observed for natural products whose linguistic peculiarity is also manifest in our other analyses (cf. Figure 2b where the naturalproducts curve intersects dependencies for drugs and Reaxys molecules). For other collections, estimating 30-50% of the content already gives decent estimates of their actual diversity. In implementing these ideas, we take the set R of molecules chosen randomly from Reaxys (collection (1)) as a chemical "universe" and a reference, and calculate the distances of molecules in other collections to this reference -for instance, the natural products are more distant (δ = AR 0.073) from this reference than drugs δ = .

Conclusions
In summary, we have extended the concepts of linguistic similarity to collections of molecules. The measures we propose provide alternative and complimentary means of assessing chemical diversity and also visualizing it (cf.   Tanimoto-based approaches. This being said, we see the main value of our chemical-linguistic approach in annotating chemical collections with characteristic "keywords" by which such collections can be then searched/navigated, akin to searches of web documents/texts. In the business of small molecule libraries, chemical keywords could be used to discern sets of molecules most resembling specific classes of drugs. If additional, higher-order linguistic considerations -e.g., collocations of words "travelling together" or "avoiding each other" during chemical reactions -were taken into account, they could provide more information not only about characteristic structural features but also characteristic reactivity patterns 32,33 . In all such analyses, the "rate-determining" step is the extraction of vocabularies characterizing a given collection of molecules (entailing large numbers of molecule-to-molecule comparisons; e.g., 2.5 billion for a typical 34 molecular library of ~100,000 compounds). Such calculations, however, can be accelerated by extrapolations based on the Heap's law and are performed only once for a given set, and with modern computing resources can be completed within hours to days -all subsequent "keyword" comparisons/searches can then follow on much shorter time-scales. Data availability. Data, including vocabularies of MCS words, and computer codes that support the findings of this study are available from the corresponding author upon request.