Introduction

Textual comprehension is a cornerstone of learning that contributes to a range of educational and developmental phenomena, including language acquisition, building vocabulary and literacy (Nowak et al., 2000). Currently, a major limitation in the field is that skills and competencies associated with textual comprehension tend to be evaluated via the use of subjective criteria, typically related to outcomes, such as spelling test scores (Sigman et al., 2014).

Textual comprehension is often defined as mental representations made by semantic connections in a text wherein codes are translated and combined (Kendeou et al., 2014). This process is determined not by concepts of reading difficulty but by rules pertaining to statistical linguistics. Features such as entropy, fatigue and ultimately comprehension, can thus be gauged from the mathematical relations of word usage. Zipf (1935) established specific laws on word frequency over a body of textual information. He observed that only a few words are used often, and a higher number is used very rarely. These rules were mathematically formalized, and it was noted that the most common words in a text corpus occurred twice as often as the second most frequently used word and this pattern repeated itself for subsequent frequency rankings (the word in the position n appears 1/n times as often as the most frequent one).

Shannon (1948) further described how statistical linguistics impact on the user and communicator, specifically in relation to entropy and fatigue. He proposed a five-point law by which accuracy and scores can be estimated and configured owing to the conformity of the text (Shannon, 1948). First, the information source which produces a message or sequence of messages must be communicated to a receiving terminal. Second, a transmitter that operates on the message being transmitted must in some way produce a signal transference suitable for transmission over the given channel. Third, the channel itself must then act as a means to mediate the transmitted signal from the transmitter to the receiver. Fourth, the receiver ordinarily performs the inverse operation of the transmitter, in turn reconstructing the message sent from the original signal. The fifth and final element was the destination, which was comprised of the intended person, reader or observer. These functions are the overarching process framework in which Zipf’s equation on textual comprehension operates.

Information theory-based studies have thus sought to measure the amount of information, informational entropy (Montemurro and Zanette, 2002, Montemurro and Zanette, 2011, Debowski, 2011, Kalimeri et al., 2015) and semantic information in a message (Bar-Hillel and Carnap, 1952, D’Alfonso, 2011, Bao et al., 2011, Marcelo and Damian, 2010), showing—from databases of texts—the universality in entropy across languages and a link between semantic information and statistical linguistics (Montemurro and Zanette, 2016, Montemurro, 2014). For example, linguistic analysis has found that the letter a is frequently used in English; however, it contains little information and has high entropy as the more similar are the letter occurrences, the higher is the entropy (Rosenfeld, 2000). However, these principles have yet to be systematically applied for the development of software suites capable of objectively extracting statistical patterns from text databases and using them to infer textual comprehension.

Another issue in the field is the scarcity of instruments developed for non-English languages. In this study, we focused on the Portuguese language. Some tools designed for English, such as the Flesch-Kincaid grade level and Coh-Metrix measurements, have been adapted to evaluate textual intelligibility in Portuguese (Gasperin et al., 2009). Although English has complex computational constructions, it has been observed that these are used infrequently (Thorne and Szymanik, 2015). However, despite these tools and the recent availability of Portuguese words repositories (Ferreira, 2014, Wiktionary, 2017, Mark, 2011, Github, 2017), there have been relatively few systematic analyses of the correlation between letter occurrences and syllabic structures in Portuguese (Hartmann et al., 2014, Soares et al., 2017, Rabêlo and Moraes, 2008).

Using an information theory-based approach, we developed a software tool (CalcuLetra) for quantifying average textual reading comprehension difficulty, measured under the premise that the words most frequently used in writing are the words most familiar to readers. We hypothesized that the level of reading comprehension was related to high probability (high degree of familiarity), low-entropy words, which was tested using information theory’s objective quantitative parameters for given word occurrences in a bank of academic theses in Portuguese. We consider entropy as a degree of uncertainty about textual comprehension. Therefore, a highly comprehensible text would be defined by having many low-entropy, high probability words. We also collected samples of Portuguese denotative texts (e.g., articles, essays, and abstracts) as a test database for the developed software. We propose that the quantitative model presented in this paper can objectively evaluate the relative complexity of a given text, thus, constituting an objective evaluation tool for textual semantic decoding and comprehension level complexity.

Methods

Model parameters

We assumed each word’s frequency f and occurrence probability P determined a corresponding knowledge probability Pk and unknowledge probability Pu from the correlation of scales between 0 ≤ P ≤ 4.987 and 0 ≤ Pk ≤ 100%, with the average occurrence probability being Pa= 0.003 (Pk=50%) and Pu = 100−Pk. Given their irregular distribution, the words were divided into two groups G1 (50% ≤ Pk ≤ 100%) and G2 (0 ≤ Pk ≤ 50%) to avoid distortions in the mean. Thus, using linear interpolation, for G1, \(Pk = \frac{{\left( {{\mathrm{P}} {\,-\,} 0.003} \right) \times 50}}{{4.987}} + 50\) and for G2, \(Pk \,=\,\frac{{{\mathrm{P}} \times 50}}{{0.003}}\). The information quantity I=log(1/Pu) and information entropy h=I×Pu were also measured for each word (Rosenfeld, 2000). Equivalence formulas had to be obtained as different letters could have similar entropies but generate words that have different degrees of familiarity. This implies that the amount of information and entropy could be the same in texts with different comprehension difficulty classifications. For example, both isolated letters such as c and m have similar informational entropy, thus both their amount of information and entropy can be the same for an easily comprehensible text or not. However, in a given text we can have more unusual words with the letter c than the letter m with different entropies. Thus, variables Pk and Pu needed to be formulated to establish entropy in semantic-communication terms. It was necessary to correct any misrepresentation of the words that were more likely to appear in texts at lower reading levels. Such inferences were confirmed using probabilistic averages for text quantification in a relatively broad database rather than focusing on isolated words.

CalcuLetra

All equations, words and their respective quantitative parameters were stored in the software suite CalcuLetra, which we have made openly available (https://github.com/LouiseBogea/CalcuLetra). From a text, it shows the number of letters, words, sentences and paragraphs, as well as letter or word frequency f and occurrence probability P. From each word it also shows the knowledge probability Pk and unknowledge probability Pu, information quantity I and entropy h. From each variable, it shows the minimum, maximum and average values. Thus, the average Pk of a text was calculated from the sum of each word’s Pk. The stored Pk values did not change when new text samples were entered. Words in the entered texts not stored in CalcuLetra are automatically excluded to avoid distortions in quantification analysis. We can import a.xls or.csv file to expand the CalcuLetra database or to reset it. Thus, we collected denotative texts samples in Portuguese (n = 3330) as test data and inserted to CalcuLetra to validate the quantitative model. Further information about the text sample sources is given in Supplementary Table S3.

Sample selection

Academic theses published during the last 10 years (2007–2017) were manually collected (n = 1032) from the Brazilian Digital Library of Theses and Dissertations and in the Digital Library of Theses and Dissertations at the University of São Paulo (USP). Their word usage statistics were fully quantified (n = 33,101) and used as training data. Further information about the repositories is shown in Supplementary Tables S1 and S2. Foreign and misspelled words or words not included in Portuguese dictionaries were manually removed as were figures, proper nouns, numbers and other symbols (e.g., () [] { }? + // = *<>:; ‘ % @! # - & |) to avoid any discrepancies in the statistical analyses. Words were categorized as either denotative or connotative based on their explicit definitions in dictionaries. We chose academic theses (denotative language) to avoid analysis errors caused by the figurative meaning of words.

Model application

Training words were stored in the information-processing software CalcuLetra, which was developed for this study using C#, Microsoft Visual Studio 2012, with SQL Server Management Studio 2012 being used to manage the database. SQL was used to filter out words that had 17 or more letters, those that occurred less than 4 times and those that had an occurrence probability of P ≤ 0.0001 due to misspelling or mixed words. Duplicate words, prefixes and suffixes were also removed by their very low occurrence to avoid distortions in the mean (e.g., pós-, sub-, vice-, intra-, inter-, pro-, pre-, anti-, ante-, -mente, -issimo, -oso, -osa).

Validation and data analysis

BioEstat 5.3 was used to evaluate comparisons within the training database and between quantitative parameter analyses, with the emphasis placed on Pk and h variations (Ayres et al., 2007). A decision tree was used to establish the Pk and Pu ranges in relation to entropy (Breiman, 2017). This is a commonly used predictive/exploratory alternative for regression and classification problems, as decision trees can handle both categorical and numerical variables using both classification and regression. The decision trees algorithm assisted in predicting and classifying the data based on the values of the response variables. It also executed successive binary partitions to obtain subsets that are increasingly homogeneous concerning the response variable and placed the results in a hierarchical tree of decision rules for prediction or classification, with each data point partition being a node. There are internal nodes and terminal nodes, which are called leaves. The main steps in the algorithm were: (1) generating the root node (from the entire database); (2) determining the best way to divide the data into two groups (find the nodes to be divided); (3) choosing an attribute that best classified the data by considering the single paths between the root nodes and each subsequent node; and (4) creating and designing the nodes and the associated branches. The algorithm then returned to step 2.

If all cases in each terminal node show identical values, the impurity of the node will be minimal; the homogeneity will be maximal and the prediction will be perfect. One way to control division is to allow the division to continue until all terminal nodes are pure or contain no more than a specified minimum number of cases or objects. In most cases, the interpretation of summarized results in a relatively simple tree. Thus it is useful not only for rapid classification of new observations but can also often produce a much simpler model to explain why these observations are classified or predicted in a particular way.

Because decision trees are nonparametric, they do not assume an underlying distribution and can provide a relatively unbiased initial inspection of the training data, providing an image of the structure and thereby improving the interpretation of the results. Here, the quadratic relationships between entropy and Pk were modeled using a Gaussian regression model (Montgomery et al., 2006). Deviation measures were found, including the mean absolute deviation (MAD), the simple mean of the absolute differences between real and adjusted values; mean square deviation (MSD), the simple mean square of the differences between actual and adjusted values; and mean absolute percent error (MAPE), the percentage error between the actual and adjusted values.

As the test data was inserted in CalcuLetra, each text’s readability was determined by each word’s objective parameters from the stored database and their average Pk compared to quantify the comprehension difficulty. To identify the approximate points at which there was a change in the slope between the number of denotative texts inserted in CalcuLetra and the average knowledge probabilities, a binary segmentation method was used that identified the changes in the mean and variance of the knowledge probabilities as the number of texts grew (Scott and Knott, 2006). For each change point interval, the relationship between the knowledge probabilities and texts was described using linear regressions. For all analysis, including tables and figures, we used R (version 3.2.4) (R Core Team, 2013).

Results

Representative word groups

We considered the upper and lower bounds of the training database as the most relevant (Fig. 1), as they showed the words with a Pk ≥ 0.7 (G1) that contributed to comprehension and those with a Pu 0.3 (G2) which were likely to cause a communication failure.

Fig. 1
figure 1

Distributions of the probabilities, information, and entropy for each word stored in CalcuLetra. a, b Harmonic distribution of entropy, except for deviations in G1 (Pk ≥ 0.7), which predominantly included conjunctions, prepositions, and articles. This is due to a lack of data in this group; inverse results for b. The distributions had two atypical points: there was an observation that presented a high value for Pk (99.97) and a low value for h (0.31), and there was an observation with a high value for Pu (100.00) and a null h (0.00). c Around 17.33% of the texts had entropy measures greater than 10, and these were also the texts with the lower word counts. For instance, entropy equaled 10 when the number of words was 3259, and the entropy equaled 5 when the number of words was 5724. A four-word count was associated with maximum entropy (53.05). d Monotonic decreasing and non-linear relationships between the variables. When Pu was maximized, I was minimized

In the dispersion chart that connects the data pairs, we found a positive and significant correlation (r = 0.99, p-value < 0.001) between the entropy and Pk, i.e., the greater the entropy, the greater the Pk and vice versa. Conversely, we found a significant negative correlation (r = −0.99, p-value < 0.001) between the entropy and Pu, i.e., the higher the entropy, the lower the Pu and vice versa.

In Fig. 2, we evaluated the cut-offs for Pk (a,c) and Pu (b,d), for which the entropy distribution was homogeneous by employing a machine learning regression tree technique, with the stipulation that the minimum number for the group division be 5000 to avoid an overly large tree, facilitating data interpretation. Classification and regression tree algorithms aim to achieve the best possible predictive accuracy. Thus, the division at each node was found and generated improvement in forecast accuracy by finding sub-groups in which the groups are homogeneous (intra-group) and heterogeneous (inter-group). Operationally, the most accurate forecast was defined as the one with the lowest rate of classification errors.

Fig. 2
figure 2

Regression tree (a, b) for entropy as a response variable and knowledge probability as an explanatory variable, also probability of being unknown as an explanatory variable, and descriptive of variables in boxplots (c, d). In each sheet, entropy is represented as follows: h = leaf mean. Such tree method results can be summarized in a series of logical if-then conditions, unlike the usual regression models that define a prediction equation for the relationship between variables. In addition, there is no implicit assumption that the relationships between the predictor variable and the dependent variable are linear

As Fig. 2 shows, from the cut-off points identified by the regression tree (panels 2a, b), entropy was described in the bands for knowledge probability (Table 1) and for the probability of being unknown (Table 2). Also, descriptive of each variable is represented in boxplots (panels 2c, d). Further information about boundaries in boxplots is shown in Supplementary Table S4.

Table 1 Description of entropy by knowledge probability
Table 2 Description of entropy by the probability of being unknown

Quadratic regression

To describe the relationship between the entropy and the knowledge probability, we used a quadratic regression model to determine the knowledge probability value that maximized entropy, with the knowledge probability variable being centered on the average. The adjusted model was found to have a variance inflation factor of 6.24, indicating the absence of multicollinearity, and thus was considered to be well adjusted (Fox, 2008). Therefore, the model equations—Eq. (1) for entropy h and knowledge probability Pk and Eq. (2) for unknown probability Pu—were defined as follows:

$$\begin{array}{l}\overline {Pk} { \,=\,}Pk - 3.210436\\ \widehat h{ \,=\, }4.6457866 + 1.4350649^ \ast \overline {Pu} - 0.0101171^ \ast \overline {Pu} ^2\end{array}$$
(1)
$$\begin{array}{l}\overline {Pu} {\,=\,}Pu - 96.78956\\ \widehat h{ \,=\,}4.6457866 - 1.4350649^ \ast \overline {Pu} - 0.0101171^\ast \overline {Pu} ^2\end{array}$$
(2)

In this sense, when Pk increased by 10 units, h had an average increase of approximately 13.92 units. This function allowed for the maximum points to be identified and the entropy to be estimated from the knowledge probability. It was found that the position that maximized entropy was a knowledge probability of 70.92% (Pu = 30.74%); as after this value, as Pk or Pu increased, h tended to decrease Table 3.

Table 3 Bounds of the training database

Figure 3 shows the comparison of actual and adjusted values for Pk (a) and Pu (b). Note that after the inflection point, the adjusted curve did not exactly follow the actual curve because of the small number of points examined in this range. This was the main limitation of the training database; however, despite the missing values in G1 (Pk ≥ 0.7), it was possible to infer that h tended to decrease for familiar words.

Fig. 3
figure 3

Adjusted regression model: knowledge probability a and probability of being unknown b. The adjusted values by the regression analysis show that the model can accurately reproduce the actual entropy values. The adjusted model shaped the data characteristics up to a 60% knowledge probability. The lack of adjustment for the higher values of Pk reflected the lack of samples with probabilities higher than 60% and the presence of atypical observations (Pk = 99.97 and h = 0.31)

We found MAD and MSD estimates to be low, suggesting proximity between the adjusted values and the actual data. Also, the MAPE estimate was that the forecast was on average 3.009% incorrect, which is satisfactory for terms of fit quality.

Quantification of the texts in CalcuLetra

Test data (e.g., articles, essays, abstracts, and news) were entered in CalcuLetra (n = 3330) in order to extract the quantitative word values based on the stored objective parameters in the training database (Fig. 4). This process inferred the difficulty of semantic comprehension by the average Pk values. Texts were already organized syntactically in standard Portuguese. We could set Pk ≤ 0.3 for texts with a higher chance of causing incomprehension.

Fig. 4
figure 4

Random register of denotative texts in CalcuLetra and the average Pk correlations. The average Pk for the test data was 0.3 ≤ Pk ≤ 0.5, and texts with an average Pk ≤ 0.3 were unlikely to be understood by most neuro-typical readers. The statistical significance level was set at Pk ≥ 0.23. The total range for the average Pk values was from 0.23 to 0.69, and the overall average Pk was 0.42. In the comparison of actual and adjusted Pk values, the mean percentage error was very low, indicating almost perfect prediction. MAD (0.024) and MSD (0.002) estimates were low, suggesting that the adjusted values were in close proximity to the real data. Also, the MAPE estimate indicated that the average forecast was incorrect by 0.064%

We used a binary segmentation method to identify the changes in the mean and variance of Pk as the number of texts grew (Scott and Knott, 2006). Similar to decision trees, this method searches for the values that make the groups more homogeneous in relation to the mean and variance internally and are heterogeneous between the groups. In this way, we identified points 24, 551, 2724, 3152, 3295, and 3330. Note that within each change point range, the relationship between Pk and the texts was strictly linear. Therefore, the six linear regressions were adjusted according to the time intervals identified in the change points. When the number of texts increased by 10 units in the interval [1; 24], Pk increased by 3.00 units on average; in the interval (25; 551), Pk increased by 0.03 units on average; in the interval (552; 2724), Pk increased by 0.10 units on average; in the interval (2725; 3152), Pk increased by 0.03 units on average; in the interval (3153; 3295), Pk increased by 0.30 units on average; in the interval (3296; 3330), Pk increased by 3.00 units on average. In all models, the R² value was perfect as one variable thoroughly explains the variability in the other.

Discussion

Textual complexity analyses often include several domains (Crossley et al., 2017). In this paper, comprehension difficulty was quantified in order to infer an objective measure of the complexity of denotative written material in Portuguese using texts related to educational language standards. Text comprehension was estimated based on the correlations between the semantic representations of the words and their respective probabilities to ensure that the textual complexity was related to the semantic entropy of the words and their familiarity degree. To the best of our knowledge, this was the first investigation on Portuguese textual comprehension difficulties using quantitative methods and objective parameters. The use of a quantitative model that incorporates comprehension in terms of word complexity measures may help overcome many of the problems associated with the reliance on qualitative standards and subjective evaluations (Gathercole and Alloway, 2006). Importantly, entropy analysis in our model revealed that the relationship between word comprehensibility and probability is not linear (Smith, 2012). This means that word features, such as a word’s relationship to the sentence, is non-linearly related to textual features, such as occurrence. The model presented here can elucidate that relationship through the use of regression analysis (Gastón and García-Viñas, 2011).

We chose academic texts due to their potentially more comprehensive vocabulary and lower use of figurative language. We acknowledge that using a corpus of academic words may have biased our sampling, as it is possible that university students intentionally use unusual words to demonstrate a more elaborate vocabulary. There is a possibility that we did not accurately sample more common words that would exhibit higher degrees of familiarity. However, we highlight that one can use any corpus (including from other languages) as training data for CalcuLetra to quantify textual comprehension difficulty. Corpus selection by itself does not directly impact the validity of our quantitative model, and future studies may systematically address how the training dataset can affect comprehension estimates.

Our results found that textual labor was quantifiable by a low degree of familiarity, a low amount of information, and low entropy, which together predicted a high incomprehension probability (summarized in Fig. 5). In a study of word frequency in books, the amount of information in 300 to 3000 unpredictable words was reported to be 0.2 bits per word, which was congruent with the findings in our dataset (Marcelo and Damian, 2010). Our results are also consistent with several available Portuguese word-frequency lists (Ferreira, 2014, Wiktionary, 2017, Mark, 2011, Github, 2017).

Fig. 5
figure 5

Overview of the database parameters. Lower and upper bounds of the training database were delimited by low-entropy words with the highest probabilities of causing incomprehension (i.e., nouns and adjectives) or facilitating semantic decoding (i.e., prepositions and conjunctions). Despite the missing values in G1(Pk ≥ 0.7), the score resulted from eliminated words at the bounds. This avoided distortions in the mean, while the linear interpolation made in this range could be deduced based on the harmonic distribution that was established in other bands

Previous studies have shown that complex aspects of information organized in symbolic sequences can be quantified (Montemurro and Zanette, 2002, Montemurro and Zanette, 2011, Debowski, 2011, Kalimeri et al., 2015). In this work, we ranked semantic decoding difficulties using established equivalence formulas. Differently from other research that sought to quantify the entropy of word order in sentences, the results of which showed that there was constant relative entropy despite vocabulary diversity, this study set the average ratio for the Portuguese denotative text comprehension (Montemurro, 2014). In using an average ratio for text comprehension, deviations due to vocabulary diversity can be overcome. While this may suggest that the parameters may only function and possibly depend on the restricted description of a language, this can be altered to suit any specified vernacular. In this sense, the findings of the research are not restricted to Portuguese comprehension, and parameters can be applied to any linguistic modality that incorporates a defined semantic code (language) (Zwaan, 2016). This generalizability is because, despite the known differences in structure and vocabulary of any given language system, the impact of word order within the structure of the language is thought to be a statistical linguistic universal (Fox, 2008). In deviating from reliance on word ordering and concentrating instead upon a quantified flow of informatics sequencing, word usage and cognitive forms such as repetition and rehearsal are acknowledged in the function of comprehension (Zwaan, 2016). As described, we can speculate on several applications for our software suite. For example, an objective quantification of textual comprehensibility could be used to the development of diagnostic and therapeutic tools for learning disorders such as dyslexia (Harley and O’Mara, 2016). Thus, the use of a quantitative tool that identifies words in relation to their embedded discourse might provide a valuable objective guideline for the quantitative evaluation of learning disorder severity, considering current therapeutic practices are based mainly on qualitative measures (Kirkby et al., 2011). In addition, by using the regression analysis technique employed in this study, it may be possible to estimate the optimal rate of acquiring a new word from textual representations concerning its occurrence throughout different texts, which could be used as benchmarks for clinical studies and diagnostics (Oliveira and Gomes, 2010).

Our software might also be used in standard education. In the latest edition of the Programme for International Student Assessment (2015), Brazil ranked 59th in reading out of 70 countries, revealing that many students have insufficient comprehension skill. The National Indicator of Functional Literacy (INAF) reports that 68% of the 30.6 million Brazilians between 15 and 64 years old who have studied up to 4 years, and 75% of the 31.1 million who have studied up to 8 years, remain at basic literacy levels. The use of inadequate teaching material and improper screening for learning disabilities are likely contributors to these failures in public education outcomes. Specifically, the annual National Textbook Program (PNLD) has been heavily criticized, especially regarding the creation and selection of textbooks (Di Giorgi et al. 2014). Using CalcuLetra, teachers might perhaps select more appropriate course material based on quantitative entropy parameters (Cidrim and Madeiro, 2017). Of course, which entropy levels might be best suited for specific grades or educational goals would have to be empirically determined. CalcuLetra could also be applied for the creation of textbooks, essay materials, and the evaluation of automated textual comprehension and student production. It could also be used to quantify the functional acquisition of specific vocabularies using different training datasets. While our model was not yet been fitted nor compared to the comprehension of actual human readers, it is plausible that it could be used to monitor the development of a given student’s textual comprehension skill by analyzing the production of the student with progressively complex training datasets inserted in CalcuLetra. Understanding trends and differences in textual representation can also add to ongoing theoretical debates about the role of automated textual analysis in education. Indeed, automated textual analyses may contribute to exploring intersectional perspectives on social groups studying in Portuguese, such as specific ethnic groups and individuals with learning disabilities.

Conclusions

By applying information theory principles, we developed a new textual analysis software suite (CalcuLetra) and demonstrated that the comprehension of Portuguese denotative texts could be quantified using objective parameters. Our results revealed groups of words with low entropy that either cause incomprehension or facilitate semantic decoding. We propose that the methodology and software developed may eventually be used as an auxiliary evaluation tool for teaching materials and textual comprehension assessments, as well as for the study and therapy of learning disorders. This model may also be adapted to other languages to evaluate the difficulty of semantic decoding and semantic complexity.