Information gain modulates brain activity evoked by reading

The human brain processes language to optimise efficient communication. Studies have shown extensive evidence that the brain’s response to language is affected both by lower-level features, such as word-length and frequency, and syntactic and semantic violations within sentences. However, our understanding on cognitive processes at discourse level remains limited: How does the relationship between words and the wider topic one is reading about affect language processing? We propose an information theoretic model to explain cognitive resourcing. In a study in which participants read sentences from Wikipedia entries, we show information gain, an information theoretic measure that quantifies the specificity of a word given its topic context, modulates word-synchronised brain activity in the EEG. Words with high information gain amplified a slow positive shift in the event related potential. To show that the effect persists for individual and unseen brain responses, we furthermore show that a classifier trained on EEG data can successfully predict information gain from previously unseen EEG. The findings suggest that biological information processing seeks to maximise performance subject to constraints on information capacity.

The generative probability of a document given a word, P (d|w), is defined using a derivative of a generative likelihood model (1). First, we assign a probability P (w|M d ) for a word w given a document model M d for a document d. The document model is a bag-of-words representation of a document, in which the order of the words is disregarded, and only the frequency of each word is preserved. The probability of a word occurring generated by a document can be estimated as: where f w,d stands for word frequency for word w in document d and f d is the total amount of words in d. 25 Next, since we are interested in the distribution of documents given a word, we calculate P (d|w). By utilizing Bayes' rule 26 this becomes: where P (t) can be ignored, since it is the same for all d. Since we defined that the documents have an uniform prior probability, 28 the equation can be simplified further: 29 P (d|w) ∝ P (w|d) Due to this, P (w|d) can be used to compute the probability of a word "generating" a document. 30 We are now ready to compute the a priori entropy over documents H(D) and the entropy over documents when observing a word H(D|w). By using the definition of entropy and conditional entropy, we get and H(D|w) = − d∈D P (d|w) log 2 P (d|w) Since P (d) is uniform, H(D) will yield the maximum entropy for the given set of documents, formally H(D) = log 2 (|D|). From here it follows that we now have a model for computing the information gain of a word w given a collection of documents D: To understand how the measure of information gain works, let us view how the generative distribution of documents changes when conditioned on different words. Consider a collection of 50 Wikipedia articles D . A language model is generated for each of these documents as specified above, and the generative probabilities P (d|w) are computed for all d ∈ D given the words the, small, and cat. These words are examples of low, medium, and high information gain words, respectively. Figure S1 displays the probability distributions of P (D |w) for each of the aforementioned words, alongside with the conditional entropy of each distribution H(D |w). We see that H(D |w) is highest for the word the, which is due to the fact that the frequency of the is roughly the same in all of the documents. This implies that the is not very good at discriminating documents from each other. On the other hand, the word cat occurs only in one document in our limited collection. This makes the entropy of the document distribution fall to zero, because there is no uncertainty about a document given the word; we are certain that the document is the one in which cat occurs. In a larger collection of documents, say, one consisting millions of documents, it would be very unlikely for a word to occur in only one document. Lastly, the word small falls between the words the and cat in terms of entropy. It occurs in some documents but not all, and thus is somewhat descriptive in terms of documents. To study the information gains of these three words, we simply subtract the conditional entropy from the a priori entropy, which for our collection is H(D ) = log 2 50 = 3.91: We see that the highest information gain of these three words is achieved with the word cat, with the word the having the 32 least information gain, and small falling between these two. To conclude, words that occur only in select few documents with 33 varying frequencies will tend to have a higher information gain than those words that occur in great many documents with 34 approximately equal frequency. Thus, information gain is an estimate of the information gained on a topic upon observing a 35 particular word.

37
Computation of information gain. In the present study, information gain of each word was computed from the English Wikipedia 38 using the aforementioned model. Document models of all of Wikipedia's articles were generated. Prior to constructing these 39 models punctuation marks were removed from the text and the words were stemmed using the Porter stemming algorithm (2).  A word likelihood model was constructed using the aforementioned models. Using these models, information gain was 47 computed for each of the stemmed words. Words with information gain in the 75th percentile were labelled as high information 48 gain words (label 1), and words with information gain less than the 75th percentile low information gain words (label 0). These 49 labels were employed for data visualisation and classifier training, but not for significance testing, for which continuous values 50 of information gain were used. A histogram of the occurrences of information gain of words can be seen in Figure 1 (left).

51
Technical details of experimental procedure and data analysis 52 Apparatus and stimuli. Words were presented with an 18-point Lucida Console black typeface at the centre of the 19" LCD 53 screen. They were shown against a silver (RGB 82%, 82%, 82%) background in the middle of a 300 x 100 pixel pattern mask.

54
The mask was a black rectangle with a grid-like pattern, with an opening to show the word. This was used to control the 55 degree to which word length affected light reaching the eyes (i.e. To make sure longer words were not tantamount to more 56 black pixels on the screen). Sentence separators were word-like character repetitions consisting of 4 to 9 numbers (3333333) 57 or other non-alphabetic characters (&&&&&&), which were designed to mimic the same early visual activity as words without 58 evoking psycholinguistic processing.

59
The screen was positioned approximately 60 cm from the participants and was running at a resolution of 1680 x 1050 and a 60 refresh rate of 60 Hz. Stimulus presentation, timing, and EEG synchronization were controlled using E-Prime 2 Professional between an alternative hypothesis model and a null hypothesis model. The initial models were designed according to the "keep 68 it maximal" -principle (3). Due to convergence failures, however, we dropped the random effects explaining the least variance 69 and refit the models until convergence was achieved, as suggested in (3, 4).

75
β0 is the overall intercept and epi ∼ N (0, σ 2 ) represents the general error term. The null model was the same as the alternative 76 hypothesis model, except that the fixed effect of information gain was omitted.

77
After dropping the effects explaining the least variance to achieve convergence, the alternative hypothesis model was 78 formulated as: The null model was constructed by removing the fixed effect of information gain, as above. This formulation was used to 80 compute the results displayed in Table 1.

81
Since LMMs without a random slope structure may have an increased Type 1 error rate (3), we wanted to ensure that we 82 achieved similar results from the full (non-converging) and reduced (converging) models. Thus, we compared their performance 83 as seen in The classifiers were trained with the information gain labels (low/high). The split at the 75th percentile resulted in   Trials  Trials  Channels  Subject  Threshold (µV )  recorded  dropped  dropped  S01  57,42  1 941  388  None  S02  33,88  1 961  392  Fp1, Fp2, TP9, TP10, FT10  S03  65,54  1 936  387  Fp1, Fp2  S04  30,64  1 986  397  Fp1, Fp2, P7  S05  31,19  1 959  391  Fp1, Fp2, F7, TP9, TP10  S06  51,04  1 a  contains  impetus  an  of  painting  spiritual  airbrushes  to  the  craftsmen  sponges  act  and  surface  knives  or  in  brush  craftsmen  be  a  outside  pigment  such  of  plato  philosophical  socrates  is  the  aristotle  socratic  been  and  academy  plato  in  in  athens  platonism  perspective a  higher  aristotle  have  of  politics  practice  adversaries  in  the  employed  sovereign  or  and  international  discourse  which  in  influencing  civic  wide  a  institutions  warfare  among  of  rome  michelangelo  bramante  to  the  bramante  bernini  chapel  and  province  sistine  for  in  baroque  tiber  in  a  architecture  michelangelo  was  of  savanna  unbroken  unbroken  also  the  hemisphere  herbaceous  of  and  grassland  savannas  and  in  majority  savanna  common  a  seasonal  savannah  by  of  schizophrenia  syndromes  contributory  a  the  characterized  antipsychotic  have  and  schizophrenia  dopamine  often  a  unclear  auditory  number  of  important  schizophrenia  receptor  is  school  teenagers  homeschooling  a  the  homeschooling  compulsory  but  and  building  vocational  have  in  an  seminary  the  a  dedicated  teenagers  who  of  society   institutions  criminology  on  the  ant  subculture  used  and  insofar  interpersonal  and  in  otherwise  insofar  by  a  societies  ant  that  of  star  gaseous  asterisms  to  the  primarily  luminous  collapse  and  gravity  nebula  a  a  plasma  helium  the  of  source  gaseous  space  is  telephone  transmissions  earphone  on  the  telecommunications keypad  two  and  landline  Movie S1. Animation of differential scalp topographies for low/high information gain words for the time