A potential cortical precursor of visual word form recognition in untrained monkeys

Skilled human readers can readily recognize written letters and letter strings. This domain of visual recognition, known as orthographic processing, is foundational to human reading, but it is unclear how it is supported by neural populations in the human brain. Behavioral research has shown that non-human primates (baboons) can learn to distinguish written English words from pseudo-words (lexical decision), successfully generalize that behavior to novel strings, and exhibit behavioral error patterns that are consistent with humans. Thus, non-human primate models, while not capturing the entirety of human reading abilities, may provide a unique opportunity to investigate the neuronal mechanisms underlying orthographic processing. Here, we investigated the neuronal representation of letters and letter strings in the ventral visual stream of naive macaque monkeys, and asked to what extent these representations could support visual word recognition. We recorded the activity of hundreds of neurons at the top two levels of the ventral visual form processing pathway (V4 and IT) while monkeys passively viewed images of letters, English words, and non-word letter strings. Linear decoders were used to probe whether those neural responses could support a battery of orthographic processing tasks such as invariant letter identification and lexical decision. We found that IT-based decoders achieved baboon-level performance on these tasks, with a pattern of errors highly correlated to the previously reported primate behavior. This capacity to support orthographic processing tasks was also present in the high-layer units of state-of-the-art artificial neural network models of the ventral stream, but not in the low-layer representations of those models. Taken together, these results show that the IT cortex of untrained monkeys contains a reservoir of precursor features from which downstream brain regions could, with some supervised instruction, learn to support the visual recognition of written words. This suggests that the acquisition of reading in humans did not require a full rebuild of visual processing, but rather the recycling of a brain network evolved for other visual functions.


Introduction
recordings were made in four Rhesus monkeys using chronically implanted intracortical microelectrode arrays (Utah) ( Figure 1A). Neuronal responses were measured while each monkey passively viewed streams of images, consisting 120 of alphabet letters, English words, and pseudo-word strings, presented in a rapid serial visual presentation (RSVP) 121 protocol at the center of gaze (Fig. 1). Images were presented in randomized order, and each image was shown at 122 least 25 times. Crucially, monkeys had no previous supervised experience with orthographic stimuli, and they were 123 not rewarded contingently on the stimuli, but solely for accurately fixating. This experimental procedure resulted in a decoder, which computes a simple weighted sum over the IT population activity, to discriminate between two classes 130 of stimuli (e.g. words versus pseudo-words). The decoder weights are "learned" using the IT population responses 131 to a subset of stimuli (using 90% of the stimuli for training), and then the performance of the decoder is tested on 132 held-out stimuli. The overarching prediction of the "IT precursor" hypothesis was that, if a putative neural mechanism 133 (i.e. a particular readout of a particular neural population) is sufficient for primate orthographic processing behaviors, 134 then, it should be easy to learn (i.e. few supervised examples), its learned performance should match the overall 135 primate performance, and its learned performance should have similar patterns of errors as primates that have 136 learned those same tasks. This logic has been previously applied to the domain of core object recognition to uncover 137 specific neural linking hypotheses (25) that have been successfully validated with direct causal perturbation of neural activity (26, 27).

Lexical decision
using a random subset of the stimuli tested on baboons (17). We collected the response of 510 IT neural sites and 144 277 V4 neural sites to a base set of 308 four-letter written words and 308 four-letter pseudo-words (see Figure 1B,

145
"base set" for example stimuli). To test the capacity of the IT neural representation to support lexical decision, we 146 trained a linear decoder using the IT population responses to a subset of words and pseudo-words, and tested the 147 performance of the decoder on held-out stimuli. Note that this task requires generalization of a learned lexical 148 classification to novel orthographic stimuli, rather than the mere memorization of orthographic properties. Figure 2A We recorded the activity of hundreds of neural sites in IT while monkeys passively fixated images of orthographic stimuli. (As a control, we also recorded from the dominant input to IT, area V4.) We then tested the sufficiency of the IT representation on 30 tests of orthographic processing (e.g. lexical decision, letter identification, etc.) using simple linear decoders. (B) Example visual stimuli. Images consisted of four-letter English words and pseudo-word strings presented in canonical views, as well as with variation in case (upper/lower) and size (small/medium/large), and single alphabet letters presented at four different locations.

Figure 2: (A)
Comparison of baboon behavior and a linear readout of IT neurons, plotted as the proportion of stimuli categorized as "words." The 616 individual stimuli were grouped into equally-sized bins based on the baboon performance, separately for words (red) and pseudo-words (blue). Error bars correspond to SEM, obtained via bootstrap resampling over stimuli; dashed line corresponds to unity line, demarking a perfect match between baboon behavior and IT-based decoder outputs. (B) Performance of decoders trained on IT and V4 representations on lexical decision, for varying number of neural sites. Distribution of individual baboon performance is shown on the right. (C) Consistency with baboon behavioral patterns of decoders trained on IT and V4 representations, for varying number of neural sites. (D) Distribution of selectivity of lexical decision for individual IT sites, highlighting the subpopulation of sites with selectivity significantly different from zero. (E) Performance of decoders trained on subpopulation of selective sites from (d) compared to remaining IT sites and all IT sites. baboons and neural populations exhibited similar behavioral patterns across stimuli, e.g. whether letter strings that 166 were difficult to categorize for baboons were similarly difficult for these neural populations. To reliably measure population and the pool of baboons using a noise-adjusted correlation (see Methods). We observed that the pattern 173 of performances obtained from the IT population was highly correlated with the corresponding baboon pool 174 behavioral pattern (noise-adjusted correlation " = 0.64; Figure 2C, blue). Perhaps any neural population would than the corresponding value estimated from the V4 population, which is only one visual processing layer away from 177 IT ( " = 0.11; Figure 2C, green). By holding out data from each baboon subject from the pool, we additionally 178 estimated the consistency between each individual baboon subject to the remaining pool of baboons (median " = 179 0.67, inter-quartile range = 0.27, n=6 baboon subjects). This consistency value corresponds to an estimate of the 180 ceiling of behavioral consistency, accounting for inter-subject variability. Importantly, the consistency of IT-based 181 decoders is within this baboon behavioral range; this demonstrates that that this neural mechanism is as consistent 182 to the baboon pool as each individual baboon is to the baboon pool, at this behavioral resolution. Together, these 183 results suggest that the distributed neural representation in macaque IT cortex is sufficient to explain the lexical decision behavior of baboons, which itself was previously found to be correlated with human behavior (17).

186
We next explored how the distributed IT population's capacity for supporting lexical decision arose from 187 single neural sites. Figure 2D shows the distribution of word selectivity of individual sites in units of d', with positive 188 values corresponding to increased firing rate response for words over pseudo-words. We observed that, across the 189 population, IT did not show strong selectivity for words over pseudo-words (average d' = 0.008 ± 0.09, mean, SD 190 over 510 IT sites), and that no individual IT site was strongly selective for words vs. pseudo-words (|d'|<0.5 for all 191 recorded sites). However, a small but significant proportion of sites (10%; p<10 -5 , binomial test with 5% probability of 192 success) exhibited a weak but significant selectivity for this contrast (inferred by a two-tailed exact test with bootstrap 193 resampling over stimuli). Note that this subset of neural sites includes both sites that responded preferentially to 194 words and sites that responded preferentially to pseudo-words. We measured the lexical decision performance of 195 decoders trained on this subpopulation of neural sites, compared to the remaining subpopulation. Importantly, to 196 avoid a selection bias in this procedure, we selected and tested neural sites based on independent sets of data 197 (disjoint split-halves over trial repetitions). As shown in Figure 2E, we observed that decoders trained on this subset 198 of selective neural sites performed better than a corresponding sample from the remaining non-selective population, whether this subset of selective neural sites was topographically organized on the cortical tissue. For this subset of neural sites, we did not observe a significant hemispheric bias (p=0.13, binomial test with probability of success 203 matching our hemisphere sampling bias), or significant spatial clustering within each 10x10 electrode array (Moran's 204 I=0.11, p=0.70, see Methods). Thus, we observed no direct evidence for topographically organized specialization 205 (e.g. orthographic category-selective domains) in untrained non-human primates, at the resolution of single neural 206 sites. Taken together, these results suggest that lexical decision behavior could be supported by a sparse, distributed 207 read-out of the IT representation in untrained monkeys, and provide a baseline against which to compare future 208 studies of trained monkeys.

212
Importantly, human readers can not only discriminate between different orthographic objects, but also do so  shows the performance of a decoder trained on the IT neuronal representation on each of these three types of 221 behavioral tests, as a function of the neural sample size. For comparison, we also show the same decoder test for 222 the V4 population. Once again, we observe that the IT population achieved relatively high performance across all 223 tasks, and that this performance was greater than the corresponding estimated performance from the measured V4 224 population. We note that performance values for invariant lexical decision should not be directly compared with those 225 in Figure 2B, as invariant tests here were conducted with fewer training examples for the decoders (i.e. trained/tested 226 on 40 distinct words/pseudo-words strings, rather than 308 strings in Figure 2B).
We additionally tested the feature representation obtained from a deep recurrent convolutional neural defined cortical area in the ventral visual hierarchy (V1, V2, V4, and IT).

240
Finally, we tested how the IT population's capacity for these 29 invariant orthographic processing tests was 241 distributed across individual IT neural sites. We computed the selectivity of individual sites in units of d' for each of 242 these tests, and estimated the statistical significance of each selectivity index using a two-tailed exact test with 243 bootstrap resampling over stimuli (see Methods). Figure 3B shows a heatmap of significant selectivity indices for all 244 pairs of neural sites and behavioral tests; each row corresponds to one behavioral test, each column to a single IT 245 neural site, and filled bins indicate statistically significant selectivity. The histogram above shows the number of 246 behavioral tests that each neural site exhibited selectivity for (median: 3 tests, inter-quartile-range: 5), and the sites, inter-quartile range: 23/337).

250
Taken together, these results suggest that a sparse, distributed read-out of the adult IT representation of 251 untrained non-human primates is sufficient to support many visual discrimination tasks, including ones in the domain 252 of orthographic processing, and that that neural mechanism could be learned with a

274
We first asked if individual IT neural sites exhibit any selectivity for letters. To test this, we measured the   Figure 4B shows the average normalized response to each of the 26 letters, across these 222 neural sites. For each 282 neural site, letters were sorted according to the site's response magnitude, estimated using half of the data (split-on the remaining half (individual sites in grey, mean ± SEM in black). Across the entire population, we observe that some neural sites reliably respond more to some letters than others, but this modulation is generally not selective 4C, formatted as in Figure 4B), with a greater response to letters presented contralateral to the recording site, while 289 also exhibiting substantial tolerance across positions.

291
Next, we asked whether the encoding of letter strings could be approximated as a local combination of 292 responses to individual letters. To test this, we linearly regressed each site's response to letter strings on the 293 responses to the corresponding individual letters at the corresponding position, cross-validating over letter strings.

294
Using the neural responses to all four letters, we observed that the predicted responses of such a linear 295 reconstruction were modestly correlated with the measured responses to letter strings (see Figure 4D, right-most 296 bar; " = 0.29 ± 0.06, median ± standard error of median, n = 222 neural sites). To investigate if this explanatory 297 power arose from all four letters, or whether 4-letter string responses could be explained just as well by a substring 298 of letters, we trained and tested linear regressions using responses to only some (1, 2, or 3) letters. Given that there 299 are many possible combinations for each, we selected the best such mapping from the training data, ensuring that 300 selection and testing were statistically independent. We observed that reconstructions using only some of the letters 301 were significantly poorer in predicting letter string responses (three letters: " = 0.12 ± 0.02, median ± standard error 302 of median). Finally, we tested how well a position-agnostic (or "bag of letters") model performed on the same 303 reconstruction task by trained and test linear regressions that mapped responses of letters, with the incorrect position 304 (using a fixed, random shuffling of letter positions) on reconstructing the responses to whole letter strings. We found 305 that this "bag of letters" model performed significantly worse ( " = 0.11 ± 0.02, median ± standard error of median).

307
Note that all correlation values reported above were adjusted to account for the reliability of measured neural 308 responses, such that a fully predictive model would have a noise-adjusted correlation of 1.0 regardless of the finite 309 amount of data that were collected. Yet, the maximal value of " = 0.29 that we obtained using the linear superposition

316
Taken together, these observations demonstrate that individual IT neural sites in untrained non-human 317 primates while failing to exhibit strong orthographic specialization, collectively suffice to support a battery of  Average normalized response to each of the 26 letters, across 222 IT neural sites. For each neural site, letters were sorted according to the site's response magnitude (estimated using an independent half of the data) and plotted in gray. Averaging across the entire population, we observe that neural sites reliably respond more to some letters than others (black, mean ± SE across sites; note that SE is very small). However, this modulation is not very selective to individual letters or small numbers of letters, as quantified by the sparsity of letter responses (bottom panel). (C) Individual sites were also modulated by the letter position, exhibiting substantial tolerance across positions (formatted as in B). (D) To test if the encoding of letter strings can be approximated as a local combination of responses to individual letters, we reconstructed letter string responses from letter responses, for each neural site. As illustrated by the inset, we used the neural response to images of individual constituent letters to predict the response to images of the corresponding letter strings; predictions were made using linear regressions, crossvalidating over letter strings. The bar plot shows the noise-adjusted correlation of different regression models (median ±SE across neural sites). The "bag of letters" model uses responses of each of the four letters, at arbitrary positions, to predict responses of whole letter strings. Each of the position-specific models uses the responses of up to four letters at the appropriate position to predict letter string responses. abilities in the human species, we reasoned that the high-level visual representations in the primate ventral visual 334 stream could serve as a precursor that is recycled by developmental experience for human orthographic processing 335 abilities. In other words, we hypothesized that the neural representations that directly underlie human orthographic 336 processing abilities must be strongly constrained by the prior evolution of the primate visual cortex, such that 337 representations present in naïve, illiterate, non-human primates could be minimally adapted to support orthographic 338 processing. Here, we observed that orthographic information was explicitly encoded in sampled populations of 339 spatially distributed IT neurons in naïve, illiterate, non-human primates. Our results are consistent with the hypothesis 340 that the population of IT neurons in each subject forms an explicit representation of orthographic objects, and could 341 serve as a common substrate for learning many visual discrimination tasks, including ones in the domain of 342 orthographic processing.

344
We tested a battery of 30 orthographic tests, spanning a lexical decision task (words versus pseudo-words), 345 invariant letter recognition, and invariant bigram recognition, as well as modifications that required tolerance to 346 variation in text size and case. We do not claim that these tasks form an exhaustive characterization of orthographic 347 processing, but rather a good starting point for that greater goal. Importantly, this battery of tasks could not be

496
Behavioral consistency. To quantify the behavioral similarity between baboons and candidate visual systems (both 497 neural and artificial) with respect to the pattern of unbiased performance described above, we used a measure called 498 "consistency" ( ") as previously defined (46), computed as a noise-adjusted correlation of behavioral signatures (47).

499
For each system, we randomly split all behavioral trials into two equal halves and estimated the pattern of unbiased 500 performance on each half, resulting in two independent estimates of the system's behavioral signature. We took the 501 Pearson correlation between these two estimates of the behavioral signature as a measure of the reliability of that .

506
Since all correlations in the numerator and denominator were computed using the same number of trials, we did not 507 need to make use of any prediction formulas (e.g. extrapolation to larger number of trials using Spearman-Brown 508 prediction formula). This procedure was repeated 10 times with different random split-halves of trials. Our rationale 509 for using a reliability-adjusted correlation measure for consistency was to account for variance in the behavioral 510 signatures that is not replicable by the experimental condition (image and task).

512
Single neuron analyses. For each neural site, we estimated the selectivity with respect to a number of contrasts