Investigating lexical categorization in reading based on joint diagnostic and training approaches for language learners

Efficient reading is essential for societal participation, so reading proficiency is a central educational goal. Here, we use an individualized diagnostics and training framework to investigate processes in visual word recognition and evaluate its usefulness for detecting training responders. We (i) motivated a training procedure based on the Lexical Categorization Model (LCM) to introduce the framework. The LCM describes pre-lexical orthographic processing implemented in the left-ventral occipital cortex and is vital to reading. German language learners trained their lexical categorization abilities while we monitored reading speed change. In three studies, most language learners increased their reading skills. Next, we (ii) estimated, for each word, the LCM-based features and assessed each reader’s lexical categorization capabilities. Finally, we (iii) explored machine learning procedures to find the optimal feature selection and regression model to predict the benefit of the lexical categorization training for each individual. The best-performing pipeline increased reading speed from 23% in the unselected group to 43% in the machine-selected group. This selection process strongly depended on parameters associated with the LCM. Thus, training in lexical categorization can increase reading skills, and accurate computational descriptions of brain functions that allow the motivation of a training procedure combined with machine learning can be powerful for individualized reading training procedures.


Supplementary Methods 2 Cross-validated diagnostic procedure
Leave-one-out-cross-validation (LOOCV).The diagnostic procedure included three levels (see Fig. 1C): feature extraction, feature selection, and fitting procedure.To reduce the chance of overfitting, we created unseen test data using a cross-validation procedure, i.e., the LOOCV (see Fig. 1C).With this procedure, we ensured that the prediction was as unbiased as possible by the data itself (prediction trained on an independent dataset).At the same time, we had relatively large training sets because we only left one dataset out for the test set (each training set trained on n-1 participants).In this proof-of-concept study introducing this diagnostic machine learning approach, we focused on optimizing the prediction method and ran multiple models with di erent hyperparameters.
To prevent overfitting on the level of hyperparameters (e.g., which features have been selected or which machine learning method is used), we implemented an additional cross-validation loop supplementary to the LOOCV (see Supplementary Fig. 1).This additional loop allows us to evaluate the hyperparameters and apply the winning variant to left-out data.To this end, we ran a nested cross-validation procedure ((1; 2)) in which the first step consisted of tuning hyperparameters by the leave-one-out method again; however, this time within nested inner loops while subsequently holding one dataset entirely out.This step follows the goal of selecting the best parameter combination for the prediction in the outer loop.The hyperparameter selection across all inner loops was based on their consensus.Thus, we can call the procedure the consensus nested cross-validation, i.e., selecting hyperparameters of the prediction model that produced the most accurate prediction (i.e., the highest correlation between predicted and observed).
Due to this method's high computational demand (i.e., 75 leave-one-out validations within each LOOCV), we restricted the number of hyperparameter combinations to 144 instead of 720 combinations (4 predictor compositions x 9 feature selection criteria x 4 fitting procedures; see Supplementary Fig. 2-4) by focusing only on one model structure.Specifically, we used only a simple mixed model structure (lmer(log(responsetime) [predictorcomposition] + (ef f ectsubject), [data]) during feature extraction as we could not observe any benefit from more complex model structures (e.g., by adding the random e ect of items or adding behavioral measures; see Supplementary Figure 2).After that, we applied the most often winning model from the inner loops to predict the dataset outside the loop.
We are aware that we would optimally need an additional held-out data set that stays completely unseen during the hyperparameter tuning, and consensus nested cross-validation needs to be better established.Still, we can justify our choice by findings that nested cross-validation is largely robust against overfitting ((2)) and that similar accuracy was found for standard nested cross-validation and consensus nested cross-validation ((1)).We preferred consensus nested cross-validation because standard nested cross-validation is not perfectly applicable here as we aim at dimensional and not categorization predictions.When di erent models across participants are applied, the considerable di erences in the feature extraction, selection, and fitting can result in inconsistent predictions and deteriorate the overall prediction strength (i.e., correlation predicted vs. observed).
Cross-validation for categorization.We explored whether the best dimensional prediction model that numerically predicted the training success (e.g., expected reading speed increase of 10 %) could also be used to categorize responders vs. non-responders (i.e., as a categorization model).A categorization criterion typically optimizes the tradeo of sensitivity and specificity (as high sensitivity as possible -the proportion of correctly identified responders, and as high specificity as possible -the proportion of correctly identified non-responders; i.e., specificity and sensitivity optimally balanced).Here, we tested within the inner loops of nested cross-validation, which predicted training benefit should serve as a criterion.In the next step, we aggregated these values across the loops based on a median and applied the resulting criterion value (decision boundary; i.e., selected if expected training success is larger than 13.5%).
Supplementary Figure 1: Schematic depiction of the consensus nested cross-validation to tune the best hyperparameter combinations (e.g., which prediction method is best).For each leave-one-out run, we implemented an inner loop.Within these inner loops, we again ran leave-one-out procedures to find the pipeline that resulted in the highest correlation.Across all inner loops, we searched for the hyperparameter combination selected most often (i.e., find a consensus on the pipeline for the outer loop).We use the consensual selected pipeline for the predictions in the outer loop.

Prediction procedure
Level 1 -Feature extraction.The first step of our machine learning pipeline is extracting features from the first training session's response time and accuracy measures.Also, we used metadata like training week or incoming reading speed as additional features.To extract features from reaction times, we fitted linear mixed models, including specific random slope estimates on the random e ect of participants.Linear mixed models allow for the explicit fitting of the hierarchical data structure of random e ects.Thus, one can consider the interindividual variance and the individual slope on fitted parameters.
Exploiting this possibility, we fitted for each parameter or interaction of interest a separate model, extracting the corresponding individual slope (see Supplementary Figure 7A).For example, when we would hypothetically include parameters a, b, and c as predictors and the interaction of a and b, we would need to fit four models to get the individual estimates on these parameters.Running multiple models and not a single model that includes all random e ects (that might be more intuitive) resulted in higher reliability of the results (more observation within each cluster).The interindividual variance was thereby maximal, and the shrinkage, typically implemented for complex models, is restricted (see (3)).Thus, we used the individual slope estimates for each parameter and interaction of interest as separate features.
We based the predictor composition in the linear mixed models on the assumptions from the LCM and identical to the group-based analyses described above.Notably, we included the interaction of categorization di culty and word likeness as they are both supposed to be relevant for the pre-semantic word processing in the visual word form area (see ( 4)).For our basic model structure, we fitted log-transformed response times and added one random e ect, taking the inter-individual hierarchical structure into account.In the context of the hyperparameter search, we ran 20 models varying the model structures and predictor compositions (5 di erent structures times four di erent compositions).With these variations, we aimed (i) to find the optimal model for outcome prediction and (ii) to ensure the robustness of our interpretations, as all models showed certain similarities (see all combinations in Supplementary Table 1).
Varying predictor composition comprised especially adding interactions to the above-described basic variant.Furthermore, we added the log-transformed sequence index in all alternative models, capturing the training e ect within one session.Focusing on the interactions, we tested 3-way interactions of categorization di culty and wordlikeness with (i) lexicality, (ii) word frequency, and (iii) sequence index.The pre-lexical processing is likely a ected by whether the letter string represents any semantics or not (interaction with lexicality) or whether the word can be  found very fast or very slow in the mental lexicon (interaction with word frequency).Finally, we might achieve better predictions by estimating the training e ect on word-likeness and categorization di culty within the first session (interaction with sequence index).
Varying the model structure, we added to the (i) basic structure, (ii) the random e ect of items (i.e., which letter string was presented), (iii) behavioral measures from the pre-diagnostics to the response time data (SLS correct prediagnostics + SLS errors pre-diagnostics + design), (iv) we subset our data to correct items only and (v) fitted the binomial distribution of correctness of the responses with generalized linear mixed models (glmer instead of the lmer function).Finally, independently from extracting features from the random e ects of the mixed models, we accounted for the pre-training reading level (by the number of correctly and falsely answered items in SLS), the training week (week 1 or 2), and the design variable (indicating if the post-diagnostic has been implemented on day 3 or 4).
Level 2 -Feature selection.We applied a stepwise regression procedure to filter irrelevant features and reduce noise (i.e., unrelated features) and redundancy (i.e., select one of several highly correlated predictors).The selection starts with the intercept model, and predictors are sequentially added based on the fit improvement measured by Akaike Information Criterion (AIC, (5; 6)).Our maximal model included all features and their 2-way interactions.The procedure examined all variables at each stage of the selection process to identify if a feature could be excluded without deteriorating the model fit.The search for relevant features ends when the fit cannot be improved anymore (i.e., AIC reduction when additional parameters are added).With an evaluation based on AIC, we aim for a simple model as AIC penalizes model complexity.The crucial factor why we chose this feature selection method was that it allows the interaction of features.
The stepwise regression ran multiple times (i.e., 75 times -number of training sets * 74 -number of stepwise regressions per training set; 5,550 times in total per hyperparameter combination) within the leave-one-out crossvalidation method (see Supplementary Figure 7B).Subsequently, we counted how often the solution of the stepwise regression identified a feature to be relevant.If it was more often than a cuto value across the n-1 runs (testing for cuto values 0, 5, 10, 20, 30, 40, 50, 60, 70), the feature was selected to be used by the fitting procedures.In this way, we found a robust set of features that should likely not overfit.Note that the feature selection had to be run for all training sets of each model variant (i.e., for the outer loop: 75 training sets *74 runs of stepwise regression *20 models variants*9 cuto values*4 fitting procedures = 3,996,000).In the inner loop, we omitted running stepwise regression 73 times within each inner training set to save computational resources (i.e., it would be 75*74*73 = 405,105 multiple regression per hyperparameter combination instead of 5,550).Thus, we use feature selection identical to the outer loop (i.e., across all 74 datasets within each training set).This simplification hardly influences the results, as only features identified as relevant just as often as the cuto value would be added or removed di erently.
Level 3 -Fitting procedure.We used the selected features to predict the reading speed increase of the lexical categorization training based on the following model types: (i) multiple regression, (ii) support vector machine with a linear and (iii) a radial kernel, and (iv) random forest algorithm.Notably, the models were used as regression algorithms, predicting outcomes on a continuous scale.We are interested in which amount the training might be helpful.We consider the correlation, t-value, and the mean square error of the comparison between the predicted and observed value indicators of the model fit.We tested all four fitting procedures for each of the 20 model variants.
Multiple regression represents the most straightforward way of fitting data linearly based on the minimization of the residuals between the dependent variable and the sum of weighted variables.Support vector machine fits a line in a multidimensional space, minimizing the distance between the fitted line and the observed data points.The linear kernel restricts the line to be linear.For the radial kernel, the line can be curved or, in other words, non-linear.Besides defining the kernel, we kept standard settings from the package e1071 (( 7)).The random forest algorithm is based on combining randomly drawn decision trees.We chose to draw 500 decision trees with three features each (m = 3 in the bag; the remaining features out of the bag are used to evaluate the tree).Finally, to ensure the prediction was independent of the randomization, we ran the random forest model 30 times and averaged the predictions.

Table 1
Table with combinations of calculated predictor compositions and model structures resulting in 20 different models.