Large-scale randomized experiments reveals that machine learning-based instruction helps people memorize more effectively

Upadhyay, Utkarsh; Lancashire, Graham; Moser, Christoph; Gomez-Rodriguez, Manuel

doi:10.1038/s41539-021-00105-8

Download PDF

Brief Communication
Open access
Published: 06 September 2021

Large-scale randomized experiments reveals that machine learning-based instruction helps people memorize more effectively

npj Science of Learning volume 6, Article number: 26 (2021) Cite this article

3285 Accesses
1 Citations
15 Altmetric
Metrics details

Subjects

Abstract

We perform a large-scale randomized controlled trial to evaluate the potential of machine learning-based instruction sequencing to improve memorization while allowing the learners the freedom to choose their review times. After controlling for the length and frequency of study, we find that learners for whom a machine learning algorithm determines which questions to include in their study sessions remember the content over ~69% longer. We also find that the sequencing algorithm has an effect on users’ engagement.

Optimizing practice scheduling requires quantitative tracking of individual item performance

Article Open access 15 October 2020

Luke G. Eglington & Philip I. Pavlik Jr

Does pre-testing promote better retention than post-testing?

Article Open access 24 September 2019

Alice Latimier, Arnaud Riegert, … Franck Ramus

Comparing models of learning and relearning in large-scale cognitive training data sets

Article Open access 04 October 2022

Aakriti Kumar, Aaron S. Benjamin, … Mark Steyvers

Main text

The greater degree of personalization offered today by learning apps promises to facilitate the design and implementation of automated, data-driven teaching policies that adapt to each learner’s knowledge over time. However, to fulfill this promise, it is necessary to develop adaptive data-driven models of the learners, which accurately quantify their knowledge, and efficient methods to find teaching policies that are optimal under the learners’ models^1,2.

In this context, research in the computer science literature has been typically focused on finding teaching policies that either enjoy optimality guarantees under simplified mathematical models of the learner’s knowledge^3,4,5,6,7, adapt empirically to learners^8,9,10, or optimize engagement^11,12. In contrast, research in cognitive sciences has focused on measuring the effectiveness of a variety of heuristics to optimize the review times informed by psychologically valid models of the learner’s knowledge using (usually small) randomized control trials^{13,14,15,16,17}. Only very recently, Tabibian et al.¹⁸ has introduced a machine learning modeling framework that bridges the gap between both lines of research—their framework can be used to determine the provably optimal review times under psychologically valid models of the learner’s memory state whose parameters are estimated from real review and recall data using a variant of half-life regression¹². However, in the evaluation of their framework, the authors resort to a natural experiment using data from a popular language-learning online platform rather than a randomized control trial, the gold standard in the cognitive sciences literature. As a result, it has been argued that, in an interventional setting, an actual learner following the rate of study may fail to achieve optimal performance¹.

We perform a large-scale randomized controlled trial involving ~50,700 learners of at least 18 years of age in Germany who use an app to study for the written portion of the driver’s permit from December 2019 to July 2020 and gave consent to participate in the trial. The goal of the randomized controlled trial is to evaluate to what extent a machine learning algorithm that builds upon Tabibian et al. can help people learn and remember more effectively. However, rather than optimizing the rate of study as in Tabibian et al., which is typically chosen by the learner, the algorithm determines which questions to include in a learner’s sessions of study over time. To facilitate research at the intersection of cognitive science and machine learning, we are releasing open-source implementation of our algorithm and all the data gathered during our randomized control trial.

During the randomized controlled trial, each learner was randomly assigned to a ‘select’, a ‘difficulty’, or a ‘random’ group throughout her entire usage of the app (Refer to Supplementary Information for more details on the random assignment). In the ‘select’ group (n = 10,151 learners), the questions of each study session were chosen using our machine learning algorithm. In the ‘difficulty` group (n = 34,029), they were chosen in circular order proportionally to the initial difficulty, i.e., easier questions first. In the ‘random` group (n = 13,600), they were chosen uniformly at random with replacement. The only difference in app functionality across groups was due to the item selection algorithm and learners do not know to which item selection algorithm they have been assigned. Moreover, in the ‘select’ group, as long as there were questions that the learner has not reviewed at least once, these were chosen first in order of initial difficulty, i.e., easier question first. By the end of the randomized controlled trial, we recorded more than ~16.75 million answers to ~1900 questions by ~50,700 learners in ~628,000 study sessions. Most of the learners were based in Germany (99.1%), they were evenly split between male (50.8%) and female (49.2%) and the most common age group was 18–24 (64.7%), followed by 25–34 (28%).

For consistency, we removed the data from the 6774 learners who reinstalled the app during the trial period and were assigned to a different group after the re-installation (or installed the app on different devices). Moreover, since we do not expect any algorithm to help learners who are cramming for tests, we do not use data from the 32,445 learners who used the app for less than 2 days. After these preprocessing steps, the resulting dataset contains ~894,000, ~3.3 million, and ~693,000 unique (learner, question) reviewing sequences due to 1564, 7582, and 2335 learners, respectively (refer to Supplementary Information for more details).

We first compare learners of the ‘select’, ‘difficulty’ and ‘random’ groups in terms of normalized empirical forgetting rate¹⁸ (Fig. 1). After controlling for review time and number of reviews, the median normalized empirical forgetting rate for the learners in the ‘select’ group was lower than that of the learners in the ‘difficulty’ and ‘random’ groups in 83.5% of the cases and the decrease was statistically significant (Matt–Whitney U-test, two-sided; p-value = 0.05/36, Bonferroni correction) in 66.7% of the cases. Moreover, the median decrease in the median empirical forgetting rate for learners in the ‘select’ group was ~48% and ~40% when compared to learners in the ‘random’ and ‘difficulty’ groups, respectively, and the corresponding median increase in the median half-lives was ~92% and ~40%.

**Fig. 1: Normalized empirical forgetting rate.**

To further analyze the influence of each algorithm on the normalized empirical forgetting rate, we also fit the following regression model to the data for a fixed number of reviews:

$$\frac{\hat{n}}{\hat{{n}_{0}}}=c+{w}_{T}(T-{T}_{{{\mbox{median}}}})+{w}_{{{\mbox{difficulty}}}}{I}_{{{\mbox{difficulty}}}}+{w}_{{{\mbox{random}}}}{I}_{{{\mbox{random}}}},$$

(1)

where c is the intercept term, T_median is the median time for the last (test) review, w_T captures the impact of the spacing effect, and (w_random, w_difficulty) capture the effect of being assigned to ‘random’ or ‘difficulty’ group, respectively, relative to being assigned to the ‘select’ group. Table 1 summarizes the results, which suggest that the spacing effect holds true in aggregate, i.e., the coefficient associated with review time T − T_median is negative indicating that the more spaced the reviews are, the lower is the final normalized empirical forgetting rate. The results also suggest that, since w_difficulty > 0 and w_random > 0 across all #reviews, learners in the ‘select’ group have lower forgetting rate and, since the coefficients for the other groups seem to increase with number of reviews, the competitive advantage offered by our machine learning algorithm increases with the number of reviews.

Table 1 Multiple regression analysis to study the dependence the normalized empirical forgetting rate on the group assignment and the review time (Lower is better).

Full size table

In terms of engagement, learners of the ‘select’ (‘difficulty’) group were 50.6% (47.6%) more likely, in median, to return to the app within 4–7 days than learners of the ‘random’ group. However, learners of the ‘select’ group were also more likely to stop using the app in the initial 2 days than those of the other groups. Refer to Supplementary Information for more details.

While our results have direct implications for the learning of large sets of paired-associate items by young learners using machine learning-based instruction, we acknowledge that more research at the intersection of cognitive sciences and machine learning is needed to generalize our results to different populations of learners, different materials, or other tasks. In this context, it would also be interesting to compare our algorithm with stronger baselines and experiment with different feedback modalities to further understand which aspects are most responsible for the improved engagement and performance.

Methods

Modeling framework of spaced selection

Given a set of questions ${{{{{\mathcal{I}}}}}}$ whose answers a learner wants to learn, we represent each study session as a triplet $e:= (t,{{{{{\mathcal{S}}}}}},{r}_{{{{{{\mathcal{S}}}}}}})$, where ${{{{{\mathcal{S}}}}}}\subseteq {{{{{\mathcal{I}}}}}}$ is the set of questions that the learner reviewed at time t and ${r}_{{{{{{\mathcal{S}}}}}}}$ is a vector in which each entry corresponds to a question in the set ${{{{{\mathcal{S}}}}}}$ and indicates whether the learner recalled (r = 1) or forgot (r = 0) the answer to the question. Here, note that in the learning app that we used in our randomized experiment, the learner is tested in each study session, similar to most spaced repetition software and online platforms such as Mnemosyne, Synap, and Duolingo, and the seminal work of Roediger and Karpicke¹⁹.

Given the above representation, we keep track of the study times using a counting process N(t), which counts the number of study sessions up to time t. Following the literature on temporal point processes²⁰, we characterize this counting process using its corresponding intensity u(t), i.e., E[dN(t)] = u(t)dt, and think of the set of questions ${{{{{\mathcal{S}}}}}}$ and vector ${r}_{{{{{{\mathcal{S}}}}}}}$ as its binary marks. Moreover, we utilize the well-known memory model from the psychology literature, the exponential forgetting curve model with binary recalls^21,22,23,24, to estimate the probability m_i(t) that a learner recalls (forgets) the answer to a question i at time t. Under the exponential forgetting curve model, the recall probability depends on the time since the last review Δ_i(t) and the forgetting rate ${n}_{i}(t)\in {{\mathbb{R}}}^{+}$, which may depend on many factors, e.g., number of previous (un)successful recalls of the answer to the question. To estimate the value of the forgetting rate n_i(t), we use a variant of half-life regression¹² proposed by Tabibian et al.¹⁸ (refer to Supplementary Information).

The SELECT algorithm Given a set of questions ${{{{{\mathcal{I}}}}}}$, we cast the optimization of the study sessions as the search for the optimal selection probabilities ${p}_{i}(t):= {\mathbb{P}}[i\in {{{{{\mathcal{S}}}}}}]$ that minimize the expected value of a particular (quadratic) loss function l(m(t), n(t), Δ(t), p(t)) of the recall probability of the answers to the questions ${{{{{\boldsymbol{m}}}}}}(t)={[{m}_{i}(t)]}_{i\in {{{{{\mathcal{I}}}}}}}$, the forgetting rates ${{{{{\boldsymbol{n}}}}}}(t)={[{n}_{i}(t)]}_{i\in {{{{{\mathcal{I}}}}}}}$, the times since their last review ${{{{{\boldsymbol{\Delta }}}}}}(t)={[{{{\Delta }}}_{i}(t)]}_{i\in {{{{{\mathcal{I}}}}}}}$, and the selection probabilities ${{{{{\boldsymbol{p}}}}}}(t)={[{p}_{i}(t)]}_{i\in {{{{{\mathcal{I}}}}}}}$ over a time window (t₀, t_f].

To solve the above problem, we resort to the theory of stochastic optimal control of jumps and proceed similarly as in Tabibian et al.¹⁸. However, in contrast with Tabibian et al., rather than optimizing the rate of study, we optimize the selection probability of each question in each study session. In Supplementary Information, we show that, for each question $i\in {{{{{\mathcal{S}}}}}}$, the optimal selection probability is:

$${p}_{i}^{* }(t)=\frac{1}{\sqrt{q}}(1-{m}_{i}(t))$$

(2)

where q ≥ 1 is a given parameter, which trades off recall probability upon review and the size of the study sessions—the higher its value, the shorter the study sessions. In practice, in our randomized trial, the app presents questions according the order given by the selection probability and the user chooses the size of the study session. Therefore, our results are agnostic to the value of the parameter q.

Finally, since the optimal selection probability depends only on the recall probability, which is estimated using the exponential forgetting curve model, we can implement a very efficient procedure to construct study sessions, which we name SELECT (refer to Supplementary Information).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The data and code can be obtained at: https://github.com/Networks-Learning/spaced-selection.

References

Mozer, M. C., Wiseheart, M. & Novikoff, T. P. Artificial intelligence to support human instruction. Proc. Natl Acad. Sci. USA 116, 3953–3955 (2019).
Article Google Scholar
Sense, F., Jastrzembski, T. S., Mozer, M. C., Krusmark, M. & van Rijn, H. Perspectives on computational models of learning and forgetting. Proc. of the 17th International Conference on Cognitive Modeling, 216-221 (2019).
Lewis, J. B. & Li, N. Combinatorial aspects of flashcard games. Ann. Comb. 18, 459–472 (2014).
Article Google Scholar
Nishimura, J. Critically slow learning in flashcard learning models. Chaos 28, 083115 (2018).
Article Google Scholar
Novikoff, T. P., Kleinberg, J. M. & Strogatz, S. H. Education of a model student. Proc. Natil Acad. Sci. USA 109, 1868–1873 (2012).
Article CAS Google Scholar
Reddy, S., Labutov, I., Banerjee, S. & Joachims, T. Unbounded human learning: Optimal scheduling for spaced repetition. Proc. 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1815–1824 (2016).
Hunziker, A. et al. Teaching multiple concepts to a forgetful learner. Advances in Neural Information Processing Systems, 4048–4058 (2019).
Doroudi, S., Aleven, V. & Brunskill, E. Where’s the reward. Int. J. Artificial Intell. Educ. 29, 568–620 (2019).
Li, N., Cohen, W. W., Koedinger, K. R. & Matsuda, N. A machine learning approach for automatic student model discovery. in Edm, 31–40. (ERIC, 2011).
van der Velde, M., Sense, F., Borst, J. & van Rijn, H. Alleviating the Cold Start Problem in Adaptive Learning using Data-Driven Difficulty Estimates. Comput Brain Behav 4, 231–249 (Springer, 2021).
Lomas, D., Patel, K., Forlizzi, J. L. & Koedinger, K. R. Optimizing challenge in an educational game using large-scale design experiments. Proc. SIGCHI Conference on Human Factors in Computing Systems, 89–98 (2013).
Settles, B. & Meeder, B. A trainable spaced repetition model for language learning. Proc. 54th annual meeting of the association for computational linguistics (volume 1: long papers), 1848–1858, (The Association for Computational Linguistics in conference proceedings, 2016).
Pavlik, P. I. & Anderson, J. R. Using a model to compute the optimal schedule of practice. J. Exp. Psychol. Appl. 14, 101 (2008).
Article Google Scholar
Metzler-Baddeley, C. & Baddeley, R. J. Does adaptive training work. Appl. Cogn. Psychol. 23, 254–266 (2009).
Article Google Scholar
Lindsey, R. V., Shroyer, J. D., Pashler, H. & Mozer, M. C. Improving students? long-term knowledge retention through personalized review. Psychol. Sci. 25, 639–647 (2014).
Article Google Scholar
Kim, A. S. N., Wong-Kee-You, A. M. B., Wiseheart, M. & Rosenbaum, R. S. The spacing effect stands up to big data. Behav. Res. Methods 51, 1485–1497 (2019).
Article CAS Google Scholar
Sense, F., Behrens, F., Meijer, R. R. & van Rijn, H. An individual’s rate of forgetting is stable over time but differs across materials. Top. Cogn. Sci. 8, 305–321 (2016).
Article Google Scholar
Tabibian, B. et al. Enhancing human learning via spaced repetition optimization. Proc. Natl Acad. Sci. USA 116, 3988–3993 (2019).
Article Google Scholar
Roediger III, H. L. & Karpicke, J. D. Test-enhanced learning: taking memory tests improves long-term retention. Psychol. Sci. 17, 249–255 (2006).
Article Google Scholar
Aalen, O., Borgan, O. & Gjessing, H. Survival and Event History Analysis: A Process Point of View (Springer Science & Business Media, 2008).
Wixted, J. T. & Carpenter, S. K. The wickelgren power law and the ebbinghaus savings function. Psychol. Sci. 18, 133 (2007).
Article Google Scholar
Averell, L. & Heathcote, A. The form of the forgetting curve and the fate of memories. J. Math. Psychol. 55, 25–35 (2011).
Article Google Scholar
Ebbinghaus, H. Memory: A contribution to experimental psychology (Teachers College Press, 1913).
Loftus, G. R. Evaluating forgetting curves. J. Exp. Psychol. 11, 397 (1985).
Google Scholar

Download references

Acknowledgements

We thank Robert West, Klein Lars Henning, Roland Aydin, and Behzad Tabibian for helpful conversations. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 945719).

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Max Planck Institute for Software Systems, Kaiserslautern, Germany
Utkarsh Upadhyay & Manuel Gomez-Rodriguez
Reasonal DE GmbH, Berlin, Germany
Utkarsh Upadhyay
Swift Management AG, Basel, Switzerland
Graham Lancashire & Christoph Moser

Authors

Utkarsh Upadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Graham Lancashire
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Moser
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Gomez-Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

U.U. and M.G.R. designed research; U.U., G.L., C.M., and M.G.R. performed research; U.U. analyzed data; and U.U. and M.G.R. wrote the paper.

Corresponding author

Correspondence to Manuel Gomez-Rodriguez.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Upadhyay, U., Lancashire, G., Moser, C. et al. Large-scale randomized experiments reveals that machine learning-based instruction helps people memorize more effectively. npj Sci. Learn. 6, 26 (2021). https://doi.org/10.1038/s41539-021-00105-8

Download citation

Received: 25 January 2021
Accepted: 18 August 2021
Published: 06 September 2021
DOI: https://doi.org/10.1038/s41539-021-00105-8