Abstract
The widespread adoption of online courses opens opportunities for analysing learner behaviour and optimising webbased learning adapted to observed usage. Here, we introduce a mathematical framework for the analysis of timeseries of online learner engagement, which allows the identification of clusters of learners with similar online temporal behaviour directly from the raw data without prescribing a priori subjective reference behaviours. The method uses a dynamic time warping kernel to create a pairwise similarity between timeseries of learner actions, and combines it with an unsupervised multiscale graph clustering algorithm to identify groups of learners with similar temporal behaviour. To showcase our approach, we analyse task completion data from a cohort of learners taking an online postgraduate degree at Imperial Business School. Our analysis reveals clusters of learners with statistically distinct patterns of engagement, from distributed to massed learning, with different levels of regularity, adherence to preplanned course structure and task completion. The approach also reveals outlier learners with highly sporadic behaviour. A posteriori comparison against student performance shows that, whereas highperforming learners are spread across clusters with diverse temporal engagement, low performers are located significantly in the massed learning cluster, and our unsupervised clustering identifies low performers more accurately than common machine learning classification methods trained on temporal statistics of the data. Finally, we test the applicability of the method by analysing two additional data sets: a different cohort of the same course, and timeseries of different format from another university.
Introduction
The application of data analytics to educational data has surged in the past few years facilitated by the adoption of online learning platforms.^{1} However, in parallel to the increased access to detailed information, it is crucial to identify both the right type of data and analytical approaches that will allow us to gain interpretable insights into online engagement and learning patterns.^{2} The process of learning extends over time and thus the analysis of temporal data is an important focus for educational data analytics. In this work, we describe a methodology for the study of timeseries data collected from the engagement of learners with the tasks and stages of online courses. The analysis of temporal statistics has been shown to provide a fruitful avenue to identify learners at risk of failure,^{3} predicting performance,^{4} dropping out of a course,^{5,6,7,8} or identifying learner behaviours.^{9} Despite such developments, a recent review of the field suggested that temporal analyses are currently insufficient in number, and that additional methodologies are required.^{10}
Temporal analytics has been used in the educational context to investigate massed versus distributed study modes, i.e., to compare the performance of learners that study the material ‘massed’ (or ‘crammed’) into a single study period to that of learners that ‘distribute’ their study of the material across a number of shorter study periods. The general conclusion has been that distributed practice is the more effective strategy.^{11} The benefits of such ‘spacing effect’^{12} have been shown over differing periods and within different contexts,^{13} although other reports have noted that the effect does not apply to all learning contexts.^{14} However, a feature of previous data analyses is that they generally allocate subjects in advance to one of the two predetermined study modes. Indeed, preallocation is also an inherent restriction in supervised machine learning approaches, where labels are assigned a priori to train an algorithm.
Recent studies have collected timeseries of learners’ behaviours and used them to cluster learners according to preselected features of the data (e.g., task focus, resource usage, etc) chosen to describe different approaches to problem solving. However, such methods are highly dependent both on the temporal features chosen as descriptors, which are based on specific knowledge of the data, as well as the number of groups that are obtained by the clustering. For example, a recent study extracted particular features from learners following a blended course (i.e., on two platforms: facetoface and online) and identified four behavioural groups separated according to their differing levels of engagement across the two platforms.^{15} Such studies exemplify how the combination of temporal analytics and cluster analysis can provide insights of use to educators, course designers, and researchers in learning analytics.^{10,16}
Here, we present an unsupervised methodology that allows the direct analysis of raw timeseries gathered from the engagement of learners as they complete tasks of online courses without imposing a priori neither the statistical descriptors of the timeseries nor the number or type of groups of learners to be detected. Hence the obtained learner clusters are not predetermined or identified subjectively based on prior features but are detected algorithmically during the data analysis. To exemplify our approach, we analysed in detail the timeseries (i.e., timestamped data of task completion) of 81 learners as they undertook the six online compulsory courses that form the first year of a 2year parttime postgraduate management degree. The courses extended over three terms and the patterns of task completion differ greatly across the learner group. Three examples of such highly distinct timeseries are shown in Fig. 1, showing a variety of behaviours: from steady completion to highly massed behaviour to sporadic patterns. To highlight its applicability, we also applied the method to two additional data sets: a different set of timeseries of task completion collected from the same degree programme but from a different year cohort, and a set of timeseries of online interactions (not of task completion) collected by a different university and therefore with distinct characteristics.
The methodology is summarised in Fig. 2. We use the raw, timestamped series of online actions from each learner and employ a dynamic time warping (DTW) kernel^{17} to calculate a similarity score between all pairs of learner timeseries. Although several alternative methods exist to measure the similarity between two timeseries (e.g., Euclidean distance, Fourier coefficients, autoregressive models, edit distance, or minimum jump models),^{18} DTW has been shown to outperform a variety of measures in classification tasks^{19} and provides a principled way to use the full, raw information of the timeseries without preselecting features or functional representations.^{20} From the ensuing DTW similarity matrix, we construct a similarity graph, where nodes are learners and weighted links represent similarities between learners. This graph construction step is carried out using the Relaxed Minimum Spanning Tree algorithm,^{21} which aims to encapsulate the locally strong and globally relevant similarities in the data set. Relaxed minimum spanning tree (RMST) has been shown to perform well in conjunction with the multiscale, unsupervised graph partitioning methodology of Markov Stability,^{22,23} which we apply to our graph to obtain clusters of learners with similar temporal behaviours. Alternative methods to cluster timeseries data, with and without the creation of graphs, have been proposed in other contexts and applications.^{24,25,26,27} Instead of finding one particular clustering, our algorithm produces a multiscale description, given by a set of consistent clusterings of different coarseness obtained by robustly optimising across all levels of resolution in an unsupervised manner, without preimposing the number or type of clusters (see Fig. 3a for an example). Clusterings of different coarseness can then be used by the analyst according to their needs. If no robust clusterings are found, the algorithm will signal a lack of natural clusters in the data. Details of the computational analysis are given in the Methods section.
When applied to our case study data set, our analysis identifies a set of clusterings of learners at different levels of resolution. The clusters of learners reflect the differing temporal engagement as they progress through online course. In particular, our datadriven clusters capture behaviours associated with massed (i.e., completion of a large number of tasks within a short time period) and distributed learning, as well as finer behaviours that differentiate these learning types into subgroups. For instance, at a coarse level, the algorithm identifies a cluster of learners that follow the course in a sequential and distributed manner; yet, at a finer resolution, this cluster is subdivided into two clusters, which differ by a 1–2 week difference in the average completion times of tasks (i.e., ‘early birds’ and ‘on time’). Our approach also finds sporadic learners that skip a large number of tasks or exhibit irregular massed learning depending on particular courses or at different times of the year. Similar outcomes are observed for the other two data sets although with differences reflecting the particularities of the data. We then used exam grades a posteriori to establish whether particular online engagement behaviours can negatively affect learner performance and we compared our groupings against classification based on statistical features computed from the timeseries data.
Results
Unsupervised clustering reveals clusters of learners with differing online engagement
To find groups of learners with similar online engagement in an unsupervised manner, we follow the procedure summarised in Fig. 2. We first create a similarity matrix between learners using a dynamic time warping kernel. This matrix is transformed into a similarity graph using a sparsification based on the Relaxed Minimum Spanning Tree,^{21} a procedure that retains global network connectivity while discarding weak similarities that can be explained through longer chains of strong similarities. Through this process, we create a graph where the nodes are learners linked by edges weighted according to their timecourse similarity. Hence, two learners that complete the tasks of the course in a similar manner will be linked by a strong edge.
The constructed similarity graph is then analysed using Markov Stability (MS), a multiscale graph partitioning algorithm that uses a Markov process to scan the graph across Markov time in order to find optimised and robust partitions of the graph at any level of resolution.^{22,23} The partitions are found by maximising a resolutiondependent cost function (the Markov Stability) at all levels of resolution, as given by the Markov time, t. We then select robust partitions in the following sense: (i) they are persistent across scales (i.e., optimal over an extended Markov time t, as given by a plateau with a low value of VI(t, t′)), and (ii) robust to the small changes in the optimisation (i.e., consistently found as a good partition over those scales, as given by a relative dip in VI(t)). Such robust partitions identify clusters of learners that exhibit similar online temporal patterns. The definitions of the different measures and some details of the Markov Stability framework are given in Methods.
Figure 3a summarises the results of our multiscale clustering method applied to the timeseries of task completion of six online courses by 81 learners pursuing a postgraduate parttime Management degree at Imperial College Business School over one year. See Methods for further details about the data. As the Markov time is increased, the level of resolution is decreased and the method reveals robust partitions of decreasing granularity. In Fig. 3a, we illustrate the partitions found from ten clusters down to two clusters, with a notably robust partition into six clusters. Note the quasihierarchical aggregation of the finer clusters into coarser ones, a feature that is intrinsic to the data and not imposed by our clustering algorithm. (For a more detailed view of the multiscale clustering structure, see Supplementary Fig. 1). The quasihierarchical organisation across levels of resolution reflects the fact that subtle temporal details characterise the finer clusters, but broader similarities of the time profiles define the coarser clusters. Hence, our computational framework allows for adjustable granularity, which can be tailored to the needs of the analyst.
To exemplify the characterisation of the results in our data set, we focus mainly on the 6cluster partition, which contains four large groups and two single learners that remain unclustered due to their highly individual sporadic behaviour. The 6cluster partition exhibits the largest relative drop in VI(t) and a long plateau in VI(t, t′). The 10cluster and 8cluster partitions are equally of interest and provide a more refined clustering consistent with the 6way partition, as seen in Fig. 3a. The coarser 2cluster partition is also of interest: the two clusters are found to separate learners that exhibit distributed and massed learning. In the rest of the paper, we concentrate on a more detailed description of behaviours emerging from the 6cluster partition, as it provides a nuanced, datadriven level of resolution on the data.
Characterisation of the clusters of online learners
As shown in Fig. 3b, the 6cluster partition is both robust and the datadriven groupings it provides have an appropriate level of resolution to gain meaningful insight into the observed patterns of online learners. Two of the clusters contain only one learner, with highly individual and sporadic behaviour. For each of the other four clusters, we use Gaussian Process Regression (GPR)^{28} to compute the average engagement trajectory of the group of learners, and compare it with the average GPR trajectory for the whole set of 81 learners. The computed GPRs allow us to quantify statistically the differences in the temporal patterns of the different clusters using Bayes factors of the processes. In particular, we found that the trajectories of each cluster are statistically more probable to be derived from separate processes defined within their own cluster as follows. A GPR was fitted to the entire set of trajectories and the loglikelihood of the entire set of trajectories was calculated. Equally, the loglikelihood of each separate cluster of trajectories from that same Gaussian Process was calculated. The Bayes Factor, calculated as the sum of log likelihoods of each separate cluster minus the loglikelihood of the entire set of trajectories^{12} was found to be large (K = 3.37 × 10^{10}). This indicates that the behaviours of each cluster are statistically different from each other and are derived from different behavioural processes. This computation was repeated for the differences between each pair of neighbouring clusters. The Bayes factors were: K = 0.38 × 10^{10} between the ‘early birds’ and ‘on time’ clusters; K = 1.52 × 10^{10} between the ‘on time’ and ‘low engager’ clusters; and K = 0.17 × 10^{10} between the ‘low engager’ and ‘crammers’ clusters. These numbers provide statistical evidence of the differences between the obtained clusters.
Each of the clusters in this partition has been given a descriptive title that encapsulates the group behaviour. The learners in the ‘Early Bird’ group (green cluster) generally exhibit a highly sequential and ordered approach to their learning and tend to complete their tasks earlier than the cohort average with a systematic 1–2 week advance offset. The behaviour of learners in the ‘On time’ group (cyan cluster) is similar to the ‘Early birds’, except that they finish tasks closer to the average. Hence both the green and cyan groups present a similar ‘distributed learning’ behaviour only distinguished by a slight delay, which explains why both groups are agglomerated into a single cluster in the coarser 2way partition (Fig. 3a). The learners in the ‘Low engagers’ (orange) cluster also exhibit relatively distributed work flow (similar to the cyan and green clusters) but with less anticipation in the second half of the year (and especially in the third term). Furthermore, this group had a high number of tasks that were never completed. The ‘Crammers’ cluster (magenta) contained learners that exhibited massed learning (indicated by the presence of plateaux in their timeseries, suggesting tasks being completed in a short period of time), lowtask completion and an ordering of task completion that deviates from the proposed course sequence. Finally, the outliers (learners 43 and 46), which form their own clusters, exhibit highly sporadic learning behaviours, with tasks completed at later dates without following sequentially the layout order of the course.
To further characterise our results, we computed standard timeseries metrics for each learner. Figure 4 shows the graph of learners coloured according to two such statistical metrics derived directly from the timeseries: the mean massed session length (commonly known as binge learning), and the percentage of completed tasks. Figure 4a shows the mean massed session length, i.e., the length of plateau in the number of tasks over time calculated via an isotonic regression (see Methods). This measure captures events where a learner has completed a large number of tasks within a short time frame. We find that the ‘Crammers’ cluster has a higher mean massed session length. Figure 4b shows the graph of learners coloured according to the percentage of tasks completed relative to the total number of available tasks. In general, the ‘Crammers’ cluster shows the lowest mean task completion (66%), followed by a completion ratio of 80% in the ‘Low Engagers’ group, and a higher mean task completion rates in the ‘On time’ (86%) and ‘Early Birds’ (90%) clusters.
Cluster analysis identifies groups of learners at risk of low performance
We have also carried out an a posteriori evaluation of our behavioural clusters with respect to the performance of the learners. Figure 5a shows the mapping of the final average marks on the learner graph, where we have also highlighted high performing (>70%, top 15%, ‘Distinction’) and low performing (<60%, bottom 7.5%, below ‘Merit’) learners. Figure 5b shows that 6 out of 7 lowperformance learners lie in the ‘Crammers’ cluster associated with massed learning and reduced task completion. There was a specific learner (77, cyan cluster) who attained a low grade and yet did not exhibit timeseries behaviours indicative of a low performance. The high performers tend to be distributed across all other clusters, suggesting that the learning behaviours of a high performer are not as critical to their success. Still, 9 out of the 13 highperforming learners are found in the ‘Early Birds’ or ‘On time’ clusters characterised by a sequential approach to their learning with minimal massed learning sessions. The sporadic learners in single clusters (43 and 46) did not attain either a low performance or a distinctly high one.
Although our method captures information congruent with timeseries statistical metrics (e.g., those shown in Fig. 4 related to massed learning and task completion rates), the datadriven clusters we obtain encompass global timeseries information beyond such predetermined standard statistical measures. To test this idea, we compared the results of our datadriven clusters to standard classification methods from Machine Learning based on statistical features. Figure 5c illustrates the classification map obtained by training two common machinelearning algorithms using the two statistical features in Fig. 4. The first learning algorithm is a support vector machine (SVM) using a radial basis function kernel and the second is a decision tree with a depth of 4 branches^{29} (see Methods). For both methods, we find that the accuracy of learner classification against performance is low: only 3–4 out of 7 lowperformance learners were accurately predicted. This result suggests that using a finite set of predetermined timeseries features reduces the information available to differentiate the necessary behaviours relevant to performance. In contrast, our graph construction and clustering methodology utilises the full content of the timeseries (including attributes that are not evident from inspection of particular statistical metrics), thus providing a more comprehensive grouping of learners with similar temporal behaviours.
Testing the methodology on two additional data sets
We have applied the methodology to analyse task completion timeseries data from a second cohort of 46 learners taking the online management course at Imperial College Business School. The results we obtain are similar, as shown in the multiscale clustering presented in Supplementary Fig. 2 and the detailed analysis of the 6cluster partition in Fig. 6a. In this case, we identified a robust 9cluster partition (with four major clusters and five single learner clusters) and a robust 6cluster partition (with three major clusters and three single outliers). The major clusters in the 6way partition (shown in Fig. 6a) showed similar behaviours to those observed in the first cohort we analysed. In particular, the green cluster in Fig. 6a corresponds to the ‘Early Birds’ and ‘On time’ groups in Fig. 3, whereas the blue cluster in Fig. 6a is similar to the group of taskskipping ‘Low Engagers’ group in Fig. 3, and the purple cluster in Fig. 6a exhibits similar traits to the ‘Crammers’ cluster in Fig. 3. Within this 6cluster partition, we found that of the 8 lowperformance learners, 4/8 were located in the massed learning cluster, 2/8 were sporadic outliers, and 1/8 was in the low engagement cluster. Only 1/8 was located in the distributed learning cluster. Moreover, using standard classification procedures in Supplementary Fig. 3 we found that our methodology was superior at grouping learners with similar performance. These findings highlight the consistency of the methodology across the cohorts, yet attuned to particularities of the data.
The types of temporal engagement data collected from learners will differ across educators or institutions depending on the particularities of the Learning Management System. To test the methodology on a different kind of data, we have studied a set of 100 learners undertaking an anonymised course within the Open University (OULAD data set^{30}). The OULAD data set differs from our data set in several ways: (i) the timestamp data in OULAD corresponds to page clicks and not necessarily to task completion; (ii) the time stamps were coarsegrained to days; (iii) pages could be revisited. The results of applying our methodology to the OULAD data set in Fig. 6b (and Supplementary Fig. 4 of the Supplementary Information) show that the multiscale clustering is robust to the sparsification implicit in the graph creation step. A robust 3way partition is consistently found in our analysis, with two major clusters and a minor cluster of outliers. The two major clusters corresponded to a separation of learners who exhibited higher massed learning and lower task engagement versus learners with a distributed learning. We found that 6/7 of the lowperformance learners (<60%) were located in the cluster associated with massed learning, while one lowperformance learner was located in the minor outlier cluster and none were in the distributed learning group.
Discussion
We have described an approach for the analysis of temporal data of online learning behaviours, in which distinct clusters of learners are obtained algorithmically without using a priori statistical information about individual behaviours or about the number or type of expected behaviours across the cohort. The mathematical framework is general, and can be applied broadly to any timeseries data in physical or social sciences to identify distinct temporal behaviours. In the context of learning analytics, we showcased the method through three data sets of online learner activity of different types and origins.
Our method uses a dynamic time warping similarity kernel to generate a sparsified similarity graph between learners, to which we apply a multiscale graph partitioning algorithm in order to find optimised and robust clusters of learners with similar temporal behaviours at any level of resolution in an unsupervised manner. As our method uses the full timeseries, it inherently encompasses richer temporal information than standard methods based on selecting statistical features of the timeseries.
In the data sets analysed here, we obtained a quasihierarchy of robust partitions, from finer to coarser, which provide different levels of information, as required by the analyst. For instance, in our main case study in Fig. 3, we found robust partitions into 10, 6 and 2 clusters. The 6way partition consists of four large learner clusters (‘Early Birds’, ‘On time’, ‘Low Engagers’ and ‘Crammers’) and two single unclustered learners (‘Sporadic outliers’), which were shown to be statistically different to each other according to the GPR Bayes factor (12). A posteriori comparison with learner performance indicates good correspondence with the obtained clustering: low performers are generally located (6 out of 7) in the ‘Crammers’ cluster (associated with massed learning and lowtask completion) and are generally absent from the ‘Early Birds’ and ‘On time’ clusters (associated with distributed learning and high task completion). On the other hand, high performers are distributed across several clusters, albeit with higher prevalence in the clusters associated with distributed learning. These results provide an improved characterisation as compared to common machinelearning classification algorithms trained on two statistical measures from the timeseries. The analysis could be enhanced by the use of finer partitions (e.g., the 10way partition has clusters with overrepresentation of low performers (purple cluster, hypergeometric pvalue = 0.00023) and high performers (charcoal cluster, hypergeometric pvalue = 0.026), as seen in Supplementary Fig. 1). Similar general behaviours and classifications are obtained for the two additional data sets presented in Fig. 6 and the Supplementary Information.
The fact that low performers tend to concentrate in the massed learning cluster and be absent from the distributed learning clusters is in agreement with previous studies, which found that learners that ‘crammed’ retained less information when tested at a later date,^{31} and provides support for the risks associated with this behaviour. On the other hand, the fact that high performers are distributed across several clusters, albeit with higher prevalence in the clusters with high task completion and distributed learning, suggests they follow a host of diverse learning patterns, in agreement with a latent class model that suggested that the ‘spacing effect’ is less prominent for high performers.^{32} These observations were found to be consistent when testing our methodology on a second cohort of learners within the same institution and online degree, and broadly in agreement with a different type of data (‘page clicks’) from a set of online learners at the Open University, where we found a strong distinction between a low performing ‘massed learning’ cluster vs. a ‘distributed learning’ cluster.
Clearly, temporal behaviours do not fully account for learner performance, and this methodology is not intended as a diagnostic tool, but rather as providing a method to explore and identify learner engagement behaviours with the purpose of aid, intervention and help with course design. Combining the temporal analysis introduced here with established ‘early warning system’ analyses^{33} could aid in such tasks. Although educators might encourage learners to pursue a distributed study behaviour, our results suggest a nuanced approach for high performers, with flexibility provided in course design so that highperforming learners may pursue the study strategies they personally find effective.
Future work within different learning contexts, coupled with additional dependent variables of interest (e.g., learner satisfaction, career success, interruption and withdrawal rates) could be important to provide broader support for the initial results reported here. We remark that the methodology is scalable to larger data sets through adjustments of the computation of both the DTW kernel and the Markov Stability cost function (see Methods). Further improvements of the similarity kernel using constrained DTW^{34} and end point invariance^{35} could also be used to improve the sensitivity and accuracy of the method in representing the different temporal behaviours. Altogether with how online behaviour changes over time for each of the learners, these directions will constitute areas of further research.
Methods
The methods section describes the data and unsupervised mathematical pipeline used to analyse the trajectories of learners. The research was performed without any a priori knowledge or allocation of the learners, making it similar to a blind investigation.
Temporal data
The main case study of this research was based on task completion data from 81 postexperience learners pursuing a postgraduate parttime management degree at Imperial College London. These learners formed part of a cohort of 87 learners. Data from the remaining six learners was not included here as these learners either interrupted their studies or withdrew from the programme. Subjects ranged in age from 28 to 53 years old, with gender balance of 57 males to 24 females, and they resided in 18 geographically disparate countries. The data corresponds to interactions with six online courses, which together comprised the first academic year of the 2year degree programme. Although the subjects met facetoface at the start of each academic year, the six courses were studied completely online. Subjects proceeded in a lockstep manner through the academic year, which was split into three 10week terms each containing two of the six courses. The anticipated study load was 5 to 7 h per week for each course, so 10 to 12 h in total. The courses were assessed via a combination of coursework and exam, however, participation in these separate assessed activities was not included in the data set analysed here, only their final 2year grade was used as an indication of their performance.
To highlight the applicability of the method, we also applied the analysis to two additional data sets: (i) timestamped task completion series from a second cohort of 46 postexperience learners pursuing the postgraduate parttime management degree at Imperial College London; (ii) timestamped data of ‘pageclicks’ (not equal to task completion) from 100 learners undertaking Open University courses (OULAD data set^{30}). For further details on these data sets, see the Supplementary Information.
Ethical approval from the Education Ethics Review Process (EERP) at Imperial College London was attained (EERP 1718032b) and a waiver for informed consent was granted for this study.
Construction of the learner similarity graph using a dynamic time warping kernel and RMST sparsification
Creating a similarity matrix between learners using dynamic time warping
To compute the similarity between the task completion time traces of every two learners i and j, we use a similarity kernel, i.e., a generalised inner product. Common approaches for sequence analysis use L_{p} norms (when p = 2 we obtain the Euclidean norm), which are fast to compute and easy to index. However, their onetoone matching often ignore sequential patterns that are nonlinearly misaligned. Instead, our approach uses a dynamic time warping (DTW) kernel, which provides an elastic matching of two time sequences incorporating both the sequential ordering of the trajectory and the absolute values of time.^{17} The DTW similarity kernel is defined as:
where D_{l} denotes the DTW distance. The distance D_{l} is calculated by constructing an n × m matrix where n and m are the lengths of the two vectors we wish to compare. Using the pairwise cost cost(x_{i}, y_{j}) = x_{i} − y_{i}^{2}, we minimise the overall cost over the path from (i, j) = (1, 1) to (i, j) = (n, m) where each cell (i, j) along the path contributes cost(x_{i}, y_{j}) to the cumulative cost (summed over the path). This method is able to implicitly stretch both sequences to get a single dynamic time warping match between the two vectors, i.e., we find the cost required to match the two timeseries trajectories for each learner. The higher the cost, the higher distance in Hilbert space, and therefore the lower similarity between learners.
For N learners we produce an N × N similarity matrix A where each element A_{ij} is the DTW similarity (1) between learners i and j. For longer timeseries and for larger number of learners N, whereby the DTW calculations may become computationally expensive, dimensionality reduction methods can be implemented to improve the speed of similarity calculations^{36} or segmented dynamic time warping algorithms with comparable speeds to Euclidean distances can be used.^{37}
Creating a similarity graph using RMST sparsfication
The similarity matrix A can be thought of as the adjacency matrix of a fully connected, weighted graph, where every learner is connected to every other learner in the network with a different strength given by their pairwise similarity. The high redundancy present in this full similarity matrix both increases the computation time and reduces the effectiveness of many clustering algorithms. We therefore sparsify the similarity matrix to produce a similarity graph by reducing the number of edges present. To do this, we employ a pruning algorithm (the Relaxed Minimum Spanning Tree, or RMST), which is based on geometric graph heuristics that preserves edges based on both their strength and their relevance to long paths within the graph. RMST has been shown to balance the local and global structure of data sets and performs well under multiscale graph clustering methods.^{21,38} Supplementary Fig. 4 shows that the community structure is relatively stable when the sparsification parameter of RMST is varied.
Visualisations and layouts of the similarity graphs for the different data sets were produced using Gephi with the Force Atlas setting.^{39}
Finding clusters of learners using Markov Stability graph partitioning
Community detection methods for graphs aim to partition the nodes of a graph into subgraphs (communities) that are wellconnected within themselves and weakly connected to each other. There are multiple ways to define communities, and many methods and criteria to score the resulting partitions.^{40} Such methods are also related to graph partitioning problems.
Markov Stability (MS) is a generalised method for identifying communities in graphs at all scales. MS employs a random walk on the graph to define a timedependent cost function that measures the probability that a random walker is contained within a subgraph over a time scale t. If the random walker becomes trapped in particular subgraphs over that particular timescale, this identifies a good partition. As the time scale of the Markov process increases, the method identifies larger subgraphs leading to coarser partitions. Hence MS has the ability to identify intrinsically relevant communities at all scales by using the dynamic scanning provided by the diffusive process. For a detailed description of the method see.^{22,23}
The random walk is governed by the N × N transition matrix Q = D^{−1} A, where N is the number of nodes in the graph, A is the adjacency matrix, and D = diag(A1) is the degree matrix where 1 is a vector of ones. Q defines the probability of the random walk transitioning from node i to node j, as given by the discretetime process:
where p_{t} is a 1 × N node vector describing the probability of the random walker to be at each node at time t. An associated continuoustime diffusive process in terms of the graph combinatorial Laplacian L = D − A has the timedependent solution:
The time t is denoted the Markov time and is distinct to any real time. Markov time can be understood as a dimensionless quantity related to the diffusive process, which acts as a resolution parameter in that it allows for the exploration of the graph at different scales: as the Markov time increases, the partitions become coarser.
A partition of the graph into c communities is encoded into a N × c membership matrix establishing the correspondence between the nodes and the clusters:
The goodness of the partition encoded by H at time t under the dynamics governed by L is defined in terms of the c × c block autocovariance matrix:
where π is the stationary solution of (3) and Π = diag(π). The meaning of this matrix is clear: the element of the matrix [R(t; H)]_{αβ} encodes the probability that a random walker starting in community α will be at community β after time t, and the diagonal elements, [R(t, H)]_{αα}, indicate the probability of remaining contained in community α over time scale t. Hence a good partition H will maximise the sum of the diagonal elements, i.e., the trace of R(t, H). This leads us to our definition of the cost function, the Markov Stability of the partition:
which is to be maximised at every time t by searching in the space of partitions H:
Owing to the optimisation (7) being nonconvex and NPhard, we use an efficient greedy algorithm known as the Louvain algorithm,^{41} which has been shown to perform well in practice and against benchmarks. Given its greedy nature, the optimised partition found by Louvain is not always the same as it depends on the initialisation of the optimisation algorithm. Therefore, we repeat the optimisation \(\ell = 100\) times using different starting points for the algorithm. For each Markov time we thus obtainn 100 optimised partitions \(H_i^ \ast (t)\) and we pick the one with maximal Markov Stability (6) in the set as the optimal partition at t:
To identify the important partitions across time, we use the following two robustness criteria:^{23}
Consistency of the optimised partition
A relevant partition should be a robust outcome of the optimisation, i.e., the ensemble of \(\ell\) optimised solutions should be similar. To assess this consistency, we employ an informationtheoretical distance between partitions: the normalised variation of information between two partitions \({\cal{P}}\) and \({\cal{P}}^{\prime}\) defined as:^{42}
where \({\mathrm{\Omega }}(H) =  \mathop {\sum}\nolimits_{\cal{C}} p ({\cal{C}}){\mathrm{log}}\,p({\cal{C}})\) is a Shannon entropy, with \(p({\cal{C}})\) given by the relative frequency of finding a node in community \({\cal{C}}\) in the partition H, and Ω(H, H′) is the Shannon entropy of the joint probability. The variation of information VI(H, H′) ∈ [0, 1] is a true metric distance between two partitions based on information theory and VI(H, H′) = 0 indicates that two partitions are identical.
A measure of the robustness to the optimisation, at a given Markov time t, is given by the average variation of information of the ensemble of solutions obtained from the \(\ell\) Louvain runs:
If all runs of the optimisation return similar partitions, then VI(t) is small, indicating robustness of the partition to the optimisation. Hence, we select partitions with low values (or dips) of VI(t).
Persistence of the partition across levels of resolution
Relevant partitions should also be optimal across stretches of Markov time. Such persistence is indicated both by a plateau in the number of communities over t and a low value plateau of the crosstime variation of information:
This provides a second measure of robustness of a partition across resolution scales, and is commonly visualised via a heatmap where blocks along the diagonal indicate partitions that are persistent. Within a timeblock of persistent partitions we choose the most robust partition, i.e., with lowest VI(t).
Markov Stability code available at github.com/michaelschaub/PartitionStability. When the computation of the matrix exponential in (5) becomes costly for moderately large N, the linearisation of e^{−tL} provides an efficient approximate method to analyse very large graphs within the same framework.
Isotonic regression
An isotonic regression is a model that identifies the optimal least squares fit to a data set given the constraint that the model must be a nondecreasing function. The optimisation is:
where x_{i} must be larger or the same as x_{i−1}, i.e., x_{0} ≤ x_{1} ≤ ... ≤ x_{n}. The algorithm looks for violations of monotonicity and adjusts the estimate to fit within the constraints.
Gaussian process regression
The Gaussian process regression (GPR) was implemented using the sklearn Python package. The implementation is based on the Algorithm 2.1 of Gaussian processes for machine learning (GPML) by Rasmussen and Nickisch.^{28}
A GPR model can be thought to define a distribution over functions and inference being undertaken directly on the space of functions. As such, a mean and variance that models the data can be calculated. Given that the GPR is probabilistic we can calculate the loglikelihood of any set of trajectories being derived from an optimised GPR on another set of trajectories. Bayes factors are a method of Bayesian model comparison, which quantify the support for a model over another model. The Bayes factor K for two models M_{1} and M_{2} given some data D is:
Additional classification algorithms
To classify learners into high, medium and lowperformance groups, we used an SVM and a Decision Tree. Both algorithms are commonly used in classification tasks and were implemented using the scikit learn Python package.^{29}

An SVM acts as a nonprobabilistic binary linear classifier that attempts to find a hyperplane in a high or infinite dimensional space that maximises the distances between data points of differing classes. We implemented the SVM with the radial basis function kernel.

The Decision Tree attempts to find optimal branches (decisions) that represent conjunctions of features that lead to accurate prediction of class labels. We implemented a Decision Tree depth of four branches, increasing the number of branches did not improve the classification accuracy.
Instead of using regression analysis between continuous dependent variables (performance) and independent variables (temporal features), we implemented classification algorithms to provide a closer comparison to our clustering results.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
To maintain anonymity of the learners that took part in this study we have not released the data.
Code availability
In accordance with the code policy at Science of Learning we have provided links to the necessary functions required for the mathematical framework detailed in this manuscript: • Clustering algorithm (Markov Stability): https://wwwf.imperial.ac.uk/mpbara/Partition_Stability/https://github.com/michaelschaub/PartitionStability. • Dynamics time warping: https://github.com/pierrerouanet/dtw.
References
 1.
van Bruggen, J. Theory and practice of online learning. Br. J. Educ. Technol. 36, 111–120 (2005).
 2.
Lodge, J. M. & Corrin, L. What data and analytics can and do say about effective learning. npj Sci. Learn. 2, 5 (2017).
 3.
Mahzoon, M. J., Maher, M. L., Eltayeby, O. & Dou, W. A sequence data model for analyzing temporal patterns of student data. J. Learn. Anal. 5, 55–74 (2018).
 4.
Papamitsiou, Z. & Economides, A. A. Temporal learning analytics for adaptive assessment. J. Learn. Anal. 1, 165–168 (2014).
 5.
Ye, C. & Biswas, G. Early prediction of student dropout and performance in MOOCs using higher granularity temporal information. J. Learn. Anal. 1, 169–172 (2014).
 6.
Ye, C. et al. Behavior prediction in MOOCs using higher granularity temporal information. In Proc Second ACM Conference on Learning @ Scale  L@S ’15, 335–338 (ACM, New York, NY, 2015).
 7.
Taylor, C., Veeramachaneni, K. & O’Reilly, U. Likely to stop? Predicting stopout in massive open online courses. Preprint at http://arxiv.org/abs/1408.3382 (2014).
 8.
Jiang, S., Williams, A. E., Schenke, K., Warschauer, M. & Dowd, D. O. Predicting MOOC performance with week 1 behavior. In Proc 7th International Conference on Educational Data Mining, 273–275 (EDM, 2014).
 9.
Antonenko, P. D., Toy, S. & Niederhauser, D. S. Using cluster analysis for data mining in educational technology research. Educ. Technol. Res. Dev. 60, 383–398 (2012).
 10.
Knight, S., Friend Wise, A. & Chen, B. Time for change: why learning analytics needs temporal analysis. J. Learn. Anal. 4, 7–17 (2017).
 11.
Gerbier, E. & Toppino, T. C. The effect of distributed practice: neuroscience, cognition, and education. Trends Neurosci. Educ. 4, 49–59 (2015).
 12.
Ebbinghaus, H. Memory: a contribution to experimental psychology. Ann. Neurosci. 20, 155 (2013).
 13.
Toppino, T. C. & Gerbier, E. About practice: repetition, spacing, and abstraction. Psychol. Learn. Motiv. 60, 113–189 (2014).
 14.
Donovan, J. J. & Radosevich, D. J. A metaanalytic review of the distribution of practice effect: now you see it, now you don’t. J. Appl. Psychol. 84, 795–805 (1999).
 15.
Carroll, P. & White, A. Identifying patterns of learner behaviour: what business statistics students do with learning resources. INFORMS Trans. Educ. 18, 1–13 (2017).
 16.
Lee, A. V. Y. & Tan, S. C. Promising ideas for collective advancement of communal knowledge using temporal analytics and cluster analysis. J. Learn. Anal. 4, 76–101 (2017).
 17.
Berndt, D. & Clifford, J. Using dynamic time warping to find patterns in time series. Workshop Knowl. Knowl. Discov. Databases 398, 359–370 (1994).
 18.
Serrà, J. & Arcos, J. L. An empirical evaluation of similarity measures for time series classification. Knowl.Based Syst. 67, 305–314 (2014).
 19.
Mueen, A. & Keogh, E. Extracting optimal performance from dynamic time warping. In Proc 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2129–2130 (ACM, New York, 2016).
 20.
Wang, X. et al. Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Discov. 26, 275–309 (2013).
 21.
BeguerisseDiaz, M., Vangelov, B. & Barahona, M. Finding role communities in directed networks using RoleBased Similarity, Markov Stability and the Relaxed Mionimum Spanning Tree. In Proc 2013 IEEE Global Conference on Signal and Information Processing, 937–940 (IEEE, New York, 2013).
 22.
Delvenne, J. C., Yaliraki, S. N. & Barahona, M. Stability of graph communities across time scales. Proc. Natl Acad. Sci. 107, 12755–12760 (2010).
 23.
Lambiotte, R., Delvenne, J. C. & Barahona, M. Random walks, Markov processes and the multiscale modular organization of complex networks. IEEE Trans. Netw. Sci. Eng. 1, 76–90 (2014).
 24.
Rodrigues, P. P., Gama, J. & Pedroso, J. P. Hierarchical clustering of timeseries data streams. IEEE Trans. Knowl. Data Eng. 20, 615–627 (2008).
 25.
Fenn, D. J. et al. Dynamic communities in multichannel data: an application to the foreign exchange market during the 20072008 credit crisis. Chaos 19, 033119 (2009).
 26.
Ando, T. & Bai, J. Clustering huge number of financial time series: a panel data approach with highdimensional predictors and factor structures. J. Am. Stat. Assoc. 112, 1182–1198 (2017).
 27.
Hoffmann, T., Peel, L., Lambiotte, R. & Jones, N. S. Community detection in networks with unobserved edges. Preprint at https://arxiv.org/abs/1808.06079 (2018).
 28.
Rasmussen, C. & Nickisch, H. Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, 3011–3015 (2010).
 29.
Pedregosa, F. et al. Scikitlearn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
 30.
Kuzilek, J., Hlosta, M. & Zdrahal, Z. Data descriptor: open university learning analytics dataset. Sci. Data 4, 1–8 (2017).
 31.
Bloom, K. C. & Shuell, T. J. Effects of massed and distributed practice on the learning and retention of secondlanguage vocabulary. J. Educ. Res. 74, 245–248 (1981).
 32.
Verkoeijen, P. P. J. L. & Bouwmeester, S. Using latent class modeling to detect bimodality in spacing effect data. J. Mem. Lang. 59, 545–555 (2008).
 33.
Beck, H. P. & Davidson, W. D. Establishing an early warning system: predicting low grades in college students from survey of academic orientations scores. Res. High. Educ. 42, 709–723 (2001).
 34.
Ratanamahatana, C. A. & Keogh, E. Making timeseries classification more accurate using learned constraints. In Proc 2004 SIAM international conference on data mining, 11–12 (SIAM, 2004).
 35.
Silva, D. F., Batista, G. E. A. P. A. & Keogh, E. On the effect of endpoints on dynamic time warping. In SIGKDD Workshop on Mining and Learning from Time Series II, San Francisco, CA. Association for Computing MachineryACM (ACM, New York, 2016).
 36.
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3, 263–286 (2006).
 37.
Keogh, E. J. & Pazzani, M. J. Scaling up dynamic time warping to massive datasets. In European Conference on Principles of Data Mining and Knowledge Discovery, 1–11 (Springer, Berlin, Heidelberg, 2010).
 38.
BeguerisseDiaz, M., GarduñoHernández, G., Vangelov, B., Yaliraki, S. N. & Barahona, M. Interest communities and flow roles in directed networks: the Twitter network of the UK riots. J. R. Soc. Interface 11, 20140940 (2014).
 39.
Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. In Third International AAAI Conference on Weblogs and Social Media, 361–362 (AAAI, Palo Alto, CA, 2009).
 40.
Fortunato, S. Community detection in graphs. Phys. Rep. 486, 75–174 (2010).
 41.
Blondel, V. D., Guillaume, J. L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008).
 42.
Meila, M. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, 173–187 (Springer, Berlin, Heidelberg, 2003).
Acknowledgements
We would like to thank Dr. Nai Li, Marc Wells, Gavin Symonds, Samuel McGarry, and Phil Tulip for assistance with data collection and interpretation. We would also like to thank Prof Alan Spivey for helping promote the project and attain funding from Imperial College London. We would like to thank Dr. Iro Ntonia and Prof. Martyn Kingsbury for their insightful suggestions and advice on ethical procedures. This research has been funded by a President’s Excellence Award from Imperial College London. M.B. and S.N.Y. acknowledge support from EPSRC award EP/N014529/1 funding the EPSRC Centre for Mathematics of Precision Healthcare at Imperial.
Author information
Affiliations
Contributions
R.L.P. and M.B. designed the mathematical framework and data analytics pipeline. R.L.P. built and coded the Python toolbox for data collection, data cleaning and data analysis, and carried out the computational analysis. M.B, S.N.Y and D.L supervised the project design and data analytics research. All authors contributed to writing the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Peach, R.L., Yaliraki, S.N., Lefevre, D. et al. Datadriven unsupervised clustering of online learner behaviour. npj Sci. Learn. 4, 14 (2019). https://doi.org/10.1038/s4153901900540
Received:
Accepted:
Published:
Further reading

Understanding learner behaviour in online courses with Bayesian modelling and time series characterisation
Scientific Reports (2021)

Behavior analysis method for indoor environment based on app usage mining
The Journal of Supercomputing (2021)

Network memory in the movement of hospital patients carrying antimicrobialresistant bacteria
Applied Network Science (2021)

Floor plan optimization for indoor environment based on multimodal data
The Journal of Supercomputing (2021)