Data-driven unsupervised clustering of online learner behaviour

The widespread adoption of online courses opens opportunities for analysing learner behaviour and optimising web-based learning adapted to observed usage. Here, we introduce a mathematical framework for the analysis of time-series of online learner engagement, which allows the identification of clusters of learners with similar online temporal behaviour directly from the raw data without prescribing a priori subjective reference behaviours. The method uses a dynamic time warping kernel to create a pair-wise similarity between time-series of learner actions, and combines it with an unsupervised multiscale graph clustering algorithm to identify groups of learners with similar temporal behaviour. To showcase our approach, we analyse task completion data from a cohort of learners taking an online post-graduate degree at Imperial Business School. Our analysis reveals clusters of learners with statistically distinct patterns of engagement, from distributed to massed learning, with different levels of regularity, adherence to pre-planned course structure and task completion. The approach also reveals outlier learners with highly sporadic behaviour. A posteriori comparison against student performance shows that, whereas high-performing learners are spread across clusters with diverse temporal engagement, low performers are located significantly in the massed learning cluster, and our unsupervised clustering identifies low performers more accurately than common machine learning classification methods trained on temporal statistics of the data. Finally, we test the applicability of the method by analysing two additional data sets: a different cohort of the same course, and time-series of different format from another university.


Introduction
The application of data analytics to educational data has increased rapidly in the past few years facilitated by the adoption of online learning platforms [1]. However, it is crucial to identify both the right data and analytical approaches that will allow us to gain interpretable insights into online engagement and learning patterns [2]. Learning occurs over time and thus the analysis of temporal data is an important focus for educational data analytics. In this work, we describe a methodology for the study of time series data collected from the engagement of learners with the tasks and stages of online courses. Temporal statistics has been shown to provide a fruitful avenue to identify learners at risk of failure [3], predicting performance [4], dropping out of a course [5,6,7,8], or identifying learner behaviours [9]. Despite such developments, a recent review of the field suggested that such temporal analyses are currently insufficient in number and also that additional methodologies are required [10].   Figure 1: The time-seres of completion times for tasks are plotted for three example learners (learners 65, 48 and 43). The entire course is divided into three semesters in which two modules are completed during each semester. The mean time-series of task completion are plotted (coloured lines) against the average task completion trajectory for the entire cohort of learners (black line). Learner 65 (left, green line) completes tasks consistently earlier than average, whereas learner 48 (right, magenta line) tends to finish tasks after the average, particularly in term 3. Learner 43 exhibits rather sporadic behaviour during the first and second semesters whereby a lot of tasks are finished late or during massed learning sessions.
A common form of temporal analysis in the educational context investigates massed versus distributed study modes. Such studies commonly compare the performance of learners studying material 'massed' (or 'crammed') into a single study period to that of learners that 'distribute' their study of the same material across a number of shorter study periods. The general conclusion has been that distributed practice is the more effective strategy [11]. Such beneficial 'spacing effect' [12] has been shown in study distributed over differing periods and within different contexts [13], although other reports have noted that the effect does not apply to all learning contexts [14]. A feature of such data analyses is that they generally allocate subjects in advance to one of the two pre-determined study modes. Indeed, pre-allocation is also an inherent restriction in supervised machine learning approaches, where the labels must be determined a priori to train an algorithm.
Recent studies have used time series of learners and grouped them according to pre-selected features of the data (e.g., task focus, resource usage, etc) which revealed different approaches to problem solving and, consequently, to improved scores. However, such methods are highly dependent on the temporal features chosen as descriptors, which are based on specific knowledge of the data, as well as the number of groups that are obtained by the clustering. Another recent study identified four behavioural groups of learners following a blended course (face-to-face and online), with groups separated according to their differing levels of engagement across the two platforms [15]. However, this study was also influenced by the particular features they extracted from the data. Such studies exemplify how the combination of temporal analytics and cluster analysis can provide insights that can be used by educators, course designers, and researchers in learning anallytics [10,16].
Here, we present an unsupervised methodology that allows the direct analysis of raw time series gathered from the engagement of learners as they complete tasks of online courses without imposing a priori the number or type of groups of learners, i.e., distinct learner clusters are not pre-determined nor identified subjectively based on prior features but are detected algorithmically during the data analysis. Specifically, we analysed the time series (i.e., time stamped data of task completion) of 81 learners as they undertook the six online compulsory courses that form the first year of a 2-year part-time postgraduate management degree. The courses extended over three terms and the patterns of task completion differ greatly across the learner group. Three examples of such highly distinct time series are shown in Figure 1, showing a variety of behaviours: from steady completion to highly massed behaviour to sporadic patterns.
The methodology is summarised in Figure 2. Using the time stamps of task completion by each learner, we employ a dynamic time warping kernel [17] to calculate similarities between learner time series and construct a similarity score between all pairs of learners. From this information, we then construct a similarity graph of the learner group [18] and apply to it Markov Stability, a multiscale unsupervised graph partitioning framework, in order to obtain clusters of learners with similar temporal behaviours [19]. Details of the different steps of the analysis are given in the Methods section.
Our analysis identifies distinct clusters of learners with differing temporal engagement when studying an online course at different levels of resolution. We observe clusters of learners that contain behaviours associated with 'massed' (i.e., completion of a large number of tasks within a short time period) and 'distributed' learning, as well as finer behaviours that differentiate these learning types into subgroups. For instance, at a very coarse level, the algorithm identifies a   [17] which calculates a measure of similarity between two time-series. (C) We construct an adjacency matrix A where each element A ij describes the similarity between learners i and j pairwise. Similarity is measured between 0 and 1 (blue to red), where a higher value suggests a higher similarity. (D) The adjacency matrix A encodes a network where each represents a learner and the edges are the similarity between each learner. Community detection can be applied to cluster learners according to their similar time-series behaviours.
cluster of learners that follow the course in a sequential and 'distributed' manner; yet, at a finer resolution, this cluster is sub-divided into two clusters which differ by a 1-2 week difference in the average completion times of tasks (i.e., 'early birds' and 'on time'). Our approach also finds sporadic learners that skip a large number of tasks or exhibit irregular 'massed' learning depending on particular courses or times of the year. Exam grades were used a posteriori to establish whether particular online engagement behaviours can negatively affect learner performance and were compared against classification based on statistics computed from the time series data. To find groups of learners with similar online engagement in an unsupervised manner, we follow the procedure summarised in Figure 2. We first create a similarity matrix between learners using a Dynamic Time Warping kernel. This matrix is transformed into a similarity graph using a sparsification based on the Relaxed Minimum Spanning Tree [18], a procedure that retains global network connectivity whilst discarding weak similarities that can be explained through longer chains of strong similarities. Through this process, we create a graph where the nodes are learners linked by edges weighted according to their time-course similarity. Hence, two learners that complete the tasks of the course in a similar manner will be linked by a strong edge.
The constructed similarity graph is then analysed using Markov Stability (MS), a multiscale graph partitioning algorithm that uses a Markov process to scan the graph across Markov time in order to find meaningful partitions of the graph at different levels of resolution [19]. Robust partitions of the graph are signalled by two variation of information [20] measures: an extended VI(t, t ) plateau and a dip in VI(t). Such robust partitions identify clusters of learners that exhibit similar online temporal patterns. As the Markov time is increased, the method reveals robust partitions of decreasing granularity: from 10 to 6 to 2 clusters in this case ( Figure 3A). Note the quasi-hierarchical mixing of the finer clusters into broader temporal patterns, a phenomenon that is intrinsic to the data and not imposed by our clustering algorithm. This feature reflects the fact that subtle temporal details characterise the finer clusters, but broader similarities of the time profiles define the coarser clusters. Hence, our computational framework allows for adjustable granularity, which can be tailored to the needs of the analyst. To exemplify the characterisation of the clustering results in our dataset, we now focus mainly on the 6-cluster partition, which contains four large groups and two learners that remain unclustered due to their highly individual sporadic behaviour.

Characterisation of the clusters of online learners
The clusters obtained through our data-driven procedure capture distinct temporal behaviour of online learners at different levels of resolution. For purposes of illustration, we concentrate here on the 6-cluster partition, which is both robust and provides a sufficient level of resolution in the behavioural clusters it detects providing meaningful insights into the observed patterns ( Figure 3B). Two of the clusters contain only one learner, with highly individual and sporadic behaviour. For each of the other four clusters, we use Gaussian Process Regression (GPR) [21] to compute the average engagement trajectory of the group and compare it with the average GPR trajectory for the whole set of 81 learners. The computed GPRs also allow us to quantify statistically the difference in the temporal patterns of the different clusters.
In particular, using Bayes factors of the processes, we found that the trajectories of each cluster are statistically more probable to be derived from separate processes defined within their own cluster. A GPR was fitted to the entire set of trajectories and the log likelihood of the entire set of trajectories was calculated. Equally, the log likelihood of each separate cluster of trajectories from that same Gaussian Process was calculated. The Bayes Factor, calculated as the sum of log likelihoods of each separate cluster minus the log likelihood of the entire set of trajectories was found to be positive (BF = 3.37 × 10 10 ). This indicates that the behaviours of each cluster are statistically different from each other and are derived from different behavioural processes. This computation was repeated for the differences between each pair of neighbouring clusters. The Bayes factors were: 0.38 × 10 10 between the 'early birds' and 'on time' clusters; 1.52 × 10 10 between the 'on time' and 'low engager' clusters; and 0.17 × 10 10 between the 'low engager' and 'crammers' clusters. These numbers provide statistical evidence of the differences between the obtained clusters.
Each of the clusters in this partition has been given a descriptive title that encapsulates the group behaviour. The learners in the 'Early Bird' group (green cluster) generally exhibit a highly sequential and ordered approach to their learning and tend to complete their tasks earlier than the group average. The behaviour of learners in the 'On time' group (cyan cluster) is similar to the 'Early birds', except for a systematic 1-2 week off-set in their time series. Hence both the green and cyan groups present a similar 'distributed learning' behaviour only distinguished by a slight delay, which explains why both groups are agglomerated into a single cluster in the coarser 2-way partition ( Figure 3A). The learners in the 'Low engagers' (orange) cluster also exhibit relatively distributed work flow (similar to the cyan and green clusters) but with less anticipation in the second half of the year (and especially in the third term). Furthermore, this group had a high number of tasks that were never completed. The 'Crammers' cluster (magenta) contained learners that exhibited massed learning (indicated by the presence of plateaux in their time-series suggesting tasks being completed in a short period of time), low task completion and an ordering of task completion that deviates from the proposed course sequence. Finally, the outliers (learners 43 and 46), which form their own clusters, exhibit highly sporadic learning behaviours, with tasks completed at later dates and in a non-sequential order (according to the layout of the course).
To further characterise our results, we have computed standard time-series metrics for each learner. Figure 4 shows the graph of learners coloured according to two such metrics derived directly from the time-series data: the mean 'massed' (or more commonly known as binge learning) session length defined according to the length of a plateau in an isotonic regression, and the percentage of completed tasks. Figure 4A shows the mean massed session length calculated via an isotonic regression (see Methods), i.e., the length of plateaux in the number of tasks over time . More specifically, this measure captures events where a learner has completed a large number of tasks within a short time frame, an indication of massed learning. We find that the 'Crammers' cluster has a higher mean massed session length. Figure 4B maps the percentage of tasks that were completed relative to the total number of available tasks onto each learner in the network structure. In general, the 'Crammers' cluster shows the lowest mean task completion (66%), with low completion ratio in the 'Low Engagers' group (80%) and higher mean task completion rates in the 'on time' (86%) and 'early bird' (90%) clusters.

Cluster analysis identifies groups of learners at risk of low performance
We have also carried out an a posteriori evaluation of our behavioural clusters with respect to the performance of the learners. Figure 5A shows the mapping of the final average marks on the learner graph, where we have also highlighted high performing (>70%, top 15%, 'Distinction') and low performing (<60%, bottom 7.5%, below 'Merit') learners. Figure 5B shows that 6 out of 7 low performance learners lie in the 'Crammers' cluster associated with massed learning and reduced task completion. The high performers tend to be distributed across the rest of the clusters suggesting that the learning behaviours of a high performer are not critical to their success. Yet, 9 out of the 13 high performing learners are found in the 'Early Birds' or 'On time' clusters characterised by a sequential approach to their learning with minimal massed learning sessions. There was a single outlier (learner 77, cyan cluster) who attained a low grade and yet did not exhibit time-series behaviours indicative of a low performance. The outlier learners (43 and 46) did not attain a low performance (nor a distinctly high one).
Although our method captures information congruent with individual metrics of the time-series, such as those shown in Figure 4 related to massed learning and task completion rates, our data-driven clusters encompass global time- series information beyond pre-determined standard statistical features. To test this idea, we have compared learner performance predictions using our computational methodology and standard classification methods based on statistical features. Figure 5C illustrates the classification map obtained by training two common learning algorithms against the two features illustrated in Figure 4. The first learning algorithm is a support vector machine (SVM) using an radial basis function kernel and the second is a decision tree with a depth of 4 branches [22] (see Methods). In both cases we find the accuracy of learner classification against performance is low: only 3-4 out of 7 low performance learners were accurately predicted. This result suggests that using a finite set of pre-determined time-series features reduces the information available to differentiate the necessary behaviours that correspond to performance. In contrast, our graph construction and clustering methodology utilises the full content of the time series thus providing a more comprehensive grouping of learners with similar temporal learning behaviours by including time-series attributes that are not evident from correlation with particular statistical metrics.

Discussion
Our work describes an approach for the temporal analyses of learning behaviours in which distinct clusters of learners are detected algorithmically without any a priori information about individual behaviours or the number of differing learning behaviours present across the cohort. The mathematical pipeline is general, and can be applied broadly to any time-series data in physical or social sciences, providing robust distinctions between temporal behaviours.
We have showcased the analysis using data from 81 learners pursuing online postgraduate courses in management. The methodology proved effective in finding four distinct learner clusters ( 'early birds', 'on time', 'low engagers' and 'crammers') and two learners ('sporadic outliers') that formed their own separate clusters. The time-courses of the clusters obtained are statistically different to each other according to the GPR Bayes factors. Our methodology inherently encompasses more time-series information than standard time-series feature derivation and performance prediction was improved relative to common classification algorithms.
Our data-driven approach found that low performers were generally located in a single category ('Crammers'), which was associated with 'massed' learning and low task completion, in agreement with previous studies where learners that 'crammed' retained less information when tested at a later date [23]. Furthermore, the clusters 'early birds' and 'on time' most closely correlate with the expected behaviours proposed by the course design and their behaviour also corresponds to 'distributed' learning. The lack of low performing learners in these categories does provide support for the advantages associated with this behaviour.
Interestingly, the high performing learners were distributed widely across clusters, yet with a tendency to 'distributed' on-time learning. This suggests a less uniform pattern of learning behaviour for high performing learners, in agreement with a latent class modelling analysis that suggested that the 'spacing effect' is less prominent for high performing learners [24]. There was a single outlier (learner 77) who appeared in a cluster of learners that studied diligently, yet still underperformed, indicating that learning behaviours do not account fully for low performance. Hence combining temporal analysis with established 'early warning system' analyses [25] could improve predictive accuracy. We remark that our analysis allowed us to identify behavioural patterns in a small group of learners, a result that is difficult to emulate when regressing different features against performance due to possible bias. Although educators might encourage learners to pursue a more distributed study behaviour, these results suggest a nuanced approach; the distribution of high performing learners across different clusters suggests that flexibility should be provided in course design so that high performing learners may pursue the study strategies they find effective. These ideas could be tested further by applying the mathematical framework to larger datasets of learners, such as those made available by the Open University (OULAD) [26]. Future work within different learning contexts, coupled with additional dependent variables of interest (e.g., learner satisfaction, career success, interruption and withdrawal rates) will be important to provide broader support for the initial results reported here.

Methods
The methods section describes the data and mathematical pipeline used to analyse the trajectories of learners. The research was performed without any a priori knowledge or allocation of the learners, making it similar to a blind investigation. Where required, statistical tests were used to validate the findings in this paper. Specifically, we made the valid assumption that the completion of a task by learners had a normal distribution and thus we could calculate the mean and variance of the time-series trajectories using a Gaussian Process Regression. These are seen in Figures 1 and  3B, where the average time-series trajectory is compared against each cluster of learners.

Temporal Data
The subjects in this research were 81 post-experience learners pursuing a post-graduate part-time management degree at a selective research orientated institution. These learners formed part of a cohort of 87 learners. Data from the remaining 6 learners was not included here as these learners either interrupted their studies or withdrew from the programme. Subjects ranged in age from 28 to 53 years old, the gender balance was exactly 70% male and 30% female and they resided in 18 geographically disparate countries. The data corresponds to interactions with six online courses which together comprised the first academic year of the 2-year degree programme. Although the subjects met face-to-face at the start of each academic year, the six courses were studied completely online. Subjects proceeded in a lock-step manner through the academic year which was split into three 10-week terms each containing two of the six courses. The anticipated study load was 5 to 7 hours per week for each course, so 10 to 12 hours in total. The courses were assessed via a combination of coursework and exam, however, participation in these separate assessed activities was not included in the dataset analysed here, only their final 2-year grade was used as an indication of their performance.
The ethical approval from the Education Ethics Review Process (EERP) at Imperial College London was attained (EERP 1718-032b), and a waiver for informed consent was granted for this study.

Construction of the learner similarity graph
A similarity matrix is constructed which makes meaningful comparisons between the task trace graphs. Each entry to the similarity matrix is a pairwise comparison of the i-th and j-th learner using a similarity kernel. A kernel is a generalised inner product which provides a measure of similarity between two objects. Common approaches for sequence analysis use L p norms (when p = 2 we obtain the Euclidean norm) which are fast to compute and easy to index. However, their one-to-one matching often ignore sequential patterns that are non-linearly misaligned. Instead, our approach uses a dynamic time warping (DTW) kernel, which provides an elastic matching of two time sequences [17], i.e., it looks at the sequential ordering of the trajectory and the absolute values of time. The DTW kernel is defined as: where D l denotes the DTW distance. The distance D l is calculated by constructing an n × m matrix where n and m are the lengths of the two vectors we wish to compare. Using a cost function, such as cost(x i , y j ) = ||x i − y i || 2 , we are able to minimise the path starting from (i, j) = (1, 1) to (i, j) = (n, m) where each cell (i, j) along the path contributes cost(x i , y j ) to the cumulative cost. This method is able to implicitly stretch both sequences to get a single dynamic time warping match between the two vectors, i.e., we find the cost required to match the two time-series trajectories for each learner. The higher costs suggest a higher distance in Hilbert space and therefore a lower similarity between learners.
For N learners we produce an N × N similarity matrix A. Each element in the matrix A ij is a measure of similarity between learners i and j. This matrix can be thought of as a fully connected, weighted graph (where every learner is connected to every other learner in the network) with a different strength (given by their similarities). However, the full similarity matrix introduces complexity and redundancy, plus it increases computation time of any algorithm calculated across the graph. To reduce the number of edges we employ a pruning algorithm which removes edges whilst retaining long paths within the graph and has been shown to perform well [18].
Visualisations of the graph structure were produced using Gephi with the Force Atlas setting [27].

Finding clusters of learners using Markov Stability graph partitioning
Graph partitioning methods attempt to divide the nodes into sub-graphs (clusters) that are well-connected to each other and weakly connected between them. There are multiple ways to define clusters and many methods exist which attempt to partition the graphs and score and rank the resulting partitions [28]. Markov Stability is a multiscale graph partition algorithm, with the intrinsic capability to identify clusters across all scales, without imposing a particular size or number of clusters a priori. For a detailed description of the method see [19,29]. Markov Stability finds optimised partitions at every scale, and the robustness of these partitions is tested using a random-walker that explores the graph. If the random-walker becomes trapped over a Markov time scale then it indicates a good partition. As the time scale of the Markov process increases the method looks at coarser subgraphs.
The topology of a graph is described by a N × N adjacency matrix A which is symmetric and the values A ij are '1' if two vertices are connected and '0' otherwise (assuming an unweighted edges). The degree of each vertex, d i , can be compiled into an N × 1 vector which can also be diagonalised into an N × N matrix D = diag(d).
A random walk on a graph can be encoded in a Markov chain whereby the probability of leaving a vertex is uniformly split across the outgoing edges with a transition probability of 1/d i for each edge: where p t is a 1 × N probability vector and M is the transition matrix corresponding to the graph. This equation describes the evolution of a probability vector p according to the transition matrix M . Under these assumptions, the stationary distribution π = d T / i d i = d T /2m is defined when the probability distribution no longer changes π = πM .
Cluster detection in Markov Stability is obtained by analysing the block 'auto-covariance' matrix, which is defined using the Markov transition matrix: Here H is a c × N matrix that encodes the cluster partitions; π describes the probability at stationarity, and Π is the diagonal elements of π. The c × c matrix R encodes the probability of starting at community α and finishing at community β after a Markov time t has elapsed. Hence, the trace elements, [R(t)] αα , indicate the probability of being trapped in a particular cluster over a timescale t. The sum of these diagonal elements, the trace, must be maximized to find a good partition: r(t, H) = min τ <t T r(R(τ, H)), Due to the computationally expensive problem of testing all possible partitions, we use a greedy and efficient algorithm known as the Louvain algorithm [30] to maximise the trace of R, and we look for robust optimised partitions as meaningful. This is done through two methods, (i) the range of Markov time in which a particular cluster partition persists and is resistant to perturbations indicates a robust cluster, as indicated by extended plateaux in VI(t, t ); (ii) at each Markov time, 100 optimised partitions are found using 100 different initial conditions for the algorithm, and a robust partition is indicated by a similar partition being found consistently, as indicated by a low value (or dip) in VI(t). Similarities and differences between partitions are calculated using the Variation of Information (VI) [20], a true metric of distance between clusterings based on information theory and is closely related to mutual information. Low values in VI indicate similarities between partitions.

Isotonic regression
An isotonic regression is a model that identifies the optimal least squares fit to a data set given the constraint that the model must be a non-decreasing function. The optimisation is: where x i must be larger or the same as x i−1 i.e. x 0 ≤ x 1 ≤ ... ≤ x n . The algorithm looks for violations of monotonicity and adjusts the estimate to fit within the constraints.

Gaussian Process Regression
The Gaussian process regression (GPR) was implemented using the sklearn Python package. The implementation is based on the Algorithm 2.1 of Gaussian Processes for Machine Learning (GPML) by Rasmussen and Williams [21].
A GPR model can be thought to define a distribution over functions and inference being undertaken directly on the space of functions. As such, a mean and variance that models the data can be calculated. Given that the GPR is probabilistic we can calculate the log-likelihood of any set of trajectories being derived from an optimised GPR on another set of trajectories. Bayes factors are a method of Bayesian model comparison which quantify the support for a model over another model. The Bayes factor K for two models M 1 and M 2 given some data D is:

Additional classification algorithms
To classify learners into high, medium and low performance groups, we used an SVM and a Decision Tree. Both algorithms are commonly used in classification tasks and were implemented using the scikit learn Python package [22].
• An SVM acts as a non-probabilistic binary linear classifier that attempts to find a hyperplane in a high or infinite dimensional space that maximises the distances between data points of differing classes. We implemented the SVM with the radial basis function kernel. • The Decision Tree attempts to find optimal branches (decisions) that represent conjunctions of features that lead to accurate prediction of class labels. We implemented a Decision Tree depth of 4 branches, increasing the number of branches did not improve the classification accuracy.
Instead of using regression analysis between continuous dependent variables (performance) and independent variables (temporal features), we implemented classification algorithms to provide a closer comparison to our clustering results.

Code availability
In accordance with the code policy at Science of Learning we have provided links to the necessary functions required for the mathematical framework detailed in this manuscript: • Clustering algorithm (Markov Stability) : https://wwwf.imperial.ac.uk/~mpbara/Partition_ Stability/ and https://github.com/michaelschaub/PartitionStability • Dynamics time warping: https://github.com/pierre-rouanet/dtw

Data availability
To maintain anonymity of the learners that took part in this study we have not released the data.

Acknowledgements
We would like to thank Nai Li, Marc Wells, Gavin Symonds, Samuel McGarry and Phil Tulip for assistance with data collection and interpretation. We would also like to thank Prof Alan Spivey for helping promote the project and attain funding. We would like to thank Dr Iro Ntonia and Prof Martyn Kingsbury for their insightful suggestions and their advice on ethical procedures.