Data-driven unsupervised clustering of online learner behaviour

Peach, Robert L.; Yaliraki, Sophia N.; Lefevre, David; Barahona, Mauricio

doi:10.1038/s41539-019-0054-0

Download PDF

Article
Open access
Published: 03 September 2019

Data-driven unsupervised clustering of online learner behaviour

npj Science of Learning volume 4, Article number: 14 (2019) Cite this article

8845 Accesses
27 Citations
13 Altmetric
Metrics details

Subjects

Education

Abstract

The widespread adoption of online courses opens opportunities for analysing learner behaviour and optimising web-based learning adapted to observed usage. Here, we introduce a mathematical framework for the analysis of time-series of online learner engagement, which allows the identification of clusters of learners with similar online temporal behaviour directly from the raw data without prescribing a priori subjective reference behaviours. The method uses a dynamic time warping kernel to create a pair-wise similarity between time-series of learner actions, and combines it with an unsupervised multiscale graph clustering algorithm to identify groups of learners with similar temporal behaviour. To showcase our approach, we analyse task completion data from a cohort of learners taking an online post-graduate degree at Imperial Business School. Our analysis reveals clusters of learners with statistically distinct patterns of engagement, from distributed to massed learning, with different levels of regularity, adherence to pre-planned course structure and task completion. The approach also reveals outlier learners with highly sporadic behaviour. A posteriori comparison against student performance shows that, whereas high-performing learners are spread across clusters with diverse temporal engagement, low performers are located significantly in the massed learning cluster, and our unsupervised clustering identifies low performers more accurately than common machine learning classification methods trained on temporal statistics of the data. Finally, we test the applicability of the method by analysing two additional data sets: a different cohort of the same course, and time-series of different format from another university.

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Self-supervised learning for human activity recognition using 700,000 person-days of wearable data

Article Open access 12 April 2024

Song lyrics have become simpler and more repetitive over the last five decades

Article Open access 28 March 2024

Introduction

The application of data analytics to educational data has surged in the past few years facilitated by the adoption of online learning platforms.¹ However, in parallel to the increased access to detailed information, it is crucial to identify both the right type of data and analytical approaches that will allow us to gain interpretable insights into online engagement and learning patterns.² The process of learning extends over time and thus the analysis of temporal data is an important focus for educational data analytics. In this work, we describe a methodology for the study of time-series data collected from the engagement of learners with the tasks and stages of online courses. The analysis of temporal statistics has been shown to provide a fruitful avenue to identify learners at risk of failure,³ predicting performance,⁴ dropping out of a course,^5,6,7,8 or identifying learner behaviours.⁹ Despite such developments, a recent review of the field suggested that temporal analyses are currently insufficient in number, and that additional methodologies are required.¹⁰

Temporal analytics has been used in the educational context to investigate massed versus distributed study modes, i.e., to compare the performance of learners that study the material ‘massed’ (or ‘crammed’) into a single study period to that of learners that ‘distribute’ their study of the material across a number of shorter study periods. The general conclusion has been that distributed practice is the more effective strategy.¹¹ The benefits of such ‘spacing effect’¹² have been shown over differing periods and within different contexts,¹³ although other reports have noted that the effect does not apply to all learning contexts.¹⁴ However, a feature of previous data analyses is that they generally allocate subjects in advance to one of the two pre-determined study modes. Indeed, pre-allocation is also an inherent restriction in supervised machine learning approaches, where labels are assigned a priori to train an algorithm.

Recent studies have collected time-series of learners’ behaviours and used them to cluster learners according to pre-selected features of the data (e.g., task focus, resource usage, etc) chosen to describe different approaches to problem solving. However, such methods are highly dependent both on the temporal features chosen as descriptors, which are based on specific knowledge of the data, as well as the number of groups that are obtained by the clustering. For example, a recent study extracted particular features from learners following a blended course (i.e., on two platforms: face-to-face and online) and identified four behavioural groups separated according to their differing levels of engagement across the two platforms.¹⁵ Such studies exemplify how the combination of temporal analytics and cluster analysis can provide insights of use to educators, course designers, and researchers in learning analytics.^10,16

Here, we present an unsupervised methodology that allows the direct analysis of raw time-series gathered from the engagement of learners as they complete tasks of online courses without imposing a priori neither the statistical descriptors of the time-series nor the number or type of groups of learners to be detected. Hence the obtained learner clusters are not pre-determined or identified subjectively based on prior features but are detected algorithmically during the data analysis. To exemplify our approach, we analysed in detail the time-series (i.e., time-stamped data of task completion) of 81 learners as they undertook the six online compulsory courses that form the first year of a 2-year part-time post-graduate management degree. The courses extended over three terms and the patterns of task completion differ greatly across the learner group. Three examples of such highly distinct time-series are shown in Fig. 1, showing a variety of behaviours: from steady completion to highly massed behaviour to sporadic patterns. To highlight its applicability, we also applied the method to two additional data sets: a different set of time-series of task completion collected from the same degree programme but from a different year cohort, and a set of time-series of online interactions (not of task completion) collected by a different university and therefore with distinct characteristics.

The methodology is summarised in Fig. 2. We use the raw, time-stamped series of online actions from each learner and employ a dynamic time warping (DTW) kernel¹⁷ to calculate a similarity score between all pairs of learner time-series. Although several alternative methods exist to measure the similarity between two time-series (e.g., Euclidean distance, Fourier coefficients, auto-regressive models, edit distance, or minimum jump models),¹⁸ DTW has been shown to outperform a variety of measures in classification tasks¹⁹ and provides a principled way to use the full, raw information of the time-series without preselecting features or functional representations.²⁰ From the ensuing DTW similarity matrix, we construct a similarity graph, where nodes are learners and weighted links represent similarities between learners. This graph construction step is carried out using the Relaxed Minimum Spanning Tree algorithm,²¹ which aims to encapsulate the locally strong and globally relevant similarities in the data set. Relaxed minimum spanning tree (RMST) has been shown to perform well in conjunction with the multiscale, unsupervised graph partitioning methodology of Markov Stability,^22,23 which we apply to our graph to obtain clusters of learners with similar temporal behaviours. Alternative methods to cluster time-series data, with and without the creation of graphs, have been proposed in other contexts and applications.^24,25,26,27 Instead of finding one particular clustering, our algorithm produces a multiscale description, given by a set of consistent clusterings of different coarseness obtained by robustly optimising across all levels of resolution in an unsupervised manner, without pre-imposing the number or type of clusters (see Fig. 3a for an example). Clusterings of different coarseness can then be used by the analyst according to their needs. If no robust clusterings are found, the algorithm will signal a lack of natural clusters in the data. Details of the computational analysis are given in the Methods section.

When applied to our case study data set, our analysis identifies a set of clusterings of learners at different levels of resolution. The clusters of learners reflect the differing temporal engagement as they progress through online course. In particular, our data-driven clusters capture behaviours associated with massed (i.e., completion of a large number of tasks within a short time period) and distributed learning, as well as finer behaviours that differentiate these learning types into subgroups. For instance, at a coarse level, the algorithm identifies a cluster of learners that follow the course in a sequential and distributed manner; yet, at a finer resolution, this cluster is sub-divided into two clusters, which differ by a 1–2 week difference in the average completion times of tasks (i.e., ‘early birds’ and ‘on time’). Our approach also finds sporadic learners that skip a large number of tasks or exhibit irregular massed learning depending on particular courses or at different times of the year. Similar outcomes are observed for the other two data sets although with differences reflecting the particularities of the data. We then used exam grades a posteriori to establish whether particular online engagement behaviours can negatively affect learner performance and we compared our groupings against classification based on statistical features computed from the time-series data.

Results

Unsupervised clustering reveals clusters of learners with differing online engagement

To find groups of learners with similar online engagement in an unsupervised manner, we follow the procedure summarised in Fig. 2. We first create a similarity matrix between learners using a dynamic time warping kernel. This matrix is transformed into a similarity graph using a sparsification based on the Relaxed Minimum Spanning Tree,²¹ a procedure that retains global network connectivity while discarding weak similarities that can be explained through longer chains of strong similarities. Through this process, we create a graph where the nodes are learners linked by edges weighted according to their time-course similarity. Hence, two learners that complete the tasks of the course in a similar manner will be linked by a strong edge.

The constructed similarity graph is then analysed using Markov Stability (MS), a multiscale graph partitioning algorithm that uses a Markov process to scan the graph across Markov time in order to find optimised and robust partitions of the graph at any level of resolution.^22,23 The partitions are found by maximising a resolution-dependent cost function (the Markov Stability) at all levels of resolution, as given by the Markov time, t. We then select robust partitions in the following sense: (i) they are persistent across scales (i.e., optimal over an extended Markov time t, as given by a plateau with a low value of VI(t, t′)), and (ii) robust to the small changes in the optimisation (i.e., consistently found as a good partition over those scales, as given by a relative dip in VI(t)). Such robust partitions identify clusters of learners that exhibit similar online temporal patterns. The definitions of the different measures and some details of the Markov Stability framework are given in Methods.

Figure 3a summarises the results of our multiscale clustering method applied to the time-series of task completion of six online courses by 81 learners pursuing a post-graduate part-time Management degree at Imperial College Business School over one year. See Methods for further details about the data. As the Markov time is increased, the level of resolution is decreased and the method reveals robust partitions of decreasing granularity. In Fig. 3a, we illustrate the partitions found from ten clusters down to two clusters, with a notably robust partition into six clusters. Note the quasi-hierarchical aggregation of the finer clusters into coarser ones, a feature that is intrinsic to the data and not imposed by our clustering algorithm. (For a more detailed view of the multiscale clustering structure, see Supplementary Fig. 1). The quasi-hierarchical organisation across levels of resolution reflects the fact that subtle temporal details characterise the finer clusters, but broader similarities of the time profiles define the coarser clusters. Hence, our computational framework allows for adjustable granularity, which can be tailored to the needs of the analyst.

To exemplify the characterisation of the results in our data set, we focus mainly on the 6-cluster partition, which contains four large groups and two single learners that remain unclustered due to their highly individual sporadic behaviour. The 6-cluster partition exhibits the largest relative drop in VI(t) and a long plateau in VI(t, t′). The 10-cluster and 8-cluster partitions are equally of interest and provide a more refined clustering consistent with the 6-way partition, as seen in Fig. 3a. The coarser 2-cluster partition is also of interest: the two clusters are found to separate learners that exhibit distributed and massed learning. In the rest of the paper, we concentrate on a more detailed description of behaviours emerging from the 6-cluster partition, as it provides a nuanced, data-driven level of resolution on the data.

Characterisation of the clusters of online learners

As shown in Fig. 3b, the 6-cluster partition is both robust and the data-driven groupings it provides have an appropriate level of resolution to gain meaningful insight into the observed patterns of online learners. Two of the clusters contain only one learner, with highly individual and sporadic behaviour. For each of the other four clusters, we use Gaussian Process Regression (GPR)²⁸ to compute the average engagement trajectory of the group of learners, and compare it with the average GPR trajectory for the whole set of 81 learners. The computed GPRs allow us to quantify statistically the differences in the temporal patterns of the different clusters using Bayes factors of the processes. In particular, we found that the trajectories of each cluster are statistically more probable to be derived from separate processes defined within their own cluster as follows. A GPR was fitted to the entire set of trajectories and the log-likelihood of the entire set of trajectories was calculated. Equally, the log-likelihood of each separate cluster of trajectories from that same Gaussian Process was calculated. The Bayes Factor, calculated as the sum of log likelihoods of each separate cluster minus the log-likelihood of the entire set of trajectories¹² was found to be large (K = 3.37 × 10¹⁰). This indicates that the behaviours of each cluster are statistically different from each other and are derived from different behavioural processes. This computation was repeated for the differences between each pair of neighbouring clusters. The Bayes factors were: K = 0.38 × 10¹⁰ between the ‘early birds’ and ‘on time’ clusters; K = 1.52 × 10¹⁰ between the ‘on time’ and ‘low engager’ clusters; and K = 0.17 × 10¹⁰ between the ‘low engager’ and ‘crammers’ clusters. These numbers provide statistical evidence of the differences between the obtained clusters.

Each of the clusters in this partition has been given a descriptive title that encapsulates the group behaviour. The learners in the ‘Early Bird’ group (green cluster) generally exhibit a highly sequential and ordered approach to their learning and tend to complete their tasks earlier than the cohort average with a systematic 1–2 week advance offset. The behaviour of learners in the ‘On time’ group (cyan cluster) is similar to the ‘Early birds’, except that they finish tasks closer to the average. Hence both the green and cyan groups present a similar ‘distributed learning’ behaviour only distinguished by a slight delay, which explains why both groups are agglomerated into a single cluster in the coarser 2-way partition (Fig. 3a). The learners in the ‘Low engagers’ (orange) cluster also exhibit relatively distributed work flow (similar to the cyan and green clusters) but with less anticipation in the second half of the year (and especially in the third term). Furthermore, this group had a high number of tasks that were never completed. The ‘Crammers’ cluster (magenta) contained learners that exhibited massed learning (indicated by the presence of plateaux in their time-series, suggesting tasks being completed in a short period of time), low-task completion and an ordering of task completion that deviates from the proposed course sequence. Finally, the outliers (learners 43 and 46), which form their own clusters, exhibit highly sporadic learning behaviours, with tasks completed at later dates without following sequentially the layout order of the course.

To further characterise our results, we computed standard time-series metrics for each learner. Figure 4 shows the graph of learners coloured according to two such statistical metrics derived directly from the time-series: the mean massed session length (commonly known as binge learning), and the percentage of completed tasks. Figure 4a shows the mean massed session length, i.e., the length of plateau in the number of tasks over time calculated via an isotonic regression (see Methods). This measure captures events where a learner has completed a large number of tasks within a short time frame. We find that the ‘Crammers’ cluster has a higher mean massed session length. Figure 4b shows the graph of learners coloured according to the percentage of tasks completed relative to the total number of available tasks. In general, the ‘Crammers’ cluster shows the lowest mean task completion (66%), followed by a completion ratio of 80% in the ‘Low Engagers’ group, and a higher mean task completion rates in the ‘On time’ (86%) and ‘Early Birds’ (90%) clusters.

Cluster analysis identifies groups of learners at risk of low performance

We have also carried out an a posteriori evaluation of our behavioural clusters with respect to the performance of the learners. Figure 5a shows the mapping of the final average marks on the learner graph, where we have also highlighted high performing (>70%, top 15%, ‘Distinction’) and low performing (<60%, bottom 7.5%, below ‘Merit’) learners. Figure 5b shows that 6 out of 7 low-performance learners lie in the ‘Crammers’ cluster associated with massed learning and reduced task completion. There was a specific learner (77, cyan cluster) who attained a low grade and yet did not exhibit time-series behaviours indicative of a low performance. The high performers tend to be distributed across all other clusters, suggesting that the learning behaviours of a high performer are not as critical to their success. Still, 9 out of the 13 high-performing learners are found in the ‘Early Birds’ or ‘On time’ clusters characterised by a sequential approach to their learning with minimal massed learning sessions. The sporadic learners in single clusters (43 and 46) did not attain either a low performance or a distinctly high one.

Although our method captures information congruent with time-series statistical metrics (e.g., those shown in Fig. 4 related to massed learning and task completion rates), the data-driven clusters we obtain encompass global time-series information beyond such pre-determined standard statistical measures. To test this idea, we compared the results of our data-driven clusters to standard classification methods from Machine Learning based on statistical features. Figure 5c illustrates the classification map obtained by training two common machine-learning algorithms using the two statistical features in Fig. 4. The first learning algorithm is a support vector machine (SVM) using a radial basis function kernel and the second is a decision tree with a depth of 4 branches²⁹ (see Methods). For both methods, we find that the accuracy of learner classification against performance is low: only 3–4 out of 7 low-performance learners were accurately predicted. This result suggests that using a finite set of pre-determined time-series features reduces the information available to differentiate the necessary behaviours relevant to performance. In contrast, our graph construction and clustering methodology utilises the full content of the time-series (including attributes that are not evident from inspection of particular statistical metrics), thus providing a more comprehensive grouping of learners with similar temporal behaviours.

Testing the methodology on two additional data sets

We have applied the methodology to analyse task completion time-series data from a second cohort of 46 learners taking the online management course at Imperial College Business School. The results we obtain are similar, as shown in the multiscale clustering presented in Supplementary Fig. 2 and the detailed analysis of the 6-cluster partition in Fig. 6a. In this case, we identified a robust 9-cluster partition (with four major clusters and five single learner clusters) and a robust 6-cluster partition (with three major clusters and three single outliers). The major clusters in the 6-way partition (shown in Fig. 6a) showed similar behaviours to those observed in the first cohort we analysed. In particular, the green cluster in Fig. 6a corresponds to the ‘Early Birds’ and ‘On time’ groups in Fig. 3, whereas the blue cluster in Fig. 6a is similar to the group of task-skipping ‘Low Engagers’ group in Fig. 3, and the purple cluster in Fig. 6a exhibits similar traits to the ‘Crammers’ cluster in Fig. 3. Within this 6-cluster partition, we found that of the 8 low-performance learners, 4/8 were located in the massed learning cluster, 2/8 were sporadic outliers, and 1/8 was in the low engagement cluster. Only 1/8 was located in the distributed learning cluster. Moreover, using standard classification procedures in Supplementary Fig. 3 we found that our methodology was superior at grouping learners with similar performance. These findings highlight the consistency of the methodology across the cohorts, yet attuned to particularities of the data.

The types of temporal engagement data collected from learners will differ across educators or institutions depending on the particularities of the Learning Management System. To test the methodology on a different kind of data, we have studied a set of 100 learners undertaking an anonymised course within the Open University (OULAD data set³⁰). The OULAD data set differs from our data set in several ways: (i) the time-stamp data in OULAD corresponds to page clicks and not necessarily to task completion; (ii) the time stamps were coarse-grained to days; (iii) pages could be revisited. The results of applying our methodology to the OULAD data set in Fig. 6b (and Supplementary Fig. 4 of the Supplementary Information) show that the multiscale clustering is robust to the sparsification implicit in the graph creation step. A robust 3-way partition is consistently found in our analysis, with two major clusters and a minor cluster of outliers. The two major clusters corresponded to a separation of learners who exhibited higher massed learning and lower task engagement versus learners with a distributed learning. We found that 6/7 of the low-performance learners (<60%) were located in the cluster associated with massed learning, while one low-performance learner was located in the minor outlier cluster and none were in the distributed learning group.

Discussion

We have described an approach for the analysis of temporal data of online learning behaviours, in which distinct clusters of learners are obtained algorithmically without using a priori statistical information about individual behaviours or about the number or type of expected behaviours across the cohort. The mathematical framework is general, and can be applied broadly to any time-series data in physical or social sciences to identify distinct temporal behaviours. In the context of learning analytics, we showcased the method through three data sets of online learner activity of different types and origins.

Our method uses a dynamic time warping similarity kernel to generate a sparsified similarity graph between learners, to which we apply a multiscale graph partitioning algorithm in order to find optimised and robust clusters of learners with similar temporal behaviours at any level of resolution in an unsupervised manner. As our method uses the full time-series, it inherently encompasses richer temporal information than standard methods based on selecting statistical features of the time-series.

In the data sets analysed here, we obtained a quasi-hierarchy of robust partitions, from finer to coarser, which provide different levels of information, as required by the analyst. For instance, in our main case study in Fig. 3, we found robust partitions into 10, 6 and 2 clusters. The 6-way partition consists of four large learner clusters (‘Early Birds’, ‘On time’, ‘Low Engagers’ and ‘Crammers’) and two single unclustered learners (‘Sporadic outliers’), which were shown to be statistically different to each other according to the GPR Bayes factor (12). A posteriori comparison with learner performance indicates good correspondence with the obtained clustering: low performers are generally located (6 out of 7) in the ‘Crammers’ cluster (associated with massed learning and low-task completion) and are generally absent from the ‘Early Birds’ and ‘On time’ clusters (associated with distributed learning and high task completion). On the other hand, high performers are distributed across several clusters, albeit with higher prevalence in the clusters associated with distributed learning. These results provide an improved characterisation as compared to common machine-learning classification algorithms trained on two statistical measures from the time-series. The analysis could be enhanced by the use of finer partitions (e.g., the 10-way partition has clusters with over-representation of low performers (purple cluster, hypergeometric p-value = 0.00023) and high performers (charcoal cluster, hypergeometric p-value = 0.026), as seen in Supplementary Fig. 1). Similar general behaviours and classifications are obtained for the two additional data sets presented in Fig. 6 and the Supplementary Information.

The fact that low performers tend to concentrate in the massed learning cluster and be absent from the distributed learning clusters is in agreement with previous studies, which found that learners that ‘crammed’ retained less information when tested at a later date,³¹ and provides support for the risks associated with this behaviour. On the other hand, the fact that high performers are distributed across several clusters, albeit with higher prevalence in the clusters with high task completion and distributed learning, suggests they follow a host of diverse learning patterns, in agreement with a latent class model that suggested that the ‘spacing effect’ is less prominent for high performers.³² These observations were found to be consistent when testing our methodology on a second cohort of learners within the same institution and online degree, and broadly in agreement with a different type of data (‘page clicks’) from a set of online learners at the Open University, where we found a strong distinction between a low performing ‘massed learning’ cluster vs. a ‘distributed learning’ cluster.

Clearly, temporal behaviours do not fully account for learner performance, and this methodology is not intended as a diagnostic tool, but rather as providing a method to explore and identify learner engagement behaviours with the purpose of aid, intervention and help with course design. Combining the temporal analysis introduced here with established ‘early warning system’ analyses³³ could aid in such tasks. Although educators might encourage learners to pursue a distributed study behaviour, our results suggest a nuanced approach for high performers, with flexibility provided in course design so that high-performing learners may pursue the study strategies they personally find effective.

Future work within different learning contexts, coupled with additional dependent variables of interest (e.g., learner satisfaction, career success, interruption and withdrawal rates) could be important to provide broader support for the initial results reported here. We remark that the methodology is scalable to larger data sets through adjustments of the computation of both the DTW kernel and the Markov Stability cost function (see Methods). Further improvements of the similarity kernel using constrained DTW³⁴ and end point invariance³⁵ could also be used to improve the sensitivity and accuracy of the method in representing the different temporal behaviours. Altogether with how online behaviour changes over time for each of the learners, these directions will constitute areas of further research.

Methods

The methods section describes the data and unsupervised mathematical pipeline used to analyse the trajectories of learners. The research was performed without any a priori knowledge or allocation of the learners, making it similar to a blind investigation.

Temporal data

The main case study of this research was based on task completion data from 81 post-experience learners pursuing a post-graduate part-time management degree at Imperial College London. These learners formed part of a cohort of 87 learners. Data from the remaining six learners was not included here as these learners either interrupted their studies or withdrew from the programme. Subjects ranged in age from 28 to 53 years old, with gender balance of 57 males to 24 females, and they resided in 18 geographically disparate countries. The data corresponds to interactions with six online courses, which together comprised the first academic year of the 2-year degree programme. Although the subjects met face-to-face at the start of each academic year, the six courses were studied completely online. Subjects proceeded in a lock-step manner through the academic year, which was split into three 10-week terms each containing two of the six courses. The anticipated study load was 5 to 7 h per week for each course, so 10 to 12 h in total. The courses were assessed via a combination of coursework and exam, however, participation in these separate assessed activities was not included in the data set analysed here, only their final 2-year grade was used as an indication of their performance.

To highlight the applicability of the method, we also applied the analysis to two additional data sets: (i) time-stamped task completion series from a second cohort of 46 post-experience learners pursuing the post-graduate part-time management degree at Imperial College London; (ii) time-stamped data of ‘page-clicks’ (not equal to task completion) from 100 learners undertaking Open University courses (OULAD data set³⁰). For further details on these data sets, see the Supplementary Information.

Ethical approval from the Education Ethics Review Process (EERP) at Imperial College London was attained (EERP 1718-032b) and a waiver for informed consent was granted for this study.

Construction of the learner similarity graph using a dynamic time warping kernel and RMST sparsification

Creating a similarity matrix between learners using dynamic time warping

To compute the similarity between the task completion time traces of every two learners i and j, we use a similarity kernel, i.e., a generalised inner product. Common approaches for sequence analysis use L_p norms (when p = 2 we obtain the Euclidean norm), which are fast to compute and easy to index. However, their one-to-one matching often ignore sequential patterns that are non-linearly misaligned. Instead, our approach uses a dynamic time warping (DTW) kernel, which provides an elastic matching of two time sequences incorporating both the sequential ordering of the trajectory and the absolute values of time.¹⁷ The DTW similarity kernel is defined as:

$$k_l(x,y) = e^{ - D_l(x,y)/\sigma ^2},$$

(1)

where D_l denotes the DTW distance. The distance D_l is calculated by constructing an n × m matrix where n and m are the lengths of the two vectors we wish to compare. Using the pair-wise cost cost(x_i, y_j) = ||x_i − y_i||², we minimise the overall cost over the path from (i, j) = (1, 1) to (i, j) = (n, m) where each cell (i, j) along the path contributes cost(x_i, y_j) to the cumulative cost (summed over the path). This method is able to implicitly stretch both sequences to get a single dynamic time warping match between the two vectors, i.e., we find the cost required to match the two time-series trajectories for each learner. The higher the cost, the higher distance in Hilbert space, and therefore the lower similarity between learners.

For N learners we produce an N × N similarity matrix A where each element A_ij is the DTW similarity (1) between learners i and j. For longer time-series and for larger number of learners N, whereby the DTW calculations may become computationally expensive, dimensionality reduction methods can be implemented to improve the speed of similarity calculations³⁶ or segmented dynamic time warping algorithms with comparable speeds to Euclidean distances can be used.³⁷

Creating a similarity graph using RMST sparsfication

The similarity matrix A can be thought of as the adjacency matrix of a fully connected, weighted graph, where every learner is connected to every other learner in the network with a different strength given by their pair-wise similarity. The high redundancy present in this full similarity matrix both increases the computation time and reduces the effectiveness of many clustering algorithms. We therefore sparsify the similarity matrix to produce a similarity graph by reducing the number of edges present. To do this, we employ a pruning algorithm (the Relaxed Minimum Spanning Tree, or RMST), which is based on geometric graph heuristics that preserves edges based on both their strength and their relevance to long paths within the graph. RMST has been shown to balance the local and global structure of data sets and performs well under multiscale graph clustering methods.^21,38 Supplementary Fig. 4 shows that the community structure is relatively stable when the sparsification parameter of RMST is varied.

Visualisations and layouts of the similarity graphs for the different data sets were produced using Gephi with the Force Atlas setting.³⁹

Finding clusters of learners using Markov Stability graph partitioning

Community detection methods for graphs aim to partition the nodes of a graph into subgraphs (communities) that are well-connected within themselves and weakly connected to each other. There are multiple ways to define communities, and many methods and criteria to score the resulting partitions.⁴⁰ Such methods are also related to graph partitioning problems.

Markov Stability (MS) is a generalised method for identifying communities in graphs at all scales. MS employs a random walk on the graph to define a time-dependent cost function that measures the probability that a random walker is contained within a subgraph over a time scale t. If the random walker becomes trapped in particular subgraphs over that particular timescale, this identifies a good partition. As the time scale of the Markov process increases, the method identifies larger subgraphs leading to coarser partitions. Hence MS has the ability to identify intrinsically relevant communities at all scales by using the dynamic scanning provided by the diffusive process. For a detailed description of the method see.^22,23

The random walk is governed by the N × N transition matrix Q = D⁻¹ A, where N is the number of nodes in the graph, A is the adjacency matrix, and D = diag(A1) is the degree matrix where 1 is a vector of ones. Q defines the probability of the random walk transitioning from node i to node j, as given by the discrete-time process:

$${\mathbf{p}}_{t + 1} = {\mathbf{p}}_t\,Q,$$

(2)

where p_t is a 1 × N node vector describing the probability of the random walker to be at each node at time t. An associated continuous-time diffusive process in terms of the graph combinatorial Laplacian L = D − A has the time-dependent solution:

$${\mathbf{p}}(t) = {\mathbf{p}}(0)\,e^{ - tL}.$$

(3)

The time t is denoted the Markov time and is distinct to any real time. Markov time can be understood as a dimensionless quantity related to the diffusive process, which acts as a resolution parameter in that it allows for the exploration of the graph at different scales: as the Markov time increases, the partitions become coarser.

A partition of the graph into c communities is encoded into a N × c membership matrix establishing the correspondence between the nodes and the clusters:

$$H_{ic} = \left\{ {\begin{array}{*{20}{l}} 1 \hfill & {{\mathrm{if}}\,{\mathrm{node}}\,i\,{\mathrm{belongs}}\,{\mathrm{to}}\,{\mathrm{community}}\,c} \hfill \\ 0 \hfill & {{\mathrm{otherwise}}} \hfill \end{array}} \right.$$

(4)

The goodness of the partition encoded by H at time t under the dynamics governed by L is defined in terms of the c × c block auto-covariance matrix:

$$R(t;H) = H^T({\mathrm{{\Pi}}}e^{ - tL} - {\mathrm{\pi }}^T{\mathrm{\pi }})H,$$

(5)

where π is the stationary solution of (3) and Π = diag(π). The meaning of this matrix is clear: the element of the matrix [R(t; H)]_αβ encodes the probability that a random walker starting in community α will be at community β after time t, and the diagonal elements, [R(t, H)]_αα, indicate the probability of remaining contained in community α over time scale t. Hence a good partition H will maximise the sum of the diagonal elements, i.e., the trace of R(t, H). This leads us to our definition of the cost function, the Markov Stability of the partition:

$$r(t,H) = \mathop {{\min }}\limits_{\tau < t} \,{\mathrm{Tr}}\,\left[ {R(\tau ,H)} \right],$$

(6)

which is to be maximised at every time t by searching in the space of partitions H:

$$r^ \ast (t) = \mathop {{\max }}\limits_H r(t,H)\,{\mathrm{and}}\,H^ \ast (t) = \mathop {{\arg \max }}\limits_H r(t,H).$$

(7)

Owing to the optimisation (7) being non-convex and NP-hard, we use an efficient greedy algorithm known as the Louvain algorithm,⁴¹ which has been shown to perform well in practice and against benchmarks. Given its greedy nature, the optimised partition found by Louvain is not always the same as it depends on the initialisation of the optimisation algorithm. Therefore, we repeat the optimisation $\ell = 100$ times using different starting points for the algorithm. For each Markov time we thus obtainn 100 optimised partitions $H_i^ \ast (t)$ and we pick the one with maximal Markov Stability (6) in the set as the optimal partition at t:

$$\mathop {{\max }}\limits_i \{ H_i^ \ast (t)\} _{i = 1}^\ell = \widehat H(t).$$

To identify the important partitions across time, we use the following two robustness criteria:²³

Consistency of the optimised partition

A relevant partition should be a robust outcome of the optimisation, i.e., the ensemble of $\ell$ optimised solutions should be similar. To assess this consistency, we employ an information-theoretical distance between partitions: the normalised variation of information between two partitions ${\cal{P}}$ and ${\cal{P}}^{\prime}$ defined as:⁴²

$$VI(H,H\prime ) = \frac{{2\,{\mathrm{\Omega }}(H,H\prime ) - {\mathrm{\Omega }}(H) - {\mathrm{\Omega }}(H\prime )}}{{{\mathrm{log}}(N)}},$$

(8)

where ${\mathrm{\Omega }}(H) = - \mathop {\sum}\nolimits_{\cal{C}} p ({\cal{C}}){\mathrm{log}}\,p({\cal{C}})$ is a Shannon entropy, with $p({\cal{C}})$ given by the relative frequency of finding a node in community ${\cal{C}}$ in the partition H, and Ω(H, H′) is the Shannon entropy of the joint probability. The variation of information VI(H, H′) ∈ [0, 1] is a true metric distance between two partitions based on information theory and VI(H, H′) = 0 indicates that two partitions are identical.

A measure of the robustness to the optimisation, at a given Markov time t, is given by the average variation of information of the ensemble of solutions obtained from the $\ell$ Louvain runs:

$$VI(t) = \frac{1}{{\ell (\ell - 1)}}\mathop {\sum}\limits_{i \ne j} {VI} (H_i^ \ast (t),H_j^ \ast (t)).$$

(9)

If all runs of the optimisation return similar partitions, then VI(t) is small, indicating robustness of the partition to the optimisation. Hence, we select partitions with low values (or dips) of VI(t).

Persistence of the partition across levels of resolution

Relevant partitions should also be optimal across stretches of Markov time. Such persistence is indicated both by a plateau in the number of communities over t and a low value plateau of the cross-time variation of information:

$${\mathrm{VI}}({\mathrm{t}},{\mathrm{t}}^{\prime} ) = VI(\widehat H(t),\widehat H(t^{\prime} )).$$

(10)

This provides a second measure of robustness of a partition across resolution scales, and is commonly visualised via a heatmap where blocks along the diagonal indicate partitions that are persistent. Within a time-block of persistent partitions we choose the most robust partition, i.e., with lowest VI(t).

Markov Stability code available at github.com/michaelschaub/PartitionStability. When the computation of the matrix exponential in (5) becomes costly for moderately large N, the linearisation of e^−tL provides an efficient approximate method to analyse very large graphs within the same framework.

Isotonic regression

An isotonic regression is a model that identifies the optimal least squares fit to a data set given the constraint that the model must be a non-decreasing function. The optimisation is:

$$\mathop {{\arg \min }}\limits_x \left| {y - x} \right|^2,$$

(11)

where x_i must be larger or the same as x_i−1, i.e., x₀ ≤ x₁ ≤ ... ≤ x_n. The algorithm looks for violations of monotonicity and adjusts the estimate to fit within the constraints.

Gaussian process regression

The Gaussian process regression (GPR) was implemented using the sklearn Python package. The implementation is based on the Algorithm 2.1 of Gaussian processes for machine learning (GPML) by Rasmussen and Nickisch.²⁸

A GPR model can be thought to define a distribution over functions and inference being undertaken directly on the space of functions. As such, a mean and variance that models the data can be calculated. Given that the GPR is probabilistic we can calculate the log-likelihood of any set of trajectories being derived from an optimised GPR on another set of trajectories. Bayes factors are a method of Bayesian model comparison, which quantify the support for a model over another model. The Bayes factor K for two models M₁ and M₂ given some data D is:

$$K = \frac{{Pr(M_1|D)}}{{Pr(M_2|D)}}\frac{{Pr(M_2)}}{{Pr(M_1)}}$$

(12)

Additional classification algorithms

To classify learners into high, medium and low-performance groups, we used an SVM and a Decision Tree. Both algorithms are commonly used in classification tasks and were implemented using the scikit learn Python package.²⁹

An SVM acts as a non-probabilistic binary linear classifier that attempts to find a hyperplane in a high or infinite dimensional space that maximises the distances between data points of differing classes. We implemented the SVM with the radial basis function kernel.
The Decision Tree attempts to find optimal branches (decisions) that represent conjunctions of features that lead to accurate prediction of class labels. We implemented a Decision Tree depth of four branches, increasing the number of branches did not improve the classification accuracy.

Instead of using regression analysis between continuous dependent variables (performance) and independent variables (temporal features), we implemented classification algorithms to provide a closer comparison to our clustering results.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

To maintain anonymity of the learners that took part in this study we have not released the data.

Code availability

In accordance with the code policy at Science of Learning we have provided links to the necessary functions required for the mathematical framework detailed in this manuscript: • Clustering algorithm (Markov Stability): https://wwwf.imperial.ac.uk/mpbara/Partition_Stability/https://github.com/michaelschaub/PartitionStability. • Dynamics time warping: https://github.com/pierre-rouanet/dtw.

References

van Bruggen, J. Theory and practice of online learning. Br. J. Educ. Technol. 36, 111–120 (2005).
Article Google Scholar
Lodge, J. M. & Corrin, L. What data and analytics can and do say about effective learning. npj Sci. Learn. 2, 5 (2017).
Article Google Scholar
Mahzoon, M. J., Maher, M. L., Eltayeby, O. & Dou, W. A sequence data model for analyzing temporal patterns of student data. J. Learn. Anal. 5, 55–74 (2018).
Article Google Scholar
Papamitsiou, Z. & Economides, A. A. Temporal learning analytics for adaptive assessment. J. Learn. Anal. 1, 165–168 (2014).
Article Google Scholar
Ye, C. & Biswas, G. Early prediction of student dropout and performance in MOOCs using higher granularity temporal information. J. Learn. Anal. 1, 169–172 (2014).
Article Google Scholar
Ye, C. et al. Behavior prediction in MOOCs using higher granularity temporal information. In Proc Second ACM Conference on Learning @ Scale - L@S ’15, 335–338 (ACM, New York, NY, 2015).
Taylor, C., Veeramachaneni, K. & O’Reilly, U. Likely to stop? Predicting stopout in massive open online courses. Preprint at http://arxiv.org/abs/1408.3382 (2014).
Jiang, S., Williams, A. E., Schenke, K., Warschauer, M. & Dowd, D. O. Predicting MOOC performance with week 1 behavior. In Proc 7th International Conference on Educational Data Mining, 273–275 (EDM, 2014).
Antonenko, P. D., Toy, S. & Niederhauser, D. S. Using cluster analysis for data mining in educational technology research. Educ. Technol. Res. Dev. 60, 383–398 (2012).
Article Google Scholar
Knight, S., Friend Wise, A. & Chen, B. Time for change: why learning analytics needs temporal analysis. J. Learn. Anal. 4, 7–17 (2017).
Article Google Scholar
Gerbier, E. & Toppino, T. C. The effect of distributed practice: neuroscience, cognition, and education. Trends Neurosci. Educ. 4, 49–59 (2015).
Article Google Scholar
Ebbinghaus, H. Memory: a contribution to experimental psychology. Ann. Neurosci. 20, 155 (2013).
Article Google Scholar
Toppino, T. C. & Gerbier, E. About practice: repetition, spacing, and abstraction. Psychol. Learn. Motiv. 60, 113–189 (2014).
Article Google Scholar
Donovan, J. J. & Radosevich, D. J. A meta-analytic review of the distribution of practice effect: now you see it, now you don’t. J. Appl. Psychol. 84, 795–805 (1999).
Article Google Scholar
Carroll, P. & White, A. Identifying patterns of learner behaviour: what business statistics students do with learning resources. INFORMS Trans. Educ. 18, 1–13 (2017).
Article Google Scholar
Lee, A. V. Y. & Tan, S. C. Promising ideas for collective advancement of communal knowledge using temporal analytics and cluster analysis. J. Learn. Anal. 4, 76–101 (2017).
Article Google Scholar
Berndt, D. & Clifford, J. Using dynamic time warping to find patterns in time series. Workshop Knowl. Knowl. Discov. Databases 398, 359–370 (1994).
Google Scholar
Serrà, J. & Arcos, J. L. An empirical evaluation of similarity measures for time series classification. Knowl.-Based Syst. 67, 305–314 (2014).
Article Google Scholar
Mueen, A. & Keogh, E. Extracting optimal performance from dynamic time warping. In Proc 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2129–2130 (ACM, New York, 2016).
Wang, X. et al. Experimental comparison of representation methods and distance measures for time series data. Data Min. Knowl. Discov. 26, 275–309 (2013).
Article Google Scholar
Beguerisse-Diaz, M., Vangelov, B. & Barahona, M. Finding role communities in directed networks using Role-Based Similarity, Markov Stability and the Relaxed Mionimum Spanning Tree. In Proc 2013 IEEE Global Conference on Signal and Information Processing, 937–940 (IEEE, New York, 2013).
Delvenne, J. -C., Yaliraki, S. N. & Barahona, M. Stability of graph communities across time scales. Proc. Natl Acad. Sci. 107, 12755–12760 (2010).
Article CAS Google Scholar
Lambiotte, R., Delvenne, J. C. & Barahona, M. Random walks, Markov processes and the multiscale modular organization of complex networks. IEEE Trans. Netw. Sci. Eng. 1, 76–90 (2014).
Article Google Scholar
Rodrigues, P. P., Gama, J. & Pedroso, J. P. Hierarchical clustering of time-series data streams. IEEE Trans. Knowl. Data Eng. 20, 615–627 (2008).
Article Google Scholar
Fenn, D. J. et al. Dynamic communities in multichannel data: an application to the foreign exchange market during the 2007-2008 credit crisis. Chaos 19, 033119 (2009).
Article Google Scholar
Ando, T. & Bai, J. Clustering huge number of financial time series: a panel data approach with high-dimensional predictors and factor structures. J. Am. Stat. Assoc. 112, 1182–1198 (2017).
Article CAS Google Scholar
Hoffmann, T., Peel, L., Lambiotte, R. & Jones, N. S. Community detection in networks with unobserved edges. Preprint at https://arxiv.org/abs/1808.06079 (2018).
Rasmussen, C. & Nickisch, H. Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, 3011–3015 (2010).
Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Kuzilek, J., Hlosta, M. & Zdrahal, Z. Data descriptor: open university learning analytics dataset. Sci. Data 4, 1–8 (2017).
Article Google Scholar
Bloom, K. C. & Shuell, T. J. Effects of massed and distributed practice on the learning and retention of second-language vocabulary. J. Educ. Res. 74, 245–248 (1981).
Article Google Scholar
Verkoeijen, P. P. J. L. & Bouwmeester, S. Using latent class modeling to detect bimodality in spacing effect data. J. Mem. Lang. 59, 545–555 (2008).
Article Google Scholar
Beck, H. P. & Davidson, W. D. Establishing an early warning system: predicting low grades in college students from survey of academic orientations scores. Res. High. Educ. 42, 709–723 (2001).
Article Google Scholar
Ratanamahatana, C. A. & Keogh, E. Making time-series classification more accurate using learned constraints. In Proc 2004 SIAM international conference on data mining, 11–12 (SIAM, 2004).
Silva, D. F., Batista, G. E. A. P. A. & Keogh, E. On the effect of endpoints on dynamic time warping. In SIGKDD Workshop on Mining and Learning from Time Series II, San Francisco, CA. Association for Computing Machinery-ACM (ACM, New York, 2016).
Keogh, E., Chakrabarti, K., Pazzani, M. & Mehrotra, S. Dimensionality reduction for fast similarity search in large time series databases. Knowl. Inf. Syst. 3, 263–286 (2006).
Article Google Scholar
Keogh, E. J. & Pazzani, M. J. Scaling up dynamic time warping to massive datasets. In European Conference on Principles of Data Mining and Knowledge Discovery, 1–11 (Springer, Berlin, Heidelberg, 2010).
Beguerisse-Diaz, M., Garduño-Hernández, G., Vangelov, B., Yaliraki, S. N. & Barahona, M. Interest communities and flow roles in directed networks: the Twitter network of the UK riots. J. R. Soc. Interface 11, 20140940 (2014).
Article Google Scholar
Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. In Third International AAAI Conference on Weblogs and Social Media, 361–362 (AAAI, Palo Alto, CA, 2009).
Fortunato, S. Community detection in graphs. Phys. Rep. 486, 75–174 (2010).
Article Google Scholar
Blondel, V. D., Guillaume, J. -L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008, P10008 (2008).
Article Google Scholar
Meila, M. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines, 173–187 (Springer, Berlin, Heidelberg, 2003).
Chapter Google Scholar

Download references

Acknowledgements

We would like to thank Dr. Nai Li, Marc Wells, Gavin Symonds, Samuel McGarry, and Phil Tulip for assistance with data collection and interpretation. We would also like to thank Prof Alan Spivey for helping promote the project and attain funding from Imperial College London. We would like to thank Dr. Iro Ntonia and Prof. Martyn Kingsbury for their insightful suggestions and advice on ethical procedures. This research has been funded by a President’s Excellence Award from Imperial College London. M.B. and S.N.Y. acknowledge support from EPSRC award EP/N014529/1 funding the EPSRC Centre for Mathematics of Precision Healthcare at Imperial.

Author information

Authors and Affiliations

Department of Mathematics, Imperial College London, London, SW7 2AZ, UK
Robert L. Peach & Mauricio Barahona
Imperial College Business School, Imperial College London, London, SW7 2AZ, UK
Robert L. Peach & David Lefevre
Department of Chemistry, Imperial College London, London, SW7 2AZ, UK
Sophia N. Yaliraki

Authors

Robert L. Peach
View author publications
You can also search for this author in PubMed Google Scholar
Sophia N. Yaliraki
View author publications
You can also search for this author in PubMed Google Scholar
David Lefevre
View author publications
You can also search for this author in PubMed Google Scholar
Mauricio Barahona
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.L.P. and M.B. designed the mathematical framework and data analytics pipeline. R.L.P. built and coded the Python toolbox for data collection, data cleaning and data analysis, and carried out the computational analysis. M.B, S.N.Y and D.L supervised the project design and data analytics research. All authors contributed to writing the manuscript.

Corresponding author

Correspondence to Mauricio Barahona.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Peach, R.L., Yaliraki, S.N., Lefevre, D. et al. Data-driven unsupervised clustering of online learner behaviour. npj Sci. Learn. 4, 14 (2019). https://doi.org/10.1038/s41539-019-0054-0

Download citation

Received: 14 January 2019
Accepted: 17 July 2019
Published: 03 September 2019
DOI: https://doi.org/10.1038/s41539-019-0054-0

This article is cited by

Performance prediction in online academic course: a deep learning approach with time series imaging
- Ahmed Ben Said
- Abdel-Salam G. Abdel-Salam
- Khalifa A. Hazaa
Multimedia Tools and Applications (2023)
Floor plan optimization for indoor environment based on multimodal data
- Shinjin Kang
- Soo Kyun Kim
The Journal of Supercomputing (2022)
Understanding learner behaviour in online courses with Bayesian modelling and time series characterisation
- Robert L. Peach
- Sam F. Greenbury
- Mauricio Barahona
Scientific Reports (2021)
Behavior analysis method for indoor environment based on app usage mining
- Shinjin Kang
- Soo Kyun Kim
The Journal of Supercomputing (2021)
Network memory in the movement of hospital patients carrying antimicrobial-resistant bacteria
- Ashleigh C. Myall
- Robert L. Peach
- Mauricio Barahona
Applied Network Science (2021)