Optimized collusion prevention for online exams during social distancing

Online education is important in the COVID-19 pandemic, but online exam at individual homes invites students to cheat in various ways, especially collusion. While physical proctoring is impossible during social distancing, online proctoring is costly, compromises privacy, and can lead to prevailing collusion. Here we develop an optimization-based anti-collusion approach for distanced online testing (DOT) by minimizing the collusion gain, which can be coupled with other techniques for cheating prevention. With prior knowledge of student competences, our DOT technology optimizes sequences of questions and assigns them to students in synchronized time slots, reducing the collusion gain by 2–3 orders of magnitude relative to the conventional exam in which students receive their common questions simultaneously. Our DOT theory allows control of the collusion gain to a sufficiently low level. Our recent final exam in the DOT format has been successful, as evidenced by statistical tests and a post-exam survey.

demonstrate that using consecutive sequences and reducing the test length M 1 help reduce the maximum question leakage; A simple example of 4 questions from the pool of 6 questions properly assigned to 6 students for suppression of the collusion gain via grouping: c, primitive assignment of four questions (in black) to six students via circular shifting, yielding a maximum individual collusion gain 48.75% and an average collusion gain in the worst case ∼14.6%, and d, an assignment follows the general grouping-based anti-collusion scheme, reducing the maximum individual collusion gain to 10% and the average collusion gain in the worst case ∼3%. Note that in d, the students are grouped by similarity in their competence levels to bound the maximum collusion gain.  The number of questions to be asked each student. Q

Supplementary
The number of choices for each MCQ and only one of them is correct. Y The competence profile of students, y i ∈ [1/Q, 1] (i ∈ [N ]) represents the probability of student i correctly answers a question. y i in Y is ranked in a descending order. P The colluding matrix, p j,i (i, j ∈ [N ]) depicts the probability for student i to actively cheat from student j if i = j (The first index indicates the source of the answers while the second indicates the destination of the information). p j,i = 0 if j > i due to the assumption (1), and p i,i = 1 − i−1 j=1 p j,i . A The assignment, A = {a i ∈ P SQ |i = 1, 2, · · · , N }, composed of a set of SQs from P SQ and depicting the mapping from students [N ] to the permutation pool of SQs P SQ , and a i represents the SQ assigned to student i. s i mapping from the students to P SQ P SQ All possible M 1 -length SQs formed by the permutation of M 2 MCQs, which is also referred as the permutation pool. n The size of the permutation pool P SQ , n = M 2 !/(M 2 − M 1 )!. q i (A) The expected score of student i ∈ [N ] with collusion in the exam with assignment A, which can also be referred as the cheating score for short. Noted that q i (A) is also related to how P is defined.
The expected honest score of student i ∈ [N ] without collusion (irrelevant to sequence assignment A), which can also be referred as the honest score for short. Z The positional matrix, and z j,i (i, j ∈ [N ]) depicts the number of questions that student i can cheat from student j if j = i, and the special case z i,i is defined as M 1 . Noted that Z is directly calculated from the assignment A and irrelevant to P or Y . D The competence difference matrix D defined as (d j,i ) i,j∈ [N ] and d j,i = max(y j − y i , 0) for the easy of expression of other variables. g i The expected collusion gain for student i under assignment A is defined as the difference between his/her cheating score and honest score, calculated by q i (A) − q * i .
Supplementary Table 2 Supplementary Table 2: Summation of Metrics.
Metric Definition g The average collusion gain g is defined as the sum of expected collusion gains over all students [N ], which is also the main metric that our optimizations tend to minimize.
The worst case average collusion gain g W is defined as the average collusion gain g in the situation where all students manage to achieve their maximum possible collusion gain (the maximum possible collusion gain of the student i is achieved by setting the probability of i cheats with the student j to 1, from whom i will obtain the maximum gain among other choices of i); The maximum individual collusion gain is the maximum of the maximum possible collusion gains for all students (max i [max P (g i )]).
Note: g W and g M I are irrelevant to P , and can be used to revisit the optimized results for the worst case analysis. g W can be treated as a reliable upper limit estimation of the collusion gain under the given competence profile Y . g M I can be used to estimate the fairness of the exam from the aspect of the maximum collusion gain any student can achieve.

Proof of Collusion Control Theorem
Theorem 1. Given sequences of M 1 questions from the bank of M 2 MCQs and with only one correct choice out of Q choices for each question, the maximum individual collusion gain is no more than (1 − 1/Q)/(M 2 − M 1 + 1) using the grouping-based anti-collusion scheme.
Proof. For any two same length question sequences s 1 and s 2 , let F z (s 1 , s 2 ) stand for the number of questions that can be copied from s 1 to s 2 . Let us denote the M 2 questions as {1, 2, 3, . . . , M 2 }.
Following the grouping-based anti-collusion scheme, (1) by circular shifting, we can easily create M 2 − M 1 + 1 sequences: It is easy to check for any pair of s i and s j out of them, we have F z (s i , s j ) = 0 if i < j. (2) Let us divide the maximum possible student competence score range [1/Q, 1] into M 2 − M 1 + 1 intervals, and the length of each interval is (1 − 1/Q)/(M 2 − M 1 + 1), as shown in Supplementary Figure 2. Then, we can group the students whose competences are in the same interval, and obtain M 2 − M 1 + 1 groups. We assign the sequences s 1 , s 2 , . . . , s M 2 −M 1 +1 to the groups ranked in the descend order of their competences. By doing so, we have achieved two goals: first, there is no collusion gain between groups ensured by (1); second, the individual maximum gain inside a group cannot be greater than (1 − 1/Q)/(M 2 − M 1 + 1) which is the interval length, ensured by (2), and regardless of how many students are in the group. Hence, we have proved the theorem.
Supplementary Figure 2: Illustration of the collusion control theorem to control the maximum individual collusion gain to be below any desired level. By dividing the competence range into C intervals and grouping students into these intervals can controls the maximum individual gain below the length of the interval (1 − 1/Q)/C, where C = M 2 − M 1 + 1.
Students inside one interval receive the same sequence 9: else 10: a i ← s t 12: return A.

Algorithm 2: Cyclic Greedy Searching (CGS)
Building on Algorithm 1, we propose CGS (Algorithm 2) as an extension to search P CS to greedily improve upon the assignment computed by Algorithm 1. Algorithm 2 proceeds in two phases as follows: In Phase 1, we use the result A 0 from Algorithm 1 as our preferred initialization, and together with other reasonable/random initializations we will find the best one among the respectively optimized results. In Phase 2, in each of N rounds, a student is selected based on the competence order from high to low, and a sequence which minimizes the average collusion gain is selected from P CS to be assigned to the student (the assignment is updated only if the update reduces current average collusion gain to ensure convergence). For any assignment A, we will use (s, a −i ) to denote the assignment where student i's sequence a i is replaced with a sequence s ∈ P CS . The steps in Phase 2 will be repeated for a maximum of N rep times or until a local minima is reached. if g((s j , a −i ), P ) < g(A, P ) then break Assignment does not change, and stop 13: else 14: tA = A 15: return A.

Algorithm 3: Min-Max Greedy Matching (MMM)
Our next algorithms remove the restriction on the search space to the cyclic pool P CS , and search for assignments in the pool of all possible question sequences P SQ . The pseudo code of Algorithm 3 is presented below, and some notations are copied from the main text for clarity. Given any s ∈ P SQ : for l ≤ M 1 , j ≤ M 2 , do Edge weights are the marginal gain over A by placing j as i's l-th question. 6: w (l,j) ← 0.

Algorithm 4: Integer Linear Programming (ILP)
We cast the assignment optimization problem into an integer linear programming problem to find a globally optimal assignment in the permutation space, as shown below in Algorithm 4.
feasible assignment expected score Supplementary Figure 3: ILP to compute an assignment with minimum gain.

Performance Comparison of the Algorithms
We evaluate the performance of the proposed algorithms in the previous section on multiple synthetic datasets with different settings corresponding to the choice of values for N , M 2 , and M 1 , over 100 instances for each setting. For each setting, we generate the competences of N students i.i.d. uniformly at random from [0.25, 1), and generate the colluding probabilities for each student i ∈ {2, . . . , N } to cheat from the students k < i (i − 1 in total) from the (i − 1)-variate Dirichlet distribution with a concentration parameter of α = 10, meaning that k<i p k,i = 1 and p 1,1 = 1 in our experiments. In addition to random assignments and assignments obtained directly by circular shifting, we consider four algorithms: CGS (Algorithm 2), MMM (Algorithm 3), MMM-CGS, and ILP (Algorithm 4, which computes an optimal assignment with minimized gain ). CGS was first initialized with the assignment generated from GAS which provides a proven upper bound of the collusion gain, and then was randomly initialized from the cyclic pool P CS for 9 times, and the best result was selected for comparisons and used as the initialization of MMM-CGS. Experiments were performed on a computer equipped with a AMD Ryzen 7 2700X processor running at 4.0GHz and 16GB of system memory. Due to practical considerations of running time and system memory, we evaluate against the ILP on instances with at most N = 10, M 2 = 5, and M 1 = 3. We evaluate the relative performance of our greedy heuristic algorithms on larger instances with N = 100, M 2 = 30, and different values of M 1 ∈ {10, 20}.
Supplementary Figure 4 presents our experimental results. For each of the four algorithms and additionally random and cyclical assignments (Y-axis), Supplementary Figure 4 shows the gain over the honest score computed using Equation 7 (from the main text) (X-axis) over 100 instances, with a box representing the upper and lower quantiles, an orange line within the box representing the median, a green arrowhead representing the mean, and whiskers extending from the box on either end representing the range of values observed. Statistic outliers are plotted individually using the 'o' symbol.
Comparing the different algorithms, we observe immediately from Supplementary Figure 4 that each of our greedy algorithms displays significantly lower gain than the random assignments on average, which demonstrates the usefulness of our anti-collusion schemes in reducing the collusion gains. Second, the low average collusion gain exhibited by the optimal solution computed by ILP on average validates our approach for minimizing the gain from collusion. Third, MMM-CGS and CGS approximated the minimum gain computed by ILP well, highlighting the effectiveness of the greedy algorithms in practice.
Comparing the optimized solutions computed under different settings, it is easy to observe that optimized average collusion gain (with the optimal ILP algorithm for example) does not necessarily correlate to the permutation space size; e.g., (5, 3, 2) 1 and (5, 3, 3) have the permutation space of the same size but different optimal gains; from (10, 3, 2) to (10,5,3), the size of the permutation space increased but the optimal gains decreased; and from (10,5,3) to (10, Supplementary Figure 4: Comparison of the greedy algorithms with the ILP method in terms of the average collusion gains. 5,5), the size of the permutation space increased but the optimal gains increased. However, the optimized collusion gains do correlate to the difference between M 2 − M 1 (which has been implied by Theorem 1); i.e., an increased M 2 − M 1 value significantly reduces the collusion gain as demonstrated by (5, 3 ,3) versus (5, 3, 2), as well as (10,5,5) and (10, 3, 2) versus (10,5,3). In addition, increasing the number of students can also raise the collusion gain as shown in the case of (5, 3, 2) versus (10,3,2).
For each of the settings presented in Supplementary Figure 4, the ILP solution has very low collusion gain on average, validating our approach of suppressing collusion by question assignment without proctoring. Surprisingly, the CGS solution matches the minimum possible gain as obtained by ILP in over 95 out of a 100 instances in the settings of (5, 3, 2), (5,3,3) and (10,3,2), and matches the gain of ILP over 65 out of 100 instances in the settings of (10, 5, 3) and (10,5,5), validating the usefulness of our greedy heuristic algorithms. Additionally, the simple MMM-CGS algorithm dominates the other heuristic algorithms, but does not significantly outperform CGS in our experiments, which performs close to optimally, even thought its search space is restricted to the set of circular sequences. It appears that MMM which is initialized by a random assignment is sensitive to the initial assignment, and is prone to being stuck in locally optimal solutions, but still significantly outperforms random assignments.
Our experiments provide two key takeaways for educators: For a given number of students N : (1) The minimum collusion gain that can be obtained by our approach (eg. the ILP solution) is determined by the permutation space available to assign sufficiently different sets of questions and in orderings that minimize collusion, which can be controlled by the either increasing the size M 2 of the pool of questions, or decreasing the number of questions on each students' sequence M 1 . (2) Greedy heuristic algorithms significantly outperform random assignments, and often approach the minimum possible average collusion gain in settings with a small permutation space for question assignment.
In Supplementary Figure 5, we compare our greedy algorithms on relatively large instances with N = 100 students, a pool of M 2 = 30 questions, and between M 1 = 10 and 20 questions per student. Our experiments demonstrate again that (1) our greedy algorithms significantly outperform random assignment, and (2)

Final Exam Design
On April 28, 2020, 78 out of 85 undergraduate students in two classes separately taught by two instructors took the final exam of an undergraduate imaging course according to our optimized design [1]. The course is based on a standard textbook [2], with all video lectures available online [3] complemented with online lecture notes [4].
The class itself is divided in two main parts, fundamentals and imaging. Fundamentals cover measurements, linear systems, convolution, Fourier analyses, basic signal processing techniques such as filtering and sampling, basic imaging definitions, and measuring tests accuracy. Imaging covers medical imaging modalities; x-ray, computed tomography (CT), nuclear imaging such as PET (Positron Emission Tomography) and SPECT (Single Photon Emission Computed Tomography), Magnetic Resonance Imaging (MRI), ultrasound, and optical techniques such as microscopy and optical coherence tomography (OCT). The final exam covered signal processing and all imaging modalities. From this, a pool of multiple choice questions was created, each question with four options. The questions tested the main concepts of the class subjects and were similar in length since they all had equal grade points and a time period of two minutes each to answer. The number of questions was proportional to the materials taught in class, i.e., 20% of the questions covered x-ray and CT combined. Similarly, nuclear imaging, MRI, and ultrasound, each was the subject of 20% of the questions. 10% of the questions were about optical techniques and the remaining 10% covered basic imaging definitions, and measuring tests accuracy.
The questions included a mix of text, formulas, and figures, and were designed for openbook tests. To simplify the testing platform and add the difficulty in direct online searching of questions, all questions were included in the exam as images, and the students have four boxes labeled A to D to choose from by clicking on the desired option. The students could change their answer within the time period allocated for every question, but could not make changes afterwards. The length of the time window was empirically adjusted so that it is enough for high-competence students to finish the question comfortably but insufficient for unprepared students to search the answer without a good understanding of the content.
For the final exam, a pool of 80 questions was created, 60 of which were used, i.e., M 2 = 60. The remaining 20 were for students who requested a makeup exam. The exam consisted of 40 questions (M 1 = 40). Therefore, not all students were tested using the same 40 questions. Additionally, students were asked to join a WebEx video conference session with their respective instructor for questions or technical difficulties, which also served as a simple online proctoring. Students also need to log into our DOT platform with their RCS ID and RIN (unique IDs assigned to each student by our institute) to attend the exam. The identities of students were double-checked through the video by the instructor.

Sequence Assignment
Based on our anti-collusion scheme, an optimized assignment of the final exam N = 85, M 2 = 60, M 1 = 40, Q = 4 was first designed by GAS and then refined with our heuristic CGS algorithm.

Competence Estimation
The students' competences are estimated with their performance in the mid-term exam before the social distancing. The two classes were taught by different instructors, and have different mid-exams, but they will take the same final at the same time. Thus, their relative performances in the class were treated as their competence score rather than their real scores. The grades distribution of two classes were first normalized to the distribution with zero mean and unit standard deviation, and then combined together. It is worth mentioning that the students did not participate the mid-term exam were picked out before the normalization procedure, and then put back to the combined profile with 0 (using the averaged performance to estimate their performance). Finally, combined normalized grades were then linearly transformed to the range [0.25, 1) to form the prior knowledge of the competence profile Y of the combined set of the students.

Colluding Matrix Construction
To perform the optimization, we heuristically construct a colluding matrix P depicting the probability of every student cheating from another student. Following the notations in main text, reasonable assumptions about colluding mechanisms are made as follows: (1) The probability of student i actively cheating is related to his/her competence y i ; Student 1 tends not to cheat since he/she could obtain no gain (risk greater than benefit), while student N will try all means to cheat since he/she will always gain (benefit greater than risk); (2) The probability of colluding happens between two students A and B is related to the difference of y A and y B . Student i will have the strongest willingness to cheat from student 1, but the least willingness to cheat from student j if y i = y j since he/she cannot trust j more than himself/herself, and he/she will never cheat from j if y i > y j .
Based on the assumptions above, the colluding matrix P is heuristically constructed as follows: where n f (i) is defined as the number of elements in Y that are greater than y i , and η is a nonnegative constant which can be used to adjust students' willingness to cheat. Larger η will increase the colluding probability, and students are supposed to always commit active cheating if η = ∞. Eqs. (S2) and (S1) define the probabilities of the cheating and non-cheating states of student i respectively, and in the cheating state, the possibility of student i will cheat from student j is proportional to their competence difference y j − y i normalized by the sum of competence differences in all possible cases. We further assume that students have different competences (y 1 > y 2 > . . . > y N ), without losing generality (due to the fact that adding tiny differences to two equal y negligibly affects the result of g), we simplify the expression of n f (i) as Hence, p ji can be written more explicitly as follows: Note that this heuristic colluding matrix P may not exactly match real life but it is a reasonable start for optimization. In our construction, P puts a larger weight on the collusion between students with a larger competence difference than that with a small competence difference, which helps limit the collusion gain in the worst case. Since mismatches are very likely to exist between the model and the practice, the worst-case analysis needs to be performed on the optimized result. If the collusion gain calculated in the worst situation for the output assignment is not acceptable, the result should be used with caution or just use different initializations to generate diverse solutions and pick the best one.

Optimization Results
After optimization with CGS, the average collusion gain was reduced to 0.0073%, with the worst case collusion gain and the maximum individual collusion gain as 0.91% and 6.88% respectively, and the distribution of individual collusion gain in the worst case is shown in Supplementary Figure 6. From the figure, it can be seen that 90% students holds a maximum possible collusion gain below 2% while the others sparsely range from 3% to 7%, suggesting this is a practically good result.
Besides the anti-collusion feature, the nature of circular shifting sequences enabled us to ensure that every student can receive the same number of questions from the same lecture, i.

Robustness Relative to Noisy Y
The proposed general anti-collusion scheme works very well in terms of the maximum individual collusion gain in the simulation. To be mentioned, this does not rely on accurate student competence profiling. It can work well even with only the rank of the students' competences, and control the collusion gain to a desired level, as shown in the proof of Theorem 1 that only the ranking information has been used for implementing the scheme.
During the design of the exam, one would ask how robust of our method on the students' competences data with noises, since we need to infer Y from students previous performance, and randomness will inevitably make the Y noisy. In principle, it should be robust even if there are noises in the competence data. This can be readily understood that small noises will only make few students across the interval boundaries. The down-dropping student (DDS) could increase the maximum individual collusion gain g M I since other students in the augmented group can cheat from the DDS, and the increment in g M I will be no larger than the noise magnitude. Clearly, this will increase the worst case g W due to the fact that all students in the same group will gain benefits. On the other hand, students in the upper group but with lower competences than the DDS can potentially cheat with the DDS which creates the inter-group collusion gains but this collusion gain is negligible in terms of g W and g M I since this inter-group gain should be much smaller than their maximum intra-group gains. As for the Up-floating case, only the up-floating student (UFS) will obtain an increased collusion gain through intra-group collusion but again the increment in g M I should be smaller than the noise magnitude. The UFS will also benefit from inter-group collusion but the gain is much smaller than his/her intra-group collusion gain. The increment in g M I will be smaller than the magnitude of noise. Thus, the grouping-based anti-collusion scheme should be robust against noise in students' competences.

Statistical Testing: Did Significant Collusion Occur
To assess whether the optimized DOT approach resulted in significant collusion, we formulated the hypotheses for aspects (i) and (ii). For testing the hypothesis for aspect (ii), there is no difference in the average number of correct answers for the first and the last 20 questions, we utilized the Wilcoxon signed-rank test, which is a standard non-parametric hypothesis test. To formulate a hypothesis for aspect (i), we considered that significant collusion did occur. This section provides details on how we tested this hypothesis, which is based on examining cases for which pairs of students gave the same answer to particular questions.
The format of the final exam stipulated that the 78 students were divided into 22 groups. As each group received a different set of problems, our focus here is on assessing the potential for intra-group collusion. More precisely, we can examine abnormal trends within the exam results a posteriori. With this in mind, we designed the following test procedure that focused on the 17 groups that had at least 3 students. We started by selecting a random integer number between 1 to 17, therefore identifying a group randomly. Next, from the selected group, we randomly selected two students which we considered to have engaged in collusion. Finally, we randomly selected 5 ≤ n q ≤ 40 questions for which we assumed that collusion had occurred.
If the pair of students gave the same answer to one of the n q problems, irrespective of whether the answer given is correct or not, we assigned a logic 0 to this case. Conversely, if the two students gave different answers to a particular problem, we assigned the label 1 to describe this. To test for significant collusion, we repeated the procedure laid out in the preceding paragraph in a Monte Carlo fashion [5], i.e. randomly selecting groups, student pairs and problems. The next step is combining 5 ≤ m p ≤ 30 randomly selected student pairs. To construct a random variable, we define the indicator function I j (i) = 0 We consider the students cheated 1 The students did not cheat where the index 1 ≤ i ≤ n q refers to a randomly selected question. The index j ≥ 1 labels the set of m p randomly selected student pairs. The random variable X = x j then describes the sample mean for the jth set of m p randomly selected student pairs, i.e. 0 ≤ x j = 1 Finally, by selecting M , e.g. M = 30, the number of Monte-Carlo runs is M × n q . Based on the above procedure, we translate aspect (i) that there was significant collusion into the following null hypothesis: There is significant collusion The students did not cheat To test the above hypothesis, we can utilize the random variable X . Following from the wellknown central limit theorem, X has an asymptotic normal distribution. Using the sample containing the M observations of the random variable X = x j , i.e. x 1 , ..., x M , we can determine the value of the test statistic T = t, t = x/s, where s is the sample mean. The test statistic T has a t-distribution with M − 1 degrees of freedom [6]. This allows computing the p-value for rejecting the null hypothesis if p < α with α being the significance of the test, selected to be α = 0.05. The final step is to repeat the above procedure a total of K times, which yields p 1 , p 2 , . . . , p k , . . . , p K . As advocated in [7], to statistically evaluate these p-values, we computed the adjusted p-values to adjust for the false discovery rate. This, in turn, allows determining the false discovery rate (FDR) threshold. A FDR threshold below the significance α implies that each of the adjusted p-values are below α. For K = 100 repetitions of this Monte-Carlo method, 5 ≤ n q ≤ 20 randomly selected problems (# Problems), 5 ≤ m p ≤ 30 randomly selected student pairs (# Comparisons), Figure 3 in the main text shows that the FDR threshold is below α = 0.05.
Even the extreme cases, describing a set of 100 times 5 randomly selected student pairs (two students within the same group) shows that there is no case where we fail to reject the null hypothesis. We therefore reject the hypothesis that the students engaged in significant collusion. In sharp contrast, we cannot make any statistically sound judgment as to whether individuals have assisted each other in answering the 40 questions. More precisely, there is no empirical evidence to accept that significant collusion occurred, for instance using cellphones for taking pictures and sending text messages or sending short emails of answers to particular problems.
It is interesting to observe that the number of questions has a negligible effect on the false detection rate threshold. This was not expected, as we considered the case of occasional collusion (a smaller value of n q , say 5) to be more likely than systematic collusion (a larger value of n q that is closer to 40). Conversely, the increase in the false discovery rate threshold when reducing the number of student pairs m p is expected. By decreasing the sample size, or m p , the size of the acceptance region increases accordingly, which reduces the probability of rejecting an incorrect null hypothesis (Type II error).
Reducing the lower boundary for m p from 5 is not advisable, as m p constitutes the sample size. More precisely, we observed a reduction in the FDR threshold for m p = 3, 4 compared to the values obtained for m p = 5, which we attributed to the lack of statistical information in the small sample. A point of contention is the assumption that the average over 10 ≤ n q ≤ 40 is drawn from a normal distribution. To verify that this is a valid assumption, we utilized the Anderson-Darling test [8] to test whether each sample of m p observations was drawn from a normal distribution. By accepting 10% of violations, we observed that violations arose for around 18% of cases, n q < 25. In practice, the use of the t-tests over alternative standard nonparametric hypothesis tests if the assumption of normality is violated often yields satisfactory results, i.e. [9]. Moreover, we repeated the same testing procedure using the standard sign test [6], which produced similar result to that depicted in Figure 3.

DOT Platform
To implement our DOT technology, we developed a software system using Flask [10], a web application written in Python [11]. This prototyping framework supports real-time communications between a secured database system and a frontend user-friendly interface. Post-greSQL [12] was used to record the information from users, and all data between PostgreSQL and Flask were transmitted with Psycogn2 [13], a PostgreSQL database adapter library because typically, the Psycogn database adapter can handle multiple database requests simultaneously. Furthermore, we used Jinja [14] embedded in Flask as the frontend interface, which is a designer-friendly HTML language for web development in Python. Through this mediator, and connected to PostgreSQL in the web framework, DOT is capable of handling a large number of requests at the same time.
Several interface screenshots of this DOT Platform are shown as Supplementary Figures 7  to 11. The aforementioned online exam as well as the data collection was conducted on this DOT platform.
Supplementary Figure 7: The log in interface where students input their accounts and passwords to join the exam.
Supplementary Figure 8: The instruction interface displaying the general guidelines for the exam. This is right after students login, and students will read the guidelines and listen to the proctor' instructions waiting for the exam starts. Students can also choose whether to activate the beep function to remind them to put in their answers when there are only ten seconds left.
Supplementary Figure 9: The exam interface where questions are displayed. On the top of the questions, there is a timer counting down the left time allocated to this question. When the left time gets smaller than ten seconds, the timer will turn red to remind students to put in answers. If the student activates the beep function, he/she will also help a short beep when timer counts down to 10 seconds.
Supplementary Figure 10: The exam interface where questions are displayed. Students will have four boxes to click to indicate their choices of answers below each question.
Supplementary Figure 11: The finish interface when a student finishes the exam (the exam period ends), he/she will be automatically directed to this interface to indicate the end of the exam. Students can also provide feedback by clicking the feedback bottom and answering a questionnaire

Random Sampling
As we mentioned in the Cyclic Greedy Searching Section, in the scenario without prior knowledge of students' competences, we prefer randomly assigning the students with random question sequences from the cyclic pool P CS rather than from the permutation pool P SQ . In this section, we calculate the expected collusion gain under randomly sampling from P CS and random sampling from P SQ , and prove that the former is more desirable (i.e. smaller expected collusion gain) than the latter.
Let us define an operation EZ(·) on a SQ pool S which calculates the expectation of F z(s 1 , s 2 ) where s 1 and s 2 are two randomly selected elements from S. In other words, EZ(S) = mean{Z(S)}, where Z(S) is the positional matrix of S with diagonals set to the length of a sequence element from S, if replacement is allowed; otherwise, EZ(S) equals to the mean of non-diagonal elements of Z(S). Proof. If we first randomly select two sequences v 1 and v 2 from S, by definition, we have E(F z(s 1 , s 2 )) = E(F z(s 2 , s 1 )) = EZ(S). (S5) Then we can continue to randomly select the rest sequences from the rest of S, but this does not affect the result we already have shown in Equation (S5). On the other hand, these two steps can be taken as one step that we randomly select m different sequences from S, hence, we cannot differentiate the m sequences from each other. In that sense, the expectation of F z between any two of them should be the same. Combining with Equation (S5), we have for i, j = 1, 2, . . . , m, and i = j. Hence, the mean of the non-diagonal elements of the positional matrix of V equals to EZ(S). This proves the theorem. Proof. Follow the same idea of the proof of Theorem 2, but in one by one manner. The definition of EZ(·) reminds us that the mean of the positional matrix of a sequence pool can be calculated by the expectation of F z number between two randomly sampled sequences from the pool. Without the loss of generality, we can assume the first sequence is s ref =  [1, 2, . . . , M 1 ] for convenience, because for other cases, i.e., [k 1 , k 2 , . . . , k M 1 ], we can relabel tag 1 to k 1 , 2 to k 2 , . . ., M 1 to k M 1 . To be noted, by re-indexing, we have not changed the questions themselves. For the case with replacement, s ref can be combined with any sequence from P CS with equal chance. If we list the sequences in P CS in a right circular shifting manner started with s ref , it is easy to find that for the second sequence s i to be chosen from P CS , where integer i indicates the position of the sequence in the sorted list of P CS , and s i = [v 1 , v 2 , . . . , v M 1 ], it is easy to find that s i can copy the problem v j if v j ≤ j, so an easy criterion can be formed to judge whether a question v j ∈ s i contributes to F z(s ref , s i ), Note that, by random permutation, we can generate much more question sequences than that obtained through circular shifting. P SQ actually contains M 2 ! elements. Suppose that s i is a placeholder for a sequence, then the sum of all F z(s ref , s i ) values can be calculated by question positions instead of by sequences, and for the case allowing replacement, it is and for the case not allowing replacement, it is Each question from {1, 2, . . . , M 2 } has an equal possibility to be v j (the jth question in a sequence), and the frequency of each question to be the jth question all equals n/M 2 in P SQ , where n is the number of sequences in P SQ which is equal to M 2 !. Thus, based on the criterion Equations (S15) and (S16) can be easily calculated with Hence, we can calculate the mean of all F z(s ref , s i ) values by normalizing the results in Equations (S19) and (S20) with their corresponding number of cases, and obtain EZ(P SQ ) with and without replacement as follows: EZ(P SQ ) = n(M 1 + 1)/2 n = M 1 + 1 2 (S22)