Anytime collaborative brain–computer interfaces for enhancing perceptual group decision-making

In this paper we present, and test in two realistic environments, collaborative Brain-Computer Interfaces (cBCIs) that can significantly increase both the speed and the accuracy of perceptual group decision-making. The key distinguishing features of this work are: (1) our cBCIs combine behavioural, physiological and neural data in such a way as to be able to provide a group decision at any time after the quickest team member casts their vote, but the quality of a cBCI-assisted decision improves monotonically the longer the group decision can wait; (2) we apply our cBCIs to two realistic scenarios of military relevance (patrolling a dark corridor and manning an outpost at night where users need to identify any unidentified characters that appear) in which decisions are based on information conveyed through video feeds; and (3) our cBCIs exploit Event-Related Potentials (ERPs) elicited in brain activity by the appearance of potential threats but, uniquely, the appearance time is estimated automatically by the system (rather than being unrealistically provided to it). As a result of these elements, in the two test environments, groups assisted by our cBCIs make both more accurate and faster decisions than when individual decisions are integrated in more traditional manners.

. An illustration of our developed form of cBCI. Here, a combination of response time, reported confidence and neural signals are used to estimate the probability of being correct on a decision. Finally, the group decision is made by the aggregation of weighted responses. Outpost. The character appears only in the second frame of the example followed by a response reported by the participant (marked in red). After the response, the participant indicates his/her degree of confidence, which is shown as 100 in this example. www.nature.com/scientificreports/ starts walking towards it (see Fig. 2(c)). Time pressure and a reward/penalty system were included to simulate a situation where both erroneous and slow decisions may have had negative consequences.
In both scenarios, participants had the task of reporting whether the characters appearing were wearing a helmet or a cap by pressing a mouse button. Both experiments received the UK Ministry of Defence (MoD)'s ethical approval in July 2017 and were performed in accordance with relevant guidelines and regulations. The tasks for both experiments were designed after consultation with the MoD. Decision confidences derived by the cBCI from neural and behavioural features were used in combination with their corresponding decisions to reach a final group consensus for each trial. Participants performed the experiments individually, and group decisions with groups of sizes two to ten were performed post-hoc by considering all possible combinations of participants.
The tasks are difficult for individual decision makers. Figure 3 shows the individual accuracies of the participants in Experiments 1 (left) and 2 (right). Due to the poor lighting conditions, the tasks are relatively difficult, the average decision accuracies (dashed line in the figures) being 79.94% ± 9.67% and 85.72% ± 11.42%(first reported in 79 ), respectively. Experiment 1 is difficult because of the poor lighting conditions and because the character appears on the screen for only 250 ms and at random locations. Experiment 2 also has very poor lighting conditions but it is slightly easier as the character stays on the screen for much longer and becomes progressively bigger, which makes it possible for participants to foveate and wait until there is enough detail to be reasonably sure of their response. A part of our objective for this study is to show the improvement in group decision-making over individual decision-making (as shown in Fig. 5).
ERP analysis shows differences in brain activity for correct and incorrect decisions. We have examined the Event Related Potentials (ERPs) associated with correct and incorrect decisions made for all participants. Figure 4 (top plots) shows the response-locked grand averages of the ERPs at the FCz electrode location for correct and incorrect trials. Green shading marks the regions where the Wilcoxon signed-rank test indicated that differences between correct and incorrect trials are statistically significant. For Experiment 1, it is apparent that differences are significant for approximately 500 ms preceding the response. For Experiment 2, differences are present in the period preceding the response too, but they are statistically significant only in much smaller time intervals than for Experiment 1.
The situation is similar for many other electrode sites, as one can see in the scalp maps in Fig. 4 (bottom) which represent the activation maps during correct and incorrect decisions (grand averages) and p-value of the Wilcoxon signed-rank test that compared the grand averages of the correct and incorrect responses at 300 ms and 80 ms before the response.
The differences in the patterns of brain activity recorded in the two experiments are most likely due to the fact that in Experiment 1 uniformed characters on which the decision is based appear suddenly and for a very short time and then disappear, while in Experiment 2 they appear initially very small and then progressively become bigger and bigger as they walk towards the outpost. So, there is not a very well-defined event that can trigger a strong ERP.
Thanks to the differences in EEG recordings for correct and incorrect decisions illustrated in Fig. 4, it is possible to exploit them within a cBCI (typically in combination with other measurements) to estimate the probability of each decision being correct, which is a form of confidence.
Groups assisted by a collaborative BCI are more accurate than traditional groups. Figure 5 also shows the mean accuracies and standard errors of the mean for individuals and groups of sizes two to ten using different cBCIbased decision support systems for Experiments 1 (left) and 2 (right). The different cBCIs use different inputs:  . For reference we also report the results obtained from decision support systems that use standard majority (Majority in blue), only RTs (RT in green) and dictatorial system (Dictator in orange). To reconstruct the RT, we employed an algorithm (see Methods section) that performed pairwise comparisons of the frames preceding the response to identify the one where a significant difference occurred. The time where such a frame was presented is taken to be the stimulus onset. In the dictatorial system, the system identifies a group's most skilled member based on individual accuracy and the other member of the group trusts the best member's judgement to make the final decision. We performed pairwise comparisons of the accuracies of all confidence estimation methods discussed above over all groups of sizes two to nine using two-tailed Wilcoxon signed rank test with Holm-Bonferonni adjustments (More information on the validation our statistical comparison approach is provided in the supplementary section of this article).
For The less marked superiority of cBCI(nf+RT+Rep.Conf) over RT+Rep.Conf in this experiment is a reflection of the weaker differences in the ERPs associated with correct and incorrect trials in Experiment 2 (see Fig. 4 (right))  www.nature.com/scientificreports/ (Additional statistical comparisons between methods can be found in Table S1a-g and Table S2a-g within the supplementary section of this article.). As one can see in Fig. 5, the differences in performance between all confidence-weighted methods (RT, RT+Rep.Conf and all cBCIs) and standard majority are larger for even-sized groups than for odd-sized groups. This is caused by the different behaviours exhibited by majority and the confidence-weighted methods in the presence of ties (which are only possible groups of even size). In the presence of a tie, standard majority breaks the tie by flipping a coin (there is no better strategy, since classes are equiprobable). On the contrary, with the confidence-weighted methods ties are simply resolved by picking the class with the higher total confidence, which is more often than not the correct decision. This is particularly beneficial with groups of size two, which present the biggest improvement over traditional methods because pairs are more likely to generate ties than larger groups, and hence they benefit the most from the ability of breaking ties in favour of correct decisions afforded by the weighted confidences derived from cBCIs, RTs and reported confidence.
Decision confidences derived from physiological and neural measures are good at assessing one's decision. Figure 6 presents the mean confidence available from decision support systems based on:(a) reported confidence, (b) RT only (confidence(RT)), (c) RT and reported confidence (confidence(RT+Rep.Conf)), (d) neural features and RT (cBCI confidence(nf+RT)), and (e) neural features, RT and reported confidence (cBCI confidence(nf+RT+Rep. Conf)). Results for the ten participants for Experiments 1 and 2 are shown in the bar charts on the left and right of the figure, respectively. The confidences are divided into two classes, associated with correct (in blue) and incorrect (in red) responses, respectively. The differences between these two conditions are also reported (in grey).
It is clear from the figure that participants reported higher confidence when they responded correctly than when they erred (Wilcoxon-signed rank test, p < 0.007 , for both experiments). This is expected, as confidence is a self-assessment of one's decisions and, therefore, decisions with high confidence should more likely be correct than incorrect.
The differences in average confidence for the incorrect and correct responses shown in the figure (grey bars) indicate that all decision support systems introduced in this paper have at least as good a separation between the two classes as the actual reported confidence. In fact, taken in the order shown in the figure, the separation is 5.22%, 15.06%, 11.95% and 17.66% better than the reported confidence in Experiment 1 and 17.38%, 24.22%, 18.43% and 24.80% better than the reported confidence in Experiment 2. While these differences are consistent, individually they are not statistically significant. However, the picture changes drastically when, later, we will use these decision support systems to aid group decision making. There we will not only see that the apparent superiority of all the decision support systems against the standard reported confidence is real, but we will also see that the cBCI based on the neural features, RT and reported confidence is also superior to all the other decision support systems.

Figure 5.
The average group accuracies of all possible groups of sizes one to ten formed from the ten participants for Experiments 1 and 2 and the corresponding standard error of the mean (computed using a boostrapping procedure). Results for the following decision aggregation strategies are shown: majority (in blue), dictatorial decisions (in orange), RT (in green), RT and reported confidence-based estimation (in red), a cBCI using neural features and RT (in purple), and a cBCI using neural features, RT and reported confidence (in maroon). www.nature.com/scientificreports/ Anytime morphing between decision support systems gives optimal time vs accuracy trade-offs. As noted from Fig. 5, the cBCI based group-decision making system with reported confidence (cBCI(nf+RT+Rep.Conf)) as an additional feature is superior in performance to the other alternatives. A limitation of group decision-making systems based on reported confidence is that a decision can only be made after the members in the group have registered their confidences. These processes can easily take several seconds, which may be incompatible with the decision times required by many real-world situations. The cBCI-based group decision-making system not using reported confidence can produce a less accurate decision sooner, that is immediately after all group members have provided a response. This may still require an excessively long time, especially in large groups. To get even quicker decisions, as we suggested in 71 , one could take a decision after the fastest N responders have cast their vote.
Here we explored an alternative strategy that tries to obtain the best compromise from accuracy and decision speed from all the above mentioned methods. The approach effectively smoothly morphs between the fastest system, where only the quickest responder determines the group decision, to the slowest one, where all participants have reported decisions and confidences and all contribute.
The strategy gathers all of the information (neural signals, decisions and reported confidence) available from any number of group members at any given time after the fastest responder has provided a decision. It then feeds such information to the appropriate types of decision support system. Such systems must all speak the same language; i.e. they must return an evaluation of the probability of the decision provided by a participant being correct (confidence). This makes it possible to form group decisions-via a confidence-weighted majority vote-even if the confidence of participants was evaluated by different systems. In this way, at any time a group decision is available. The decision is then updated as soon as new information is available, making such a system an anytime algorithm 80 .
We applied this morphing strategy to three pairs of decision support systems: (1) the two cBCIs tested in Fig. 5, (2) a decision support system based on RT and one based on RT as well as reported confidence, and (3) standard and confidence-weighted majority voting. For the standard majority system, confidence was a static quantity equal to the average accuracy of all participants in the training set. Figure 7 reports the results obtained with the corresponding anytime decision support systems.
More specifically, the figure shows how the accuracies of groups of size two to five and for Experiment 1 (right column) and Experiment 2 (left column) vary as a function of time after the first response for each of the three anytime systems. Decisions were updated by each system every 100 ms. The figure also shows how many members on average had responded by each time (shaded region with secondary ordinate axis) and the number of responders who had also reported their confidence (shaded blue region).
It is clear from the figure that both the cBCI and the system based on RTs present a monotonically increasing accuracy profile, when the more time available for the group decision, the more accurate that decision. Figure 6. Distribution of the mean of actual confidence reported by the participants (Actual Confidence) and the estimated confidences derived from RT (confidence(RT)), RT and reported confidence (confidence(RT+Rep. Conf)), cBCI confidence based on neural features and RT (cBCI confidence(nf+RT)) and finally, cBCI confidence based on neural features, RT and reported confidence (cBCI confidence(nf+RT+Rep.Conf)) for correct (in blue) and incorrect decisions (in red) made by the participants in Experiment 1 (left column) and 2 (right column), respectively. The grey bar indicates the difference in mean of confidences (both actual and estimated) for correct and incorrect decisions. www.nature.com/scientificreports/ Interestingly, in most cases, after a rather rapid transient, accuracy tends to plateau, which suggests that near optimal decisions can be obtained well before all participants have responded and reported their confidence. It is also clear that, thanks to the use of neural information, the cBCI always has an edge over the purely behavioural system based on RT. The cBCI anytime method also outperforms the majority-based system. Somehow surprisingly, the accuracy of the majority-based group-decision system is not always a monotonic function of time. This effect is associated with the fact that the best performers in a group are often also the fastest responders. In the majority system all responses have the same weight, until confidence values are available. During this period, as more and more weaker members cast their vote, the group accuracy may fail to increase (or, worse, it can even decrease) over time. The situation improves as more and more members express their confidence. However, accuracy eventually plateaus to a markedly lower value than for the other systems.

Discussion.
Metacognitive processes make decision-makers consciously or unconsciously aware of the likelihood of their decision being correct, through a feeling that we call confidence. In our previous research 33,71,75,81,82 , we found that, when decision makers act in isolation, i.e. in the absence of communication or peer pressure, a BCI can provide estimates of confidence on a decision-by-decision basis that are often more correlated with decision correctness than the confidence reported by participants themselves. We then used these estimates to www.nature.com/scientificreports/ improve the performance of groups of decision-makers by simply weighing decisions by the corresponding BCI confidence-a system that we call a collaborative BCI, or cBCI for short. All of our tests to date involved decisions based on either static images or speech.
In this paper, we have extended and then applied our cBCI to assist with perceptual decisions in two dynamic realistic environments. In the first environment, participants viewed video feeds showing the perspective of a user walking along a dark corridor and trying to identify possible threats. The second environment simulated an even more realistic situation: an outpost at night where potential threats would quickly walk towards the outpost and where the outcome of an erroneous and/or slow decision could be very severe. In both these situations one could imagine that an automated computer-vision system for target detection could be a better solution. However, for ethical reasons, many decisions in the military domain that can lead to possible fatalities (including those represented in the two scenarios studied in this paper) cannot be made by an AI system in full autonomy. A human needs to be always in the loop 83,84 . For this reason, it makes sense to augment and assist human-decision making using AI-based technologies.
In addition to dealing with the challenges imposed by such environments, we decided to address an additional challenge: in many real-world applications precise RTs are unavailable because situations requiring a decision present themselves at random times and users must realise by themselves that a situation requires a decision in the first place. For the first time, our decision-support systems are capable of reconstructing RTs, thereby dealing with this challenge and making them even more applicable in practice.
Despite these challenges, for both environments, results confirm that the cBCI based on neural features, RT and reported confidence is significantly better than traditional standard majority and dictatorial system and also, most often, other machine-learning-based decision-support systems relying on behavioural data (RT and reported confidence) to estimate confidences. The RT based decision-support systems are also significantly better than standard majority and dictatorial system. So, in the absence of neural system infrastructure such as an EEG system, the RT based systems can be implemented as an alternative albeit with a small compromise in performance.
Group decision support systems that rely on reported confidence present the drawback that decisions can only be made after the process of assessing and reporting individual confidence values is complete, which may take an additional few seconds. Our cBCI based on neural features and just RT does not present this problem and is the second-best choice, being significantly better than both majority and also the decision-support system relying on RT to estimate confidences.
It is clear from our results that using reported confidence as an additional feature allows our decision support systems to provide more reliable estimates of the probability of correctness. While, as noted above, confidence reporting requires extra time, it is often the case that by the time the slowest responders in a group have provided their decisions (thereby enabling the group decision), the fastest ones have also reported their confidence. Also, there may be cases where one can afford more time for the decision, which would allow more group members to report their confidence.
With this in mind, in this paper we proposed and tested three anytime decision support systems (both behavioural and cBCI-based). Our anytime systems estimate the decision confidence for all available responders in the group at any given time (after the first response) using a decision support system trained to work without the reported confidence as an input for all users who did not have time to report the confidence and one trained to work with the reported confidence for all users who reported it. It then makes the group decision. This decision, however, may change over time as more and more users make decisions and report their confidence.
Results indicate that the anytime cBCI-based decision support system is superior to the two behavioural anytime systems in the test environments considered. They also suggest that after a certain experiment-dependent time, group accuracy does not further improve significantly with time. So, our systems are on par in terms of accuracy with corresponding non-anytime versions, but are faster. If an application requires even faster decisions, our anytime systems can provide such decisions, but at the cost of a reduced group accuracy. For these reasons, such systems are may be particularly suitable for perceptual decision making scenarios in defence, policy-making and healthcare, where rapid decision-making may be needed.
Although our two environments have been designed to mimic realistic situations, they are still crude approximations of the rich set of sensory inputs and bodily reactions that people might encounter in real-world situations, particularly in the presence of real (as opposed to simulated) risk. Also, our participants were tested in very controlled lab conditions (e.g., they sat in a comfortable chair; there was very little noise and other distractions from the environment; the experiments were of a limited duration, thereby only inducing mild fatigue; etc.). While these conditions are not completely atypical of the military domain (e.g., in the virtual cockpit of a drone, or in remote/distributed C2 decision-making), there are many real, complex environments where they do not apply. In such cases, one should expect that, in general, poorer results might be obtained. Muscular artefacts produced by physical activity may not necessarily be an issue, as we recently reported in 85 , where we found that walking on a treadmill did not produce any negative effect on individual performance in the patrol task, instead improving cBCI/group performance, likely due to increased level of alertness associated with walking. However, EEG signals would drastically be affected by strenuous exercise, intense accelerations, intense mental fatigue, etc., which would likely render the cBCI approach presented here inapplicable.
Another limitation of the approach is that cBCIs have mostly been tested in assisting perceptual decision making in situations where there are only two options and there is some form of time pressure and/or where the perceptual information is available only for a short time, is inconsistent or overwhelmingly detailed. Only a fraction of all situations have these characteristics, many involving strategic decisions where resources (rather than time) are limited and where 5 more than two choices are available to make a decision. We are currently exploring these situations through a joint US DoD/UK MoD research initiative (https:// basic resea rch. defen se. gov/ Pilots/ BARI-Bilat eral-Acade mic-Resea rch-Initi ative/). There, we are extending our cBCI application to more complex www.nature.com/scientificreports/ problems in which, for example, decisions do not necessarily have a correct or incorrect choice, and are not just based on perceptual input given in the current trial, but also on information gained and decisions made in past trials. In this case, additional factors need to be taken in consideration, such as, for example, problem framing 5 , which can change decisions as individual perspectives change, and the notion that humans decisions often bring to solution that are not necessarily optimal, but, rather, satisfactory, as suggested in bounded rationality 86 .

Methods.
Participants. Two different groups of ten healthy participants took part in the experiments mentioned above: six females, four left-handed, age = 35.4 ± 2.6 years in Experiment 1, and four females, one lefthanded, age = 34.3 ± 11.67 years in Experiment 2. All the participants self-reported to have normal or corrected-to-normal vision and no history of epilepsy. All participants were provided with a participant information sheet informing them about the nature and objective of the experiment and they were also briefed about the experiments before the start of the session. Then they signed an informed consent form if they agreed to move ahead with the experiment. The participants were comfortably seated in a medical chair at about 80 cm from an LCD screen. After the experiment, the participants received a monetary remuneration for their time of £16 in Experiment 1 and £12 for their participation plus an additional remuneration of up to £6 (depending on their performance) in Experiment 2. The total duration of the experiments was around 50 to 70 minutes depending on the speed of response of the participants.
Stimuli description. Experiment 1: Patrol Participants were presented with video sequences (frame rate = 4 Hz) of a dynamic environment representing the viewpoint of a user walking at a constant pace along a corridor, where characters could appear from doorways, located on either side of the corridor, for one frame (Fig. 2(b)). Each participant had to decide, as quickly as possible and within 2.5s, whether the character crossing the corridor was wearing a helmet (by clicking the left mouse button) or a cap (by clicking the right mouse button). After reporting their decision, participants were asked to indicate, within 2 s and using the mouse wheel, their degree of confidence in that decision, using an 11-point scale (from 0=not confident, to 100=very confident, in steps of ten). The experiment was composed of 12 blocks of 42 trials, each trial corresponding to a doorway encountered while walking down the corridor. In each block, 14 trials had empty doors (no decisions required), 14 trials contained a person wearing a helmet, and 14 trials contained a person wearing a cap. The sequence of trials was randomised, and the same sequence was used with all participants, which allowed the simulating of group decisions offline. Prior to the start of the experimental session, each participant underwent a brief training session of 21 trials (approximately two minutes) to familiarise them with the task. Experiment 2: Outpost In this experiment, each participant viewed a scene simulating their being at an outpost and viewing an area with a house and several trees through a (simulated) night vision camera (Fig. 2(c)). In each trial, a character appeared from a distance, either from the house or from the adjoining forest cover on either side and walked towards the outpost. The video sequence had a frame rate of 10 Hz. The participant had to decide, as quickly as possible, whether the character was wearing a helmet (by clicking the left mouse button) or a cap (by clicking the right mouse button). After each response, participants were asked to indicate (within 2 s) their decision confidence on a scale from 0 (not confident) to 100 (very confident) in steps of ten by using the mouse wheel. The experiment included a point-based reward system considering the correctness of the decision and the RT of the participant. When a participant made a correct decision, they gained more points for faster RTs than for slower ones. In the case of incorrect responses, points were deducted (penalty) proportionally to the RT. Moreover, to simulate the risk in waiting for too long to make a decision, in each trial the character disappeared after a random time. If the participant did not make any decision by then, the trial was labelled as incorrect and a maximum penalty was applied. At the end of the experiment, the number of points accumulated by the participant was converted into currency (between £0 and £6) to determine the extra remuneration for the volunteer. The point-based reward system attempted to simulate a high-pressure critical decision-making situation where the user must respond correctly and as quickly as possible. The experiment was composed of six blocks of 60 trials. In each block, 30 trials contained a person wearing a helmet, and 30 trials contained a person wearing a cap. The sequence of trials was randomised, and the same sequence was used with all participants to enable the simulating of group decisions offline. Prior to the start of the experimental session, each participant underwent a brief training session of 15 trials (approximately two minutes) to familiarise them with the task.
Data recording and pre-processing. A Biosemi ActiveTwo EEG system was used to record the neural signals from 64 electrode sites following the 10-20 international system. The EEG data were sampled at 2048 Hz, referenced to the mean of the electrodes placed on the earlobes, and band-pass filtered between 0.15 to 40 Hz to reduce electrical noise. Artefacts caused by eye-blinks and other ocular movements were removed using a standard subtraction algorithm based on correlations to the averages of the differences between channels Fp1-F1 and Fp2-F2. EEG signals, RT, reported confidence, skin conductance, heart rate variability, respiration frequency and profile, pupil dilation, eye movements and eye blinks were simultaneously recorded during the experiments. RTs were measured by time-stamping the clicks of an ordinary USB mouse when the participant had responded. For this study, we used only the EEG, RTs and the reported confidence.
For each trial, the EEG data were segmented into response-locked epochs, starting from 1700 milliseconds (ms) before the response and lasting for 1900 ms. The epochs were then detrended and low-pass filtered at a pass band of 0-14 Hz and a stop band of 16-1024 Hz with an optimal Finite Impulse Response (FIR) filter designed with the Remez exchange algorithm. Finally, the data were down-sampled to 32 Hz and each epoch was trimmed by removing 200 ms from the beginning and end of the epoch. The remaining 1500 ms of the epochs were further analysed. In our study, we simulated exactly this kind of circumstance, where the reaction (a button press in our experiment) of the participant was known to the BCI system, but information on what caused it and when was not known. Hence, to reconstruct the RT for such situations, we needed to detect the onset of stimuli ('stimuli detector'). To achieve this, in each trial we parsed back each frame from the time of the response ('response event') until a frame was found where the change in average RGB values with respect to the preceding frame was above a certain threshold, which was considered to represent the moment of appearance of the character ('stimulus event') that eventually caused the button press. Then, the reconstructed RTs were calculated by subtracting the stimulus event time from their corresponding response event time. Figure 8 shows the difference between the average of the true (in blue) and estimated (in red) RT across all participants in the patrol and outpost experiment. The efficacy of our RT reconstruction algorithm is evident by the small absolute difference between the true and estimated RT for both the experiments (0.0875s in patrol experiment and 0.0807s in outpost experiment). In the patrol experiment, the estimated RT is larger than the true RT because the stimuli detector sometimes missed the characters on-screen, in which case the stimulus onset is taken to be the onset of the previous stimulus. On the contrary, the estimated RT is lower than the true RT in the outpost experiment because in some circumstances, the stimuli detector system identified the character on-screen later than the actual onset. Nevertheless, the small difference in the RTs yielded no significant changes in the confidence estimation of the decision-support methods.
Labelling the epochs. Our cBCI approach to group decision-making assigns higher weights to individual decisions where a participant was confident (and more likely to be correct) and lower weights to decisions where the participant was unsure (and more likely to be incorrect) 13,82 . To attain this, we trained our cBCI system using the correctness of individual decisions, which is available to the cBCI in the training set. The trials in which the participant made a correct decision were labelled as correct while those where the participant made an incorrect decision were labelled as incorrect. In this approach, the cBCI is trained to predict whether the user made a correct or an incorrect decision rather than decoding targets and non-targets. The same approach was used to train decision support systems only employing behavioural data (RT and reported confidence) to make their predictions.
Estimation of individual decision confidences. Common Spatial Pattern (CSP) 87 was used to extract characteristic neural features from each epoch that can distinguish between the correct and incorrect labelled trials. The main idea behind CSP is to transform the multi-channel EEG data into a low-dimensional spatial subspace using a projection matrix, that can maximise the variance of two-class signal matrices. In our study, we have used an eight-fold cross validation to split the data into training and test sets. Each training set is used to compute a CSP projection matrix, which is then applied to transform the data into a low-dimensional subspace for the corresponding test. The variances for the two classes (i.e., correct and incorrect responses) are largest in the first and the last dimensions of the subspace. So, the logarithm of the variances of the first and the last spatial subspaces along with the reconstructed RT (which is known to influence decisions 88 ) and reported confidence (when required) were used as features for a random forest model to predict the decision confidence. The model was fitted using 100 decision trees and Gini criterion. The random forest approach fits sub-samples (with replacement) of the dataset on various individual decision trees and the final output is an average of the results obtained from each one. This form of estimation improves the prediction accuracy and controls over-fitting. Thanks to cross validation all confidence estimates were obtained from test sets, i.e., they were obtained from inputs not previously seen by the machine learning model. This method was adopted to avoid over-fitting and www.nature.com/scientificreports/ deliver robust confidence estimates even in the presence of small data samples. A similar random forest model was used to calibrate the decision confidence of trials from their corresponding response time (when required).
Formation of groups. Formally, each participant, p, has a final confidence weight w p,i (t) for each trial i, obtained either from their decision confidence (cBCI or not) with or without reported confidence, depending on the time t after the stimulus event. Group decisions are then made as follows: where d p,i (t) is the decision of participant p in trial i when checked at time t. Both w p,i (t) and d p,i (t) are assumed to be 0 if the participant has not yet made a decision at time t.
Groups of size m=2, . . . , 10 were formed offline by considering the 10 m combinations of the 10 participants.
Designing the anytime morphing approach to make group decisions. The anytime morphing approach works as follows: In a group of responders, when the first responder reacts to a stimulus event in the video feed by clicking a mouse button to signify the presence of a target or a non-target, a clock starts. Within a few milliseconds the software identifies the stimulus event and it can, therefore, reconstruct the RT for the first responder. The EEG data are also already available, and so a first approximation of confidence can be immediately computed by the BCI. The group decision at this stage is the decision of the first responder. Then, every 100 ms from the first response, the system looks for other members in the group who have responded, uses the first responder stimulus event to estimate their RTs, then computes their cBCI confidence and uses a corresponding weighted majority (Eq (1)) to produce the group decision (which may, therefore, change over time as more and more team members react to the stimulus). At every clock tick, the system also checks whether any of the team members who previously responded have also manually provided a confidence value. For those where this has happened, the reported confidence is added as input features to obtain a new cBCI-estimated confidence. Every time either