## Introduction

Studies exploring the ‘wisdom of crowds’ effect on perceptual decision making have, so far, not investigated the implications of optimum group size, task difficulty, and inter-subject correlations on group performance using neural signal analysis. The current study has tried to systematically address the above-mentioned issues, for which, to the best of our knowledge, there is no available evidence. We show that the overall group performance improves as a function of group size. A gradual decrease in individual confidence and an increase in classification error rate of the participants is observed with increasing task difficulty. The error is significantly reduced upon combining the individual decisions using the group aggregation rule. We demonstrate that it is possible to achieve similar performance benefits using smaller group size instead of using all members of the largest group, which can potentially lead to benefits in terms of cost and computation time. It is generally believed that high correlations in participant judgements could imply common sources of variability in perception, hence reducing pooling benefits19. Inter subject correlation(ISC) has been previously shown to be a potential marker of engagement24,25. ISC was also shown to be a predictor of memory in the context of online educational videos26. Here we explored the Group-ISC in the context of perceptual decision making under varying task difficulty. The group inter subject correlation decreased as group decision accuracy increased across difficulty levels and it can be argued that ISC can be indicative of group decision benefit.

Overall, our results demonstrate that although the relative advantages of combining neural activity across multiple brains might depend on the inter-subject correlation across difficulty levels, group decisions achieve accuracy rates that far surpass that of the average individual.

## Materials and methods

### Ethics statement

The study was carried out following institutional guidelines. All experimental protocols were approved by the ‘Institute Ethics Committee’ of Indian Institute of Science Education and Research Kolkata, India. Participants gave written informed consent in accordance with the Declaration of Helsinki.

### Stimuli and display

The data set consisted of 7.8 inch $$\times$$ 7.8 inch 8-bit indoor or outdoor natural scene images with 400 target-present images and 400 target-absent images. The participants were at a distance of 30 inch from the display. Images subtended a visual angle of 14.81 degree $$\times$$ 14.81 degree.

### Experiment

#### Experiment 1

Eighteen healthy human volunteers (ages: 21–27, mean: 24.66, std: 2.42, seven female) participated in the study. All participants had normal or corrected-to-normal vision, disclosed no history of neurological problems, and were naive to the purpose of the study. The experiment consisted of 800 trials split into 32 successive sessions performed in one sitting. The total duration of the experiment was $$\sim 1.5$$ hours. All subjects were right handed and they used the same hand throughout the experiment to register their responses using the mouse click. One participant was not considered in the analysis due to a high degree of noise present in the neural data. The participants performed a visual search task where following the presentation of cues, they detected targets present in natural images. Targets were semantically consistent with the scene. Participants had no prior information about the target location or appearance. Participants fixated on a central cross and clicked anywhere on the screen when they were ready. After a delay of 50 ms, a target cue was presented for 100 ms followed by a variable delay of 500–800 ms. The stimulus was presented for 50 ms followed by a delay of 700 ms after which the response screen appeared. The participants reported their decision using a 10-point confidence scale (numbers 1 to 10 were displayed randomly to avoid any motor bias) with a rating of 1 indicating complete confidence that the target was present and a rating of 10 indicating that target was absent with complete confidence (see Fig. 1A).

#### Experiment 2

To assess the difficulty of the trials, another behavioural experiment was conducted with 20 new participants (ages: 23–28, mean: 24.5, std: 1.32, 11 female) with the same set of images as in Experiment 1 and the same set of target-cues. After each stimulus presentation, the participant was asked to respond whether the task of detecting the absence/presence of the target within the natural-scene image seemed ‘Easy’ or ‘Difficult’ to them. The response screen consisted of the words ‘Easy’ and ‘Difficult’ on either side of the screen in random order. Tasks were labelled based on the proportion of participants rating a particular target detection task difficult. The 800 images were divided into 5 groups for further analysis.

### Data acquisition and preprocessing

EEG activity was recorded using a 64 channel active shielded electrodes mounted in an EEG cap following the international 10/20 system at a sampling rate of 512 Hz. Trials were band-pass filtered (0.01–40 Hz.), epoched, and then referenced using average referencing. Trials with ocular artifacts (blinks and eye movements) were detected using bipolar electro-occulograms (EOG) with amplitude exceeding ± 100 mV or visual inspection and not included in the analysis. The high dimensional (number of electrodes $$\times$$ number of time-points) EEG data was reduced to an one dimensional neural decision variable using multivariate pattern classifier27. The preprocessed EEG signals were time-locked to stimulus onset and included a 200 ms pre-stimulus baseline and 700 ms post-stimulus interval.

### Dimension reduction

Since the neural data is high dimensional and suffers from small sample size problem27, a principal component analysis (PCA) based non-linear feature extraction technique—‘Classwise Principal Component Analysis’ (CPCA)27 has been used to reduce the dimensionality of the EEG signals and extract informative features. CPCA has been used successfully in previous studies to extract the multivariate pattern of neural signals28,29,30,31,32,33. The main goal of CPCA is to identify and discard non-informative subspace in data by applying principal component based analysis to each class. In this technique, PCA is applied locally to each class rather than on the whole data. Once the data is reduced locally using PCA, the reduced subspace then becomes low dimensional. The PCA subspace is further reduced to extract discriminatory features using a linear feature extraction technique like Linear Discriminant Analysis (LDA). The features retained in CPCA are the ones that carry discriminatory information between the two classes. The classification is then carried out in the residual space, in which small sample size conditions and the curse of dimensionality no longer hold. The method is described in Supplementary Information thoroughly.

### Integrating behavioural and neural decision variables across multiple brains

The behavioural confidence ratings and the neural decision variables were combined separately following the two rules described here to generate group behavior and neural decisions separately.

#### Majority pooling algorithm

While combining the behavioural and neural decision variables, a criterion was fixed per participant on the basis of which the final outcome of the group decision was determined. For participant j, let the criterion be $$c_j$$. For behavioural data, confidence ratings range between 1 and 10, both points inclusive. $$c_j$$ is the value midway between them, i.e., $$c_j$$ is taken to be 5.5, for each j. However, in the case of neural data, $$c_j$$, the neural decision criterion, was varied across all possible values of the training neural decision variables such that the true positive rate is maximized. Hence it differed from one participant to another (see19 for details). For each trial i the number of participants deciding on the presence of the target is quantified as

$$N_i = \sum _{j=1}^N I_{[0,\infty )}(c_j-R_{i,j}),$$

where $$R_{i,j}$$ is the neural decision variable on the ith trial for the $$j{th}$$ individual, N is the number of participants in the group and

$$I_{[0,\infty )}(x) = {\left\{ \begin{array}{ll} 1 &{} \text{ if } x \in [0,\infty ),\\ 0 &{} \text{ otherwise } . \end{array}\right. }$$

The outcome of the group decision $$G_i$$ is then defined as follows:

If N is odd,

$$G_i = {\left\{ \begin{array}{ll} 1 &{} \text{ if } N_i > (N-1)/2,\\ 2 &{} \text{ if } N_i \le (N-1)/2. \end{array}\right. }$$

If N is even,

$$G_i = {\left\{ \begin{array}{ll} 1 &{} \text{ if } N_i > N/2,\\ 2 &{} \text{ if } N_i < N/2,\\ r &{} \text{ if } N_i = N/2, \end{array}\right. }$$

where r is a random decision (about presence/absence of target) drawn from a discrete uniform distribution with support $$\{1,2\}$$. Here the outcome ‘1’ denotes presence of the target and ‘2’ denotes the absence.

#### Weighted average pooling algorithm

A decision variable $$D_i$$ was obtained by taking the weighted average of the confidence ratings/ neural decision variables for every trial i for a group of participants (refer to19). Mathematically, $$D_i$$ is represented as

$$D_i = \varvec{w}'\varvec{R_i},$$

where $$\varvec{R_i}$$, the column vector, consists of the ratings or neural decision variables of the participants in a group, and the column vector $$\varvec{w}$$ contains the linear weights.

For each random group of N participants, the behavioural ratings/ neural decision variables corresponding to 90% of the trials were used for training and rest for testing the performance of the algorithm. This was performed 10 times. The weights $$\varvec{w}$$ were estimated from the training data by maximizing the Fisher’s criterion function as follows:

$$\varvec{w}= S_w^{-1} (\varvec{\mu _1} - \varvec{\mu _2}),$$

where $$\varvec{\mu _1}$$ and $$\varvec{\mu _2}$$ are the mean rating/neural decision variable for present trials and absent trials, respectively, for N participants, and

$$S_w = \sum _{\varvec{R_i} \in C_1} (\varvec{R_i}-\varvec{\mu _1})(\varvec{R_i}-\varvec{\mu _1})' + \sum _{\varvec{R_i} \in C_2} (\varvec{R_i}-\varvec{\mu _2})(\varvec{R_i}-\varvec{\mu _2})',$$

with $$C_1$$ and $$C_2$$ denoting the class of present trials and absent trials, respectively. But before inverting $$S_w$$, to tackle the issue of singularity (if any), it is regularized as

$$S_w(\gamma )= S_w + \gamma \varvec{I}_{N \times N},$$

where $$\gamma$$ is the ridge parameter. In our case, we chose $$\gamma = 2$$. $$\varvec{I}_{N \times N}$$ is the identity matrix, and N is the total number of participants in the group. The decision criterion C, which maximized the proportion correct, was also estimated from the training data. The group decision $$G_i$$, defined below, was obtained by comparing $$D_i$$ to criterion C. Define

$$G_i = {\left\{ \begin{array}{ll} 1 &{} \text{ if } D_i \le C,\\ 2 &{} \text{ if } D_i > C. \end{array}\right. }$$

The outcome ‘1’ denotes the presence of the target and ‘2’ denotes the absence.

We have limited the group decision rules to simple majority and weighted average pooling since these are the two most commonly used pooling algorithms19,20. Linear combination has been shown to be superior to majority voting on visual search tasks20 based on behavioural and eye tracking data. We have used both the pooling algorithms to fuse data from multiple brains using both behavioural and neural data.

### Computation of classification error rate

Quantifying the error in collective decision making becomes crucial towards understanding the performance benefits achieved by ‘the wisdom of crowds’. In the present study, the behavioural errors were calculated in a classification setup where the participants’ confidence ratings were taken as the explanatory variables or the features, and the ground truth (absent/present) of the trials was used as the binary categorical variable. Weights were calculated using 10-fold cross-validation34 on the trials as before. The absolute difference of the posterior probabilities obtained from classification and the ground truth was taken as the classification error rate.

To fully evaluate the potential of multi-brain computing for perceptual categorization, it is also important to quantify the error in predicting the ground truth using the neural information from multiple brains. The high-dimensional (59 electrodes $$\times$$ 354 time-points) EEG data coming from each participant was reduced to a single dimension using a multivariate feature extraction algorithm, discussed before in ‘Dimension Reduction’ subsection (27, see Supplementary Information for details). For each group, the weighted average decision rule was used to combine the neural information across the individual participants and the posterior probabilities of the classification were used to compute the errors.

In both cases, for every group size N, 100 distinct samples each containing N participants were generated, and the mean error across these 100 groups of size N was extracted per trial. Next, trials were categorized according to their difficulty levels. This allowed us to compare the errors across group sizes and task difficulty levels.

While it is well-known that collective judgement is more accurate than individual decisions, whether increasing the group size always results in a better group performance is still debatable. To address this, we performed pairwise comparisons between the classification error rates of groups of different sizes across all difficulty levels using the Wilcoxon rank sum test35. Benjamini-Hochberg method36 was used to correct for multiple hypothesis testing related to the different difficulty levels.

### Inter-subject correlation of a group

The Inter-subject correlation of a group (or the Group-ISC) is defined to be the average of the pairwise correlations of the members of that group. To find the relationship between group inter-subject correlation and task difficulty, we consider pairwise correlations of participants’ behaviour across trials of a particular difficulty level. For a group of size $$N (>1)$$ we take mean of the $$\left( {\begin{array}{c}N\\ 2\end{array}}\right)$$ pairwise correlations. For $$N<17$$, this procedure was repeated 100 times, generating distinct samples of N participants every time, and the Group-ISC was extracted (for $$N=17$$ only one distinct sample is possible so, the procedure is performed on one such sample).

Assuming that the neural signal acquired using EEG is reflective of the behavioural decision of the participant, group inter-subject correlation should be obtainable from the EEG data as well. Post-stimulus pre-processed EEG data was used to generate the inter-subject correlations using correlated component analysis37 based on the concept of finding the linear combinations of electrodes that have maximum correlations between participants. Here components are found using the following steps:

Step 1: The cross-covariance of the electrodes in participant k with the electrodes in participant l is calculated as

$$R_{kl}= \frac{1}{T \times TR} \sum _{t=1}^{T} \sum _{tr=1}^{TR}(\varvec{x_k}(t,tr)-\overline{\varvec{x_k}})(\varvec{x_l}(t,tr)-\overline{\varvec{x_l}})',$$

where T and TR are the total number of time points and trials, respectively, vector $$\varvec{x_k}(t,tr)$$ represents the voltages recorded across all electrodes at time t for trial tr in participant k, and $$\overline{\varvec{x_k}}$$ is their mean value over all time points of all trials.

Step 2: The pooled between subject cross-covariance is calculated as

$$R_b= \frac{1}{n(n-1)} \sum _{k=1}^{n} \sum _{l=1,l\ne k}^{n} R_{kl},$$

where n is the total number of participants.

Step 3: The pooled within-subject cross-covariance is calculated as

$$R_w= \frac{1}{n} \sum _{k=1}^{n} R_{kk}.$$

Step 4: The largest eigenvalue of $$R_w^{-1}R_b$$ and the corresponding eigenvector have to be calculated. This vector captures the largest correlation between the participants. But before inverting $$R_w$$, to tackle the issue of singularity (if any), it is regularized as

$$R_w(\gamma )= R_w + \gamma \varvec{I}_{E \times E},$$

where $$\gamma$$ is the ridge parameter. In our case, we chose $$\gamma = 0.005$$. $$\varvec{I}_{E \times E}$$ is the identity matrix, and E is the total number of electrodes.

Step 5: Let $$\varvec{v}$$ be the eigenvector corresponding to the largest eigenvalue of $$R_w(\gamma )^{-1}R_b$$. Projecting $$(\varvec{x_k}(t,tr)-\overline{\varvec{x_k}})$$ in the direction of $$\varvec{v}$$ we get a component

$$y_{k}(t,tr)= \varvec{v}' (\varvec{x_k}(t,tr)-\overline{\varvec{x_k}}),\, k \in \{1,2,\ldots ,n\}.$$

$$y_{k}(t,tr)$$ is a linear combination of all the voltages of electrodes (centered about their mean) with weights being determined by the components of the eigenvector $$\varvec{v}$$.

The pairwise correlation of participant k and l is given by

$$\rho _{kl} = \underset{1 \le tr \le TR}{\text{ median }}\left( \frac{\frac{1}{T}\sum \nolimits _{t=1}^{T}{y_k(t,tr)y_l(t,tr)}}{\sqrt{\sigma _{k}^2(tr)\sigma _{l}^2(tr)}}\right) ,$$

where $$\sigma _{i}^2(tr) = \frac{1}{T}\sum \nolimits _{t=1}^{T} y_{i}^2(t,tr)$$, for $$i= k,\, l$$, respectively. The Inter-subject correlation of a group is calculated taking the average of the pairwise correlations of its constituent members.

### Statistical tests

To compare the performance benefits arising out of pooling of individual decisions using the majority pooling algorithm and the weighted average algorithms, one-sided paired t-tests were conducted. Further a two-sample t-test was performed in a control study where we check whether the performance benefit achieved due to combining electrodes from multiple brains is significantly higher than the performance benefit achieved due to addition of electrodes from the same brain. Two-sample t-test, with Benjamini-Hochberg to control false discovery rate (FDR), was used in ERP analysis as well.

We used a two-way ANOVA to test for main effects of group size and difficulty level on the classification error rates of collective judgement. A three-way ANOVA was used to explore the main effects of group size, difficulty level and time on proportion corrects. The epoched EEG signals were segmented into 40 ms windows and classification was performed individually in these windows. To inspect main effects of group size and difficulty level on the Group-ISC, a two-way ANOVA was performed. After implementing ANOVA, “Tukey’s honestly significant difference procedure”38 was carried out to figure out the difficulty levels where the Group-ISCs were significantly different.

Wilcoxon rank sum test was used to test for equality of median classification error rates between different group sizes. The same non-parametric test was applied for comparing the median proportion correct between two difficulty levels at different time windows. Another nonparametric test, Wilcoxon signed rank test was implemented to compare the median of Group-ISC at difficulty level 1 with the median of Group-ISC at difficulty level 5 for different time windows. For each case, the issue of multiple hypothesis testing arises as the testings need to be done for either different group sizes or different time windows. To control FDR we used Benjamini–Hochberg method36 in each case.

The sample correlation coefficients (r) between the difficulty levels and individual confidence were calculated separately for 17 participants. The significance of the correlation coefficients were tested for each participant using one-sample t-test. Further to utilize all sample correlation coefficient together, we transformed each r to z using Fisher’s Z-transformation and tested the hypothesis regarding the correlation coefficient using z-test. The association between the weighted average of the behavioural ratings and the linear combination of the neural decision variables for each difficulty level were measured by calculating the sample correlation coefficient between the two. The partial correlation coefficient between the weighted average of the behavioural ratings and the linear combination of the neural decision variables was also calculated. Significance of theses correlation coefficients were tested using one sample t-test.

## Results

The individual behavioural performance of the seventeen participants on the visual search (in natural scenes) task is shown in Fig. 1B. Response time and accuracy varied with decision confidence (see figures S1 C-D of Supplementary Information). Using logistic regression, it could be shown that RT successfully predicts single trial accuracy and for every individual, the odds ratio of accuracy decreases with increasing RT (refer to figure S1 E of Supplementary Information).

### Association between mean confidence and task difficulty

We hypothesize that individual confidence of the participants on their decision should decrease as the perceptual task becomes difficult. In this experiment, the difficulty of the task was quantified by the proportion of participants from Experiment 2 who rated it difficult. Confidence ratings were converted to individual confidences and could take only integer values from the interval [1,5]. For example, a confidence rating of 1 or 10 would mean complete confidence that the target is present/absent in the image. In both cases, the confidence is taken as 5. A confidence rating of 5 or 6 would imply least confidence and would be assigned a confidence value of 1 in both cases. We calculated the correlation coefficients for each participant individually, and it is seen that the correlation coefficient turns out to be significantly negative for each one of them using one-sample t-test (max(p values)$$<10^{-5}$$). Further the null hypothesis that $$\rho \ge 0$$ against the alternative hypothesis $$\rho <0$$ is tested using the Fisher’s Z-transformation, where each $$r_i$$ (sample correlation coefficient for ith participant) is transformed to $$z_i$$ and test statistic is formed using $$\overline{z}$$. The z-value and the corresponding p value turned out to be − 52 and less than $$10^{-100}$$, respectively. For visual advantage, the scatter plot of the mean confidence value on the 800 trials across the 17 participants and the task difficulty is shown in Fig. 1C, with the overall correlation coefficient − 0.78 and the corresponding p value $$< 10^{-100}$$.

### ERP analysis

ERP analysis was performed on average referenced and baseline subtracted EEG signals for Target-Present and Target-Absent trials. Epochs of a particular channel were marked noisy if their respective absolute differences from the median exceeded 5 times the interquartile range. Such noisy epochs were not considered for further ERP analysis. After outlier detection, on an average 681 (± 78.82 std) trials remained for analysis.

To elucidate whether the Target-Present and Target-Absent conditions induce different neural processing mechanisms, the grand average difference waveform was plotted (refer to Fig. 2A) on the scalp map using EEGLAB Toolbox39. A clear difference in Target-Present and Traget-Absent ERPs was visible across parietal electrodes 300 ms onwards post stimulus onset.

Additionally, statistical analysis was performed on the ERP signals to select electrodes having differences in activity for the Target-Present and Target-Absent conditions. Two sample t-test after multiple comparison ($$p< 0.05$$, FDR corrected in the window 220-350ms post stimulus onset) revealed significant differences in activity in the parietal electrodes. The ERP waveform in Fig. 2B is computed by taking the average across four electrodes : P3, Pz, P4 and POz. All four electrodes individually show significant differential activity post 250 ms. The ERP plot and the scalp map show the presence of the well known P300 component40,41 which is known to occur in target detection42,43,44. The P300 is elicited when observers respond to a stimulus-driven task and is independent of senosory modality45. It has also been known to contribute to attentional blink46.

### Performance benefits from integrating behavioural and neural decision variables across multiple brains

While the simple majority pooling algorithm has been effectively used in combining expert/non-expert judgements in single location tasks, the weighted average of individual decisions has been found to give superior benefits in simple visual search tasks where a small section of the group members often has more information relative to the majority. We exploit whether, in case of a more complicated scenario like visual search in natural scenes, the weighted average dominates the majority in terms of performance accuracy.

To estimate the benefits of combining the behavioural and neural decision variables, 400 random groups were generated per group size, and the expected group decision was computed using the majority and weighted average pooling algorithms. The group performance was compared with the expected individual performance. Figure 3A-B show the mean performance accuracy in terms of Proportion Correct (PC) per group size, for behavioural and neural data, respectively. Clearly, with an increase in group size the PC increases for both the pooling algorithms. The maximum performance benefits achieved by group decision (i.e., by aggregating data from all 17 participants) is 15% for behavioural data and 12% from neural data using weighted average rule and 13% and 7%, respectively, using majority rule. On comparing the performance across the two algorithms per group size, we see notable superior performance of the weighted average rule ($$p < 10^{-5}$$ across all odd group sizes (from 3 to 15), FDR corrected, paired t-test, df = 399 ). This holds irrespective of whether we combine behavioural ratings of the group members or the neural decision variables.

While it is clear that inclusion of information from multiple brains is associated with an improved performance of the classifier, it needs to be verified that this is not solely due to an increase in the size of the data used for classification (regardless of whether the data comes from a single brain or multiple brains). If the classifier performance improved because of an increase in the volume of the data alone, then the performance of the classifier should not be very different when fusing neural signals from the two groups of electrodes from two different brains and when clubbing neural information from two groups of electrodes from single brains. We tested this by comparing performance using multiple brains and using a single brain but keeping the total number of electrodes comparable in both cases. The procedure was repeated 100 times by randomly sampling two brains and choosing two random groups of electrodes of size 30. We also showed the results using 30 electrodes from a single brain for comparison. But multivariate analysis reveals that combining information from multiple brains has a unique benefit on decision making (refer to Fig. 3C). Doubling the number of electrodes definitely leads to an information gain, but that due to doubling the number of heads clearly surpasses it (two-sample t-test, $$t = 3.9204$$, $$p= 6 \times 10^{-5}$$, df = 198).

### Correlation between groups’ behavioural and neural decision variables across difficulty levels

To explore the relationship between the collective stimulus-driven neural signal and the group decision confidence, we evaluate the correlation between the weighted average of the behavioural ratings and the combination of the neural decision variables. If the neural decision variable reflects the group decision regarding the presence/absence of the target in the natural scene, we expect a significant positive correlation between the weighted average of the participants’ confidence ratings and the weighted average of the neural decision variables from multiple-brains. Figure 4A-E depicts a scatter plot of the resulting behavioural and neural decision variables across all difficulty levels. The linear combination of behavioural confidence ratings and the integration of the neural decision variables of all seventeen brains are positively correlated across all difficulty levels ($$r=0.84,0.92,0.84,0.77,0.67$$, respectively, with increasing difficulty). All correlations are significant ($$p< 10^{-15}$$ for all five difficulty levels). Partial correlation between the neural and behavioural decision variables while controlling the effect of difficulty level was found to be 0.85 ($$p < 10^{-50}$$). Our results show that a significant correlation exists between behavioural and neural decision variables irrespective of difficulty levels. The corresponding scatter plot is provided in Fig. 4F.

### Classification error rate as a function of group size and difficulty level

While from Fig. 3A,B it is evident that aggregating individual decisions into a group decision increase the accuracy of the decision, a relevant question that arises is whether group size and difficulty level of the task plays any significant role in this net information gain using multiple individuals. Classification error rates of collective judgement were computed for each group size, and each difficulty level (average classification error rates are depicted in Fig. 5A,C, for behavioural and neural data, respectively), and a two-way ANOVA was performed. Significant main effects of both group size ($$p< 10^{-20}$$, $$F = 7.96$$, df= 6) and difficulty level ($$p< 10^{-20}$$, $$F = 143.89$$, df= 4) were observed. Similar effects were seen while considering the error in judgement from the neural group decision variables. Main effects of group size ($$p< 0.005$$, $$F=3.55$$, df= 6) and difficulty level ($$p< 10^{-20}$$, $$F=13.73$$, df = 4) were significant. The benefit of group decision making across the five difficulty levels were computed and a reduction in the percentage benefit in classification error rate was identified with increasing difficulty (Fig. 5B,D, for behavioural and neural data, respectively).

We try to figure out whether the information gain from group decision making is stable across all group sizes or is always more for larger groups. Pairwise comparisons between all pairs of groups, after correcting for multiple comparisons across the five difficulty levels, revealed that increasing the group size gave an advantage up to about two-thirds of our maximum group size (seventeen). In our case, there was not any significant difference between the classification error rates of thirteen or fifteen-member groups from seventeen-member groups across difficulty levels two to five (Wilcoxon rank sum test, $$0.1<p <0.4$$, FDR corrected) for both neural and behavioural performance. Hence we can conclude that optimum group size does not necessarily imply the maximum group size considering the improvement in performance.

### Integration of neural activity in multiple brains across time and difficulty levels

To assess whether the benefits of collective decision making arise in the early perceptual processing stage or late decision making stage, we evaluate classifier performance on the EEG data for different time windows. Epochs were divided into temporal windows spanning $$\sim$$ 40 ms each. Multivariate pattern classification was conducted separately for different time windows using all electrodes. An ANOVA was performed on these time-windowed PC values, with factors difficulty level, group size and time, revealed the significant main effect of all three factors on the classification accuracy ($$p<10^{-20}$$ in all three cases, $$F = 5768.55, \, 479.8,\, 2003.77$$ and df = 4, 6, 16, respectively). Figure 6D shows the neural timeline for individual versus group decision (for group size 13) at difficulty levels 1 and 5, respectively. Wilcoxon rank sum test was performed to compare the proportion correct between the two difficulty levels for a fixed group size and between the two group sizes for a fixed difficulty level, respectively. PC at the first difficulty level was consistently higher than PC at the highest difficulty level (right tailed test, $$p<10^{-5}$$, FDR corrected) for individual decision (post 360 ms) as well as group decision (the group of size 13, post 320 ms). For the first difficulty level post 280 ms, a considerable benefit was achieved using group decision ($$p<10^{-4}$$, FDR corrected, for group size 13, Group PC > Individual PC). But for the fifth difficulty level, i.e., for tasks of maximum difficulty, the consistent significant difference between individual and group decision (Group PC > Individual PC) was observed much later (after 560 ms, $$p<10^{-5}$$, FDR corrected). It appears that benefits accrued due to group decisions in visual search are dependent on task difficulty; greater the difficulty level later the neural benefit.

### Relationship between group inter-subject correlation and task difficulty

The Group Inter-Subject Correlation (Group-ISC) is computed by taking the mean of the pairwise correlations of the group members. Pairwise correlations between the confidence ratings of the participants on trials of the same difficulty are extracted to find the behavioural Group-ISC. An ANOVA was performed on these Group-ISCs, with factors difficulty level and group-size, revealed the significant main effect of difficulty level on the Group-ISC ($$p<10^{-50}$$, $$F = 17091.57$$, df = 4) as well as a significant main effect of group-size ($$p<0.01$$, $$F = 2.99$$, df = 6). Multiple comparison tests were performed using ‘Tukeys honestly significant difference procedure’38 to find out in which difficulty levels the Group-ISCs were significantly different. All pairs of difficulty levels showed a significant difference ($$p<0.0001$$ in all cases) in their mean Group-ISC values. Figure 6A shows the mean Group-ISC across difficulty levels. The Group-ISC decreases after the 2nd difficulty level, depicting that with the increase in difficulty, the correlation between individual decisions of the participants reduces.

To see whether this randomness in the behaviour of the participants with increasing task difficulty is also reflected in their neural activity, we found the Group-ISC from the EEG data (see Material and Methods section for details). An ANOVA was performed on the Group-ISCs, with factors difficulty level and group-size, revealed the significant main effect of difficulty level on the Group-ISC ($$p< 10^{-20}$$, $$F = 29.44$$, df = 4) as well as a significant main effect of group-size ($$p<10^{-20}$$, $$F =38.19$$, df = 6). Multiple comparison tests revealed no significant difference between the mean Group-ISCs of difficulty level 1 and 2 ($$p= 0.1495$$), 1 and 3 ($$p= 0.9999$$) and 2 and 3 ($$p=0.1063$$). All other pairs of difficulty levels showed a significant difference ($$p<0.01$$ in all cases) in their mean Group-ISC values. Figure 6B shows that the mean neural Group-ISC decreases after the 2nd difficulty level, similar to its behavioural counterpart (Fig. 6A).

To explore how the Group-ISC evolves over time, the post-stimulus neural data was divided into distinct 40 ms time windows as before. Group-ISCs of the participants were computed separately in these time windows. The values increase after 160 ms of stimulus onset and reaches its peak in the 200–240 ms time window (refer to Fig. 6C). A sharp decline is noticed thereafter across all difficulty levels. The Group-ISC is highest for difficulty level 2 and lowest for difficulty level 5 at the peak. Wilcoxon signed rank test reveals that Group-ISC at difficulty level 1 is significantly greater than the Group-ISC at difficulty level 5 in the time window 200–440 ms post stimulus onset ($$p < 0.05$$, FDR corrected). However, irrespective of tasks of varying difficulty, the highest average correlations between group members are observed in the same time frame, and the task difficulty determines the magnitude of this effect.

All data analyses were performed using MATLAB 2018.