A numerical study on efficient jury size

For judicial democracy, many societies adopt jury trials, where verdicts are made by a unanimous vote of, conventionally, 12 lay citizens. Here, using the majority-vote model, we show that such jury sizes achieve the best balance between the accuracy of verdicts and the time spent for unanimous decision-making. First, we identify two determinants of the efficient jury size: the opinion homogeneity in a community decreases the optimal jury size by affecting the accuracy of verdicts; the anti-conformity tendency in the community also reduces the efficient jury size by prolonging the time to reach unanimous verdicts. Moreover, we find an inverse correlation between these two determinants, which prevents over-shrinking and excessive expansion of the efficient jury size. Finally, by applying these findings into real-life settings, we narrow down the efficient jury size to 11.8 ± 3.0. Given that such a simple toy model can explain the jury sizes in the actual societies, the number of jurors may have been implicitly optimised for efficient unanimous decision-making throughout human history.


Introduction
J ury trials are regarded as an embodiment of democracy in courtrooms (Ellsworth, 1989;Ellsworth and Getman, 1987), and it has been repeatedly debated whether decisions made by jurors sufficiently represent opinions of the entire community (Ellsworth, 1989;Saks and Marti, 1997). In this sense, ideally, juries may have to consist of all the community members or a sufficiently large number of them; however, in reality, it is impractical to implement such a direct or quasi-direct democracy into every trial. Instead, since the mid-twelfth century at latest, juries have tended to comprise~12 individuals, especially when they are requested to reach unanimous verdicts (Warren, 1973). In fact, despite some studies suggesting the superiority of much smaller juries (Fay et al., 2000;Nagel and Neef, 1975), the con-ventional~12-juror systems have still survived (Forsyth and Macdonnell, 2009).
Why has such a specific jury size been widely chosen? Throughout a line of psychological and sociological studies on various properties of the jury system-for example, its impartiality (Hamilton, 1978;Stephan, 1974;Young et al., 2014), consistency (Davis et al., 1976;Werner et al., 1985) and accuracy (Garrett et al., 2020;Ross et al., 2019)-some researches raised multiple societal and political events that could have shaped the current number of jurors (Ellsworth, 1989;Faust, 1959;Hans, 2008;Maccoun, 1989;Thomas and Fink, 1963). A meta-analysis of 17 experimental and observational studies on group decisionmaking demonstrated sociological and democratic advantages in 12-juror systems over 6-juror ones (Saks and Marti, 1997).
In the meantime, given such robustness and ubiquitousness of the juries with~12 lay citizens (Warren, 1973;Forsyth and Macdonnell, 2009), more purely statistical reasonings may underlie this specific jury size.
Here, we searched for such an account by examining a hypothesis that juries consisting of~12 individuals can achieve the best balance between the accuracy of jury verdicts and the time to reach unanimous decisions. Based on previous literature (Saks and Marti, 1997) and the concept of jury systems (Ellsworth, 1989;Ellsworth and Getman, 1987), the verdict accuracy was defined as how accurately the jury verdicts represented decisions that would be made by the full community members (Fig. 1a). The deliberation time-so-called consensus time (Krapivsky and Redner, 2003;Masuda, 2014)-was defined as how many voting steps were required for the jury to reach a unanimous consensus.
We estimated these two metrics using a majority-vote model with noise (Liggett, 2005;Oliveira, 1992;Tome et al., 1991) on a fully connected complete network (Fig. 1b). In this widely used model (Chen et al., 2017;Costa and de Souza, 2005;Lima, 2010Lima, , 2012Melo et al., 2010;Vilela and Moreira, 2009;Vilela et al., 2012;Vilela and Stanley, 2018), each node represents each individual who has either of two dichotomic opinion status (here, guilty or not guilty) at every time point. Their opinions change over time: each member adopts the majority opinion in its connected neighbours (i.e., all the other members in this study) at the previous time point with a probability 1q and chooses the minority opinion with a probability q. This noise parameter q is often called a social temperature (Vilela and Stanley, 2018) or an anti-conformity index (Nowak and Sznajd-Weron, 2019) and supposedly represents the degree to which individuals do not obey the majority opinion in their acquaintances.
With this model, we simulated opinion dynamics between jurors and found that the most efficient jury size was largely affected by the opinion homogeneity and anti-conformity tendency in the original community. In addition, these two factors were inversely correlated, which was considered to prevent overshrinking/expansion of the efficient jury size. Finally, we applied these findings into real-life networks and narrowed down the most efficient jury size to 11.8 ± 3.0.
Model Majority-vote model. Using the majority-vote model with noise q (0 ≤ q ≤ 0.5; Fig. 1b) (Liggett, 2005;Oliveira, 1992;Tome et al., 1991), we numerically analysed opinion dynamics in which N jurors (N ≥ 2) continued discussions until they reached a unanimous verdict. Given actual jury deliberation, we assumed that all the jurors interact with all the other members and implemented the majority-vote model on a fully connected complete graph.
At a time point t (t = 0, 1, 2, …), each juror had a dichotomic opinion σ i (t)·(σ i (t) = ±1, i = 0,1,3,…N). The initial opinion σ i (t = 0) was defined by random sampling from a community with 10 5 individuals whose opinions were divided into 1 or -1 at the ratio of F Major to 1-F Major (0.5 < F Major ≤ 1). At the following time point t + 1, the jurors choose the majority at the previous time point with a probability 1q or the opposite opinion with a probability q. Therefore, individual decision-making at a given time point can be partially affected by its own opinion at the previous time point.
When the opinions were equally divided at the time point t, σ i (t + 1) was determined at random.
We repeated this opinion update until all the opinions became the same. The time point when such unanimity was achieved was defined as the deliberation time T, and the unanimous verdict was also recorded as a jury verdict.
We conducted these calculations 10 6 times ( Fig. 1a). That is, for one set of N, F Major and q, we simulated 10 6 different jury trials, and consequently obtained 10 6 pairs of T and verdict accuracy score.
Deliberation time. The deliberation time T was summarised by the median of T, 〈T〉 N because it showed a skewed distribution (Fig. 1c).
Verdict accuracy. Regarding the jury verdict, we first evaluated its accuracy by comparing it with that made by all the community members from which the jurors were selected. In other words, the verdict accuracy was defined as how accurately the jury verdicts represented the community decisions.
The community verdicts were estimated in essentially the same manner as the jury verdicts were, except for the following two points. First, we assumed that the verdicts of the community, which hypothetically consisted of 10 5 individuals, were made through full interactions between all the 10 5 individuals, not randomly selected N members. Second, the community verdicts were made based on a majority-vote rule, not on unanimity, because the opinions in the community were not likely to converge to unanimous ones. The community verdicts were defined by the majority opinions at t = 100.
When the jury verdict was the same as the community verdict, the verdict accuracy score was set at 1. Otherwise, it was set at 0.
Next, we calculated the average of the accuracy score 〈Accuracy〉 N across the 10 6 jury trials. To infer effects of the jury size on the verdict accuracy, we then compared this 〈Accuracy〉 N with the accuracy of the most basic group decision-making (i.e., verdict accuracy of 2-juror juries 〈Accu-racy〉 2 ). That is, the jury effect on the verdict accuracy Δ 〈Accuracy〉 N was defined as 〈Accuracy〉 N − 〈Accuracy〉 2 .
Jury efficiency and efficient jury size. Finally, the jury efficiency was calculated as Δ〈Accuracy〉 N /〈T〉 N . We repeated this calculation for different N (2 ≤ N ≤ 50) and searched for the most efficient jury size N efficient . Technically, we differentiated the fitted polynomial curve and defined N efficient as N for a local maximum of the jury efficiency.
Brute-force search and ranges of parameters. Using this procedure, we conducted a brute-force search for the most efficient jury size in wide ranges of F Major and q.
Regarding F Major , we set its range at 0.52 ≤ F Major ≤ 0.66 because the jury efficiency is likely to matter particularly when the community opinions on trials are not so homogeneous but somewhat divided. After all, if supermajority exists in the community (i.e., F Major > 2/3), the jury can reach a verdict representing such a dominant community opinion in a relatively short deliberation time.
Regarding q, we set its range to [0.05, 0.2] so as to increase the possibility that the jury can reach a unanimous verdict. Previous analytical studies (Chen et al., 2015;Fronczak and Fronczak, 2017) demonstrated that, in the majority-vote model on a complete network with N nodes, the critical noise value q c can be described as 1 2 À 1 2 ffiffiffiffiffiffiffiffiffiffiffi π 2 NÀ1 ð Þ q (Fig. 1d). Given this, to enhance the tendency for the network to reach the ordered state (i.e., ferromagnetic state), we made q less than q c even when the number of the nodes in the network is relatively small (e.g., N~10).
Two community factors and efficient jury size. To examine effects of F Major and q on N efficient , we first conducted regression analyses. Based on the regression equations, we explored a simplified expression of N efficient and found that N efficient can be described as where β Acc and β Time were regression coefficients. Next, we calculated Pearson's correlation coefficients r between the two components of N efficient -β Acc and β Timeand the two community parameters-F Major and q. The correlation coefficients were statistically compared in tests of the significance of the difference of the correlation coefficient.
Two community factors in real-life data. Finally, we examined an association between F Major and q in real-life networks. The following three real-life network structures were obtained from Stanford Large Network Data set Collection (https://snap. stanford.edu/data/) (Leskovec and Sosic, 2016): collaboration network (ca-CondMat), email network (email-Eu-core) and Facebook network (ego-Facebook) ( Table 1). All the network structures were used as undirected unweighted graphs.
We simulated opinion dynamics on these three networks using the same majority-vote model. First, we set q (0.05 ≤ q ≤ 0.15) and the initial opinion homogeneity F Major_initial (0.52 ≤ F Major_initial ≤ 0.66). An initial opinion of each individual was randomly assigned so that the opinion bias in the entire network met the Fig. 1 Research design. a-c We evaluated the efficiency of juries with N jurors by comparing the verdict accuracy with the deliberation time (a) using the majority-vote model with noise on a complete graph (b). The verdict accuracy 〈Accuracy〉 was defined as whether the jury verdicts σ jury was the same as the verdicts hypothetically made by the full community members σ Comm . We then quantified the beneficial effect of N jurors on the verdict accuracy Δ 〈Accuracy〉 N by calculating how better the verdict accuracy in the N-juror system 〈Accuracy〉 N was compared to 〈Accuracy〉 2 , the accuracy of the most basic collective decision-making system. The deliberation time T was summarised by its median value 〈T〉 because T showed a skewed distribution (c). Finally, we estimated the jury efficiency by calculating the ratio of Δ〈Accuracy〉 to 〈T〉. In the majority-vote model (b), the jurors change their opinions to the majority opinion at a time point with a probability 1-q or to the minor one with a probability q. If the opinion in the jury is equally divided, the jurors change their opinions randomly. The q is a noise parameter, which is also stated as social temperature (Vilela and Stanley, 2018) and anti-conformity index (Nowak and Sznajd-Weron, 2019). The F Major represents the proportion of the major opinion in the entire community. d. The graph shows an association between the number of jurors N and the critical noise q c in the majority-vote model on complete graphs (Chen et al., 2015). In the range of q was set to [0.05, 0.2] so that q is less than q c even when the number of the jurors is relatively small (e.g., N~10). F Major_initial . Then, we updated the opinions until the opinion homogeneity in the community reached a plateau F Major_converge . The plateau was defined as a period where the fluctuation in F Major was within 0.1 over the 10 opinion updates. We repeated this procedure for different F Major_initial and found that, when the q was constant, the community opinion homogeneity tended to converge to the same F Major_converge regardless of F Major_initial . We then calculated this F Major_converge for different q and examined correlations between F Major_converge and q. By applying this F Major -q association to the results of the bruteforce analysis, we narrowed down the efficient jury size.

Results
Brute-force search for efficient jury size. As an example, let us consider a jury trial employing 12 jurors who were randomly selected from a community with a 60/40 opinion split on the verdict for the trial (F Major = 0.6 in Fig. 1a). Also, we set the noise parameter q in the majority-vote model at 0.075 (Fig. 1b), which is smaller than the critical noise q c (Fig. 1d) and should increase the possibility that the jury reaches a unanimous verdict. With this setting, we simulated 10 6 different opinion dynamics and obtained 10 6 different deliberation time lengths T and unanimous jury verdicts.
The deliberation time to reach a unanimous verdict showed a skewed distribution (Fig. 1c); thus, we adopted the median of the time, 3.76, as a representative deliberation time length 〈T〉 12 .
To estimate the verdict accuracy, we first calculated how many verdicts of jury were the same as those that would be made by the full community members (here, hypothetically 10 5 individuals). We then divided the number of the accurate verdicts by the total number of the simulated opinion dynamics (i.e., 10 6 ) and obtained the verdict accuracy 〈Accuracy〉 12 = 0.75.
Next, we measured the beneficial effect of the jury systems on the verdict accuracy, Δ〈Accuracy〉 12 , by comparing 〈Accuracy〉 12 with the accuracy of the most basic collective decision-making system (i.e., 〈Accuracy〉 2 = 0.60).
Based on this definition, we then searched for the most efficient jury size. In this setting of F Major and q, we repeated the calculation of the jury efficiency for different jury sizes (2 ≤ N ≤ 50). Both the 〈T〉 and Δ〈Accuracy〉 increased along with the number of the jurors (Fig. 2a, b), and the jury efficiency showed a peak at N efficient = 12.1 (Fig. 2c).
In the same manner, we conducted a brute-force search for the most efficient jury size by independently changing the two parameters in broader ranges (0.52 ≤ F Major ≤ 0.66 and 0.05 ≤ q ≤ 0.2). N efficient was found between 6.70 (F Major = 0.66 and q = 0.2) and 18.40 (F Major = 0.52 and q = 0.05) (Fig. 2d).
Simple expression of efficient jury size. Next, to infer determinants of the efficient jury size, we explored a simple expression of N efficient .
In the above example, 〈T〉 showed an exponential increase when N increased (R 2* = 0.96; Fig. 2a), whereas Δ〈Accuracy〉 increased more slowly in a manner that was well fitted to a log-log linear regression model (R 2* = 0.95; Fig. 2b).
Such associations were seen in a broader parameter space (0.52 ≤ F Major ≤ 0.66 and 0.05 ≤ q ≤ 0.2; R 2* ≥ 0.92), which enabled us to describe 〈T〉 and Δ〈Accuracy〉 as ln 〈T〉 = β Time N + ε Time and ln Δ〈Accuracy〉 = β Acc lnN + ε Acc , respectively. β Time and β Acc represent coefficients and the ε Time and ε Acc denote intercepts in the regression models.
Given them, the derivative of the logarithm of the jury efficiency can be described as d dN ln ΔhAccuracyi=hTi ð Þ ¼ β Acc =N À β Time , and thus, the jury efficiency should peak when N is β Acc /β Time . This indication was validated by a significant correlation between β Acc /β Time and N efficient (R 2 * = 0.78, coefficient of variation = 0.14; Fig. 3a).
Jury size and two community factors. Based on this simple expression of N efficient , we examined how the efficient jury size was affected by the two community properties -F Major and q. We first calculated β Acc and β Time in wide ranges of F Major and q (Fig.  3b) and then estimated correlation coefficients between them.
These results suggest that the larger opinion homogeneity in a community decreases the efficient jury size by dampening the beneficial effects of collective decision-making on the verdict accuracy, whereas the stronger anti-conformity tendency in the community also reduces N efficient by accelerating the increase in the deliberation time (Fig. 3e).
Why do not jury sizes shrink or expand? If the two community factors affect the efficient jury size in such seemingly independent manners, N efficient could overly shrink (e.g., N = 2 or 3) or excessively expand (e.g., N = 100). However, it is difficult to see such extremely small/large juries in real-life settings.
To solve this contradiction, we investigated associations between F Major and q in real-life data sets. We simulated opinion dynamics on three real-life large-scale social networks (Leskovec and Sosic, 2016) and traced the changes in the opinion homogeneity F Major when the anti-conformity index q was set as a constant. Fig. 2 Brute-force search for efficient jury size. In an example case (F Major = 0.6 and q = 0.075), the deliberation time 〈T〉 and verdict accuracy change Δ 〈Accuracy〉 increased along with the jury size N (a and b), and the jury efficiency showed a peak at N = 12.1 (c). The y-axes in the panels a and b are in a logarithmic scale. We conducted such a search for N efficient in a brute-force manner in broader ranges of F Major and q.
In all the three social networks, the opinion homogeneity converged toward a q-dependent specific value even when the initial F Major was widely varied (e.g., Fig. 4a). Moreover, the convergence value of the opinion homogeneity F Major_converge were negatively correlated with the anti-conformity index q (r ≥ 0.98, P ≤ 0.0034, P Bonferroni < 0.05; Fig. 4b). When we consider this F Major -q inverse correlation together with the results of the bruteforce search for N efficient (Fig. 2d), the efficient jury size was narrowed down into a range from 8.8 to 14.7 (11.8 ± 3.0) in the three real-life social networks (Fig. 4c).
These results show that the inverse correlation between the opinion homogeneity and anti-conformity tendency in a community can be one of the key mechanisms that avoid the overshrinking and excessive expansion of the jury size (Fig. 4d).

Discussion
This numerical study examined opinion dynamics during jury deliberation with the majority-vote model, calculated the efficiency of the jury system and searched for the most efficient jury size. We first found that such an efficient jury size was determined by two community factors via different manners: the larger opinion homogeneity in a community made the efficient jury size more compact by affecting the verdict accuracy, whereas the larger anti-conformity tendency decreased the efficient jury size by accelerating the increases in the deliberation time. These two community factors were inversely correlated with each other, which prevented over-shrinking and excessive expansion of the efficient jury size. By bringing all these findings into real-life networks, the most efficient jury size was narrowed down to 11.8 ± 3.0, which is close to actual jury sizes in most of the jury trial systems (Ellsworth, 1989).
This study has demonstrated that even a simple statistical model can provide an account for the jury sizes seen in different countries; however, such simplicity could impose limitations.
Sociologically, a series of studies have suggested that the jury sizes must have been chosen and changed through a series of societal and political events (Ellsworth, 1989;Hans, 2008;Fig. 3 Determinants of efficient jury size. a The most efficient jury size N efficient could be approximated by β Acc /β Time , where β Acc and β Time are regression coefficients in ln Δ〈Accuracy〉 = β Acc lnN + ε Acc and ln〈T〉 = β Time N + ε Time . CV represents the coefficient of variation and R 2* denotes the adjusted coefficient of determination in a regression model using N efficient = β Acc /β Time . b-e We calculated β Acc and β Time in broader ranges of F Major and q (b) and estimated correlation coefficients between them. The panels c and d show the exemplary results of the correlation analyses. β Acc was specifically correlated with F Major (c), whereas β Time was exclusively correlated with q (d). These observations indicate that larger F Major decreases N efficient by reducing β Acc , whereas larger q decreases N efficient by increasing β Time (e). In the panels c and d, r shows a correlation coefficient and P indicates the statistical significance in a test of the significance of the difference of the correlation coefficient. Fig. 4 Efficient jury size in real-life settings. We examined associations between the opinion homogeneity F Major and anti-conformity tendency q in the three real-life large-scale social networks (Leskovec and Sosic, 2016). In all the three networks, F Major converged toward a q-dependent specific value even when the initial F Major was largely different (a). The convergence F Major was negatively correlated with q (b). By applying such an inverse association between F Major and q, we could narrow down N efficient into 11.8±3.0 (c). These results imply a statistical mechanism, which avoids over-shrinking/expansion of the jury size in the real-life society (d).
HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | https://doi.org/10.1057/s41599-020-00556-1 ARTICLE HUMANITIES AND SOCIAL SCIENCES COMMUNICATIONS | (2020) 7:62 | https://doi.org/10.1057/s41599-020-00556-1 Maccoun, 1989), which were not covered in this research. We did not consider a variety of the jury trial systems, either: in some communities, their jury verdicts are not made by unanimous votes but by majority votes with different criteria (Hans, 2008); some jury systems include professional judges (Hans, 2008). Also, the model we adopted here may be too simple in statistical and psychological contexts. The majority-vote model with noise has been used to understand various opinion dynamics (Chen et al., 2017;Costa and de Souza, 2005;Lima, 2010Lima, , 2012Melo et al., 2010;Vilela and Moreira, 2009;Vilela et al., 2012;Vilela and Stanley, 2018), but the current form does not consider effects of some social and psychological factors, such as the existence of strong opinion holders in the juries (Vilela and Stanley, 2018) and individual differences in the anti-conformity tendency (Costa and de Souza, 2005;Vilela et al., 2012). This study assumed dichotomy in the opinion distribution; therefore, the current model took into account neither more nuanced differences in opinions (Lima, 2012;Melo et al., 2010) nor effects of confidence levels of individual opinions (Bahrami et al., 2012;Bahrami et al., 2010).
In addition to these limitations, the current results may be able to be validated analytically. A previous work on the majority-vote model on complete graphs used the heterogeneous mean-field theory and successfully identified the critical noise for both the infinite-size and finite-size networks (Chen et al., 2015). Another study adopted the master equation approach and obtained the exact equation for the probability of a given opinion pattern (Fronczak and Fronczak, 2017). Combining these analytical findings would bring us more clear, precise and expandable expressions about quantitative associations between the anticonformity index, opinion homogeneity in the community, verdict accuracy and deliberation time.
These sociological, statistical, psychological and analytical concerns would have to be investigated in future studies; but the fact that even a simple toy model can explain the actual jury sizes may imply that the number of jurors might have been being implicitly optimised in different communities with different cultures based on a certain common mechanism.

Data availability
The data sets of the three real-life network structures are available in Stanford Large Network Data set Collection (https://snap. stanford.edu/data/). The current work generated neither original data nor novel mathematical model. The codes used here are available from the author upon reasonable requests.