Entangled and correlated photon mixed strategy for social decision making

Collective decision making is important for maximizing total benefits while preserving equality among individuals in the competitive multi-armed bandit (CMAB) problem, wherein multiple players try to gain higher rewards from multiple slot machines. The CMAB problem represents an essential aspect of applications such as resource management in social infrastructure. In a previous study, we theoretically and experimentally demonstrated that entangled photons can physically resolve the difficulty of the CMAB problem. This decision-making strategy completely avoids decision conflicts while ensuring equality. However, decision conflicts can sometimes be beneficial if they yield greater rewards than non-conflicting decisions, indicating that greedy actions may provide positive effects depending on the given environment. In this study, we demonstrate a mixed strategy of entangled- and correlated-photon-based decision-making so that total rewards can be enhanced when compared to the entangled-photon-only decision strategy. We show that an optimal mixture of entangled- and correlated-photon-based strategies exists depending on the dynamics of the reward environment as well as the difficulty of the given problem. This study paves the way for utilizing both quantum and classical aspects of photons in a mixed manner for decision making and provides yet another example of the supremacy of mixed strategies known in game theory, especially in evolutionary game theory.

www.nature.com/scientificreports/ conflicts are avoided and maximum total reward is accomplished while ensuring equality 16 . More recently, we have theoretically derived optimal quantum states that provide the maximum total reward, while preserving equality, with respect to more than three players on two-armed bandit problems 21 . We also showed that classical photons, in the sense of not-entangled states such as single photons and correlated photon pairs, cannot resolve decision conflicts 16 . In these studies, the reward dispensed from a slot machine is constant at a time. Hence, in the event of a decision conflict, the individual player's reward is reduced by the number of overlaps, resulting in a reduction in the total reward. However, depending on the given environmental conditions, decision conflicts can also provide a greater total reward. For example, if the individual reward is not reduced from a particular slot machine even in the case of conflicts, choosing the same slot machine (decision conflict) yields a higher total reward than choosing different slot machines (non-conflicting decision). Indeed, we can observe similar real-life scenarios, for example, in the form of enhanced services or resource availability, such as computing power and reduced sales prices which are offered during a limited amount of time. Similarly, the notion of critical mass that has to be reached for an activity to be sustainable reflects the non-decreasing rewards with conflicting decisions.
In this study, to accommodate the aforementioned changes in the environmental conditions and maximize total rewards, we propose and demonstrate a mixed strategy of utilizing entangled photons and classical photons (specifically, polarization-entangled photon pairs and polarization-correlated photon pairs) to find the optimal solution of 2-player, 2-armed bandit problems. While utilizing entangled photons, which guarantee non-conflicted and fully equal decisions, each player accumulates information about the reward environment. When recognizing that the conflicted choice provides greater rewards, we utilize correlated photons to fully exploit the reward from the environment. We show that an optimal mixture of entangled and correlated photons exists depending on the dynamics of the reward environment as well as the difficulty of finding the higher reward probability machine. Although the following discussion is restricted to 2-player, 2-choice problems, the present study captures the essential aspects of entangled and classical-photon mixed strategies that can be extended for solving more generalized problems.

Results
System architecture. We consider two players (Players 1 and 2), each of whom chooses one of two slot machines (Machines A and B) with the intention of maximizing the total reward or the summation of the reward of each player. The reward probabilities of Machines A and B are denoted as P A and P B , respectively. Although the present study examines the properties of entangled and classical photon states theoretically and numerically, it assumes technologically feasible experimental optical systems to generate photon pairs by spontaneous parametric down conversion (SPDC), as schematically represented in Fig. 1a, which is essentially the same as the experimental setup proposed in our previous study 16 . The photon pair generation is based on a standard Sagnac loop architecture 22 to induce SPDC. The signal and idler photons correspond to the decisions of Players 1 and 2, respectively. The signal photon goes through a half-wave plate (HW 1 ) followed by a polarization beam splitter PBS (PBS 1 ). If the photon is detected by the photodetector corresponding to the horizontally polarized light (PD1), the decision of Player 1 is to choose Machine A, whereas if the photon is detected by the photodetector corresponding to the vertically polarized light (PD2), then the decision of Player 1 is to choose Machine B. Similarly, the decision of player 2 is determined by the detection of the idler photon by PD3 or PD4, which corresponds to the decisions of selecting Machines A and B, respectively.
We introduce several notations to describe the system. The input photon state for the decision of Player i ( i = 1, 2 ) is denoted as |θ i � , where θ i is the linear polarization angle. The roles of HW i and PBS i are given by and where |H i � and |V i � indicate photon states with horizontal and vertical polarizations propagating in orthogonal directions beyond PBS i 23 . One strategy for realizing collective decision making is to link the decisions of Players 1 and 2 by introducing correlations among the decisions at the level of photon states. Here, we consider polarization-orthogonal photon pairs denoted by, |θ 1 , θ 2 � where. as input photon states to the two players. In practice, we can fix θ i (i = 1, 2) by controlling the polarizers and half/ quarter waveplates in the path of the excitation laser (respectively denoted by P, HW E , and QW E in Fig. 1a). Let us set θ 1 = 0 and θ 2 = π/2, for the sake of simplicity. The probability of observing photons at PD1 and PD3 (meaning that both players chose Machine A) and at PD2 and PD4 (both players select Machine B) are represented by the following equations. and (1) P C (A, A) = cos 2 2θ HW 1 cos 2 2θ HW 2 − π 2 (5) P C (B, B) = sin 2 2θ HW 1 sin 2 2θ HW 2 − π 2 . www.nature.com/scientificreports/ By letting θ HW 1 = 0 and θ HW 2 = π 2 or their Nπ angle-shifted equivalents (where N is an integer) in Eq. (4), P(A, A) becomes unity, indicating that both players always chose Machine A, which is schematically illustrated in Fig. 1b. Similarly, P(B,B) becomes unity when θ HW 1 = π 2 and θ HW 2 = 0 or their Nπ angle-shifted equivalents in Eq. (5). That is, with polarization orthogonal photon pairs, both players can choose the same intended machine with appropriate half-wave plate settings. Meanwhile, the probability of observing photons at PD1 and PD4 is given by which becomes unity when θ HW 1 = 0 and θ HW 2 = 0 or their Nπ phase-shifted equivalent. P C (A,B) = 1 implies that Player 1 always chooses Machine A while Player 2 always selects Machine B. There is indeed no decision (6) P C (A, B) = cos 2 2θ HW 1 sin 2 2θ HW 2 − π 2 www.nature.com/scientificreports/ conflict in this case. However, equality is severely deteriorated; for example, when Machine A owns a higher reward probability than Machine B, Player 1 earns greater rewards than Player 2. More details can be found in Ref. 16 . In the discussion of mixed strategy below, such fixed choices prevent the players from autonomously realizing which machine is actually dispensing higher rewards.
To overcome this issue, we utilize a coherent superposition of states corresponding to the entangled states. Here, we consider the maximally entangled singlet photon state given by where θ 1 and θ 2 are orthogonal to each other, as specified in Eq. (3). The maximally entangled photons are usually represented in the form 1 √ 2 (|HV � − |VH�) . A different notation is used in Eq. (7) to maintain consistent notations with the aforementioned polarization-correlated photons ( |θ 1 , θ 2 � ) and to clearly present the role of halfwave plates in the following discussion. Considering the probability amplitude originating from the second term in Eq. (7), the probabilities of the two players' decisions are given by 16 : which means that if θ HW 1 = θ HW 2 is satisfied, the non-conflict probability is always unity and equality is ensured by Eq. (9). This also means that the conflict probability is always zero in Eq. (8) regardless of the values of θ i and θ HW i . That is, both players randomly but equally select Machine A or B, but conflict never happens. Such collective decision-making is schematically illustrated in Fig. 1b. If θ HW 1 and θ HW 2 are orthogonally arranged with each other, for example, θ HW 1 = θ HW 2 + π 2 , the relationship is completely reversed: decision conflict is always induced with equal probability at Machine A and Machine B. Although such an orthogonally arranged configuration is another interesting aspect of entangled-photon states given by Eq. (7), it is not exploited in the following discussion for simplicity. One remark is that, while such an orthogonal arrangement of θ HW 1 and θ HW 2 provides conflicted decisions, the machine to be chosen cannot be specified to the intended one as opposed to polarization correlated photon pairs discussed earlier; this aspect does not meet the mixed strategy discussed shortly below.
When two players make the same decision, the reward for each player is usually divided into two halves, as schematically illustrated in Fig. 2a. Therefore, from the viewpoint of maximizing the total reward, when a player chooses the best slot machine, the other player should select the other machine. Hence, entangled-photon-based decision making theoretically provides the maximum total reward 16 . However, as discussed in the introduction, decision conflicts could yield greater total reward depending on the reward environment conditions. Here, we define the notion of a happy hour. During the happy hour, one of the two slot machines dispenses a reward of unity per play to all players who select that machine even when the decisions are conflicted, as schematically shown in Fig. 2b. In the present study, we assume that the higher-reward-probability machine occasionally provides a happy hour. This means that a player gets one coin when he wins even if the decision is in conflict with the decision of the other player during the happy hour. At the same time, it should be emphasized that the reward per play is unity during non-happy hours. That is, a player can get one coin when he wins if the decision is not conflicted. Therefore, the player cannot detect the occurrence of a happy hour simply by observing the amount of reward per play. On the other hand, the player can immediately realize the end of the happy hour because the dispensed reward decreases to one-half due to decision conflict.
Mixed strategy. The aim of the present study is to statistically mix entangled-photon-based decision-making and correlated-photon-based decision making. While the entangled photons provide non-conflict decisions, the players can accumulate information about the slot machines. Assume that Machine i is selected N i times, and the number of wins is L i . Based on maximum likelihood estimation, the estimated reward probability of Machine i is given by P i = L i N i (i = A, B). Here, we consider that the machine that gives the maximum P i would be the best machine. This machine is denoted as Machine m. The source photon states are switched (by HW E and QW E in Fig. 1a) so that they provide correlated photons. Although here we tune a common photon pair source for both types of states, this could be done by switching from one distinct photon pair source to another, without any difference for the results. At the same time, the half-wave plates of Players 1 and 2 (denoted, respectively, by HW 1 and HW 2 in Fig. 1a) are configured in such a way that Machine m is chosen based on Eqs. (4) and (5), i.e., conflicting decision-making is intentionally induced. If the amount of reward is unity in such an intentionally induced conflicted decision, we can deduce that Machine m is indeed being operated at a happy hour. In addition, once Machine m returns to the non-happy hour operation, the player can immediately detect the end of the happy hour since the dispensed reward becomes one-half due to decision conflict. Such a mixed strategy of entangled-and correlated-photons is summarized in Algorithm 1 and Fig. 2c. [Entangled strategy] Play slot machines based on entangled-photon-based decision making while accumulating knowledge about the reward probability of the slot machines. Repeat this strategy during SI steps. SI refers to search interval. Determine the highest reward probability machine by m = arg max i (P i ). [Correlated strategy] Play slot machines based on correlated-photon-based decision-making.
Here, the half-wave plates of Players 1 and 2 are configured so that both players select Machine m. Repeat this strategy during CP steps. CP refers to check span.

3.
If the dispensed reward does not become unity, go back to the entangled-photon decision making (Step 1). If the dispensed reward is unity (i.e., Machine m is operated in a happy hour), the correlatedphoton strategy is maintained. When the dispensed reward becomes half, go back to the entangled-photon decision making (Step 1).

Discussion
The present study is stimulated by the notion of an evolutionary stable strategy (ESS) known in evolutionary game theory 24 . Let us roughly describe the Hawk-Dove game to convey a fundamental concept of ESS. Detailed discussions of this strategy are available in literature, such as Ref. 24 . Players who choose the Hawk strategy always take adversarial actions when confronted by their opponents. By doing so, they can gain a lot of rewards if they win the battle. However, they could also suffer from huge damage if they lose. Conversely, players who choose the Dove strategy always avoid battles when they face their enemies. Hence, there is no gain (because they avoid battles), but there is also no risk of loss. In evolutionary game theory, there exists an optimal mixture of Hawk and Dove strategies that maximizes the expected rewards, and this mixed strategy can be superior to both pure Hawk and pure Dove strategies depending on the environment. The optimal mixture depends on the gains and losses in the battle. We observe similarities between this concept in evolutionary game theory and the present study of the CMAB problem. The Dove strategy is similar to the entangled-photon strategy which attempts to secure the achievable total reward, while the Hawk strategy is like the correlated-photon strategy which seeks greater reward at a certain degree of risk. The difference lies in the method by which the optimal mixture is derived. www.nature.com/scientificreports/ In the following numerical analysis, 1500 consecutive slot machine plays are conducted for the 2-player, 2-armed CMAB problem. For the sake of simplicity, we assume a fixed reward probability throughout the 1500 plays for P A and P B . The total reward, which is the summation of the rewards gained by Player 1 and Player 2, is calculated by averaging over 1000 repetitions of such 1500 consecutive plays. See the "Methods" section for the details. In addition, we assume that P A is greater than P B while the condition of P A + P B = 1 holds. Therefore, if there are no happy hours, the expected maximum total reward is 1500 by the entangled-photon decision strategy. This is because the entangled photons ensure the absence of conflict, meaning that the two machines are always selected. Conversely, the condition of P A + P B = 1 leads to a constant total reward by entangled photon only strategy, which allows us to examine the effect of the mixed strategy only. The dashed blue line in Fig. 3a shows the calculated total reward. Now, we examine the impact of the occurrence of happy hours. We assume that happy and non-happy hours periodically interchange in every T steps, with T being an integer. Let us first focus on the case when T is equal to 50. The red curve shown in Fig. 3a represents the average total reward as a function of the search interval when (P A , P B ) = (0.6, 0.4). We can observe that the total reward is greater than that of the entangled-photononly strategy if SI is between 6 and 30, while the maximum total reward is realized when SI = 14. Hence, SI = 14 provides the optimal mixed strategy for this particular reward environment. A search interval that is too short indicates excessive greedy actions (SI < 6), whereas an excessively long search interval indicates missing a large reward during the happy hours (SI > 30). For the numerical evaluation, the initial starting time of the happy hour was determined for each repetition by a uniformly distributed random natural number between 1 and 2 × T, where T is the interval of happy and non-happy hours. The solid curve is the average of the total reward over 1000 such randomly arranged repetitions. The error bars show the corresponding standard deviation, which was found to be unaffected by larger repetition numbers (up to 10,000). Note that the standard deviation is indicated www.nature.com/scientificreports/ only for the search intervals of 1 and 5 × n (n = 1, …, 10) in Fig. 3a. The computing environments are described in the "Methods" section. The green, magenta, and brown curves in Fig. 3a show the average total reward when (P A , P B ) is equal to (0.7, 0.3), (0.8, 0.2), and (0.9, 0.1), respectively. The optimal search interval that yields the maximum total reward decreases as P A becomes larger. This is because the rewards gained during happy hours dramatically increase as P A increases. When P A = 0.9, the total reward is almost 2050, which is nearly a 40% increase compared with the entangled-photon-only strategy. In addition, when P A is greater than 0.7, the total reward is greater than the entangled-photon-only strategy even with extremely small as well as extremely long search intervals, indicating that the gain accomplished during a happy hour surely pays off the cost of conducting correlated-photon-based greedy actions.
The optimal mixture, however, also depends on the frequency of happy hours. The curves in Fig. 3b examine the total rewards when the interval of the happy hour is increased from 10 to 100 steps. First, we consider the case when (P A , P B ) = (0.6, 0.4) denoted by Fig. 3b-iv. As the happy-hour interval decreases, the maximum total reward decreases, indicating that the mixed strategy cannot adapt to rapid environmental changes. Nevertheless, we also observe that the search interval that yields the maximum total reward decreases as the happy-hour interval decreases, meaning that frequent usage of a correlated-photon-based decision-making strategy provides a greater total reward. The same tendency is observed in different reward probability settings (P A , P B ) given by (0.9, 0.1), (0.8, 0.2), and (0.7, 0.3), which are summarized in Fig. 3b-i, ii, and iii, respectively. Furthermore, it is interesting to observe the oscillatory behaviour of the total rewards as a function of the search interval. This is clearly due to interdependence between the environmental switching and the strategy switching as the period of oscillation is about twice the period of environmental switching and does not depend on the reward probabilities.
For the sake of obtaining higher rewards regardless of the given environmental change dynamics and to accommodate any uncertainty in the given environment, we discuss the optimality of the search interval. The curves in Fig. 3c represent the normalized total reward, (R − R MIN ) (R max − R MIN ) , where R is the total reward for a search interval. R MIN and R MAX indicate the minimum and maximum total rewards in the range of the search interval under study ( 1 ≤ SI ≤ 50 ). The black curves represent the average total reward over different happy-hour intervals. The search interval that maximizes the normalized total rewards is summarized in Fig. 4a. The horizontal axis represents the difficulty of finding a higher reward probability machine, which is defined as 1 − (P A − P B ). The reward probability of the combination (P A , P B ) = (0.9, 0.1) corresponds to a difficulty of 0.2, whereas (P A , P B ) = (0.5, 0.5) corresponds to a difficulty of unity, which means that the slot machines are identical. One remark here is that Machine A provides happy hour in the case of P A = P B while Machine B does not provide happy hour. The optimal search interval monotonically increases as the difficulty of finding a better machine increases. Furthermore, if the difficulty is less than 0.8 (the reward probability differences are greater than 0.2), the search interval of approximately 5-10 is close to the optimal search interval. This is also confirmed by Fig. 3b, suggesting that such a search interval can accommodate the uncertainty of reward environments in terms of both the reward probability values as well as the dynamic change of happy hour occurrences.
The check span is another parameter that influences the optimal mixture of entangled and correlated-photonbased decision-making strategy. Here we focus on the case when reward probabilities are given by (P A , P B ) = (0.7, 0.3), whose optimal search interval is 9 as observed in Fig. 4a. While keeping this search interval (SI = 9), the square marks in Fig. 4b show the average total rewards as a function of CP when the happy hour interchange interval is 50. The number of repetitions is 1000 while the initial starting timing of the happy hour is randomly specified for each repetition as in the above analyses. The maximum total average reward is obtained when CP is 2. This result indicates that too short an CP (CP < 2) may miss the detection of a happy hour because of the probabilistic attributes of the slot machine, whereas excessive CP (CP > 2) also leads to excessive loss. The analysis in Fig. 3 was conducted based on an CP value of 2. However, it should be noted that the achievable rewards are comparable in view of the standard deviation denoted by the error bars. This indicates that CP is not a dominant parameter; hence the optimality of CP cannot conclusively be affirmed.
In the demonstrations above, the summation of the reward probabilities has been kept unity: P A + P B = 1. Again, this condition is for the purpose of analyzing the effect of the mixed strategy while keeping the same total reward with the entangled photon only strategy. The proposed strategy does work with other reward environment in general. Figure 5 shows the total reward as a function of the search interval for the cases when P A is given by 0.9, 0.7, 0.5, and 0.3 while P B is being kept at 0.2. The other conditions are the same with Fig. 3a; the happy and non-happy hour are periodically switched every 50 steps and the check span is 2. The dashed lines depict the total reward when entangled-photon-only strategy is adopted, which is given by (P A + P B ) × 1500 where 1500 is the total number of plays. The amount of the total reward obtained by the mixed strategy becomes larger than for the entangled-photon-only strategy as P A increases because the reward obtained during happy hour increases. On the other hand, when P A = 0.3 and P B = 0.2, the merit of the mixed strategy is negligible or even negative since the difference of the reward probability is small, which is similar to the observation in Figs. 3 and 4.
Before concluding the paper, we put a few remarks on this study. In this work, we focussed on the 2-player, 2-machine CMAB problem to highlight the central concept and principle of the entangled and correlated-photon mixed strategy. The extension of the proposed method to a general case, N-player, M-machine CMAB is an important future study. Indeed, Chauvet et al. have already demonstrated optimal entangled photon states for three, four, and five players for the 2-armed bandit problem 21 . Scalability analysis is also critical from the viewpoint of practical applications such as resource management in information and communication infrastructure 10 . Also, the mixed strategy studied herein allows the players to immediately change the photon states from correlated to entangled photons when they detect the end of happy hour in Step 3 of Algorithm 1. Such controllability or accessibility of the photon source by the player can be generalized through using time delay for example, which is an interesting future topic. www.nature.com/scientificreports/ The reward probability estimation also needs to be studied. In Step 1 of Algorithm 1, the expected reward probabilities of Machine A and B were evaluated as P A = L A SI and P B = L B SI , respectively, after SI-step slot machine plays. Here, L A and L B denote the number of wins by playing Machines A and B, respectively. It is remarkable that the denominators of P A and P B are both SIs because the decisions were based on entangled photons, i.e., Machine A and Machine B were chosen exactly the same number of times. However, it should also be noted that in the present study, the information integration is assumed from both players. In a future study, we will tackle the case where the reward probability estimation is completely independently conducted by each player. We would also presume, however, that the impact of this independent estimation may be minor, particularly when SI is moderately large owing to the equality secured by entangled photons. Meanwhile, an alternative approach to find the higher reward probability machine is to utilize a round-robin scheduling; for instance, Players 1 and 2 select Machines A and B, respectively, during SI/2 plays, and vice versa for the subsequent SI/2 plays using classical photon pairs. This approach, however, requires a pre-determined coordination among players to avoid conflicting selections, which is outside the problem setting of the present study. Entangled photon pairs, on the other hand, autonomously provide random and non-conflicting choices for the players.
It should also be noted that order recognition is required for solving general N-player, M-machine CMAB problems, especially when M > N . Hence, finding the highest reward probability machine alone is not sufficient, and a novel strategy should be developed for order recognition. In this respect, we presume that the random and non-conflicting selections by entangled photons may provide more efficient recognition of the reward environment, even when compared with the pre-coordinated round-robin scheduling approach. To achieve this, several The optimal search interval monotonically increases as the difficulty of finding better slot machine increases. Also, if the difficulty is less than 0.8, a search interval of approximately 5-10 yields nearly optimal total rewards (see Fig. 3b). (b) The check span (CP) is another parameter for the mixed strategy. CP of 2 yields the maximum average total rewards. However, concerning the standard deviation denoted by the error bars, the achievable rewards are comparable, indicating that CP is not necessarily a dominant parameter. www.nature.com/scientificreports/ approaches, such as using confidence interval 25 and Schubert calculus 26 , could be integrated with the present study in the future.

Conclusion
We theoretically and numerically demonstrated an entangled and correlated-photon-based mixed decisionmaking strategy to obtain enhanced total rewards in dynamically changing reward environments. Entangledphoton-based decision-making completely avoids conflict and secures equal opportunities for all players. However, conflict avoidance does not necessarily maximize total rewards in reward environments where greedy actions are beneficial, even socially. By introducing the notion of happy hours in competitive multi-armed bandit problems, the cases in which conflicts are beneficial are systematically examined. We demonstrated an optimal mixture of entangled and correlated-photon strategies in terms of adequate switching intervals between these two strategies. The present study is relevant to evolutionary stable states known in evolutionary game theory, where the optimally mixed strategy provides greater expected rewards than other mixed and pure strategies in biological species. We observe similarities between the proposed method and evolutionary game theory in terms of the mixture in strategies themselves as well as the dependence on the given environment. This study paves the way for utilizing both quantum and classical aspects of photons in a mixed manner as well as demonstrating yet again, the supremacy of mixed strategies.  www.nature.com/scientificreports/