Entangled-photon decision maker

The competitive multi-armed bandit (CMAB) problem is related to social issues such as maximizing total social benefits while preserving equality among individuals by overcoming conflicts between individual decisions, which could seriously decrease social benefits. The study described herein provides experimental evidence that entangled photons physically resolve the CMAB in the 2-arms 2-players case, maximizing the social rewards while ensuring equality. Moreover, we demonstrated that deception, or outperforming the other player by receiving a greater reward, cannot be accomplished in a polarization-entangled-photon-based system, while deception is achievable in systems based on classical polarization-correlated photons with fixed polarizations. Besides, random polarization-correlated photons have been studied numerically and shown to ensure equality between players and deception prevention as well, although the CMAB maximum performance is reduced as compared with entangled photon experiments. Autonomous alignment schemes for polarization bases were also experimentally demonstrated based only on decision conflict information observed by an individual without communications between players. This study paves a way for collective decision making in uncertain dynamically changing environments based on entangled quantum states, a crucial step toward utilizing quantum systems for intelligent functionalities.


Section 1. Single-player and two-non-cooperative-player decision-making strategies
We describe the case when only Player 1 plays the slot machines. If in cycle t, the selected machine is Machine A and it yields a reward (in other words, if Player 1 wins by playing Machine A), then the polarization adjuster value of Player 1 (PA1) is updated at cycle t + 1 according to where 1  is a forgetting parameter S1,S2 and 1 Δ is a constant increment. In this experiment, 1 = 1 and 1 = 0.999. The initial value of PA1 is zero. If the selected machine (Machine A) does not yield a reward (i.e. if Player 1 loses), PA1 is updated according to where 1 is a parameter that is adaptively configured concerning the history S2 . In this study, 1 was a constant (1 = 1), assuming the summation of the reward probabilities to be known (PA + PB = 1).
Intuitively speaking, PA1 decreases if Machine A is more likely to win and increases if Machine B is considered to be more likely to earn rewards. The value of PA1 is then adapted to polarization control of HP1. Specifically, the orientation of HP1 at cycle t is determined by where   represents the round-off function to the closest whole number. In the experiment, the rounded integer PA values were −3, −2, −1, 0, 1, 2, and 3. When  Fig. S1a, so that the ratio of the photon count by APD1 to that by APD2 was 20, 10, 5, 1, 1/5, 1/10, and 1/20, respectively, to resolve the asymmetry effectively. In addition, to prevent the PA value from being too large or too small, which could limit the speed of adaptation to environmental changes, the maximum and minimum PA values were set to 10 and 10, respectively.
For a given slot machine play, the actual decision of Player 1 was made by the identity of the channel where the first photon arrival was measured S3 . If the first photon was detected by APD1, the decision was to select Machine A, whereas if it was detected by APD2, the decision was to choose Machine B. Likewise, the decision of Player 2 was made based on the first photon detection either by APD3 or APD4.

Section 2. Implementation of collective decision making
Using the multiple-event time digitizer, four kinds of coincidence of observing photons at (i) APD1 and APD3, (ii) APD2 and APD3, (iii) APD1 and APD4, and (iv) APD2 and APD4 were measured.
For a given slot machine play, the identity of the first observation of the coincidence among the four possible combinations was considered to represent the decisions of Players 1 and 2. The correspondence between the photon measurement and the decision to be made was the same as in the single-player cases described in Sec. 1; for example, measurement of a photon by APD1 corresponded to the decision of choosing Machine A for Player 1, while the detection of a photon by APD3 corresponded to the decision of choosing Machine A for Player 2. Therefore, for example, if the first coincident observation was made by APD1 and APD3, the decision of Player 1 was to choose Machine A, while that of Player 2 was also to choose Machine A.
Fig. S1c characterizes the optical system by presenting the coincidence rate statistics measured over 2 s regarding the joined photodetection as a function of the rotation of HW1 and HW2. For that we used the following sets of detectors: (i) APD1 and APD3 (downward triangles), (ii) APD1 and APD4 (circles), (iii) APD2 and APD3 (diamonds), and (iv) APD2 and APD4 (upward triangles). The large coincidences of (ii) and (iii), and smaller coincidences of (i) and (iv) throughout the polarization basis clearly demonstrate the successful establishment of cross-polarized entangled photon pairs.
The total rewards accomplished by correlated and entangled photon pairs presented in Fig. 3 in the main manuscript are the average accumulated rewards at cycle 100 when the polarization bases were 0°, 15°, 30°, and 45°. This was done to ensure fair comparison between the correlated and entangled photons concerning the contrasting behavior in the case of the correlated photons with polarization bases of 0° and 45°.

Section 3. Definition of equality
The equality shown in Figs. 3a,iii and 3b,iii in the main manuscript was evaluated as follows. (1) Calculate the ratio between the average numbers of times that Players 1 and 2 selected the higherreward-probability machine from the first play to the 50th play.
(2) Calculate the ratio between the average numbers of times that Players 1 and 2 selected the higher-reward-probability machine from the 51st play to the 100th play.
(3) Calculate the average of (1) and (2). This average is a reasonable metric regarding equality of the opportunities to select the higher-reward-probability machine taking into account the fact that the casino setting was different between the first 50 cycles and the next 50 cycles.

Section 4. Dependence of total rewards on casino setting
In order to check the sensitivity of the decision-making performance with respect to the reward probabilities PA and PB, they were changed to PA = 0.4 and PB = 0.6 from PA = 0.2 and PB = 0.8. It appeared that finding the higher-reward-probability machine was more difficult in that case due to the smaller difference between the reward probabilities than in the former cases. The blue bars in Fig. S2 show the accumulated rewards at cycle 100. Indeed, for Player 1 only, Player 2 only, and two noncooperative players, the total reward is substantially lower than in the former cases (depicted by the red bars). These differences are due to the longer time needed to reach stable selection of the higherreward-probability machine. On the contrary, with correlated and entangled photons, the team reward does not change, and the entangled photons again provide the maximum total reward. This finding clearly demonstrates that collective decision making based on entangled photons ensures that the social maximum reward will be achieved regardless of the difficulty of the given problem. This has strong implications in terms of resource allocation, for example in network communications, as the maximized efficiency is ensured whatever the actual qualities of the two channels, which may actually fluctuate in time.

Section 5. Randomly cross-polarized photon pairs
Polarization-correlated and polarization-entangled photon pairs show very different characteristics in the configuration used for these experiments, especially fixed polarization state for the former and coherent sum of states for the latter. In particular, a player can fully determine the polarization state of polarization-correlated photon pairs by using his/her half-waveplate, whereas it is impossible to do so for polarization-entangled photon pairs. Another kind of photonic state can be used to compare with entangled photon pairs: a series of cross-polarized photon pairs along random directions. In this situation, using the same notations as in the main text, each photon pair has a polarization state of the kind: With this input state | ⟩, players have 50% probability on average to select either machine A or machine B whatever the angle of their own waveplate: as a consequence, individual rewards are necessarily identical between players on average. The only way for players to change the total rewards is to tune the conflict rate by modifying the relative angle between their waveplates.
Simulations have been run to check this behavior with the following set-up: at each play, a uniformly distributed pseudorandom number is used to select the angle of the state | , + /2⟩ as input photon pair state for the players, with each photon then following the same rules and transformations as for polarization-correlated pairs. The reward probabilities follow the same rules as before, with PA = 0.2 and PB = 0.8. Figure S3 shows numerical results with cooperative players (i.e., fixed waveplate angle) for a identical polarization bases, b polarization bases tilted by 45 degrees from each other and c polarization bases tilted by 90 degrees. As predicted, while the correct decision ratio (CDR) shown in item (i) is independent from the polarization bases (CDR stays around 0.5) and identical for both player on average, conflict ratio in (ii) varies according to the relative angle tilt. These changes explain the differences in accumulated reward in (iii) between the three cases, bases alignment leading to an average accumulated total reward of 87.9 in a whereas 50% conflict ratio lowers it to 74.7 in b and higher ratio in c lowers it to 62.7. These results indicate that the outcome depends on the relative angle between waveplates, as is the case for entangled photon pairs, though the maximum total accumulated reward of 100 cannot be reached because of residual conflicts due to experimental imperfections. Figure S4 summarizes the comparison between the two kinds of resources and their performance with respect to misalignment angle between players' bases, by presenting the accumulated total reward for different measurement bases tilts between players. In particular, if entangled photon pairs enable players to reach maximum accumulated reward when their bases are mutually aligned, they suffer heavier losses when the misalignment is higher than 45 degrees as 7 compared with randomly cross-polarized photon pairs. The lower sensitivity obtained with this kind of resource may then be of interest for applications more susceptible to noise and perturbations between players.