Conflict-free collective stochastic decision making by orbital angular momentum of photons through quantum interference

In recent cross-disciplinary studies involving both optics and computing, single-photon-based decision-making has been demonstrated by utilizing the wave-particle duality of light to solve multi-armed bandit problems. Furthermore, entangled-photon-based decision-making has managed to solve a competitive multi-armed bandit problem in such a way that conflicts of decisions among players are avoided while ensuring equality. However, as these studies are based on the polarization of light, the number of available choices is limited to two, corresponding to two orthogonal polarization states. Here we propose a scalable principle to solve competitive decision-making situations by using the orbital angular momentum of photons based on its high dimensionality, which theoretically allows an unlimited number of arms. Moreover, by extending the Hong-Ou-Mandel effect to more than two states, we theoretically establish an experimental configuration able to generate multi-photon states with orbital angular momentum and conditions that provide conflict-free selections at every turn. We numerically examine total rewards regarding three-armed bandit problems, for which the proposed strategy accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This is thanks to the quantum interference effect that achieves no-conflict selections, even in the exploring phase to find the best arms.

Optics and photonics are expected to play crucial roles in future computing systems 1 , making a variety of devices and systems to be intensively studied such as optical fibre-based neuromorphic computing 2 , on-chip optical neural networks 3 , optical reservoir computing 4 , among others. While these works are basically categorized in supervised learning, reinforcement learning is another important branch of artificial intelligence 5 . The Multi-Armed Bandit (MAB) problem is an example of a reinforcement learning situation, which formulates a fundamental issue of decision making in dynamically changing uncertain environments where the target is to find the best selection among many slot machines, also referred to as arms, whose reward probabilities are unknown 6 . In solving MAB problems, exploration actions are necessary to find the best arm, although too much exploration may reduce the final amount of obtained reward from the exploitation. On the opposite, insufficient exploration may lead to miss the best arm. Furthermore, when multiple players are involved, decision conflicts become serious, as they induce congestions and inhibit socially achievable benefits 7,8 . Equality among players is another critical issue, as unfair repartition of outcomes may lead to distrust the system. This whole problem is known as the competitive MAB (CMAB) problem.
In order to solve these complex issues, photonic solutions have been recently considered. For example, the wave-particle duality of single photons has been utilized for the resolution of the two-armed bandit problem 9 . Moreover, Chauvet et al. theoretically and experimentally demonstrated that polarization entangled photon pairs provide non-conflict and equality-assured decisions in two-player, two-armed bandit problems 10 . Entangled photon states that allow more than three players while guaranteeing optimal outcome and equal repartition have also been demonstrated 11 .

Results
Scalable decision maker with OAM. System architecture for solving 1-player K-armed bandit problem. We first describe the problem under study, which is a stochastic multi-armed bandit problem with rewards following Bernoulli distributions defined as follows. There are K available slot machines (or arms): when the player selects one arm i, the player wins with probability P i (and receives a fixed reward of 1), or loses with probability 1 − P i (and receives a fixed reward of 0), with i an integer ranging from 1 to K. Let a player choose an arm each time and allow a total of T times, then the goal of the bandit problem is to find out which strategy should be followed to choose arms so that the resultant accumulated outcome is maximized. When the slot machine with the highest winning probability is known, the best strategy is to draw that specific arm for all T times, but the player initially has no information about the arms. Therefore, exploration actions are required to know the best arm, whereas too much exploration potentially leads to missing a higher total amount of rewards from the best machine.
In the previous work on single-photon decision maker using polarization 9 , two orthogonal linear polarizations of photons are associated with two slot machines; that is, horizontal and vertical polarizations correspond to slot machine 1 and 2, respectively. The exploration is physically realized by the probabilistic attribute of photon measurement, whose outcome depends on the direction of the polarization of linearly polarized single photons. Therein, the polarization degree of freedom physically and directly allows specifying the probabilistic selection of slot machines. However, as mentioned in the "Introduction" section, the number of arms is limited to only two, although extendable in a single-player setup to powers of two via a tournament-based approach 12 .
The fundamental idea of the present study is to associate the dimension of OAM with the selection of multiple arms, whatever the number of arms. Allen et al. have pointed out that a Laguerre-Gaussian (LG) beam has an angular momentum independent from polarization; they have called it OAM to distinguish it from the polarization-dependent spin angular momentum 25 . The spatial mode of a LG beam can be expressed using the near-axis approximation where ρ is the distance from the optical axis, θ is the azimuthal angle around the optical axis, z is the coordinate of the propagation direction, f m is the complex amplitude distribution, and k is the wavenumber. m and l are integer numbers that respectively describe the order of the Laguerre polynomial for the radial distribution and the azimuthal rotation number. In our study, m is fixed at 0, while l takes any integer numbers. Correspondingly, |l� is the state in which there is one photon in the l mode, whose angular momentum is equal to l where is Planck's constant divided by 2π . Since the modes with different l are orthogonal to each other, the quantum state can be expressed by a linear superposition, using these modes as a basis. Figure 1a schematically illustrates examples of beams with different l-valued OAM where l is an integer from −3 to 3. Non-zero l beams exhibit spiral isophase spatial distributions. Figure 2 shows a schematic diagram of the proposed system architecture for solving the MAB problem using OAM. Here we illustrate the case where the number of arms is three, but the same principle applies in extending to a larger number of arms.
Conventional laser sources generate beams that do not have orbital angular momentum. Technologically, methods to generate light with OAM from a plane wave or a Gaussian beam include the use of phase plates 26 , computer generated holograms (CGH) 27 , or mode converters 28,29 . Spatial light modulators (SLMs) are widely utilized for this purpose, as they enable direct and tunable amplitude and/or phase modulation of an incoming light beam 30 . The simplest and the most widely used method is a CGH-based approach implemented with an SLM and a 4f optical setup 15 . In Fig. 2, a photon with a Gaussian spatial profile emitted from a laser is sent to a phase SLM, displaying a CGH pattern to generate OAM states, each carrying a phase factor e ilθ which depends on the azimuthal angle θ and the OAM number l. l could be any integer, but when all generated l are expected to be positive, the output photon is described by the state: www.nature.com/scientificreports/ where φ 1 , φ 2 , . . . , φ K depict phase changes associated with each OAM with l values being +1, +2, . . . , and +K , respectively, and |l� denotes the photon state with OAM value of l. That is to say, a single photon is emitted from the source system that contain K OAM states with equal probability amplitude. Meanwhile, a mirror causes flipping of the twisted structure of any given OAM; that is, the function of a beam splitter (BS) in the light propagation is represented by where R represents flipping of OAM state, for example, R|+1� = |−1� . In the case K = 3 , we generate a photon state that carries equally l = +1, +2, +3 by setting φ 1 = φ 2 = φ 3 = 0 . That is, the output after SLM is given by . This photon is then transferred to an array of BSs and single photon detection system to examine which l-valued OAM is detected. Among a variety of methods in measuring the OAM of light 31 , the system architecture shown in Fig. 2 illustrates a method utilizing a hologram (HG) followed by a zeroth-order extraction system 32 . In practical implementation, a zeroth-order extraction system could be free-space optics with spatial filtering or single-mode optical fibre.
This hologram adds a phase factor of e il HG θ to the state |l� with OAM l, which results in a transformation |l� → |l + l HG � . After injection into a zeroth-order extraction system, only an l = 0 photon propagates in it. In other words, the zeroth-order extraction system acts as a filter to extract the l = 0 component only. If the hologram induces a shift of OAM by l HG and a photon is detected by the subsequent photodetector, the OAM of the incoming photon is identified to be l = −l HG . Based on this principle, in the system shown in Fig. 2, three holograms HG1, HG2, HG3 are arranged, which transform |l� into |l − 1�, |l − 2� , and |l − 3� , respectively.
One remark here is that, although multiple BSs and holograms are employed in Fig. 2, more compact realization is indeed possible by, for example, a geometric optical transformation technique 33 , which has been extended to more than 50 OAM states 34 . The reason behind the introduction of the measurement architecture shown in Fig. 2 regards the following procedure related to photon detections.
The output light is subjected to attenuators (ATT1, ATT2, ATT3) to control detection probabilities and a zeroth-order extraction system, followed by photodetectors (PD1, PD2, PD3). Based on the filtering by the zeroth-order extraction system, photon detection by PD1, PD2, and PD3 means observing OAM values of 1, 2, and 3, respectively. Photon detection by PD1 immediately means playing slot machine 1. Similarly, PD2 and PD3 are associated with the decision of playing slot machines 2 and 3, respectively. It should be emphasized that in this configuration, a machine is only selected if a photon is detected.
Initially, since the probabilities of the detected photons to be measured by PD1, PD2, and PD3 are all equal to 1/3, all machines are explored equally. Depending on the obtained results, the attenuation levels by ATT1, ATT2, ATT3 are updated.
After a single photon is detected by any photodetector, the selection yields eventual rewards from slot machines, and the results are registered into history H(t). While referring to the history H(t), the next decision is determined by following a certain policy of the player. The softmax policy is one of the most well-known feedback algorithms for the decision, which is also considered to accurately emulate the model of human decision making 35,36 . In the softmax policy, the player selects each machine based on a maximum likelihood estimation of the reward probability P 1 (t),P 2 (t), . . . ,P K (t) and the probability of selecting machine i is given by the following equation: where β , which is also known as inverse temperature from analogy to statistical mechanics, is a parameter that influences the balance between exploration and exploitation. While optimal parameter β depends on reward probabilities and some methods for tuning β have been proposed 37 , this paper, for simplicity, set it to a constant value β = 20 based on a moderate tuning. The amplitude transmittance of attenuators (ATT1, ATT2, ATT3) are denoted by d 1 , d 2 , d 3 , which are initially all one. These values are updated after every trial based on: In this way, d i (t) is revised as the time elapses so that the photon detection event is highly likely induced at the photodetector that corresponds to the best slot machine or the highest reward probability machine. For example, if slot machine 1 is the highest reward probability one, the transmittance of ATT1 should be higher while those of ATT2 and ATT3 should become smaller.
Here is a remark about the denominator of the right side of Eq. (5). The probability of detecting state i is proportional to d i (t) 2 . Dividing each d i (t) 2 by the same value max k s k (t) does not give any unintended bias to the detection probabilities, but transmission efficiency by the attenuators is kept high. That is, the loss of photons by the attenuators is minimized.
Finally, we discuss one more important remark regarding the architecture for solving the single-player, multiarmed bandit problem shown in Fig. 2. The principle maximizes the detection probability of the OAM state www.nature.com/scientificreports/ corresponding to the best machine. Actually, instead of reconfiguring the attenuators, we can accomplish the same functionality by reconfiguring the phase pattern displayed at the SLM located on the light source side. Indeed, this alternative way is directly and dynamically utilizing the high-dimensional property of OAM 38 . This architecture, however, imposes a complex arbitration mechanism when we extend the principle to two-player situations in the following. That is, controlling the light source by a single player is indeed feasible, but the source management by two players is non-trivial. Instead, player-specific attenuator control does not impose any global server. For these reasons, we discuss the fundamental architecture shown in Fig. 2.
Simulation results for 1-player 3-armed bandit problem. Figure 3 summarizes simulation results for the 1-player 3-armed bandit problem with the OAM system following the softmax policy. The solid, dashed, and dasheddotted curves in Fig. 3a show the time evolution of the selection probability of machine 1, 2, and 3, respectively, when the reward probability of slot machines are given by [P 1 , P 2 , Here the number of repetitions is 1000. We can clearly observe that the probability of selecting the maximum reward probability machine, here machine 1, monotonically increases. Figure 3b examines the correct decision rate, which is referred to as CDR, defined by the number of selections of the highest reward probability machine over 1000 trials when the reward environment is configured differently. The blue, red, and yellow curves show the time evolution of CDR when the reward environment [P 1 , P 2 , P 3 ] is given by [0.9, 0.7, 0.1], [0.9, 0.5, 0.1], and [0.9, 0.3, 0.1], respectively. Here the maximum and minimum reward probabilities are commonly configured. As the difference between the maximum and the second maximum reward probability becomes smaller, the increase of CDR toward unity becomes slow. Nevertheless, we can observe that the monotonic increase of selecting the best machine in Figs. 3a,b. Since there is no theoretical limitation regarding the number of OAM states, the system configuration herein can be used for the probabilistic selection among a large number of selections. Note that the softmax policy itself is also scalable.

Solving 2-player 3-armed bandit problem with OAM and quantum interference. System ar-
chitecture for solving 2-player 3-armed competitive bandit problem with OAM and quantum interference. This section discusses stochastic selections of arms in the CMAB problem using photon pair OAM quantum states. The system presented in Fig. 2 has been extended to the case of two players (Player A and B) by the architecture represented in Fig. 4. This time, the assumption is that the selection only happens when exactly one photon is detected simultaneously by each player on their photodetectors.
In the source part, a photon pair is created by a nonlinear crystal such as a periodically poled KTP (PPKTP) and then subjected to an interferometer. One of the photon pair is supplied to the Detection A system, and the other goes to the Detection B system. The internal structure of Detection systems is the same as the one-player system depicted in Fig. 2. Thanks to the quantum interference, even though there is no explicit communication between the players, the detection results of the two photons are correlated with each other, as discussed in detail later.
In quantum research using light, it has been common to use quantum states based on properties such as polarization, spatial mode, and phase, but since the discovery of orbital angular momentum, many studies on quantum states using orbital angular momentum of light have been reported 39 . The availability of orbital angular momentum with an infinite number of states is very important in quantum research. In 2001, Mair et al. used parametric down conversion (PDC) to study the generation of photon pairs in states with entangled orbital angular momentum 39 . Subsequently, a theoretical study of the change in orbital angular momentum during the PDC process was performed 40 , and photon pairs with three entangled orbital angular momentum states were also studied 41 . In the present study, we utilize quantum interference given by an extension of the Hong-Ou-Mandel effect 22 . www.nature.com/scientificreports/ Generation of OAM photon pair with quantum interference. Hong-Ou-Mandel effect has been well studied for two identical photons always detected together in the same output path when they enter into a 1:1 beam splitter 22 . We extend the description of this phenomenon for multiple-OAM states carrying input photons. When OAM states of input photon | � is sent to the beam splitter, transmitted term A and reflected term B can be described with the following forms: where R represents flipping of OAM state, for example, R|+1� = |−1� . As shown in Fig. 5a, when OAM states of input photons are | �, | � on the two BS inputs, the output state | ′ � ⊗ | ′ � can be described with the following forms:  www.nature.com/scientificreports/ With K being the number of OAM used in the system, the input states | �, | � can be set to considering that the two photons have the same polarization, wavelength, and are synchronized on the beam splitter. Each term of the output state given by Eq. (7) is described by the following: Therefore, the output state | ′ � ⊗ | ′ � is given by the following terms: Correspondingly, the probability of detecting the same state at the same side, that is |+k� A ⊗ |+k� A or |−k� B ⊗ |−k� B , is given by By introducing parameters θ k = φ k −ψ k 2 , which depends on the phase difference of two input states, the probability of detecting different states on the same side, that is |+k 1 � A ⊗ |+k 2 � A or |−k 1 � B ⊗ |−k 2 � B , is given by and finally the probability of detecting pair of states on different sides, that is |+k 1 � A ⊗ |−k 2 � B , is given by www.nature.com/scientificreports/ Figure 5b summarizes the probability of detecting each output state, while Fig. 6 shows all the probabilities with K ranging from 1 to 4. The probabilities depend only on θ k , which can be tuned by controlling the SLM phases φ k and ψ k .
A pair of photons being detected on both sides is displayed with the red frames in Fig. 6, which are utilized as selections by the two players. What is remarkable is that the probability of detecting the same states at different sides is always zero because the probability term sin 2 (θ k − θ k ) is always equal to zero. For K = 1 , this phenomenon corresponds to what is known as the Hong-Ou-Mandel effect. As the detected OAM states correspond to the selection of players, the probability of both players selecting the same machine is only limited by experimental constraints such as multiple pair generation, meaning that conflict-free decisions are accomplished.
The probabilities described in the red frames include the probabilities of detecting different states by the two players. It is remarkable that these probabilities can take equal value when K is less than or equal to three. For example, when K = 2 , by assigning θ 1 = 0 and θ 2 = π/2 , all such probabilities becomes 1/4. Similarly, when K = 3 , by setting (θ 1 , θ 2 , θ 3 ) = (0, π/3, 2π/3) , the probabilities are all 1/12. Namely, all arm combinations except selecting the same arm are selected equally. Note, however, that when K is larger or equal to four, we cannot perfectly equalize these probabilities by only tuning θ 1 , θ 2 , . . . , θ K . This point is discussed in the "Discussion" section. In this study, we focus on the case when K = 3 because the equivalent selection of pairs is ensured, as discussed above.
Simulation results for 2-player 3-armed bandit problem. In the CMAB problem in the present study, the rewards are equally split among the players who selected the same machine; that is, the decision conflict by multiple players reduces the individual benefit. Furthermore, total rewards are reduced because of the conflicted choice. www.nature.com/scientificreports/ Here we begin with a brief overview of the two-player decision-making situations by a game-theoretic formalism 42 while mentioning its intuitive implications. We denote P k * , P k * * , P k * * * respectively the first, second, and third highest reward probability. First, when P k * > 2 × P k * * , the situation of both players selecting machine 1 is the only Nash equilibrium. That is, conflict is unavoidable if both players act in a greedy manner because the best machine is far better than the other machines.
Second, when P k * < 2 × P k * * , Nash equilibrium is achieved when player 1 chooses the best machine (machine k * ), and player 2 selects the second-best machine (machine k * * ), and vice versa. That is, conflicting decisions are avoided because changing the player's decision decreases his/her reward. However, there is a problem from the viewpoint of equality, as one of the players can keep selecting the higher reward machines while the other is locked with the lower reward decisions.
Third, there exists another symmetric Nash equilibrium with a mixed strategy, meaning that they select each machine with a certain probability. The details are described in the "Methods" section. Intuitively speaking, by this mixed strategy, both players sometimes intentionally refrain from choosing the best machine. Therefore, sometimes, decision conflicts can be avoided. Indeed, Lai et al. successfully utilized a mixed strategy in dynamic channel selection in communication systems 7 . However, it should be remarked that perfect conflict avoidance cannot be ensured by mixed strategies.
In order to quantitatively evaluate the performance differences among different policies, we compare the quantum interference system with the following two policies. One is a greedy policy where both players take greedy actions as if they are playing alone. The second is an equilibrium policy where both players try to achieve the symmetric Nash equilibrium by a mixed strategy. The details are described in the "Methods" section. Figure 7 shows the results for solving the 2-player 3-armed bandit problem. Figure 7a shows how the selection probabilities of both players evolve with each policy. With the greedy policy, reminding that machine 1 has the highest reward probability of 0.9, its selection probability approaches almost 1 for both players, as in the case of a single player. For the equilibrium policy, the selection probabilities of the two most rewarding machines 1 and 2 converge to the probabilities defined by the mixed strategy. With the quantum interference strategy, however, machine 1 and machine 2 are selected with equal probability by both players. Figure 7b shows the ratio of each selection combination from both players. The greedy policy is associated with a large number of conflicts as both players almost only select machine 1, while the equilibrium policy reduces the number of conflicts to some extent as the selections are distributed. Finally, the quantum interference policy completely avoids conflicts. The final rewards with such selections are shown in Fig. 7c, for each player and for the total attributed reward. We observe that the quantum interference policy achieves almost ideal total rewards as well as equality between players. By contrast, the total reward by the greedy and the equilibrium policies becomes small compared with the quantum interference policy because they suffer from unavoidable decision conflicts. Figure 7d shows how the final reward of each policy varies when the reward probabilities of the three machines are modified. In greedy and equilibrium policies, the total reward changes due to the rate of selection of the lowest rewarding machine 3 in the exploration phase. On the other hand, with the quantum interference policy, the larger the difference between the reward probabilities of machine 2 and machine 3, the easier it is to determine the top two machines, and the higher the final total reward; despite this, the difference in total reward is mild in comparison with the difference with the other two policies.

Discussion
In this study, we show that we can benefit from the high dimensionality of OAM for scalability in solving multiarmed bandit problems. Furthermore, appropriate quantum interference constructions lead to achieving high rewards while maintaining a fair repartition between two players in competitive bandit problem situations. The total reward optimization is guaranteed by the selections of the two best machines by the two players in a nonconflicting manner, while the fair repartition is guaranteed by the equal probabilities of selection among players through quantum interference.
The main assumption is the simultaneous detection of exactly one photon for all players. In the proposed optical design, this is for the purpose of the extended Hong-Ou-Mandel effect or quantum interference that guarantees that identical photons go to the same side of the beam splitter, at the price of a post-selection of half of all photon pairs. While this is a strong constraint for potential applications, this design is only an example, and nothing forbids the obtention of the target state with other designs that do not rely on post-selection.
Regarding the extension to more arms, the current design is limited to three arms due to fundamental constraints (lack of enough degrees of freedom to constrain the 2-photon state). This may be solved by allowing to tune the relative amplitude between each OAM with the SLM and/or additional mechanisms. Once again, the goal of the setup presented in this study is only to present the principle of utilizing OAM for multiple arms in MAB and quantum interference for competitive decision-making. We believe that the extension to many arms is a technological problem without theoretical constraints 34 .
The next discussion point is about security. The two-player CMAB system herein intends to let the players directly influence the detection probability via the attenuation amplitude in front of the detectors. While this architecture ensures independent machine selection and revision of the attenuation among the players, it presents one fundamental weakness: if a player only wants to select the highest rewarding arm, then the attenuation will be maximized for the lower arms, only letting photons reach the corresponding detector. However, this situation is easily identifiable by the other player, who can recognize that the probability of selecting a particular machine decreases. The solution for that player is straightforward: attenuate more the second-best arm too to correct the imbalance (in case of slight inequality), which is equivalent to not playing anymore if the other player completely blocks the other photons. www.nature.com/scientificreports/ This brings the following discussion point about the photon utilization efficiency. In this study, only simultaneous detection of exactly one photon for all detectors of each and every player triggers the selection of arms from both players. The reason is to implement the post-selection of output states where one photon goes for both players instead of two photons for only one player. With the current operation principle based on quantum interference summarized in Fig. 6, half of all photon pairs are strictly unusable for the players. Although such a loss is unavoidable, further photon losses are induced in the system architecture shown in Fig. 4 because of the multiple BSs. As discussed earlier, this part can be improved by technological methods developed in the literature 33,34 .

Conclusion
To overcome the scalability limitations in the former single-photon-based decision making that relies on two orthogonal polarizations to resolve the two-armed bandit problem, we associate orbital angular momentum of photons to individual arms, which theoretically allows ideal scalability. When multiple players are involved, conflict of decisions becomes a serious issue, which is known as the competitive multi-armed bandit problem. Formerly, polarization-entangled photons have been shown to realize conflict-free decision making in twoplayer, two-armed situations; however, its arm-scalability is limited to only two. In this study, by extending the Hong-Ou-Mandel effect to more than two states, we theoretically establish an experimental configuration able to generate quantum interference among states with orbital angular momentum and conditions that provide conflict-free selections. We numerically examine total rewards regarding two-player, three-armed bandit problems, for which the proposed principle accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This study paves a way toward photon-based intelligent systems as well as extending the utility of the high dimensionality of orbital angular momentum of photons and quantum interference in artificial intelligence domains. If there exists a non-selected machine, all the machines are selected randomly with the same probability. Otherwise, P k (t) = w k (t) w k (t)+l k (t) when the player has been rewarded w k (t) times and not rewarded l k (t) times from machine k. As a moderate tuning, β is set to be 20 in this study.