Abstract
In recent crossdisciplinary studies involving both optics and computing, singlephotonbased decisionmaking has been demonstrated by utilizing the waveparticle duality of light to solve multiarmed bandit problems. Furthermore, entangledphotonbased decisionmaking has managed to solve a competitive multiarmed bandit problem in such a way that conflicts of decisions among players are avoided while ensuring equality. However, as these studies are based on the polarization of light, the number of available choices is limited to two, corresponding to two orthogonal polarization states. Here we propose a scalable principle to solve competitive decisionmaking situations by using the orbital angular momentum of photons based on its high dimensionality, which theoretically allows an unlimited number of arms. Moreover, by extending the HongOuMandel effect to more than two states, we theoretically establish an experimental configuration able to generate multiphoton states with orbital angular momentum and conditions that provide conflictfree selections at every turn. We numerically examine total rewards regarding threearmed bandit problems, for which the proposed strategy accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This is thanks to the quantum interference effect that achieves noconflict selections, even in the exploring phase to find the best arms.
Similar content being viewed by others
Introduction
Optics and photonics are expected to play crucial roles in future computing systems^{1}, making a variety of devices and systems to be intensively studied such as optical fibrebased neuromorphic computing^{2}, onchip optical neural networks^{3}, optical reservoir computing^{4}, among others. While these works are basically categorized in supervised learning, reinforcement learning is another important branch of artificial intelligence^{5}. The MultiArmed Bandit (MAB) problem is an example of a reinforcement learning situation, which formulates a fundamental issue of decision making in dynamically changing uncertain environments where the target is to find the best selection among many slot machines, also referred to as arms, whose reward probabilities are unknown^{6}. In solving MAB problems, exploration actions are necessary to find the best arm, although too much exploration may reduce the final amount of obtained reward from the exploitation. On the opposite, insufficient exploration may lead to miss the best arm. Furthermore, when multiple players are involved, decision conflicts become serious, as they induce congestions and inhibit socially achievable benefits^{7,8}. Equality among players is another critical issue, as unfair repartition of outcomes may lead to distrust the system. This whole problem is known as the competitive MAB (CMAB) problem.
In order to solve these complex issues, photonic solutions have been recently considered. For example, the waveparticle duality of single photons has been utilized for the resolution of the twoarmed bandit problem^{9}. Moreover, Chauvet et al. theoretically and experimentally demonstrated that polarization entangled photon pairs provide nonconflict and equalityassured decisions in twoplayer, twoarmed bandit problems^{10}. Entangled photon states that allow more than three players while guaranteeing optimal outcome and equal repartition have also been demonstrated^{11}.
However, since these former principles rely on the polarization of light as the tunable degree of freedom, the number of possible selections or arms is limited to only two, although potential scalability for the singleplayer MAB is feasible within a tournamentbased approach^{12}. Therefore, the scalable principle of decisionmaking has been an important and fundamental issue, especially for multiplayer situations. In this paper, we introduce the use of the orbital angular momentum (OAM) of photons^{13,14} to resolve the scalability issue of photonic decision making, following the concept summarized in Fig. 1.
Photons that carry OAM^{13} realize highdimensional state spaces, only restricted by the precision and accuracy of the generation technique and the transmission medium^{15} (Fig. 1a); hence one of the basic ideas of this study is to associate individual selections to differentvalued OAM (Fig. 1b). The applications of OAM have progressed in diverse areas ranging from the manipulation of cooled atoms, communications, nonlinear optics, optical solitons, and so on. The highdimensionality of OAM is particularly attractive for quantum information processing in increasing the dimension of elementary quantum information carriers to go beyond the qubit^{16,17,18,19,20,21}. Likewise, in the present study, the multidimensionality of OAM plays a crucial role in extending the maximum number of arms as well as utilizing the probabilistic attribute of single photons carrying OAM.
Furthermore, to resolve CMAB problems when the number of arms is greater than two, we extend the notion of HongOuMandel effect^{22} to more than two (OAM) vector states to induce quantum interference. We show that conflicting decisions among two players can be perfectly avoided by the adequate quantum interference design to generate OAM 2photon states, relying on a coherent photon pair source. In the literature, OAM has been examined from gametheoretic perspectives such as resolving prisoners dilemma^{23} and duel game^{24}. In the present study, we benefit from quantum interference for nonconflicting decisionmaking to maximize total rewards, which is similar to the insight gained by quantum game literature. Additionally, in solving CMAB problems with many arms, exploration action is necessary. We numerically examine total rewards regarding threearmed bandit problems where the proposed quantuminterferencebased strategy accomplishes nearly theoretical maximum total reward. We confirm that the proposed strategy clearly outperforms conventional ones, including the mixed strategy intending to realize Nash equilibrium^{7}.
Moreover, equality among players is important in CMAB problems. We demonstrate that equality is perfectly ensured by appropriate quantum interference constructions when the number of arms is three. At the same time, however, we also show that it is unfortunately impossible to accomplish perfect equality in the proposed scheme and with the current hypotheses when the number of arms is equal to or larger than four. Note also that perfect collision avoidance is ensured for any number of arms.
These properties are made possible thanks to the high dimensionalities of OAM for scalability and the quantum interference effect for nonconflict selections even in the exploring phase to find the best arms.
Results
Scalable decision maker with OAM
System architecture for solving 1player Karmed bandit problem
We first describe the problem under study, which is a stochastic multiarmed bandit problem with rewards following Bernoulli distributions defined as follows. There are K available slot machines (or arms): when the player selects one arm i, the player wins with probability \(P_i\) (and receives a fixed reward of 1), or loses with probability \(1P_{i}\) (and receives a fixed reward of 0), with i an integer ranging from 1 to K. Let a player choose an arm each time and allow a total of T times, then the goal of the bandit problem is to find out which strategy should be followed to choose arms so that the resultant accumulated outcome is maximized. When the slot machine with the highest winning probability is known, the best strategy is to draw that specific arm for all T times, but the player initially has no information about the arms. Therefore, exploration actions are required to know the best arm, whereas too much exploration potentially leads to missing a higher total amount of rewards from the best machine.
In the previous work on singlephoton decision maker using polarization^{9}, two orthogonal linear polarizations of photons are associated with two slot machines; that is, horizontal and vertical polarizations correspond to slot machine 1 and 2, respectively. The exploration is physically realized by the probabilistic attribute of photon measurement, whose outcome depends on the direction of the polarization of linearly polarized single photons. Therein, the polarization degree of freedom physically and directly allows specifying the probabilistic selection of slot machines. However, as mentioned in the “Introduction” section, the number of arms is limited to only two, although extendable in a singleplayer setup to powers of two via a tournamentbased approach^{12}.
The fundamental idea of the present study is to associate the dimension of OAM with the selection of multiple arms, whatever the number of arms. Allen et al. have pointed out that a LaguerreGaussian (LG) beam has an angular momentum independent from polarization; they have called it OAM to distinguish it from the polarizationdependent spin angular momentum^{25}. The spatial mode of a LG beam can be expressed using the nearaxis approximation
where \(\rho \) is the distance from the optical axis, \(\theta \) is the azimuthal angle around the optical axis, z is the coordinate of the propagation direction, \(f_m\) is the complex amplitude distribution, and k is the wavenumber. m and l are integer numbers that respectively describe the order of the Laguerre polynomial for the radial distribution and the azimuthal rotation number. In our study, m is fixed at 0, while l takes any integer numbers. Correspondingly, \({{l}\rangle }\) is the state in which there is one photon in the l mode, whose angular momentum is equal to \(l \hbar \) where \(\hbar \) is Planck’s constant divided by \(2\pi \). Since the modes with different l are orthogonal to each other, the quantum state can be expressed by a linear superposition, using these modes as a basis. Figure 1a schematically illustrates examples of beams with different lvalued OAM where l is an integer from \(3\) to 3. Nonzero l beams exhibit spiral isophase spatial distributions. Figure 2 shows a schematic diagram of the proposed system architecture for solving the MAB problem using OAM. Here we illustrate the case where the number of arms is three, but the same principle applies in extending to a larger number of arms.
Conventional laser sources generate beams that do not have orbital angular momentum. Technologically, methods to generate light with OAM from a plane wave or a Gaussian beam include the use of phase plates^{26}, computer generated holograms (CGH)^{27}, or mode converters^{28,29}. Spatial light modulators (SLMs) are widely utilized for this purpose, as they enable direct and tunable amplitude and/or phase modulation of an incoming light beam^{30}. The simplest and the most widely used method is a CGHbased approach implemented with an SLM and a 4f optical setup^{15}. In Fig. 2, a photon with a Gaussian spatial profile emitted from a laser is sent to a phase SLM, displaying a CGH pattern to generate OAM states, each carrying a phase factor \(e^{i l\theta }\) which depends on the azimuthal angle \(\theta \) and the OAM number l. l could be any integer, but when all generated l are expected to be positive, the output photon is described by the state:
where \(\phi _1, \phi _2, \ldots , \phi _K\) depict phase changes associated with each OAM with l values being +1, +2, \(\ldots \), and \(+K\), respectively, and \({{l}\rangle }\) denotes the photon state with OAM value of l. That is to say, a single photon is emitted from the source system that contain K OAM states with equal probability amplitude.
Meanwhile, a mirror causes flipping of the twisted structure of any given OAM; that is, the function of a beam splitter (BS) in the light propagation is represented by
where R represents flipping of OAM state, for example, \(R{{+1}\rangle }={{1}\rangle }\). In the case \(K=3\), we generate a photon state that carries equally \(l=+1, +2, +3\) by setting \(\phi _1=\phi _2=\phi _3=0\). That is, the output after SLM is given by \((1/\sqrt{3}) \times ({{+1}\rangle }+{{+2}\rangle }+{{+3}\rangle })\).
This photon is then transferred to an array of BSs and single photon detection system to examine which lvalued OAM is detected. Among a variety of methods in measuring the OAM of light^{31}, the system architecture shown in Fig. 2 illustrates a method utilizing a hologram (HG) followed by a zerothorder extraction system^{32}. In practical implementation, a zerothorder extraction system could be freespace optics with spatial filtering or singlemode optical fibre.
This hologram adds a phase factor of \(e^{i l_{HG} \theta }\) to the state \({{l}\rangle }\) with OAM l, which results in a transformation \({{l}\rangle } \rightarrow {{l+l_{HG}}\rangle }\). After injection into a zerothorder extraction system, only an \(l=0\) photon propagates in it. In other words, the zerothorder extraction system acts as a filter to extract the \(l=0\) component only. If the hologram induces a shift of OAM by \(l_{HG}\) and a photon is detected by the subsequent photodetector, the OAM of the incoming photon is identified to be \(l = l_{HG}\). Based on this principle, in the system shown in Fig. 2, three holograms HG1, HG2, HG3 are arranged, which transform \({{l}\rangle }\) into \({{l1}\rangle }, {{l2}\rangle }\), and \({{l3}\rangle }\), respectively.
One remark here is that, although multiple BSs and holograms are employed in Fig. 2, more compact realization is indeed possible by, for example, a geometric optical transformation technique^{33}, which has been extended to more than 50 OAM states^{34}. The reason behind the introduction of the measurement architecture shown in Fig. 2 regards the following procedure related to photon detections.
The output light is subjected to attenuators (ATT1, ATT2, ATT3) to control detection probabilities and a zerothorder extraction system, followed by photodetectors (PD1, PD2, PD3). Based on the filtering by the zerothorder extraction system, photon detection by PD1, PD2, and PD3 means observing OAM values of 1, 2, and 3, respectively. Photon detection by PD1 immediately means playing slot machine 1. Similarly, PD2 and PD3 are associated with the decision of playing slot machines 2 and 3, respectively. It should be emphasized that in this configuration, a machine is only selected if a photon is detected.
Initially, since the probabilities of the detected photons to be measured by PD1, PD2, and PD3 are all equal to 1/3, all machines are explored equally. Depending on the obtained results, the attenuation levels by ATT1, ATT2, ATT3 are updated.
After a single photon is detected by any photodetector, the selection yields eventual rewards from slot machines, and the results are registered into history H(t). While referring to the history H(t), the next decision is determined by following a certain policy of the player. The softmax policy is one of the most wellknown feedback algorithms for the decision, which is also considered to accurately emulate the model of human decision making^{35,36}. In the softmax policy, the player selects each machine based on a maximum likelihood estimation of the reward probability \({\hat{P}}_1(t),{\hat{P}}_2(t),\ldots ,{\hat{P}}_K(t)\) and the probability of selecting machine i is given by the following equation:
where \(\beta \), which is also known as inverse temperature from analogy to statistical mechanics, is a parameter that influences the balance between exploration and exploitation. While optimal parameter \(\beta \) depends on reward probabilities and some methods for tuning \(\beta \) have been proposed^{37}, this paper, for simplicity, set it to a constant value \(\beta = 20\) based on a moderate tuning. The amplitude transmittance of attenuators (ATT1, ATT2, ATT3) are denoted by \(d_1, d_2, d_3\), which are initially all one. These values are updated after every trial based on:
In this way, \(d_{i}(t)\) is revised as the time elapses so that the photon detection event is highly likely induced at the photodetector that corresponds to the best slot machine or the highest reward probability machine. For example, if slot machine 1 is the highest reward probability one, the transmittance of ATT1 should be higher while those of ATT2 and ATT3 should become smaller.
Here is a remark about the denominator of the right side of Eq. (5). The probability of detecting state i is proportional to \(d_i(t)^2\). Dividing each \(d_i(t)^2\) by the same value \(\displaystyle {\max _{k}}{s_k(t)}\) does not give any unintended bias to the detection probabilities, but transmission efficiency by the attenuators is kept high. That is, the loss of photons by the attenuators is minimized.
Finally, we discuss one more important remark regarding the architecture for solving the singleplayer, multiarmed bandit problem shown in Fig. 2. The principle maximizes the detection probability of the OAM state corresponding to the best machine. Actually, instead of reconfiguring the attenuators, we can accomplish the same functionality by reconfiguring the phase pattern displayed at the SLM located on the light source side. Indeed, this alternative way is directly and dynamically utilizing the highdimensional property of OAM^{38}. This architecture, however, imposes a complex arbitration mechanism when we extend the principle to twoplayer situations in the following. That is, controlling the light source by a single player is indeed feasible, but the source management by two players is nontrivial. Instead, playerspecific attenuator control does not impose any global server. For these reasons, we discuss the fundamental architecture shown in Fig. 2.
Simulation results for 1player 3armed bandit problem
Figure 3 summarizes simulation results for the 1player 3armed bandit problem with the OAM system following the softmax policy. The solid, dashed, and dasheddotted curves in Fig. 3a show the time evolution of the selection probability of machine 1, 2, and 3, respectively, when the reward probability of slot machines are given by \([P_1, P_2, P_3] = [0.9, 0.7, 0.1]\). Here the number of repetitions is 1000. We can clearly observe that the probability of selecting the maximum reward probability machine, here machine 1, monotonically increases.
Figure 3b examines the correct decision rate, which is referred to as CDR, defined by the number of selections of the highest reward probability machine over 1000 trials when the reward environment is configured differently. The blue, red, and yellow curves show the time evolution of CDR when the reward environment \([P_1, P_2, P_3]\) is given by [0.9, 0.7, 0.1], [0.9, 0.5, 0.1], and [0.9, 0.3, 0.1], respectively. Here the maximum and minimum reward probabilities are commonly configured. As the difference between the maximum and the second maximum reward probability becomes smaller, the increase of CDR toward unity becomes slow. Nevertheless, we can observe that the monotonic increase of selecting the best machine in Figs. 3a,b. Since there is no theoretical limitation regarding the number of OAM states, the system configuration herein can be used for the probabilistic selection among a large number of selections. Note that the softmax policy itself is also scalable.
Solving 2player 3armed bandit problem with OAM and quantum interference
System architecture for solving 2player 3armed competitive bandit problem with OAM and quantum interference
This section discusses stochastic selections of arms in the CMAB problem using photon pair OAM quantum states. The system presented in Fig. 2 has been extended to the case of two players (Player A and B) by the architecture represented in Fig. 4. This time, the assumption is that the selection only happens when exactly one photon is detected simultaneously by each player on their photodetectors.
In the source part, a photon pair is created by a nonlinear crystal such as a periodically poled KTP (PPKTP) and then subjected to an interferometer. One of the photon pair is supplied to the Detection A system, and the other goes to the Detection B system. The internal structure of Detection systems is the same as the oneplayer system depicted in Fig. 2. Thanks to the quantum interference, even though there is no explicit communication between the players, the detection results of the two photons are correlated with each other, as discussed in detail later.
In quantum research using light, it has been common to use quantum states based on properties such as polarization, spatial mode, and phase, but since the discovery of orbital angular momentum, many studies on quantum states using orbital angular momentum of light have been reported^{39}. The availability of orbital angular momentum with an infinite number of states is very important in quantum research. In 2001, Mair et al. used parametric down conversion (PDC) to study the generation of photon pairs in states with entangled orbital angular momentum^{39}. Subsequently, a theoretical study of the change in orbital angular momentum during the PDC process was performed^{40}, and photon pairs with three entangled orbital angular momentum states were also studied^{41}. In the present study, we utilize quantum interference given by an extension of the HongOuMandel effect^{22}.
Generation of OAM photon pair with quantum interference
HongOuMandel effect has been well studied for two identical photons always detected together in the same output path when they enter into a 1:1 beam splitter^{22}. We extend the description of this phenomenon for multipleOAM states carrying input photons. When OAM states of input photon \({{\Phi }\rangle }\) is sent to the beam splitter, transmitted term A and reflected term B can be described with the following forms:
where R represents flipping of OAM state, for example, \(R{{+1}\rangle }={{1}\rangle }\). As shown in Fig. 5a, when OAM states of input photons are \({{\Phi }\rangle },{{\Psi }\rangle }\) on the two BS inputs, the output state \({{\Phi '}\rangle }\otimes {{\Psi '}\rangle }\) can be described with the following forms:
With K being the number of OAM used in the system, the input states \({{\Phi }\rangle },{{\Psi }\rangle }\) can be set to
considering that the two photons have the same polarization, wavelength, and are synchronized on the beam splitter. Each term of the output state given by Eq. (7) is described by the following:
Therefore, the output state \({{\Phi '}\rangle }\otimes {{\Psi '}\rangle }\) is given by the following terms:
Correspondingly, the probability of detecting the same state at the same side, that is \({{+k}\rangle }_A\otimes {{+k}\rangle }_A\) or \({{k}\rangle }_B\otimes {{k}\rangle }_B\), is given by
By introducing parameters \(\theta _k=\frac{\phi _k\psi _k}{2}\), which depends on the phase difference of two input states, the probability of detecting different states on the same side, that is \({{+k_1}\rangle }_A\otimes {{+k_2}\rangle }_A\) or \({{k_1}\rangle }_B\otimes {{k_2}\rangle }_B\), is given by
and finally the probability of detecting pair of states on different sides, that is \({{+k_1}\rangle }_A\otimes {{k_2}\rangle }_B\), is given by
Figure 5b summarizes the probability of detecting each output state, while Fig. 6 shows all the probabilities with K ranging from 1 to 4. The probabilities depend only on \(\theta _k\), which can be tuned by controlling the SLM phases \(\phi _k\) and \(\psi _k\).
A pair of photons being detected on both sides is displayed with the red frames in Fig. 6, which are utilized as selections by the two players. What is remarkable is that the probability of detecting the same states at different sides is always zero because the probability term \(\sin ^2(\theta _k\theta _k)\) is always equal to zero. For \(K=1\), this phenomenon corresponds to what is known as the HongOuMandel effect. As the detected OAM states correspond to the selection of players, the probability of both players selecting the same machine is only limited by experimental constraints such as multiple pair generation, meaning that conflictfree decisions are accomplished.
The probabilities described in the red frames include the probabilities of detecting different states by the two players. It is remarkable that these probabilities can take equal value when K is less than or equal to three. For example, when \(K = 2\), by assigning \(\theta _1 = 0\) and \(\theta _2 = \pi /2\), all such probabilities becomes 1/4. Similarly, when \(K = 3\), by setting \((\theta _1, \theta _2, \theta _3) = (0, \pi /3, 2\pi /3)\), the probabilities are all 1/12. Namely, all arm combinations except selecting the same arm are selected equally. Note, however, that when K is larger or equal to four, we cannot perfectly equalize these probabilities by only tuning \(\theta _1,\theta _2,\ldots ,\theta _K\). This point is discussed in the “Discussion” section. In this study, we focus on the case when \(K = 3\) because the equivalent selection of pairs is ensured, as discussed above.
Simulation results for 2player 3armed bandit problem
In the CMAB problem in the present study, the rewards are equally split among the players who selected the same machine; that is, the decision conflict by multiple players reduces the individual benefit. Furthermore, total rewards are reduced because of the conflicted choice.
Here we begin with a brief overview of the twoplayer decisionmaking situations by a gametheoretic formalism^{42} while mentioning its intuitive implications. We denote \(P_{k^{*}}, P_{k^{**}}, P_{k^{***}}\) respectively the first, second, and third highest reward probability. First, when \(P_{k^{*}} > 2\times P_{k^{**}}\), the situation of both players selecting machine 1 is the only Nash equilibrium. That is, conflict is unavoidable if both players act in a greedy manner because the best machine is far better than the other machines.
Second, when \(P_{k^{*}}<2 \times P_{k^{**}}\), Nash equilibrium is achieved when player 1 chooses the best machine (machine \(k^{*}\)), and player 2 selects the secondbest machine (machine \(k^{**}\)), and vice versa. That is, conflicting decisions are avoided because changing the player’s decision decreases his/her reward. However, there is a problem from the viewpoint of equality, as one of the players can keep selecting the higher reward machines while the other is locked with the lower reward decisions.
Third, there exists another symmetric Nash equilibrium with a mixed strategy, meaning that they select each machine with a certain probability. The details are described in the “Methods” section. Intuitively speaking, by this mixed strategy, both players sometimes intentionally refrain from choosing the best machine. Therefore, sometimes, decision conflicts can be avoided. Indeed, Lai et al. successfully utilized a mixed strategy in dynamic channel selection in communication systems^{7}. However, it should be remarked that perfect conflict avoidance cannot be ensured by mixed strategies.
In order to quantitatively evaluate the performance differences among different policies, we compare the quantum interference system with the following two policies. One is a greedy policy where both players take greedy actions as if they are playing alone. The second is an equilibrium policy where both players try to achieve the symmetric Nash equilibrium by a mixed strategy. The details are described in the “Methods” section.
Figure 7 shows the results for solving the 2player 3armed bandit problem. Figure 7a shows how the selection probabilities of both players evolve with each policy. With the greedy policy, reminding that machine 1 has the highest reward probability of 0.9, its selection probability approaches almost 1 for both players, as in the case of a single player. For the equilibrium policy, the selection probabilities of the two most rewarding machines 1 and 2 converge to the probabilities defined by the mixed strategy. With the quantum interference strategy, however, machine 1 and machine 2 are selected with equal probability by both players. Figure 7b shows the ratio of each selection combination from both players. The greedy policy is associated with a large number of conflicts as both players almost only select machine 1, while the equilibrium policy reduces the number of conflicts to some extent as the selections are distributed. Finally, the quantum interference policy completely avoids conflicts. The final rewards with such selections are shown in Fig. 7c, for each player and for the total attributed reward. We observe that the quantum interference policy achieves almost ideal total rewards as well as equality between players. By contrast, the total reward by the greedy and the equilibrium policies becomes small compared with the quantum interference policy because they suffer from unavoidable decision conflicts.
Figure 7d shows how the final reward of each policy varies when the reward probabilities of the three machines are modified. In greedy and equilibrium policies, the total reward changes due to the rate of selection of the lowest rewarding machine 3 in the exploration phase. On the other hand, with the quantum interference policy, the larger the difference between the reward probabilities of machine 2 and machine 3, the easier it is to determine the top two machines, and the higher the final total reward; despite this, the difference in total reward is mild in comparison with the difference with the other two policies.
Discussion
In this study, we show that we can benefit from the high dimensionality of OAM for scalability in solving multiarmed bandit problems. Furthermore, appropriate quantum interference constructions lead to achieving high rewards while maintaining a fair repartition between two players in competitive bandit problem situations. The total reward optimization is guaranteed by the selections of the two best machines by the two players in a nonconflicting manner, while the fair repartition is guaranteed by the equal probabilities of selection among players through quantum interference.
The main assumption is the simultaneous detection of exactly one photon for all players. In the proposed optical design, this is for the purpose of the extended HongOuMandel effect or quantum interference that guarantees that identical photons go to the same side of the beam splitter, at the price of a postselection of half of all photon pairs. While this is a strong constraint for potential applications, this design is only an example, and nothing forbids the obtention of the target state with other designs that do not rely on postselection.
Regarding the extension to more arms, the current design is limited to three arms due to fundamental constraints (lack of enough degrees of freedom to constrain the 2photon state). This may be solved by allowing to tune the relative amplitude between each OAM with the SLM and/or additional mechanisms. Once again, the goal of the setup presented in this study is only to present the principle of utilizing OAM for multiple arms in MAB and quantum interference for competitive decisionmaking. We believe that the extension to many arms is a technological problem without theoretical constraints^{34}.
The next discussion point is about security. The twoplayer CMAB system herein intends to let the players directly influence the detection probability via the attenuation amplitude in front of the detectors. While this architecture ensures independent machine selection and revision of the attenuation among the players, it presents one fundamental weakness: if a player only wants to select the highest rewarding arm, then the attenuation will be maximized for the lower arms, only letting photons reach the corresponding detector. However, this situation is easily identifiable by the other player, who can recognize that the probability of selecting a particular machine decreases. The solution for that player is straightforward: attenuate more the secondbest arm too to correct the imbalance (in case of slight inequality), which is equivalent to not playing anymore if the other player completely blocks the other photons.
This brings the following discussion point about the photon utilization efficiency. In this study, only simultaneous detection of exactly one photon for all detectors of each and every player triggers the selection of arms from both players. The reason is to implement the postselection of output states where one photon goes for both players instead of two photons for only one player. With the current operation principle based on quantum interference summarized in Fig. 6, half of all photon pairs are strictly unusable for the players. Although such a loss is unavoidable, further photon losses are induced in the system architecture shown in Fig. 4 because of the multiple BSs. As discussed earlier, this part can be improved by technological methods developed in the literature^{33,34}.
Conclusion
To overcome the scalability limitations in the former singlephotonbased decision making that relies on two orthogonal polarizations to resolve the twoarmed bandit problem, we associate orbital angular momentum of photons to individual arms, which theoretically allows ideal scalability. When multiple players are involved, conflict of decisions becomes a serious issue, which is known as the competitive multiarmed bandit problem. Formerly, polarizationentangled photons have been shown to realize conflictfree decision making in twoplayer, twoarmed situations; however, its armscalability is limited to only two. In this study, by extending the HongOuMandel effect to more than two states, we theoretically establish an experimental configuration able to generate quantum interference among states with orbital angular momentum and conditions that provide conflictfree selections. We numerically examine total rewards regarding twoplayer, threearmed bandit problems, for which the proposed principle accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This study paves a way toward photonbased intelligent systems as well as extending the utility of the high dimensionality of orbital angular momentum of photons and quantum interference in artificial intelligence domains.
Methods
Detail algorithm of greedy policy, equilibrium policy, and entanglement policy
Greedy policy (Strategty for single player MAB)
Both players independently decide the probability of selecting each machine at each round. The algorithm is based on the softmax policy^{5} and the probability of selecting machine i at round t is given by the following equation:
If there exists a nonselected machine, all the machines are selected randomly with the same probability. Otherwise, \({\hat{P}}_k(t)=\frac{w_k(t)}{w_k(t)+l_k(t)}\) when the player has been rewarded \(w_k(t)\) times and not rewarded \(l_k(t)\) times from machine k. As a moderate tuning, \(\beta \) is set to be 20 in this study.
Equilibrium policy
Table1 represents the profit table of the expected reward in a single selection. In Nash equilibrium, no player has anything to gain by changing only their own strategy. In Nash equilibrium, the strategy may be selecting one particular machine, but it could also be a strategy such that multiple machines are probabilistically chosen. In the situation of Table 1, strategies can be defined with the probabilities of selecting machine 1, machine 2, machine 3, or \(\alpha _1,\alpha _2,\alpha _3\) for player A and \(\beta _1,\beta _2,\beta _3\) for player B. In what follows, the notations \(k^{*},k^{**},k^{***}\) represent indices of the first, the second, and the third best machine, respectively. Nash equilibriums are summarized as shown below:

in case \(P_{k^{*}}>2P_{k^{**}}\)

\(\diamond \) \((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (1,0,0)\)


in case \(P_{k^{*}}<2P_{k^{**}}\) and \(P_{k^{*}}P_{k^{**}}/Q>\frac{2}{5}\)

\((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (1,0,0), \ (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (0,1,0)\)

\((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (0,1,0), \ (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (1,0,0)\)

\(\diamond \) \((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (\frac{2P_{k^{*}}P_{k^{**}}}{P_{k^{*}}+P_{k^{**}}}, \frac{2P_{k^{**}}P_{k^{*}}}{P_{k^{*}}+P_{k^{**}}},0)\)


in case \(P_{k^{*}}<2P_{k^{**}}\) and \(P_{k^{*}}P_{k^{**}}/Q<\frac{2}{5}\)

\((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (1,0,0), \ (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (0,1,0)\)

\((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (0,1,0), \ (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (1,0,0)\)

\(\diamond \) \((\alpha _{k^{*}},\alpha _{k^{**}},\alpha _{k^{***}}) = (\beta _{k^{*}},\beta _{k^{**}},\beta _{k^{***}}) = (2  \frac{5P_{k^{**}}P_{k^{***}}}{Q}, 2  \frac{5P_{k^{***}}P_{k^{*}}}{Q}, 2  \frac{5P_{k^{*}}P_{k^{**}}}{Q})\)

where \(Q = P_{k^{*}}P_{k^{**}}+P_{k^{**}}P_{k^{***}}+P_{k^{***}}P_{k^{*}}\). With the equilibrium policy, both players try to achieve symmetric Nash equilibrium, which is represented with the shape \(\diamond \) above, under the situation that reward probabilities are not quite sure. In the simulation algorithm, each player decides which machines are better and which Nash equilibrium to achieve based on their own maximum likelihood estimation of reward probabilities. In an actual algorithm, the parameters of player 1 are calculated as below with \({\hat{k}}^{*}, {\hat{k}}^{**}, {\hat{k}}^{***}\) respectively representing machine indices with the first, the second, and the third highest estimated reward probability:

in case \({\hat{P}}_{{\hat{k}}^{*}}>2{\hat{P}}_{{\hat{k}}^{**}}\)

\((\alpha ^{*},\alpha ^{**},\alpha ^{***}) = (1,0,0)\)


in case \({\hat{P}}_{{\hat{k}}^{*}}>2{\hat{P}}_{{\hat{k}}^{**}}\) and \({\hat{P}}_{{\hat{k}}^{*}}{\hat{P}}_{{\hat{k}}^{**}}/Q>\frac{2}{5}\)

\((\alpha ^{*},\alpha ^{**},\alpha ^{***}) = (\frac{2{\hat{P}}_{{\hat{k}}^{*}}{\hat{P}}_{{\hat{k}}^{**}}}{{\hat{P}}_{{\hat{k}}^{*}}+{\hat{P}}_{{\hat{k}}^{**}}}, \frac{2{\hat{P}}_{{\hat{k}}^{**}}{\hat{P}}_{{\hat{k}}^{*}}}{{\hat{P}}_{{\hat{k}}^{*}}+{\hat{P}}_{{\hat{k}}^{**}}},0)\)


in case \({\hat{P}}_{{\hat{k}}^{*}}<2{\hat{P}}_{{\hat{k}}^{**}}\) and \({\hat{P}}_{{\hat{k}}^{*}}{\hat{P}}_{{\hat{k}}^{**}}/Q<\frac{2}{5}\)

\((\alpha ^{*},\alpha ^{**},\alpha ^{***}) = (2  \frac{5{\hat{P}}_{{\hat{k}}^{**}}{\hat{P}}_{{\hat{k}}^{***}}}{Q}, 2  \frac{5{\hat{P}}_{{\hat{k}}^{***}}{\hat{P}}_{{\hat{k}}^{*}}}{Q}, 2  \frac{5{\hat{P}}_{{\hat{k}}^{*}}{\hat{P}}_{{\hat{k}}^{**}}}{Q})\)

where \(Q = {\hat{P}}_{{\hat{k}}^{*}}{\hat{P}}_{{\hat{k}}^{**}}+{\hat{P}}_{{\hat{k}}^{**}}{\hat{P}}_{{\hat{k}}^{***}}+{\hat{P}}_{{\hat{k}}^{***}}{\hat{P}}_{{\hat{k}}^{*}}\), and \({\hat{P}}_{{\hat{k}}^{*}},{\hat{P}}_{{\hat{k}}^{**}},{\hat{P}}_{{\hat{k}}^{***}}\) represent the first, the second, and the third highest estimated reward probability. The parameters of player 2 are also calculated in the same way with the different reward probability estimations. The probability of selecting each machine is calculated as below:
where, \(\pi (P_{{\hat{k}}^{*}}=P_{k^{*}}H(t))\) represents the probability of machine \({\hat{k}}^{*}\) to have the highest reward probability from the estimation based on the softmax policy. Here, \(\pi (P_a>P_b>P_cH(t))\) represents the probability of reward probabilities being \(P_a>P_b>P_c\) under this estimation and it is calculated as below:
Therefore, probabilities of \(P_a\) being the first, the second, and the third best machine under estimation with softmax policy are:
Quantum interference policy
In the quantum interference policy, both players try to select both the first and the secondbest machine with the same probability to achieve fairness between the two players. Therefore, the probabilities of selecting machines are given by the following equation:
References
Kitayama, K. et al. Novel frontier of photonics for data processingphotonic accelerator. APL Photonics 4, 090901 (2019).
De Lima, T. F. et al. Machine learning with neuromorphic photonics. J. Lightwave Technol. 37, 1515–1534 (2019).
Shen, Y. et al. Deep learning with coherent nanophotonic circuits. Nat. Photonics 11, 441 (2017).
Van der Sande, G., Brunner, D. & Soriano, M. C. Advances in photonic reservoir computing. Nanophotonics 6, 561–576 (2017).
Sutton, R. S. & Barto, A. G. t al 1st edn. (MIT Press, 1998).
Auer, P., CesaBianchi, N. & Fischer, P. Finitetime analysis of the multiarmed bandit problem. Mach. Learn. 47, 235–256 (2002).
Lai, L., El Gamal, H., Jiang, H. & Poor, H. V. Cognitive medium access: Exploration, exploitation, and competition. IEEE Trans. Mob. Comput. 10, 239–253 (2010).
Kim, S.J., Naruse, M. & Aono, M. Harnessing the computational power of fluids for optimization of collective decision making. Philosophies 1, 245–260 (2016).
Naruse, M. et al. Singlephoton decision maker. Sci. Rep. 5, 13253 (2015).
Chauvet, N. et al. Entangledphoton decision maker. Sci. Rep. 9, 12229 (2019).
Chauvet, N. et al. Entangled Nphoton states for fair and optimal social decision making. Sci. Rep. 10, 20420 (2020).
Naruse, M. et al. Single photon in hierarchical architecture for physical decision making: Photon intelligence. ACS Photonics 3, 2505–2514 (2016).
Allen, L., Barnett, S. M. & Padgett, M. J. Optical Angular Momentum (CRC Press, 2003).
Forbes, A., de Oliveira, M. & Dennis, M. R. Structured light. Nat. Photonics 15, 253–262 (2021).
Yao, A. M. & Padgett, M. J. Orbital angular momentum: Origins, behavior and applications. Adv. Opt. Photonics 3, 161–204 (2011).
Flamini, F., Spagnolo, N. & Sciarrino, F. Photonic quantum information processing: A review. Rep. Prog. Phys. 82, 016001 (2019).
Forbes, A. & Nape, I. Quantum mechanics with patterns of light: Progress in high dimensional and multidimensional entanglement with structured light. AVS Quantum Sci. 1, 011701 (2019).
Krenn, M., Malik, M., Erhard, M. & Zeilinger, A. Orbital angular momentum of photons and the entanglement of Laguerre–Gaussian modes. Phil. Trans. R. Soc. A 375, 20150442 (2017).
Zhang, Y. et al. Engineering twophoton highdimensional states through quantum interference. Sci. Adv. 2, e1501165 (2016).
Mirhosseini, M. et al. Highdimensional quantum cryptography with twisted light. New J. Phys. 17, 033033 (2015).
MolinaTerriza, G., Torres, J. P. & Torner, L. Twisted photons. Nat. Phys. 3, 305–310 (2007).
Hong, C.K., Ou, Z.Y. & Mandel, L. Measurement of subpicosecond time intervals between two photons by interference. Phys. Rev. Lett. 59, 2044 (1987).
Pinheiro, A. R. C. et al. Vector vortex implementation of a quantum game. JOSA B 30, 3210–3214 (2013).
Balthazar, W. F., Passos, M. H. M., Schmidt, A. G. M., Caetano, D. P. & Huguenin, J. A. O. Experimental realization of the quantum duel game using linear optical circuits. J. Phys. B 48, 165505 (2015).
Allen, L., Beijersbergen, M. W., Spreeuw, R. & Woerdman, J. Orbital angular momentum of light and the transformation of Laguerre–Gaussian laser modes. Phys. Rev. A 45, 8185 (1992).
Beijersbergen, M., Coerwinkel, R., Kristensen, M. & Woerdman, J. Helicalwavefront laser beams produced with a spiral phaseplate. Opt. Commun. 112, 321–327 (1994).
Heckenberg, N., McDuff, R., Smith, C. & White, A. Generation of optical phase singularities by computergenerated holograms. Opt. Lett. 17, 221–223 (1992).
Beijersbergen, M. W., Allen, L., Van der Veen, H. & Woerdman, J. Astigmatic laser mode converters and transfer of orbital angular momentum. Opt. Commun. 96, 123–132 (1993).
Padgett, M., Arlt, J., Simpson, N. & Allen, L. An experiment to observe the intensity and phase structure of Laguerre–Gaussian laser modes. Am. J. Phys. 64, 77–82 (1996).
Wang, J. et al. Terabit freespace data transmission employing orbital angular momentum multiplexing. Nat. Photonics 6, 488–496 (2012).
Leach, J., Padgett, M. J., Barnett, S. M., FrankeArnold, S. & Courtial, J. Measuring the orbital angular momentum of a single photon. Phys. Rev. Lett. 88, 257901 (2002).
Vaziri, A., Pan, J.W., Jennewein, T., Weihs, G. & Zeilinger, A. Concentration of higher dimensional entanglement: Qutrits of photon orbital angular momentum. Phys. Rev. Lett. 91, 227902 (2003).
Lavery, M. P. et al. Refractive elements for the measurement of the orbital angular momentum of a single photon. Opt. Express 20, 2110–2115 (2012).
Lavery, M. P. et al. Efficient measurement of an optical orbitalangularmomentum spectrum comprising more than 50 states. New J. Phys. 15, 013024 (2013).
Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay or should I go? how the human brain manages the tradeoff between exploitation and exploration. Philos. Trans. R. Soc. B 362, 933–942 (2007).
Daw, N. D., O'doherty, J. P., Dayan, P., Seymour, B. & Dolan, R. J. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).
CesaBianchi, N., Gentile, C., Lugosi, G. & Neu, G. Boltzmann exploration done right. arXiv:1705.10257 (2017).
Pinnell, J., RodríguezFajardo, V. & Forbes, A. Singlestep shaping of the orbital angular momentum spectrum of light. Opt. Express 27, 28009–28021 (2019).
Mair, A., Vaziri, A., Weihs, G. & Zeilinger, A. Entanglement of the orbital angular momentum states of photons. Nature 412, 313–316 (2001).
FrankeArnold, S., Barnett, S. M., Padgett, M. J. & Allen, L. Twophoton entanglement of orbital angular momentum states. Phys. Rev. A 65, 033823 (2002).
Vaziri, A., Weihs, G. & Zeilinger, A. Experimental twophoton, threedimensional entanglement for quantum communication. Phys. Rev. Lett. 89, 240401 (2002).
Nash, J. F. et al. Equilibrium points in nperson games. Proc. Natl. Acad. Sci. USA 36, 48–49 (1950).
Acknowledgements
This work was supported in part by the CREST Project (JPMJCR17N2) funded by the Japan Science and Technology Agency, GrantsinAid for Scientific Research (JP20H00233) funded by the Japan Society for the Promotion of Science, and CNRSUTokyo Excellence Science Joint Research Program.
Author information
Authors and Affiliations
Contributions
M.N., N.C., and G.B. directed the project. T.A., N.C., G.B., S.H., and M.N designed the system architecture. T.A. and N.C conducted physical modeling and numerical performance evaluations. N.C., G.B., S.H., and R.H. examined technological constraints. All authors discussed the results. T.A., N.C., and M.N. wrote the manuscript. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Amakasu, T., Chauvet, N., Bachelier, G. et al. Conflictfree collective stochastic decision making by orbital angular momentum of photons through quantum interference. Sci Rep 11, 21117 (2021). https://doi.org/10.1038/s41598021004932
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598021004932
This article is cited by

Conflictfree joint decision by lag and zerolag synchronization in laser network
Scientific Reports (2024)

Asymmetric quantum decisionmaking
Scientific Reports (2023)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.