Abstract
Reinforcement learning involves decision making in dynamic and uncertain environments and constitutes an important element of artificial intelligence (AI). In this work, we experimentally demonstrate that the ultrafast chaotic oscillatory dynamics of lasers efficiently solve the multiarmed bandit problem (MAB), which requires decision making concerning a class of difficult tradeoffs called the exploration–exploitation dilemma. To solve the MAB, a certain degree of randomness is required for exploration purposes. However, pseudorandom numbers generated using conventional electronic circuitry encounter severe limitations in terms of their data rate and the quality of randomness due to their algorithmic foundations. We generate laser chaos signals using a semiconductor laser sampled at a maximum rate of 100 GSample/s, and combine it with a simple decisionmaking principle called tug of war with a variable threshold, to ensure ultrafast, adaptive, and accurate decision making at a maximum adaptation speed of 1 GHz. We found that decisionmaking performance was maximized with an optimal sampling interval, and we highlight the exact coincidence between the negative autocorrelation inherent in laser chaos and decisionmaking performance. This study paves the way for a new realm of ultrafast photonics in the age of AI, where the ultrahigh bandwidth of light wave can provide new value.
Introduction
Unique physical attributes of photons have been utilized in information processing in the literature on optical computing^{1}. New photonic processing principles have recently emerged to solve complex timeseries prediction problems^{2,3,4}, and issues in spatiotemporal dynamics^{5} and combinatorial optimization^{6}, which coincide with the rapid shift to the age of artificial intelligence (AI). These novel approaches exploit the ultrahigh bandwidth attributes of light wave and their enabling device technologies^{2,3,6}. This paper experimentally demonstrates the usefulness of ultrafast chaotic oscillatory dynamics in semiconductor lasers for reinforcement learning, which is among the most important elements in machine learning.
Reinforcement learning involves decision making in dynamic and uncertain environments^{7}. It forms the foundation of a variety of applications, such as information infrastructures^{8}, online advertisements^{9}, robotics^{10}, transportation^{11}, and Monte Carlo tree search^{12}, which is used in computer gaming^{13}. A fundamental of reinforcement learning is known as the multiarmed bandit problem (MAB), where the goal is to maximize the total reward from multiple slot machines, the reward probabilities of which are unknown^{7,14,15}. To solve the MAB, one needs to explore better slot machines. However, too much exploration may result in excessive loss, whereas too quick a decision, or insufficient exploration, may lead to the neglect of the best machine. There is a tradeoff, referred to as the exploration–exploitation dilemma^{7}. A variety of algorithms for solving the MAB have been proposed in the literature, such as εgreedy^{14}, softmax^{16}, and upper confidence bound^{17}.
These approaches typically involve probabilistic attributes, especially for exploration purposes. While the implementation and improvements of such algorithms on conventional digital computing systems are important for various practical applications, understanding the limitations of the algorithms and investigating novel approaches are also important based on perspectives from postsilicon computing. For example, pseudorandom number generation (RNG) used in conventional algorithmic approaches has severe limitations, such as its data rate, owing to the operating frequencies of digital processors (~gigahertz (GHz) range). Moreover, the quality of randomness in RNG has serious limitations^{18}. The usefulness of photonic random processes for machine learning is also discussed by utilizing multiple optical scattering^{19}.
We consider that directly utilizing physical irregular processes in nature is an exciting approach with the goal of realizing artificially constructed, physical decisionmaking machines^{20}. Indeed, the intelligence of slime moulds or amoebae, singlecell natural organisms, has been used in solution searches, whereby complex intercellular spatiotemporal dynamics play a key role^{21}. This has stimulated the subsequent discovery of a new principle of decisionmaking strategy called tug of war (TOW), invented by Kim et al.^{22,23}. The principle of the TOW method originated from the observation of the slime mould: its body dynamically expands and shrinks while maintaining a constant intracellularresource volume, allowing the mould to collect environmental information; this conservation of the volume entails a nonlocal correlation within the body. The fluctuation, or probabilistic behaviour, in the body of amoeba is important for the exploration of better solutions. The name TOW is a metaphor used to represent such a nonlocal correlation while accommodating fluctuation, which enhances decisionmaking performance^{23}.
This principle can be adapted to photonic processes. In past research, we experimentally showed physical decision making based on nearfieldmediated optical excitation transfer at the nanoscale^{20,24} and with single photons^{25}. These former studies pursued the ultimate physical attributes of photons in terms of diffraction limitfree spatial resolutions and energy efficiency by considering nearfield photons^{26,27}, and the quantum attributes of singlelight quanta^{28}. The nonlocal aspect of TOW is directly physically represented by the wave nature of a single photon or an excitonpolariton, whereas fluctuation is also directly represented by the intrinsic probabilistic attributes. However, fluctuations are limited by the practical limitations on the measurements and control systems (second order in the worst case) as well as the singlephoton generation rate (kHz range).
The ultrafast, highbandwidth aspect of light wave, for instance beyond THz order if the wavelength is around 1.5 μm in optical communications, is another promising physical platform for TOWbased decision making to complement diffractionlimitfree and lowenergy nearfield photon approaches as well as quantumlevel singlephoton strategies. As demonstrated below, chaotic oscillatory dynamics of lasers that contains negative autocorrelation experimentally demonstrates 1GHz decision making. In addition to the resultant speed merit, it should be emphasized that the technological maturity of ultrafast photonic devices allows for relatively easy and scalable system implementation through commercially available photonic devices. Furthermore, the applications of the proposed ultrafast photonicsbased, and the former nearfield/singlephotonbased, decision making are complementary: the former targets highend, data centre scenarios by highlighting ultrafast performance, whereas the latter appeals to lowenergy, InternetofThingsrelated^{29}, and security^{30} applications.
In this study, we demonstrate ultrafast reinforcement learning based on chaotic oscillatory dynamics in semiconductor lasers^{31,32,33,34} that yields adaptation from zeroprior knowledge at a GHz range. The randomness is based on complex dynamics in lasers^{32,33,34}, and its resulting speed is unachievable in other mechanisms, at least through technologically reasonable means. We experimentally show that ultrafast photonics has significant potential for reinforcement learning. The proposed principles using ultrafast temporal dynamics can be matched to applications including an arbitration of resources at data centres^{35}, highfrequency trading^{36}, where decision making is required at least within milliseconds, and other such highend utilities. Scientifically, this study paves a way toward the understanding of the physical origin of the enhancement of intelligent abilities (which is reinforcement learning herein) when natural processes (laser chaos herein) are coupled with external systems; this is what we call natural intelligence.
Chaotic dynamics in lasers has been examined in the literature^{32,33,34}, and its applications have exploited the ultrafast attributes of photonics for secure communication^{37,38,39}, RNG^{31,40,41}, remote sensing^{42}, and reservoir computing^{2,3,4}. Reservoir computing is a type of neural network similar to deep learning^{13} that has been intensively studied to provide recognition and predictionrelated functionalities. Reinforcement learning described in this study differs completely from reservoir computing from the perspective that neither a virtual network nor machine learning for output weights is required. However, it should be noted that reinforcement learning is important in complementing the capabilities of neural networks, indicating the potential for the fusion of photonic reservoir computing with photonic reinforcement learning in future work.
Principle of reinforcement learning
For the simplest case that preserves the essence of solving the MAB, we consider a player who selects one of two slot machines, called slot machines 1 and 2 hereafter, with the goal of maximizing reward (known as the twoarmed bandit problem). Denoting the reward probabilities of the slot machines by P_{ i } (i = 1, 2), the problem is to select the machine with the highest reward probability. The amount of reward dispensed by each slot machine for a play is assumed to be the same in this study. That is, the probability of ‘win’ by playing the slot machine i is P_{ i } and the probability of ‘lose’ by playing the slot machine i is 1 − P_{ i }. The sum of ‘win’ and ‘lose’ is unity in playing a particular slot machine, whereas the sum of P_{ i } among all slot machines may not be unity.
The measured chaotic signal s(t) is subjected to the threshold adjuster (TA), according to the TOW principle. The output of the TA is immediately the decision concerning the slot machine to choose. If s(t) is equal to or greater than the threshold value T(t), the decision is made to select slot machine 1. Otherwise, the decision is made to select slot machine 2. The reward—the win/lose information of a slot machine play—is fed back to the TA.
The chaotic signal level s(t) is compared with the threshold value T(t) denoted by
where TA(t) is the threshold adjuster value at cycle t, \(\lfloor TA(t)\rfloor \) is the nearest integer to TA(t) rounded to zero, and k is a constant determining the range of the resultant T(t). In this study, we assumed that \(\lfloor TA(t)\rfloor \) takes the values \(N,\ldots 1,0,1,\ldots ,N\), where N is a natural number. Hence the number of the thresholds is 2N + 1, referred to as TA’s resolution. The range of TA(t) is limited between −kN and kN by setting T(t) = kN when \(\lfloor TA(t)\rfloor \) is greater than N, as well as T(t) = −kN when \(\lfloor TA(t)\rfloor \) is smaller than −N.
If the selected slot machine yields a reward at cycle t (in other words, wins the slot machine play), the TA value is updated at cycle t + 1 based on
where α is referred to as the forgetting (memory) parameter^{20}, and Δ is the constant increment (in this experiment, Δ = 1 and α = 0.999). In this study, the initial TA value was zero. If the selected machine does not yield a reward (or loses in the slot machine play), the TA value is updated as follows:
where Ω is the increment parameter defined below. Intuitively speaking, the TA takes a smaller value if slot machine 1 is considered more likely to win, and a greater value if slot machine 2 is considered more likely to earn the reward. This is as if the TA value is being pulled by the two slot machines at both ends, coinciding with the notion of a tug of war.
The fluctuation, necessary for exploration, is realized by associating the TA value with the threshold of digitization of the chaotic signal train. If the chaotic signal level s(t) is equal to or greater than the assumed threshold T(t), the decision is immediately made to choose slot machine 1; otherwise, the decision is made to select slot machine 2. Initially, the threshold is zero; hence, the probability of choosing either slot machine 1 or 2 is 0.5. As time elapses, the TA value shifts (becomes positive or negative) towards the slot machine with the higher reward probability based on the dynamics shown in Eqs (2) and (3). We should note that due to the irregular nature of the incoming chaotic signal, the possibility of choosing the opposite machine is not zero, and this is a critical feature of exploration in reinforcement learning. For example, even when the TA value is sufficiently small (meaning that slot machine 1 seems highly likely to be the better machine), the probability of the decision to choose slot machine 2 is not zero.
In TOWbased decision making, the increment parameter Ω in Eq. (3) is determined based on the history of betting results. Let the number of times when slot machine i is selected in cycle t be S_{ i } and the number of wins in selecting slot machine i be L_{ i }. The estimated reward probabilities of slot machines 1 and 2 are given by
Ω is then given by
The initial Ω is assumed to be unity, and a constant value is assumed when the denominator of Eq. (5) is zero. The detailed derivation of Eq. (5) is shown in ref.^{23}.
Results
The architecture of laser chaosbased reinforcement learning is schematically shown in Fig. 1a. A semiconductor laser is coupled with a polarizationmaintaining (PM) coupler. The emitted light is incident on a variable fibre reflector, by which a delayed optical feedback is supplied to the laser, leading to laser chaos^{40}. The output light at the other end of the PM coupler is detected by a highspeed, ACcoupled photodetector through an optical isolator (ISO) and an attenuator, and is sampled by a highspeed digital oscilloscope at a rate of 100 GSample/s (a 10ps sampling interval). The detailed specifications of the experimental apparatus are described in the Methods section.
Figure 1b(i) shows an example of the chaotic signal train. Figure 1c and d show the optical and radio frequency (RF) spectra of laser chaos measured by optical and RF spectrum analysers, respectively. The semiconductor laser was operated at a centre wavelength of 1547.785 nm. The standard bandwidth^{31} of the RF spectrum was estimated as 10.8 GHz. Figure 1e summarizes the histogram of the signal levels of the chaotic trains spanning from −0.5 to 0.5, with the level zero at its maximum incidence, and slightly skewed between the positive and negative sides. A remark is that the small incidence peak at the lowest measured amplitude is due to our experimental apparatus (not the laser), and it does not critically affect the present study. Meanwhile, Fig. 1b(ii) exhibits an example of a quasiperiodic signal train generated from the same laser by changing the optical feedback power from the external reflector (see the Methods section for details). Besides, Fig. 1b(iii) shows an example of a coloured noise signal train containing negative autocorrelation calculated in a computer on the basis of the Ornstein–Uhlenbeck process using white Gaussian noise and a lowpass filter^{43} with the cutoff frequency of 10 GHz. Relevant details and performance comparisons with respect to decision making will be discussed later.
The rounded TA values, \(\lfloor TA(t)\rfloor \), are also schematically illustrated at the righthand side of Fig. 1b and the upper side of Fig. 1e, assuming N = 10. This means that \(\lfloor TA(t)\rfloor \) ranges from −10 to 10. Signal value s(t) spans between −0.5 and 0.5. The particular example of TA values shown in Fig. 1b and e shows the case when the constant k in Eq. (1) is given by 0.05, so that the actual threshold T(t) spans from −0.5 to 0.5. In the experimental demonstration in this study, the TA and the slot machines were emulated in offline processing, whereas online processing was technologically feasible owing to the simple procedure of the TA (Remarks are given in the Discussion section).
[EXPERIMENT1] Adaptation to sudden environmental changes
We first solved the twoarmed bandit problems given by the following two cases, where the reward probabilities {P_{1}, P_{2}} were given by {0.8, 0.2} and {0.6, 0.4}. We assumed that the sum of the reward probabilities P_{1} + P_{2} was known prior to slot machine plays. With this knowledge, Ω is unity by Eq. (5).
The slot machine was consecutively played 4,000 times, and this play was repeated 100 times. The red and blue curves in Fig. 2a show the evolution of the correct decision rate (CDR), defined by the ratio of the number of selections of the machines that yielded a higher reward probability at cycle t in 100 trials, with respect to the probability combination of {0.8, 0.2} and {0.6, 0.4}. The chaotic signal was sampled every 10 ps; hence, the total duration of the 4,000 plays of the slot machine was 40 ns. To represent sudden environmental changes (or uncertainty), the reward probability was forcibly interchanged every 10 ns, or every 1,000 plays (For example, {P_{1}, P_{2}} = {0.8, 0.2} was reconfigured to {P_{1}, P_{2}} = {0.2, 0.8}). The resolution of the TA was set to 9 (or N = 4).
We observed that the CDR curves quickly approached unity, even after sudden changes in the reward probabilities, showing successful decision making. The adaptation was steeper in the case of a reward probability combination of {0.8, 0.2} than that of {0.6, 0.4}, since the difference in the reward probability was greater in the former case (0.8 − 0.2 = 0.6) than the latter (0.6 − 0.4 = 0.2). This meant that decision making was easier. The red and blue curves in Fig. 2b represent the evolution of TA values (TA(t)) in the case of {0.8, 0.2} and {0.6, 0.4}, respectively, where the TA value became greater than 4 or −4; hence, it took a long time for the TA values to be inverted to the other polarity following environmental change. This is the origin of some delay in the CDR in responding to the environmental changes observed in Fig. 2a. A simple method of improving adaptation is to limit the maximum and minimum values of TA(t). The forgetting parameter α is also crucial to improving adaptation speed.
Figure 2c,d and e characterize decisionmaking performance dependencies with respect to the configuration of TA. Figure 2c concerns TA resolutions demonstrated by the eight curves therein, with corresponding TA resolutions of 3 to 257 (or \(N={2}^{i1}(i=1,\ldots \,,8)\)), where the adaptation was quicker in the case of lower resolutions than in the case of higher resolutions. Figure 2d considers TA range dependencies: while keeping the centre of the TA value at zero and the TA resolution at 8, the three curves in Fig. 2d compare CDRs in the TA ranges of 1, 0.25, and 0.125. We observed that a full coverage (TA range of 1) of the chaotic signal yields the best performance. Figure 2e shows the TA’s centre value dependencies while maintaining a TA range of 1 and a resolution of 8, and we can clearly observe the deterioration of CDRs with the shift in the centre of the TA from zero.
[EXPERIMENT2] Adaptation from zero prior knowledge
Here we considered decisionmaking problems without any prior knowledge of the slot machines. Hence, parameter Ω needed to be updated. The reward probabilities of the slot machines were set to {P_{1}, P_{2}} = {0.5, 0.1}, and 100 consecutive plays were executed. Figure 3a shows the time evolution of the CDR until the 75th slot play cycle to enlarge the early phase of the adaptation. The red and blue curves in Fig. 3a depict CDRs on the basis of chaotic signal trains with sampling intervals of 10 and 50 ps, respectively. Hence, the time needed to complete 100 consecutive slot machine plays differed among these; it ranges from \(10\,{\rm{ps}}\times 100=1\,{\rm{ns}}\) to \(50\,{\rm{ps}}\times 100=5\,{\rm{ns}}\). Meanwhile, the green curve in Fig. 3a represents the CDR with quasiperiodic signal trains sampled at 50ps intervals. We observed that the CDR based on chaotic signals sampled at 50ps intervals exhibited more rapid adaptation than quasiperiodic signals. The time evolution of the CDR for the coloured noise signal is shown by the magenta curve in Fig. 3a, which stays lower than the experimentally observed chaotic signal trains.
Moreover, the black curve shows the CDR obtained by uniformly distributed pseudorandom numbers generated with the Mersenne Twister as the random signal source, instead of the experimentally observed chaotic signals. In addition, we characterized CDRs with normally distributed random numbers (referred to as RANDN) to ensure that the statistical incidence patterns of the laser chaos, which were similar to a normal distribution shown in Fig. 1e, were not the origins of the fast adaptation of decision making. The solid, dotted, and dashed curves in Fig. 3a exhibit the CDR obtained by RANDN of which standard deviation (σ) was configured as 0.2, 0.1, and 0.01 while keeping the mean value at zero. We observed that the CDRs based on chaotic signals exhibited more rapid adaptation than the uniformly and normally distributed pseudorandom numbers.
The most prompt adaptation, or the optimal performance, was obtained at a particular sampling interval. The blue square marks in Fig. 3b compare CDRs at the cycle t = 5 as a function of the sampling interval ranging from 10 ps to 4000 ps. The sampling interval of 50 ps yielded the best performance, indicating that the original chaotic dynamics of the laser could physically be optimized such that the most prompt decisionmaking is realized. Indeed, the autocorrelation of the laser chaos signal trains was evaluated as shown by the blue square marks in Fig. 3c, and its negative maximum value is taken when the time lag is given by 5 or −5, corresponding exactly to the sampling interval of 50 ps (5 × 10 ps). In other words, the negative correlation of chaotic dynamics affects the exploration ability for decision making. Furthermore, this finding suggests that optimal performance is obtained at a particular sampling rate (or data rate) by physically tuning the dynamics of the original laser chaos, which will be an important and exciting topic for future investigation. The adaptation speed of decision making was estimated as 1 GHz in this optimal case, where CDR was larger than 0.95 at 20 cycles with a 50ps sampling interval (20 GSamples/s) in Fig. 3a (1 GHz = (50 ps × 20 cycles)^{−1}). In other words, the latency of decision making is approximately 1 ns, and the throughput is 20 GDecision/s.
Meanwhile, the autocorrelation of the quasiperiodic signal train also yields negative values, as shown by the green circular marks in Fig. 3c. In fact, the absolute value of the negative autocorrelation at the time lag of 5 (and −5) is larger than that of the chaotic signal train. However, the adaptation performance of the chaotic signal trains is superior (see Fig. 3b). At the same time, the best adaptation performance of the quasiperiodic signal train is obtained when the sampling interval is 70 ps (see Fig. 3b), which coincides with the peak of the negative maximum value of autocorrelation with the time lag of 7 (=70 ps) (see Fig. 3c).
Another indication of the observation is that the coloured noise containing negative autocorrelation might contribute to performance improvement. We generated the coloured noise signal by computer simulations of which autocorrelation is shown by the diamond marks in Fig. 3c whereby the negative maximum is obtained at the time lag of 7 (sampling interval: 70 ps). The samplingintervaldependence of the CDR is summarized by the diamond marks in Fig. 3b, where a peak is observed at the sampling interval of 7, and also coincides with the peak of negative autocorrelation. These results imply the necessity of gaining deeper insights in future studies as discussed in the Discussion section.
Figure 4 also presents the CDR at the cycle t = 5 obtained by the chaotic signal, quasiperiodic time series, coloured noise signal, and pseudorandom numbers. Moreover, the CDR at the cycle t = 5 was evaluated by surrogating, or randomly permutating, the chaotic signal time series sampled at 50 ps intervals, and this resulted in poorer performance compared with the original chaotic signal, as shown in Fig. 4. These evaluations support the claim that laser chaos is beneficial to the performance of reinforcement learning, in addition to its ultrafast data rate for random signals.
Discussion
The performance enhancement of decision making by chaotic laser dynamics is demonstrated and the impacts of negative autocorrelation are clearly suggested. Further understanding between chaotic oscillatory dynamics and decision making is a part of important future research.
The first consideration involves physical insights. Toomey et al. recently showed that the complexity of laser chaos varies within the coherence collapse region in the given system^{44}. The level of the optical feedback, injection current of the laser becomes an important parameter in determining the complexity of chaos, and leads to thorough insights. We also consider the use of the bandwidth enhancement technique^{31} with optically injected lasers to improve the adaptation speed of decision higher than tens of GHz.
We showed the exact coincidence between the time lag that yielded the negative maximum of the autocorrelation and the sampling interval that provided the highest adaptation performance with the use of laser chaos signals (Fig. 3b). The absolute value of the negative autocorrelation itself is, however, larger with the quasiperiodic signals rather than with chaotic signals (Fig. 3c). Nevertheless, superior adaptation performance is realized with the chaotic signals (Figs 3b and 4). These observations indicate that, besides the negative autocorrelation inherent in chaotic and quasiperiodic oscillatory dynamics of lasers as well as the coloured noise generated by computers, other perspectives could explain the underlying mechanism, such as diffusivity^{45} and Hurst exponents^{46}.
The decisionmaking latency of 1 ns demonstrated by chaotic lasers in this study may overachieve the expected performance demanded by practical applications today. However, highend applications require even shorter latency; for example, microseconds and nanoseconds applications are discussed in the literature^{47,48}. Meanwhile, fully electrical online processing using multiple processors presents difficulties in reducing latency while having comparable throughput; such an architectural standpoint continues to be an important aspect in future studies.
Online processing can become feasible through electric circuitry, as already demonstrated for a randombit generator^{40,49} since the required processing is very simple as described in Eqs (1) to (5). More specifically, combining a demultiplexer and multiple data processing circuits (such as fieldprogrammable gate array chips) could provide the requisite online processing capability. In this study, in contrast, the prime interest is to present the principle and the usefulness of laser chaos for reinforcement learning, and we do not experimentally adapt the electrical circuits for online processing that need substantial development costs. In addition, the performance comparison between laser chaos signals and numerically generated coloured noise signals shows the usefulness of direct usage of physical signal sources besides the advantages in terms of latency and technological implementations in the highspeed domain.
In the experimental demonstration, the nonlocal aspect of the TOW principle was not directly, physically relevant to the given chaotic oscillatory time series of the lasers whereas our former single photonbased decision maker directly utilizes the physical property therein (the wave–particle duality of single photons) for decision making as already mentioned in the introduction. By combining the chaotic dynamics with the threshold adjustor, the nonlocal and fluctuation properties of the TOW principle emerged which have not been completely reported in the literature nor realized in our past experimental studies^{24,25}. Such a hybrid realization of nonlocality in TOW leads to higher likelihood of technological implementability and better scalability and extension to highergrade problems.
All optical realization is an interesting issue to explore, and has already been implied by the analysis where the timedomain correlation of laser chaos strongly influences decisionmaking performances. The mode dynamics of multimode lasers^{50} are very promising for the implementation of nonlocal properties of fully photonic systems required for decision making. Synchronization and its clustering properties in coupled laser networks^{51,52} are also interesting approaches to physically realizing the nonlocality of the TOW. These systems can automatically tune the optimal settings for decision making (e.g., negative autocorrelation properties), which can lead to autonomous photonic intelligence.
For scalability, a variety of approaches can be considered, such as timedomain multiplication, which exploits the ultrafast attributes of chaotic lasers and is a frequently used strategy in ultrafast photonic systems. Introducing multithreshold values^{31,41} is another simple extension of our proposed scheme. Furthermore, the relation between the difficulty level of the given MAB problem and the necessary irregularity of chaotic signal trains would be an interesting future study topic. In this respect, Naruse et al. experimentally examined a hierarchical method and successfully solved fourarmed bandit problems using single photons^{53}; the timedomain realization of such a hierarchical approach based on laser chaos may highlight the uniqueness of chaotic oscillatory dynamics of lasers.
We also make note of the extension of the principle demonstrated in this paper to highergrade machine learning problems. This paper studied the MAB where the problem is decision making to maximize the income for a single player. The highergrade problem in this context is competitive MAB^{54} where the issue is decision making to maximize the total reward for multiple players as a whole. The competitive MAB involves the socalled Nash equilibrium where the decisions based on an individual’s reward maximization do not yield maximum reward as a whole. This is a foundation of such important applications as resource allocation and social optimization. Investigating the possibilities of extending the present method and utilizing ultrafast laser dynamics for competitive MAB is highly interesting.
Conclusion
We experimentally established that laser chaos provides ultrafast reinforcement learning and decision making. The adaptation speed of decision making reached 1 GHz in the optimal case with the sampling rate of 20 GSample/s (50ps decisionmaking intervals) using the ultrafast dynamics inherent in laser chaos. The maximum adaptation performance coincided with the negative maximum of the autocorrelation of the original timedomain laser chaos sequences, demonstrating the strong impact of chaotic lasers on decision making. The origin of superior performance was also validated by comparing with experimentally observed quasiperiodic signals, computergenerated coloured noises, uniformly and normally distributed pseudorandom numbers, and surrogated arrangements of original chaotic signal trains. This study is the first demonstration of ultrafast photonic reinforcement learning or decision making, to the best of our knowledge, and paves the way for research on photonic intelligence and new applications of chaotic lasers in the realm of artificial intelligence.
Methods
Optical system
The laser used in the experiment was a distributed feedback semiconductor laser mounted on a butterfly package with optical fibre pigtails (NTT Electronics, KELD1C5GAAA). The injection current of the semiconductor laser was set to 58.5 mA (5.37 I_{ th }), where the lasing threshold I_{ th } was 10.9 mA. The relaxation oscillation frequency of the laser was 6.5 GHz. The temperature of the semiconductor laser was set to 294.83 K. The laser output power was 13.2 mW. The laser was connected to a variable fibre reflector which reflected a fraction of light back into the laser, inducing highfrequency chaotic oscillations of optical intensity^{32,33,34}. The values of the optical feedback power (ratio) from the external reflector to the laser were 211 μW (1.6%) and 15 μW (0.11%) in generating chaotic and quasiperiodic signals, respectively. The fibre length between the laser and the reflector was 4.55 m, corresponding to the feedback delay time (round trip) of 43.8 ns. Polarizationmaintaining fibres were used for all optical fibre components. The optical output was converted to an electronic signal by a photodetector (New Focus, 1474A, 38 GHz bandwidth) and sampled by a digital oscilloscope (Tektronics, DPO73304D, 33 GHz bandwidth, 100 GSample/s, eightbit vertical resolution). The RF spectrum of the laser was measured by an RF spectrum analyser (Agilent, N9010A544, 44 GHz bandwidth). The optical wavelength of the laser was measured by an optical spectrum analyser (Yokogawa, AQ6370C20).
Data analysis
[EXPERIMENT1]
A chaotically oscillated signal train was sampled at a rate of 100 GSample/s by 10,000,000 points, and lasted approximately 10 μs. As described in the main text, 4,000 consecutive plays were repeated 100 times; hence, the total number of slot machine plays was 400,000. With a 10ps interval sampling, the initial 400,000 points of the chaotic signal were used for the decisionmaking experiments. The processing time required for 400,000 iterations of slot machine plays was approximately 0.93 s (or 2.3 μs/decision), on a personal computer (HewlettPackard, Z800, Intel Xeon CPU, 3.33 GHZ, 48 GB RAM, Windows 7, MATLAB R2011b).
[EXPERIMENT2]

(1)
Sampling methods: A chaotic signal train was sampled at 10ps intervals with 10,000,000 sampling points. Such a train was measured 120 times. Each chaotic signal train was referred to as chaos_{ i }, and there were 120 kinds of such trains: \(i=1,\ldots \,,120\). In demonstrating 10 × M ps sampling intervals, where M was a natural number ranging from 1 to 400 (that is, the sampling intervals were 10 ps, 20 ps, … and 4000 ps), we chose one of every M samples from the original sequence.

(2)
Evaluation of CDR regarding a specific chaos sequence: For every chaotic signal train chaos_{ i }, 100 consecutive plays were repeated 100 times. Consequently, 10,000 points were used from chaos_{ i }. Such evaluations were repeated 100 times. Hence, 1,000,000 slot machine plays were conducted in total. These CDRs were calculated for all signal trains chaos_{ i } (\(i=1,\ldots \,,120\)).

(3)
Evaluation of CDR of all chaotic sequences: We evaluated the average CDR of all chaotic signal trains (\(i=1,\ldots \,,120\)) derived in (2) above, which were the results discussed in the main text. The methods of the performance evaluation of CDR with respect to the experimentally observed quasiperiodic signals, RAND, and RANDN were the same as that described in (2) and (3).

(4)
Autocorrelation of chaotic signals: The autocorrelation was computed based on all 10,000,000 sampling points of chaos_{ i }, and was evaluated for all chaos_{ i } (\(i=1,\ldots \,,120\)). The autocorrelation demonstrated in Fig. 3c was evaluated as the average of these 120 kinds of autocorrelations. The autocorrelation of quasiperiodic signals was evaluated in the same manner.

(5)
Coloured noise: Coloured noise was calculated on the basis of the Ornstein–Uhlenbeck process using white Gaussian noise and a lowpass filter in numerical simulations^{43}. We assumed that the coloured noise was generated at the sampling rate of 100 GHz, and the cutoff frequency of the lowpass filter was set to 10 GHz (the correlation time was 100 ps). Forty sequences of 10,000,000 points were generated. The methods of calculating CDR and autocorrelation were the same as that with (2), (3), and (4) whereas the number of sequences was 40 (not 120). The reduction in the number of sequences was due to the excessive computational costs in our computing environment.

(6)
Surrogate methods: The surrogate time series of original chaotic sequences were generated by the randperm function in MATLAB which is based on the sorting of pseudorandom numbers generated by the Mersenne Twister.
Data availability
The data sets generated during the current study are available from the corresponding author on reasonable request.
Change history
05 April 2018
A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has not been fixed in the paper.
References
Jahns, J. & Lee, S. H. Optical Computing Hardware. (Academic Press, San Diego, 1994).
Larger, L. et al. Photonic information processing beyond Turing: an optoelectronic 3 implementation of reservoir computing. Opt. Express 20, 3241–3249 (2012).
Brunner, D., Soriano, M. C., Mirasso, C. R. & Fischer, I. Parallel photonic information processing at gigabyte per second data rates using transient states. Nat. Commun. 4, 1364 (2013).
Vandoorne, K. et al. Experimental demonstration of reservoir computing on a silicon photonics chip. Nat. Commun. 5, 3541 (2014).
Tsang, M. & Psaltis, D. Metaphoric optical computing of fluid dynamics. arXiv:physics/0604149v1 (2006).
Inagaki, T. et al. A coherent Ising machine for 2000node optimization problems. Science, doi:10.1126/science.aah4243 (2016).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction. (The MIT Press, Massachusetts, 1998).
Awerbuch, B. & Kleinberg, R. Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74, 97–114 (2008).
Agarwal, D., Chen, B. C. & Elango, P. Explore/exploit schemes for web content optimization. Proc. of ICDM 1–10, doi:10.1109/ICDM.2009.52 (2009).
Kroemer, O. B., Detry, R., Piater, J. & Peters, J. Combining active learning and reactive control for robot grasping. Robot. Auton. Syst. 58, 1105–1116 (2010).
Cheung, M. Y., Leighton, J. & Hover, F. S. Multiarmed bandit formulation for autonomous mobile acoustic relay adaptive positioning. In 2013 IEEE Intl. Conf. Robot. Auto. 4165–4170 (2013).
Kocsis, L. & Szepesvári, C. Bandit based Monte Carlo planning. Machine Learning: ECML (2006), LNCS 4212, 282–293, doi:10.1007/11871842_29 (2006).
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Robbins, H. Some aspects of the sequential design of experiments. B. Am. Math. Soc. 58, 527–535 (1952).
Lai, T. L. & Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985).
Daw, N., O’Doherty, J., Dayan, P., Seymour, B. & Dolan, R. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).
Auer, P., CesaBianchi, N. & Fischer, P. Finitetime analysis of the multiarmed bandit problem. Machine Learning 47, 235–256 (2002).
Murphy, T. E. & Roy, R. The world’s fastest dice. Nat. Photon. 2, 714–715 (2008).
Saade, A., et al. Random projections through multiple optical scattering: Approximating Kernels at the speed of light. In IEEE International Conference on Acoustics, Speech and Signal Processing, March 20–25, 2016, Shanghai, China 6215–6219 (IEEE, 2016).
Kim, S.J., Naruse, M., Aono, M., Ohtsu, M. & Hara, M. Decision Maker Based on Nanoscale PhotoExcitation Transfer. Sci Rep. 3, 2370 (2013).
Nakagaki, T., Yamada, H. & Tóth, Á. Intelligence: Mazesolving by an amoeboid organism. Nature 407, 470–470 (2000).
Kim, S.J., Aono, M. & Hara, M. Tugofwar model for the twobandit problem: Nonlocally correlated parallel exploration via resource conservation. BioSystems 101, 29–36 (2010).
Kim, S.J., Aono, M. & Nameda, E. Efficient decisionmaking by volumeconserving physical object. New J. Phys. 17, 083023 (2015).
Naruse, M. et al. Decision making based on optical excitation transfer via nearfield interactions between quantum dots. J. Appl. Phys. 116, 154303 (2014).
Naruse, M. et al. Singlephoton decision maker. Sci. Rep. 5, 13253 (2015).
Pohl, D. W. & Courjon, D. Near Field Optics. (Kluwer, The Netherlands, 1993).
Naruse, M., Tate, N., Aono, M. & Ohtsu, M. Information physics fundamentals of nanophotonics. Rep. Prog. Phys. 76, 056401 (2013).
Eisaman, M. D., Fan, J., Migdall, A. & Polyakov, S. V. Singlephoton sources and detectors. Rev. Sci. Instrum. 82, 071101 (2011).
Kato, H., Kim, S.J., Kuroda, K., Naruse, M. & Hasegawa, M. The Design and Implementation of a Throughput Improvement Scheme based on TOW Algorithm for Wireless LAN. In Proc. 4th KoreaJapan Joint Workshop on Complex Communication Sciences, January 12–13, Nagano, Japan J13 (IEICE, 2016).
Naruse, M., Tate, N. & Ohtsu, M. Optical security based on nearfield processes at the nanoscale. J. Optics 14, 094002 (2012).
Sakuraba, R., Iwakawa, K., Kanno, K. & Uchida, A. Tb/s physical random bit generation with bandwidthenhanced chaos in threecascaded semiconductor lasers. Opt. Express 23, 1470–1490 (2015).
Soriano, M. C., GarcíaOjalvo, J., Mirasso, C. R. & Fischer, I. Complex photonics: Dynamics and applications of delaycoupled semiconductors lasers. Rev. Mod. Phys. 85, 421–470 (2013).
Ohtsubo, J. Semiconductor lasers: stability, instability and chaos (Springer, Berlin, 2012).
Uchida, A. Optical communication with chaotic lasers: applications of nonlinear dynamics and synchronization (WileyVCH, Weinheim, 2012).
Bilal, K., Malik, S. U. R., Khan, S. U. & Zomaya, A. Y. Trends and challenges in cloud datacenters. IEEE Cloud Computing 1, 10–20 (2014).
Brogaard, J., Hendershott, T. & Riordan, R. Highfrequency trading and price discovery. Rev. Financ. Stud. 27, 2267–2306 (2014).
Colet, P. & Roy, R. Digital communication with synchronized chaotic lasers. Opt. Lett. 19, 2056–2058 (1994).
Argyris, A. et al. Chaosbased communications at high bit rates using commercial fibreoptic links. Nature 438, 343–346 (2005).
AnnovazziLodi, V., Donati, S. & Scire, A. Synchronization of chaotic injectedlaser systems and its application to optical cryptography. IEEE J Quantum Electron. 32, 953–959 (1996).
Uchida, A. et al. Fast physical random bit generation with chaotic semiconductor lasers. Nat. Photon. 2, 728–732 (2008).
Kanter, I., Aviad, Y., Reidler, I., Cohen, E. & Rosenbluh, M. An optical ultrafast random bit generator. Nat. Photon. 4, 58–61 (2010).
Lin, F.Y. & Liu, J.M. Chaotic lidar. IEEE J. Sel. Top. Quantum Electron. 10, 991–997 (2004).
Fox, R. F., Gatland, I. R., Roy, R. & Vemuri, G. Fast, accurate algorithm for numerical simulation of exponentially correlated colored noise. Phys. Rev. A 38, 5938–5940 (1988).
Toomey, J. P. & Kane, D. M. Mapping the dynamic complexity of a semiconductor laser with optical feedback using permutation entropy. Opt. Express 22, 1713–1725 (2014).
Kim, S.J., Naruse, M., Aono, M., Hori, H. & Akimoto, T. Random walk with chaotically driven bias. Sci. Rep. 6, 38634 (2016).
Lam, W. S., Ray, W., Guzdar, P. N. & Roy, R. Measurement of Hurst exponents for semiconductor laser phase dynamics. Phys. Rev. Lett. 94, 010602 (2005).
HighFrequency Trading Is Nearing the Ultimate Speed Limit, MIT Technology Review. https://www.technologyreview.com/s/602135/highfrequencytradingisnearingtheultimatespeedlimit/ (Last access: 05/07/2017).
High Frequency Trading Turns to High Frequency Technology to Reduce Latency. http://www.recusa.com/press/High%20Frequency%20Trading%20Turns%20to%20High%20Frequency.pdf (Last access: 05/07/2017).
Ugajin, K. et al. Realtime fast physical random number generator with a photonic integrated circuit. Opt. Express 25, 6511–6523 (2017).
Aida, T. & Davis, P. Oscillation mode selection using bifurcation of chaotic mode transitions in a nonlinear ring resonator. IEEE J. Quantum Electron. 30, 2986–2997 (1994).
Nixon, M. et al. Controlling synchronization in large laser networks. Phys. Rev. Lett. 108, 214101 (2012).
Williams, C. R. S. et al. Experimental observations of group synchrony in a system of chaotic optoelectronic oscillators. Phys. Rev. Lett. 110, 064104 (2013).
Naruse, M. et al. Single Photon in Hierarchical Architecture for Physical Decision Making: Photon Intelligence. ACS Photonics 3, 2505–2514 (2016).
Kim, S.J., Naruse, M. & Aono, M. Harnessing the Computational Power of Fluids for Optimization of Collective Decision Making. Philosophies Special Issue ‘Natural Computation: Attempts in Reconciliation of Dialectic Oppositions’ 1, 245–260 (2016).
Acknowledgements
The authors thank Mr. Takatomo Mihana of Saitama University for his help in numerically generating coloured noise signals. This work was supported in part by the CoretoCore Program A. Advanced Research Networks and the GrantsinAid for Scientific Research (A) from the Japan Society for the Promotion of Science.
Author information
Authors and Affiliations
Contributions
M.N., A.U., and S.J.K. directed the project. M.N. designed the system architecture. Y.T. and A.U. designed and implemented the laser chaos. Y.T., A.U., and M.N. conducted the optical experiments and the data processing. M.N., A.U., and S.J.K. analysed the data, and M.N., A.U., and S.J.K. wrote the paper.
Corresponding author
Ethics declarations
Competing Interests
The authors declare that they have no competing interests.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Naruse, M., Terashima, Y., Uchida, A. et al. Ultrafast photonic reinforcement learning based on laser chaos. Sci Rep 7, 8772 (2017). https://doi.org/10.1038/s41598017085858
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598017085858
This article is cited by

Deep neural networkbased phase calibration in integrated optical phased arrays
Scientific Reports (2023)

Harnessing microcombbased parallel chaos for random number generation and optical decision making
Nature Communications (2023)

Efficient dynamic channel assignment through laser chaos: a multiuser parallel processing learning algorithm
Scientific Reports (2023)

Decision making for largescale multiarmed bandit problems using bias control of chaotic temporal waveforms in semiconductor lasers
Scientific Reports (2022)

Entangled and correlated photon mixed strategy for social decision making
Scientific Reports (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.