Ultrafast photonic reinforcement learning based on laser chaos

Naruse, Makoto; Terashima, Yuta; Uchida, Atsushi; Kim, Song-Ju

doi:10.1038/s41598-017-08585-8

Download PDF

Article
Open access
Published: 18 August 2017

Ultrafast photonic reinforcement learning based on laser chaos

Makoto Naruse¹,
Yuta Terashima²,
Atsushi Uchida² &
…
Song-Ju Kim³

Scientific Reports volume 7, Article number: 8772 (2017) Cite this article

4976 Accesses
78 Citations
42 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 05 April 2018

This article has been updated

Abstract

Reinforcement learning involves decision making in dynamic and uncertain environments and constitutes an important element of artificial intelligence (AI). In this work, we experimentally demonstrate that the ultrafast chaotic oscillatory dynamics of lasers efficiently solve the multi-armed bandit problem (MAB), which requires decision making concerning a class of difficult trade-offs called the exploration–exploitation dilemma. To solve the MAB, a certain degree of randomness is required for exploration purposes. However, pseudorandom numbers generated using conventional electronic circuitry encounter severe limitations in terms of their data rate and the quality of randomness due to their algorithmic foundations. We generate laser chaos signals using a semiconductor laser sampled at a maximum rate of 100 GSample/s, and combine it with a simple decision-making principle called tug of war with a variable threshold, to ensure ultrafast, adaptive, and accurate decision making at a maximum adaptation speed of 1 GHz. We found that decision-making performance was maximized with an optimal sampling interval, and we highlight the exact coincidence between the negative autocorrelation inherent in laser chaos and decision-making performance. This study paves the way for a new realm of ultrafast photonics in the age of AI, where the ultrahigh bandwidth of light wave can provide new value.

Decision making for large-scale multi-armed bandit problems using bias control of chaotic temporal waveforms in semiconductor lasers

Article Open access 16 May 2022

Kensei Morijiri, Takatomo Mihana, … Atsushi Uchida

Experimental quantum speed-up in reinforcement learning agents

Article 10 March 2021

V. Saggio, B. E. Asenbeck, … P. Walther

Generalizable control for quantum parameter estimation through reinforcement learning

Article Open access 04 October 2019

Han Xu, Junning Li, … Xin Wang

Introduction

Unique physical attributes of photons have been utilized in information processing in the literature on optical computing¹. New photonic processing principles have recently emerged to solve complex time-series prediction problems^2,3,4, and issues in spatiotemporal dynamics⁵ and combinatorial optimization⁶, which coincide with the rapid shift to the age of artificial intelligence (AI). These novel approaches exploit the ultrahigh bandwidth attributes of light wave and their enabling device technologies^2,3,6. This paper experimentally demonstrates the usefulness of ultrafast chaotic oscillatory dynamics in semiconductor lasers for reinforcement learning, which is among the most important elements in machine learning.

Reinforcement learning involves decision making in dynamic and uncertain environments⁷. It forms the foundation of a variety of applications, such as information infrastructures⁸, online advertisements⁹, robotics¹⁰, transportation¹¹, and Monte Carlo tree search¹², which is used in computer gaming¹³. A fundamental of reinforcement learning is known as the multi-armed bandit problem (MAB), where the goal is to maximize the total reward from multiple slot machines, the reward probabilities of which are unknown^7,14,15. To solve the MAB, one needs to explore better slot machines. However, too much exploration may result in excessive loss, whereas too quick a decision, or insufficient exploration, may lead to the neglect of the best machine. There is a trade-off, referred to as the exploration–exploitation dilemma⁷. A variety of algorithms for solving the MAB have been proposed in the literature, such as ε-greedy¹⁴, softmax¹⁶, and upper confidence bound¹⁷.

These approaches typically involve probabilistic attributes, especially for exploration purposes. While the implementation and improvements of such algorithms on conventional digital computing systems are important for various practical applications, understanding the limitations of the algorithms and investigating novel approaches are also important based on perspectives from postsilicon computing. For example, pseudorandom number generation (RNG) used in conventional algorithmic approaches has severe limitations, such as its data rate, owing to the operating frequencies of digital processors (~gigahertz (GHz) range). Moreover, the quality of randomness in RNG has serious limitations¹⁸. The usefulness of photonic random processes for machine learning is also discussed by utilizing multiple optical scattering¹⁹.

We consider that directly utilizing physical irregular processes in nature is an exciting approach with the goal of realizing artificially constructed, physical decision-making machines²⁰. Indeed, the intelligence of slime moulds or amoebae, single-cell natural organisms, has been used in solution searches, whereby complex intercellular spatiotemporal dynamics play a key role²¹. This has stimulated the subsequent discovery of a new principle of decision-making strategy called tug of war (TOW), invented by Kim et al.^22,23. The principle of the TOW method originated from the observation of the slime mould: its body dynamically expands and shrinks while maintaining a constant intracellular-resource volume, allowing the mould to collect environmental information; this conservation of the volume entails a nonlocal correlation within the body. The fluctuation, or probabilistic behaviour, in the body of amoeba is important for the exploration of better solutions. The name TOW is a metaphor used to represent such a nonlocal correlation while accommodating fluctuation, which enhances decision-making performance²³.

This principle can be adapted to photonic processes. In past research, we experimentally showed physical decision making based on near-field-mediated optical excitation transfer at the nanoscale^20,24 and with single photons²⁵. These former studies pursued the ultimate physical attributes of photons in terms of diffraction limit-free spatial resolutions and energy efficiency by considering near-field photons^26,27, and the quantum attributes of single-light quanta²⁸. The nonlocal aspect of TOW is directly physically represented by the wave nature of a single photon or an exciton-polariton, whereas fluctuation is also directly represented by the intrinsic probabilistic attributes. However, fluctuations are limited by the practical limitations on the measurements and control systems (second order in the worst case) as well as the single-photon generation rate (kHz range).

The ultrafast, high-bandwidth aspect of light wave, for instance beyond THz order if the wavelength is around 1.5 μm in optical communications, is another promising physical platform for TOW-based decision making to complement diffraction-limit-free and low-energy near-field photon approaches as well as quantum-level single-photon strategies. As demonstrated below, chaotic oscillatory dynamics of lasers that contains negative autocorrelation experimentally demonstrates 1-GHz decision making. In addition to the resultant speed merit, it should be emphasized that the technological maturity of ultrafast photonic devices allows for relatively easy and scalable system implementation through commercially available photonic devices. Furthermore, the applications of the proposed ultrafast photonics-based, and the former near-field/single-photon-based, decision making are complementary: the former targets high-end, data centre scenarios by highlighting ultrafast performance, whereas the latter appeals to low-energy, Internet-of-Things-related²⁹, and security³⁰ applications.

In this study, we demonstrate ultrafast reinforcement learning based on chaotic oscillatory dynamics in semiconductor lasers^31,32,33,34 that yields adaptation from zero-prior knowledge at a GHz range. The randomness is based on complex dynamics in lasers^32,33,34, and its resulting speed is unachievable in other mechanisms, at least through technologically reasonable means. We experimentally show that ultrafast photonics has significant potential for reinforcement learning. The proposed principles using ultrafast temporal dynamics can be matched to applications including an arbitration of resources at data centres³⁵, high-frequency trading³⁶, where decision making is required at least within milliseconds, and other such high-end utilities. Scientifically, this study paves a way toward the understanding of the physical origin of the enhancement of intelligent abilities (which is reinforcement learning herein) when natural processes (laser chaos herein) are coupled with external systems; this is what we call natural intelligence.

Chaotic dynamics in lasers has been examined in the literature^32,33,34, and its applications have exploited the ultrafast attributes of photonics for secure communication^37,38,39, RNG^31,40,41, remote sensing⁴², and reservoir computing^2,3,4. Reservoir computing is a type of neural network similar to deep learning¹³ that has been intensively studied to provide recognition- and prediction-related functionalities. Reinforcement learning described in this study differs completely from reservoir computing from the perspective that neither a virtual network nor machine learning for output weights is required. However, it should be noted that reinforcement learning is important in complementing the capabilities of neural networks, indicating the potential for the fusion of photonic reservoir computing with photonic reinforcement learning in future work.

Principle of reinforcement learning

For the simplest case that preserves the essence of solving the MAB, we consider a player who selects one of two slot machines, called slot machines 1 and 2 hereafter, with the goal of maximizing reward (known as the two-armed bandit problem). Denoting the reward probabilities of the slot machines by P_i (i = 1, 2), the problem is to select the machine with the highest reward probability. The amount of reward dispensed by each slot machine for a play is assumed to be the same in this study. That is, the probability of ‘win’ by playing the slot machine i is P_i and the probability of ‘lose’ by playing the slot machine i is 1 − P_i. The sum of ‘win’ and ‘lose’ is unity in playing a particular slot machine, whereas the sum of P_i among all slot machines may not be unity.

The measured chaotic signal s(t) is subjected to the threshold adjuster (TA), according to the TOW principle. The output of the TA is immediately the decision concerning the slot machine to choose. If s(t) is equal to or greater than the threshold value T(t), the decision is made to select slot machine 1. Otherwise, the decision is made to select slot machine 2. The reward—the win/lose information of a slot machine play—is fed back to the TA.

The chaotic signal level s(t) is compared with the threshold value T(t) denoted by

$$T(t)=k\times \lfloor TA(t)\rfloor \,,$$

(1)

where TA(t) is the threshold adjuster value at cycle t, $\lfloor TA(t)\rfloor $ is the nearest integer to TA(t) rounded to zero, and k is a constant determining the range of the resultant T(t). In this study, we assumed that $\lfloor TA(t)\rfloor $ takes the values $-N,\ldots -1,0,1,\ldots ,N$, where N is a natural number. Hence the number of the thresholds is 2N + 1, referred to as TA’s resolution. The range of TA(t) is limited between −kN and kN by setting T(t) = kN when $\lfloor TA(t)\rfloor $ is greater than N, as well as T(t) = −kN when $\lfloor TA(t)\rfloor $ is smaller than −N.

If the selected slot machine yields a reward at cycle t (in other words, wins the slot machine play), the TA value is updated at cycle t + 1 based on

$$\begin{array}{cc}TA(t+1)=-{\Delta }+\alpha TA(t) & {\rm{if}}\,{\rm{slot}}\,{\rm{machine}}\,{\rm{1}}\,{\rm{wins}}\\ TA(t+1)=+{\Delta }+\alpha TA(t) & {\rm{if}}\,{\rm{slot}}\,{\rm{machine}}\,{\rm{2}}\,{\rm{wins}}\end{array},$$

(2)

where α is referred to as the forgetting (memory) parameter²⁰, and Δ is the constant increment (in this experiment, Δ = 1 and α = 0.999). In this study, the initial TA value was zero. If the selected machine does not yield a reward (or loses in the slot machine play), the TA value is updated as follows:

$$\begin{array}{cc}TA(t+1)=+{\Omega }+\alpha TA(t) & {\rm{if}}\,{\rm{slot}}\,{\rm{machine}}\,{\rm{1}}\,{\rm{fails}}\\ TA(t+1)=-{\Omega }+\alpha TA(t) & {\rm{if}}\,{\rm{slot}}\,{\rm{machine}}\,{\rm{2}}\,{\rm{fails}}\end{array},$$

(3)

where Ω is the increment parameter defined below. Intuitively speaking, the TA takes a smaller value if slot machine 1 is considered more likely to win, and a greater value if slot machine 2 is considered more likely to earn the reward. This is as if the TA value is being pulled by the two slot machines at both ends, coinciding with the notion of a tug of war.

The fluctuation, necessary for exploration, is realized by associating the TA value with the threshold of digitization of the chaotic signal train. If the chaotic signal level s(t) is equal to or greater than the assumed threshold T(t), the decision is immediately made to choose slot machine 1; otherwise, the decision is made to select slot machine 2. Initially, the threshold is zero; hence, the probability of choosing either slot machine 1 or 2 is 0.5. As time elapses, the TA value shifts (becomes positive or negative) towards the slot machine with the higher reward probability based on the dynamics shown in Eqs (2) and (3). We should note that due to the irregular nature of the incoming chaotic signal, the possibility of choosing the opposite machine is not zero, and this is a critical feature of exploration in reinforcement learning. For example, even when the TA value is sufficiently small (meaning that slot machine 1 seems highly likely to be the better machine), the probability of the decision to choose slot machine 2 is not zero.

In TOW-based decision making, the increment parameter Ω in Eq. (3) is determined based on the history of betting results. Let the number of times when slot machine i is selected in cycle t be S_i and the number of wins in selecting slot machine i be L_i. The estimated reward probabilities of slot machines 1 and 2 are given by

$${\hat{P}}_{1}=\frac{{L}_{1}}{{S}_{1}},{\hat{P}}_{2}=\frac{{L}_{2}}{{S}_{2}}.$$

(4)

Ω is then given by

$$\Omega =\frac{{\hat{P}}_{1}+{\hat{P}}_{2}}{2-({\hat{P}}_{1}+{\hat{P}}_{2})}.$$

(5)

The initial Ω is assumed to be unity, and a constant value is assumed when the denominator of Eq. (5) is zero. The detailed derivation of Eq. (5) is shown in ref.²³.

Results

The architecture of laser chaos-based reinforcement learning is schematically shown in Fig. 1a. A semiconductor laser is coupled with a polarization-maintaining (PM) coupler. The emitted light is incident on a variable fibre reflector, by which a delayed optical feedback is supplied to the laser, leading to laser chaos⁴⁰. The output light at the other end of the PM coupler is detected by a high-speed, AC-coupled photodetector through an optical isolator (ISO) and an attenuator, and is sampled by a high-speed digital oscilloscope at a rate of 100 GSample/s (a 10-ps sampling interval). The detailed specifications of the experimental apparatus are described in the Methods section.

Figure 1b(i) shows an example of the chaotic signal train. Figure 1c and d show the optical and radio frequency (RF) spectra of laser chaos measured by optical and RF spectrum analysers, respectively. The semiconductor laser was operated at a centre wavelength of 1547.785 nm. The standard bandwidth³¹ of the RF spectrum was estimated as 10.8 GHz. Figure 1e summarizes the histogram of the signal levels of the chaotic trains spanning from −0.5 to 0.5, with the level zero at its maximum incidence, and slightly skewed between the positive and negative sides. A remark is that the small incidence peak at the lowest measured amplitude is due to our experimental apparatus (not the laser), and it does not critically affect the present study. Meanwhile, Fig. 1b(ii) exhibits an example of a quasiperiodic signal train generated from the same laser by changing the optical feedback power from the external reflector (see the Methods section for details). Besides, Fig. 1b(iii) shows an example of a coloured noise signal train containing negative autocorrelation calculated in a computer on the basis of the Ornstein–Uhlenbeck process using white Gaussian noise and a low-pass filter⁴³ with the cut-off frequency of 10 GHz. Relevant details and performance comparisons with respect to decision making will be discussed later.

The rounded TA values, $\lfloor TA(t)\rfloor $, are also schematically illustrated at the right-hand side of Fig. 1b and the upper side of Fig. 1e, assuming N = 10. This means that $\lfloor TA(t)\rfloor $ ranges from −10 to 10. Signal value s(t) spans between −0.5 and 0.5. The particular example of TA values shown in Fig. 1b and e shows the case when the constant k in Eq. (1) is given by 0.05, so that the actual threshold T(t) spans from −0.5 to 0.5. In the experimental demonstration in this study, the TA and the slot machines were emulated in offline processing, whereas online processing was technologically feasible owing to the simple procedure of the TA (Remarks are given in the Discussion section).

[EXPERIMENT-1] Adaptation to sudden environmental changes

We first solved the two-armed bandit problems given by the following two cases, where the reward probabilities {P₁, P₂} were given by {0.8, 0.2} and {0.6, 0.4}. We assumed that the sum of the reward probabilities P₁ + P₂ was known prior to slot machine plays. With this knowledge, Ω is unity by Eq. (5).

The slot machine was consecutively played 4,000 times, and this play was repeated 100 times. The red and blue curves in Fig. 2a show the evolution of the correct decision rate (CDR), defined by the ratio of the number of selections of the machines that yielded a higher reward probability at cycle t in 100 trials, with respect to the probability combination of {0.8, 0.2} and {0.6, 0.4}. The chaotic signal was sampled every 10 ps; hence, the total duration of the 4,000 plays of the slot machine was 40 ns. To represent sudden environmental changes (or uncertainty), the reward probability was forcibly interchanged every 10 ns, or every 1,000 plays (For example, {P₁, P₂} = {0.8, 0.2} was reconfigured to {P₁, P₂} = {0.2, 0.8}). The resolution of the TA was set to 9 (or N = 4).

We observed that the CDR curves quickly approached unity, even after sudden changes in the reward probabilities, showing successful decision making. The adaptation was steeper in the case of a reward probability combination of {0.8, 0.2} than that of {0.6, 0.4}, since the difference in the reward probability was greater in the former case (0.8 − 0.2 = 0.6) than the latter (0.6 − 0.4 = 0.2). This meant that decision making was easier. The red and blue curves in Fig. 2b represent the evolution of TA values (TA(t)) in the case of {0.8, 0.2} and {0.6, 0.4}, respectively, where the TA value became greater than 4 or −4; hence, it took a long time for the TA values to be inverted to the other polarity following environmental change. This is the origin of some delay in the CDR in responding to the environmental changes observed in Fig. 2a. A simple method of improving adaptation is to limit the maximum and minimum values of TA(t). The forgetting parameter α is also crucial to improving adaptation speed.

Figure 2c,d and e characterize decision-making performance dependencies with respect to the configuration of TA. Figure 2c concerns TA resolutions demonstrated by the eight curves therein, with corresponding TA resolutions of 3 to 257 (or $N={2}^{i-1}(i=1,\ldots \,,8)$), where the adaptation was quicker in the case of lower resolutions than in the case of higher resolutions. Figure 2d considers TA range dependencies: while keeping the centre of the TA value at zero and the TA resolution at 8, the three curves in Fig. 2d compare CDRs in the TA ranges of 1, 0.25, and 0.125. We observed that a full coverage (TA range of 1) of the chaotic signal yields the best performance. Figure 2e shows the TA’s centre value dependencies while maintaining a TA range of 1 and a resolution of 8, and we can clearly observe the deterioration of CDRs with the shift in the centre of the TA from zero.

[EXPERIMENT-2] Adaptation from zero prior knowledge

Here we considered decision-making problems without any prior knowledge of the slot machines. Hence, parameter Ω needed to be updated. The reward probabilities of the slot machines were set to {P₁, P₂} = {0.5, 0.1}, and 100 consecutive plays were executed. Figure 3a shows the time evolution of the CDR until the 75th slot play cycle to enlarge the early phase of the adaptation. The red and blue curves in Fig. 3a depict CDRs on the basis of chaotic signal trains with sampling intervals of 10 and 50 ps, respectively. Hence, the time needed to complete 100 consecutive slot machine plays differed among these; it ranges from $10\,{\rm{ps}}\times 100=1\,{\rm{ns}}$ to $50\,{\rm{ps}}\times 100=5\,{\rm{ns}}$. Meanwhile, the green curve in Fig. 3a represents the CDR with quasiperiodic signal trains sampled at 50-ps intervals. We observed that the CDR based on chaotic signals sampled at 50-ps intervals exhibited more rapid adaptation than quasiperiodic signals. The time evolution of the CDR for the coloured noise signal is shown by the magenta curve in Fig. 3a, which stays lower than the experimentally observed chaotic signal trains.

Moreover, the black curve shows the CDR obtained by uniformly distributed pseudorandom numbers generated with the Mersenne Twister as the random signal source, instead of the experimentally observed chaotic signals. In addition, we characterized CDRs with normally distributed random numbers (referred to as RANDN) to ensure that the statistical incidence patterns of the laser chaos, which were similar to a normal distribution shown in Fig. 1e, were not the origins of the fast adaptation of decision making. The solid, dotted, and dashed curves in Fig. 3a exhibit the CDR obtained by RANDN of which standard deviation (σ) was configured as 0.2, 0.1, and 0.01 while keeping the mean value at zero. We observed that the CDRs based on chaotic signals exhibited more rapid adaptation than the uniformly and normally distributed pseudorandom numbers.

The most prompt adaptation, or the optimal performance, was obtained at a particular sampling interval. The blue square marks in Fig. 3b compare CDRs at the cycle t = 5 as a function of the sampling interval ranging from 10 ps to 4000 ps. The sampling interval of 50 ps yielded the best performance, indicating that the original chaotic dynamics of the laser could physically be optimized such that the most prompt decision-making is realized. Indeed, the autocorrelation of the laser chaos signal trains was evaluated as shown by the blue square marks in Fig. 3c, and its negative maximum value is taken when the time lag is given by 5 or −5, corresponding exactly to the sampling interval of 50 ps (5 × 10 ps). In other words, the negative correlation of chaotic dynamics affects the exploration ability for decision making. Furthermore, this finding suggests that optimal performance is obtained at a particular sampling rate (or data rate) by physically tuning the dynamics of the original laser chaos, which will be an important and exciting topic for future investigation. The adaptation speed of decision making was estimated as 1 GHz in this optimal case, where CDR was larger than 0.95 at 20 cycles with a 50-ps sampling interval (20 GSamples/s) in Fig. 3a (1 GHz = (50 ps × 20 cycles)⁻¹). In other words, the latency of decision making is approximately 1 ns, and the throughput is 20 GDecision/s.

Meanwhile, the autocorrelation of the quasiperiodic signal train also yields negative values, as shown by the green circular marks in Fig. 3c. In fact, the absolute value of the negative autocorrelation at the time lag of 5 (and −5) is larger than that of the chaotic signal train. However, the adaptation performance of the chaotic signal trains is superior (see Fig. 3b). At the same time, the best adaptation performance of the quasiperiodic signal train is obtained when the sampling interval is 70 ps (see Fig. 3b), which coincides with the peak of the negative maximum value of autocorrelation with the time lag of 7 (=70 ps) (see Fig. 3c).

Another indication of the observation is that the coloured noise containing negative autocorrelation might contribute to performance improvement. We generated the coloured noise signal by computer simulations of which autocorrelation is shown by the diamond marks in Fig. 3c whereby the negative maximum is obtained at the time lag of 7 (sampling interval: 70 ps). The sampling-interval-dependence of the CDR is summarized by the diamond marks in Fig. 3b, where a peak is observed at the sampling interval of 7, and also coincides with the peak of negative autocorrelation. These results imply the necessity of gaining deeper insights in future studies as discussed in the Discussion section.

Figure 4 also presents the CDR at the cycle t = 5 obtained by the chaotic signal, quasiperiodic time series, coloured noise signal, and pseudorandom numbers. Moreover, the CDR at the cycle t = 5 was evaluated by surrogating, or randomly permutating, the chaotic signal time series sampled at 50 ps intervals, and this resulted in poorer performance compared with the original chaotic signal, as shown in Fig. 4. These evaluations support the claim that laser chaos is beneficial to the performance of reinforcement learning, in addition to its ultrafast data rate for random signals.

Discussion

The performance enhancement of decision making by chaotic laser dynamics is demonstrated and the impacts of negative autocorrelation are clearly suggested. Further understanding between chaotic oscillatory dynamics and decision making is a part of important future research.

The first consideration involves physical insights. Toomey et al. recently showed that the complexity of laser chaos varies within the coherence collapse region in the given system⁴⁴. The level of the optical feedback, injection current of the laser becomes an important parameter in determining the complexity of chaos, and leads to thorough insights. We also consider the use of the bandwidth enhancement technique³¹ with optically injected lasers to improve the adaptation speed of decision higher than tens of GHz.

We showed the exact coincidence between the time lag that yielded the negative maximum of the autocorrelation and the sampling interval that provided the highest adaptation performance with the use of laser chaos signals (Fig. 3b). The absolute value of the negative autocorrelation itself is, however, larger with the quasiperiodic signals rather than with chaotic signals (Fig. 3c). Nevertheless, superior adaptation performance is realized with the chaotic signals (Figs 3b and 4). These observations indicate that, besides the negative autocorrelation inherent in chaotic and quasiperiodic oscillatory dynamics of lasers as well as the coloured noise generated by computers, other perspectives could explain the underlying mechanism, such as diffusivity⁴⁵ and Hurst exponents⁴⁶.

The decision-making latency of 1 ns demonstrated by chaotic lasers in this study may overachieve the expected performance demanded by practical applications today. However, high-end applications require even shorter latency; for example, microseconds and nanoseconds applications are discussed in the literature^47,48. Meanwhile, fully electrical online processing using multiple processors presents difficulties in reducing latency while having comparable throughput; such an architectural standpoint continues to be an important aspect in future studies.

Online processing can become feasible through electric circuitry, as already demonstrated for a random-bit generator^40,49 since the required processing is very simple as described in Eqs (1) to (5). More specifically, combining a demultiplexer and multiple data processing circuits (such as field-programmable gate array chips) could provide the requisite online processing capability. In this study, in contrast, the prime interest is to present the principle and the usefulness of laser chaos for reinforcement learning, and we do not experimentally adapt the electrical circuits for online processing that need substantial development costs. In addition, the performance comparison between laser chaos signals and numerically generated coloured noise signals shows the usefulness of direct usage of physical signal sources besides the advantages in terms of latency and technological implementations in the high-speed domain.

In the experimental demonstration, the nonlocal aspect of the TOW principle was not directly, physically relevant to the given chaotic oscillatory time series of the lasers whereas our former single photon-based decision maker directly utilizes the physical property therein (the wave–particle duality of single photons) for decision making as already mentioned in the introduction. By combining the chaotic dynamics with the threshold adjustor, the nonlocal and fluctuation properties of the TOW principle emerged which have not been completely reported in the literature nor realized in our past experimental studies^24,25. Such a hybrid realization of nonlocality in TOW leads to higher likelihood of technological implementability and better scalability and extension to higher-grade problems.

All optical realization is an interesting issue to explore, and has already been implied by the analysis where the time-domain correlation of laser chaos strongly influences decision-making performances. The mode dynamics of multi-mode lasers⁵⁰ are very promising for the implementation of nonlocal properties of fully photonic systems required for decision making. Synchronization and its clustering properties in coupled laser networks^51,52 are also interesting approaches to physically realizing the nonlocality of the TOW. These systems can automatically tune the optimal settings for decision making (e.g., negative autocorrelation properties), which can lead to autonomous photonic intelligence.

For scalability, a variety of approaches can be considered, such as time-domain multiplication, which exploits the ultrafast attributes of chaotic lasers and is a frequently used strategy in ultrafast photonic systems. Introducing multi-threshold values^31,41 is another simple extension of our proposed scheme. Furthermore, the relation between the difficulty level of the given MAB problem and the necessary irregularity of chaotic signal trains would be an interesting future study topic. In this respect, Naruse et al. experimentally examined a hierarchical method and successfully solved four-armed bandit problems using single photons⁵³; the time-domain realization of such a hierarchical approach based on laser chaos may highlight the uniqueness of chaotic oscillatory dynamics of lasers.

We also make note of the extension of the principle demonstrated in this paper to higher-grade machine learning problems. This paper studied the MAB where the problem is decision making to maximize the income for a single player. The higher-grade problem in this context is competitive MAB⁵⁴ where the issue is decision making to maximize the total reward for multiple players as a whole. The competitive MAB involves the so-called Nash equilibrium where the decisions based on an individual’s reward maximization do not yield maximum reward as a whole. This is a foundation of such important applications as resource allocation and social optimization. Investigating the possibilities of extending the present method and utilizing ultrafast laser dynamics for competitive MAB is highly interesting.

Conclusion

We experimentally established that laser chaos provides ultrafast reinforcement learning and decision making. The adaptation speed of decision making reached 1 GHz in the optimal case with the sampling rate of 20 GSample/s (50-ps decision-making intervals) using the ultrafast dynamics inherent in laser chaos. The maximum adaptation performance coincided with the negative maximum of the autocorrelation of the original time-domain laser chaos sequences, demonstrating the strong impact of chaotic lasers on decision making. The origin of superior performance was also validated by comparing with experimentally observed quasiperiodic signals, computer-generated coloured noises, uniformly and normally distributed pseudorandom numbers, and surrogated arrangements of original chaotic signal trains. This study is the first demonstration of ultrafast photonic reinforcement learning or decision making, to the best of our knowledge, and paves the way for research on photonic intelligence and new applications of chaotic lasers in the realm of artificial intelligence.

Methods

Optical system

The laser used in the experiment was a distributed feedback semiconductor laser mounted on a butterfly package with optical fibre pigtails (NTT Electronics, KELD1C5GAAA). The injection current of the semiconductor laser was set to 58.5 mA (5.37 I_th), where the lasing threshold I_th was 10.9 mA. The relaxation oscillation frequency of the laser was 6.5 GHz. The temperature of the semiconductor laser was set to 294.83 K. The laser output power was 13.2 mW. The laser was connected to a variable fibre reflector which reflected a fraction of light back into the laser, inducing high-frequency chaotic oscillations of optical intensity^32,33,34. The values of the optical feedback power (ratio) from the external reflector to the laser were 211 μW (1.6%) and 15 μW (0.11%) in generating chaotic and quasiperiodic signals, respectively. The fibre length between the laser and the reflector was 4.55 m, corresponding to the feedback delay time (round trip) of 43.8 ns. Polarization-maintaining fibres were used for all optical fibre components. The optical output was converted to an electronic signal by a photodetector (New Focus, 1474-A, 38 GHz bandwidth) and sampled by a digital oscilloscope (Tektronics, DPO73304D, 33 GHz bandwidth, 100 GSample/s, eight-bit vertical resolution). The RF spectrum of the laser was measured by an RF spectrum analyser (Agilent, N9010A-544, 44 GHz bandwidth). The optical wavelength of the laser was measured by an optical spectrum analyser (Yokogawa, AQ6370C-20).

Data analysis

[EXPERIMENT-1]

A chaotically oscillated signal train was sampled at a rate of 100 GSample/s by 10,000,000 points, and lasted approximately 10 μs. As described in the main text, 4,000 consecutive plays were repeated 100 times; hence, the total number of slot machine plays was 400,000. With a 10-ps interval sampling, the initial 400,000 points of the chaotic signal were used for the decision-making experiments. The processing time required for 400,000 iterations of slot machine plays was approximately 0.93 s (or 2.3 μs/decision), on a personal computer (Hewlett-Packard, Z-800, Intel Xeon CPU, 3.33 GHZ, 48 GB RAM, Windows 7, MATLAB R2011b).

[EXPERIMENT-2]

(1)
Sampling methods: A chaotic signal train was sampled at 10-ps intervals with 10,000,000 sampling points. Such a train was measured 120 times. Each chaotic signal train was referred to as chaos_i, and there were 120 kinds of such trains: $i=1,\ldots \,,120$. In demonstrating 10 × M ps sampling intervals, where M was a natural number ranging from 1 to 400 (that is, the sampling intervals were 10 ps, 20 ps, … and 4000 ps), we chose one of every M samples from the original sequence.
(2)
Evaluation of CDR regarding a specific chaos sequence: For every chaotic signal train chaos_i, 100 consecutive plays were repeated 100 times. Consequently, 10,000 points were used from chaos_i. Such evaluations were repeated 100 times. Hence, 1,000,000 slot machine plays were conducted in total. These CDRs were calculated for all signal trains chaos_i ($i=1,\ldots \,,120$).
(3)
Evaluation of CDR of all chaotic sequences: We evaluated the average CDR of all chaotic signal trains ($i=1,\ldots \,,120$) derived in (2) above, which were the results discussed in the main text. The methods of the performance evaluation of CDR with respect to the experimentally observed quasiperiodic signals, RAND, and RANDN were the same as that described in (2) and (3).
(4)
Autocorrelation of chaotic signals: The autocorrelation was computed based on all 10,000,000 sampling points of chaos_i, and was evaluated for all chaos_i ($i=1,\ldots \,,120$). The autocorrelation demonstrated in Fig. 3c was evaluated as the average of these 120 kinds of autocorrelations. The autocorrelation of quasiperiodic signals was evaluated in the same manner.
(5)
Coloured noise: Coloured noise was calculated on the basis of the Ornstein–Uhlenbeck process using white Gaussian noise and a low-pass filter in numerical simulations⁴³. We assumed that the coloured noise was generated at the sampling rate of 100 GHz, and the cut-off frequency of the low-pass filter was set to 10 GHz (the correlation time was 100 ps). Forty sequences of 10,000,000 points were generated. The methods of calculating CDR and autocorrelation were the same as that with (2), (3), and (4) whereas the number of sequences was 40 (not 120). The reduction in the number of sequences was due to the excessive computational costs in our computing environment.
(6)
Surrogate methods: The surrogate time series of original chaotic sequences were generated by the randperm function in MATLAB which is based on the sorting of pseudorandom numbers generated by the Mersenne Twister.

Data availability

The data sets generated during the current study are available from the corresponding author on reasonable request.

Change history

05 April 2018
A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has not been fixed in the paper.

References

Jahns, J. & Lee, S. H. Optical Computing Hardware. (Academic Press, San Diego, 1994).
Google Scholar
Larger, L. et al. Photonic information processing beyond Turing: an optoelectronic 3 implementation of reservoir computing. Opt. Express 20, 3241–3249 (2012).
Article ADS CAS PubMed Google Scholar
Brunner, D., Soriano, M. C., Mirasso, C. R. & Fischer, I. Parallel photonic information processing at gigabyte per second data rates using transient states. Nat. Commun. 4, 1364 (2013).
Article ADS PubMed PubMed Central Google Scholar
Vandoorne, K. et al. Experimental demonstration of reservoir computing on a silicon photonics chip. Nat. Commun. 5, 3541 (2014).
Article PubMed Google Scholar
Tsang, M. & Psaltis, D. Metaphoric optical computing of fluid dynamics. arXiv:physics/0604149v1 (2006).
Inagaki, T. et al. A coherent Ising machine for 2000-node optimization problems. Science, doi:10.1126/science.aah4243 (2016).
Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction. (The MIT Press, Massachusetts, 1998).
Google Scholar
Awerbuch, B. & Kleinberg, R. Online linear optimization and adaptive routing. J. Comput. Syst. Sci. 74, 97–114 (2008).
Article MathSciNet MATH Google Scholar
Agarwal, D., Chen, B. -C. & Elango, P. Explore/exploit schemes for web content optimization. Proc. of ICDM 1–10, doi:10.1109/ICDM.2009.52 (2009).
Kroemer, O. B., Detry, R., Piater, J. & Peters, J. Combining active learning and reactive control for robot grasping. Robot. Auton. Syst. 58, 1105–1116 (2010).
Article Google Scholar
Cheung, M. Y., Leighton, J. & Hover, F. S. Multi-armed bandit formulation for autonomous mobile acoustic relay adaptive positioning. In 2013 IEEE Intl. Conf. Robot. Auto. 4165–4170 (2013).
Kocsis, L. & Szepesvári, C. Bandit based Monte Carlo planning. Machine Learning: ECML (2006), LNCS 4212, 282–293, doi:10.1007/11871842_29 (2006).
MathSciNet Google Scholar
Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489 (2016).
Article ADS CAS PubMed Google Scholar
Robbins, H. Some aspects of the sequential design of experiments. B. Am. Math. Soc. 58, 527–535 (1952).
Article MathSciNet MATH Google Scholar
Lai, T. L. & Robbins, H. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6, 4–22 (1985).
Article MathSciNet MATH Google Scholar
Daw, N., O’Doherty, J., Dayan, P., Seymour, B. & Dolan, R. Cortical substrates for exploratory decisions in humans. Nature 441, 876–879 (2006).
Article ADS CAS PubMed PubMed Central Google Scholar
Auer, P., Cesa-Bianchi, N. & Fischer, P. Finite-time analysis of the multi-armed bandit problem. Machine Learning 47, 235–256 (2002).
Article MATH Google Scholar
Murphy, T. E. & Roy, R. The world’s fastest dice. Nat. Photon. 2, 714–715 (2008).
Article ADS CAS Google Scholar
Saade, A., et al. Random projections through multiple optical scattering: Approximating Kernels at the speed of light. In IEEE International Conference on Acoustics, Speech and Signal Processing, March 20–25, 2016, Shanghai, China 6215–6219 (IEEE, 2016).
Kim, S.-J., Naruse, M., Aono, M., Ohtsu, M. & Hara, M. Decision Maker Based on Nanoscale Photo-Excitation Transfer. Sci Rep. 3, 2370 (2013).
Article PubMed PubMed Central Google Scholar
Nakagaki, T., Yamada, H. & Tóth, Á. Intelligence: Maze-solving by an amoeboid organism. Nature 407, 470–470 (2000).
Article ADS CAS PubMed Google Scholar
Kim, S.-J., Aono, M. & Hara, M. Tug-of-war model for the two-bandit problem: Nonlocally correlated parallel exploration via resource conservation. BioSystems 101, 29–36 (2010).
Article PubMed Google Scholar
Kim, S.-J., Aono, M. & Nameda, E. Efficient decision-making by volume-conserving physical object. New J. Phys. 17, 083023 (2015).
Article ADS Google Scholar
Naruse, M. et al. Decision making based on optical excitation transfer via near-field interactions between quantum dots. J. Appl. Phys. 116, 154303 (2014).
Article ADS Google Scholar
Naruse, M. et al. Single-photon decision maker. Sci. Rep. 5, 13253 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Pohl, D. W. & Courjon, D. Near Field Optics. (Kluwer, The Netherlands, 1993).
Book Google Scholar
Naruse, M., Tate, N., Aono, M. & Ohtsu, M. Information physics fundamentals of nanophotonics. Rep. Prog. Phys. 76, 056401 (2013).
Article ADS PubMed Google Scholar
Eisaman, M. D., Fan, J., Migdall, A. & Polyakov, S. V. Single-photon sources and detectors. Rev. Sci. Instrum. 82, 071101 (2011).
Article ADS CAS PubMed Google Scholar
Kato, H., Kim, S.-J., Kuroda, K., Naruse, M. & Hasegawa, M. The Design and Implementation of a Throughput Improvement Scheme based on TOW Algorithm for Wireless LAN. In Proc. 4th Korea-Japan Joint Workshop on Complex Communication Sciences, January 12–13, Nagano, Japan J13 (IEICE, 2016).
Naruse, M., Tate, N. & Ohtsu, M. Optical security based on near-field processes at the nanoscale. J. Optics 14, 094002 (2012).
Article ADS Google Scholar
Sakuraba, R., Iwakawa, K., Kanno, K. & Uchida, A. Tb/s physical random bit generation with bandwidth-enhanced chaos in three-cascaded semiconductor lasers. Opt. Express 23, 1470–1490 (2015).
Article ADS CAS PubMed Google Scholar
Soriano, M. C., García-Ojalvo, J., Mirasso, C. R. & Fischer, I. Complex photonics: Dynamics and applications of delay-coupled semiconductors lasers. Rev. Mod. Phys. 85, 421–470 (2013).
Article ADS CAS Google Scholar
Ohtsubo, J. Semiconductor lasers: stability, instability and chaos (Springer, Berlin, 2012).
Uchida, A. Optical communication with chaotic lasers: applications of nonlinear dynamics and synchronization (Wiley-VCH, Weinheim, 2012).
Bilal, K., Malik, S. U. R., Khan, S. U. & Zomaya, A. Y. Trends and challenges in cloud datacenters. IEEE Cloud Computing 1, 10–20 (2014).
Article Google Scholar
Brogaard, J., Hendershott, T. & Riordan, R. High-frequency trading and price discovery. Rev. Financ. Stud. 27, 2267–2306 (2014).
Article Google Scholar
Colet, P. & Roy, R. Digital communication with synchronized chaotic lasers. Opt. Lett. 19, 2056–2058 (1994).
Article ADS CAS PubMed Google Scholar
Argyris, A. et al. Chaos-based communications at high bit rates using commercial fibre-optic links. Nature 438, 343–346 (2005).
Article ADS CAS PubMed Google Scholar
Annovazzi-Lodi, V., Donati, S. & Scire, A. Synchronization of chaotic injected-laser systems and its application to optical cryptography. IEEE J Quantum Electron. 32, 953–959 (1996).
Article ADS Google Scholar
Uchida, A. et al. Fast physical random bit generation with chaotic semiconductor lasers. Nat. Photon. 2, 728–732 (2008).
Article ADS CAS Google Scholar
Kanter, I., Aviad, Y., Reidler, I., Cohen, E. & Rosenbluh, M. An optical ultrafast random bit generator. Nat. Photon. 4, 58–61 (2010).
Article ADS CAS Google Scholar
Lin, F.-Y. & Liu, J.-M. Chaotic lidar. IEEE J. Sel. Top. Quantum Electron. 10, 991–997 (2004).
Article CAS Google Scholar
Fox, R. F., Gatland, I. R., Roy, R. & Vemuri, G. Fast, accurate algorithm for numerical simulation of exponentially correlated colored noise. Phys. Rev. A 38, 5938–5940 (1988).
Article ADS CAS Google Scholar
Toomey, J. P. & Kane, D. M. Mapping the dynamic complexity of a semiconductor laser with optical feedback using permutation entropy. Opt. Express 22, 1713–1725 (2014).
Article ADS CAS PubMed Google Scholar
Kim, S.-J., Naruse, M., Aono, M., Hori, H. & Akimoto, T. Random walk with chaotically driven bias. Sci. Rep. 6, 38634 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Lam, W. S., Ray, W., Guzdar, P. N. & Roy, R. Measurement of Hurst exponents for semiconductor laser phase dynamics. Phys. Rev. Lett. 94, 010602 (2005).
Article ADS PubMed Google Scholar
High-Frequency Trading Is Nearing the Ultimate Speed Limit, MIT Technology Review. https://www.technologyreview.com/s/602135/high-frequency-trading-is-nearing-the-ultimate-speed-limit/ (Last access: 05/07/2017).
High Frequency Trading Turns to High Frequency Technology to Reduce Latency. http://www.rec-usa.com/press/High%20Frequency%20Trading%20Turns%20to%20High%20Frequency.pdf (Last access: 05/07/2017).
Ugajin, K. et al. Real-time fast physical random number generator with a photonic integrated circuit. Opt. Express 25, 6511–6523 (2017).
Article ADS PubMed Google Scholar
Aida, T. & Davis, P. Oscillation mode selection using bifurcation of chaotic mode transitions in a nonlinear ring resonator. IEEE J. Quantum Electron. 30, 2986–2997 (1994).
Article ADS CAS Google Scholar
Nixon, M. et al. Controlling synchronization in large laser networks. Phys. Rev. Lett. 108, 214101 (2012).
Article ADS PubMed Google Scholar
Williams, C. R. S. et al. Experimental observations of group synchrony in a system of chaotic optoelectronic oscillators. Phys. Rev. Lett. 110, 064104 (2013).
Article ADS PubMed Google Scholar
Naruse, M. et al. Single Photon in Hierarchical Architecture for Physical Decision Making: Photon Intelligence. ACS Photonics 3, 2505–2514 (2016).
Article CAS Google Scholar
Kim, S.-J., Naruse, M. & Aono, M. Harnessing the Computational Power of Fluids for Optimization of Collective Decision Making. Philosophies Special Issue ‘Natural Computation: Attempts in Reconciliation of Dialectic Oppositions’ 1, 245–260 (2016).
Google Scholar

Download references

Acknowledgements

The authors thank Mr. Takatomo Mihana of Saitama University for his help in numerically generating coloured noise signals. This work was supported in part by the Core-to-Core Program A. Advanced Research Networks and the Grants-in-Aid for Scientific Research (A) from the Japan Society for the Promotion of Science.

Author information

Authors and Affiliations

Strategic Planning Department, National Institute of Information and Communications Technology, 4-2-1 Nukui-kita, Koganei, Tokyo, 184-8795, Japan
Makoto Naruse
Department of Information and Computer Sciences, Saitama University, 255 Shimo-Okubo, Sakura-ku, Saitama, Saitama, 338-8570, Japan
Yuta Terashima & Atsushi Uchida
Graduate School of Media and Governance, Keio University, 5322 Endo, Fujisawa, Kanagawa, 252-0882, Japan
Song-Ju Kim

Authors

Makoto Naruse
View author publications
You can also search for this author in PubMed Google Scholar
Yuta Terashima
View author publications
You can also search for this author in PubMed Google Scholar
Atsushi Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Song-Ju Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.N., A.U., and S.-J.K. directed the project. M.N. designed the system architecture. Y.T. and A.U. designed and implemented the laser chaos. Y.T., A.U., and M.N. conducted the optical experiments and the data processing. M.N., A.U., and S.-J.K. analysed the data, and M.N., A.U., and S.-J.K. wrote the paper.

Corresponding author

Correspondence to Makoto Naruse.

Ethics declarations

Competing Interests

The authors declare that they have no competing interests.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Naruse, M., Terashima, Y., Uchida, A. et al. Ultrafast photonic reinforcement learning based on laser chaos. Sci Rep 7, 8772 (2017). https://doi.org/10.1038/s41598-017-08585-8

Download citation

Received: 20 April 2017
Accepted: 21 July 2017
Published: 18 August 2017
DOI: https://doi.org/10.1038/s41598-017-08585-8

This article is cited by

Conflict-free joint decision by lag and zero-lag synchronization in laser network
- Hisako Ito
- Takatomo Mihana
- Makoto Naruse
Scientific Reports (2024)
Deep neural network-based phase calibration in integrated optical phased arrays
- Jae-Yong Kim
- Junhyeong Kim
- Hamza Kurt
Scientific Reports (2023)
Harnessing microcomb-based parallel chaos for random number generation and optical decision making
- Bitao Shen
- Haowen Shu
- Xingjun Wang
Nature Communications (2023)
Efficient dynamic channel assignment through laser chaos: a multiuser parallel processing learning algorithm
- Zengjing Chen
- Lu Wang
- Chengzhi Xing
Scientific Reports (2023)
Decision making for large-scale multi-armed bandit problems using bias control of chaotic temporal waveforms in semiconductor lasers
- Kensei Morijiri
- Takatomo Mihana
- Atsushi Uchida
Scientific Reports (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.