Scalable photonic reinforcement learning by time-division multiplexing of laser chaos

Reinforcement learning involves decision-making in dynamic and uncertain environments and constitutes a crucial element of artificial intelligence. In our previous work, we experimentally demonstrated that the ultrafast chaotic oscillatory dynamics of lasers can be used to efficiently solve the two-armed bandit problem, which requires decision-making concerning a class of difficult trade-offs called the exploration–exploitation dilemma. However, only two selections were employed in that research; hence, the scalability of the laser-chaos-based reinforcement learning should be clarified. In this study, we demonstrated a scalable, pipelined principle of resolving the multi-armed bandit problem by introducing time-division multiplexing of chaotically oscillated ultrafast time series. The experimental demonstrations in which bandit problems with up to 64 arms were successfully solved are presented where laser chaos time series significantly outperforms quasiperiodic signals, computer-generated pseudorandom numbers, and coloured noise. Detailed analyses are also provided that include performance comparisons among laser chaos signals generated in different physical conditions, which coincide with the diffusivity inherent in the time series. This study paves the way for ultrafast reinforcement learning by taking advantage of the ultrahigh bandwidths of light wave and practical enabling technologies.

Recently, the use of photonics for information processing and artificial intelligence has been intensively studied by exploiting the unique physical attributes of photons. The latest examples include a coherent Ising machine for combinatorial optimization 1 , photonic reservoir computing to perform complex time-series predictions 2,3 , and ultrafast random number generation using chaotic dynamics in lasers 4,5 in which the ultrahigh bandwidth attributes of light bring novel advantages. Reinforcement learning, also called decision-making, is another important branch of research which involves making decisions promptly and accurately in uncertain, dynamically changing environments 6 and constitutes the foundation of a variety of applications ranging from communication infrastructures 7,8 and robotics 9 to computer gaming 10 .
The multi-armed bandit problem (MAB) is known to be a fundamental reinforcement learning problem in which the goal is to maximize the total reward from multiple slot machines whose reward probabilities are unknown and could dynamically change 6 . To solve the MAB, it is necessary to explore higher-reward slot machines. However, too much exploration may result in excessive loss, whereas too quick decision-making or insufficient exploration may lead to missing the best machine; thus, there is a trade-off referred to as the exploration-exploitation dilemma 11 .
In our previous study, we experimentally demonstrated that the ultrafast chaotic oscillatory dynamics of lasers [2][3][4][5] can be used to solve the MAB efficiently 12,13 . With a chaotic time series generated by a semiconductor laser with a delayed feedback sampled at a maximum rate of 100 GSample/s followed by a digitization mechanism with a variable threshold, ultrafast, adaptive, and accurate decision-making was demonstrated. Such ultrafast Scientific REPORTS | (2018) 8:10890 | DOI: 10.1038/s41598-018-29117-y decision-making is unachievable using conventional algorithms on digital computers 11,14,15 that rely on pseudorandom numbers. It was also demonstrated that the decision-making performance is maximized by using an optimal sampling interval that exactly coincides with the negative autocorrelation inherent in the chaotic time series 12 . Moreover, even when assuming that pseudorandom numbers and coloured noise were available in such a high-speed domain, the laser chaos method outperformed these alternatives; that is, chaotic dynamics yields superior decision-making abilities 12 .
However, only two options, or slot machines, were employed in the MAB investigated therein; that is, the two-armed bandit problem was studied. A scalable principle and technologies toward an N-armed bandit with N being a natural number are strongly demanded for practical applications. In addition, detailed insights into the relations between the resulting decision-making abilities and properties of chaotic signal trains should be pursued to achieve deeper physical understanding as well as performance optimization at the physical or photonic device level.
In this study, we experimentally demonstrated a scalable photonic reinforcement learning principle based on ultrafast chaotic oscillatory dynamics in semiconductor lasers. Taking advantage of the high-bandwidth attributes of chaotic lasers, we incorporated the concept of time-division multiplexing into the decision-making strategy; specifically, consecutively sampled chaotic signals were used in the proposed method to determine the identity of the slot machine in a binary digit form.
In the recent literature on photonic decision-making, near-field-mediated optical excitation transfer 16,17 and single photon 18,19 methods have been discussed; the former technique involves pursuing the diffraction-limit-free spatial resolution 20 , whereas the latter reveals the benefits of the wave-particle duality of single light quanta 21 . A promising approach for achieving scalability by means of near-field-coupled excitation transfer or single photons is spatial parallelism; indeed, a hierarchical principle has been successfully demonstrated experimentally in solving the four-armed bandit problem using single photons 19 . In contrast, the high-bandwidth attributes of chaotic lasers accommodate time-division multiplexing and have been successfully used in optical communications 22 .
In this study, we transformed the hierarchical decision-making strategy 19 into the time domain, transcending the barrier toward scalability. We also successfully resolved the bandit problem with up to 64 arms. Meanwhile, four kinds of chaotic signals experimentally generated in different conditions, as well as quasiperiodic sequences, were subjected to performance comparisons and characterizations, including diffusivity analysis. In addition, computer-generated pseudorandom signals and coloured noise were used to clarify the similarities and differences with respect to chaotically fluctuating random signals. A detailed dependency analysis with regard to the precision of parameter adjustments, sampling interval of chaotic time series, and difficulties of given decision-making problems as well as diffusivity analyses of time series were also performed. The experimental findings will facilitate understanding of the characteristics of laser-chaos-based decision-making and the future design of integrated systems.

Principle
We considered an MAB problem in which a player selects one of N slot machines, where N = 2 M with M being a natural number. The N slot machines are distinguished by the identity given by natural numbers ranging from 0 to N − 1, which are also represented in an M-bit binary code given by S 1 S 2  S M with S i (i = 1, …, M) being 0 or 1. For example, when N = 8 (or M = 3), the slot machines are numbered by S 1 S 2 S 3 = {000, 001, 010, …, 111} (Fig. 1a). The reward probability of slot machine i is represented by P i (i = 0, …, N − 1), and the problem addressed herein is the selection of the machine with the highest reward probability. The reward amount dispensed by each slot machine per play is assumed to be the same in this study. That is, the probability of winning by playing slot machine i is P i , and the probability of losing by playing slot machine i is 1 − P i .
The principle consists of the following three steps: [STEP 1] decision-making for each bit of the slot machine in a pipelined manner, [STEP 2] playing the selected slot machine, and [STEP 3] updating the threshold values. The exact details and general formula are given in the Methods section.
[STEP 1] Decision for each bit of the slot machine. The identity of the slot machine to be chosen is determined bit by bit from the most significant bit (MSB) to the least significant bit in a pipelined manner. For each of the bits, the decision is made based on a comparison between the measured chaotic signal level and the designated threshold value.
First, the chaotic signal s(t 1 ) measured at t = t 1 is compared to a threshold value denoted as TH 1 (Fig. 1b). The output of the comparison is immediately the decision of the MSB concerning the slot machine to choose. If s(t 1 ) is less than or equal to the threshold value TH 1 , the decision is that the MSB of the slot machine to be chosen is 0, which we denote as D 1 = 0. Otherwise, the MSB is determined to be 1 (D 1 = 1). Here we suppose that s(t 1 ) < TH 1 ; then, the MSB of the slot machine to be selected is 0.
Based upon the determination of the MSB, the chaotic signal s(t 2 ) measured at t = t 2 is subjected to another threshold value denoted by TH 2,0 . The first number in the suffix, 2, means that this threshold is related to the second-most significant bit of the slot machine, while the second number of the suffix, 0, indicates that the previous decision, related to the MSB, was 0 (D 0 = 0). If s(t 2 ) is less than or equal to the threshold value TH 2,0 , the decision is that the second-most significant bit of the slot machine to be chosen is 0 (D 2 = 0) (Fig. 1b). Otherwise, the second-most significant bit is determined to be 1 (D 2 = 1). Note that the second-most significant bit is determined by the other threshold value TH 2,1 if the MSB is 1 (D 0 = 1).
All of the bits are determined in this manner. In general, there are 2 k−1 kinds of threshold values related to the k-th bit; hence, there are 2 M − 1 = N − 1 kinds of threshold values in total. What is important is that the incoming signal sequence is a chaotic time series which enables efficient exploration of the searching space, as discussed later. [STEP 3] Threshold values adjustment. Suppose that the selected slot machine yields a reward (i.e. the player wins the slot machine play). Then, the threshold values are adjusted so that that same decision will be highly likely to be selected in the subsequent play. Therefore, for example, if the MSB of the selected machine is 0, TH 1 should be increased because doing so increases the likelihood of obtaining the same decision regarding MSB being 0. All of the other threshold values involved in determining the decision are updated in the same manner.
It should be noted that due to the irregular nature of the incoming chaotic signal, the possibility of choosing the opposite values of bits is not 0 if the above-described threshold adjustments have been made. This feature is critical in exploration in reinforcement learning. For example, even when the value of TH 1 is sufficiently small (indicating that slot machines whose MSBs are 1 are highly likely to be better machines), the probability of the decision to choose machines whose MSBs are 0 is not 0. This mechanism is of particular importance when the given decision-making problem is difficult (i.e. the differences among the reward probabilities are minute); this situation will be discussed in detail later.
If the selected slot machine does not yield a reward (i.e. the player loses the slot machine play), then the threshold values are adjusted so that that same decision will not be highly likely to be selected in the subsequent play. Therefore, for example, if the MSB of the selected machine is 0, TH 1 should be decreased because doing so decreases the likelihood of obtaining the same decision regarding MSB being 0. All of the other threshold values involved in determining the decision are revised.
As described above, the threshold adjustment involves increasing or decreasing the threshold values based on the betting results that seem to be symmetric between the cases of winning and losing. However, the adjustment must be made asymmetrically except in special cases for the following reason.
Suppose that the reward probabilities of Machines 0 and 1 are given by 0.9 and 0.7, respectively, where the probability of receiving a reward is rather high. Indeed, the probability of receiving a reward regardless of the decision is (0.9 + 0.7)/2 = 0.8 while that of no reward is (0.1 + 0.3)/2 = 0.2. Thus, the event of losing is rare and should occur four times (0.8/0.2 = 4) less than the event of winning. Hence, if the amount of threshold adjustment in the case of winning is set to 1, that in the case of losing should be 4. On the contrary, if the reward probabilities of Machines 0 and 1 are given by 0.1 and 0.3, respectively, the tendency becomes the opposite since most of the betting results in losing; hence, the amount of threshold adjustment in the case of losing must be attenuated by four times compared to that in the case of winning.
In the present study, the amount of threshold adjustment in the case of wining is given by 1 while that of losing is given by the parameter Ω. In view of the adaptation in dynamically changing environments, forgetting past events is also important; hence, we introduced forgetting (memory) parameters 12,13 for all threshold values. The detailed definition is provided in the Methods section. Ω is also updated during the course of play based on the betting history concerning the numbers of wins and selections. Notably, Ω must be configured differently based on the designated bit. For the MSB, for example, the win/lose events should be related to the two groups of slot machines whose MSBs are 0 and 1, while for the second-most significant bit when the MSB is 0, the win/lose events are related to the two groups of slot machines whose second-most significant bits are 0 and 1 and have MSBs of 0.

Results
A schematic diagram of the laser-chaos-based scalable decision-making system is shown in Fig. 1c. A semiconductor laser operated at a centre wavelength of 1547.785 nm is coupled with a polarization-maintaining (PM) coupler. The light is connected to a variable fibre reflector which provides delayed optical feedback to the laser, generating laser chaos [23][24][25] . The output light at the other end of the PM coupler is detected by a high-speed, AC-coupled photodetector through an optical isolator (ISO) and an optical attenuator. The signal is sampled by a high-speed digital oscilloscope at a rate of 100 GSample/s (a 10 ps sampling interval) with an eight-bit resolution; the signal level takes integer values ranging from −127 to 128. The details of the experimental setup are described in the Methods section. Figure 2a shows examples of the chaotic signal trains. Four kinds of chaotic signal trains were generated, and are referred to as (i) Chaos 1, (ii) Chaos 2, (iii) Chaos 3, and (iv) Chaos 4 in Fig. 2a, by varying the reflection by the variable reflector by letting 210, 120, 80, and 45 μW of optical power be fed back to the laser, respectively. A quasiperiodic signal train was also generated, as depicted in Fig. 2a(v), by the variable reflector by providing a feedback optical power of 15 μW. Figure 2b summarizes the experimentally observed radio-frequency (RF) power spectra obtained using Chaos 1, 2, 3, and 4 and quasiperiodic signals. It can be seen that the chaotic time series contain wide bands of signals 25 and that there are clear differences among the shapes of the RF spectra corresponding to Chaos 1-4, even though the time-domain waveforms shown in Fig. 2a(i-iv) look similar. The experimental details of the RF spectrum evaluation are provided in the Methods section.
In addition, Fig. 2a(vii) shows an example of a coloured noise signal train containing negative autocorrelation calculated using a computer based on the Ornstein-Uhlenbeck process using white Gaussian noise and a low-pass filter 26 with a cut-off frequency of 10 GHz 12 . The black curve in Fig. 2a(vi) marked with RAND depicts a sequence generated by a uniformly distributed pseudorandom generator based on the Mersenne Twister. For RAND, the horizontal axis of Fig. 2a should be read as 'cycles' instead of physical time, but we dealt with RAND as if it were available at the same sampling rate as the laser chaos signals to investigate the performance differences between the laser chaos sequences and pseudorandom numbers both in qualitative and quantitative ways. The details of the time series are described in the Methods section. Two-armed bandit. We began with the two-armed bandit problem which is the simplest case 12 . The slot machine was played 250 times consecutively, and such play was repeated 10,000 times. The reward probabilities of the two slot machines, referred to as Machines 0 and 1, were 0.9 and 0.7, respectively; hence the correct decision was to choose Machine 0 because it was the machine with the higher reward probability. As shown later, the maximum and the second-maximum reward probabilities were consistently given by 0.9 and 0.7, respectively, for MAB problems with 4, 8, 16, 32, and 64 arms. For the sake of maintaining coherence with respect to the given problem throughout the study, we chose 0.9 and 0.7 for the reward probabilities for a two-armed bandit problem.
The red, green, blue, and cyan curves in Fig. 2c show the evolution of the correct decision ratio (CDR), defined as the ratio of the number of times when the selected machine has the highest reward probability at cycle t based on the time series of Chaos 1, 2, 3, and 4, respectively. The chaotic signal was sampled every 50 ps; that is, a single cycle corresponds to 50 ps in physical time. The magenta, yellow, and black curves in Fig. 2c represent the CDRs obtained based on quasiperiodic, coloured noise, and RAND sequences. Clearly, the chaotic sequences approach a CDR of unity more quickly than the other signals. Although the difference is subtle, Chaos 3 exhibits the best adaptation among the four chaotic time series; a CDR of 0.95 was achieved at cycle 122, corresponding to 6.1 ns.
In the previous study, an exact coincidence between the autocorrelation of the laser chaos signal trains and the resulting decision-making performance was obtained 12 ; specifically, it was found that the sampling interval yielding the negative maximum of the autocorrelation provided the fastest decision-making abilities. To solve a two-armed bandit problem, a single threshold (TH 1 ) and single chaotic signal sample are needed to derive a decision (D 1 = 0 or 1). The sampling interval, or more precisely the inter-decision sampling interval, of chaotic signals to configure the threshold (TH 1 ) is defined by Δ S , which is shown in Fig. 1b. Figure 2d compares the autocorrelations of Chaos 1-4 as well as the quasiperiodic and coloured noise. Chaos 1-4 exhibit negative maxima at time lags of around 5 and 6 (and −5 and −6), whereas the quasiperiodic and coloured noise yield negative maxima at time lags of around 7 (and −7). The amount of time lag corresponds to the physical time difference multiplied times 10 ps, which is the sampling interval; hence, for example, a time lag of 5 means that the time difference is 50 ps.
Scientific REPORTS | (2018) 8:10890 | DOI:10.1038/s41598-018-29117-y Correspondingly, Fig. 3 characterizes the CDRs as a function of the inter-decision sampling interval Δ S by setting the reward probabilities of the two slot machines to 0.1 and 0.5, which are the same values studied in ref. 12 in which the relevance to negative autocorrelation was found. In Fig. 3a, the CDRs at cycle 10 are compared among Chaos 1-4, while Fig. 3b shows the CDRs at cycle 100 for the quasiperiodic, coloured noise, RAND, and Chaos 3 series. In Fig. 3a, the CDRs obtained using the chaotic time series show maxima around the sampling intervals of 50 ps and 60 ps, which coincide well with the negative maxima of the autocorrelations, as we reported previously 12 . At the same time, the negative maxima of the chaotic time series follow the order Chaos 4, 3, 2, and 1 in Fig. 2d, whereas the greater decision-making performances follow the order Chaos 3, 2, 4, and 1 in Fig. 3a with a sampling interval of 50 ps. That is, the order of the absolute values of the autocorrelation does not explain the resulting decision-making performances. We will discuss the relation between the decision-making performance and the characteristics of chaotic time series via other metrics at the end of the paper. Meanwhile, the black curve in Fig. 3b, which corresponds to RAND, does not show dependency on the inter-decision sampling interval, whereas the magenta and yellow curves corresponding to quasiperiodic signals and coloured noise exhibit peaky characteristics with respect to the sampling interval, clearly indicating the qualitative differences between correlated times series and conventional pseudorandom signals. Multi-armed bandit. We applied the proposed time-division multiplexing decision-making strategy to bandit problems with more than two arms. Here, we first describe the problem to be solved and the assignment of reward probabilities (Fig. 4a).
Note that the difference is 0.2, which is retained in the subsequent settings. (2) Four-armed: In addition to the threshold used to determine the MSB (TH 1 ), two more thresholds are necessary to determine the second bit (TH D 2, 1 (D 1 = {0, 1})). The reward probabilities of Machines 0, 1, 2, and 3 are defined as 0.7, 0.5, 0.9, and 0.1, respectively, where the correct decision is to select Machine 2 ( Fig. 4a(ii)). Note that the difference between the highest and second-highest reward probabilities is 0.2, as in the two-armed bandit problem. In addition, the sum of the reward probabilities of the first two machines (Machines 0 and 1: 0.7 + 0.5 = 1.2) is larger than that of the second two machines (Machines 2 and 3: 0.9 + 0.1 = 1.0). This situation is called contradictory 19 since the maximum-reward-probability machine (Machine 2) belongs to the latter group whose reward-probability sum is smaller than that of the former group.
(3) Eight-armed: In addition to the thresholds used to determine the MSB (TH 1 ) and the second bit TH D 2, 1 (D 1 = {0, 1}), four more thresholds are needed to decide the third bit (TH D D 3, , The reward probabilities of Machines 0, 1, 2, 3, 4, 5, 6, and 7 are given by 0.7, 0.5, 0.9, 0.1, 0.7, 0.5, 0.7, and 0.5, respectively. First, the difference between the highest and second-highest reward probabilities is 0.2, as in the two-and four-armed bandit problems described above. Second, the sum of the reward probabilities of the slot machines whose MSBs are 0 and 1 are 2.2 and 2.4, respectively, whereas the maximum-reward-probability machine (Machine 2) has an MSB of 0, which is a contradictory situation. Similarly, the sums of the reward probabilities of the slot machines whose second MSBs are 0 and 1 (as well as whose MSBs are 0) are 1.2 and 1, respectively, while the best machine belongs to the latter group, which is also a contradiction ( Fig. 4a(iii)). In the following bandit problem definitions, all of these contradictory conditions are satisfied for the sake of coherent comparison with the increased arm numbers. The contradiction rules apply, as in the previous cases. The details are described in the Methods section ( Fig. 4a(v)). The contradiction rules apply, as in the previous cases. The details are described in the Methods section ( Fig. 4a(vi)).
Figure 4b(i-vi) summarize the results of the two-, four-, eight-, 16-, 32-, and 64-armed bandit problems, respectively. The red, green, blue, and cyan curves show the CDR evolution obtained using Chaos 1, 2, 3, and 4, respectively, while the magenta, black, and yellow curves depict the evolution obtained using quasiperiodic, RAND, and coloured noise, respectively. The threshold values take integer values ranging from −128 to 128. The sampling interval of the chaotic signal trains for the MSB (Δ S ) is 50 ps, whereas that of the subsequent bits, called the inter-bit sampling interval (Δ L ), is 100 ps. Consequently, one decision needs a processing time of around M × Δ L if the number of bits is M. However, note that the consecutive decision making can be computed with the interval of Δ S thanks to the pipeline structure of the system, as discussed shortly below. The impacts due to the choice of Δ L are discussed later. From Fig. 4b, it can be seen that Chaos 3 provides the promptest adaptation to the unity value of the CDR, whereas the nonchaotic signals (quasiperiodic, RAND, and coloured noise) yield substantially deteriorated performances, especially in bandit problems with more than 16 arms. The number of cycles necessary to obtain the correct decision increases as the number of bandits increases. The square marks in Fig. 4c indicate the numbers of cycles required to reach a CDR of 0.95 as a function of the number of slot  relation aN b , where a and b  are approximately 52 and 1.16, respectively. These results support the successful operation of the proposed scalable decision-making principle using laser-generated chaotic time series.
In this study, we investigated the MAB problems when the number of arms (N) is given by 2 K with K being an integer value. When N is not a power of 2, one implementation strategy is to virtually assign zero reward probability machines to the "unused" arms by preparing a 2 K′ -arm system where K′ is the minimum integer that yields 2 K′ larger than N. A certain performance enhancement should be expected by, for example, uniformly distributing such zero-reward arms within the system so that the overall decision-making performance is accelerated. Such an aspect will be a topic of future studies.
Pipelined processing. We emphasize that the proposed decision-making principle and architecture have simple structures of pipelined processing 27 : (1) The decision of the first bit (S 1 ) of the slot machine depends only on the first threshold (TH 1 ) and first sampled data. No other information is required. (2) The decision of the second bit (S 2 ) depends on the decision of the first bit (D 1 ) obtained in the previous step, the second threshold (TH D 2, 1 ), and the second sampled data. Simultaneously, the decision of the first bit can proceed to the next decision by sampling the next signal.
(3) The same architecture continues until the M-th bit. The earlier stages do not depend on the results obtained in the latter stages. Such a structure is particularly preferred due to the benefits of the ultrahigh-speed chaotic time series signals and greater throughput of the total system. At the same time, it should be noted that the latency between the timing when a decision is determined by the chaos-based decision maker and the timing when the corresponding reward is returned from the agent (slot machine/environment) could be long due to, for example, communication delay. Therefore, the reward information subjected to the chaos-based decision maker at t = t 1 could be originated from the decision that was generated much before t 1 . Nevertheless, what is important is that the above-mentioned pipelined structure is valid in terms of the signal flow and the high throughput is maintained.
Meanwhile, in the present study, we conduct the performance evaluation in the following manner by eliminating the impact of delay in the agent. Upon the receipt of a reward at t = t 0 , a decision is determined at around . We assume that the reward is instantaneously returned. Furthermore, we assume that the next decision, initiated at t 1 = t 0 + Δ S , is based on the reward obtained at ′ t 0 to confirm fundamental decision-making mechanisms and to conduct the basic performance analysis for MAB problems with many arms. In fact, the dynamic change of reward probabilities is not assumed in this study; hence we consider that the present approach yields essential attributes. In fully considering the timing issues of decisions and rewards, communication latency and dynamical attributes of agents will need to be accommodated while a specific signal processing strategy should be organized in the initial phase when no reward information is observed. Such more general considerations will be the focus of exciting future studies.

Discussion
Inter-bit sampling interval dependencies. In resolving MAB problems in which the number of bandits is greater than two and is given by 2 M , M samples are needed with the interval being specified by Δ L , as schematically shown in Fig. 1b. In this study, we investigated the Δ L dependency by analysing the four kinds of four-armed bandit problems shown in Fig. 5a and labelled as Types 1-4. The reward probabilities of Type 1 are equal to those in the case shown in Fig. 4a(ii); P 0 , P 1 , P 2 , and P 3 are given by 0.7, 0.5, 0.9, and 0.1, respectively. The correct decision is to select Machine 2; that is, the machine identity is given by S 1 S 2 = 10. In deriving the correct decision, the first sample should be greater than the threshold (TH 0 ) to obtain the decision S 1 = 1, whereas the second sample should be smaller than the threshold (TH 2,1 ) to obtain the decision S 2 = 0. Consequently, if Δ L is 0, the search for the best selections does not work well since the same sampling provides the same searching traces that do not satisfy the conditions for both bits. Indeed, the cyan circular marks in Fig. 5b characterize the CDR at cycle 100 as a function of Δ L , where Δ L = 0 ps (i.e. the same sample is used for multiple bits) yields a CDR of 0. Chaos 3 was used for the evaluation. The CDR exhibits the maximum value when Δ L = 50 ps, which is reasonable because 50 ps is the interval that provides the negative autocorrelation that easily allows oppositely arranged bits to be found (S 1 S 2 = 10). Types 2, 3, and 4 contain the same reward probability values but in different arrangements. In Type 3, the correct decision is to select Machine 1, or S 1 S 2 = 01, which is similar to the correct decision in Type 1 in the sense that the two bits have opposite values. Consequently, the inter-bit-interval dependence, shown by the yellow circular marks in Fig. 5b, exhibits traces similar to those of Type 1, where Δ L = 0 ps gives a CDR of 0, whereas Δ L values that yield negative autocorrelations provide greater CDRs. In Types 2 and 4, on the other hand, the correct decisions are given by Machines 0 and 3, or S 1 S 2 = 00 and S 1 S 2 = 11. For such problems, Δ L = 0 ps gives a greater CDR due to the eventually identical values of the first and second bits, whereas Δ L values corresponding to negative autocorrelations yield poorer performance, unlike for Types 1 and 3, as clearly represented by the magenta and black circular marks in Fig. 5b.
It is also noteworthy that pseudorandom numbers provide no characteristic responses with respect to the inter-bit intervals, as shown by the square marks in Fig. 5b which clearly indicate that the temporal structure inherent in chaotic signal trains affects the decision-making performance.
The decision-making system must deal with all of these types of problems, namely, all kinds of bit combinations; thus, temporal structures, such as positive and negative autocorrelations, may lead to inappropriate consequences. To derive a moderate setting, the circular marks in Fig. 5c show the coefficient of variation (CV) which is Scientific REPORTS | (2018) 8:10890 | DOI:10.1038/s41598-018-29117-y defined as the ratio of the standard deviation to the mean value, for the four types of problems shown in Fig. 5b. A smaller CV is preferred. The inter-bit sampling interval of 30 ps eventually provides the minimum CV, although slight changes could lead to larger CVs. Indeed, the autocorrelation is about 0 with this inter-bit sampling interval (Fig. 2d). An inter-bit sampling interval of approximately 100 ps constantly offers smaller CV values. The square marks in Fig. 5c correspond to RAND, and no evident inter-bit interval dependency related to the CV is observable, in clear contrast to the chaotic time series cases.
Decision-making difficulties. The adjustment precision of the thresholds is important when searching for the maximum-reward-probability machine, especially in many-armed bandit problems that include the contradictory arrangement discussed earlier 19 . Here, we discuss the dependency of the decision-making difficulty by focussing on the two-armed bandit problem; even in simple two-armed cases, the threshold precision clearly affects the resulting decision-making performance. Figure 6a,b present two-armed bandit problems in which the reward probability of Machine 0 is 0.9 whereas that of Machine 1 is 0.5 and 0.7, respectively. Since the probability difference is larger in the former case, it is easier to derive the maximum-reward-probability machine in that case. Indeed, the curves in the former case shown in Fig. 6a provide steeper adaptation than in the latter case depicted in Fig. 6b. The eight curves shown therein depict the CDRs corresponding to the numbers of threshold levels given by 2 B + 1, where B is the bit resolution of the threshold values and takes integer values from 1 to 8. From Fig. 6a,b, it can be seen that the CDR is saturated before approaching unity when the number of threshold levels is small; in particular, the CDR is limited to significantly lower values in difficult problems.  (2) and (6)), the absolute value of the threshold decreases by an amount of 0.45. This value is larger than the average of the decrement and increment caused by threshold updating involving Ω 1 , leading to the saturation of TH 1 and resulting in a limited CDR. Figure 6c summarizes the CDR at cycle 200 as a function of decision difficulty, where precise threshold control is necessary to obtain a higher CDR, especially in difficult problems.

Diffusivity and decision-making performance.
In the results shown for bandit problems with up to 64 arms in Fig. 4, Chaos 3 provides the best performance among the four kinds of chaotic time series. The negative autocorrelation indeed affects the decision-making ability, as discussed in Fig. 5; however, the value of the negative maximum of the autocorrelation shown in Fig. 2b does not coincide with the order of performance superiority, indicating the necessity of further insights into the underlying mechanisms.
In this respect, we analysed the diffusivity of the temporal sequences based on the ensemble averages of the time-averaged mean square displacements (ETMSDs) 28,29 in the following manner. We first generated a random walker via comparison between the chaotic time series. If the value of the uniformly distributed random number, which was generated based on the Mersenne Twister, is smaller than the chaotic signal s(t), the walker moves to the right. X(t) = +1; otherwise, X(t) = −1. The details about the random numbers and the formation of the walker are described in the Methods section. Hence, the position of the walker at time t is given by We then calculate the ETMSD using is the time series, T is the last sample to be evaluated, and  denotes the ensemble average over different sequences. The ETMSDs corresponding to Chaos 1, 2, 3, and 4 and quasiperiodic, RAND, and coloured noise are shown in the inset of Fig. 7, all of which monotonically increase as a function of the time difference τ. It should be noted that at τ = 1000, Chaos 3 exhibits the maximum ETMSD, followed by Chaos 2, 1, and 4, as shown by the circular marks in Fig. 7a. This order agrees with the superiority order of the decision-making performance in the 64-armed bandit problem shown in Fig. 4b. At the same time, RAND derives an ETMSD of 1000 at τ = 1000, which is a natural consequence considering the fact that the mean square displacement of a random walk is given by , where p and q are the probabilities of flight to the right and left, respectively. If p = q = 1/2 (via RAND), then the mean square displacement is t. From Fig. 7a, RAND and coloured noise actually exhibit larger ETMSD values than Chaos 1-4, although the decision-making abilities are considerably poorer for RAND and coloured noise, implying that the ETMSD alone cannot perfectly explain the performances. Figure 7b explains diffusivity in another way, where the average displacement 〈x(t)〉 and 〈x(t + D)〉 are plotted for each time series superimposed in the XY plane with D = 10,000. Although the quasiperiodic and coloured Figure 6. Difficulty of the given decision-making problem and threshold precision control. While retaining the higher reward probability in the two-armed bandit problem (P 0 = 0.9), the lower reward probability P 1 was set to (a) 0.5 and (b) 0.7 to examine the decision difficulty. The CDR increases more rapidly in the easier decision-making problem (a) than in the harder one (b). In addition, a decrease in the number of threshold levels prevents the system from reaching the correct decision, especially for harder problems, due to insufficient exploration. c CDR at cycle 200 as a function of decision difficulty.
Scientific REPORTS | (2018) 8:10890 | DOI:10.1038/s41598-018-29117-y noise, shown by the magenta and yellow curves, respectively, move toward positions far from the Cartesian origin, their trajectories are biased toward limited coverage in the plane. Meanwhile, the trajectories of the chaotic time series cover wider areas, as shown by the red, green, blue, and cyan curves. The trajectories generated via RAND, shown by the black curve, remain near the origin.
To quantify such differences, we evaluated the covariance matrix Θ = cov(X 1 , X 2 ) by substituting x(t) and x(t + D) for X 1 and X 2 , where the ij-element of Θ is defined by with N denoting the number of samples and X i denoting the average of X i . The condition number of Θ, which is the ratio of the maximum singular value to the minimum singular value 30,31 , indicates the uniformity of the sample distribution. A larger condition number means that the trajectories are skewed toward a particular orientation, whereas a condition number closer to unity indicates uniformly distributed data. The square marks in Fig. 7a show the calculated condition numbers, where Chaos 1-4 achieve smaller values, which are even smaller than that achieved by RAND, and the quasiperiodic and coloured noise yield larger scores.
By the way, from Fig. 7b, the trajectories of the quasiperiodic signal and the coloured noise are biased toward upper right and lower left corners, respectively. Figure 7c shows the histogram of the position of the walker at t = 100,000 where large incidences are observed in positively and negatively biased positions for the quasiperiodic signal and coloured noise, respectively. Indeed, the ensemble average of time average of the source sequences, s t ( ) , were given by 0.1407 (Chaos 1), 0.1295 (Chaos 2), 0.1298 (Chaos 3), 0.1129 (Chaos 4), 0.5903 (quasiperiodic), 0.0022 (RAND), and −0.4902 (coloured noise); that is, the biases inherent in the source signals impacted the walker's behaviour. At the same time, it should be noted that whereas the walker's statistical distributions of the four kinds of chaos and RAND exhibit similarities in Fig. 7c, their decision-making performances (Fig. 4) and the diffusivity in a two-dimensional diagram (Fig. 7b) are quite different between chaos and RAND. Through these analyses using the ETMDSs and condition numbers related to the diffusivity of the time series, a clear correlation between the greater diffusion properties inherent in laser-generated chaotic time series and the superiority in the resulting decision-making ability is observable.
Simultaneously, however, we consider further insight to be necessary to draw general conclusions regarding the origin of the superiority of chaotic time series in the proposed decision-making principle. For example, as can be seen in Fig. 2b, the RF spectrum differs among the chaotic time series, suggesting a potential relation to the resulting decision-making performances. The impact of inherent biases, which can be dynamically configured, should be further investigated. Ultimately, an artificially constructed optimal chaotic time series that provides the best decision-making ability could be derived, which is an important and interesting topic requiring future study.

Conclusion
We proposed a scalable principle of ultrafast reinforcement learning or decision-making using chaotic time series generated by a laser. We experimentally demonstrated that multi-armed bandit problems with N = 2 M arms can be successfully solved using M points of signal sampling from the laser chaos and comparison to multiple thresholds. Bandit problems with up to 64 arms were successfully solved using chaotic time series even though the presence of difficulties that we call contradictions can potentially lead to trapping in local minima. We found that laser chaos time series significantly outperforms quasiperiodic signals, computer-generated pseudorandom numbers, and coloured noise. Based on the experimental results, the required latency scales as N 1.16 , with N being the number of slot machines or bandits. Furthermore, by physically changing the laser chaos operation conditions, four kinds of chaotic time series were subjected to the decision-making analysis; a particular chaos sequence provided superiority over the other chaotic time series. Diffusivity analyses through the ETMSDs and covariance matrix condition numbers related to the time sequences well accounted for the underlying mechanisms for quasiperiodic sequences and computer-generated pseudorandom numbers and coloured noise. This study is the first demonstration of photonic reinforcement learning with scalability to larger decision problems and paves the way for new applications of chaotic lasers in the realm of artificial intelligence.

Methods
Optical system. The laser was a distributed feedback semiconductor laser mounted on a butterfly package with optical fibre pigtails (NTT Electronics, KELD1C5GAAA). The injection current of the semiconductor laser was set to 58.5 mA (5.37I th ), where the lasing threshold I th was 10.9 mA. The relaxation oscillation frequency of the laser was 6.5 GHz, and its temperature was maintained at 294.83 K. The optical output power was 13.2 mW. The laser was connected to a variable fibre reflector through a fibre coupler, where a fraction of light was reflected back to the laser, generating high-frequency chaotic oscillations of optical intensity [23][24][25] . The length of the fibre between the laser and reflector was 4.55 m, corresponding to a feedback delay time (round trip) of 43.8 ns. PM fibres were used for all of the optical fibre components. The optical signal was detected by a photodetector (New Focus, 1474-A, 38 GHz bandwidth) and sampled using a digital oscilloscope (Tektronics, DPO73304D, 33 GHz bandwidth, 100 GSample/s, eight-bit vertical resolution). The RF spectrum of the laser was measured by an RF spectrum analyser (Agilent, N9010A-544, 44 GHz bandwidth). The observed raw data were subjected to moving averaging over a 20-point window, yielding the RF spectrum curves shown in Fig. 2b. Details of the principle. Decision of the most significant bit. The chaotic signal s(t 1 ) measured at t = t 1 is compared to a threshold value denoted as TH 1 (Fig. 1b). The output of the comparison is immediately the decision of the most significant bit (MSB) concerning the slot machine to choose. If s(t 1 ) is less than or equal to the threshold value TH 1 , the decision is that the MSB of the select slot machine to be chosen is 0, which we denote as D 1 = 0. Otherwise, the MSB is determined to be 1 (D 1 = 1).
Decision of the second-most significant bit. Suppose that s(t 1 ) < TH 1 ; then, the MSB of the slot machine to be selected is 0. The chaotic signal s(t 2 ) measured at t = t 2 is subjected to another threshold value denoted by TH 2,0 . The first number in the suffix, 2, means that this threshold is related to the second-most significant bit of the slot machine, while the second number of the suffix, 0, indicates that the previous decision, related to the MSB, was 0 (D 0 = 0). If s(t 2 ) is less than or equal to the threshold value TH 2,0 , the decision is that the second-most significant bit of the select slot machine to be chosen is 0 (D 2 = 0). Otherwise, the second-most significant bit is determined to be 1 (D 2 = 1).
Decision of the least significant bit. Suppose that s(t 2 ) > TH 2,0 ; then, the second-most significant bit of the slot machine to be selected is 1. In such a case, the third comparison with regard to the chaotic signal s(t 3 ) measured at t = t 3 is performed using another threshold adjuster value denoted by TH 3,0,1 . The 3 in the subscript 3,0,1 indicates that the threshold is related to the third-most significant bit, and the second and third numbers, 0 and 1, indicate that the most and second-most significant bits were determined to be 0 and 1, respectively. Such threshold comparisons continue until all M bits of information that specify the slot machine have been determined. If M = 3, the result of the third comparison corresponds to the least significant bit of the slot machine to be chosen. Suppose that the result of the comparison is s(t 3 ) < TH 3,0,1 ; then, the third bit is 0 (D 3 = 0). Finally, the decision is to select the slot machine with D 1 D 2 D 3 = 010; that is, the slot machine to be chosen is 2.
In  where α is referred to as the forgetting (memory) parameter 12,13 and Δ is the constant increment (in this experiment, Δ = 1 and α = 0.99). The intuitive meaning of the update given by Eq. (2) is that the threshold value is revised so that the likelihood of choosing the same machine in the next cycle increases.
(2) Second-most significant bit The threshold adjuster values related to the second-most significant bit are revised based on the following rules: 2,0 2 ,0 1 2 2,0 2 ,0 1 2 when the MSB has been determined to be 0 (D 1 = 0) and α α when the MSB has been determined to be 1 (D 1 = 1).

(3) General form
As was done with the most and second-most significant bits in Eqs (2)(3)(4), all of the threshold values are updated. In a general form, the threshold value for the K-th bit is given by when the decisions from the MSB to the (K − 1)-th bit have been determined by D 1 = S 1 , D 2 = S 2 , …, and When losing the slot machine play: If the selected machine does not yield a reward (i.e. the player loses in the slot machine play), the threshold values are updated as follows.
(1) MSB The threshold value of the MSB is updated according to α α , respectively. Then, the estimated reward probability, or winning probability, by choosing slot machines for which the MSB is k is given bŷ The initial Ω 1 is assumed to be unity, and a constant value is assumed when the denominator of Eq. (8) is 0. Ω 1 is the figure that designates the degrees of winning and losing. Indeed, the numerator of Eq. (8) indicates the degree of winning, whereas the denominator shows that of losing.
(2) Second-most significant bit The threshold adjuster values related to the second-most significant bit are updated when the MSB of the decision is 0 (D 1 = 0) by using the following formula:

TH t T H t D D TH t T H t D D
( 1) ( ) if 0, 0 ( 1) ( ) if 0, 1 Let the number of times that slot machines for which the MSB is 0 (S 1 = 0) and the second-most significant bit is 0 (S 2 = 0) are selected be given by = = C S S 0, 0 1 2 . Let the number of times that slot machines for which the MSB is 0 (S 1 = 0) and the second-most significant bit is 1 (S 2 = 1) are selected be given by = = C S S 0, 1 1 2 . Let the numbers of wins by selecting slot machines for which the MSB is 0 (S 1 = 0) and the second-most significant bit is 0 (S 2 = 0) or 1 (S 2 = 1) be given by = = L S S 0, 0 1 2 and = = L S S 0, 1 1 2 , respectively. Then, the estimated reward probability, or winning probability, by choosing slot machines for which the MSB is 0 and the second-most significant bit is k is given by    Initially, all of the threshold values are 0; hence, for example, the probability of determining the MSB of the slot machine to be 1 or 0 is 0.5 since TH 1 = 0. As time elapses, the threshold values are biased towards the slot machine with the higher reward probability based on the updates described by Eqs (2)(3)(4)(5)(6)(7)(8)(9)(10). It should be noted that due to the irregular nature of the incoming chaotic signal, the possibility of choosing the opposite values of bits is not 0, and this feature is critical in exploration in reinforcement learning. For example, even when the value of TH 1 is sufficiently small (indicating that slot machines whose MSBs are 1 are highly likely to be better machines), the probability of the decision to choose machines whose MSBs are 0 is not 0. The number of threshold levels was limited to a finite value in the experimental implementation. Furthermore, the threshold resolution affects the decision-making performance, as discussed below. In this study, we assumed that the actual threshold level takes the values −Z, …, −1, 0, 1, …, Z, where Z is a natural number; thus, the number of the threshold levels is 2Z + 1 (Fig. 1c). More precisely, the actual threshold value is defined by ( ) is the nearest integer to TH(t) rounded to 0, and a is a constant for scaling to limit the range of the resulting T(t). The value of T(t) ranges from −aZ to aZ by assigning the limits T(t) = aZ when ⌊ ⌋ TH t ( ) is greater than Z and T(t) = −aZ when ⌊ ⌋ TH t ( ) is smaller than −Z. In the experiment, the chaotic signals s(t) take integer values from −127 to 128 (eight-bit signed integer); hence, a was given by a = 128/Z in the present study.