Scalable reservoir computing on coherent linear photonic processor

Photonic neuromorphic computing is of particular interest due to its significant potential for ultrahigh computing speed and energy efficiency. The advantage of photonic computing hardware lies in its ultrawide bandwidth and parallel processing utilizing inherent parallelism. Here, we demonstrate a scalable on-chip photonic implementation of a simplified recurrent neural network, called a reservoir computer, using an integrated coherent linear photonic processor. In contrast to previous approaches, both the input and recurrent weights are encoded in the spatiotemporal domain by photonic linear processing, which enables scalable and ultrafast computing beyond the input electrical bandwidth. As the device can process multiple wavelength inputs over the telecom C-band simultaneously, we can use ultrawide optical bandwidth (~5 terahertz) as a computational resource. Experiments for the standard benchmarks showed good performance for chaotic time-series forecasting and image classification. The device is considered to be able to perform 21.12 tera multiplication–accumulation operations per second (MAC ∙ s−1) for each wavelength and can reach petascale computation speed on a single photonic chip by using wavelength division multiplexing. Our results are challenging for conventional Turing–von Neumann machines, and they confirm the great potential of photonic neuromorphic processing towards peta-scale neuromorphic super-computing on a photonic chip. Optical computing holds promise for high-speed, low-energy information processing due to its large bandwidth and ability to multiplex signals. The authors propose a recurrent neural network implementation using reservoir computing architecture in an integrated photonic processor capable of performing ~10 tera multiplication–accumulation operations per second for each wavelength channel.

N owadays, machine learning techniques are advancing at a tremendous speed 1 , and their applications for artificial intelligence (AI) systems are penetrating society.This is motivating the development of special-purpose AI hardware such as application-specific integrated circuits (ASICs) and fieldprogrammable gate arrays (FPGAs) 2,3 , which provide much faster and more energy-efficient computational resources.Recently, photonic implementations of artificial neural networks (ANNs) are attracting interest because they have great potential to reduce operational power, increase speed, and reduce latency beyond what is possible in electronic computing [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21] .Optical circuits can perform a large-scale multiply-accumulate (MAC) operation-a dominant factor in ANN computation-with ultrahigh processing speed thanks to their ultrawide bandwidth (terahertz region) and inherent parallelism in space, time, phase, and wavelength domains.As this operation is executed simply by light propagation and interference, the principal energy consumption is very small.Although the basic operation of this alternative computation was originally proposed in the early 1970s [22][23][24] , the research was suspended due to the rapid development of electronic largescale-integration (LSI) technology and an AI winter.The revival of AI through deep learning technology has led to a rethinking of neuromorphic photonic systems.Recent progress in photonic integrated circuits 25 , which were developed for telecom applications, enables us to implement photonic ANN compactly 4 .This will boost the research of photonic ANN for alternative computational devices.
Among them, RC [28][29][30] is gaining attention for its affinity with optical elements and its excellent performance.The standard deep neural network (DNN) requires fine tuning of each weight through the use of the error back propagation algorithm 31 , which in turn requires highly accurate and uniform large-scale integration of tunable optical elements, which is a very challenging issue for fabrication.In addition, the training time of a photonic ANN is generally much longer than that with electrical devices due to the slow response of the phase shifters in the MZIs (on the order of milliseconds for thermo-optic devices).Therefore, standard photonic ANNs are effective only at the inference stage.On the other hand, in the RC framework, only output weights are trained, which is carried out by using a linear regression scheme.The other weights are fixed randomly.Thus, there is no need for any fine tuning of the optical system under training.As the training time is determined by forward propagation in RC, it can be accelerated by using photonics.In spite of the simple training, DNNs based on photonics have shown excellent performance comparable to that of standard DNNs on a series of benchmark tasks, such as speech recognition 20 , economic forecasting 32 , action detection in movie data 33 , and telecom signal compensation 18,19,28 .Apart from optics, many physical implementations have been reported such as spintronic devices 34 .
However, most conventional demonstrations have been limited to implementations of a middle recurrent layer called a "reservoir".The scalability for chip integration is highly limited due to their simple architectures (less than 23 neurons in the reservoir [14][15][16] ).In addition, the bandwidth is limited by the electrical components used for the nonlinear activation functions and input-output (I/O) frontend, such as response-lasers (gigahertz order) 12,15 and spatial light modulators (hertz order) 17 .As the slowest one determines the operating time, the potential bandwidth of light (>THz) is limited to the electrical bandwidth.As a result, the computation speed, the distinguishing feature of photonic RC, is still considered to be below or at the same level as that of commercial electrical hardware such as central processing units (CPUs) or graphics processing units (GPUs).Note that there has not been a clear discussion in the literature.Our estimations are discussed in "Discussion" section.
In this paper, we propose a scalable photonic implementation of an RC system based on coherent linear photonic processors driven by an optical pulse source.In contrast to previous approaches, both the input and reservoir weights are optically encoded in the spatiotemporal domain, which enables scalable integration on a compact chip.The ultrafast optical pulse source up-converts the input signals to the higher frequency region beyond the electrical bandwidth.In the linear photonic processor, we can operate the photonic signals beyond the electrical bandwidth limitation.Although the nonlinear activation functions are only implemented in the input and output, the complex-valued evolution in coherent systems ensures rich dynamics comparable to that of incoherent nonlinear systems as described in the ref. 13 .We also demonstrate parallel processing based on wavelength division multiplexing (WDM).This enables the use of ultrawide optical bandwidth (>THz) as a computational resource, which boosts the computation efficiency beyond the computation bottleneck remaining from the I/O bandwidth.Experiments with the standard benchmarks showed good performance for chaotic time-series forecasting and image classifications with ultrafast processing speed of 17.1 ns per image.The device can achieve ten TMAC/s for each wavelength and can reach petascale computation speed on a single photonic chip by using WDM.Our results are challenging for conventional Turing-von Neumann machines, and they significantly advance photonic neuromorphic processing towards peta-scale neuromorphic super-computing on a photonic chip.

Results
Basic operation principle.Figure 1a shows the proposed device architecture for photonic RC.In this system, the radio-frequency (RF) input signal u(t) with modulation time interval T is coded to the amplitude of the electromagnetic field of optical pulse, δ(t), by an optical Mach-Zehnder modulator (MZM).Then, the ultrafast optical pulse up-converts the RF-signal to the optical frequency, and this electro-optic conversion results in a sinusoidal nonlinearity.The modulated optical amplitude u′(t) is described as where V o is offset voltage, and γ and V π are the characteristic voltages of the MZM.Then, the converted signals u′(t) are input to first stage of the linear optical circuits for the spatiotemporal input-masking.In this part, they are split into N-branches and then transmitted through different delay lines with a delay differential θ.The θ is set to satisfy the relationship θ = T/N, where N is the virtual node count in a single optical cavity 11,12,35 .Then, they are weighted by optical cross connecting units.This means that the input signals spread along time and space division with complex-valued weights.In general, the masked response s l (t) with continuous time representation is described as where h l (τ) is an impulse response to the lth output port for the input mask circuit.For the comparison with the digital mask operation 11,12,35 , we consider a discretized time t(n,i) corresponding to each interval of duration θ and T: t′(n, i) = nT + iθ, where n ∈ Z and i ∈ [1; N].By considering the ideal optical impulse with repeating time of T, the discretized expression of ( 1) is described as where m l,i ∈ [0; 1] and Ψ l,i ∈ [0; 2π] are the amplitude and phase delay from ith delay line to lth spatial.Thus, the input mask circuit acts as all-optical complex-valued spatiotemporal input weight generator.The matrix shape h l,I and its programmability depend on the type of optical cross connects.More detailed explanations for the input mask circuit are described in the Supplementary Note 1 and Supplementary Fig. 1.The masked optical signals s l (t) are input to the second stage of the optical processor.In this part, the integrated L-array of coherent cavities acts as spatially parallel delay-based optical reservoir with complex-valued evolution.As discussed in the refs. 36,37, a parallel reservoir computer increase the virtual node count in the reservoir, which enhance the performance.The evolution equation of the complex-valued amplitude inside the lth reservoir cavity is given by where T l is the roundtrip time, α l is the feedback gain that is reconfigured through a programmable optical attenuator inside the cavity, β l is the transmission coefficient of the input fiber coupler, j is an imaginary unit, and φ 1 is the phase detuning of the cavity.The subscript l denotes that the parameters are for lth cavity.The T l is typically set to T l = TN/(N-q), where 0 ≤ q < N.
Then, the continuous time evolution in (2) can be approximated by the following discrete time evolution equations: Thus, the parallel optical cavities act as parallel reservoir cavities.Although the nonlinear activation functions are only implemented in the input and output, the complex-valued evolution in coherent systems ensures rich dynamics comparable to that in incoherent nonlinear systems, as described in 13 .The signals from the reservoir cavity are directly detected by photodiodes (PDs) installed after each cavity.Their dynamics are sampled by using an analog-to-digital converter or oscilloscope with a sampling interval of θ′.Their discretized dynamic responses are considered as the squared norm of complex-valued virtual node response jx l n; i ð Þj 2 , where n′ ∈ [1; N′(=Nθ′/θ)] and N′ are the measurable nodes.The outputs y(n′) are obtained from weighted summation of jx l n0; i ð Þj 2 , which is described as: where ω l,i are trainable read-out weights, which are determined by minimizing the mean square error using the Tikhonov regularization or standard gradient descent method.We can obtain the y(n′) value with time interval T, which is the same as the input one.
The equilibrium network architecture is also shown in Fig. 1a.As the proposed architecture can integrate the spatiotemporal RC system on a compact chip, it is more scalable than the conventional on-chip integration [13][14][15] .In contrast to previous digital pre-processing 11,12,35 , we can set the short virtual node interval θ beyond the RF input sampling rates by setting a short delay difference and using an ultrafast pulse source.For instance, we can achieve the ultra-wideband optical computing on a single photonic chip by employing the femtosecond (>THz) or attosecond pulse lasers (>PHz).However, in this condition, the measurement of the reservoir state and post processing are difficult to perform due to the output sampling rate θ′.Thus, it remains as the final bottleneck.Although the output bandwidth still remains bottleneck to access the ultrawide bandwidth of light, we can overcome it by using WDM and thus use the ultrawide optical bandwidth (>THz) as a computational resource to boost the computation efficiency much more.The details are described in subsection "Parallel data processing using WDM".
Integrated coherent linear photonic processor.As the proposed system requires coherent interference for the MAC operation, we need to accurately maintain the phases and delays in the optics.In addition, the system needs functional optics, including couplers, phase shifters, variable optical attenuators (VOAs), and MZIs.Thus, it is very difficult to build this optics with bulk fiber optics.Here, we employ our photonic platform technology, called the planar lightwave circuit (PLC), to integrate such a complex system on a compact chip.The PLC is a silica-based waveguide technology that has excellent features for composing functional optical devices: low transmittance loss (~0.02 dB/cm), mass fabrication using standard wafer processes, excellent stability (>10year operation 38 ), many available lineup functional components, including interferometers and delay lines 39 , and a wide operating bandwidth (visible 40 to mid-infrared wavelengths 38,39 ).Thanks to these features, PLC devices have already been installed as optical systems for optical fiber links and have also been utilized as advanced circuits for fundamental science such as a quantum photonics 41 and an optical lattice clock 42 .Here, we apply this technology to the photonic RC for the first time.
As a first demonstration, we integrated the above described linear optical system with N = 32 and L = 16 into the PLC.Thus, there are 512 virtual neurons in the circuit.Figure 1b show a schematic of the PLC layout for the input mask.For simplicity, the case for the RC with N = 8 is illustrated.The delay difference θ in the input mask circuit is set to ~8.3 ps (1/120 GHz) by adjusting the waveguide length.As this delay value corresponds to the optical path length of ~2.49mm, it can be accurately implemented to the PLC by using a standard lithography technique (submicrometer accuracy).For the splitting and weighting of the signal, the MZIs are integrated in the mask circuits.Each MZI has two heaters to adjust its phase by using the thermos-optic effect.The MZIs apply the following unitary conversion: where ξ 1 and ξ 2 are the phase shift of MZIs shown in the inset of Fig. 1b.By cascading the MZIs, we can realize various N × N optical topologies.Here, we illustrate three types of possible installations at the bottom of Fig. 1b.The simplest case is to use a mirror of a 1:N variable splitter as shown in Type I in Fig. 1b 43 .This filter supports the arbitrary dense connection to the center output port with only an (N − 1) array of MZIs .However, the connection to the other spatial port becomes sparse.This means that the hl,I for l = L/2 is dense and fully programmable, but the other ports are sparse and not programmable.The dense connection can be realized by using an N(N − 1)/2 array of MZIs as shown in Type II in Fig. 1b, which is called a universal unitary operator 44,45 .Although this architecture can realize dense and fully programmable connections to the output ports, only unitary conversion is supported.The arbitrary matrix connection is realized by combining 2N(N − 1) MZIs as shown in Type III in Fig. 1b 4,46,47 .This network uses a physical instantiation of the singular value decomposition, which is a factorization of any matrix (M) as M = UΣV † , where U is an N × N unitary matrix; Σ is an N × N diagonal, rectangular matrix of nonnegative real numbers; and V is an N × N unitary matrix.Here, two universal unitary circuits (U, V † ) are connected by a column of single MZIs that are used as variable attenuators implementing Σ.Although the densely connected input mask is preferred to achieve better RC performance, the dense unitary conversion requires 496 MZIs for N = 32, which is challenging for the first trial due to the complex wiring and operation.Thus, in this study we employed Type I, which only requires 31 MZIs.Figure 1c show a schematic of reservoir coherent cavities.For simplicity, the case for RC with L = 4 is illustrated in this figure.A VOA, phase shifter, and variable coupler is installed in each cavity.A previous coherent cavity-based RC 13 was implemented in an optical fiber ring with a length of over 200 m.In contrast, we can integrate the system on a compact optical chip thanks to the small θ value.The previous fiber cavity was unstable in terms of temperature and vibration, and thus required the huge thermal isolation and an external feedback operation to stabilize the device.On the other hand, our cavity can be compactly implemented on an optical chip by using silica-based optical waveguide technology.Moreover, it is highly stable and does not require any feedback operation thanks to the telecom-grade PLC technology.The cavity length is set to (6.0 + 0.4 × l) cm, where l is the cavity number.The corresponding roundtrip time is (~290 + 20 × l) ps.In the reservoir PLC, the coupling ratio to the passive cavity is set to 10/90, which corresponds to β in Eq. ( 11) is (0.1) 1/2 = 0.316.The α l ∈ [0, α l , max ] and φ ∈ [0, 2π] for each optical layer is also reconfigurable by tuning the MZIs on the RC chips.The achievable α l , max value depends on the cavity number due to the loss increase from the intersection points, which is almost completely determined by α l;max ¼ 0:87 À 0:021 l.This dependence could be reduced by optimizing the waveguide design.Figure 2 shows the appearance of our optical RC circuit fabricated by the PLC technique.The input mask and reservoir circuit were fabricated on the separate chips for the ease of characterization.These chips are connected by the optical fiber array.The footprint of the input and reservoir circuit were 41 × 46 and 28 × 47 mm 2 , respectively.
Here, we performed experiments to examine the fabricated chip's performance.First, we input the ASE light to the PLC for input mask and observed the transmittance spectrum to confirm the filtering performance.In this setup, two neighboring port were coupled to the center output port with the same optical power.Figure 3a shows the experimentally obtained filtering shape of the fabricated input mask circuit for the center output port (number 16) over the wavelength range of 1530-1565 nm.A magnified view is shown in Fig. 3b.For this experiment, the delay difference was set to θ as shown in Fig. 3e.As can be seen in Fig. 3a, flat and repeated wavelength dependence is obtained, suggesting the circuit can be used for an ultra-wideband optical processor.The observed extinction ratio of 20 dB suggests the good performance of the fabricated MZIs.The observed wavelength duration of 120 GHz suggests accurate implementation of θ.By changing the interfering port as shown in Fig. 3f, we observed a change in the wavelength duration of 60 GHz as expected [Fig.3c].When we randomly set the input mask parameter [Fig.3g], we can set the random filtering shape as shown in Fig. 3d.Thus, the fabricated circuits can be used for the optical mask circuit.Next, we examined the time domain response and the stability of the fabricated chip.For this experiment, we input the masked signal into the center coherent cavity (number 8) to check the overall characteristics of the RC system as shown in Fig. 3k. Figure 3i shows the reservoir response of cavity number 8 before and after the 3-h operation.The input signal [Fig.3h] and the output error [Fig.3j] are also plotted.In this experiment, we randomly generated the mask function and α and φ were set to 0.5 and 0.1π, respectively.A continuous coherent pulse with repeating time of 512 ps [top of Fig. 3e] was input to the system The response of the center output port was observed with an oscilloscope.As shown in this figure, we can successfully generate the time domain response along the time domain thanks to the input mask and reservoir circuit [Fig.3i].As is well known, a coherent system is highly sensitive to the changes in the optical phase due to changes in the surrounding temperature or vibration.Thus, the previous coherent cavitybased RC requires a complex feedback system to stabilize the RC outputs 13 .On the other hand, thanks to the solid-state circuit technology, the reservoir response for the fabricated circuit is stable over a period of three hours without any feedback operation, which is enough time to execute most RC tasks.
Chaotic time-series expectation.We used the Santa-Fe timeseries prediction task 48 to evaluate the RC performance.The aim of this task is to perform one-step-ahead prediction of chaotic data.The chaotic laser data were generated from a far-infrared laser.We used 2000 steps for training and 1000 steps for testing.For the training, we utilized the same teacher signal d(n) = u(n + 1).The training was done by standard Tikhonov regression.The amplitude of the Santa Fe time series was normalized so that the input signal u(n) of the Santa Fe time series ranged from 0 to 1.
First, we compared the bandwidth of input signal u(t) and reservoir output |x(t)| 2 to confirm our concept described above.We used the center cavity (lane 8) for this experiment.The relative feedback phase was set to Δϕ l = 0.1π. Figure 4a shows the measured bandwidth of input data u(t) of Santa-Fe chaotic timeseries and reservoir output |x l (t)| 2 .As can be seen in this figure, the bandwidth of the RF input is elongated thanks to the optical pulse modulation.Thus, we can execute RC processing beyond the input RF bandwidth.The observed maximum bandwidth was limited by the bandwidth of the oscilloscope (~20 GHz) in this experiment.By utilizing an electrical system with higher bandwidth, the signals can be up-converted to a much higher frequency region with the same optical circuits.
Next, we examined the performance on the chaotic-time-series expectation of our RC for a single cavity.Figure 4b shows the normalize mean square error (NMSE) as a function of feedback gain.For comparison, the simulation performance for the passive cavity and optoelectric reservoir with N = 16 are also plotted in the figure.For this simulation, we estimated the NMSE using Eqs.
(3) and ( 5) with q = 1.The impulse response was set randomly to keep the following constraint: |Σh l, i | 2 = 1.The feedback gain was swept from zero to one, and the relative feedback phase was set to Δϕ l = 0.1π, the same as the physical setup.The experimental results well agree with the simulation, suggesting the good performance of our constructed optical circuit.As shown in the figure, the NMSE highly depended on the spectral radius of the reservoir connection (i.e., feedback gain in this case).This behavior was also observed in previous RC systems, and it implies the enhanced memory capacity in the large feedback condition.The observed NMSE was comparable to the optoelectric RC 11 in spite of the lack of the nonlinear element in the cavity.
Next, we investigated the performance using multiple cavities.The reservoir parameters were randomly set for each cavity.The measured reservoir responses of all the cavities are shown in Fig. 4c.Various responses were observed from each cavity.From these responses, we estimated the NMSE as a function of the number of parallel cavities as shown in Fig. 4d.The NMSE is monotonically reduced by adding the cavities, and NMSE of 0.06 is achieved, which is superior to that of a previous on-chip RC 15 thanks to the increment of virtual nodes.This result indicates that the virtual nodes obtained from the parallel cavities are more effective for the chaotic time-series prediction task than those obtained from a single temporal sequence of a single cavity.
Image classification.We tested the image classification task to confirm the multidimensional data classification using multiple cavities.The dataset comprised hand-written-digits in the Modified National Institute of Standards and Technology (MNIST) database 49 .The processing procedure for multidimensional inputs is shown in Fig. 5a.For preprocessing, the original 28 × 28 images were resized to 16 × 16 ones.To covert the temporal data, the 2 × 2 time-sliding window was used for data input.Different digit data are multiplexed in different time slots.The sliding step was set to 2. Thus, the input image data were converted to a fourdimensional time series with 64 time steps.These signals were converted to optical data, and they were processed by the input mask and reservoir optical circuits.The detailed operation for the multidimensional input is described in the "Methods" section.The output signals from each reservoir cavity were acquired by multiple photodetectors.The signals were weighted and summed by using Eq. ( 6).For the training, we used 60,000 images from the training dataset of MNIST.As this requires large-scale memory for Tikhonov regression, we optimized the readout weights in real time using the gradient descent method with the minibatch size of 200 and the learning rate of 5 × 10 −3 .The training was executed in only one epoch.For the test phase, we validated the 10,000 data from the test dataset.
An example of the time domain reservoir responses (cavity number 8) is plotted in Fig. 5b.These responses indicate that we can process the input images at a speed of 17.1 ns per image, which is much faster than the standard RC using electrical hardware; RC with same architecture takes ~1 ms using an 8-core 3.1-GHz CPU.The training and test results are shown in Fig. 5c,  d.As shown in Fig. 5b, the training accuracy almost monotonically increased, suggesting the success of the training.The test accuracies were improved by increasing the parallelism of the RC, and we achieved maximum test accuracy of 91.3% for the 16 parallel RC case.To the best of our knowledge, this is first experimental demonstration of image classification using on-chip photonic RC.However, this value is still inferior to the accuracy for standard deep neural network models.As discussed in 21 , around 16,000 reservoir nodes are required to achieve state-ofthe-art performance for the MNIST benchmark.As the node number of our device is 512, we need to construct 32 times larger circuits: e.g., the proposed optical circuit with N = L = 128 array is required, which needs the 128 × 128 optical cross-connects for the input mask and 128 array of 32-cm delay lines for the reservoir circuit.The fabrication of these components are challenging.However, it could be possible by using state-of-theart manufacturing technology [50][51][52] .
Parallel data processing using WDM.As described above, we can operate the photonic signals beyond the input RF limitation by using our device.The circuit also has potential for ultrawideband operation thanks to the flat passband over 1530-1570 nm (~5 THz) as shown in Fig. 3a.However, the remaining bottleneck from the output still limits the processing speed.Here, we demonstrate parallel processing based on WDM to boost the computation efficiency.Figure 6a shows the WDM processing with the proposed chip.In this scheme, the optical inputs with wavelength channel spacing of Δλ are multiplexed into a single optical transmission line and then input to the optical circuit.The input mask operation of our circuit is equilibrium to the frequency filtering of the up-converted signals.As shown in this figure, the filtering shape is repeated along the λaxis due to the circulating nature of the optical phase.This feature was experimentally confirmed as shown in Fig. 3a-d.As the repeating frequency is determined by the minimum delay difference in the filter, we can generate the same optical mask function when we set the channel spacing Δλ = cθ.Thus, we can use the full optical bandwidth of our device by multiplexing the multiple wavelength inputs.The feedback α l can be kept constant by setting the optical paths length in the MZMs to the same length.Unfortunately, the relative phase and delay length are changed due to the group delay of the circuit.However, the impact on the performance is relatively smaller than that of the other hyper parameters in RC.The output signals from the optical circuits are demultiplexed and detected by the detectors.The reservoir responses for each wavelength channel are weighted and summed individually, which is considered as the wavelength parallel RC outputs.Thus, we can realize parallel RC on the same optical chip simultaneously.As a result, we can use the full optical bandwidth of our circuit (~5 THz) beyond the remaining bottleneck from the outputs bandwidth.
For the demonstration, we examined the parallel RC processing using two wavelength inputs.We used different datasets (MNIST and Fashion-MNIST datasets) for the classification task.The input spectrum is shown in Fig. 6b.The Fashion-MNIST data were modulated onto the wavelength of 1550 nm, and the MNIST data were modulated onto that of 1550.966nm, corresponding to channel spacing of 120-GHz to set the same optical mask shape.We employed only one cavity in this demonstration.The training results are shown in Fig. 6c.The training accuracy increased almost monotonically, which suggests the success of the proposed WDM approach.The observed test accuracies were 79.2% for the MNIST and 70.1% for the Fashion-MNIST datasets.The results for the MNIST are almost comparable to those for the singlewavelength experiment [the result for the single cavity in Fig. 5d] despite the change of wavelength.This suggests that the performance degradation in the WDM approach is small.Thus, we think that we can improve the accuracy to 91.3% by using the multiple cavities as shown in Fig. 5d.Although we only confirmed the feasibility for two-wavelength input due to the limitations of the WDM experimental setup, the fabricated circuit can support 40 wavelength channels over the telecom C-band at least.The results confirm the potential of ultra-wideband processing through WDM.

Discussion
Here, we discuss the computational efficiency of the proposed circuit.Although there are many indices for expressing the performance of a computing device, the multiply-accumulate per second speed (MAC•s −1 ) is now widely considered to be a milestone in the photonic neuromorphic computation region 9,53,54 .Thus, we discuss this index for the reservoir computer.Our photonic circuit can compute the input and reservoir layer propagation as described in Eqs.(3) and ( 5).The equilibrium algorithm for our photonic reservoir computer with multiple ring cavities for the case of one-dimensional input (M = 1) is shown in Algorithm 1.The second term is input mask operation as described in Eq. (3).The first term and the summation of first and second terms are the reservoir operation as described as Eq. ( 5).Note that Ω ijl in the Algorithm 1 is an interreservoir connection, which depends on the q value (delay length of each cavity).For ease of understanding, we describe the case for q = 1.The computational complexity of the photonic circuit does not depend on the q value.This algorithm requires the following MAC operations for each time step: 6 NL times multiplication for the reservoir operation (first term), 6N Lσ times multiplication for the mask operation, where σ is the non-zero density of h li (second term), and (2 NL) times for summation of the first and second terms.Note that the processing is executed on the complex space, which requires two (six times) calculations for the sum (multiplex) operation.The average σ value for the FIR filter is considered as 0.5.Thus, the total MAC for each time-step in our device is considered as 11 NL.As the operation time of each time step for the optical circuit is determined by modulation time interval T, the MAC/s can be expressed as 11 L/θ.For the Mdimensional case, input mask operation becomes more complex, and it requires 0.2 (4M − 1) σNL times operations (see Supplementary Note 1).Thus, the MAC•s −1 for our photonic RC can be described as follows: As M is an input-data-oriented value, L/θ is the important value for the device.The L value is the limited to the size and device components of the optical circuit.The T (=Nθ) is determined by the optical pulse width.Note that the typical single delay-line RC approach gives MAC•s −1 of only 2 N/θ because it does not have an optical input mask and L = 1 with real-valued processing.Algorithm 1 Photonic reservoir computing with multiple ring cavity.
. . .3) and ( 5) in parallel, which result in much faster computation speed beyond the input RF-bandwidth.The performance reached 21.12 and 44.16 T MAC•s −1 .for M = 1 and 4 for our fabricated circuit (θ = 12.5 ps, L = 16).The total power consumption of our system is estimated to be 1330 W: 20 W for the optical circuit; 20 W for the detector, including the post RF amplifier; 120 W for the optical pulse source, including the optical modulator and RF amplifier; 700 W for the oscilloscope, 180 W for the arbitrary waveform generator, and 290 W for our desk-top PC.As the values were estimated from nominal values in the catalog spec sheet, the actual total power consumption is less than this estimation.Thus, the energy efficiency of our circuit [MAC s −1 per watt (MAC s −1 W −1 )] can be estimated as 15.9 and 33.2 G MAC s −1 W −1 for M = 1 and 4.This value is for state-of-the-art electric computational devices (the present best performance is 21.108 GMAC s −1 for MN-3 55 in June 2020).As can be seen, most of the power consumption originates from the electric devices (oscilloscope, AWG, and desk-top PC); therefore, we can reduce the power consumption much more by constructing application-specific circuits.In addition, as demonstrated in this study, the WDM technique enables the optical circuit to share the parallel data inputs.Our circuit supports the C-band (1530-1570 nm, which has ~5-THz bandwidth) with wavelength-spacing of 120 GHz.Thus, we can potentially use 40 wavelength channels in our circuit, which realizes petascale optical reservoir processing (0.845 and 1.77 PMAC s −1 for M = 1 and 4).These speeds are much higher than the theoretical operation speed of recent CPUs [~500 G MAC s −1 (16 MAC × 3 GHz × 10 core)] and GPUs [~6 T MAC s −1 (2 MAC × 1 GHz × 3000 core)].This value is not far from that for current state-of-the art of supercomputers, which ranges from 1 to 100 PMAC/s.Therefore, our approach poses a challenge to conventional Turing-von Neumann machines, and it confirms the great potential of photonic neuromorphic processing.

Conclusion
In this paper, we demonstrated photonic RC based on coherent linear photonic processors.In contrast to previous approaches, both the input signals and reservoir weights are optically encoded in the spatiotemporal domain, which enables scalable integration on a compact chip.The ultrafast optical pulse source up-converts the input signals to a higher frequency region beyond the electrical bandwidth.We also demonstrated parallel processing based on the wavelength division multiplexing (WDM), which enables the use of ultrawide optical bandwidth (>THz) as a computational resource.As a result, the computation efficiency is boosted beyond the computation bottleneck remaining from the I/O bandwidth.Experiments for the standard benchmarks showed good performance for chaotic time-series forecasting and image classifications with record-high processing speed of ~17.1 ns per image.The device can achieve 21.12 T MAC s −1 for each wavelength and can reach peta-scale computation speed on a single photonic chip by using WDM.

Methods
Experimental setup.Optical pulses with a pulse width of 30 ps and repeating time T of ~266.7 ps (1/120 GHz ×32), corresponding to p = 1, were generated by using a coherent laser light source with a 1550-nm wavelength and an amplitude modulator with a 45-GHz bandwidth.The input data was generated by using an arbitrary waveform generator with a 20-GHz bandwidth and sampling rate of 60 GSa/s.The synthesizer and arbitrary waveform generator (AWG) were synchronized by using the same clock.The signal was amplified by an optical fiber amplifier to 10 dBm and input to the fabricated PLC.The mask condition and reservoir condition were randomly set by controlling the Mach-Zehnder modulators and phase shifters in the chip.The output signals were filtered by the wavelength filter.They were measure by using 50-GHz bandwidth photodetectors, and the observed RF signals were amplified by using low-noise amplifier with a 40-GHz bandwidth.The signals were acquired by using a digital storage oscilloscope with an 18-GHz bandwidth and 60 GSa/s (DSO).Thus, the measurable node count decreased to θ′/θ =1/2 [θ′ ~16.7 ps, N' ′= 16].
Multidimensional data processing.To process the multidimensional data such as image information, RC inputs often become time-dependent vector u(t) = [u 1 (t), u 2 (t), …, u M (t)].To process the such type of data, (M × N)-sized random matrix are usually employed for the input mask 20,21 .In the previous work, it is easy to realize because the input mask was processed in the electric domain.However, there are no reports for input masking for photonic implementation.Here, we consider the method for input masking by using the same photonic convolutional filter.In our device, there are two ways to process the multidimensional inputs.The first way is simply inputs the signal to multiple input ports like the conventional on-chip ANNs 4 .However, this architecture requires multiple laser sources, which requires complex and expensive hardware update; e.g., for MNIST task, which is a standard image classification task, requires 28 inputs.It requires 28 laser inputs with optoelectric modulator array.In addition, when their relative phases and/or intensity are fluctuated, the optical output is also fluctuated due to interferometric condition changes.It leads the poor accuracy of the system.Thus, we adapt the second choice, that is time division multiplexing as shown in Fig. 1.This method does not require any hardware update.In this method, the multi-dimensional inputs are arranged in different time slot.To realize it, the radio-frequency (RF) input signal u(t) is set to following form.
where the modulation interval T′ is set to T′ = T/M.The repeating time of optical pulse is also set to T′.Then, the δ(t) is described as follows; Then, the modulated optical pulse u′(n) for discretized time t(n,i) can be described as the orthonormal basis of input vector To satisfy the synchronization between RF-input and optical pulse, N/M should be set to integer value.The reservoir processing and post processing are same as the way for one-dimensional case.Thus we can process the multidimensional data by using the same optical circuit.
Training of readout.The parameter of RC is only readout weight ω, which forms a linear combination of the reservoir states.For the training, we collected the reservoir response |x| 2 from each photodetector during training period, T train Then, we obtained the N r × T train state matrix S, where N r is the number of reservoir nodes (in our case, N r = NL).The goal of the optimization is to find the ω in such a way that the actual output Y=ωS matches the desired output Y teacher as close as possible in the least-squares sense.This is a linear problem.For the offline case, the optimum ω is calculated by using the Moore-Penrose pseudoinverse S † of state matrix S: S y ¼ ðS T SÞ À1 S T ; ð12Þ ω ¼ Y teacher ðS y À λS T IÞ À1 ; ð13Þ where λ is the parameter for Ticknov regularization to avoid overfitting, and I is an N r -dimensional unity matrix.For the online training, we divided the training data into chunks (mini-batches).The readout was updated by the following simple gradient descent method.
where ε is learning rate.

Fig. 1
Fig. 1 Scalable reservoir computer (RC) on coherent photonic processor.a Schematics of proposed reservoir computing architecture.The electric input signal u(t) with modulation time interval T is coded on optical pulse, δ(t) by optical modulator.This electro-optic conversion results in a sinusoidal nonlinearity (F in ).In the optical circuit, both the input and reservoir weights are encoded in the spatiotemporal domain.L-array of optical cavities with N temporal nodes acts as RC system with N × L virtual nodes.The output signals from optical circuit are detected by photodetectors (PDs), which generate nonlinear conversion (F out ).The detected signals |x l | 2 are weighted and summed by digital processing unit to obtain final output y(t).b, c Schematic of waveguide layout for b input mask circuit, and c reservoir circuit.Three types of possible installations for optical connection in the input mask circuit are also shown in c.The input weights h l and reservoir parameters α l , β l , φ l can be reconfigured by tuning the phase shifter (PS), Mach-Zehnder interferometer (MZI) and variable optical attenuator (VOA).

Fig. 2
Fig. 2 Fabricated optical circuit for reservoir computing.a, b Appearance of fabricated optical circuit for a input mask and b reservoir circuit.Silica-based waveguide platform technology, called the planar lightwave circuit (PLC), was employed to integrate a complex optical system on a compact chip.

Fig. 3
Fig. 3 Frequency and time response of fabricated circuit.a Experimentally obtained wavelength (frequency) filtering shape of the fabricated input mask circuit for the center output port (number 16) over the wavelength range of 1530-1565 nm.The delay difference was set to θ. b Magnified view for a. c, d Filtering shape with c delay difference of 2θ and d randomly set input mask.e-g Setting state of input mask circuit with e delay difference of θ [for a and b], f 2θ [for c], and g random state [for d].The observed frequency response suggests the accurate implementation of input mask circuit.We also examined the time domain response and the stability of the fabricated chip by inputting the masked signal into the center coherent cavity (number 8).h-j, h Input pulse and i time responses of reservoir computer before and after 3-h operation and j their error.k Setting state of input mask and reservoir circuit [for h-j].Thanks to the solid-state circuit technology, the reservoir response for the fabricated circuit is stable over a period of three hours without any feedback operation, which is enough time to execute most RC tasks.

Fig. 4
Fig. 4 Chaotic time-series expectation.a Observed bandwidth for radio-frequency (RF) input chaotic time-series and reservoir response.b Normalized mean square error (NMSE) as a function of feedback gain for single cavity experiment.c Reservoir response of all cavity.d NMSE as a function of cavity parallelism.Error bars in b and d represent one standard deviation.

Fig. 5
Fig. 5 Hand-written-digit recognition.a Processing scheme for Modified National Institute of Standards and Technology (MNIST) database.Original digit images were converted to optical time-series data, and they were processed by the input mask and reservoir optical circuits.The signals were detected by photodetectors (PDs) and classified by digital output layer.b Measured time response of reservoir for the center cavity (l = 8).c Training accuracy as a function of learned images.d Test accuracy as a function of cavity parallelism.Error bars represent one standard deviation.

Figure 7
Figure 7 shows estimated MAC•s −1 for each wavelength (MAC•s −1 •λ −1 ) as a function of θ for various L values.Here, we assume M = 1 [Fig.7a] and M = 4 [Fig.7b], which are the values we used for the chaotic series expectation and image classification tasks in this work.In our circuit, the spatially distributed delay lines, optical cavities, and optical interferometers solve Eqs.(3) and (5) in parallel, which result in much faster computation speed beyond the input RF-bandwidth.The performance reached 21.12 and 44.16 T MAC•s −1 .for M = 1 and 4 for our fabricated circuit (θ = 12.5 ps, L = 16).The total power consumption of our system is estimated to be 1330 W: 20 W for the optical circuit; 20 W for the detector, including the post RF amplifier; 120 W for the

Fig. 6
Fig. 6 Parallel processing using wavelength division multiplexing (WDM).a Schematic of parallel reservoir computing using WDM.As the filtering shape of input mask is repeated along the wavelength (λ) axis, we can process the WDM-input data in parallel with cθ channel spacing.b Spectra of WDM-inputs.c Training accuracy as a function of learned images.

Fig. 7
Fig. 7 Estimated processing speed.a, b Estimated multiply-accumulate (MAC) per second per wavelength λ (MAC s −1 λ −1 ) for proposed photonic RC as a function of 1/θ with a M = 1 and b M = 4.