## Introduction

Machine learning (ML) covers a wide range of algorithms and modeling tools with automated data-processing capabilities based on experience1,2. ML, with the prominent example of neural networks, has proven successful for tackling practical processing tasks that are unsuitable for conventional computer algorithms3,4,5,6,7. With the deployment of ML algorithms, their limitations and inefficiencies when running on top of conventional computing hardware arise both in terms of power consumption and computing speed8. The demand for an increased efficiency is currently fueling the field of unconventional computing, which aims at developing hardware and algorithms that go beyond the traditional von Neumann architectures9,10,11. Recent extensions of neural networks and other ML techniques based on quantum systems12,13,14,15,16 aim to offer and identify novel capabilities17,18,19,20,21,22,23. In this context, reservoir computing (RC) is a machine-learning paradigm that is amenable to unconventional hardware-based approaches in the classical domain, e.g., in photonics24,25,26,27,28,29,30 and spintronics31,32.

RC exploits the dynamics of a nonlinear system—the reservoir—for information processing of time-dependent inputs33,34. RC has its roots in the discovery that in recurrent neural networks, i.e., neural networks with an internal state, it is sufficient to only train the connections leading to the final output layer, avoiding optimization difficulties well-known in neural networks, without any apparent loss in computational power35,36. In fact, restricting the training to only the final layer does not necessarily restrict the scope of tasks that can be done; indeed, universality has been shown for distinct classes of RC such as so-called liquid state machines37 and echo-state networks38. Unlike feed-forward neural networks, said to be universal when they can approximate any continuous function39,40, RC realizes transformations between sequences of data and is said to be universal when it can approximate any so called fading memory function37,38, which can be thought of as a continuous function of a finite number of past inputs. In practice, reservoir computers have achieved state-of-the-art performance in tasks such as continuous speech recognition41 and nonlinear time series prediction42 thanks to their intrinsic memory43.

RC has the potential to be extended to the quantum regime22,44,45,46,47,48,49,50,51,52,53,54,55. Indeed, since the reservoir requires neither fine tuning nor precise engineering such extensions are good candidates for NISQ (noisy intermediate-scale quantum) technologies56. Continuous-variable quantum systems amenable to an optical implementation would have the additional benefit of intrinsic resilience to decoherence even at room temperature and allow to easily read-out a portion of the optical signal for output processing, including (direct or indirect) measurements. Previous proposals for neural networks realized with continuous-variable quantum systems have either used non-Gaussian gates19 or nonlinear oscillators52 to go beyond linear transformations of the modes. Such nonlinearity is required, e.g., for universal quantum computing in continuous-variable systems57, i.e., for the ability to reproduce any unitary transformation57,58, however, non-Gaussian gates in particular are an unsolved engineering challenge so far59,60,61,62.

Here, we put forward the framework for RC with Gaussian states of continuous-variable quantum systems, in bosonic reservoirs given by random harmonic networks. The input is encoded into single-mode Gaussian states injected into the network by periodic state resets, whereas the output is a trained function of the network observables. It is worth noticing that the encoding is in general nonlinear, providing nonlinearity for the overall input-output map. The linear network then provides an analog system—not based on gates or circuits—where different encodings may be exploited. We explore the utility of the method spanning across classical and quantum states and our results are applicable to several physical reservoirs. As elaborated in the Discussion, the restriction to Gaussian dynamics of linear oscillators brings the model within reach of state-of-the-art optical63,64,65,66 or superconducting and optomechanical platforms67,68,69,70, however, one might expect it to have limited information processing capabilities. We find that even universal RC, i.e., approximation of any fading memory functions, is possible. Versatility is provided by the ability to tune information processing from fully linear to highly nonlinear with the input encoding. Therefore, even a fixed network can adapt to the requirements of the task at hand. Finally, we reveal that encoding to field fluctuations, or covariances, instead of in field intensities, or first moments, can significantly increase the system’s information processing capacity (IPC)71, a task-independent quantifier of RC capability. However, encoding to classical (thermal) fluctuations does not provide any nonlinearity. To solve the problem, we take a step towards exploiting the quantumness of the system and propose to harness the quantum fluctuations provided by squeezed vacuum instead. Indeed, this achieves simultaneously increased capacity, nonlinearity and universal RC.

## Results

### The model

We consider a linear network of N interacting quantum harmonic oscillators as detailed in Methods. The scheme for using this network for RC is shown in Fig. 1. The input sequence s = {…, sk−1, sk, sk+1, …}, where $${{\bf{s}}}_{k}\in {{\mathbb{R}}}^{n}$$ represents each input vector and $$k\in {\mathbb{Z}}$$, is injected into the network by resetting at each timestep k the state of one of the oscillators, called ancilla (A), accordingly. As often done in RC, each input vector can be mapped to a single scalar in the ancilla through a function of the scalar product vsk where $${\bf{v}}\in {{\mathbb{R}}}^{n}$$. The rest of the network acts as the reservoir (R), and output is taken to be a function h of the reservoir observables before each input.

To express the dynamics, let x = {q1, p1, q2, p2, …} be the vector of network momentum and position operators and let xk be the form of this vector at timestep k, after input sk has been processed. We may take the ancilla to be the Nth oscillator without a loss of generality. Let the time between inputs be Δt. Now operator vector xk is related to xk−1 by

$${{\bf{x}}}_{k}={\bf{S}}({{\Delta }}t)\left({P}_{R}{{\bf{x}}}_{k-1}\oplus {{\bf{x}}}_{k}^{A}\right),$$
(1)

where PR drops the ancillary operators from xk−1 (reservoir projector, orthogonal to the ancilla vector) and $${{\bf{x}}}_{k}^{A}$$ is the vector of ancillary operators conditioned on input sk, while $${\bf{S}}({{\Delta }}t)\in {\rm{Sp}}(2N,{\mathbb{R}})$$ is the symplectic matrix induced by the Hamiltonian in Eq. (7) (see Methods) and time Δt (for more details on this formalism see, e.g., refs. 72,73,74). The dynamics of reservoir operators $${{\bf{x}}}_{k}^{R}={P}_{R}{{\bf{x}}}_{k}$$ is conveniently described dividing St) into blocks as

$${\bf{S}}({{\Delta }}t)=\left(\begin{array}{ll}{\bf{A}}&{\bf{B}}\\ {\bf{C}}&{\bf{D}}\end{array}\right),$$
(2)

where A is 2(N − 1) × 2(N − 1) and D is 2 × 2. Now the formal definition of the proposed reservoir computer reads

$$\left\{\begin{array}{l}{{\bf{x}}}_{k}^{R}={\bf{A}}{{\bf{x}}}_{k-1}^{R}+{\bf{B}}{{\bf{x}}}_{k}^{A},\\ {o}_{k}=h({{\bf{x}}}_{k}^{R}),\hfill\end{array}\right.$$
(3)

where h maps the reservoir operators to elements of the real output sequence o = {…, ok−1, ok, ok+1, …}.

For Gaussian states, the full dynamics of the system conditioned by the sequential input injection is entailed in the first moments vector $$\langle {{\bf{x}}}_{k}^{R}\rangle$$ and covariance matrix $$\sigma ({{\bf{x}}}_{k}^{R})$$. The values at step 0, given a sequence of previous m inputs s = {sm+1, …, s−1, s0} encoded in the corresponding ancilla vectors, is obtained through repeated application of Eq. (3) and reads

$$\left\{\begin{array}{lll}\sigma ({{\bf{x}}}_{0}^{R})={{\bf{A}}}^{m}\sigma ({{\bf{x}}}_{-m}^{R}){({{\bf{A}}}^{\top })}^{m}\hfill\\ \qquad\quad \,+\mathop{\sum }\limits_{j=0}^{m-1}{{\bf{A}}}^{j}{\bf{B}}\sigma ({{\bf{x}}}_{-j}^{A}){{\bf{B}}}^{\top }{({{\bf{A}}}^{\top })}^{j},\hfill\\ \langle {{\bf{x}}}_{0}^{R}\rangle ={{\bf{A}}}^{m}\langle {{\bf{x}}}_{-m}^{R}\rangle +\mathop{\sum }\limits_{j=0}^{m-1}{{\bf{A}}}^{j}{\bf{B}}\langle {{\bf{x}}}_{-j}^{A}\rangle ,\end{array}\right.$$
(4)

where $$\sigma ({{\bf{x}}}_{-m}^{R})$$ and $$\langle {{\bf{x}}}_{-m}^{R}\rangle$$ are the initial conditions, i.e., the initial state of the reservoir. This is the Gaussian channel for the reservoir conditioned on the input encoded in xA. Different Gaussian states of the ancilla can be addressed, such as coherent states, squeezed vacuum or thermal states (see Fig. 1), respectively characterized by the complex displacement α, squeezing degree r and phase φ, and thermal excitations nth (see Methods, Eq. (9)). Finally, the output is taken to be either linear or polynomial in either the elements of $$\sigma ({{\bf{x}}}_{k}^{R})$$ or $$\langle {{\bf{x}}}_{k}^{R}\rangle$$.

Observables can be be extracted with Gaussian measurements such as homodyne or heterodyne detection. We will next show that the introduced model not only satisfies the requirements for RC, but is universal for RC even when restricting to specific input encodings.

### Universality for reservoir computing

To begin with, we show that instances of the model defined by Eq. (3) and the dependency of $${{\bf{x}}}_{k}^{A}$$ on sk can be used for RC, i.e., the dynamics conditioned by the input can be used for online time series processing by adjusting the coefficients of the polynomial defined by h to get the desired output.

As explained in Methods (Reservoir computing theory), the goal is more formally to reproduce a time-dependent function f(t) = F[{…, st−2, st−1, st}], associated with a given input s and functional F from the space of inputs to reals. Consequently, we say that the system can be used for RC if there is a functional from the space of inputs to reals that is both a solution of Eq. (3) and sufficiently well-behaved to facilitate learning of different tasks. These two requirements are addressed by the echo-state property (ESP)43 and the fading memory property (FMP)75, respectively. The associated functions are called fading memory functions. In essence, a reservoir has ESP if and only if it realizes a fixed map from the input space to reservoir state space—unchanged by the reservoir initial conditions—while FMP means that to get similar outputs it is enough to use inputs similar in recent past—which provides, e.g., robustness to small past changes in input. The two are closely related and in particular both of them imply that the reservoir state will eventually become completely determined by the input history; in other words forgetting the initial state is a necessary condition for ESP and FMP.

Looking at Eq. (4), it is readily seen that the model will become independent of the initial conditions at the limit m →  of a left infinite input sequence if and only if ρ(A) < 1, where ρ(A) is the spectral radius of matrix A. Therefore, ρ(A) < 1 is a necessary condition for having ESP and FMP. The following lemma (proven in Supplementary Note 2) states that it is also sufficient when we introduce the mild constraint of working with uniformly bounded subsets of the full input space, i.e., there is a constant that upper bounds sk for all k in the past.

Lemma 1 Suppose the input sequence s is uniformly bounded. Let ancilla parameters be continuous in input and let h be a polynomial of the elements of $$\sigma ({{\bf{x}}}_{k}^{R})$$ or $$\langle {{\bf{x}}}_{k}^{R}\rangle$$. The corresponding reservoir system has both ESP and FMP if and only if the matrix A in Eq. (3) fulfills ρ(A) < 1.

This is the sought condition for RC with harmonic networks, either classical or quantum. Importantly, it allows to discriminate useful reservoirs by simple inspection of the parameters of the network through the spectral radius of A.

We now turn our attention to universality. The final requirement to fulfill is separability, which means that for any pair of different time series there is an instance of the model that can tell them apart. Then the class of systems defined by Eq. (3) is universal38,76 in the following sense. For any causal time-invariant fading memory functional F (see Methods, reservoir computing theory) there exists a finite set of functionals realized by our model that can be combined to approximate F up to any desired accuracy. Physically, such combinations can be realized by combining the outputs of many instances of the model with a polynomial function. Mathematically, this amounts to constructing the polynomial algebra of functionals.

The next theorem (of which we give a simplified version here and a full version in the Supplementary Note 3) summarizes our analysis of the model described in Eq. (3).

Universality theorem (simplified) Given instances of reservoir systems governed by Eq. (3) with a given Δt and for ρ(A)<1, hence providing a family $${\mathcal{Q}}$$ of fading memory functionals, the polynomial algebra of these functionals has separability. This holds also for the subfamilies $${{\mathcal{Q}}}_{\text{thermal}}$$, $${{\mathcal{Q}}}_{\text{squeezed}}$$, and $${{\mathcal{Q}}}_{\text{phase}}$$, that correspond to thermal, squeezed, and phase encoding, respectively. As any causal, time-invariant fading memory functional F can be uniformly approximated by its elements, the reservoir family of Eq. (3) with the specified constraint is universal.

We sketch the main ingredients of the proof. Since the model admits arbitrarily small values of ρ(A), there are instances where ρ(A) < 1; by virtue of Lemma 1, $${\mathcal{Q}}$$ can therefore be constructed and it is not empty. We show that the associated algebra has separability. Since the space of inputs is taken to be uniformly bounded, we may invoke the Stone-Weierstrass Theorem77 and the claim follows. Full proof and additional details are in Supplementary Note 3.

We note that unlike ESP and FMP, separability depends explicitly on the input encoding. In Supplementary Note 3 we show separability for three different encodings of the input to elements of $$\sigma ({{\bf{x}}}_{k}^{A})$$: thermal (nth), squeezing strength (r) and phase of squeezing (φ). It should be pointed out that separability (and hence, universality) could be shown also for first moments encoding in a similar manner.

### Controlling performance with input encoding

The here derived Universality Theorem guarantees that for any temporal task, there is a finite set of reservoirs and readouts that can perform it arbitrarily well when combined. Let us now assume a somewhat more practical point of view: we possess a given generic network, and we attempt to succeed in different tasks by training the output function h to minimize the squared error between output o and target output. For simplicity, we will also take inputs to be sequences of real numbers, rather than sequences of vectors—we stress that universality holds for both. Beyond universality of the whole class of Gaussian reservoirs, what is the performance and versatility of a generic one?

First of all, we will address how to single out instances with good memory. As pointed out earlier, memory is provided by the dependency of the reservoir observables on the input sequence. Informally speaking, reservoirs with good memory can reproduce a wide range of functions of the input and, therefore, learn many different tasks. Furthermore, to be useful a reservoir should possess nonlinear memory, since this allows the offloading of nontrivial transformations of the input to the reservoir. Then nonlinear time series processing can be carried out while keeping the readout linear, which simplifies training and reduces the overhead from evaluating the trained function.

Memory is strongly connected to FMP; in fact, a recent general result concerning reservoirs processing discrete-time data is that under certain mild conditions, FMP guarantees that the total memory of a reservoir—bounded by the number of linearly independent observables used to form the output—is as large as possible71. Consequently, all instances that satisfy the spectral radius condition of Lemma 1 have maximal memory in this sense. Indeed with Lemma 1 the condition for FMP is straightforward to check. Furthermore, we find that reservoir observables seem to be independent as long as L in Eq. (7) does not have special symmetries—as a matter of fact, numerical evidence suggests a highly symmetric network such as a completely connected network with uniform frequencies and weights never satisfies ρ(A) < 1. Having FMP says nothing about what kind of functions the memory consists of, however.

Is there nonlinear memory? It is evident that the top equation of Eq. (3) is linear in reservoir and ancilla operators, but the encoding is not necessarily linear because of the way ancilla operators $${{\bf{x}}}_{k}^{A}$$ depend on the input. For single-mode Gaussian states (see Eq. (9) in Methods), it can be seen that the reservoir state is nonlinear in input when encoding to either magnitude r or phase φ of squeezing, or the phase of displacement $$\arg (\alpha )$$. Otherwise, that is for encoding in coherent states amplitude or thermal states average energy, it is linear (see Eq. (10) in Methods). This implies that the input encoding is the only source of nonlinear memory when the readout is linear—nonlinearity comes from pre-processing the input to ancilla states, not from an activation function, which is the conventional source of nonlinearity in RC.

The performance of Gaussian RC can be assessed considering different scenarios. For the remainder of this work we fix the network size to N = 8 oscillators and form the output using the reservoir observables and a bias term; see Methods for details about the networks used in numerical experiments and their training. We consider nonlinear tasks in Fig. 2. In panels a and b we take the output function h to be a linear function of $$\langle {{\bf{x}}}_{k}^{R}\rangle$$ and inputs sk to be uniformly distributed in [−1, 1], and consider two different encodings of the input into the ancilla $$\langle {{\bf{x}}}_{k}^{A}\rangle$$, as the amplitude and phase of coherent states. Setting α → sk + 1 and phase to a fixed value $$\arg (\alpha )\to 0$$ leads to fully linear memory, which leads to good performance in the linear task of panel a only. In contrast, setting α → 1 and encoding the input to phase as $$\arg (\alpha )\to 2\pi {s}_{k}$$ leads to good performance in the nonlinear task shown in panel b and limited success in the linear one of panel a.

Nonlinearity of reservoir memory is not without limitations since $$\langle {{\bf{x}}}_{k}^{R}\rangle$$ does not depend on products of the form sksjsl where at least some of the indices are unequal, i.e., on products of inputs at different delays, as can be seen from Eq. (10). This is a direct consequence of the linearity of reservoir dynamics. When h is linear in $$\langle {{\bf{x}}}_{k}^{R}\rangle$$ the output will also be independent of these product terms, hindering performance in any temporal task requiring them. While Universality Theorem implies the existence of a set of reservoirs for any task, we will show that even a single generic reservoir can be sufficient when nonlinearity is introduced at the level of readout, at the cost of relatively more involved training.

To illustrate the nontrivial information processing power of a single generic reservoir, we consider the parity check task78 (Fig. 2c, d), defined as

$${({\rm{PC}}(\tau ))}_{k}={\rm{mod}}\left(\mathop{\sum }\limits_{l=0}^{\tau }{s}_{k-l},2\right)$$
(5)

where sk {0, 1}; the target output is 0 if the sum of τ + 1 most recent inputs is even and 1 otherwise. It can be shown that (PC(τ))k coincides with a sum of products of inputs at different delays for binary input considered here. When encoding in a coherent state amplitude (α → sk, $$\arg (\alpha )\to 0$$) for readout function h polynomial in $$\langle {{\bf{x}}}_{k}^{R}\rangle$$, the results show that increasing the polynomial degree d of the function h allows to solve the task for higher delays τ. In particular, we find that the reservoir can solve the well-known XOR problem for nonlinearity degrees d ≥ 2, which coincides with the parity check at τ = 1. The parity check at τ = 3 works, in turn, for d ≥ 4.

### Information processing capacity

Besides providing nonlinearity, input encoding also facilitates a versatile tuning of the linear and nonlinear memory contributions. To demonstrate this, we consider how input encoding affects the degree of nonlinear functions that the reservoir can approximate, as quantified by the information processing capacity (IPC)71 of the reservoir. The IPC generalizes the linear memory capacity79 often considered in RC to both linear and nonlinear functions of the input. Even if its numerical evaluation is rather demanding, it has the clear advantage to provide a broad assessment of the features of RC, beyond the specificity of different tasks.

We may define the IPC as follows. Let X be a fixed reservoir, z a function of a finite number of past inputs and let h be linear in the observables of X. Suppose the reservoir is run with two sequences $${\bf{s}}^{\prime}$$ and s of random inputs drawn independently from some fixed probability distribution p(s). The first sequence $${\bf{s}}^{\prime}$$ is used to initialize the reservoir; observables are recorded only when the rest of the inputs s are encoded. The capacity of the reservoir X to reconstruct z given $${\bf{s}}^{\prime}$$ and s is defined to be

$${C}_{{\bf{s}}^{\prime} ,{\bf{s}}}(X,z)=1-\frac{{\min }_{h}\sum _{k}{({z}_{k}-{o}_{k})}^{2}}{\sum _{k}{z}_{k}^{2}}$$
(6)

where the sums are over timesteps k after initialization, each zk is induced by the function z to be reconstructed and the input, and we consider the h that minimizes the squared error in the numerator. The maximal memory mentioned earlier may be formalized in terms of capacities: under the conditions of Theorem 7 in the work by Dambre et al.71, the sum of capacities $${C}_{{\bf{s}}^{\prime} ,{\bf{s}}}(X,z)$$ over different functions z is upper bounded by the number of linearly independent observables used by h, with the bound saturated if X has FMP. This also implies that finite systems have finite memory, and in particular increasing the nonlinearity of the system inevitably decreases the linear memory71,80. Importantly, infinite sequences $${\bf{s}}^{\prime}$$, s and a set of functions that form a complete orthogonal system w.r.t. p(s) are required by the theorem; shown results are numerical estimates. We consistently take p(s) to be the uniform distribution in [−1, 1]; examples of functions z orthogonal w.r.t. this p(s) include Legendre polynomials P1 and P5 appearing in Fig. 2, as well as their delayed counterparts. Further details about the estimation of total capacity are given in Methods.

We consider the breakdown of the normalized total capacity to linear (covering functions z with degree 0 or 1), nonlinear (degrees 2–3) and highly nonlinear (degree 4 or higher) regimes in Fig. 3. We take h to be a linear function of $$\langle {{\bf{x}}}_{k}^{R}\rangle$$ and address the capability to have Gaussian RC operating with different linear and nonlinear capabilities by varying the input encoding into a coherent ancillary state from amplitude to phase α → (1 − λ)(sk + 1) + λ, $$\arg (\alpha )\to 2\pi \lambda s_k$$ where sk [−1, 1]; this is a convex combination of the two encodings used in panels a and b of Fig. 2. As can be seen in Fig. 3, adjusting λ allows one to move from fully linear (for amplitude encoding) to highly nonlinear (for phase) information processing, which can be exploited for a versatile tuning of the reservoir to the task at hand. Remarkably, this can be done without changing neither the parameters of the Hamiltonian in Eq. (7) (that is, the reservoir system) nor the observables extracted as output in h. The earlier mentioned trade-off between linear and nonlinear memory keeps the total memory bounded, however, Lemma 1 ensures that this bound is saturated for all values of λ.

### From intensities to field fluctuations and RC with quantum resources

Previously we considered coherent states for the ancilla, encoding the input to α and $$\arg (\alpha )$$. In the limit of large amplitudes α 1, coherent states allow for a classical description of the harmonic network, with field operator expectation values corresponding, for instance, to classical laser fields81,82. This realization of RC would be impossible in the limit of vanishing fields where α → 0 since then $$\langle {{\bf{x}}}_{k}^{R}\rangle$$ becomes independent of input. Here, we put forward the idea of harnessing instead the fluctuations encoded in the covariance matrix $$\sigma ({{\bf{x}}}_{k}^{R})$$, which also increases the amount of linearly independent reservoir observables. Even if only a subset of them can be directly monitored, the rest will play the role of hidden nodes that may be chosen to contribute as independent computational nodes, and have already been suggested to be a source of computational advantage in quantum RC in spin systems44. Here, we analyze the full IPC of the system including all observables.

In the classical regime, thermal fluctuations could be used, corresponding to thermal states (encoding to nth) for the ancilla. While representing an original approach to the best of our knowledge, it will be seen that these classical fluctuations will provide only linear memory and as such cannot be used to solve nontrivial temporal tasks without relying on external sources of nonlinearity. We propose to solve this problem by relying instead on quantum fluctuations provided by encoding the input to squeezed vacuum (encoding to r, φ), i.e., by adopting a quantum approach to RC with Gaussian states. By virtue of the Universality Theorem, either thermal or squeezed states encoding can be used for universal RC.

We compare the classical and quantum approaches in Fig. 4, where we show how the capacity of the system of seven reservoir oscillators and the ancilla gets distributed for the aforementioned encodings. For comparison, we also include the case of an echo-state network (ESN, see Methods) consisting of 8 neurons, a classical reservoir computer based on a recurrent neural network43. We separate the contributions to total capacity according to degree to appreciate the linear memory capacity (degree 1) and nonlinear capacity and take sk to be uniformly distributed in [−1, 1]. The total capacity, as displayed in Fig. 4, allows to visualize the clear advantage of the oscillators network over the ESN, as well as the particularities of the different input encodings.

The cases α and $$\arg (\alpha )$$ in Fig. 4 correspond to Fig. 3 for λ = 0 and λ = 1, respectively. For them, we take h to be linear in $$\langle {{\bf{x}}}_{k}^{R}\rangle$$. The total capacity for ESN and harmonic networks with α and $$\arg (\alpha )$$ encoding differ by a factor 2 because the neurons of ESN each provide a single observable, while two quadratures are considered for reservoir oscillators. Comparing capacities, ESN provides more nonlinear memory than amplitude encoding of coherent states α, but for phase encoding we can see significant nonlinear contributions to total capacity from higher degrees, up, e.g., to 7, in spite of the small size of the harmonic network.

Let us now reduce α → 0. In this case, we show in Fig. 4 that classical thermal fluctuations provided by setting nth → sk + 1 and taking h to be a linear function of $$\sigma ({{\bf{x}}}_{k}^{R})$$ increase the capacity significantly, as the number of available observables is proportional to the square of the number of oscillators (as explained in Methods, training of the network). For thermal encoding, the capacity does not reach the total capacity bound, limited by dependencies between some of the observables. Furthermore, there is only linear memory. As shown in Fig. 4, considering instead quantum fluctuations gives both a quadratic capacity increase and nonlinear memory. Setting r → sk + 1, φ → 0 has a ratio of linear to nonlinear memory somewhat similar to the ESN, while setting r → 1, φ → 2πsk gives significant amounts of nonlinear memory (last bar in Fig. 4), displaying the advantage of encoding in squeezed vacuum.

## Discussion

The concept of universality in different fields and applications depends on the scope. In the context of quantum computing it refers to the ability to reproduce any unitary transformation57,58, in feed-forward neural networks it characterizes the possibility to approximate any continuous function39,40, and in RC it identifies the ability to approximate as accurately as one wants so called fading memory functions37,38.

Recently, universality of RC has been proved in the full class of classical linear systems76, in echo-state networks38, and on the quantum side, for the spins maps of the work by Chen et al.49 tested with superconducting quantum computers. The choice of the quantum map governing the RC dynamics is also relevant. For instance, at variance with the work by Chen et al.49, a previous universality demonstration with spins47 applies to an input-dependent dynamical map that is not guaranteed to be convergent for all input sequences. Although in general more challenging to implement, it has been suggested that quantum reservoirs may have a more favorable scaling of computational power with size than classical reservoirs44; importantly, universality guarantees that this power may be harnessed for achieving any temporal tasks. In this work we have established that even with the limited resources offered by linear quantum optics platforms, Gaussian measurements, and a restricted class of input Gaussian states, such as thermal or squeezed vacuum, universal reservoir computing is achieved. Our work provides the foundations for RC with generic quadratic Hamiltonians in linear optics platforms, classical or quantum, with minimal resources83,84.

We find that performance is not limited by the reduced resources in place, as well as a clear improvement when information is encoded in quantum fluctuations of the field, providing superior capacity. This is displayed in the quadratic growth of IPC with system size, as opposed to linear in the case of first moments or standard ESNs. Versatility is also achieved without modifying the network reservoir, as input encoding leads to easily tunable nonlinear memory—already proven to be beneficial to tackle different computational tasks in classical RC85,86. Non-classical resources in Gaussian systems originate from squeezing and this is shown to provide nonlinear memory necessary for achieving nontrivial RC with a linear readout clearly beyond the performance achieved with thermal fluctuations, which are classical.

RC realized with physical systems is a new and rapidly developing field87, with the extension to quantum regime proposed in a handful of pioneering works44,45,46,47,49,52. Our proposal can be adapted to many classical and quantum platforms modeled by continuous variables in linear networks26,63,64,67,68,69,70,88,89. As an example, any harmonic network described by Eq. (1) can be implemented in optical parametric processes pumped by optical frequency combs, based on interacting spectral or temporal modes. The network size and structure are controlled by shaping the pump and multimode measurements, allowing in principle to create any network within a fixed optical architecture63. Lemma 1 provides a simple and mild condition to identify networks with echo-state and fading memory properties in this reconfigurable platform. Even if the full numerical analysis of the IPC here is limited to 8 nodes, larger networks have already been reported64. These linear quantum optics platforms are already well-established for measurement-based quantum computing64,65,66, have intrinsic resilience to decoherence even at room temperature and high potential for scalability. Our study reveals their potential as promising physical systems for quantum RC.

The proposed scheme assumes Gaussian measurements as homodyne or heterodyne detection90. A proof of principle realization of this protocol would require several copies of the experiment44,45,49 at each time. Otherwise, when monitoring the system output in order to exploit RC for temporal tasks, the back-action effect of quantum measurement needs to be taken into account. Even if the system performance is robust up to some level of classical noise and the effective dissipation introduced by measurement turns out to be beneficial for the fading memory, the design of platforms including measurement protocols is a question open to further research. In optical implementations, a clear advantage is the possibility to exploit arbitrary-node manipulation via beam splitters operations and feedback loops, as in recent experiments64,89. Measurements may also be a potential source of computational power introducing nonlinearity52. As examples, even for Gaussian states, non-Gaussian measurements allow to reproduce any unitary transformation91 or to make a problem intractable for classical computers, as in Boson sampling92.

The most direct advantage when considering RC with quantum systems is the possibility to access a larger Hilbert space44. With the minimal resources of a Gaussian model, this leads to a quadratic advantage. An interesting perspective is to explore a potential superior performance based on non-Gaussian resources93,94,95, to achieve exponential scaling of the total processing capacity in the quantum regime and a genuine quantum advantage in terms of computational complexity91,96. A further possibility is to embed the (quantum) reservoir within the quantum system to be processed in order to explore quantum RC in quantum tasks, providing a quantum-quantum machine learning (in the data to be analyzed and in the RC platform, respectively). Finally, the extension of the learning theory to quantum output is still missing; tasks of interests could include training the reservoir to simulate given quantum circuits, for instance. Static versions of such tasks carried out in feed-forward, as opposed to recurrent, architecture have been considered48,53,97 and could be used as a starting point.

## Methods

### Linear network Hamiltonian

We consider a network of interacting quantum harmonic oscillators acting as the reservoir for RC, with spring-like interaction strengths gij. The Hamiltonian of such a system can be conveniently described in terms of the Laplacian matrix L having elements Lij = δijkgik − (1 − δij)gij. We adopt such units that the reduced Planck constant  = 1 and the Boltzmann constant kB = 1. Arbitrary units are used for other quantities such as frequency and coupling strength. The resulting Hamiltonian is

$$H=\frac{{{\bf{p}}}^{\top }{\bf{p}}}{2}+\frac{{{\bf{q}}}^{\top }({{\boldsymbol{\Delta }}}_{{\boldsymbol{\omega }}}^{2}+{\bf{L}}){\bf{q}}}{2},$$
(7)

where p = {p1, p2, …, pN} and q = {q1, q2, …, qN} are the vectors of momentum and position operators of the N oscillators while the diagonal matrix Δω holds the oscillator frequencies ω = {ω1, ω2, …, ωN}.

### Reservoir computing theory

A common way to process temporal information is to use artificial neural networks with temporal loops. In these so-called recurrent neural networks, the state of the neural network nodes depends on the input temporal signals to be processed but also on the previous states of the network nodes, providing the needed memory98. Unfortunately, such recurrent neural networks are notorious for being difficult to train99. Reservoir Computing, in turn, leads to greatly simplified and faster training, enlarges the set of useful physical systems as reservoirs, and lends itself to simultaneous execution of multiple tasks by training separate output weights for each task while keeping the rest of the network—the reservoir—fixed34.

Here, we provide an overview of Reservoir computing theory that introduces the relevant definitions and concepts in context. For proper development of the discussed material we refer the reader to38,100. We will also briefly discuss the application of the framework to quantum reservoirs.

#### Reservoir computers

We consider sequences of discrete-time data s = {…, si−1, si, si+1, …}, where $${{\bf{s}}}_{i}\in {{\mathbb{R}}}^{n}$$, n is the dimension of the input vector and $$i\in {\mathbb{Z}}$$. Let us call the space of input sequences $${{\mathcal{U}}}_{n}$$ such that $${\bf{s}}\in {{\mathcal{U}}}_{n}$$. Occasionally, we will also use left and right infinite sequences defined as $${{\mathcal{U}}}_{n}^{-}=\{{\bf{s}}=\{\ldots ,{{\bf{s}}}_{-2},{{\bf{s}}}_{-1},{{\bf{s}}}_{0}\}| {{\bf{s}}}_{i}\in {{\mathbb{R}}}^{n},i\in {{\mathbb{Z}}}_{-}\}$$ and $${{\mathcal{U}}}_{n}^{+}=\{{\bf{s}}=\{{{\bf{s}}}_{0},{{\bf{s}}}_{1},{{\bf{s}}}_{2},\ldots \}| {{\bf{s}}}_{i}\in {{\mathbb{R}}}^{n},i\in {{\mathbb{Z}}}_{+}\}$$, respectively. Formally, a reservoir computer may be defined by the following set of equations:

$$\left\{\begin{array}{l}{{\bf{x}}}_{k}=T({{\bf{x}}}_{k-1},{{\bf{s}}}_{k})\\ {o}_{k}=h({{\bf{x}}}_{k}),\end{array}\right.$$
(8)

where T is a recurrence relation that transforms input sequence elements sk to feature space elements xk—in general, in a way that depends on initial conditions—while h is a function from the feature space to reals. When T, a target $$\overline{{\bf{o}}}$$ and a suitable cost function describing the error between output and target are given, the reservoir is trained by adjusting h to optimize the cost function. The error should remain small also for new input that was not used in training.

The general nature of Eq. (8) makes driven dynamical systems amenable to being used as reservoirs. This has opened the door to so-called physical reservoir computers that are hardware implementations exploiting different physical substrates87. In such a scenario time series s—often after suitable pre-processing—drives the dynamics given by T while xk is the reservoir state. A readout mechanism that can inspect the reservoir state should be introduced to implement function h. The appeal of physical RC lies in the possibility to offload processing of the input in feature space and memory requirements to the reservoir, while keeping the readout mechanism simple and memoryless. In particular, this can lead to efficient computations in terms of speed and energy consumption with photonic or electronic systems28,101.

Online time series processing—what we wish to do with the system in Eq. (8)—is mathematically described as follows. A temporal map $$M:{{\mathcal{U}}}_{n}\to {{\mathcal{U}}}_{1}$$, also called a filter, transforms elements from the space of input time series to the elements of the space of output time series. In general M is taken to be causal, meaning that (M[s])t may only depend on sk where k ≤ t, i.e., inputs in the past only. When M is additionally time-invariant, roughly meaning that it does not have an internal clock, (M[s])t = F({…, st−2, st−1, st}) for any t for some fixed $$F:{{\mathcal{U}}}_{n}^{-}\to {\mathbb{R}}$$38. We will later refer to such F as functionals. When M is given, fixing s induces a time-dependent function that we will denote by f, defined by f(t) = F({…, st−2, st−1, st}).

To process input s into o in an online mode requires to implement f(t); real-time processing is needed. We will later refer to such tasks as temporal tasks. Reservoir computing is particularly suited for this due to the memory of past inputs provided by the recursive nature of T and online processing accomplished by the readout mechanism acting at each timestep.

#### Properties of useful reservoirs

In general, ok in Eq. (8) depends on both the past inputs and the initial conditions, but f(t) depends only on the inputs; therefore any dependency on the initial conditions should be eliminated by the driving. It may also be expected that reservoirs able to learn temporal tasks must be in some sense well-behaved when driven. These informal notions can be formalized as follows.

The echo-state property (ESP)43 requires that for any reference time t, $${{\bf{x}}}_{t}={\mathcal{E}}(\{\ldots ,{{\bf{s}}}_{t-2},{{\bf{s}}}_{t-1},{{\bf{s}}}_{t}\})$$ for some function $${\mathcal{E}}$$, that is to say at the limit of infinitely many inputs the reservoir state should become completely determined by the inputs, and not by the initial conditions. This has two important consequences. First, it guarantees that the reservoir always eventually converges to the same trajectory of states for a given input, which also means that initial conditions do not need to be taken into account in training. Second, it ensures that the reservoir together with a readout function can realize a temporal map. A strongly related condition called the fading memory property (FMP)75 requires that for outputs to be similar, it is sufficient that the inputs are similar up to some finite number of past inputs. The formal definition can be given in terms of so-called null sequences as explained in the Supplementary Note 1. It can be shown that FMP imposes a form of continuity to the overall input-output maps that can be produced by the reservoir computer described by Eq. (8)37; the associated temporal maps (functionals) are called fading memory temporal maps (functionals).

A useful quantifier for the processing power of a single reservoir was introduced by Dambre et al.71. They showed that when the readout function h is linear in reservoir variables, the ability of a reservoir to reconstruct orthogonal functions of the input is bounded by the number of linearly independent variables used as arguments of h. The central result was that all reservoirs with FMP can saturate this bound.

Considering a class of reservoirs instead offers a complementary point of view. If the reservoirs have ESP and FMP then they can realize temporal maps. If additionally the class has separability, i.e., for any $${{\bf{s}}}_{1},{{\bf{s}}}_{2}\in {{\mathcal{U}}}_{n}^{-}$$, s1 ≠ s2, some reservoir in the class will be driven to different states by these inputs, then universality in RC becomes possible. This can be achieved by imposing mild additional conditions on the input space and realizing an algebra of temporal maps by combining the outputs of multiple reservoirs with a polynomial function76. When these properties are met, universality then follows from the Stone-Weierstrass Theorem (Theorem 7.3.1 in the work by Dieudonne et al.77).

### The explicit forms of the covariance matrix and first moments vector

For a single-mode Gaussian state with frequency Ω, covariances and first moments read

$$\left\{\begin{array}{l}\sigma ({\bf{x}})=({n}_{\text{th}}+\frac{1}{2})\left(\begin{array}{ll}(y+{z}_{\text{cos}}){{{\Omega }}}^{-1}&{z}_{\sin }\\ {z}_{\sin }&(y-{z}_{\text{cos}}){{\Omega }}\end{array}\right),\\ \langle {\bf{x}}\rangle =\left(\begin{array}{l}| \alpha | \cos (\arg (\alpha ))\sqrt{2{{{\Omega }}}^{-1}}\\ | \alpha | \sin (\arg (\alpha ))\sqrt{2{{\Omega }}}\end{array}\right),\hfill\end{array}\right.$$
(9)

where $$y=\cosh (2r)$$, $${z}_{\text{cos}}=\cos (\varphi )\sinh (2r)$$ and $${z}_{\sin }=\sin (\varphi )\sinh (2r)$$. Here, nth controls the amount of thermal excitations, r and φ control the magnitude and phase of squeezing, respectively, and finally α and $$\arg (\alpha )$$ control the magnitude and phase of displacement, respectively. The input sequence may be encoded into any of these parameters or possibly their combination.

Suppose that s = {sm+1, …, s−1, s0} and each input sk is encoded to all degrees of freedom as nthnth(sk), rr(sk), φφ(sk), αα(sk) and $$\arg (\alpha )\mapsto \arg (\alpha ({{\bf{s}}}_{k}))$$. Then from Eq. (4) it follows that

$$\left\{\begin{array}{l} [\left.\sigma ({{\bf{x}}}_{0}^{R})-{{\bf{A}}}^{m}\sigma ({{\bf{x}}}_{-m}^{R}){({{\bf{A}}}^{\top })}^{m}\right)]_{ij}=\sum_{k=0}^{m-1}{a}_{k}^{ij}{n}_{\text{th}}({{\bf{s}}}_{k})\left({\mathrm{cosh}} (2r({{\bf{s}}}_{k}))+({b}_{k}^{ij}{\mathrm{cos}} (\varphi ({{\bf{s}}}_{k})) + {c}_{k}^{ij}{\mathrm{sin}} (\varphi ({{\bf{s}}}_{k}))){\mathrm{sinh}} (2r({{\bf{s}}}_{k}))\right),\\ \left[\langle {{\bf{x}}}_{0}^{R}\rangle -{{\bf{A}}}^{m}\langle {{\bf{x}}}_{-m}^{R}\rangle \right]_{i}=\sum _{k=0}^{m-1}| \alpha ({{\bf{s}}}_{k})| ({a}_{k}^{i}{\mathrm{cos}} ({\mathrm{arg}} (\alpha ({{\bf{s}}}_{k})))+{b}_{k}^{i}{\mathrm{sin}} ({\mathrm{arg}} (\alpha ({{\bf{s}}}_{k}))))\hfill\end{array}\right.$$
(10)

where $${a}_{k}^{ij}$$, $${b}_{k}^{ij}$$, $${c}_{k}^{ij}$$, $${a}_{k}^{i}$$, and $${b}_{k}^{i}$$ are constants depending on the Hamiltonian in Eq. (7) and Δt. That is to say the part of the observables independent of the initial conditions $${{\bf{x}}}_{-m}^{R}$$ are linear combinations of nth(sk) and α(sk), while the dependency on r(sk), φ(sk) and $$\arg (\alpha ({{\bf{s}}}_{k}))$$ is nonlinear. When the dynamics of the reservoir is convergent the effect of the initial conditions vanishes at the limit m →  and the corresponding terms on the L.H.S. may be omitted.

### The networks used in numerical experiments

We have used a chain of N = 8 oscillators for all results shown in Figs. 2 and 4. For simplicity, all oscillators have the same frequency ω = 0.25 and all interaction strengths are fixed to g = 0.1. The ancilla is chosen to be one of the oscillators at the ends of the chain. For the aforementioned parameter values of ω and g, we have computed ρ(A) as a function of Δt. We have set Δt = 59.6, which is close to a local minimum of ρ(A); in general values of Δt that achieve ρ(A) < 1 are common and produce similar results. It should be pointed out that the choice of ancilla matters, e.g., choosing the middle oscillator in a chain of odd length seems to lead to ρ(A) ≥ 1 for any choice of Δt.

The starting point for the results shown in Fig. 3 is a completely connected network of N = 8 oscillators with uniform frequencies ω = 0.25 and random interaction strengths uniformly distributed in the interval g = [0.01, 0.19]. We point out that the condition ρ(A) < 1 is favored by interaction strengths that break the symmetry of the system. A suitable value for Δt is then found as follows. We consider values Δtω0 = 0.01, 0.02, …, 29.99, 30 and find the corresponding ρ(A) for each. Then we choose a random Δt out of all values for which ρ(A) ≤ 0.99. In the rare event that none of the values can be chosen, new random weights are drawn and the process is repeated. We have checked that choosing instead the Δt that minimizes ρ(A) leads to virtually the same results, confirming that reservoir memory is primarily controlled by the encoding, not the choice of Δt.

### Training of the network

For all shown results, we take the initial state of the reservoir to be a thermal state and use the first 105 timesteps to eliminate its effect from the reservoir dynamics, followed by another M = 105 timesteps during which we collect the reservoir observables used to form the output. We wish to find a readout function h that minimizes

$${\rm{SE}}(\overline{{\bf{o}}},{\bf{o}})=\sum _{k}{({\bar{o}}_{k}-{o}_{k})}^{2},$$
(11)

i.e., the squared error between target output $$\overline{{\bf{o}}}$$ and actual output o.

In Fig. 2a, b and in Figs. 3 and 4, h is linear in reservoir observables. In this case, the collected data is arranged into M × L matrix X, where L is the number of used observables. In the case of first moments, L = 2(N − 1), whereas in the case of covariances L = 2N2 − 3N + 1; the latter comes from using the diagonal and upper triangular elements of the symmetric 2(N − 1) × 2(N − 1) covariance matrix. We introduce a constant bias term by extending the matrix with a unit column so that the final dimensions of X are M × (L + 1). Now we may write $${o}_{k}=h({{\bf{X}}}_{k})=\mathop{\sum }\nolimits_{i}^{N+1}{W}_{i}{X}_{ki}$$ where Xk is the kth row of X, Xki its ith element and $${W}_{i}\in {\mathbb{R}}$$ are adjustable weights independent of k. Let W be the column vector of the weights. Now XW = o. To minimize (11), we set

$${\bf{W}}={{\bf{X}}}^{+}{\overline{{\bf{o}}}}^{\top }$$
(12)

where X+ is the Moore–Penrose inverse102,103 of X. When X has linearly independent columns—meaning that the reservoir observables are linearly independent—$${{\bf{X}}}^{+}={({{\bf{X}}}^{\top }{\bf{X}})}^{-1}{{\bf{X}}}^{\top }$$.

In Fig. 2c, d, h is taken to be polynomial in reservoir observables. In this case, the training proceeds otherwise as above except that before finding X+ we expand X with all possible products of different reservoir observables up to a desired degree, increasing the number of columns. Powers of the same observable are not included since they are not required by the parity check task.

### The used echo-state network

An echo-state network (ESN) is used for some of the results shown in Fig. 4. For a comparison with a harmonic network of eight oscillators (one of them the ancilla) and a bias term we set the size of the ESN to N = 8 neurons, all of which are used to form the output, and include a bias term.

The ESN has a state vector $${{\bf{x}}}_{k}\in {{\mathbb{R}}}^{N}$$ with dynamics given by $${{\bf{x}}}_{k}=\tanh (\beta {\bf{W}}{{\bf{x}}}_{k-1}+\iota{\bf{w}}{s}_{k})$$ where W is a random N × N matrix, w a random vector of length N, β, and $$\iota$$ are scalars and $$\tanh$$ acts element-wise. W and w are created by drawing each of their elements uniformly at random from the interval [−1, 1]. Furthermore, W is scaled by dividing it by its largest singular value. Parameters β and $$\iota$$ are used to further adjust the relative importance of the previous state xk−1 and scalar input sk. We use a single fixed realization of W and w and set β = 0.95 and $$\iota$$  = 1. The readout function is a linear function of the elements of xk and training follows a similar procedure to the one described for the oscillator networks in Methods D. We note that the precise linear and nonlinear memory contributions of the ESN to the IPC bar in Fig. 4 depend on the choice of the parameters values for β and $$\iota$$. For this manuscript, the relevant aspect is that the total IPC of the ESN is bounded to the number of neurons (8), independently of the choice of the parameter values.

### Estimation of total information processing capacity

Information processing capacity is considered in Figs. 3 and 4. By total capacity we mean the sum of capacities over a complete orthogonal set of functions and using infinite sequences $${\bf{s}}^{\prime}$$ and s. Shown results are estimates of the total capacity found as follows.

All estimates are formed with input i.i.d. in [−1, 1]. One choice of functions orthogonal w.r.t. this input is described in Eq. (12) of the work by Dambre et al.71, which we also use. More precisely, the considered orthogonality is defined in terms of the scalar product in the Hilbert space of fading memory functions given in Definition 5 of the work by Dambre et al.71—it should be stressed that in general, changing the input changes, which functions are orthogonal. Since σ(xR) and 〈xR〉 can only depend on products of the inputs at the same delay, we only consider the corresponding subset of functions. They are of the form $${({P}_{d}^{\tau })}_{k}={P}_{d}({s}_{k-\tau })$$ where Pd is the normalized Legendre polynomial of degree d and $$\tau \in {\mathbb{N}}$$ is a delay. In Fig. 4, an estimate for the total capacity of an echo-state network is also shown, for which we consider the full set of functions.

For each considered function, we compute the capacity given by Eq. (6) by finding the weights of the optimal h as described previously in the Methods. We use finite input sequences, which in general can lead to an overestimation of the total capacity. As explained in Sec. 3.2 of the Supplementary Material of the work by Dambre et al.71, the effect of this can be reduced by fixing a threshold value and setting to 0 any capacity at or below the value. We use the same method.

In practice, only a finite number of degrees d and delays τ can be considered for the numerical estimates, which can lead to an underestimation. We have found the following approach useful when searching for capacities larger than the threshold value. We fix a maximum degree d (for all results we have used 9) and for each degree we order the functions according to delay and find the capacity of N/2 (rounded to an integer) functions at a time, until none of the N/2 functions in the batch contribute to total capacity. All total capacities except the one for thermal encoding—where we have verified that some of the observables are in fact linearly dependent or almost—are very close to the theoretical maximum.

A different approach is used for the echo-state network, which we briefly describe in the following. We still fix the maximum degree as 9. For a fixed degree d we consider a sequence of delays {τ1, τ2, …, τd} where the sequence is non-descending to avoid counting the same function multiple times. Then we form the product $${\prod }_{{\tau }_{i}}{P}_{m({\tau }_{i})}({s}_{k-{\tau }_{i}})$$ over distinct delays of the sequence where m(τi) is the multiplicity of τi in the sequence. The lexical order of non-descending sequences of delays allows us to order the functions, which is exploited to generate each function just once. Furthermore, we have found that functions that contribute to total capacity seem to have a tendency to be early in the ordering, which makes it faster to get close to saturating the theoretical bound.