## Introduction

Artificial neural networks mimic biological neural networks to perform information processing tasks. They are highly versatile, applying to vehicle control, trajectory prediction, game-playing, decision making, pattern recognition (such as facial recognition, spam filters), financial time series prediction, automated trading systems, mimicking unpredictable processes, and data mining.1, 2 The networks can be trained to perform tasks without the programmer necessarily detailing how to do it. Novel techniques for training networks of many layers (deep networks) are credited with giving impetus to the neural networks approach.3

The field of quantum machine learning is rapidly developing though the focus has arguably not been in the connection to neural networks. Quantum machine learning, see e.g. refs. 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 employs quantum information processing (QIP).20 QIP uses quantum superpositions of states with the aim of faster processing of classical data as well as tractable simulation of quantum systems. In a superposition each bit string is associated with two numbers: the probability of the string and the phase,21 respectively. The phase impacts the future probabilities via a time evolution law. There are certain promising results that concern quantum versions of recurrent neural networks, wherein neurons talk to each other in all directions rather than feeding signals forward to the next layer, e.g. with the purpose of implementing quantum simulated annealing.10, 16, 22, 23 In ref. 24 several papers proposing quantum neural network designs are discussed and critically reviewed. A key challenge to overcome is the clash between the nonlinear, dissipative dynamics of neural network computing and the linear, reversible dynamics of quantum computing.24 A key reason for wanting well-functioning quantum neural networks is that these could do for quantum inputs what classical networks can do for classical inputs, e.g., compressing data encoded in quantum superpositions to a minimal number of qubits.

We here accordingly focus on creating quantum generalisations of classical neural networks, which can take quantum inputs and process them coherently. Our networks contribute to a research direction known as quantum learning,25,26,27,28,29 which concerns learning and optimising with truly quantum objects. The networks provide a route to harnessing the powerful neural network paradigm for this purpose. Moreover they are strict generalisations of the classical networks, providing a clear framework for comparing the power of quantum and classical neural networks.

The networks generalise classical neural networks to the quantum case in a similar sense to how quantum computing generalises classical computing. We start with a common classical neural network family: feedforward perceptron networks. We make the individual neurons reversible and then naturally generalise them to being quantum reversible (unitary). This resolves the classical-quantum clash mentioned above from ref. 24. An efficient training method is identified: global gradient descent for a quantum generalisation of the cost function, a function evaluating how close the outputs are to the desired outputs. To illustrate the ability of the quantum network we apply it to (i) compressing information encoded in superpositions onto fewer qubits (an autoencoder) and (ii) re-discovering the quantum teleportation protocol—this illustrates that the network can work out QIP protocols given only the task. To make the connection to physics clear we describe how to simulate and train the network with quantum photonics.

We proceed as follows. Firstly, we describe the recipe for generalising the classical neural network. Then it is demonstrated how the network can be applied to the tasks mentioned above, followed by a discussion of the results. Finally, we present a design of a quantum photonic realisation of a neural module.

## Results

Classical neural networks are composed of elementary units called neurons. We begin with describing these, before detailing how to generalise them to quantum neurons.

### Quantum neural networks

#### The neuron

A classical neuron is depicted at the top of Fig. 1. In this case, it has two inputs (though there could be more). There is one output, which depends on the inputs (bits in our case) and a set of weights (real numbers): if the weighted sum of inputs is above a set threshold, the output is 1, else it is 0.

We will use the following standard general notation. The j-th neuron in the l-th layer of a network takes a number of inputs, $$w_{jk}^{(l)}$$, and an output, $$a_k^{(l - 1)}$$ where k labels the input. The inputs are each multiplied by a corresponding weight, $$w_{jk}^{(l)}$$, and an output, $$a_j^{(l)}$$, is fired as a function of the weighted input $$z_j^{(l)} = \mathop {\sum}\nolimits_{k = 1}^n {w_{jk}^{(l)}a_k^{(l - 1)}}$$, where n is the number of inputs to the neuron (top of Fig. 1). The function relating the output to the weighted input is called the activation function. This is normally taken to be a non-linear function $$f\left( {\mathop {\sum}\nolimits_i {\left( {{r_i}{a_i}} \right)} } \right) \ne \mathop {\sum}\nolimits_i {{r_i}f\left( {{a_i}} \right)}$$ for inputs a i and real numbers r i ).1 Commonly it is the Heaviside step function or a sigmoid.1 For example, the neuron in the top of Fig. 1 with a Heaviside activation function gives an output of the form:

$$a_j^{(l)} = \left\{ {\begin{array}{*{20}{c}}\\ {1,} & {{\rm{if}}\,z_j^{(l)} \, >\, 0.5} \\ \\ {0,} & {{\rm{otherwise}}.}\end{array}} \right.$$
(1)

This paper aims to generalise the classical neuron to a quantum mechanical one. In the absence of measurement, quantum mechanical processes are required to be reversible, and more specifically, unitary, in a closed quantum system.20, 30 This suggests the following procedure for generalising the neuron first to a reversible gate and finally to a unitary gate:

Irreversiblereversible: For an n-input classical neuron having (in1, in2, ..., in n ) → out, create a classical reversible gate taking (in1, in2, ..., in n , 0) → (in1, in2, ..., in n , out). Such an operation can always be represented by a permutation matrix.31 This is a clean way of rendering the classical neuron reversible. The extra ‘dummy’ input bit is used to make it reversible30; in particular, some of the ‘2 bits in −1 bit out’ functions the neuron can implement require 3 bits to be made reversible in this manner.

Reversibleunitary: Generalise the classical reversible gate to a quantum unitary taking input $$\left( {{{\left| {{\psi _{{\rm{in}}}}} \right\rangle }_{1,2,...,n}}\left| 0 \right\rangle } \right) \to {\left| {{\psi _{{\rm{out}}}}} \right\rangle _{1,2,...,n,{\rm{out}}}}$$, such that the final output qubit is the output of interest. This is the natural way of making a permutation matrix unitary.

If the input is a mixture of states in the computational basis and the unitary a permutation matrix,32 the output qubit will be a mixture of $$\left| 0 \right\rangle$$ or $$\left| 1 \right\rangle$$: this we call the classical special case. This way the quantum neuron can simulate any classical neuron as defined above. The generalisation recipe summarised in Fig. 1 also illustrates how any irreversible classical computation can be recovered as a special case from reversible classical computation (by ignoring the dummy and copied bits), which in turn can be recovered as a special case from quantum computation.

#### The network

In order to form a neural network, classical neurons are connected together in various configurations. Here, we consider feedforward classical networks, where neurons are arranged in layers and each neuron in the l-th layer is connected to every neuron in the (l − 1)-th and (l + 1)-th layers, but with no connections within the same layer. For an example of such a classical network, see Fig. 2a. Note that in this case the same output of a single neuron is sent to all the neurons in the next layer.1, 2

To make the copying reversible, in line with our approach of firstly making the classical neural network reversible, we propose the recipe:

Irreversiblereversible: For a classical irreversible copying operation of a bit b → (b, b), create a classical reversible gate, which can be represented by a permutation matrix,30 taking (b, 0) → (b, b).

In the quantum case the no-cloning theorem shows one cannot do this in the most naive way.20 For a 2-qubit case, one can use a CNOT for example to copy in the classical computational basis30: $$\left| b \right\rangle \left| 0 \right\rangle \to \left| b \right\rangle \left| b \right\rangle$$, if $$\left| b \right\rangle \in \left\{ {\left| 0 \right\rangle ,\left| 1 \right\rangle } \right\}$$. Thus one may consider replacing the copying with a CNOT. However when investigating applications of the network we realised that there are scenarios (the autoencoder in particular) where entanglement between different neurons is needed to perform the task. We have therefore chosen the following definition:

Reversibleunitary: The classical CNOT is generalised to a general 2-qubit ‘fan-out’ unitary U F , with one dummy input set to $$\left| 0 \right\rangle$$, such that $$\left| b \right\rangle \left| 0 \right\rangle \to {U_F}\left| b \right\rangle \left| 0 \right\rangle$$. As this unitary does not in general copy quantum states that are non-orthogonal we call it a ‘fan-out’ operation rather than a copying operation, as it distributes information about the input state into several output qubits. Note that a quantum network would be trained to choose the unitary in question.

#### Efficient training with gradient descent

A classical neural network is trained to perform particular tasks. This is done by randomly initialising the weights and then propagating inputs through the network many times, altering the weights after each propagation in such a way as to make the network output closer to the desired output. A cost function, C, relating the network output to the desired output is defined by

$$C = \frac{1}{2}{\left| {{{\vec y}^{(L)}} - {{\vec a}^{(L)}}} \right|^2},$$
(2)

where $${\vec y^{(L)}}$$ is a vector of the desired outputs from each of the final layer l = L neurons and $${\vec a^{(L)}}$$ is the vector of actual outputs, which depends on the network weights, and $$\left| {\left( . \right)} \right|$$ is the l 2-norm. The cost function is minimised to zero when the weights propagate the input in such a way that the network output vector equals the desired output vector.

Since the weights are continuous variables, the numerical partial derivatives of the cost function w.r.t. each weight can be found by approximating $$\frac{{\partial C}}{{\partial w}} \approx \frac{{C(w + \epsilon ) - C(w)}}{\epsilon }$$. After each propagation, these partial derivatives are computed and the weights are altered in the direction of greatest decrease of the cost function. Specifically, each weight $$w_{jk}^{(l)}$$ is increased by $${\rm{\delta }}w_{jk}^{(l)}$$, with

$${\rm{\delta }}w_{jk}^{(l)} = - \eta \frac{{\partial C}}{{\partial w_{jk}^{(l)}}},$$
(3)

where η is an adjustable non-negative parameter. This training procedure is known as gradient descent.1

Note that gradient descent normally also requires a continuous and differentiable activation function, to allow small changes in the weights to relate to small changes in the cost. For this reason, the Heaviside activation function has traditionally been replaced by a sigmoid function.1, 2 Nevertheless, gradient descent has also been achieved using Heaviside activation functions, by taking the weights as Gaussian variables and taking partial derivatives w.r.t. the means and standard deviations of the appropriate Gaussian distributions.33, 34

In the reversible generalisation, where each neuron is replaced by a permutation matrix, we find that the output is no longer a function of the inputs and continuous weights, but rather of the inputs and a discrete set of permutation matrices. However, in the generalisation to unitaries, for a gate with n inputs and outputs, there exist an infinite number of unitaries, in contrast with the discrete set of permutation matrices. This means that the unitaries can be parametrised by continuous variables, which once again allows the application of gradient descent.

Given that any unitary matrix U can be expressed as U = e iH, where H is a Hermitian matrix,20 and that such matrices can be written as linear combinations of tensor products of the Pauli matrices and the identity, it follows that a general N-qubit unitary can be expressed as

$${U_N} = {\rm{exp}}\left[ {i\left( {\mathop {\sum}\limits_{{j_1},...,{j_N} = 0,...,0}^{3,...,3} {{\alpha _{{j_1},...,{j_N}}}\left( {{\sigma _{{j_1}}} \otimes ... \otimes {\sigma _{{j_N}}}} \right)} } \right)} \right],$$
(4)

where σ i are the Pauli matrices for i {1, 2, 3} and σ 0 is the 2 × 2 identity matrix. This parametrisation allows the use of the training rule of Eq. (3), but replacing the weight $$w_{jk}^{(l)}$$ with a general parameter $${\alpha _{{j_1},...,{j_N}}}$$ of the unitary U N :

$${\rm{\delta }}{\alpha _{{j_1},...,{j_N}}} = - \eta \frac{{\partial C}}{{\partial {\alpha _{{j_1},...,{j_N}}}}}.$$
(5)

A simpler and less general form of U N has been sufficient for the tasks discussed in this paper:

$${U_3} = \mathop {\sum}\limits_{j = 1}^4 {\left| {{\tau _j}} \right\rangle \left\langle {{\tau _j}} \right| \otimes {T_j}} ,$$
(6)

where $$\left\{ {\left| {{\tau _j}} \right\rangle } \right\}_{j = 1}^4 = \left\{ {V\left| {00} \right\rangle ,V\left| {01} \right\rangle ,V\left| {10} \right\rangle ,V\left| {11} \right\rangle } \right\}$$. V is a general 2-qubit unitary of the form of Eq. (4). Each T j is similarly a general 1-qubit unitary and one can see, using the methods of35 on Eq. (4), that this can be expressed as a linear combination of the Pauli matrices, σ j :

$${U_{{\rm{1 - qubit}}}} = {e^{i{\alpha _0}}}\left( {{\rm{cos}}\,\Omega \,{\Bbb{1}} + i\frac{{{\rm{sin}}\,\Omega }}{\Omega }\mathop {\sum}\limits_{j = 1}^3 {{\alpha _j}{\sigma _j}} } \right),$$
(7)

where $$\Omega = \sqrt {\alpha _1^2 + \alpha _2^2 + \alpha _3^2}$$.35 To extend this to higher dimensional unitaries, see e.g. ref. 36

The cost function we use for the quantum neural networks is, with experimental feasibility in mind, determined by the expectation values of local Pauli matrices (σ 1, σ 2, σ 3) on individual output qubits, j. It has the form

$$C = \mathop {\sum}\limits_{i,j} {{f_{ij}}{{\left( {{{\left\langle {\sigma _i^{(j)}} \right\rangle }_{{\rm{actual}}}} - {{\left\langle {\sigma _i^{(j)}} \right\rangle }_{{\rm{desired}}}}} \right)}^2}} ,$$
(8)

where f ij is a real non-negative number (in the examples to follow f ij {0, 1}). We note in the classical mode of operation, where the total density matrix state is diagonal in the computational basis, only σ 3 will have non-zero expectation, and the cost function becomes the same as in the classical case (Eq. (2)) up to a simple transformation.

It is important to note that the number of weights grow polynomially in the number of neurons. Each weight shift is determined by evaluating the cost function twice to get the RHS of Eq. (5). Thus the number of evaluations of the cost function for a given iteration of the gradient descent grows polynomially in the number of neurons. The training procedure is efficient in this sense. Here we do not attempt to provide a proof that the convergence to zero cost-function, where possible, will always take a number of iterations that grows polynomially in the number of neurons. Note also that the statements about the efficiency of the training procedure refer to the physical implementation with quantum technology: the simulation of quantum systems with a classical computer is, with the best known methods, in general inefficient.

### Example: autoencoder for data compression

We now demonstrate applications of our quantum generalisation of neural networks described in the previous section. We begin with autoencoders. These compress an input signal from a given set of possible inputs onto a smaller number of bits, and are ‘work-horses’ of classical machine learning.2

#### Classical autoencoder

Autoencoders are commonly achieved by a feedforward neural network with a bottleneck in the form of a layer with fewer neurons than the input layer. The network is trained to recreate the signal at a later layer, which necessitates reversibly compressing it (as well as possible) to a bit size equal to the number of neurons in the bottleneck layer.2 The bottleneck layer size can be varied as part of the training to find the smallest compression size possible, which depends on the data set in question. After the training is complete, the post-bottleneck part of the network can be discarded and the compressed output taken directly from after the bottleneck.

In Fig. 2a a basic autoencoder designed to compress two bits into a single bit is shown. (Here the number of input bits, j max = 2.) The basic training procedure consists of creating a cost function:

$$C = \mathop {\sum}\limits_{j = 1}^{{j_{{\rm{max}}}}} {{{\left( {{\rm{i}}{{\rm{n}}_j} - {\rm{ou}}{{\rm{t}}_j}} \right)}^2}} ,$$
(9)

with which the network is trained using the learning rule of Eq. (3). If the outputs are identical to the inputs (to within numerical precision), the network is fully trained. The final layer is then removed, revealing the second last layer, which should enclose the compressed data. The number of neurons in a given hidden layer for a classical neuron will not exceed j max. Once the network is trained, the removal of the post-bottleneck layer(s) will yield a last layer of fewer neurons, achieving dimensional reduction.2

#### Quantum autoencoder

We now generalise the classical autoencoder as shown in Fig. 2a to the quantum case. We generalise the neurons labelled 1, 2 and 3 in Fig. 2a into unitary matrices U 1, U 2 and U 3, respectively, with the addition of a ‘fan-out’ gate, U F , as motivated in the previous sections. The result is shown in Fig. 2b as a quantum circuit model. (We follow the classical convention that this neural network is drawn with the input neurons as well, but they are identity operators which let the inputs through regardless, and can be ignored in the simulation of the network.) The input state of interest in12 is on 2 qubits, each fed into a different neuron, generalising the classical autoencoder in Fig. 2a. From each of these neurons, one output qubit each is led into the bottleneck neuron U 1, followed by a fan-out of its output. We add as an extra desideratum that the compressed bit, the output of U 1, is diagonal in the computational basis. The final neurons have the task of recreating $$\left| {{\rm{i}}{{\rm{n}}_{{\rm{12}}}}} \right\rangle$$ on the outputs labelled 6 and 8 respectively.

This means that a natural and simple cost function is

$$C = \mathop {\sum}\limits_{j = 0,k = 0}^3 {{{\left( {{\rm{Tr}}\left( {{\rho _{6,8}}{\sigma _j} \otimes {\sigma _k}} \right) - {\rm{Tr}}\left( {{\rho _{\rm{in}_{1,2}}}{\sigma _j} \otimes {\sigma _k}} \right)} \right)}^2}} .$$
(10)

Training is then conducted via global gradient descent of the cost w.r.t. the $${\alpha _{{j_1},...,{j_N}}}$$ parameters, as defined in Eq. (5). During the training the network was fed states from the given input set, picked independently and identically for each step (i.i.d). Standard speed-up techniques for learning were used, e.g., a momentum term.1, 2 In training with a variety of two possible orthogonal input states including superposition states, the cost function of the quantum autoencoder converged towards zero through global gradient descent in every case, starting with uniformly randomised weights, $${\alpha _{{j_1},...,{j_N}}} \in \left[ { - 1,1} \right]$$. For two non-orthogonal inputs and a 1-qubit bottleneck the cost-function will not converge to zero as is to be expected, but the training rather results in an approximately compressing unitary. Figure 2c shows the network learning to compress in the case of two possible inputs: $$\left( {\left| {00} \right\rangle + \left| {11} \right\rangle } \right){\rm{/}}\sqrt 2$$ and $$\left( {\left| {00} \right\rangle - \left| {11} \right\rangle } \right){\rm{/}}\sqrt 2$$. One can force the compressed output to be diagonal in a particular basis by adding an extra term to the cost-function (e.g., desiring the expectation values of Pauli X and Y to be zero in the case of a single qubit will push the network to give an output diagonal in the Z-basis).

### Example: neural network discovers teleportation protocol

With quantum neural networks already shown to be able to perform generalisations of classical tasks, we now consider the possibility of quantum networks discovering solutions to existing and potentially undiscovered quantum protocols. We propose a quantum neural network structure that can, on its own, work out the standard protocol for quantum teleportation.20

The design and training of this network is analogous to the autoencoder and the quantum circuit diagram is shown in Fig. 3a. The cost function used was:

$$C = \mathop {\sum}\limits_{j = 0}^3 {{{\left( {{\rm{Tr}}\left( {\left| \psi \right\rangle \left\langle \psi \right|{\sigma _j}} \right) - {\rm{Tr}}\left( {{\rho _6}{\sigma _j}} \right)} \right)}^2}} .$$
(11)

A fully trained network can teleport the state $$\left| \psi \right\rangle$$ (from Alice) to the output port of qubit 6 (to Bob). Once trained properly, $${\rho _{{\rm{ou}}{{\rm{t}}_{\rm{1}}}}}$$ will no longer be $$\left| \psi \right\rangle \left\langle \psi \right|$$, as the teleportation has ‘messed up’ Alice’s state.37

In order to train the teleportation for any arbitrary state $$\left| \psi \right\rangle$$ (and to avoid the network simply learning to copy $$\left| \psi \right\rangle$$ from Alice to Bob), the training inputs are randomly picked from the axis intersection states on the surface of the Bloch sphere.20 Figure 3b shows the convergence of the cost function during training, simulated on a classical computer. As can be seen, the training was found to be successful, i.e., the cost function converged towards zero. This held for all tests with randomly initialised weights.

## Discussion

### Quantum vs. classical

Can these neural networks show some form of quantum supremacy? The comparison of classical and quantum neural networks is well-defined within our set-up, as the classical networks correspond to a particular parameter regime for the quantum networks. A key type of quantum supremacy is that the quantum network can take and process quantum inputs: it can for example process $$\left| + \right\rangle$$ and $$\left| - \right\rangle$$ differently. Thus, there are numerous quantum tasks it can do that the classical network cannot, including the two examples above. We anticipate that they will moreover, in some cases be able to process classical inputs faster, by turning them into superpositions—investigating this is a natural follow-on from this work.

We also mention that we term our above design a quantum neural network with classical learning parameters, as the parameters in the unitaries are classical. It seems plausible that allowing these parameters to be in superpositions, while experimentally more challenging, could give further advantages.

While adding the ancillary qubits ensures that the network is a strict generalisation of the classical network, it can of course be experimentally and numerically simpler to omit these. Then one would sacrifice performance in the classical mode of operation, and the network may not be as good as a classical network with the same number of neurons for all tasks.

### Visualising the cost function landscape

To gain intuitive understanding, one can visualise the gradient descent in 3D by reducing the number of free parameters. We sampled the cost surface and gradient descent path of a one-input neuron (4 × 4 unitary matrix). With the second qubit expressed as the dummy-then-output qubit, the task for the neuron was $$\left| + \right\rangle \otimes \left| 0 \right\rangle \to \left| + \right\rangle \otimes \left| 0 \right\rangle$$ and $$\left| - \right\rangle \otimes \left| 0 \right\rangle \to \left| - \right\rangle \otimes \left| 1 \right\rangle$$. We optimised, similarly to Eq. (6), over unitaries of the form

$$U = \left| \tau \right\rangle \left\langle \tau \right| \otimes {\sigma _0} + \left| {{\tau ^ \bot }} \right\rangle \left\langle {{\tau ^ \bot }} \right| \otimes {\sigma _1},$$
(12)

where $$\left| \tau \right\rangle = {\rm{cos}}(\theta {\rm{/}}2)\left| 0 \right\rangle + {e^{i\phi }}{\rm{sin}}(\theta {\rm{/}}2)\left| 1 \right\rangle$$ and $$\left| {{\tau ^ \bot }} \right\rangle = {\rm{sin}}(\theta {\rm{/}}2)\left| 0 \right\rangle <> <>- {e^{i\phi }}{\rm{cos}}(\theta {\rm{/}}2)\left| 1 \right\rangle$$. We performed gradient descent along the variables θ and ϕ as shown by the red path in Fig. 4.

### Scaling to bigger networks

The same scheme can be used to make quantum generalisations of networks whose generalised neurons have more inputs/outputs and connections. Figure 5 illustrates an M-qubit input quantum neuron with a subsequent N-qubit fan-out gate.

If one wishes the number of free parameters of a neuron to grow no more than polynomially in the number of inputs, one needs to restrict the unitary. It is natural to demand it to be a polynomial length circuit of some elementary universal gates, in particular if the input states are known to be generated by a polynomial length circuit of a given set of gates, it is natural to let the unitary be restricted to that set of gates.

The evaluation of the cost function can be kept to a sensible scaling if we restrict it to be a function of local observables on each qubit, in particular a function of the local Pauli expectation values, as was used in this paper, for which case a vector of 3n expectation values suffices for n qubits.

## Methods

### Quantum photonics neuron module

To investigate the physical viability of these quantum neural networks we consider quantum photonics. This is an attractive platform for QIP: it has room temperature operation, the possibility of robust miniaturisation through photonic integrated circuits; in general it harnesses the highly developed optical fibre-related technology for QIP purposes.38 Moreover optical implementations have been viewed as optimal for neural networks, in the classical case, due to the low design cost of adding multiple connections (as light passes through light without interacting).39 A final motivation for choosing this platform is that the tuning can be naturally implemented, as detailed below.

We design a neuron as a module that can then be connected to other neurons. This makes it concrete how experimentally complex the network would be to build and operate, including how it could be trained.

The design employs the Cerf–Adami–Kwiat (C–A–K) protocol,40 where a single photon with polarisation and multiple possible spatial modes encodes the quantum state; the scheme falls into the category of hyper-entangling schemes, which entangle different degrees of freedom. One qubit is the polarisation; digital encodings of the spatial mode labels give rise to the others. With four spatial modes this implements 3 qubits, with basis $$\left| {0{\rm{/}}1} \right\rangle \left| {H{\rm{/}}V} \right\rangle \left| {0{\rm{/}}1} \right\rangle$$, where H/V are two different polarisation states, and the other bits label the four spatial modes. The first bit says whether it is in the top two or bottom two pairs of modes and the last bit whether it is the upper or lower one in one of those pairs. This scheme and related ones such as in refs. 41, 42 are experimentally viable, theoretically clean and can implement any unitary on a single photon spread out over spatial modes. In such a single photon scenario they do not scale well however. The number of spatial modes grows exponentially in the number of qubits. Thus for larger networks our design below would need to be modified to something less simple, e.g., accepting probabilistic gates in the spirit of the KLM scheme,43 or using measurement-based cluster state quantum computation approaches.38

Before describing the module we make the simplifying restriction that there is one input qubit to the neuron and one dummy input. We will ensure that the designated output qubit can be fed into another neuron, as in Fig. 6a.

We propose to update the neural network by adjusting both variable polarisation rotators, and spatial phase shifters in a set of Mach–Zehnder interferometers as shown in Fig. 6c. In this we are able to change the outputs from each layer of the network. The spatial shift could be induced by varying the strain or temperature on the waveguides at given locations, to change their refractive indices and hence the relative phase; this may have additional difficulties in that silicon waveguides are birefringent.44 Alternatively we can tune both polarisation and spatial qubits via the electro–optic effect.

This circuit can be made more robust and miniaturised using silicon or silica optical waveguides.38 They have been extensively used to control spatial modes and recently also polarisation.45 Several labs can implement the phase shifting via heaters or the electro–optic effect. Conventionally phase shifters built upon the electro–optic effect are known to work in the megahertz region and have extremely low loss.38 For many applications this would be considered slow, but our tuning only requires (in the region of) a few thousand steps. Taking into account that each step requires approximately 1000 repetitions, around 300 for each of the three Pauli measurements, a learning task could be completed in the order of seconds. While it appears that this effect will be the limiting factor in terms of speed, photodetectors are able to reach reset times in the tens of nanoseconds, while the production of single photons through parametric down conversion has megahertz repetition rates.46

### Data availability

This is a theoretical paper and there is no experimental data available beyond the numerical simulation data described in the paper.