Experimental graybox quantum system identification and control

Understanding and controlling engineered quantum systems is key to developing practical quantum technology. However, given the current technological limitations, such as fabrication imperfections and environmental noise, this is not always possible. To address these issues, a great deal of theoretical and numerical methods for quantum system identification and control have been developed. These methods range from traditional curve fittings, which are limited by the accuracy of the model that describes the system, to machine learning methods, which provide efficient control solutions but no control beyond the output of the model, nor insights into the underlying physical process. Here we experimentally demonstrate a"graybox"approach to construct a physical model of a quantum system and use it to design optimal control. We report superior performance over model fitting, while generating unitaries and Hamiltonians, which are quantities not available from the structure of standard supervised machine learning models. Our approach combines physics principles with high-accuracy machine learning and is effective with any problem where the required controlled quantities cannot be directly measured in experiments. This method naturally extends to time-dependent and open quantum systems, with applications in quantum noise spectroscopy and cancellation.

Quantum technology promises to deliver exponentially faster computation, provably secure communications, and high-precision sensing [1].However, during the fabrication and operation of a quantum device, there are many factors that can significantly impact its functionality, requiring characterization and control techniques to achieve high-level performance.Generally, we are interested in uncovering the unknown relation between the control and the Hamiltonian governing the device, and then utilizing this information to drive the device toward a desired target.Typical targets include unitary gates, a specific Hamiltonian, or certain output probability distributions.
Approaches that directly aim to control the quantum device without first identifying it, includes dynamical decoupling and dynamically-corrected gates [2][3][4][5], as well as direct gradient-based optimization, such as the commonly used GRAPE algorithm [6] and its variants [7][8][9][10][11][12][13][14].These techniques only work when the dependence of the Hamiltonian on the control is known, because they are based on optimizing the fidelity to some target with respect to control.In situations where this dependence is unknown, the fidelity (and in general other cost functions) and/or its gradient, can be computed iteratively from experimental data.The control is optimized after each iteration and directly applied to the system for the next iteration.The physical system becomes part of a feedback architecture for designing the pulses without a need for a model.Examples of this approach are in [14][15][16][17], and are sometimes referred to as "learning quantum control" [18].Evolutionary algorithms, such as Genetic Algorithm (GA) have been also proposed [19], as optimization techniques with the advantage of being gradient-free, and are more likely to find global minima.These techniques can also be applied with a known model or directly from experimental measurements.Reinforcement learning methods [20][21][22][23], are also model-free and have been recently explored for the purposes of removing the reliance on assumptions on the physical system, and have been successfully applied to controlling quantum systems.
In many situations, it is important to reconstruct a mathematical model of the system from experimental data, a process referred to as "system identification".An identified model can be used to compare the behaviour of a fabricated device to its design, or to understand the underlying noise process affecting the system.As the model can predict the behaviour of the system, it can be used for control as well.
The traditional approach to characterizing and controlling physical devices is based on theoretical models of the underlying processes governing the relationship between input and output signals.Example of fitting data to a physical model include [24,25].These "whitebox" (WB) models are based on parameter estimation via curve fitting and can be computationally expensive, inaccurate, or in-complete.For example, they do not consider unexpected input parameters or dynamics.Moreover, the models commonly used to describe the dynamics of an open quantum system (such as Lindblad's master equation), is valid only under strict assumptions and approximations of the noise (such as Markovianity) and the control (such as being ideal impulses).These assumptions do not hold for many quantum platforms, making the use of fixed a-priori WB models inaccurate.
To increase accuracy and remove some of the limitations of the WB method, supervised machine learning techniques are proving useful for modeling and controlling complex physical systems.In particular, techniques such as neural networks are, in many cases, superior if not the only viable approach.However, this approach, referred to as "blackbox" (BB), does not provide any information about the underlying physics of the system.Nonetheless, it has been used in many applications such as quantifying Non-Markovianity of quantum systems [26], characterizing qubits and environments [27][28][29][30], quantum control [31][32][33], quantum error correction [34,35], optimization of experimental quantum measurements [36,37], and calibration of quantum devices [38].A closely-related approach to system identification is the "Hamiltonian Learning" problem [39,40], where a fixed time-independent Hamiltonian (control is fixed) is inferred from quantum measurements.
In situations where the quantities of interest, such as Hamiltonians, unitaries, and noise operators, cannot be directly accessed from the model, the use of a hybrid WB-BB or "graybox" (GB) approach allows to both identify and control these quantities beyond the measurable dataset.Following the standard control engineering definition [41], the aim of GB models is to merge an abstract mathematical structure, such as a neural network, with physical laws.While direct control methods, such as GRAPE, are sometimes referred to as grayboxes, because they are data-driven and may rely on prior knowledge, these methods do not identify a useful mathematical description of the system.The graybox architecture provides access to any physical quantity available from the physical part (WB) of the model.
GB has been proposed to model electrical drift in quantum photonic circuits [42], as well as open quantum systems subject to classical [43] and quantum noise [44], covering the case of a time-dependent Hamiltonian problem.The GB model was also applied in the context of noise detection in the presence of a spectator qubit that acts as a sensor of the environment [45], and to geometric quantum gate synthesis [46].
While GB has been used experimentally to characterize superconducting qubits [47], to date, no quantum device has been characterised using a GB model where the identified model was then used to design optimal quantum control.
Here, we experimentally demonstrate how to model and control a quantum device using the GB architecture when the Hamiltonian dependence on the control is unknown.We report high-fidelity preparation of arbitrary unitaries and output probability distributions of a reconfigurable three-mode integrated photonic device and uncover the Hamiltonian dependence on the control.Our GB approach outperforms the traditional model fitting methods and can successfully prepare unitary operations, which are not accessible from the structure of a BB.Our results show a promising approach to enhance quantum control by understanding the physical processes, and open the way to improve the engineering of quantum devices.

Modeling quantum devices
We consider the class of quantum devices, shown in Fig. 1a, described by a time-independent Hamiltonian undergoing a closed-system evolution (i.e. in the absence of quantum noise).Photonic devices are examples of this class when the Hamiltonian is not modulated faster than the evolution time of a propagating photon.We focus on the case where the system is described by a finite Ndimensional Hilbert space (i.e. a qudit).The Hamiltonian governing the dynamics of the system can be represented in the most general form as an N × N complex Hermitian matrix that depends on a set of external controls.We encode the set of controls in a M × 1-dimensional vector , where V k is the k th control.We assume that during the system evolution, the control vector is fixed.An example of such controls is the voltages applied to a reconfigurable photonic chip, schematically shown in Fig. 1b.The system starts in an initial state |ψ 0 ⟩ and evolves to the state |ψ T ⟩ at time t = T .The state is then measured on some basis to obtain a set of probability outcomes.A more general form of time-dependent evolution in the presence of unwanted interactions with the environment has been considered in our previous work [43].
Our aim is to obtain a machine learning (ML) model that describes the behavior of the device given a set of controls V and use it for controlling the device.The input to the ML model is the M -dimensional control vector V, while the output is the set of measured outcomes of the state after evolution.Generally, it is required to have informationally-complete measurements to fully characterize a quantum system.Here, we restrict the initial states as well as measurement basis to the set of computational basis states {|0⟩ , |1⟩ , • • • |N ⟩} in order to be compatible with our experimental setup.For each of these N initial states, we have N possible outcomes with an associated probability P j→k corresponding to the j th input and k th output, giving a total of N 2 outputs.The approach, however, is independent of this choice, and any set of states can be used.In Supplementary Materials D, we discuss a more general measurement scheme.It is FIG.1: Physical and machine learning models of the class of quantum devices considered in this paper.
These are described by a time-independent Hamiltonian in the absence of interaction with the environment.a) Representation of the class of quantum devices considered in this work.b) A schematic of an integrated photonic voltage-controlled reconfigurable waveguide array chip, implementing a noiseless time-independent Hamiltonian.Photons enter from the input port (on the left), undergo a voltage-controlled propagation along the chip, and are then measured at the output port of the chip (on the right).c) The structure of the proposed graybox model.The input to the model is the set of M controls, while the outputs are the quantum measurements for the set of computational basis as initial states.P a→b indicates the transition probability from input port a to output port b.The graybox is a combination of black and white boxes.The blackbox estimates the real and imaginary components of each matrix element of the Hamiltonian.The whitebox layers construct the Hamiltonian matrix and perform the quantum evolution and measurements.d) A fully whitebox architecture where a physical model is utilized.The first layer generates predefined Hamiltonian parameters that follow a known analytical dependence.The remaining layers perform the quantum evolution and measurements.e) A fully blackbox model where only a generic neural network is utilized with no physical model.
important to emphasize that the model input is the set of controls V applied to the system, and not the initial state |ψ 0 ⟩.

Graybox architecture
Our starting point is our theoretical proposal [42] for modelling and controlling quantum photonic circuits using a GB architecture.The work aimed at stabilizing the effect of electrical drift and preparing quantum gate sequences at the same time.In order to model such an effect, a GB was desgined to capture variations over a "classical" time scale (i.e.slower than the evolution time of a single photon).So, a recurrent neural network was used, particularly a Gated-Recurrent Unit (GRU) as the black part of the model.The inputs and outputs of the model are slowly time-varying waveforms.However, this resulted in optimal voltage pulses that did not belong to the class of pulses in the training set, which is not available in our experimental device, and may not in general be available.
Here, we focus only on modeling the unknown Hamiltonian-voltage dependence, and stabilize the drift in hardware using fast pulsing, a well known technique in integrated photonics.The pulses have fixed frequency and duty cycle, so the only controllable parameters are the amplitude on each electrode.This makes it possible to restrict the controller solution to the space of training pulses.Therefore, in what follows in this paper, we use standard feed-forward neural networks as opposed to recurrent neural networks as there is no need to model a sequence over time.
The GB structure we propose, shown in Fig. 1c, consists of a BB (in the form of a neural network) followed by a WB part that processes the outputs of the BB into measurable physical quantities.The purpose of the blackbox is to map the controls V (the model inputs) to the Hamiltonian of the system.The output of the BB then represents the elements of the Hamiltonian matrix.A general N × N complex matrix has 2N 2 degree of freedom (N 2 components with real and imaginary parts).
Thus, the output layer of the BB must consist of 2N 2 neurons.The other BB layers can be designed arbitrarily, and are custom engineered to provide the best performance for a given dataset.
The second layer is a Hamiltonian construction layer that arranges the outputs of the BB into an N ×N matrix.A valid Hamiltonian has to be Hermitian (i.e.H † = H) and this is ensured by calculating the Hermitian part of the constructed matrix and discarding the anti-Hermitian part.This can be done simply by adding the matrix to its Hermitian conjugate.The output of this layer is a valid control-dependent Hamiltonian.We did not enforce any structure or constraint on the Hamiltonian, to enable the best fitting allowed by the rules of quantum mechanics for a given dataset.
The Hamiltonian is then followed by subsequent layers that transform the Hamiltonian matrix into the set of probability outcomes-which can be measured experimentally-utilizing the laws of quantum mechanics.Starting from a valid Hamiltonian, there is no need to further use a BB since the dynamical equations are known.This saves the algorithm from trying to learn the rules of quantum mechanics from experimental data, which would complicate the process and result in a less accurate model.We use WB layers for the remaining steps.In particular, there is a layer that calculates the evolution unitary by the matrix exponentiation of the Hamiltonian U = e −iHT , which is the solution of the Schrödinger's time-independent equation.The final part of the GB model is a concatenation of N -layers representing the quantum measurement operation for each of the N input |j⟩.In each layer we calculate |ψ T ⟩ = U |j⟩, where |j⟩ is the initial state of the system.After evolving the state to |ψ T ⟩, the probabilities P j→k are calculated by taking the absolute value squared of each entry of the state, that is applying the Born's rule for quantum expectation values.
The added WB layers do not include any trainable parameter; they only exist in the BB part.Therefore, when we train the model on a dataset, the only updates occur in the BB, generating a set of outputs that can be interpreted as a Hamiltonian.
An important aspect of the GB architecture is that it is independent of the physics of the system, i.e. it provides the most general form of a map between the control vector and the quantum measurements while keeping the most important physical quantities accessible via software, namely Hamiltonian, unitary and evolved state.This is the key aspect needed to perform quantum control as we are usually interested in implementing a quantum gate represented by the unitary or Hamiltonian and not by the measured evolved state.Having access to those quantities, even though the model is trained with quantum measurements, is the key feature of the GB architecture.

Whitebox and blackbox architectures
We benchmark the performance of our GB model against the fully WB and fully BB architectures.The WB approach (Fig. 1d) is equivalent to the standard model (curve) fitting and the details of the architecture depend on the physical system.The assumption is that all the relations between different dynamical variables are exactly known except for the parameters we are fitting.Generally, a WB consists of several layers.The first layer represents the mathematical relations between the controls and the Hamiltonian, with a set of unknown parameters.The input of this layer is the control vector, and the output is a mathematically valid Hamiltonian.The remaining layers are identical to those of the GB and represent quantum evolution and quantum measurements.For some systems, we might need more layers (e.g. to model the fan-in/out in photonic devices [48]).The WB provides the same access to hidden quantities as the GB and even provides more physics as we know exactly the analytical relations between Hamiltonian and control.However, if we do not know these relations, or they do not match the physical reality, the WB will fail and will not be useful for further applications.
The BB architecture (Fig. 1e) is largely different from the WB model since the relation between the control vector and the quantum measurement is modeled by a neural network.While any structure can be used, in this paper we consider fully-connected neural networks with a softmax output layer of N neurons.This enforces the outputs to form a probability distribution (i.e.positive numbers in [0, 1] whose sum is equal to 1).This is consistent with what the model outputs represent, which are the probability amplitudes of the evolved quantum state.Since we are characterizing with N -initial basis states, we need N of such layers with the initial state chosen accordingly.This will give a total of N 2 outputs.The structure of the other hidden layers can only be determined and optimized by examining the performance on an actual dataset.No other physical quantities can be accessed through this architecture, but it is very good in fitting a dataset since it gives the maximum freedom in terms of representation.If we are only interested in controlling FIG. 2: Protocol schematic.The first step is to construct an experimental dataset by applying controls to the system and measuring the corresponding outputs.The dataset is then used to train the machine learning models (1).Next, the trained models are tested (2) for generalization by comparing their output predictions against a different experimental testing dataset.After that, the trained models can be used to optimize controls (3) to achieve a certain target, which could be a Hamiltonian, a unitary gate, or an output probability distribution.Finally, the obtained controls are tested (4) experimentally and the controlled system output is compared against the desired target.
the outputs, a BB would be an efficient solution.But, if the goal is to estimate physical quantities (like unitary gates), then a BB model is not useful at all.

Protocol for training, testing, and controlling
Our protocol for training and testing models and controllers is schematically depicted in Fig. 2. It starts with preparing a dataset that will be used to train and test the machine learning BB.The dataset consists of examples.Each example is made of the M control inputs V and the N 2 outputs of the model P j→k .So we start by generating random values for our control, let the system evolve, then perform the measurements and obtain the probabilities P j→k .We repeat this for the N input states we consider to obtain all the outputs.The procedure is repeated for multiple examples.The number of examples of the dataset depends on the particular structure of the ML model, the noise level in the experiment, and the acceptable perfor-mance level.Generally, the larger dataset is, the better the ML algorithm will perform.In our previous work [42], only computer simulated datasets were considered.In this paper, we create and test experimental datasets.This comes with many challenges including performing the experiment itself, the limited dataset size (to be feasible to collect), and the presence of noise not modeled by the quantum dynamics.In particular, initially, we found that statistical noise caused inconsistencies in the dataset, resulting in a poor performance of the method.As a result, we modified the dataset collection protocol, in particular the normalization of output power measurements as discussed in Supplementary Materials C.This new step improved the performance of the ML significantly.
After the dataset is collected, it is split into the training and testing subsets, and the ML model is trained.The purpose of training is to minimize a loss function that measures the distance between the predictions and actual outputs from the training dataset.In [42], the loss function measured the similarity between waveforms, because the model was modeling drift, and the RMSprop algorithm [49] was used for the training.Here, we use the standard mean square error (MSE) as a loss function and use the ADAM algorithm [50] for training.Once the model is trained, its parameters are fixed and do not change in any of the remaining protocol stages.Next, the model is evaluated using the testing dataset, and its predictions are compared to the true corresponding values.
The testing examples are not included in the optimization procedure, so they provide an unbiased evaluation of the performance of the model.
Let's now consider the trained model for control purposes.In this case, the model acts as a replacement for the actual experimental setup and can be probed via software for any purpose.In [42], the controller was designed to be a GRU, and therefore the optimal control was not restricted to any class of pulses.Here, we use a different controller that is designed to directly obtain the parameters of a fixed pulse shape (i.e. the voltage amplitudes).We consider two types of applications.The first is for obtaining the values of the control V to achieve a target output (i.e.probability amplitudes).In this case, we use the MSE as the control cost function, and the optimal control vector V * can be expressed as where ŷP (•) is the ML output predictions, y d is the desired target, and I is the control domain, which reflects the maximum allowed range for the controls.To get accurate results, the ML model should be trained with dataset examples that lie in the same control domain as well.
Note that the internal ML model parameters that define ŷP (•) are not allowed to change during the optimization as they have been fixed after the training.The other case is for achieving a target quantum gate (i.e. a target unitary).For this application, we use the gate fidelity as the control cost function for the controller defined as where U and W are two unitary matrices, and N is their dimension.The gate fidelity lies in the range [0, 1], with 1 representing the maximum overlap between the two gates.The optimal control voltages can then be represented as where ŷU (•) is the evolution unitary obtained from the ML model, and U d is the desired target gate.In this case, we can only use the GB and WB models, since a BB does not provide access to unitaries as discussed earlier.With this modular approach, the optimization algorithm of the controller can be chosen arbitrarily.While we choose a gradient-based method in this paper, other techniques such as genetic algorithms could also be used.Finally, the optimal controls for a set of targets are applied experimentally, and the system is measured to construct the "control" dataset.This dataset is then assessed and compared against the desired targets.Our main goal is to control a quantum system, and thus the assessment of any model should not just rely on its prediction capabilities, but also on how it performs in conjunction with a controller when tested experimentally.

Experimental results and discussion
We tailor the design of the models described above around the device used for the experimental verification for our proposal, a voltage-controlled quantum photonic circuit of continuously coupled waveguides based on lithium niobate technology, schematically shown in Fig. 1b.The details about the chip's fabrication and its physical model are given in Supplementary Materials A and B. The chip has 3 waveguides, corresponding to a qutrit system, and is controlled by 4 electrodes and their respective voltages.In principle, there is no or negligible cross-talk in our device.This is guaranteed by the confinement of the electric field within the material due to the shielding effect from neighboring electrodes, as opposed to other technologies such as thermo-optic switching [51].Thus, the electrodes can be activated simultaneously, which is how we perform the experiments in this paper.We implemented the ML models using the TensorFlow Python package [52,53], applied the protocol for training the three models and then verified the performance of the controllers.The details of the implementations are also given in Supplementary Materials C.Moreover, we provide independent results of applying our method to a simulated synthetic dataset in Supplementary Materials D including a showcase for a 32-mode chip with 33 electrodes.
The results of the models training and testing as well as the control performance are reported in Table I.The MSE evaluated at each iteration for training and testing sets are shown in Fig. 3a and Fig. 3b.The plot of the learning curve in Fig. 3a shows the superior performance of the GB in terms of accuracy compared to the WB.This is due to the constraints imposed by the physical model that is used to construct the Hamiltonian in the case of the WB.As detailed in Supplementary Materials B, the commonly-used device Hamiltonian is assumed to be tri-diagonal, real-valued, and linearly dependent on voltages.Our results show that these assumptions do not hold for a real device, and thus the degradation of the WB performance.On the other hand, the GB learns a general mathematically valid Hamiltonian and thus is able to better fit the experimental data.Initially, we designed the GB to enforce the Hamiltonian to be real-valued but otherwise arbitrary, and the fitting was not good.
When we relaxed this assumption to allow a complexvalued Hamiltonian, the results were improved.Here we note that the Hamiltonian remains Hermitian to allow a unitary evolution, and thus it does not model losses.
The normalization procedure (detailed in Supplementary Materials C) that we perform on the power measurements makes it unnecessary to model losses.The BB had similar performance to GB, with the main drawback of losing the physical picture.In terms of the testing performance, Fig. 3b shows that the three models do not overfit, as the final MSE of the testing set is close to the final MSE of the training set.This means that the models do not memorize the examples of the training set.In other words, the model generalizes-there's no significant loss in prediction accuracy-confirming that the dataset, the model structure, and the training algorithm are well designed.Furthermore, the GB and BB clearly perform better than the WB.
The performance of the models is limited by the size of the dataset.Because the experimental data always suffer from some level of noise, the minimum MSE obtainable without overfitting is limited as well.Usually, the acceptable level of MSE depends on the specific application.In this paper, our application is to control the chip to obtain target power distributions as well as target unitary operations.The ML models then act as a replacement/simulator of the actual setup.The experimental assessment of the optimal control will determine whether the model performance is accepted or need improvement.In general, the way to improve models is by constructing very large datasets which is the standard approach in most typical machine learning applications.For engineering applications, where we characterize and control a physical device, we are limited by how many measurements we can obtain.Thus, it becomes a tradeoff between the amount of time and resources needed to construct the dataset experimentally, and the accuracy of the trained models, which will also affect the performance of the controller.
The histogram of the controller fidelities between the desired target and the experimental measurements for 1000 randomly prepared output power distributions and 1000 randomly prepared unitary gates are shown in Fig. 4a and Fig. 4b respectively.The plots show that the WB is particularly skewed towards lower fidelities (minimum is 80.53% compared to a minimum of approximately 91% for GB and BB for the output controller).Similarly, for the gate controller, the minimum fidelity is 86.6% and 94.95% for WB and GB.In Fig. 4c, we summarize the statistics of the MSE between the ML model predictions and actual outputs for the training and testing datasets, in comparison with the MSE between the experimentally controlled measurements and the targets for each of the two controllers.
The results of the controller for obtaining a target power distribution, show once again the superior performance of the GB and BB over the WB in terms of both MSE and average fidelity.The same controller/optimization algorithm and cost function are used for the three models.Thus, the lower performance comes from the lower accuracy of the WB model itself.When considering the controller for target unitary, it is only possible to use a WB or a GB as they are the only models that can give access to the overall unitary evolution matrix.A BB cannot be utilized in this case since it only encodes the dynamics in an abstract machine-suitable format, and does not provide any physical picture.
The performance assessment of ML-based algorithms on real rather than synthetic datasets is critical.Different noise sources could affect the data in unpredicted ways, which may also be difficult to simulate.This can affect the performance of the ML algorithm.We see that the final MSE of training and testing for the experimental dataset is two orders of magnitude less than that of the synthetic dataset (shown in the Supplementary Materials D).However, the control performance on the experimental dataset is accepted and thus we also accept the model prediction performance.In other situations, this might not be the case, and the ML design has to be modified to achieve higher performance.Therefore, using a design based on simulations such as [42] and applying it directly to an experimental dataset would not result in adequate performance, and so the whole workflow needs to be reexecuted.
Determining the required dataset size as well as NN architecture and complexity for higher-dimensional systems is generally difficult and has to be studied case by case.In Supplementary Materials D, we show promising results for a simulated 32-mode chip.And while the dataset and neural network sizes had to be increased, the overall protocol was still feasible to execute.
Finally, the GB provides sufficient physical insights for most purposes, for example in Fig. 5a and 5b, where the tunability of the Hamiltonian as a function of a single electrode voltage is explored.The figure shows the GB prediction of the different Hamiltonian elements as a function of the voltage applied to a single electrode.We can use these predictions more generally when more than one electrode is tuned (although it would be more difficult to plot in this situation) for unitary control.We can use the GB to predict the unitary given the set of voltages even though the dataset originally did not include this information, but rather the power distribution.Another advantage of using the GB compared to WB, is the incorporation of unmodelled effects such as cross-talk.While the effect is negligible in our technology, in other situations it can be difficult to have an exact/accurate WB model.
It is also important to realize that this predicted Hamiltonian, besides not being unique mathematically, represents physically an effective quantity, and so it will differ in structure (such as the existence of an imaginary part) from the ideal Hamiltonian that one may expect for a system.Depending on the purpose of the use of the GB we can control how much we make it "blacker" or "whiter".
For control applications, the best architecture is to have this effective Hamiltonian.In another application such as modeling a device for the purpose of completely understanding the physics, the architecture of the GB might need to be modified to allow Hamiltonians that are closer to some expected structure.In Supplementary Materials E, we explore this idea in more detail.In particular, we explore the relaxation of the WB assumptions gradually until we reach the structure of the GB.The results show that the best architecture that fits the experimental data is a complex non-tridiagonal Hamiltonian with non-linear dependence on the control voltage.This suggests that the reconfigurable section of the chip has variations along the propagation direction, which could be the result of fabrication imperfections.In other words, the estimated Hamiltonian effectively represents a time-dependent system.Finally, it is worth mentioning that reaching this conclusion was based on interpreting the mathematical structure obtained from the GB.Thus, the GB approach helped us understand better the behaviour of our system.

Conclusion and outlook
We have shown how a GB model can be designed for a general quantum device, trained on experimental data, and verified by generating target unitary operations and output distributions with high fidelity.The performance was benchmarked against WB and BB models, showing the superior performance of our approach.Our approach is general and can be applied to any quantum system, it can be extended to time-dependent and open quantum systems with the needs of modifying the machine learning structure and dedicated dataset-taking process for specific hardware or quantum systems [43][44][45].There are many possible extensions to this work as well.One possibility is to design a GB for other physical models such as a Lindblad master equation for Markovian open quantum systems.One could also consider a hybrid approach between Hamiltonian learning using Genetic Algorithms (such as [54,55]) and our numerical GB for the purpose of obtaining more detailed physical models.In terms of the ML-aspects of this application, a study about the scaling requirements for the NN structures of the GB in relation to the dimensionality of the system, would be interesting.However, it will be challenging because asymptotic analysis of ML algorithms is difficult or might be impossible.On the other hand, relying on numerical analysis might not be sufficient since the analysis will be restricted to a particular range of the scaling parameter, and cannot be generalized outside that range.Another aspect related to any ML algorithm is the requirements of the training dataset size.For complex devices, it can be challenging to collect a large-sized dataset.However, some emerging techniques can facilitate this process including incremental learning [56], transfer learning [57], and adaptive online learning [58].While these techniques are constantly developing in classical machine learning literature, there is still a gap in porting such methods to physics-based applications and especially quantum applications.
Senior Research Fellowship and a Google Faculty Research Award.ML was supported by the Australian Research Council (ARC) Future Fellowship (FT180100055).BH was supported by the Griffith University Postdoctoral Fellowship.This work was supported by the Australian Government through the Australian Research Council under the Centre of Excellence scheme (No: CE170100012), and the Griffith University Research Infrastructure Program.This work was performed in part at the Queensland node of the Australian National Fabrication Facility, a company established under the National Collaborative Research Infrastructure Strategy to provide nanoand micro-fabrication facilities for Australia's researchers.This research was also undertaken with the assistance of resources from the National Computational Infrastructure (NCI Australia), an NCRIS enabled capability supported by the Australian Government.

AUTHOR CONTRIBUTIONS
A.Y. and A.P. conceived the experiment.B.H., F.L. and M.L. fabricated the integrated photonic device.A.Y, Y.Y, R.J.C. and A.P. carried out the experiments and performed the data analysis.All the authors discussed the results and contributed to the writing of the paper.

COMPETING INTERESTS
The Authors declare no Competing Financial or Non-Financial Interests.
Lithium niobate is an ideal choice for a reconfigurable coupled waveguide array due to its large electro-optic coefficient, and consolidated waveguide fabrication technology.The reconfigurable coupled waveguide array devices (Figure S1a) are fabricated in X-Cut bulk lithium niobate (Gooch & Housego) using the annealed proton exchange technique.The design was optimised for transmission at 808nm using simulations and methods described in [1].Waveguides were formed with a width of 5.3 µm by a proton exchange step through a titanium mask.An exchange depth d e =0.33 µm was formed by immersion in pure benzoic acid at 150 • C for 45 minutes.The subsequent annealing was performed in air at 328 • C for 9 hours and 50 minutes.A silicon dioxide layer of 200 nm is sputtered on top of the lithium niobate before gold electrodes are defined via photolithography and lift-off in a configuration shown in Figure S1b.Electrical connection to the chip is made via wire bonding.Waveguide separation at the ends of the device is 127 µm to match the pitch of the butt-coupled fibre array.Consequently, a fan-in and fan-out region is required to bring the waveguides into the coupling region where the centre-to-centre waveguide separation is 10 µm, chosen to achieve a coupling coefficient of C=140 at 808 nm.The bends in the fan-in and fan-out regions are defined by cosine curves to maximise bending radius and minimise loss.

B. CHIP MODELLING
The Hamiltonian of the photonic chip we use in this paper is typically assumed to be real-valued with this FIG.S1: A schematic of the three waveguide chip we utilized in our experiment.a) Top view b) Cross-section tridiagonal form: where each of the diagonal elements β i represents the propagation constant along the i th waveguide, while the off-diagonal elements C ij represent the coupling between the neighbouring i th and j th waveguides.These Hamiltonian parameters depend on the voltages applied to the electrodes.As first order approximation, the dependence can be considered linear as follows: where ∆V i is the potential difference across the i th waveguide, and β where ∆V i and ∆V i are the potential difference across the i th and i th waveguide, and C (0) ij and ∆C ij are parameters representing zero-voltage coupling constant and sensitivity to change in voltage.Therefore, there is a total of 4N − 2 trainable parameters to completely specify this Hamiltonian.The length of the waveguides l is fixed and known from the design of the chip, and is equivalent to time for the evolution.This model is based on what is widely known and accepted by the photonics community.In particular, the assumption of a real-valued Hamiltonian is justified by the fact that the propagation constant and the coupling coefficient are real for lossless devices [2].The tri-diagonal form is justified by the fact that only nearest-neighbor coupling between waveguides is significant, and the coupling decays exponentially for further waveguides [3,4].Finally, the linear dependence on voltage is due to the Pockels effect where the change in the refractive index of the material is proportional to the amplitude of the applied electric field, as shown in [2].
In addition, there exists an effective fan-in and fan-out sections of the chip where the waveguides are gradually separated to the input and output ports of the chip.This effectively induces extra coupling, and the complete device can be modeled as a cascade of three unitaries: the fan-in unitary, the voltage-controlled unitary, and the fan-out unitary.In other words, where U (V ) is the voltage-dependent evolution unitary, and each of the fan-in and fan-out unitaries take the form of e −iH fan l fan .The fan Hamiltonian takes the same tri-diagonal form of the chip Hamiltonian, but with one main difference that it is voltage independent.The fan length l fan is chosen arbitrarily to be 1 (since it can be absorbed in fan Hamiltonian).The parameters of the fan Hamiltonian alongside the chip Hamiltonian parameters form the set parameters of the physical model of the chip.
We choose the initial states to be the computational basis states {|0⟩ , |1⟩ , • • • |N ⟩, where the |j⟩ state encodes a single photon entering the chip at port j.Equivalently, if classical laser is used, then the state corresponds to light entering through input port j.Therefore, we actually measure N power distributions corresponding to each input basis states.In other words, we measure the probability P i→j of transitioning from input port i to output port j.This gives a total of N 2 outputs.For the purposes of this paper, these measurements are sufficient.However, in the absence of any phase measurements, the model predictions are be accurate only in terms of probability amplitudes of the output state (or equivalently the square amplitudes of the unitary describing the chip).One way to measure phase shifts is to use a Mach-Zehnder interferometer to convert phases into powers.In this case, there will be 2N 2 outputs.

C. PROTOCOL IMPLEMENTATION
We collected a dataset of 7000 examples for training, and 1000 for testing.Each example consists of the voltages applied to four electrodes, chosen randomly in the interval [−1, 1], and the corresponding measured output power distribution.The electrodes are activated simultaneously in all experiments.Thus, the models we use will have four inputs, and 9 outputs.
Because the evolution of the system is unitary, and the input states are orthogonal, the output states must also be orthogonal.Particularly the matrix |U | 2 , which is the unitary after taking squared absolute value element-wise must form a bistochastic matrix.In other words, the columns must be normalized to 1, and the rows as well.Both the matrix and its inverse U −1 = U † (backward evolution in time) are unitary, and thus the extra requirement of row normalization.While measuring the probability amplitudes of an evolved state, the column normalization is automatically done so that it yields a valid probability distribution.However, the row normalization is not naturally guaranteed due to the presence of experimental noise.Therefore, the measured power distributions are further post-processed to ensure the normalization holds for both rows and columns.We use iterated proportional fitting algorithm which is a standard procedure in statistics.First we arrange the measured power distributions into an 3 × 3 matrix where each column corresponds to the output of a particular input state.The idea simply is to keep alternating between normalizing rows and normalizing columns until convergence, which is theoretically guaranteed [5].Finally, the columns are stacked together to form an 9 × 1 vector that matches the model output.The procedure for normalizing the power distributions is done for each example of the dataset.It is also done for the control dataset.
For the GB model, we implemented the structure described in the main text of the paper.In particular, the neural network consists of three layers of 50, 100, and 18 neurons respectively.The first two layers have a hyperbolic tangent activation, while the last one has linear activation.The whitebox parts consists of standard quantum evolution and measurement operations.Because we do not impose any constraints of the Hamiltonian structure, the interpretation is that it is an effective Hamiltonian that results in the overall unitary (i.e.chip unitary and fan unitaries) as described in Equation S4.In fact any unitary can be written as the imaginary evolution of some Hermitian operator.
For the fully WB model, we implement the physical model described by Equations S1,S2, S3, S4 setting N = 3.Thus, the first layer of the ML model implements the set of parameterized equations S2, S3.The second layer in the model, is the Hamiltonian construction where the Hamiltonian parameters (that are the output of the first layer), are arranged into an 3 × 3 matrix as described by Equation S1.This layer is followed by the quantum evolution layer to calculate the voltage dependent unitary.
After that, we construct a layer that implements the overall evolution by including the fan-in and fan-out of the waveguides, implementing Equation S4.Finally, for the fully BB model, we have three layers.The first layer consists of 50 neurons with hyperbolic tangent activation function.The second layer consists of 100 neurons of hyperbolic tangent activation as well.The final output layer consists of three concatenated layers of 3 neurons with softmax activation.
We then trained the three models using the Adam optimizer [6] (which is one of the most commonly-used and successful methods for training NN) with learning rate 0.003 for 3000 iterations.For each of the three ML models, we run the output controller as well as the unitary controller for 1000 random states and unitaries.After obtaining the optimal controls for all cases, we apply them experimentally on the chip, and measure the corresponding output distributions for all initial states.We also the same post-processing step that is used for the training and testing datasets.Since we restrict the measurements in this paper to computational basis, we can use the classical fidelity F (p, q) = i √ p i √ q i between probability distributions as a metric of how close the measured distributions are to the target ones.Since we have 3 initial states, we get three fidelities for each target example, so we take the average over the three initial states.We repeat this for each of the 1000 targets, and perform statistical analysis on the calculated fidelities.For each of the three models considered, we report below the statics distributions for each separate input.It can be note that the performance slightly change between the inputs.

D. SIMULATIONS
We implemented the proposed protocol on a synthetic simulated dataset to study the performance of the different ML models in an ideal scenario.For the simulator, we considered a Hamiltonian with an extra quadratic term in voltage to model the possibility of having a non-linearity that is not known for a WB physical model.Additionally, we considered measuring phase information and recording it as a part of the dataset.This is done by doing interferometer measurements on the outputs of the chip as described in [7].It is possible to infer the magnitude and phases of the output state if we did two interferometer measurements with different angles for each input/output pair.This will result in double the number of outputs (i.e.2N 2 ) which is the requirement for completely determining all the elements of the evolution unitary.We matched all the settings (such as dataset size, and ML hyperparameters) as the those used for the experimental data in the paper.The BB model however was modified in the output layer where a sigmoid activation was used instead of softmax, as we no longer require the normalization of the outputs since they now represent the interference measurements.
Figures S3a, S3b show the results of the training and testing the three ML architectures against the simulated dataset.It clearly shows that the WB had the worst performance because it is based on a physical model where the Hamiltonian depends linearly on the voltage whereas the simulated chip has a quadratic term as well.The interesting observation though is that at the end of iterations, the GB performed much better than the BB for the same number of iterations.This motivates the idea that with the GB structure, it is easier for the machine to learn the dataset because we already provide part of the dynamical equations.Fig. S3c shows the results for comparing the gate fidelity using the WB and GB.As expected the GB outperforms the WB because it models the data more accurately and so it is expected to have better control.In conclusion, the simulation results agree with the experimental results, and also shows the applicability of the proposed method for a different and more complex setup.
Additionally, we performed another simulation for a 32-mode chip, to study the applicability of the proposed method to much higher-dimensional system.The dataset consisted of 65536 examples for training, and 8192 for testing.The layers of the neural networks had 100 and 300 nodes for the hidden layers.For assessing control, we utilized 1024 examples out of the testing set. Figure S4 shows the training, testing, and unitary control performance.The GB maintains its performance compared to BB and WB.Comparing to the 3 × 3 case, the training and testing MSE are higher.However, the unitary control in the GB case still shows most examples are concentrated around 99% of fidelity.

E. RE-DESIGNING THE WHITEBOX
The results of our experiments, particularly from the GB, show that the best way to model the chip is using a general N × N complex-valued Hermitian Hamiltonian that is non-linearly dependent on the control voltages.This deviates significantly from the standard WB model that uses the tri-diagonal real-valued linearly voltagedependent Hamiltonian.In this section, we explore the effect of relaxing the WB assumptions based on the new knowledge acquired from the GB results.
Here we introduce three additional WB models:   A) complex, tri-diagonal, and linear voltage dependence, B) complex, non-tridiagonal, and linear voltage dependence, C) complex, non-tridigonal, and non-linear voltage dependence (using an NN) For all models (including the original WB in the main text), we separate the fan-in and fan-out Hamiltonians (that are voltage independent), so that we can study the reconfigurable part separately.We use the same mathematical structure for the fan-in and fan-out Hamiltonians as the reconfigurable part.Next, we fit those models with the experimental data, and compare against the standard WB and the GB. Figure S5 shows the MSE for the training and testing datasets.The results show that relaxing the the assumptions on the WB, the MSE improves.Model C (with least assumptions) performs comparably to the GB towards the end of the iterations.The fact that the complex non-tridiagonal fits better implies that we are fitting an effective Hamiltonian of non-commuting sections, or more generally a time-dependent Hamiltonian.In other words, the model is trying to fit where T + is the time-ordering operator, and H(V, z) is the physical Hamiltonian that depends on the control voltage V as well as position z along the propagation direction.Since we already separated the fan-in and fan-out from the reconfigurable part, we can conclude that the reconfigurable part itself is varying along the propagation direction.For example, the electrodes might not be symmetric due to fabrication imperfections.
As for the non-linear dependence on the voltage, this can be attributed to multiple sources.First, it can be physical due to higher-order Pockel's effect, although it is unlikely in our experiment as we limit the maximum control voltage.Alternatively, the non-linearity could be a mathematical artifact of the optimization process.By looking back at Equation S5, we see that the effec-tive Hamiltonian will be likely non-linear because of the matrix logarithm and time-ordered evolution operations.Moreover, due to the non-uniqueness of Hamiltonians generating a given unitary, we can have a case where both a linear and a non-linear Hamiltonian would generate exactly the same unitary.We illustrate this with a 2 × 2 system as an example.Let H 1 (V) be a 2 × 2 Hamiltonian of the form where f 1 (V) and g 1 (V) are two linear functions of the control voltage V. We want to find another equivalent Hamiltonian H 2 of the same form i.e. they generate the same unitaries U 1 and U 2 , such that, f 2 (V) and g 2 (V) are now non-linear functions of V.For this form of the Hamiltonian, we can express the evolved unitary easily if we rewrite the Hamiltonian in the Pauli vector notation A = a(n • ⃗ σ) U = e iA = I cos(a) + i sin(a)(n where a is a scalar, I is the identity matrix, n := [n x , n y , n z ] T is a three-dimensional unit vector, and n • ⃗ σ := n x σ x + n y σ y + n z σ z , σ j is the Pauli matrix along the j th direction.Thus, rewriting the two Hamiltonians in that form we have, The evolution can be calculated then as Equating the two unitaries, we obtain the three conditions to satisfy: We can see that by exploiting the periodicity of the cosine, we can obtain a non-linear solution for f 2 (V) and f 2 (V).Thus, there exist an infinite family of transformations parameterized by an integer k, that define an equivalent class of Hamiltonians in terms of the evolved unitary.The general concept can be extend to larger systems, while it can verified numerically, finding an analytical expression would be very difficult.
The consequence of this model invariance, and the fact that we cannot directly measure the Hamiltonian, but only the unitary encoded as power distributions, means that there is no way to decide from experimental data whether a linear or a non-linear model is the groundtruth.Moreover, the NN is more likely to find a non-linear so-

FIG. 3 :
FIG. 3: Experimental performance of the machine learning models.The whitebox model consists of fan-in, reconfigurable, and fan-out sections, each modelled as a real-valued tri-diagonal Hamiltonian in addition to a linear dependence on voltage for the reconfigurable section.(a) Results of training the different models on the experimental dataset.The MSE is plotted versus iteration number (b) The results of evaluating the different models on the testing set.

FIG. 4 :
FIG. 4: Experimental quantum control performance.The distribution of the fidelity between the experimentally measured output power distribution and the desired target distribution for the three models.The whitebox model utilizes a real-valued tri-diagonal Hamiltonian with linear dependence on voltages, in addition to fan-in and fan-out sections.The results are for (a) the output controller, and (b) the unitary controller.The reported values are the average over the three distributions corresponding to each possible initial state.(c) Violin plot showing the statistics of the MSE obtained for the training, testing, and control datasets.The horizontal lines represent from bottom to top, the minimum, median, and maximum respectively.The plot also shows an estimated kernel density for the data.

FIG. 5 :
FIG. 5: Dependence of the Hamiltonian elements to a subset of input voltages, as predicted by the graybox model.(a) Real and (b) imaginary parts of the Hamiltonian matrix elements as a function of voltage when all electrodes are grounded except the first electrode.It should be noted that the imaginary parts of H 11 , H 22 , and H 33 are by definition equal to zero.The non-linear dependence, the second off-diagonal elements, and the imaginary components indicate an effective Hamiltonian being estimated for a time-dependent system, attributed to non-homogeneity of the chip along the propagation direction.
and ∆β i represent the zero-voltage propagation constant and sensitivity to change in voltage.The arXiv:2206.12201v4 [quant-ph] 29 Nov 2023 coupling coefficients take the form: FIG. S2:The distribution of fidelity between the experimentally measured output power distribution and the desired target distribution.The rows show the results for each of the whitebox (WB), blackbox (BB), and graybox (GB) respectively.The WB model is based on a tri-diagonal real-valued Hamiltonian with linear dependence on voltages.The three columns represent different input states.
FIG. S3: Results of training the different model on the simulated dataset for the 3-mode chip.The WB model is based on a tri-diagonal real-valued Hamiltonian with linear dependence on voltages.The MSE is plotted versus iteration number for a) training set, and b) for testing set.c) The distribution of gate fidelity between the simulated unitary and the desired target unitary for each of the whitebox and graybox controllers.
FIG. S4: Results of training the different models on the simulated dataset for a 32-mode chip.The WB model is based on a tri-diagonal real-valued Hamiltonian with linear dependence on voltages.The MSE is plotted versus iteration number for a) training set, and b) testing set.c) The distribution of gate fidelity between the simulated unitary and the desired target unitary for each of the whitebox and graybox controllers.