Experimental neural network enhanced quantum tomography

Quantum tomography is currently ubiquitous for testing any implementation of a quantum information processing device. Various sophisticated procedures for state and process reconstruction from measured data are well developed and benefit from precise knowledge of the model describing state preparation and the measurement apparatus. However, physical models suffer from intrinsic limitations as actual measurement operators and trial states cannot be known precisely. This scenario inevitably leads to state-preparation-and-measurement (SPAM) errors degrading reconstruction performance. Here we develop and experimentally implement a machine learning based protocol reducing SPAM errors. We trained a supervised neural network to filter the experimental data and hence uncovered salient patterns that characterize the measurement probabilities for the original state and the ideal experimental apparatus free from SPAM errors. We compared the neural network state reconstruction protocol with a protocol treating SPAM errors by process tomography, as well as to a SPAM-agnostic protocol with idealized measurements. The average reconstruction fidelity is shown to be enhanced by 10\% and 27\%, respectively. The presented methods apply to the vast range of quantum experiments which rely on tomography.

Introduction. Rapid experimental progress realizing quantum enhanced technologies places an increased demand on methods for validation and testing.
As such, various approaches to augment state-and processtomography have recently been proposed. A persistent problem faced by these contemporary approaches are systematic errors in state preparation and measurements (SPAM). Such notoriously challenging errors are inevitable in any experimental realization [2][3][4][5][6][7][8][9][10][11][12]. Here we develop a data-driven, deep-learning based approach to augment state-and detector-tomography that successfully minimized SPAM error on quantum optics experimental data.
Several prior approaches have been developed to circumvent the SPAM problem. One line of thought leads to the so-called randomized benchmarking protocols [3,13,14], which were designed for quality estimation of quantum gates in the quantum circuit model. The idea is to average the error over a large set of randomly chosen gates, thus effectively minimizing the average influence of SPAM. Randomized benchmarking in its initial form however, only allowed to estimate an average fidelity for the set of gates, so more elaborate and informative procedures were developed [4,15]. Another example is gate set tomography [5,16,17]. Therein the experimental apparatus is treated as a black box with external controls allowing for (i) state preparation, (ii) application of gates and (iii) measurement. These unknown components (i)-(iii) are inferred from measurement statistics. Both approaches require long sequences of gates and are not suited for a simple prepare-and-measure scenario in quantum communication applications. Indeed, in such a * All data and source code are available online at [1]. scenario the experimenter faces careful calibration of the measurement setup, or in other words quantum detector tomography [6,7,18], which works reliably if known probe states can be prepared [19][20][21][22].
As (imperfect) quantum tomography is a data-driven technique, recent proposals suggest a natural benefit offered by machine learning methods. Bayesian models were used to optimise the data collection process by adaptive measurements in state reconstruction [8,9,23], process tomography [24], Hamiltonian learning [25] and other problems in experimental characterisation of quantum devices [26]. Neural networks were proposed to facilitate quantum tomography in high-dimensions. In such approaches neural networks of different architectures, such as restricted Boltzmann machines [10,11,27], variational autoencoders [12] and other architectures [28] are used for efficient state reconstruction; interestingly, a model for tackling a more realistic scenario of mixed quantum states has been proposed [29].
Our framework differs significantly and is based on supervised learning, specifically tailored to address SPAM errors. Our method hence compensates for measurement errors of the specific experimental apparatus employed, as we demonstrate on real experimental data from highdimensional quantum states of single photons encoded in spatial modes. The success of our approach bootstraps the well-known noise filtering class of techniques in machine learning.
Quantum tomography. Performing quantum state estimation implies the reconstruction of the density matrix ρ of an unknown quantum state given the outcomes of known measurements [30][31][32]. In general, a measurement is characterized by a set of positive operator valued measures (POVM's) {M a } with index α ∈ A the different configurations of the experimental apparatus (set A).
Given the configuration α, the probability of observing an outcome γ is given: where M αγ ∈ M α are POVM elements, i.e. positive operators satisfying the completeness relation γ M αγ = I. A statistical estimator maps the set of all observed outcomes D N = {γ n } N n=1 onto an estimate of the unknown quantum stateρ. A more general concept of quantum process tomography stands for a protocol dealing with estimation of an unknown quantum operation acting on quantum states [33,34]. Process tomography uses measurements on a set of known test states {ρ α } to recover the description of an unknown operation. 1 The reconstruction procedure requires knowledge of the measurement operators {M αγ }, as well as the test states {ρ α } in the case of process tomography. However, both tend to deviate from the experimenter's expectations due to stochastic noise and systematic errors. While stochastic noise may to some extent be circumvented by increasing the sample size, systematic errors are notoriously hard to correct. The only known way to make tomography reliable is to explicitly incorporate these errors in (1). Thus, trial states and measurements should be considered as acted upon by some SPAM processes: ρ α = R(ρ α ) andM αγ = M(M αγ ), and the models for these processes should be learned independently from a calibration procedure. Such calibration is essentially tomography on its right. For example, the reconstruction of measurement operators is known as detector tomography [6,7,18,35,36] and requires ideal preparation of calibration states. The most straightforward approach is calibration of the measurement setup with some closeto-ideal and easy to prepare test states, or calibration of the preparation setup with known and close-to-ideal measurements. In this case, one may then infer the processes R and/or M explicitly -for example -in the form of the corresponding operator elements, and incorporate this knowledge in the reconstruction procedure. Ideally, this procedure should produce an estimator free from bias caused by systematic SPAM errors. 2 Denoising by deep learning. The problem of fighting SPAM is essentially a denoising problem. Given the estimates of raw probabilities inferred from the experimental datasetP(γ|α,ρ) = Tr(M αγρ ), one wants to establish a one-to-one correspondence with the ideal proba-bilitiesP(γ|α,ρ) ↔ P(γ|α, ρ) for the measurement setup free from systematic SPAM errors. We use a deep neural network (DNN) in the form of an overcomplete autoencoder trained on a dataset D N to approximate the map fromP to P.
1 See Supplemental Material (Section 3) for the thorough discussion of quantum process tomography and its application for calibration of the measurement setup (Section 1). 2 See Supplemental Material (Section 3) for the detailed description of this procedure applied to our experiment (Section 2). The DNN modifies its internal parameters to find a function F :P(γ|α,ρ) → P(γ|α, ρ) which translates between the experimentally estimated probabilitiesP(γ|α,ρ), subjected to SPAM errors, at the input and ideal P(γ|α, ρ) at the output. To achieve this goal the network is forced to reduce the Kullback-Leibler divergence amongst pairs of distributions. An early stopper is applied in order to avoid overfitting during the training phase.
To train and test the DNN we prepare a dataset of N Haar-random pure states . For a d-dimensional Hilbert space, reconstruction of a Hermitian density matrix with unit trace requires at least d 2 different measurements. The network is trained on the dataset, consisting of d 2 × N frequencies experimentally obtained by performing the same d 2 measurements {M γ } d 2 γ=1 for all N states (in our experiments d = 6, i.e. we deal with a six-dimensional Hilbert space). These frequencies are fed to the input layer of the feed-forward network consisting of d 2 = 36 neurons. 3 We use a DNN with two hidden layers as shown in Fig. 1. The first hidden layer is chosen to consist of four hundred neurons, whilst the second contains two hundred. To prevent overfitting we applied dropout between the two hidden layers with drop probability equal to 0.2, i.e. at each iteration we randomly drop 20% neurons of the first hidden layer in such a way that the network becomes more robust to variations. We use a rectified linear unit as an activation function after both hidden layers, while in the final output d 2 -dimensional layer we use a softmax function to transform the predicted values to valid normalized probability distributions. Following the standard paradigm of statistical learning, we divided our dataset of overall N = 10500 states (represented by their density matrix elements) into 7000 states for training, 1500 states for validation and 2000 for testing. The validation set is an independent set and is used to stop the network training as soon as the error evaluated for this set stops decreasing (generally, this is referred to as early stopping: we examine validation loss every 100 epochs). Our loss function is computed over mini-batches of data of size 40.
Kullback-Leibler divergence. Training is performed by minimization of the loss function, defined as the sum of Kullback-Leibler divergences between the distributions of predicted probabilities {p i γ } d 2 γ=1 at the output layer of the network and the ideally expected probabilities {P i γ } d 2 γ=1 , which are calculated for the test states as P i γ = Tr(M γ ρ i ) assuming errorless projectors M γ : The minimization of KL divergence of Eq.
(2) is achieved by virtue of gradient descent with respect to the parameters {θ k } of the DNN for updating its internal weights. The KL divergence for a pair (P i , p i ) can be expressed in terms of cross-entropy H(P i , p i ) = d 2 γ=1 P i γ log p i γ which has to be minimized. For this purpose, we utilized the RMSprop [37] algorithm, in which the learning rate is adapted for each of the parameters {θ k }, dividing the learning rate for a weight by a running average of the magnitudes of recent gradients for that weight according to where α = 0.1. While the parameters are updated as with η standing for the learning rate. Experimental dataset. We fix the set of tomographicaly complete measurements {M α } = M to estimate all matrix elements of ρ using (1) and an appropriate estimator. We will assume that our POVM M consists of d 2 one-dimensional projectors M γ = |ϕ γ ϕ γ |. These projectors are transformed by systematic SPAM errors into some positive operatorsM γ . Experimental data consists of frequencies f γ = n γ /n, where n γ is the number of times an outcome γ was observed in a series of n measurements with identically prepared state ρ. For the time being, we assume, that all the SPAM errors can be attributed to the measurement part of the setup, and the state preparation may be performed reliably. This is indeed the case in our experimental implementation (see Supplemental Material).
We reconstruct high-dimensional quantum states encoded in the spatial degrees of freedom of photons. The most prominent example of such encoding uses photonic states with orbital angular momentum (OAM) [38] as relevant to numerous experiments in quantum optics and quantum information. However, OAM is only one of  2. Experimental setup for preparation and measurement of spatial qudit states. In the generation part, single photons from a heralded source are beam-shaped by a single mode fiber (SMF) and then transformed by a hologram displayed on a spatial light modulator. Analogously, the detection part consists of a hologram corresponding to the chosen detection mode, followed by a single mode fiber and a single photon counter. The hologram in the generation part produces highquality HG modes with the use of amplitude modulation, while a phase-only hologram at the detection part sacrifices projection quality for efficiency. two quantum numbers, associated with orthogonal optical modes, and radial degree of freedom of Laguerre-Gaussian beams [39,40] as well as full set of Hermite-Gaussian (HG) modes [41] offer viable alternatives for increasing the accessible Hilbert space dimensionality. One of the troubles with using the full set of orthogonal modes for encoding is the poor quality of projective measurements. Existing methods to remedy the situation [42] trade reconstruction quality for efficiency, significantly reducing the latter. Complex high-dimensional projectors are especially vulnerable to measurement errors and fidelities of state reconstruction are typically at most ∼ 0.9 in high-dimensional tomographic experiments [43]. That provides a challenging experimental scenario for our machine-learning-enhanced methods.
Our experiment is schematically illustrated in Fig. 2. We use phase holograms displayed on the spatial light modulator as spatial mode transformers. At the preparation stage an initially Gaussian beam is modulated both in phase and in amplitude to an arbitrary superposition of HG modes, which are chosen as the basis in the Hilbert space. At the detection phase the beam passes through a phase-only mode-transforming hologram and is focused to a single mode fiber, filtering out a single Gaussian mode. This sequence corresponds to a projective measurement in mode space, where the projectorM γ is determined by the phase hologram. 4 In dimension d = 6, we are able to prepare an arbitrary superposition expressed (c) Fidelity histogram for the case, when the state is reconstructed to be pure. The results of the filtering process are clearly witnessed by the modification of data histogram shapes. Besides the shifting towards higher values, that shows average gain over our experimental data, the reduction of FWHM indicates filtering task by the neural network.
in the basis of HG modes as |ψ = 2 i,j=0 c ij |HG ij . In the measurement phase we used a SIC (symmetric informationally complete) POVM, which is close to optimal for state reconstruction and may be relatively easily realized for spatial modes [43].
Experimental results. We performed state reconstruction using maximum likelihood estimation [44] for both raw experimental data and DNN-processed data. 5 In the former case, the log-likelihood function to be maximized with respect to ρ has been chosen as , with frequencies f γ = n γ /n and i numbering the test set states. Whereas in the latter case, these frequencies have been replaced with predicted probabilities p γ . The results forρ i (raw) = argmax L(f i γ |ρ) andρ i (nn) = argmax L(p i γ |ρ) with the prepared states |ψ i are shown in Fig. 3. Interestingly, the average reconstruction fidelity increases from F (raw) = (0.82 ± 0.05) to F (nn) = (0.91 ± 0.03) and this increase is uniform over the entire test set. Similar behavior is observed for the purity -since we did not force the state to be pure in the reconstruction, the average purity of the estimate is less then unity: π (raw) = (0.78 ± 0.07), whereas π (nn) = (0.88 ± 0.04). If the restriction to pure states is explicitly imposed in the reconstruction procedure, the fidelity increase is even more significant, as shown in Fig. 8c. In this case the initially relatively high fidelity of F (raw) = (0.94 ± 0.03) increases to F (nn) = (0.98 ± 0.02) -a very high value, given the states dimensionality.

Conclusion.
Our results were obtained with analytical correction for some known SPAM errors already performed. In particular, we have explicitly taken into account the Gouy phase-shifts acquired by the modes of different order during propagation (see Supplemental Material). This correction is however unnecessary for neuralnetwork post-processing. The DNN has been trained without any need of data preprocessing over the experimental dataset, as to say without introducing any phase correction in our initial data, wherein considering the effect of a channel process E. However, we have achieved average estimation fidelities of F (nn) = (0.81 ± 0.19) as compared to F (raw) = (0.54 ± 0.12) for this completely agnostic scenario, showing a dramatic improvement by straightforward application of a learning approach. To conclude, our results unambiguously demonstrate that a use of neural-network-architecture on experimental data can provide a reliable tool for quantum state-and-detector tomography.
setup (Section 1) and state preparation and detection methods (Section 2). 5 See also Supplemental Material (Section 4) for extra information on spatial probability distribution of reconstructed states.
[1] Experimental data and source code. https:// github.com/Quantum-Machine-Learning-Initiative/ dnnquantumtomography. We use spatial degrees of freedom of photons to produce high-dimensional quantum states. The corresponding continuous Hilbert space is typically discretized using the basis of transverse modes, for this purpose we chose Hermite-Gaussian (HG) modes HG nm (x, y), which are the solutions of the Helmholtz equation in Cartesian coordinates (x, y) and form a complete orthonormal basis. The HG modes are separable in x-and y-coordinates, so that HG nm (x, y) = HG n (x)×HG m (y). Each mode is characterized by indices n and m which indicate the orders of corresponding Hermite polynomials H n (x) and H m (y): where w is the mode waist. We limited the dimensionality of the Hilbert space to 6 by using only the beams with n + m ≤ 2. The basis of HG modes is fully equivalent to a commonly used basis of Laguerre-Gaussian (LG) modes which are also the solutions of the Helmholtz equation but in cylindrical coordinates. Most commonly, only the azimuthal part of LG basis, associated with orbital angular momentum (OAM) of photons, is considered in the experiments, primarily due to simplicity of detection [43]. Here we use the full two-dimensional mode spectrum of HG modes, which is equivalent to including the radial degree of freedom in addition to OAM. This is rarely done in quantum experiments, and one of the reasons is poor quality of projective measurements. Thus, this choice of physical system nicely fits the purpose of our demonstration. The experimental setup is presented in Fig. 4. Two light sources were used: an attenuated 808 nm diode laser and a heralded single photon source. Heralded single photons were obtained from spontaneous parametric down conversion in a 15 mm periodically-poled KTP crystal pumped by a 405 nm volume-Bragg-grating-stabilized diode laser. Beams from both sources were filtered by a single-mode fiber (SMF) and then collimated by an aspheric lens L2 (11 mm). We used one half (right) of the SLM (Holoeye Pluto) to generate the desired mode in the first diffraction order of the displayed hologram. Since SLM's working polarization is vertical, the half-wave plate (HWP) was inserted into the optical path to let the beam pass through the polarizing beamsplitter (PBS). The combination of lenses L3 and L4 with equal focal lengths (100 mm) separated by a 200 mm distance was used to cut off the zero diffraction order with a pinhole in the focal plane. After the double pass through this telescope and a quarter-wave plate (QWP) the beam was reflected by the PBS and directed back to the SLM. Using the hologram displayed on the left half of the SLM and a single mode fiber followed by a single photon counting module (SPCM) we realized a well-known technique of projective measurements in the spatial mode space [45]. To focus the first diffraction order of the reflected beam on the tip of the fiber we used an aspheric lens L5 with the same focal length (11 mm) as L1.
All data used for the neural network training and evaluation were taken for an attenuated laser source due to much higher data acquisition rate. When the NN trained on the attenuated laser was applied to a dataset taken with the heralded single photon source, the reconstruction fidelity slightly degraded -we observed F (nn) = 0.86 ± 0.04 vs. F (raw) = 0.81 ± 0.05, while π (nn) = 0.84 ± 0.04 vs. π (raw) = 0.75 ± 0.07. The most likely reason for this is some non-uniformity of the datasets caused by experimental drifts -the data for heralded single photons were taken after some period of time. We believe, the performance may be recovered if we use heralded photons data for training as well, using a much larger amount of data.

State generation and detection methods
To generate the beams with an arbitrary phase and amplitude profiles with a phase-only SLM, we calculated hologram patterns F (i, j), which can be described as a superposition of a desired phase profile Φ(i, j) and a blazed grating pattern with a period Λ modulated by the corresponding amplitude mask M (i, j): where i and j are the pixel coordinates. A spatially dependent blazing function allows one to control the intensity in the first diffraction order by changing the phase depth of the hologram. The presence of an amplitude mask A(i, j) significantly decreases the diffraction efficiency, but at the same time corrects the alterations caused by diffraction. Bolduc et al. showed that with the modification Φ(i, j) → Φ(i, j) − πA(i, j) this technique guarantees accurate conversion of the plane wave into a beam of arbitrary spatial profile [46]. This allows one to safely assume that the preparation errors in our setup are small. Considering the states generated with amplitude modulation as ideal, we compared the quality of detection with and without such modulation (A(i, j) = 1 for the latter case). The result is illustrated in Fig. 5 where the experimentally measured probabilities P i j = Tr M j |ϕ i ϕ i | = | ϕ i |φ j | 2 are shown. The projectors M j = |ϕ i ϕ j | were chosen to be elements of the SIC (symmetric informationally complete) POVM for d = 6 dimensional Hilbert space, andM j are their SPAM-corrupted counterparts. Thus the ratio between P i j=i and P i j =i was expected to be close to 36. To quantify the deviation betweenM and M we used the similarity parameter S = ( i,j P i j P i j ) 2 /( i,j P i j i,j P i j ), where P i j and P i j stand for the experimentally measured and theoretically expected probabilities, respectively. We found that the value of the similarity parameter decreased from 0.99 to FIG. 5. Experimentally measured cross-talk probabilities P i j = | ϕi|φj | 2 for the projectors from the POVM for the cases of detection with (a) and without (b) amplitude modulation. 0.96 after switching off the amplitude modulation in the detection holograms. At the same time, the total amount of observed counts rose from 6.2 × 10 6 to 40.9 × 10 6 due to the increased diffraction efficiency of the hologram. This illustrates a known tradeoff between the projection measurement quality and detection efficiency. One of the applications of the results developed here is in increasing the detection efficiency for complex measurements of spatial states of photons without sacrificing quality.
There is a simple way to understand, why projection measurement quality is lower, than that of state preparation. The orthogonality condition for the detection of HG mode HG nm (x, y) with a hologram corresponding to HG n m (x, y) can be written as but since the aforementioned hologram calculation method is designed for the plain wave input and not a Gaussian beam, one has to introduce an additional Gaussian term with a waist w f corresponding to the detection mode waist, which breaks the orthogonality ∞ −∞ HG * n m (x, y) × HG nm (x, y) × exp[−(x 2 + y 2 )/w 2 f ] dxdy = δ n n δ m m .
One possible way to fix the problem is to increase the waist w f of the detection mode as in the experimental work [42]. Unfortunately, it leads to the reduction of the detection efficiency to the level of a few percent. Thus, we used a different approach, modifying the HG mode equation (5) used for the holograms calculation with the second independent width parameterw in the following waỹ where the parameterw was chosen to satisfy the relation to compensate for the SMF term in (7).

Gouy phases reconstruction by process tomography
Importantly, as the lengths of optical paths in our setup were comparable with the Rayleigh lengths of the collimated beams, the generated states suffered from additional Gouy phase shifts, which depend on the mode orders. In order to avoid the related difficulties we reconstructed these phases with a standard process tomography procedure and took them into account during state generation.
Quantum process tomography is a protocol dealing with estimation of unknown quantum operation E acting on quantum states. The most general form of such an operation in the absence of loss is a CPTP map, which can be FIG. 6. Experimentally reconstructed first operator element E1 of the process E associated with the spatial state evolution between the preparation and measurement stages. The matrix elements are expressed in Hermite-Gaussian modes basis. Ideally, it should be an identity matrix, but additional phase-shifts, known as Gouy phase-shifts were observed in our setup.
written in the following form with K ≤ d 2 in a d-dimensional state space, known as an operator-sum representation. The problem of quantum process tomography boils down to reconstruction of the operators {E k } given the observed outcomes of measurements performed on some test states ρ α with probabilities P(γ|α, E) = Tr(M αγ E(ρ α )) = Tr We have reconstructed the operator elements E k for the process associated with the spatial state evolution between the preparation and measurement. In this case masks with amplitude modulation were used both for state preparation and measurement. The process E(ρ) = K k=1 E k ρE † k turned out to be close to a rank-one process with a single dominating operator element E 1 . As one can see from Fig. 6, the reconstructed E 1 is close to a diagonal matrix with pure phases at the diagonal. These phase-shifts are naturally interpreted as Gouy phase shifts, since they are almost equal for the modes of a particular order n + m = const. The inferred Gouy phase shifts were found to be equal to 0.92 ± 0.02 and 1.97±0.03 radians for mode orders of n+m = 1 and n+m = 2 correspondingly. Only when this additional phase-shifts were taken into account, the fidelities above 0.8 were achieved without the neural-network-enhanced post-processing.
The Gouy phase-shift elimination case is a nice example of the situation where the machine-learning-based approach helps even if the correct model of the detector is unknown to the experimenter. Indeed, instead of performing the full process reconstruction to find out the relevant phase-shifts for the modes of different order, one may stay agnostic of these shifts and consider them as just another contribution to systematic SPAM errors. We have tested the performance of the neural network trained on the states for which no correction of the Gouy phase shifts were made. The impression of how dramatically these phase-shifts affect measurement, we show the crosstalk probabilities for the projectors of the SIC POVM, with no phase correction, i.e. modified asM j = E 1 M j E † 1 in Fig. 7a. Without phase correction the state reconstruction of the 2000 test states gives the average fidelity of F (raw) = 0.54 ± 0.12 only and the average purity of the estimate π (raw) = (0.77 ± 0.07). When the state is reconstructed as a pure one, the value of average fidelity increases toF (raw) = 0.60 ± 0.13. At the same time DNN-trained on the same dataset without any information about the Gouy phase-shifts gives the corresponding fidelities of F (nn) = 0.81 ± 0.19 and F (nn) = 0.89 ± 0.22 (see Fig. 7b and Fig. 7c).

Neural Network
Throughout the paper we consider a feed-forward neural network [47] with two hidden layers of 400 and 200 neurons respectively, which maps input probabilities to the ideal ones and can be regarded as an autoencoder. To prevent overfitting in our model we use dropout between the two hidden layers with drop probability equal to 0.2; this means that at each iteration we randomly drop 20% of the neurons of the first hidden layer in such a way that the network becomes more robust to variations in the input data. After both hidden layers we use the Rectified Linear Unit (ReLU) as activation function, while in the final output layer of 36 dimensions we use a softmax function to transform predicted values in probabilities. The network is trained considering the Kullback-Leibler divergence (KLD) between predicted values and the real target probabilities. Thus, we aim at minimizing the distance between the predicted distribution and the objective one according to In the following, we address the performance of DNN with respect to varying the size of the dataset, constituted by 10500 states, and the performance for the reconstruction task. At each iteration we select 2000 states from the dataset that we consider as testing samples. Then, from the remaining dataset of K = 8500 states we sample a percentage of data in the range from η = 0.1 to η = 1 with steps of 0.1 (i.e., 10% of the data, or 850 samples). We train the network over 200 epochs and compute both the loss function and Bhattacharyya distance (classical fidelity) on the test data sample. We do this 5 times and average the results to report a stable value as shown in Fig. 9. Interestingly, little data and few epochs are necessary to learn how to generate probabilities that are close to the ideal ones. Fidelity for η = 0.1, i.e. training set consisting of 10% of the data is already equal to 0.9720. The experimental setting used to obtain the result described in the paper is as follows: we divide our initial dataset into three subsets, namely training, validation and testing set, which is in line with commonly accepted ratio of 80% (or 60%) for training, 10% (or 20%) for validation and 10% (or 20%) for testing. We hereby approach the problem considering roughly 20% of the dataset (2000 samples) for testing and we use 15% as validation (1500 samples) and 7500 samples for training. This division ensures that there is enough data to train the network and that we can test on a sample that is almost 20% of the original data.
The training set is used to train the model while the validation set is an independent set to stop the network training as soon as the error no longer decreases on the validation set. This technique is generally referred to as early stopping [48], we stop training if the error does not decrease within 100 epochs. At the end of the early stopping we restore the weights of the network that had the best validation loss during training. Finally, we test the model on the test set. This last step allows us to have a more unbiased estimation about the value of the loss on a completely unseen set of data, since the model has been chosen using a validation set and it is biased on this set.