Accurate prediction of X-ray pulse properties from a free-electron laser using machine learning

Free-electron lasers providing ultra-short high-brightness pulses of X-ray radiation have great potential for a wide impact on science, and are a critical element for unravelling the structural dynamics of matter. To fully harness this potential, we must accurately know the X-ray properties: intensity, spectrum and temporal profile. Owing to the inherent fluctuations in free-electron lasers, this mandates a full characterization of the properties for each and every pulse. While diagnostics of these properties exist, they are often invasive and many cannot operate at a high-repetition rate. Here, we present a technique for circumventing this limitation. Employing a machine learning strategy, we can accurately predict X-ray properties for every shot using only parameters that are easily recorded at high-repetition rate, by training a model on a small set of fully diagnosed pulses. This opens the door to fully realizing the promise of next-generation high-repetition rate X-ray lasers.

where E indicates the expected value with respect to the distribution of the data. Since a generic function approximator f a defined in terms of a series of parameters a is normally used, the problem reduces to finding the parameters a of the function that solve the optimization problem: a = arg min In supervised learning for regression the parameters of the model are optimized based on some training data, consisting of a series of n e training examples with associated values for the features {x 1 , x 2 , ..., x ne } and their corresponding target values {y 1 , y 2 , ..., y ne }. This allows approximately redefining the previous optimization problem in terms of the training data as: where J(a) is known as the cost function. Any parameters of f that are not optimized in this process are known as hyperparameters.
Once the optimal parameters a have been found, predictions for the targets can be made using: Different machine learning models allow f to take different analytical forms. Depending on this, the model may be able to represent a narrow or a wide set of functions. This is qualitatively defined in terms of capacity: the more flexible a model is, and the more nonlinearities it can represent, the larger is its capacity. Typically models with low capacity tend to underfit the data, while models with large capacity tend to overfit the data.
In the case of the linear model, each of the n t targets are calculated as: where a i j are the parameters of the model. finding the optimal value for the parameters is equivalent to the linear regression problem, which has an analytical solution.
The quadratic model is equivalent to the linear model with the difference that instead of working directly with the input variables as features, auxiliary features are created by calculating all possible products across the input variables up to second order. For example, for an input vector the following vector of auxiliary features would be used: Similarly, cubic or quartic models can be defined; the higher the degree of the polynomial, the larger the capacity of the model.
The support vector regressor (SVR) 2 model attempts to find a solution for each of the n t targets calculated as: where the summation is performed across all the examples in the training set, a i j are parameters of the model, and k is a kernel function, in this case, a Gaussian kernel defined as: where γ is a hyperparameter of the model. While this method would in principle need to store every single example in the training set to evaluate the function, in practice, the training is performed in terms of two additional hyperparameters C and , finding a solution for which a i j is 0 for most of the training examples. The examples for which a i j is not zero are called the support vectors. For more information on the kernel functions, the training and the hyperparameters, see reference 2 .
In the case of artificial neural networks (ANN) 3 , the model calculates the output in a series of layers of the following kind: where x l is the vector with the variables at a given layer, x l+1 , the values at the next layer, A l , a matrix of parameters of size n l x n l+1 , where n l and n l+1 are the number of variables in the current layer and the next layer, respectively, and b l is a vector of parameters with size n l+1 . The selection of the activation function φ, which breaks the linearity of the model, is treated as a hyperparameter. In this case we chose the widely used rectified linear activation (ReLU) function: The input to the first layer x 0 corresponds to the input features x, and the output of the last layer x N l corresponds to the targets y, where N l is the number of layers in the ANN. The activation function is removed for the last layer so the function can output any real number. Intermediate layers calculating the internal variables x l are known as hidden layers, and the number of variables used at each hidden layer is known as the number of hidden cells per layer. Both the number of hidden layers and the number of hidden cells per layer are hyperparameters of the model. By increasing these numbers the capacity of the model increases. The effect of interleaving linear multiplications with the rectified linear activation function provides an output that is linear on the inputs almost everywhere, except for the points that are exactly at 0 for the rectified linear activations. As a result, the ANN can be seen as a sophisticated piecewise function formed of a number of linear regions that increases as the number of hidden layers and hidden cells increase 4 .
Training a neural network consists of finding the set of matrices A l and vectors b l that solve the optimization problem for the training set. This is typically performed using the iterative gradient descent technique where the derivative of the cost function with respect to each parameter of the model is calculated and used to update the values of the parameters in the direction opposite to the derivative, decreasing the value of the cost function. In this work, we use a variation of this algorithm named AdaGrad 5 for this purpose. For more information on ANN, see reference 3 .

Supplementary Note 2. Details about the variables used for the prediction.
Fast shot-to-shot variables. We list here all of the fast shot-to-shot variable names used as features for prediction, currently measured at 120 Hz at the Linac Coherent Light Source (LCLS): • ebeamCharge and ebeamDumpCharge: Electron beam charge measured at the accelerators, and at the electron dump.
• ebeamEnergyBC1 and ebeamEnergyBC2 : Electron beam energy measured at each of the two bunch compressors.
• ebeamPkCurrBC1 and ebeamPkCurrBC2 : Electron beam peak current measured at each of the two bunch compressors.
• ebeamL3Energy: Electron beam energy measured after the third linear acceleration stage.
• ebeamLTUPosX and ebeamLTUPosY : Horizontal and vertical electron beam positions at the Linac to Undulator (LTU) transport line.
• ebeamLTUAngX and ebeamLTUAngY : Horizontal and vertical electron beam angles at the Linac to Undulator (LTU) transport line.
• ebeamLTU250 and ebeamLTU450 : Electron beam position in two dispersive regions at the LTU transport line.
• ebeamUndPosX and ebeamUndPosY : Horizontal and vertical electron beam positions at the undulator.
• ebeamUndAngX and ebeamUndAngY : Horizontal and vertical electron beam angles at the undulator.
• f 11 ENRC and f 12 ENRC : Redundant X-ray total energy measurements before attenuation from two gas detectors.
• f 21 ENRC and f 22 ENRC : Redundant X-ray total energy measurements after attenuation from two gas detectors.
• f 63 ENRC and f 64 ENRC : Redundant X-ray total energy measurements corrected to be accurate for small signals (<0.5 mJ).
Slow EPICS variables. We list here typical slow environmental properties recorded as Experimental Physics and Industrial Control System (EPICS) 6 variables measured at 2 Hz at LCLS: • Positions of translation stages involved in the control feedback loops.
• Voltages of power supplies involved in the control feedback loops.
• Strength of magnetic fields in the magnetic chicanes, and bending magnets.
• Nominal values for the amplitude and phases of the radiofrequency fields.
• Pressures from the vacuum systems.
• Temperatures at different stages.
• Calibration values inputted manually by operators.
• Status of beam blockers.  After pre-processing and normalizing the input dataset, it is divided into three groups: the training set, the validation set, and the test set. Different models with different sets of hyperparameters are trained on the training set, and used to predict the targets for the validation set, allowing to obtain the validation error. Once the set of hyperparameters that yield the smallest validation error is found, the final error of the model is obtained by making predictions on the test set, which was kept isolated during the previous stages. Datasets are shown in light brown. Features are shown in orange.
Targets are shown in blue. Models are shown in purple. Calculated errors in the predictions are shown in red.