Efficient prediction of attosecond two-colour pulses from an X-ray free-electron laser with machine learning

X-ray free-electron lasers are sources of coherent, high-intensity X-rays with numerous applications in ultra-fast measurements and dynamic structural imaging. Due to the stochastic nature of the self-amplified spontaneous emission process and the difficulty in controlling injection of electrons, output pulses exhibit significant noise and limited temporal coherence. Standard measurement techniques used for characterizing two-coloured X-ray pulses are challenging, as they are either invasive or diagnostically expensive. In this work, we employ machine learning methods such as neural networks and decision trees to predict the central photon energies of pairs of attosecond fundamental and second harmonic pulses using parameters that are easily recorded at the high-repetition rate of a single shot. Using real experimental data, we apply a detailed feature analysis on the input parameters while optimizing the training time of the machine learning methods. Our predictive models are able to make predictions of central photon energy for one of the pulses without measuring the other pulse, thereby leveraging the use of the spectrometer without having to extend its detection window. We anticipate applications in X-ray spectroscopy using XFELs, such as in time-resolved X-ray absorption and photoemission spectroscopy, where improved measurement of input spectra will lead to better experimental outcomes.


Introduction
In recent years, X-ray free-electron lasers (XFELs) [1][2][3] have emerged as a versatile tool for research with applications ranging from damage-free dynamic imaging of molecules [4] and proteins [5][6][7], new spectroscopic methods for quantum chemistry [8,9] and resonant X-ray spectroscopy of nanostructures in condensed matter [10,11].The versatility of XFELs is based on their tunability, brightness and very short pulse durations, which make the tracking of ultra-fast dynamics of electrons in matter feasible.
XFEL sources generate X-ray pulses by accelerating electron bunches to relativistic speeds in a linear accelerator of radiofrequency (RF) cavities and allowing them to interact with magnetic fields generated by an undulator [1][2][3], see Fig. 1.An XFEL can emit coherent or partially coherent radiation because of a favourable self-organization of the electrons in a relativistic beam as it passes through an appropriately tuned undulator.Different configurations are chosen that lead to the modulation of the phase space for the electron bunch and lasing.This can be used to generate pulses with different properties.Using an additional pre-modulation of the electron beam energy in a short wiggler section, followed by phase space manipulation to transfer the energy into a very short duration high electron current, leads to so-called enhanced SASE that results in sub-femtosecond pulses of the kind studied here [12].SASE and enhanced SASE pulse are important techniques in ultrafast science [13], where dynamics can be resolved using pump-probe configurations with synchronization to infra-red or optical laser fields [6,14] or by using two-pulse XFEL modes [15][16][17].Despite the versatility of XFELs in creating two-colour pulses in the femtosecond regime [18], single-shot variation of the pulse energy is significant; for example, photon energy fluctuation of more than 1% of the mean, pulse energy up to 100% of the mean and bandwidth more than 20% of the mean are common in existing machines.
Multiple factors contribute to the instability of output X-ray properties.The working principle of XFEL machines relies on SASE, which is inherently a stochastic process, with amplification seeded broadband emission from noise in the distribution of electrons in the bunch [19].In the case of traditional SASE operation, there are several temporal spikes within the width of the pulse that are not coherent with each other and are amplified, producing only partial longitudinal coherence across the XFEL pulse.This is compounded by fluctuations in the RF amplitudes or RF phases, which can translate to variation of the spatial and energy distribution of the electrons within a bunch.
Techniques like XFEL seeding and optical active stabilization may improve stability, but the issue of temporal fluctuations is still relevant at the few-femtosecond level.Alternatively, one can also circumvent issues of unstable pulse properties by performing a full X-ray characterization for each XFEL shot.However, single-shot characterization of XFEL pulses requires higher-dimensional inputs, such as the X-ray spectrum, which are obtained in a data expensive manner e.g. using an X-ray spectrometer with a CCD image readout.In addition to the slow and invasive diagnostics, the processing of large volumes of image data, given inevitable limits to computational power and data transfer rates, restricts the rate of characterization [20][21][22].Diagnostics in current machines operate at kHz repetition rates, and technological advances in high speed diagnostics must be accompanied by increased efficiency to reduce complexity and cost.An interesting solution to the issue of slow characterization of XFEL pulses was suggested in [23], where machine learning techniques were used to make accurate predictions of XFEL properties using data collected solely from fast diagnostics.The key concept relies on exploiting the correlation of various XFEL properties such as photon energy and spectral shape of the X-ray pulses with data that can be acquired at a higher repetition rate, such as electron beam properties.Since the detailed modelling of every experimental aspect that determines this correlation is currently out of reach, machine learning methods can prove to be extremely useful in this context, as further illustrated in [24].Whilst the quantum fluctuations associated with SASE will not be amenable in principle to machine learning, the complex interplay of the other fluctuating parameters gives some hope that machine learning strategies can be applied to predict the X-ray parameters with improved fidelity.
In this work, we use techniques of supervised learning to make efficient predictions of central photon energies for attosecond fundamental and harmonic pulses with high fidelity that can be applied to any XFEL facility.Enhanced SASE is realised by manipulation of the electron bunch spikes from the photoinjector with the undulator split into two sections for radiation of ω and 2ω frequencies [25].We use two different approaches of supervised learning, namely artificial neural networks (ANNs) and gradient boosted decision trees (GB) for our predictions.While the former consists of multiple layers with inter-connected nodes (artificial neurons), the latter constitutes of an ensemble of decision trees with better performance and lower overfitting than simple decision trees.By applying feature selection analysis, we reduce the dimensionality of the entire input space to the most relevant features.This leads to a simpler neural network architecture and optimal decision trees that make accurate predictions for real experimental data while enhancing the training efficiency when compared to [23].Moreover, despite XFEL beamlines being typically designed with the flexibility to allow for different experimental configurations (targets, diagnostics, etc.), at current facility beamlines it is not usually possible to measure the X-ray spectrum before and after a sample.Many experiments are also unable to measure multiple pulses simultaneously, due to the limited spectral range of available spectrometers.One of the key results of our work is the intriguing possibility of using machine learning methods to predict the photon energy for the second harmonic pulse without relying on the measurements of the fundamental pulse.Thus our methods offer a more pragmatic approach to maximising useful information from available resources whilst adding little experimental overhead.

Building the prediction model
A prediction model mathematically connects the output variables to the input parameters.This mathematical function is often non-trivial, especially for noisy experiments, which exhibit large variance of the affected parameters and variables.This leads to difficulties in discriminating between noise and signal, while further establishing an upper bound on the quality of predictions we can achieve.Naturally, the quality of the model is benchmarked by its ability to make successful predictions for future measurements.
Figure 1 illustrates machine learning of the prediction model.The objective is to predict the pulse characterization y from the diagnostics x.There are three main stages to building the theoretical prediction model.The first step is to perform pre-processing on the raw experimental data, which mainly involves filtering and normalizing the data.Here, filtering implies removing outlier events, such as events that correspond to low variance or based on not properly recorded measurements.The next step is to randomly split the pre-processed data into three different data sets: 70% of the data set used for training to fit different models, 15% of the data set used for testing while another 15% used for validation.The models chosen for this work are artificial neural networks (ANNs) [26] and gradient boosting (GB) [27].We train, validate and benchmark the performance of the prediction models on the test set.Later in this work, the performance of the machine learning methods are compared with a simpler model, namely a linear regression model [28].The final step is to optimize the prediction model in terms of its training cycle period.For this, it is important to identify the most relevant input features that contribute to the prediction of the

Reducing the dimensionality of input space
The goal is to identify the most relevant set of input features, which in this case are the XFEL electron beam properties, by assessing their importance in the prediction of the output.Typically, a few hundred parameters are recorded for each event, including measurements of the electron beam properties, basic photon diagnostics (such as gas detectors for the pulse energy) and large numbers of other environmental variables.Many of the environmental features are collected at a reduced rate of 1 Hz and therefore are only measuring slower fluctuations.This is done to reduce data flow rates, as these variables are generally uninformative at high repetition rates but could, in principle, be measured on every shot.A lot of these parameters such as environmental variables are empirically known to be disconnected from the XFEL operation and thus have no predictive value.These are systematically removed to reduce the total number of input features for an event from hundreds to N ≃ 80. Focusing on the remaining features, especially with those that have large fluctuations, it is a priori unclear whether they are expected to have predictive value.For such instances, it is useful to perform a thorough statistical analysis on the remaining features and rank them in order of their relevance using the permutation feature importance function [29].Before describing the importance function, we define the input matrix denoted by x whose dimensions are S (total number of events) × N (input features for each event).Throughout this work, the tilde will be used to indicate that the data have been normalized to zero mean and unit standard deviation.Thus, for i th event, the row vector has N input features denoted by the vector xi = ( x1 i , x2 i , . . ., xN i ) while for the j th input feature, the column vector has S events denoted by x j = ( x j 1 , x j 2 , . . ., x j S ) T .The mean absolute error calculated over S events is given as where Ỹi denotes the output for the i th event and f (x i , N) is the estimator for the output observable generated using the input vector xi .The relevance of a particular j th input feature is given using the normalized permutation feature importance function [29] which is denoted here by I j .It measures the increase in the mean absolute error when the j th input feature is randomly replaced by an incorrect one and is defined as follows, where p r ( j) = (p r 1 ( j), p r 2 ( j), . . ., p r S ( j)) T is a matrix of the r th permutation to the j th input feature.Its individual row vectors are denoted as p r i ( j) = (p 1,r i ( j), p 2,r i ( j) . . ., p N,r i ( j)).These vectors have elements where only the j th input feature is replaced using a permutation operator Π r which gives the element Here Π r (x j ) gives the r th permutation from a series of random permutations applied to column vector x j .The i th value of the resultant vector obtained after the permutation is given by the element [Π r (x j )] i .All other column vectors xk̸ = j remain unaltered.
Figure 2(a) is a plot which ranks the input features using the permutation feature importance I j while predicting the central photon energy of the second pulse E 2 with an ANN using only non-pulse measurement data.The relevance of a particular input feature is ranked with descending values of j and the plot of the mean absolute error (M ( j) = M (x, j)) reaches its lowest value for the top ten relevant features, most of which are related to the electron beam properties.A listing with descriptions of the ten most important features is given in table 1. Adding further features leads to over-fitting, as is seen with the rise in M ( j) for higher j values.
Figure 2(b) shows a scatter plot which compares the measured values of the central photon energy of the pulse E 2 with the predicted values estimated by the ANN.The predictions obtained with GB match these both quantitatively and qualitatively, as illustrated for a range of data in the supplemental.For a perfect predictor, the points would all lie exactly along the diagonal, with deviations from this distribution indicating reduced accuracy of prediction.The blue and red scatter points correspond to full input space (M = N = 87) and reduced input space (M = 10) respectively.The main deviations in this prediction are shared between both full and reduced input spaces, and are visible as the weak nearly uncorrelated background, and the deviation of the predictions far from the mean energy.The former is likely due to the highly stochastic nature of SASE, while the latter is indicative of low estimator confidence leading to more conservatice estimates closer to the mean.These error signatures are nearly identical for both full and reduced spaces, and the overall quality of predictions was identical, with a mean absolute error of 2.48 eV.Thus, we can perform training of simpler estimators with smaller architectures by using the reduced input space, without compromising the quality of predictions.By including only the most relevant features, we introduce a feature-restricted mean absolute error M = M (x, 10), which will be used to estimate the performance of predictor models for the rest of this work.To further allow for comparability between different prediction targets, we will proceed by normalizing the mean absolute error M with respect to the standard deviation σ of the target data.Using this notation, the results seen in Fig. 2(b) are equivalent to M = 0.54σ .Whilst the accuracy of the predictions is modest, this was achieved without addition probes of electron and X-ray properties to those already in use at LCLS.The methods employed in this work can therefore be used generally for the prediction of beam properties.For example, the results shown in Figs.(S1) and (S2) of the Supplemental Material were obtained using a completely different experimental setup [23] and provide M ∼ 0.2σ and M ∼ 0.3σ for the prediction of the time delay and central energies respectively.The data for Figs.(S1) and (S2) indicate that the input-output correlation in the data for the time delay parameter between the pulses is much higher than that of the central photon energies of the pulses.The difference in performance between predictions for these two experiments indicates that the limiting factors are the specifics of experimental setup and inherent noise, rather than the machine learning method itself.These limitations likely manifest themselves in low correlation between input features and labels, and errors in the ground-truth of measurements, respectively.

Independent prediction of a single pulse
Figure 3 focuses on predicting the central photon energy of Pulse 2 (E 2 ) using two different detection schemes for the experiment.One setting corresponds to a configuration of the spectrometer which detects both the pulses simultaneously (depicted with blue lines) and using the energy of Pulse 1 (E 1 ) as an input feature, while the other measures only the second pulse (depicted with dashed lines).The green dashed line is the prediction of the central photon energy for Pulse 2 with ANN, while the magenta dashed line is with LIN model.These predictions are made with experimental data where different numbers of undulators were used between the pulses.Although both LIN and ANN models make accurate predictions of E 2 without the spectral information of Pulse 1, we find that the accuracy of their predictions depends on the number of undulators between the pulses.Predictions of E 2 improve with increasing number of undulators that are used for generating the second pulse.One plausible explanation for this is that, as each additional undulator provides amplification to Pulse 2, the accuracy of central-photon energy estimation (the ground truth of our prediction models) improves.Alternatively, this can be understood by considering the interaction times each pulse shares with the electron beam.The first pulse only shares a short interaction with the beam, so it may not correlate with properties of the entire beam, butresults rather only a part of it.The second pulse is seeded by the first, which leads to high correlation between the two and similarly bad predictability of the second pulse for low undulator counts.However, as the number of undulators for the second pulse goes up, its interaction time with the beam increases and this may explain the improved predictions using overall beam properties.
It is further worth noting that although Pulse 2 is a harmonic of Pulse 1 and is generated from the same electron bunch, the spectrometer was optimized for Pulse 2 and thus the accuracy in determining E 2 for both the training and test data is improved when compared to the setup where both pulses were measured.Often in experiments, measurement of the energy spectrum of both pulses simultaneously is not possible due to the limited spectral range of the spectrometer.Furthermore, it may only be possible to measure photon spectra after transmission through target samples, which in many settings alter the spectrum, e.g.due to absorption.This result allows for prediction of the photon energy without input from the spectrometer, except for training, adds directly to the capabilities of current XFEL experiments, allowing for important information about the incoming pulses to be extracted within typical experimental constraints.

Discussion
Conventional X-ray spectrometers involve high volumes of data and are still too slow for future XFEL experiments (which will run at MHz repetition rates) and proposed high data rate models using photo-electron spectrometers [30,31] would add significantly to experimental cost and complexity.Another issue is the limited spectral range of the available spectrometers.In both cases, machine learning methods can be advantageous, as demonstrated in this work.Although there have been prior The plot shows the precision with which the central photon energy of the second pulse (E 2 ) can be predicted as a function of the number of undulators between the pulses for two different detection scenarios: one where both pulses are measured simultaneously, while for the other only pulse 2 is measured.Using the former data, ANN is used to predict E 2 (blue solid line with circle), while using the latter data, both LIN (magenta dashed line with circle) and ANN (green dashed line with triangle) were used to predict E 2 .
works relying on the concept of using data from the photon spectrometer to train the neural networks, our work suggests that gradient boosting methods are more efficient (orders of magnitude) than neural networks in making spectroscopic predictions while giving comparable predictions.It is well-established that there is strong dependence of the properties of two-colour pulses on the electron beam parameters.Although most of the environmental variables are usually not relevant, it could be that certain environmental parameters specific to that facility beamline play a crucial role in making more accurate predictions.One of the challenges in pre-processing of data used for predictor models is to filter the relevant features from the redundant ones.In this work, the dimension of the input parameter space was drastically reduced without having to compromise with the prediction results using the feature selection analysis.However, the data collected in the experiment was not tailored to machine learning, and the electron beam and photon properties recorded were incidentally of use for predictions and, in future experiments, collection of more relevant electron beam properties may allow for improved prediction accuracy.

A: Experiment details for attosecond two-colour pulses
In our experiment, data with two pulses at different energies were obtained from a configuration similar to [12], utilizing an enhanced SASE mode.The phases between SASE emitting microbunches are not predetermined and, as a result, the temporal properties are difficult to predict from purely spectral measurements.The photon energy of the emission is determined by the period of the undulators, the energy of the electron bunch and the position of the SASE emission within the bunch.The spatial and energetic distribution of electrons within the bunch varies on a shot-to-shot basis due to fluctuations in the electron accelerator.In the two-colour mode, a second set of undulators was used to produce a second pulse (see Fig. 1), either at the second or third harmonic of the first, with the emission from the first pulse seeding the second.
Separation of the X-rays from the electrons due to a difference in their group velocities, i.e. slippage, was used to create a time delay between the pulses, for use in a separate pump-probe experiment.With more undulators in the second section, the slippage is larger at the centre of mass of the second pulse generation, so the delay is greater.Both pulses are estimated to have temporal length below 500 attoseconds [12].Pulses were generated at 120 Hz with photon energies of approximately 250 eV and using either the second or third harmonics at 500 and 750 eV respectively, 2-10 eV FWHM bandwidth, and up to 50 µJ energy in each pulse.

Data Filtering
A typical experimental data set will contain many events which are labelled by i ∈ 1, 2 . . ., S, where S = 35000-40000.After filtering, the total number of events in each data set reduces to S = 16000-32000 (varying between the different data sets) that can actually be used for building the predictive model.For each event, we typically have around 300 recorded input features that are collected during the experiment.These include environmental variables such as current and voltage measures for different XFEL machines, total photon energies of the pulses as measured by gas monitor detectors as well as electron beam properties at the dump which include electron beam charge and energy.We remove from this set of features any that take less than 10 distinct values across the full dataset.Furthermore, we eliminate any features that are perfectly correlated (correlation coefficient above 0.995).The combination of these two methods brings our overall feature count down to around 80 (depending on the individual data set).Based on the statistical dispersion of the data, we also remove outlier events which can negatively impact the prediction results.Thus, any events with features with a median absolute deviation greater than four are removed.Finally, we impose a lower limit of 5 µJ on the total central photon energy of the pulse as measured by the gas monitor detectors.

Normalisation of data
Let the vector of input features for event i be denoted by where N is the total number of recorded features and the output for this event be denoted by Y i .Then the normalised input and output data are given as Here x j = (x j 1 , x j 2 . . ., x j S ) T is a vector consisting of jth input variable from every event and Y = (Y 1 ,Y 2 . . .,Y S ) T is the output vector.Additionally, µ and σ respectively correspond to the mean and standard deviation of the subscripted data column across all events.C: Key Code for top ten input parameters for Fig. 2(a The measured vertical gap between the undulator magnet arrays in one undulator, which is tuned to adjust the K-factor of the undulator 1 Hz

epics_UND_36_gap_act
The measured vertical gap between the undulator magnet arrays in another undulator 1 Hz

xgmd_energy
The pulse energy measured after all attenuation using the total ionisation of an N 2 filled cell (X-ray gas monitor detector) 120 Hz

epics_UND_34_gap_des
The desired vertical gap between the undulator magnet arrays in the first undulator 1 Hz

ebeam_ebeamPkCurrBC2
The peak electron bunch current measured in the second bunch compressor 120 Hz

epics_GMD_ElectronMesh
The voltage applied to an electron mesh in the gas monitor detector (GMD), to extract the electrons towards the electrode 1 Hz

gmd_energy
The pulse energy measured after all attenuation using the total ionisation of an Kr filled cell (X-ray gas monitor detector) 120 Hz 10 epics_UND_28_gap_act The measured vertical gap between the undulator magnet arrays in a third undulator 1 Hz Table 1.Input feature ranking by permutation feature importance for pulse energy data

Linear modeling
A linear regression model (LIN) fits a general linear function 7/14 across S events.The parameters c, c 0 are varied to minimize the residuals-squared, given by Here, μY is the mean of the normalized labels Ỹ such that μY ≡ 0. We then use the mean absolute error M to calculate the model performance.While linear regression methods can be very useful and simple to implement, they naturally fail with data that are highly non-linear.Since the generation of XFEL pulses are highly non-linear processes, it is helpful to use this method to get a sense of the level of non-linearity in the data set.

Gradient boosting decision trees
Decision tree learning is a supervised machine learning approach often used for predicting classification or regression type of problems.A decision tree is built by splitting the root node (which is at the apex) into subsets, and this process of splitting continues for each subset recursively until further splitting not improve the predictions.The rules for splitting a node are determined by the classification features.Gradient boosted decision trees is an ensemble learning method where rather than using a single decision tree to make predictions, we combine multiple decision trees to enhance the model's accuracy.The basic premise of boosting is to combine weak "learners" into a single strong learner iteratively.The success of the boosting scheme is evaluated by defining a suitable loss function that is minimized using a gradient descent scheme.
In our case, the full set of events S forms the root node, which is subsequently split into subsets S i and are distinguished based on the values of different categorical or numerical features.We partition out the input space into D regions d ∈ 1, . . ., D where we split the data using By predicting a constant value h d across each of these regions, we can define the output of the decision tree as ȳt Here, h d is the average of the target across all points within the region d, and is used as the model output for all points where z d = 1.The predictions of an individual decision tree are generally heavily biased, and thus ensemble methods are often used.Apart from random forests, which use independent decision tree predictors, gradient boosting (GB) is another commonly used method where trees are added to the estimator successively and fitted to the pseudo-residuals of all the previous tree's predictions.A gradient boosting regressor [27] is an ensemble method that gives an estimate Ȳ (GB) i from the weighted sum of estimates given by T base regressors ȳt (x i ; N), written as where we used the decision trees to define our base estimator.The gradient boosting regressor is then constructed iteratively under consideration of a differentiable loss function We begin by considering a constant average estimate Ȳ (GB) i,0 = μY = 0, where the subscript 0 indicates that no estimators have been added yet.We then iterate over t ∈ 1, . . ., T and at each step perform the following: 1.For each i, find the pseudo-residuals given by 2. Fit a decision tree estimator y t (x i ; N) to the set of pseudo-residuals. 8/14 3. Find γ t to minimize L for the new set of estimates Ȳi,t = Ȳi,t−1 + γ t y t (x i ).
After adding T base estimators in this manner, we have our fully fitted estimator Ȳ (GB) i = Ȳi,T .This approach has the advantage of focusing on regions of bad prediction and improving them.While many tree parameters are fit in the algorithm, others are hyperparameters that have to be specified a priory, such as the number of trees, the number of decisions per tree, the use of regularization and the number of data points to consider for each decision.Often the intuitive interpretation of the regressor obtained from decision trees can be lost when using an ensemble of decision trees.We found an estimator with 20 trees without specified depth limit and l2 regularization to yield the best results with only minor overfitting as seen in Figure 4. To evaluate the performance of the gradient boosting estimator, we evaluated the mean absolute error across the test set and compare it to the performance of the ANN and the linear model.predicting the central photon energy of the pulse using [12] and (b) time delay prediction using [23].

Neural networks
Artificial Neural Networks (ANNs) are one of the most widely used modern machine learning techniques and have been very successful in making predictions for various physical systems.In this work, we use Feed-Forward Neural Networks as we are performing supervised learning on a set of independent data points.Conceptually, a neural network can be represented by a graph, with values and biases associated with each node (or neuron) and weights associated with each edge.We group the nodes into layers, and allow edges only between nodes of neighbouring layers.The data propagates through this network layer by layer in one direction (Feed-Forward) only.The overall architecture of the neural network is defined by the hyperparameters which include the number of neurons in each layer, number of layers and choice of activation function applied to the outputs of different nodes.Regularization schemes and choice of optimizer constitute further hyperparameters, while bias b and weights W are parameters fit using the backpropagation algorithm.The last layer must have the same size as the number of prediction labels in the data, 1 in our case.For each of the L + 1 layers labelled by l ∈ 0, . . ., L, we define the node activation by a vector v l , the node bias by a vector b l , the edge weights for edges between layers l and l + 1 by a matrix W l and the differentiable activation function for each node in the layer as a l .We then perform forward propagation of the data for event i by setting v i 0 = xi .We then propagate the data using and use Ȳ (ANN) i = v i L as our estimate of Ỹi .The crucial task is then to train the estimator by finding W l and b l such that our loss, chosen as M is minimized.We initialize these parameters randomly, and then perform backpropagation with gradient descent, implemented through the Adagrad algorithm [32].We used Bayesian optimization to find the optimal neural network architecture, activation functions, regularization and drop out.This technique uses Bayesian inference to guess combinations of hyperparameters that yield the best predictions for the smallest computational cost.We find that the optimal network sufficient to make accurate predictions for both the two-pulse delay and the pump-probe energies consists of two hidden layers of 20 cells each.The network is also l2-regularized and there is no drop-out, leading to no overfitting (Figure 4) and training convergence after few thousand of epochs (Figure 5).The choice of the activation function on hidden layers is chosen to be a ReLU (regularized linear unit function).In combination with the reduced feature count, this results in a substantial speed-up of model fitting and requires far fewer data to be collected.

Supplemental Material
Here we present the results of our ML prediction for central photon energies using the experimental setup [12] described in the main text and compare it against the predictions of energies and time delay using data obtained from another two-colour experiment [23] which has a different modus operandi than [12].Although [23] does not utilize an enhanced SASE scheme like in [12], both methods use a variable line spectrometer to measure the X-ray spectrum.To create the two pulses, a double slotted foil in inserted into the bunch compressor.In the bunch compressor, there is a space-to-energy mapping, so the spatial windows spoil the bunch except in two energy regions, which are then the only regions able to lase.As energy maps to time in the undulators, this space-to-energy mapping becomes a space-to-time mapping for the emission.The result is reduced total brightness and emission confined to two short periods, i.e. pulses.The widths of the slits determine the widths of the pulses, and the slits' separation scales linearly with the delay [33].The space-to-energy mapping in the bunch compressor is equally important, and will jitter with the electron beam energy.Pulses up to 30 µJ were produced in this way with photon energy centred close to 540 eV and separated by 14 eV.The repetition rate was 120 Hz, though complete pulse diagnostics operated at only 60 Hz.The temporal structure of the pulses is retrieved for the double slotted foil method using XTCAV.
In general, we find higher input-output correlation for the data from [23].Testing on the highly correlated data with limited non-linearity helps to benchmark our theoretical prediction models.

I. Predicting central photon energies with two-pulse data from experiment in [12]
Figure S1 shows the validity of predicting central photon energies of the individual pulses (E 1 and E 2 ) using different machine learning methods.Despite the complex inter-dependence of these energies on the diagnostics, which is in some cases highly non-linear, the linear regression model (LIN) makes reasonable predictions of the central photon energy for either pulse, as seen in Fig. S1(a) and (d).Both gradient boosting (GB) and artificial neural networks (ANN) make better predictions than the LIN models for the central photon energies of the individual pulses as depicted in Fig. S1(b), (e) and Fig. S1(c), (f) respectively.In general, independent of the prediction model, the mean absolute error for the predictions of E 2 are 2.4 times larger than for E 1 which can be attributed to the fact that Pulse 2 is the second harmonic of Pulse 1.Thus, the second pulse will experience effects of electron bunch energy jitter twice as much compared to Pulse 1 while the remaining difference may be attributed to the error in our energy measurements.It is promising to find that the GB model and the ANN model have similar accuracy in their predictions, especially since we find the GB models are faster to train compared to ANN models, at least by a factor of three.

II. Predictions of central photon energy and time delay for two-colour data from experiment in [23]
Figures S2 and S3 depicts the prediction of the central photon energy of the second pulse and time delay between the two pulses using the machine learning methods as used in the main manuscript.While the results agree with [23], it is more efficient in training time due to reduced input parameter space, which is the result of our feature analysis.Interestingly, for this data, the linear model here is sufficient to make accurate predictions with mean absolute error that is a fraction of the variance of the data.This perhaps can be understood by considering how the double-slotted foil affects the electron bunch used to create the pulses.In summary, the data in [23] seems to be less nonlinear in nature with high input-output correlation compared to [12]. 12/14

Figure 1 .
Figure 1.(a) Diagram of the XFEL configuration for two-colour X-ray pulse generation: An electron bunch is modulated in energy-time phase space to yield a high peak current that propagates between two undulator sections separated by a chicane that introduces delay between two pulses.In each undulator section, self-amplified spontaneous emission (SASE) generates a bright, coherent X-ray pulse.A CCD camera is used to measure the spectrum of the two pulses.(b) Diagnostics are used to measure the energies of the two-colour XFEL pulse y(x) which depend on the input feature vector x.Both y(x) and x are used to build the prediction model, which consists of three main steps: pre-processing of data, feature extraction, and training/validating/testing of the prediction model.Two different prediction models were used in this work: neural networks and decision tree based on gradient boosting classifier.(c) An optimized neural network or gradient boosting classifier is applied directly to real-time experiments for efficient prediction of central photon energies for two-colour XFEL pulses.

Figure 2 .
Figure 2. The grey bars in panel (a) depict the permutation importance function I j for a particular input parameter j while predicting the central photon energy of the second pulse (E 2 ) using neural networks.Mean absolute error M (solid blue line, in eV ) is plotted for a varying number of input features selected by feature importance.Panel (b) is a scatter plot that compares the measured values of E 2 with the predicted values of neural networks.Predictions of the reduced input space (red dots) agree with the full input space (blue dots) with a mean absolute error of M = 2.48 eV.

Figure 3 .
Figure 3.The plot shows the precision with which the central photon energy of the second pulse (E 2 ) can be predicted as a function of the number of undulators between the pulses for two different detection scenarios: one where both pulses are measured simultaneously, while for the other only pulse 2 is measured.Using the former data, ANN is used to predict E 2 (blue solid line with circle), while using the latter data, both LIN (magenta dashed line with circle) and ANN (green dashed line with triangle) were used to predict E 2 .
energy measurement of the electron beam orbit in a dispersive region of the linac-to-undulator beamline 120 Hz 2 ebeam_ebeamLTU250 A position based energy measurement of the electron beam orbit in a dispersive region of the linac-to-undulator beamline 120 Hz 3 epics_UND_34_gap_act

Figure 4 .
Figure 4. Convergence of the mean absolute error as function of the number of events S used for training the decision trees/neural networks for (a) Prediction of central photon energy of the pulse using[12] and (b) time delay prediction using[23].

Figure 5 .
Figure 5.Convergence of mean absolute error as function of the number of epochs used in the neural networks for (a) predicting the central photon energy of the pulse using[12] and (b) time delay prediction using[23].

Figure S1 .
Figure S1.Prediction of XFEL energies of the individual pulses for two-colour data: (a-c) compare the measured values of energies E 1 with the values predicted by different ML methods,while (d-f) is the same for E 2 .Top row panels represent linear regression model (LIN), middle row panels represent gradient boosting method (GB) and bottom row panels represent neural networks (ANN).The 2D histogram plots are constructed by grouping the data into 50 bins along each direction, where the density is indicated by the intensity of the blue colouring.

Figure S2 .Figure S3 .
Figure S2.Comparing the measured values of the higher energy pulse from the two-colour data of an older operation mode[23] using the prediction from different ML methods: (a) linear regression model (LIN), (b) gradient boosting method (GB) and (c) neural networks (ANN).2D histogram plots are constructed in similar way as Fig.S1.Note that the energies shown here differ from those presented in[23] by a small offset owing to a scaling factor but doesn't affect the performance of the fit.