Event Generation and Statistical Sampling for Physics with Deep Generative Models and a Density Information Buffer

We present a study for the generation of events from a physical process with generative deep learning. To simulate physical processes it is not only important to produce physical events, but also to produce these events with the right frequency of occurrence (density). We investigate the feasibility to learn the event generation and the frequency of occurrence with Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) to produce events like Monte Carlo generators. We study three toy models from high energy physics, i.e. a simple two-body decay, the processes $e^+e^-\to Z \to l^+l^-$ and $p p \to t\bar{t} $ including the decay of the top quarks and a simulation of the detector response. We show that GANs and the standard VAE are not able to learn the distributions precisely. By buffering density information of Monte Carlo events in latent space given the encoder of a VAE and by introducing a smudge factor $\alpha$ we are able to construct a prior for the sampling of new events from the decoder that yields distributions that are in very good agreement with real Monte Carlo events and are generated $\mathcal{O}(10^8)$ times faster. Applications of this work include generic density estimation and sampling, targeted event generation via a principal component analysis of encoded events in the latent space and the possibility to generate better random numbers for importance sampling, e.g. for the phase space integration of matrix elements in quantum perturbation theories. Finally, we create a deep generative model with real experimental data taken from the Large Hadron Collider.


I. INTRODUCTION
The simulation of physical and other statistical processes is typically done with the help of sampling quasi-random numbers as the input of a simulation. The simulation turns random numbers into observable physical events. This is known as the Monte Carlo (MC) method. A fundamental problem with the numerical calculations simulating physical processes today is their immense need for computational resources which restricts the corresponding scientific progress regarding its velocity, budget and therefore general availability. As an example, the full pipeline of the MC event generation in particle physics experiments including the detector response may take up to O(10) minutes per event [1][2][3][4][5][6][7][8] and largely depends on MC sampling algorithms such as VEGAS [9]. Accelerating the event generation pipeline with the help of machine learning will also provide a significant speed up for signal studies allowing e.g. broader searches for signals of new physics. Another issue is the probabilistic nature underlying event generation: the inability to exactly specify the event that ought to * Sydney.Otten@ru.nl † scaron@nikhef.nl ‡ mcbeekveld@gmail.com be generated. Data analysis often requires to generate events which are kinematically similar to events seen in the data. Current event generators typically require the generation of many events and then, after generation, to select the interesting events with a low efficiency. Furthermore, currently the simulation itself cannot be learned directly from real detector data.
In this paper we outline an alternative approach with deep generative machine learning (ML) models and perform a feasibility study of our method. The main problem we tackle in this paper is to create a generator that learns to sample from distributions that are in good agreement with the distributions found in the training data. However, most efforts of the machine learning community regarding generative models are typically not aimed at learning the correct frequency of occurrence. So far, applications of generative ML approaches in particle physics focused on image generation [10][11][12][13] due to the recent successes in unsupervised machine learning with generative adversarial networks (GANs) [14][15][16] to generate realistic images according to human judgement [17,18]. Here, we investigate deep generative models, namely GANs and Variational Autoencoders (VAEs) [19] and provide proof for the feasibility of expanding the use of generative models in the form of VAEs beyond images for applications in particle physics. We find that beside being difficult to tune, GANs are inferior for this task.
To test our setup, three different types of data have been generated. In a first step we construct generative models for a 10-dimensional two-body decay toy-model and compare several distributions in the real MC and the generated ML model data. There we find the potential of generative models to encode physical relations. Subsequently, we study two more realistic problems, where we generate data in the form of Z boson production from e + e − collisions and its decay to two leptons, e + e − and µ + µ − , and in the form of tt production from proton collisions, where at least one of the top quarks is required to decay leptonically. We find that only by using a density information buffer with the decoder of a VAE we are able to produce a realistic collection of events that follows the distributions present in the MC event data. We perform a principal component analysis (PCA) [20,21] of MC events in the latent space of the VAE for tt production and demonstrate a strategy that enables us to steer the generation of events. Finally, we discuss several further applications of this work including the construction of a generator with experimental data and the utilization for the phase space integration of matrix elements.

II. METHODOLOGY
This section briefly summarizes the methodology used to investigate the feasibility of using deep generative models to produce realistic events. First, we will explain how we produce the events with MC generators that are used as data to train the generative models. Then we proceed to present the ML techniques used in this work.

A. Event generation
As an example we study the generation of events from particle collisions at colliders. We have used three different sets of generated events: 1. a simple toy-model, 2. Z-boson production from e + e − collisions and 3. top production and decay from proton collisions, i.e. pp → tt. Here, we describe the procedures for obtaining the training data sets. a. Toy model For the toy model we assume a stationary particle with mass M decaying into two particles with masses m 1 and m 2 and calculate their momentum 4-vectors by sampling m 1 , m 2 , θ and φ from uniform distributions 10 6 times. The components of the model that are learned with the generative models are the energies E 1 , E 2 of the daughter particles, the phase space components p x , p y , p z for each particle and their masses m 1 and m 2 . We introduce a degeneracy to check whether the generative models learn the relativistic dispersion relation.
The resulting events given in LHEF format [22] have been read and we have extracted the four-momentum of the produced leptons as input data for the generative models.
c. pp → tt We have generated 0.5 million events of pp → tt, where at least one of the top-quarks is required to decay leptonically. We used MG5 aMC@NLO v6.3.2 [1] for the matrix element generation, using the NNPDF PDF set [23]. Madgraph is interfaced to Pythia 8.2 [2], which handles the showering of the matrix element level generated events. The matching with the parton shower was done using the MLM merging prescription [24]. Then a quick detector simulation was done with Delphes 3 [3,4], using the ATLAS detector card.

B. Generative Models
In this section a general description of the applied machine learning methods (GANs and VAEs) is given and we provide the details of the corresponding architectures as well as hyperparameters, data augmentation and training procedures. For the VAE we show how the density information buffer is created and how we utilize it to generate events. The GANs and VAEs have been trained on an Nvidia Geforce GTX 970 and a Tesla K40m GPU using TensorFlow [25], Keras [26] and cuDNN [27].
a. Generative Adversarial Networks GANs consist of two feed forward neural networks, namely the generator G and the discriminator D, that play a two-player mini-max game with the value function V (G, D): Eq. 1 shows that D tries to tell apart real from fake samples and G tries to fool D. In [14] the authors show that given an optimal discriminator D * , the two-player game minimizes the Jensen-Shannon divergence (JSD): which is a distance measure between probability distributions. If the JSD equals 0 then p data = p g , i.e. the distribution that the samples generated by a GAN form is identical to the true distribution given by the training data. Several issues can occur during training that are related to the fact that the training happens as a two-player game. G might learn how to fool D with bad samples, e.g. G might fool D with a low variance among samples over and over or D might be too intelligent in the sense that it is able to discriminate correctly in every case. In both cases the training gets stuck in a local Nash equilibrium such that G most likely will not produce samples with the desired qualities. To avoid such issues, we have employed several techniques: label smoothing [17], label switching, batch-size scheduling, a custom activation function [28] for the generator and minibatch discrimination [17]. The capacity of the generator is chosen to be larger than of the discriminator because D would often quickly get too smart: G is comprised of four layers with 64 neurons each, whereas D has only three layers with 40 neurons each. For G we used Adam [29] with a learning rate of α G = 0.00045 while D was trained with Stochastic Gradient Descent (SGD) [30] with α D = 0.01. The GAN was trained on 10 6 samples for 50 epochs on each batch-size which increased from 512 to 2048, doubling each iteration. The advantage of big batch-sizes in achieving a better performance was already demonstrated by results obtained in [16]. The noise input of the generator during training and sampling is a 10-dimensional standard normal distribution. To optimize the selection of the weights of the GAN, the JSD is calculated for the relativistic dispersion relation, momentum conservation and θ and φ distributions for each epoch. The best generator was selected by taking the generator with the lowest value for a weighted sum of these JSDs.
b. Variational Autoencoders and density information buffering VAEs also consist of two feed forward neural networks, the encoder Enc and the decoder Dec. The output of Enc consists of three parts and constitutes the latent space of the VAE with dimension d: with µ = (z mean,1 , · · · , z mean,d ) T and log(Var) = (log(σ 2 1 ), · · · , log(σ 2 d )) T being the mean and the logarithm of the variance of d Gaussians while z is a sample drawn from these d Gaussians. The parameter z serves as the input for Dec and leads to an output that is the VAE reconstruction of the input of Enc. The loss function now compares the output of Dec to the input of Enc as well as the Gaussians in the latent space to standard normal distributions using the mean squared error and the Kullback-Leibler divergence. We furthermore make use of the β-VAE [31] which adds a parameter β to the loss function: with β 1. For β = 0 the variances of the Gaussians in latent space are getting close to 0, while for β = 1 the latent space distributions converge to standard normal distributions. The main difficulty with this is to find an appropriate β that allows a good reconstruction while reaching a point in training where MSE ≈ βD KL . The β-VAE is a necessary ingredient for accurately modelling the probability densities by allowing to separate the minimization of the MSE from the minimization of D KL resulting in two corresponding phases of the training. In the first phase, the value of βD KL MSE such that the optimization effectively only minimizes the MSE. In the second phase, the MSE is so low that MSE ≈ βD KL such that the distribution in latent space becomes as important as the MSE for the optimizer. Training sufficiently long in both phases is crucial to have a good reconstruction loss as well as order in latent space.
The usual sampling when using VAEs as generative models is to assume a prior p(z) = N d (0, 1) corresponding to d samples from a standard normal distribution as entries of z. However, using this prior assumes the ideal result for D KL and ignores the actual latent space representation of real MC events x i . By creating a density information buffer in the latent space we are able to account for deviations from the idealized standard normal distribution prior and can sample a realistic collection of events from the decoder. To construct the density information buffer we marginalize over N true events and construct the prior following where p(z|x i ) is the Gaussian with means Enc[0] i and variances exp(Enc [1] i ) and p(x i ) = 1/N . This procedure explicitly encodes the latent space representation of the real MC events in the sampling procedure by means of the prior used to sample with the decoder. Note that a large N is favorable due to effects related to the law of large numbers, i.e. the variance in the dataset decreases. We call the combination of this density information buffer with the decoder B-VAE. To train the B-VAE the setting of β is of utmost importance to optimally learn a smooth event density without reproducing just the training data since the setting of β = 0 would converge to a latent space density buffer consisting of very narrow Gaussians around the training data. The method is similar to a gaussian kernel density estimation [32] in the latent space of an autoencoder, where the β-term in Eq. (4) is related to the width of the Gaussians. We also introduce a smudge factor α to handle the width of the Gaussians As a consequence, our generative model, the B-VAE, has two additional hyperparameters α and β that impact the quality of the reconstructions as well as the diversity among the generated events.
The encoders and the decoders of our VAEs have the same architectures for both datasets consisting of four (toy model and Z → l + l − ) or six hidden layers (pp → tt) with 128 neurons each and shortcut connections between every other layer [33,34]. We chose β 1 = 3 · 10 −6 (toy model and Z → l + l − ) and β 2,3 = {3 · 10 −7 , 10 −3 } (pp → tt) , a batch-size of 1024 and the exponential linear unit (ELU) [35] as the activation function of hidden layers. The output layer of the decoder is a hyperbolic tangent such that we need to pre-and postprocess the input and output of the VAE. We do this by dividing each component by the maximum of absolute values found in the training data. We initialized the hidden layers following a normal distribution with mean 0 and a variance of (1.55/128) 0.5 such that the variance of the initial weights is approximately equal to the variance after applying the activation function on the weights. While for the toy model we use a simple training procedure using the Adam optimizer with default values for 100 epochs, we employ a learning rate scheduling for 7×80 epochs and SWATS [36], i.e. switching from Adam to SGD during training, for the Z decay case. For the tt case the setup is identical except for the number of epochs: we train 4 × 240 epochs with Adam and then for 4 × 120 epochs with SGD. The number of latent space dimensions are 9 (toy model), 10 (Z decay) and 16 or 32 (tt).

III. RESULTS
Our feasibility study finds that standard GANs and VAEs do not work well. Only by using the B-VAE we are able to capture the underlying distribution such that we can generate a collection of events that is in very good agreement with the distributions found in MC event data as demonstrated in Figures 1 to 6. The failure of the standard VAE is accounted to the fact that the distributions of encoded real MC events in latent space is not a standard normal distribution. We find that the density information buffer circumvents this issue.
The comparison of the generative model performances for the toy model in Fig. 1 indicates that the B-VAE with an adjusted prior given in Eq. 5 is the only investigated ML technique that is able to reliably model the p x , p y and p z distributions when compared to GANs and VAEs with a standard normal prior, although these models still give good approximations. We find that all models learn the relativistic dispersion relation which underlines the findings in [37,38]. It is noteworthy that the GANs did not work well. Since the discriminator evaluates single events or small batches of events (with minibatch discrimination) in the training process, it is typically non-optimal to find differences in the densities of the training data and the generated data. We therefore only trained GANs for the two-body decay and focused on the VAE and B-VAE for the other two processes. Fig. 2 shows the distributions of the first three out of a total of ten latent dimensions for Z events. In total, four of them are barely distinguishable from standard normal distributions while the others have a more narrow peak and very fat tails with a cut-off. Sampling beyond the cut-off gives unphysical events, e.g. mixtures of electrons and muons or negative total energies which would be part of the sampled events if one sampled only from standard normal distributions. To get proper events, and to get the frequency of occurence correct, it is therefore crucial to sample from the actual latent space distribution which is achieved with the density buffer. Fig. 3 and 4 show the results for the Z events, where the Z boson decays leptonically. Here we again find that the B-VAE is able to accurately generate events that respect the probability distribution of the MC events. We find perfect agreement between the B-VAE and MC events for distributions of p T , θ and φ and good agreement for invariant mass M inv of the lepton pair around 91 GeV. While the standard VAE fails for the momentum conservation beside having a peak around 0, the B-VAE is much closer to the true distribution. When plotting φ, θ and the transverse momentum p T of lepton 1 against lepton 2 (Fig. 4), we find good agreement for the B-VAE, while the standard VAE results in a smeared out distribution. In addition, it can be seen that the events generated by the standard VAE are not always produced back to back. We conclude that if we do not use density information buffering, the VAE is not able to accurately generate events that follow the Monte Carlo distributions. Therefore, we only investigate the B-VAE performance for the tt case.
In Fig. 5 and 6, the results for the more complicated tt production with a subsequent semi-leptonic decay are shown. We show results using the two different values for the latent space dimension (dim(z)), as well as using different values for α and β. We train the B-VAE on events that have a maximum of four jets and two leptons in the final state. For simplicity we do not discriminate between b-jets and light-flavored jets, nor between different kinds of leptons. A jet is defined as a clustered object that has a minimum transverse momentum (p T ) of 20 GeV in the Monte Carlo simulation. Regarding dim(z) we conclude that both 16 and 32 can deliver satisfactory results, although we get slightly better reconstructions for dim(z) = 32. The results also show that we require a low value of β to get satisfying distributions. It can be seen in Fig. 5 that the B-VAE also generates events that have a p T for leading jets of less than 20 GeV, even though the B-VAE was never trained to generate these events. For η of the leading jet and lepton and p T of the leading lepton we find similar phenomena at the edges of the distributions. From this we conclude that although the distributions found in the MC data are learned by the B-VAE almost perfectly, it actually is providing samples that only follow a very good approximation and does not just generate copies. However, we must note that the corresponding widths, and therefore the diversity among the samples, are relatively small. As discussed before, we introduce the smudge factor α and observe that the distributions still agree very well, and in addition show increased diversity among the events. In particular we note that sampling from a standard normal distribution in latent space leads to kinematic distributions that do not resemble the ground truth. From this we conclude that the best way to sample from the B-VAE is to use 16 or 32 latent space dimensions with a low value of β and an α that gives a satisfactory diversity of samples. For the φ distributions of leading jets and leptons we observe almost perfect agreement from which we conclude that the B-VAE also learns effects coming from the granularity of the detector. Fig. 6 again shows a very good agreement for the invariant mass of all four jets, the MET and METφ with similar edge effects as seen before. Finally, generating 10 7 tt events with the VAE has taken 177.5 seconds on an Intel i7-4790K and is therefore O(10 8 ) faster than the traditional MC methods. Fig. 7 shows eight histograms for the training data (≈ 475000 events) and ≈ 15 million generated events from the B-VAE with three different α = {1, 5, 10} for φ of the leading jet vs. φ of the next to leading jet (φ 1 vs φ 2 ). The left column shows the histogram for the full range [−π, π]×[−π, π] whereas the right column shows the same histogram zoomed in on [2,3] × [2,3]. We observe that for α = 1, most of the blank space in (φ 1 vs. φ 2 ) is already filled with frequencies that globally follow the density pattern of the Monte Carlo data. However, we note that this pattern is not learned smoothly but in a way that resembles egg cartons with local peaks that seem to come from random seeds in the training data whose latent space representations have a small width and still contains some holes. Increasing α alleviates this issue: for α = 5 almost all holes disappear but the landscape in (φ 1 , φ 2 ) is still mountaineous which is smoothened out for α = 10.

IV. APPLICATIONS
We have found that the B-VAE as a deep generative model can be both a very good generator and a very good density sampler. In this section we discuss and present several further applications of this work such as the creation of a generator for new theoretical physics models and improved MC integration. First examples demonstrate how one can utilize the B-VAE to steer the event generation and for the first time, we create a generator directly from experimental data taken at the Large Hadron Collider.
To steer the event generation we need to find out which regions in latent space correspond to which events generated by the decoder. To this end we perform a principal component analysis of the latent space representation of MC events. The PCA is an orthogonal transformation of the data that defines new axes such that the first component accounts for most of the variance in the dataset. We look at the first two principal components, sample a grid in these components and apply the inverse PCA transformation to get back to a latent space representation. Fig. 8 shows a scatter plot of 10 4 PCA transformed latent space representations of MC events in gray and 64 chosen points in red. The 64 chosen points were picked after finding that MC events in PCA space are distributed on an approximately circular area. Since the histogram of the points in PCA space displayed a circular shape, we created an equidistant 8 × 8 grid in polar coordinates r and φ. The grid in PCA space is then transformed back to a latent space representation and used as input for the decoder to generate events that are being displayed in Fig. 9. The numbers associated with the red dots in Fig. 8 correspond the events with their numbers shown in Fig. 9. This is effectively a mapping of latent space. Observing the event displays reveals that we are in fact able to capture where we find events with what number of jets and leptons, what order of MET and what kind of orientations. In case one wants to produce events that e.g. look like event 62, one can do this by sampling around r = 3.5 and φ = 225 • in PCA space, then transform these events back to a latent space representation and to use that as input for the decoder. This will again offer the possibility to narrow down the characteristics of the events even further and arbitrarily many iterations of this procedure will allow to finally generate events with arbitrarily precise characteristics.
An application of GANs to create effective field theories has already been shown in ref. [13]. Another possibility to explore latent space would be to interpolate between two latent space representations. For example for the Z decay one could encode an e + e − and a µ + µ − event, define paths between them and sample from the decoder along the path in latent space. Although unphysical, it now appears natural for e + e − µ + µ − events to occur and this is indeed what we find. By using different buffers the production can be steered: events from different processes (e.g. top and Z events) can be generated using the same latent space but different buffers. The idea of interpolating between latent space representations can be extended to also encompass the creation of new theory mixtures. In follow-up work we will create mixtures of beyond the standard model theories such as supersymmetry and Z .
Having found that the B-VAE can be used to sample according to highly complex probability distributions, one possible application may be to provide a very efficient method for the phase space integration of multi-leg matrix elements. Recent work has shown that machine learning approaches to Monte Carlo integration of multidimensional probability distributions [28] and phase space integration of matrix elements [39] may be able to obtain much better rejection efficiency than the current widely used methods [9]. We point out that event weights can be obtained from the B-VAE in similar fashion to the above papers. Given the ability of the B-VAE to model complex distributions almost perfectly, we expect most of these weights to be very close to unity.
Lastly, this approach allows to create an event generator that is trained directly on experimental data, e.g. coming from astroparticle experiments or from a particle accelerator such as the Large Hadron Collider. In Fig. 10 we show several probability distributions characterizing the experimental data [40] taken from the MultiJet primary CMS open data release and the data created by our deep generative models. In all cases, we find very good agreement with the distributions found in the experiment, i.e. we successfully created a deep generative model from experimental Large Hadron Collider data.
To create this generative model we have used a B-VAE with dim(z) = 8, β = 3 · 10 −8 , a batch-size of 128 and all other hyperparameters including the training procedure being identical to those used for pp → tt. The approach likely has applications in statistical sampling and generative modeling beyond particle physics, i.e. wherever it is beneficial to learn a generative model directly from data.

V. CONCLUSION
We have provided more evidence for the capability of deep generative models to learn the laws of physics with GANs, VAEs and in particular with our modification of it, the B-VAE. However, the GAN and the VAE with a standard normal prior fail to replicate the frequency of occurrence of the physical events. By creating a density information buffer of real MC events in the latent space of a VAE and by introducing a smudge factor α, i.e. with the B-VAE, we presented a way to generate events whose probabilistic characteristics are in very good agreement with those found in data simulated with Monte Carlo event generators for a toy model, e + e − → Z → l + l − and pp → tt events. By performing a principal component analysis of the latent space representations of MC events and a subsequent exploration of the corresponding PCA space we introduced a method to steer event generation. For the first time, we create a particle physics event generator directly from experimental data taken at the Large Hadron Collider. These results indicate not only usefulness for particle physics but beyond that for all branches of science that involve computationally expensive Monte Carlo simulations, the interest to create a generative model from experimental data or the need to sample from high-dimensional and complex probability distributions.

VI. ACKNOWLEDGEMENTS
This work was partly funded by and carried out in the SURF Open Innovation Lab project "Machine learning enhanced high performance computing applications and computations" and was partly performed using the Dutch national e-infrastructure. S. C. and line shows E 2 − p 2 − m 2 for particle 1 and 2 and the distribution for the azimuthal angle φ of particle 1.

FIG. 2:
Distributions of the first three of ten latent space dimensions (left to right) in the trained VAE for the e + e − → Z → l + l − process. While the left plot displays great similarity with a standard normal distribution, the other two have a more narrow peak and a cut-off around ±2.5. If sampled beyond the cut-off, events become unphysical. show the ratio of the events generated by the generative models to the events generated by the Monte Carlo generator. The jump in η around |η|=1.5 for the leading lepton is caused by the detector simulation: the calorimeter drops in efficiency at higher rapidities. The step-function like shapes in φ of the leading jet that the B-VAE learns are not statistical fluctuations but it is an effect due to the detector granularity that is being accounted for in the event generation. The number of events is normalized to the number of generated Monte Carlo events. The error that is indicated is the statistical error, computed as √ N events for each bin.  [40] (gray) and three B-VAE configurations with α = 1 (blue), α = 5 (green) and α = 10 (red). Shown are the invariant mass distribution for the leading and subleading jet, the missing transverse energy, as well as φ and p T of the leading jet.