High-precision regressors for particle physics

Monte Carlo simulations of physics processes at particle colliders like the Large Hadron Collider at CERN take up a major fraction of the computational budget. For some simulations, a single data point takes seconds, minutes, or even hours to compute from first principles. Since the necessary number of data points per simulation is on the order of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^9$$\end{document}109–\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{12}$$\end{document}1012, machine learning regressors can be used in place of physics simulators to significantly reduce this computational burden. However, this task requires high-precision regressors that can deliver data with relative errors of less than 1% or even 0.1% over the entire domain of the function. In this paper, we develop optimal training strategies and tune various machine learning regressors to satisfy the high-precision requirement. We leverage symmetry arguments from particle physics to optimize the performance of the regressors. Inspired by ResNets, we design a Deep Neural Network with skip connections that outperform fully connected Deep Neural Networks. We find that at lower dimensions, boosted decision trees far outperform neural networks while at higher dimensions neural networks perform significantly better. We show that these regressors can speed up simulations by a factor of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^3$$\end{document}103–\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^6$$\end{document}106 over the first-principles computations currently used in Monte Carlo simulations. Additionally, using symmetry arguments derived from particle physics, we reduce the number of regressors necessary for each simulation by an order of magnitude. Our work can significantly reduce the training and storage burden of Monte Carlo simulations at current and future collider experiments.


Introduction
Particle physics experiments like those at the Large Hadron Collider at CERN, are running at progressively higher energies and are collecting more data than ever before.As a result, the experimental precision of the measurements they perform is continuously improving.However, to infer what these measurements mean for the interactions between the fundamental constituents of matter, they have to be compared with, and interpreted in light of, our current theoretical understanding.This is done by performing first-principles computations for these high energy processes order by order in a power series expansion.After the computation, the resulting function is used in Monte Carlo simulations.The successive terms in the power series expansion, simplistically, become progressively smaller.Schematically, this can be written as: where α 1 is the small expansion parameter.The term of interest to our current work is the one enclosed by the curly braces in Equation (1) which we will refer to as the second-order term 1 .The function, F(x x x), must be evaluated on the order of 10 9 -10 12 times for each simulation.However, for many processes, evaluating the second-order term, specifically, f 02 , is computationally space-and time-intensive and could take several seconds to compute a single data point.Moreover, these samples cannot be reused leading to an overall high cost of computation for the entire process under consideration.Building surrogate models to speed up Monte Carlo simulations is highly relevant not only in particle physics but in a very large set of problems addressed by all branches of physics using perturbative expansion like the one in Equation (1).We give a broader overview of the physics motivation and applications in section 4. 1 Here order refers to the power of the expansion coefficient α.
1/15 arXiv:2302.00753v1[physics.comp-ph] 2 Feb 2023  A simple solution to speed up the computation of the functions is to build a regressor using a representative sample.However, to achieve the precision necessary for matching with experimental results, the regressors need to produce very high accuracy predictions over the entire domain of the function.The requirements that we set for the regressors, and in particular what we mean by high precision, are: High precision: prediction error < 1% over more than 90% of the domain of the function Speed : prediction time per data point of < 10 −4 seconds Lightweight : the disk size of the regressors should be a few megabytes at the most for portability In this work we explore the following novel concepts: • With simulated data from real physics processes occurring in particle colliders, we study the error distributions over the entire input feature spaces for multi-dimensional distributions when using boosted decision trees (BDT), Deep Neural Networks (DNN) and Deep Neural Networks with skip connections (sk-DNN).
• We study these regressors for 2, 4, and 8 dimensional (D) data making comparisons between the performance of BDTs, DNN and sk-DNNs with the aim of reaching errors smaller than 1% -0.1% over at least 90% of the input feature space.
• We outline architectural decisions, training strategies and data volume necessary for building these various kinds of high-precision regressors.
In what follows, we will show that we can reduce the compute time of the most compute-intensive part, f 11 (x x x) + f 02 (x x x) (defined in equation ( 1)), by several orders of magnitude, down from several seconds to submilliseconds without compromising the accuracy of prediction.We show that physics-motivated normalization strategies, learning strategies, and invocation of physics symmetries will be necessary to achieve the goal of high precision.In our experiments, the BDTs outperform the DNNs for lower dimensions while the DNNs give comparable (for 4D) or significantly better (for 8D) accuracy at higher dimensions.DNNs with skip connections perform comparably with fully connected DNNs even with much fewer parameters and outperform DNNs of equivalent complexity.Moreover, DNNs and sk-DNNs meet and exceed the high-precision criteria with 8D data while BDTs fail.Our goal will be to make the most lightweight regressor for real-time prediction facilitating the speed-up of the Monte Carlo simulation.

Related Work
Building models for the regression of amplitudes has been a continued attempt in the particle physics literature in the recent past.boosted decision trees (BDTs) have been the workhorse of particle physics for a long time but mostly for performing classification of tiny signals from dominating backgrounds [1].However, the necessity to use BDTs as a regressor for theoretical estimates of experimental signatures has only been advocated recently [2] and has been shown to achieve impressive accuracy for 2D data.
Several other machine learning algorithms have been used for speeding up sample generation for Monte Carlo simulations.Ref. [3] proposed the use of Normalizing Flows [4] with Invertible Neural Networks to implement importance sampling [5,6].Recently, neural network surrogates have been used to aid Monte Carlo Simulations of collider processes [7].Ref. [8] used Bayesian Neural networks for regression of particle physics amplitudes with a focus on understanding error propagation and estimation.Ref. [9] attempted to reach the high-precision regime with neural networks and achieved 0.7% errors integrated over the entire input feature space.And Ref. [10] tackled parametric integrals, i.e., where only some of the variables are integrated over, that arise in precision particle physics calculations achieving 0.1 to 1 percent precision over a range of parameter values.Physics-aware neural networks were studied by Ref. [11] in an attempt to handle singularities in the regressed functions.In the domain of generative models, GANs [12][13][14] and VAEs [14] have been used for sample generation [15,16].Similar applications have surfaced in other domains of physics where Monte Carlo simulations are used.Selflearning Monte Carlo methods have been explored by Ref. [17].Applications of Boltzmann machines [18], deep neural networks [19] and autoregressive neural networks [20] have been seen recently.Ref. [21] use neural networks in Quantum Monte Carlo simulations to learn eigenvalues of Hamiltonians and the free energy of spin configurations, an application that lies outside the domain of particle physics.However, the primary goal of all these efforts has been to avoid first-principles computation and, hence, reduce compute time while staying below credible error budgets that are set in a problem-specific manner.
In contrast to prior works [2,3,15,16], the novelty of our contribution is that we try to attain high precision in the entire domain of the function being regressed with fast and efficient regressors.For that, we compare BDTs and neural networks for functions with 2D, 4D, and 8D input feature spaces.We propose a new architecture which is a DNN with skip connections to avoid the problem of vanishing gradients for deeper neural networks and show that they perform better than fully connected DNNs.We also propose novel methods, derived from physics domain knowledge, for scaling the function being regressed with another function that is computationally inexpensive to calculate and is highly correlated with the function being regressed.We leverage the symmetry properties of the physical process under consideration for the reduction of the number of regressors required to be trained.The applicability of our work goes beyond the domain for which it has been developed and can be used for any application that requires high precision in speeding up simulations or sample generation.

Models: Decisions Trees and Neural Networks
In this section, we will develop several methods that will enable us to achieve the high-precision requirements that we set.As a benchmark, we will use the condition: |δ | < 1% in over 90% of the domain of the function being regressed 2 .Here, δ is defined as the difference between the predicted value of the function, f (x x x) predicted , and its true value, f (x x x) true , normalized by f (x x x) true , or, knowing a priori that f (x x x) true is positive definite.Usually, the performance of a regressor and model comparison in machine learning is done using a single accuracy measure which is a statistical average of the distribution for that accuracy measure over the entire test sample.This, however, does not provide a complete picture of the accuracy of the regressor in high-precision applications.Using error distributions instead of a single number leads to a better criterion for model selection and enhances the interpretability of the model.Physics informed normalization: An attempt to build regressors with the raw data from the Monte Carlo simulations results in a failure to meet the high-precision requirements that we have set.Hence, we have to appeal to 2 For a more detailed explanation of the precision requirements please read section 4

3/15
DESY 22-174 a novel normalization method derived from the physics that governs the physical processes.The functions of interest in particle physics processes at colliders are often very highly peaked in one or more dimensions.This makes it quite difficult to build a regressor that will retain the desired high precision over the entire domain of the function.This problem cannot be addressed by log scaling or standardizing to zero mean and unit variance since the peaks can be quite narrow and several orders of magnitude greater than the mean value of the function.A regressor trained on the log-scaled function provides an error distribution over the entire domain which, when exponentiated, transforms to large errors around the peak.This behavior is not desirable.Normal scaling does not help since the standard deviation of the distribution is much smaller than the peak value, often being several orders of magnitude smaller, making the peak-values outliers.
As a solution, we normalized the second-order contribution with the zeroth-order contribution as defined in Equation ( 1), i.e., we transform to a distribution: where f (x x x) is the function that will be regressed.This first-order term, f 00 (x x x), also has a peak similar to and is highly correlated with the second-order term, f 11 (x x x) + f 02 (x x x), with ρ ∼ 0.9.Hence, this normalization yields a distribution, f (x x x), that is more tractable to regress.We show examples in Figure 1 where one can see that both f 00 (x x x) and [ f 11 (x x x) + f 02 (x x x)] are both very peaked and span several orders of magnitude but their ration spans only one order of magnitude as the two terms are highly correlated.Computation of the first-order term from first principles is numerically inexpensive and does not require regression.Furthermore, we standardize the distribution by removing the mean and scaling to unit variance.

DNN design decisions
The DNN architectures that we used are fully connected layers with Leaky ReLU [22] activation for the hidden layers and linear activations for the output layer.We show a comparative study of activation functions in subsection 5.3 where we find that the Leaky ReLU outperforms other activation functions like ReLU [23,24], softplus and ELU [25].We do not consider any learnable activation functions in this work and leave it for a future work.We use the following design decisions: Objective function: we use the mean-square-error loss function without any regularization.While we use linear relative error, δ , to estimate the performance of the model over the entire feature space, our decision to use the mean-square-error loss function is made so as to preferentially penalize outliers and reduce their occurrence.
Learning rate: It is necessary to cool down the learning rate as a minimum of the objective function is approached.This is absolutely necessary to search out an optimum that allows for uniformly low error over the entire feature space.For both the DNN and the sk-DNN, we use the Adam optimizing algorithm.Ref. [26] discuss an inverse square-root decay rate scaling for the Adam optimizer.We do not find this optimal for this high-precision application.The square-root cooling is quite rigid in its shape as it is invariant to scaling up to a multiplicative constant.Hence, we use an exponential cooling of the learning rate which has an asymptotic behavior similar to the inverse-square-root scaling but its shape is far more tunable.The learning rate is cooled down starting from 10 −3 at the beginning of the training to 10 −6 at 2500 epochs.Much of the learning is completed during the early stages of the training, i.e. within 200 epochs.The R 2 score at this point is about 0.5% from the final R 2 score (> 99.9%).However, to attain the high-precision requirements, the final stages of the training are necessary and take about 2500 -3000 more epochs.
Training epochs and validation: An early stopping criterion based on RMSE is used to determine the number of epochs the regressors is trained for with patience 3 set to an unusually large number, 200 epochs.We use this large patience to allow the optimizer to possibly move to a better optimum while having a very small learning rate if a better one exists.We first split the data into 20% test set and 80% training and validation set.The latter set is further split into 60% training set and 40% validation set.This results in a 20%-48%-32% split for test, train and validation respectively.The large validation set is necessary to make sure that errors are uniformly low over the entire domain of the function being regressed.For all cases, we use a dataset with 10 million samples.In addition to a fully connected DNN, we also experiment with a DNN with skipped connections (sk-DNN) to address the problem of vanishing gradients for deeper neural networks.The building block of the sk-DNN is illustrated in Figure 2. Given an input x x x the output of the block is

DNN with skip connections
where h(x x x) is the output of the third layer with linear activation, g is a non-linear function and W W W is a trainable weight matrix of dimension i × j when the input dimension, i, is different from the output dimension, j, and I I I otherwise.
The structure of this block can be derived from the Highway Network [27] architecture with the transform gate set to I I I and the carry gate set to W W W for dim(x x x) = dim(y y y) and I I I otherwise.Structurally, the sk-DNN block is similar to a ResNet block [28] with a different set of hidden layers.We keep the normalization of the target variable and the learning rate decay schedule the same as for the DNN.We also test the sk-DNN with the weight matrix, W W W fixed with a random initialization of the elements and see no difference in the accuracy of the model and hence keep W W W trainable.
1. that the prediction error relative to the exact value be < 1%, 2. that this be the case for 90% of the domain of the function, 3. that the evaluation time be faster than O(10 −3 ) seconds, 4. and, finally, that the disk size of the regressors be on the order of megabytes rather than gigabytes.These regressors will be used as surrogate models for exact functions that are numerically slow to evaluate.As a result of their (extreme) slowness, these exact functions, which are used in Monte Carlo simulations, are by far the biggest bottleneck in the simulation.
The physics context: the theoretical model that describes the fundamental particles and their interactions is called the Standard Model of particle physics.This model is an example of a quantum field theory; what this means exactly is not crucial here.Rather, the important feature of this theory is that computing observables (i.e., outcomes of experiments) cannot, in general, be done exactly because such calculations are not tractable for several reasons the explanation of which goes beyond the scope of this work.The usual way of obtaining predictions is by expanding the theory as a power series in a small expansion parameter and computing higher orders in this expansion to improve the accuracy of the prediction.Such perturbative expansions are ubiquitous in physics in general since only a few systems, most notably, e.g., the simple harmonic oscillator and the two-body inverse r 2 problem can be solved exactly.A very large fraction of physics problems spanning atomic physics, nuclear physics, condensed matter physics, astrophysics, cosmology, hydrodynamics, electrodynamics, quantum mechanics, complex systems etc. requires the use of perturbative expansions where the higher order terms are very tedious and slow to compute.Hence, the methods we develop here are more broadly applicable in any problem where a perturbative expansion is used and/or a function that requires a very large number of evaluations is very slow to evaluate and a certain threshold of precision is required.
The slow functions that are the focus of this work arise at second order in this power series expansion.The number of terms produced at each order rapidly increases and the complexity of the mathematical functions that appear also increases.For example, at first order in the expansion (if the zeroth order is a so-called tree process), polylogarithms of at most order 2 can appear.At second order, higher order polylogarithms appear.On top of the fact that these functions are time-consuming to evaluate numerically, large cancellations between these functions typically exist which requires using arbitrary precision arithmetic libraries to circumvent the infamous 'catastrophic cancellation' problem in numerical analysis.
For example, the time penalty for improving the accuracy of the prediction of the rate of production of four electrons by including the second-order term is a factor of 1500.For details, please see table 11 in the journal version of Ref. [29].So here lies the logic behind requirement 3: the second-order functions typically take 1 − 10 seconds per point to evaluate while all other functions in the Monte Carlo simulation typically take milli-seconds.Therefore, the bottleneck is removed if the surrogate model takes 10 −3 second per point or less to evaluate.
The accuracy requirement: there are many sources of uncertainty that propagate to the final prediction.Roughly speaking, there are systematic errors of order 1% that cannot be reduced at the moment and for the foreseeable future.There are also statistical errors inherent in the finite samples produced by Monte Carlo simulations.The goal is to strive to have Monte Carlo statistical errors much smaller than 1%, say 0.1%.Since the contribution of the second order functions is of order 10%, then it is sufficient for the surrogate models to be accurate to 1% in order for the error on the total prediction (including the zeroth and first order) to be of order 0.1%.While this precision would be good to have in the entire domain of the function it is not necessary given the error margins we aim for.Assuming that the errors in the predictions made by the model follow a Gaussian distribution, we can safely set the threshold to 1% error over 90% of the function domain.We checked, after the fact, that the Gaussian assumption is approximately realized (cf.figure 6).An elaboration of this requires a discussion of specific integrals for specific scattering processes in particle physics that we shall leave for a more particle-physics-oriented work.
Portability: in contrast with the speed and accuracy requirements (items 1, 2, & 3), the requirement that the disk size of the models be of order megabytes (item 4) is desirable but not a strict requirement.In practice, to implement the surrogate models discussed in this work into Monte Carlo codes, several regressors are required.Since these  1.The number of total and independent functions that arise at 2, 4, and 8D and the number of independent functions, a minimal subset from which the other functions can be generated by re-mapping the feature space.
codes must be downloaded locally by the users, it is desirable that the disk size remain small.As shown before BDTs can reach several GB in compressed model format for higher dimensional data.Hence, in this work, we focus on building specific neural networks that are a lot more portable.

Physics Simulations
The functions in question are maps, f i j : R n → R, where n ∈ {2, 4, 8} and i, j ∈ {0, 1, 2}, cf.equation ( 1).The domain of the functions, i.e. the feature space, is mapped to the unit hypercube and populated from a uniform distribution.The corresponding datasets are generated using the particle physics simulation code VVAMP [30] from first principles using building-block functions that we will refer to as form factors. Apart from the 2D dataset, which is a special case of the 4D one, the same form factors were used to generate the 4D and 8D datasets.The difference between the 4D and 8D feature spaces lies in the physics of the process in question, namely the number of external particles the functions describe.The regressor of the 4D functions, g i j ≈ f (4) i j , can be used to generate the 8D functions, f i j , after multiplying by two other (exact) functions that are computationally inexpensive to calculate and summing them.
The number of resulting functions, technically called helicity amplitudes, depends on the dimension as shown in Table 1.While the number of required regressors for the 4D feature space is the largest, it also offers the most flexibility for downstream physics analyses.To generate the 8D functions, more details of the process have to be specified during data generation which is then frozen into the regressor.Consequently, different physics analyses will require different regressors.By contrast, the 4D regressors are more general-purpose and do not contain any frozen physics parameters.
The smaller number of necessary functions in the third column of Table 1 is obtained by leveraging the symmetry properties discussed below derived from physics domain knowledge.For the data used in this analysis, it stems from the symmetries manifest in the scattering process that was simulated.The last column indicates whether summing the functions has a physics meaning; in the cases where it does, i.e. 2D and 8D, only the single regressor of the sum of the functions is required.
Symmetry properties: the full set of functions, f i j , for any dimension, n, is over complete.Pairs of functions can be mapped into one another via particular permutations of the external particles the process describes.This translates into a linear transformation on the second coordinate, x 2 , independently and in combination with the permutation of the third and fourth coordinates, x 3 and x 4 , in feature space.For example, in 4D, two permutations π 12 : p 1 ↔ p 2 and π 34 := p 3 ↔ p 4 , where p i is a particle with label i reduces the number of independent functions from 162 to 25.
Permutation particle symmetry coordinate symmetry Computational burden of Monte Carlo simulations: generating the 2D, 4D and 8D datasets required 144 hours on 96 AMD EPYC 7402 cores for 13 million data points per set.This had to be done twice, once for the 2D 7/15  dataset and once for the 4D and 8D datasets which were generated from the same computationally intensive form factors which have to be calculated from first principles.In contrast, the regressors that we build generate a million samples in a few seconds to a few minutes on any desktop computer.

Results
We proceed to derive the optimal hyperparameters for the models that will facilitate the high-precision requirements.We also study a set of activation functions to best design the neural network architectures.With these we provide a comparison of the accuracy of all the models for all the datasets.2D BDT (20) 4D BDT (20) 8D BDT (20) Figure 3. Left panel: the effect of learning rate on achieving high precision with BDTs.The learning rate is not an important parameter for low dimensions but is significant for higher dimensions.Right panel: the data volume required to train the different regressors.For lower dimensions small volumes of data (< 1M) is sufficient.However, for higher dimensions, a lot more data is necessary.

Boosted Decision Trees
We use XGBoost [31] to implement the BDTs.In varying the architecture of the regressors, we focus on the max-depth of the BDT which is a hyperparameter that controls the maximum depth to which a tree grows in a boosted ensemble.If Figure 3 we show how changing the learning rate and the training data volume changes the accuracy of the trained BDT models.In the final version of our experiments, we use a learning rate of 0.01 and 10 million data points of which 48% is used for training, 32% is used for validation and early stopping and 20% is used for testing.More details on hyperparameter correlation and selection can be found in subsection 5.2.

Hyperparameter surveys for boosted decision trees
Maximum depth of trees in the ensemble: The BDT models are trained with an early stopping condition which stops the growth of the trees once the RMSE stops decreasing after checking for its decrease for 25 rounds.This makes the hyperparameters used to train a BDT correlated to a certain extent.For example, a decrease in the learning rate increases the number of trees grown till the optimum is reached.This can be seen from Figure 4.However, as one increases the maximum depth to which each tree can grow the number of total trees grown decreases.The number of nodes of a tree grows exponentially with the depth of the trees and, hence, allowing for a larger maximum depth of the trees results in a much larger disk size for the trained models.This is aggravated further with higher dimensional data.Therefore, when portability is a concern, BDTs cannot be used for high-precision applications for higher dimensional data.
Learning rate and maximum depth of trees: When exploring the learning rate for the BDT models in Figure 4, we find that, initially, with decreasing learning rate, starting at 1, the accuracies of the trained models increase but after a point, the accuracy decreases.This is evident for shallower trees.We also note that the accuracies of the models increase with the maximum depth of the trees but only up to a certain depth.In the example in the right panel of Figure 4 we use the 8D data and we see that the accuracy of the model increase till a maximum depth of 15 and then decreases.The variation of the number of trees grown in a BDT ensemble increases rapidly with decreasing learning rate when using an early-stopping criterion.Note: The learning rate for BDTs with maximum depths 20 and 50 could not be reduced below 0.1 as the disk size of the memory consumption while training the models with 8D data gets too large for a single node in a high-performance computing cluster.Right panel: Larger maximum depth for BDT ensembles gives better accuracy up until a certain value and then the accuracy falls.The optimal value for maximum depth seems to be 15 or 20.Also, the learning rate has an optimal after which it decreases or plateaus.

Deep Neural Networks and Skip Connections
For the neural networks, we focus on the depth, width and number of trainable parameters in the regressor (denoted as width-depth (trainable parameters) in the tables and figures).The depth of the sk-DNN denotes the number of sequential sk-DNN blocks in the regressor and not the total number of layers.The width of the sk-DNNs is chosen to be half the width of the DNNs and the depth of the sk-DNN is adjusted so that they have approximately the same number of parameters as the DNNs with similar depth.An sk-DNN with 2 blocks is an exception and has more parameters than the corresponding DNN with 2 layers.The data strategy remains the same as for the BDTs.We performed tests for various activation functions keeping all other hyperparameters and data strategies the same.We use the 4D and 8D datasets with a 9-deep and 36-wide sk-DNN for 4D and 9-deep and 50-wide sk-DNN for 8D on an exponential learning rate schedule and data normalized using Equation (3).We explore only non-trainable activation functions like the ReLU, Leaky ReLU, ELU and softmax activations functions.The last three were chosen as they are similar to ReLU and have shown improved learning abilities in several domains [22,24,25].As in the other experiments, the models were trained with an early-stopping criterion.From Figure 5 we see that the Leaky .The δ = ( f x x) predicted − f (x x x) true )/ f (x x x) true distribution for the various 2D (upper panels), 4D (middle panels), and 8D (lower panels) regressors.The labels for the DNN and sk-DNN designate depth-width (number of parameters) where depth is the number of layers for the DNN and the number of blocks for the sk-DNN.For the BDTs, the labels denote max_depth: N with max_dept being the maximum depth of the trees.Detailed analyses can be found in the text.
ReLU activation function far outperforms all other activation functions with a narrower error distribution.This is more prominent for 4D data than for 8D data.Hence, for all experiments in this work, we use the Leaky ReLU activation function.
To compare the performance of the regressor we use the distribution of δ (defined in Equation ( 2)).We focus on this distribution as it is important for the high-precision requirement to identify the fraction of test data that has large errors.We will identify the following statistics: |δ | < 1%: the fraction of the test set that has δ less than 1% µ δ : the mean of the δ distribution σ δ : the standard deviation of the δ distribution

Model Comparison
Baselines: To understand the efficacy of the optimization strategies that we developed, we build a baseline without any optimization for BDTs, DNNs and sk-DNNs.We do not normalize the data as described in section 3, rather, we only log scale the data.We set the train-validation split to 80%-20%.For the BDTs, we use an ensemble with max-depth = 50, set the learning rate to 0.1.For the DNNs and sk-DNNs, we fix the learning rate of the Adam optimizer at 10 −3 , lower the patience to 10 rounds, and use the most effective architecture chosen from amongst the Table 2. Parameters extracted from the error distributions shown in Figure 6.Predictions from 1 million test samples were used to generate these statistics.The best-performing models for each of 2D, 4D, and 8D are marked in bold.More details regarding the notations are available in the text.
high-precision regressors.The results are presented in Table 2 and Figure 6.We see that without the optimizations the regressors perform very poorly.Key results: we present the results of the experiments in Table 2 and Figure 6.We show the distribution of errors over two variables, square root of the center-of-mass energy, √ s → x 1 , and cos θ → x 2 in Figure 7.It is clear that the BDTs far outperform the DNNs in 2D.However, at 4D and 8D the sk-DNN not only outperforms the fully connected DNNs, but also outperforms the BDTs as can be seen from the distributions in Figure 6 and the numbers in Table 2.While at 4D the improvement of accuracy from the DNN and sk-DNN is marginal over the BDTs, at 8D the improvement of accuracy is quite significant.One major disadvantage of the BDTs is that they take up significant disk space as the ensemble grows large, especially at higher dimensions, which is necessary for high-precision applications but affects their portability.Hence the sk-DNNs are a good solution for having a portable, yet accurate regressor that meets the thresholds we set at the beginning of the work.

Figure 1 .
Figure 1.The effects of normalizing second-order term f 02 (x x x), with the zeroth-order term, f 00 (x x x).The two functions are highly correlated (ρ = 0.87 for 8D, ρ = 0.91 for 4D and ρ = 0.96 for 2D) and the resulting normalized functions, f (x x x) = [ f 11 (x x x) + f 02 (x x x)]/ f 00 (x x x) are much less peaked.

Figure 2 .
Figure 2. The building block for a DNN with skip connections.The first two layers are fully connected with Leaky ReLU activation.The last layer has linear activation and is added to the input through the skip connection before being transformed with a non-linear Leaky ReLU function.The matrix W W W is a weight matrix which is trainable if the input and output dimensions are different for the block and I I I otherwise.The blocks are stacked sequentially to form the neural network.

Figure 4 .
Figure 4. Left panel:The variation of the number of trees grown in a BDT ensemble increases rapidly with decreasing learning rate when using an early-stopping criterion.Note: The learning rate for BDTs with maximum depths 20 and 50 could not be reduced below 0.1 as the disk size of the memory consumption while training the models with 8D data gets too large for a single node in a high-performance computing cluster.Right panel: Larger maximum depth for BDT ensembles gives better accuracy up until a certain value and then the accuracy falls.The optimal value for maximum depth seems to be 15 or 20.Also, the learning rate has an optimal after which it decreases or plateaus.

Figure 5 .
Figure 5.A comparison between Leaky ReLU, ReLU, softplus and ELU for the 8D data using an sk-DNN (left panel:) for 4D with 9 blocks of width 36 and (right panel:) for 8D with 9 blocks of width 50.The Leaky ReLU activation function outperforms any other activation function and this is more prominent at 8D than at 4D.We use it for all experiments with DNNs and sk-DNNs in our work

δ 10 Figure 6
Figure6.The δ = ( f x x) predicted − f (x x x) true )/ f (x x x) true distribution for the various 2D (upper panels), 4D (middle panels), and 8D (lower panels) regressors.The labels for the DNN and sk-DNN designate depth-width (number of parameters) where depth is the number of layers for the DNN and the number of blocks for the sk-DNN.For the BDTs, the labels denote max_depth: N with max_dept being the maximum depth of the trees.Detailed analyses can be found in the text.