Developments in data science solutions for carnivore tooth pit classification

Competition for resources is a key question in the study of our early human evolution. From the first hominin groups, carnivores have played a fundamental role in the ecosystem. From this perspective, understanding the trophic pressure between hominins and carnivores can provide valuable insights into the context in which humans survived, interacted with their surroundings, and consequently evolved. While numerous techniques already exist for the detection of carnivore activity in archaeological and palaeontological sites, many of these techniques present important limitations. The present study builds on a number of advanced data science techniques to confront these issues, defining methods for the identification of the precise agents involved in carcass consumption and manipulation. For the purpose of this study, a large sample of 620 carnivore tooth pits is presented, including samples from bears, hyenas, jaguars, leopards, lions, wolves, foxes and African wild dogs. Using 3D modelling, geometric morphometrics, robust data modelling, and artificial intelligence algorithms, the present study obtains between 88 and 98% accuracy, with balanced overall evaluation metrics across all datasets. From this perspective, and when combined with other sources of taphonomic evidence, these results show that advanced data science techniques can be considered a valuable addition to the taphonomist’s toolkit for the identification of precise carnivore agents via tooth pit morphology.


Extended Samples
The samples used for the present study originates from multiple sources, all of which were collected and handled by one of the authors (JY). With the objective of providing the maximum level of transparency possible, the present supplementary section has provided a number of details regarding the nature of these samples, alongside a list of publications that have used them.
All bite damage samples obtained were found on the diaphyses of appendicular long bones. The only exception to this is the Panthera pardus samples (see main text and Supplementary Appendix 2). One all bones had been collected, they were cleaned for 12 hours in boiling water. So as to avoid any possible damage to the tooth marks, this boiling process was performed solely using water, without the use of any added chemicals.
Samples included; • Brown Bears (Ursus arctos, Ursidae, 69 pits). This sample was obtained from bears living in the Cabárceno natural park, collected during the months of June and July, 2020. All marks were found on the diaphyses of horse long bones. All tooth marks were produced by multiple adult brown bears of varying sexes. Bones were collected a couple of days after the initial feeding. The present study is the first to use this sample.
• Spotted Hyenas (Crocuta crocuta, Hyaenidae, 86 pits). This sample was obtained from hyenas living in the Cabárceno natural park, collected during the year 2012. All marks were found on the diaphyses of horse tibiae and radii. All tooth marks were produced by a single female adult hyena. Bones were exposed to hyenas for only a few hours, considering how bones left for longer were typically completely consumed. This sample was originally published in the work of Domínguez-Rodrigo and colleagues (2015) 1 . This sample, however, has also been included in a large number of other studies, including analyses presented by Aramendi  • Wolves (Canis lupus, Canidae, 80 pits). This sample was obtained from a mixture of wild wolves and others living in parks. Four main samples were used * ; the first two samples consisting of wild wolves from the areas of Flechas and Villardeciervos (Zamora, Castilla y León), while a third sample was obtained from the Cabárceno natural park, and a final sample from the Hosquillo natural park. The original tooth marks from these 4 samples amounted to a total of 283 digitised tooth pits and 288 tooth scores. So as to avoid subjective bias induced by manually selecting 3d models from these 571 tooth marks, an algorithm was used to randomly select 80 marks. The number 80 was chosen primarily to match the relative sample sizes obtained from other animals. Across all four samples, marks were found on the diaphyses of horse, sheep, deer, ibex, boar and cow appendicular elements. Difference in animal size has previously been revealed to not be an important conditional factor on these types of statistical analyses 10 . All tooth marks were produced by a group of adult wolves, with varying pack sizes (>5) and different sexes. In the case of the Hosquillo Cabárceno samples, bones were collected between a number of days to a number of weeks after initial feedings. A large number of studies have published data regarding these samples. The Cabárceno sample was originally publisehd by Yravedra et al. (2014Yravedra et al. ( , 2017, 2019) 6,11,12 , and has subsequently been used by Aramendi 10 . Both the Villardeciervos and Flechas samples are yet to be published in detail (in prep).
• African Wild Dogs (Lycaon pictus, Canidae, 89 pits). This sample was obtained from African wild dogs living in the Cabárceno natural park, collected during the year 2010. All marks were found on the diaphyses of horse radii and tibiae. All tooth marks were produced by a pair of African wild dogs of both sexes. Bones were collected a couple of days after the initial feeding. This sample was originally published in the work of Yravedra et al. (2014) 12 .
• Foxes (Vulpes vulpes, Canidae, 53 pits). This sample was obtained from wild adult foxes in the area of Ayllón (Segovia), collected during the year 2002. All tooth marks were produced on the diaphyses of sheep long bones. This sample was originally published in the work of Yravedra et al. (2014) 15 , however has also been used in Yravedra et al. (2019) 11 .
• Jaguars (Panthera onca, Felidae, 77 pits). This sample was obtained from jaguars living in the Cabárceno natural park, collected during the years 2009 and 2010. All marks were found on the diaphyses of horse tibiae and radii. All tooth marks were produced by multiple jaguars of varying sexes. Bones were collected a couple of days after the initial feeding, however include feeding sessions that span over a number of months. This sample was originally published in the work of Domínguez-Rodrigo and colleagues (2015) 1 . This sample, however, has also been included in a large number of other studies, including analyses presented by Aramendi 16 .
• Leopards (Panthera pardus, Felidae, 84 pits). This sample was obtained from Sri Lankan leopards (subspecies P. p. kotiya) living in the Biopark of Fuengirola (Málaga), collected during the autumn of 2020. This is the only sample of the present study to include cow axial skeletal elements. All tooth marks were produced by a pair of leopards of varying sexes. Bones were collected a couple of days after the initial feeding. The present study is the first to use this sample.
• Lions (Panthera leo, Felidae, 82 pits). This sample was obtained from lions living in the Cabárceno natural park, collected during the years 2011. All marks were found on the diaphyses of horse radii and tibiae. All tooth marks were produced by a number of adult lion of varying sexes. Bones were collected a couple of days after the initial feeding. This sample was originally published by Domínguez-Rodrigo et al. (2012) 18 and Gidna et al. (2013) 19 . This sample, however, has also been included in a number of other studies, including analyses presented by Domínguez-Rodrigo et al.

Unsupervised Learning
Beyond PCA, a total of 8 algorithms with 11 variants were used for the purpose of advanced data science applications. These included both supervised and unsupervised algorithms. Prior to any advanced computational modelling, datasets were cleaned using unsupervised Isolation Forests (IFs) 21 . IFs are a development of recursive partitioning trees, using the term isolation to refer to the 'separation of instances that are "few" and "different" from the rest of the instances' 21 . IFs thus consider "isolated" instances as data points that are located farthest from the root of the tree. From here anomaly scores can then be computed with scores 0.5 considered as outliers.
GANs are unsupervised algorithms that consist of two different neural networks; a Discriminator (D) and a generator (G) 22,23 . G is used to produce synthetic information while D evaluates G's output for authenticity. The two are trained simultaneously in a "zero-sum game" 23 , with the intention of producing a generator capable of producing synthetic data that D is unable to tell apart from the original dataset 22,23 . G creates this data by sampling from a hypothetical distribution (e.g. N (µ = 0, σ = 1)) in the form of a fixed-length latent vector, known as a latent vector. The neural network is then trained to find the best means of mapping the hypothetical distribution onto the real sample domain. The output of G is then fed as input into D in an attempt to predict class labels for the generated points (real (1) or fake (0)).
For GANs, three different loss functions were considered 24 ; the Least Square loss function (LSGAN) 30 , Wasserstein loss (WGAN) 31 and a variant of WGAN using Gradient Penalties (WGAN-GP) 32,33 . Models were trained for 1000 epochs using an R 50 Latent Vector as input. For more details on the models used and their architectures consult Courtenay and González-Aguilera 25 .
MCMCs are a family of algorithms that are typically used as inference engines in probabilistic programming 29 . MCMCs, as their name suggest, combine two fundamental data science concepts including Markovian Chains and Monte Carlo simulations. Markov Chains are stochastic models that can be used to explore the entirety of a distribution via "random walks", whereby the chain makes predictions on the position of future states based solely on the position of its present state, as well as the probability of transition to the next 28,29 . The Monte Carlo element of MCMC involves the simulation of drawn samples from the probability distribution, generating ensembles of Markov Chains that search for areas of high information density 28 . While many types of MCMCs exist, the present study used the Metropolis-Hastings 1st Order variant of MCMC 26,27 . The Metropolis-Hastings algorithm draws samples from a probability distribution p(x) based on the distribution's density. Samples are calculated iteratively, with the next state of the Markov Chain being determined by calculating its suitability using the Metropolis-Hastings criterion 29 . If the suggested next state does not agree with the proposed probability of being accepted, then the current state is maintained and the process begins again.
MCMCs were used to generate samples for each of the sample distributions, thus generating new synthetic data points that are statistically likely to belong to the same distribution. While a number of techniques exist to define the distribution to be sampled from (θ ) 29 , the present study strictly defined p(θ |x) using robust statistical techniques. These considered robustly defined Gaussian priors, as supported by the Central Limit Theorem (See Supplementary Appendix 4), as well as skewed distributions 34,35 . In order to ensure the adaptability of the theoretical Gaussian to the empirical non-Gaussian distributions, robust metrics were used to define the central tendency and variance of these theoretical models (eq. S2). This included the calculation of sample medians (x) and the Normalized Median Absolute Deviation (NMAD) (eq. S3) 36 . Skewed distributions, on the other hand, were defined in a similiar manner, while including an additional conditioning variable α to try and simulate the asymmetry observed in the original samples.
Once the target θ had been defined, MCMCs were trained for 10,000 to 30,000 iterations, discarding the first 2,000 iterations as the "burn-in period". Evaluation of MCMCs were performed considering chain autocorrelation performance as well as the Effective Sample Size (ESS) metric 29,37,38 . Autocorrelation is defined when sampled values are seen to be dependent on the sampled values at other iterations, whereas ESS is calculated as the difference of mean values divided by pooled standard deviation. An ideal autocorrelation value for MCMC ≈ 0, while ESS values [1000, 10000] are recommended. For Metropolis-Hastings proposal mechanisms, a number of different hyperparameters were also tested. These mainly focused on defining the optimal σ values for the symmetrical proposal distribution. Each of these tests considered σ = [0.1, 3), with σ = 0.3 producing the best results.
Final evaluation of both GANs and MCMCs was performed using robust statistical approaches, as described by Courenay and González-Aguilera 25 . First both GANs and MCMCs were used to generate synthetic samples of the exact same sample size as the original samples. This was performed to ensure that the samples being compared (synthetic and original) were balanced so as to not condition hypothesis testing results according to Cohen's d. Both samples were then subjected to a Two One-Sided equivalency Test (TOST), for the evaluation of the magnitude of similarities via Superior (εS or ∆S) and Inferior (εI or ∆I) equivalence bound parameters. Depending on sample homogeneity, the traditional Welch's t-statistic or trimmed non-parametric Yuen's robust t-statistic was used 39,40 . Once the most realistic synthetic-data generator had been established, the final model was used to augment all samples to n = 100 while ensuring that no repeated values were present.
Unsupervised computational applications were performed in both the R and Python programming languages (Sup. Appendix 8).

Supervised Learning
Two different supervised learning algorithms were considered for the present study; Support Vector Machines (SVM) and an extension of SVMs using Neural Networks (NSVM).
SVMs are a highly powerful modeling algorithm, useful for both regression and discriminant analyses. Unlike typical discriminant functions, SVMs use soft maximized-margins with an associated cost (c) hyperparameter to define decision boundaries that are less conditioned by the presence of outliers. The larger the c, the greater the softness of the margin. From this perspective, SVMs are less likely to overfit and can be considered more flexible 41 . SVMs additionally have the distinct advantage of using a kernel trick to overcome traditional limitations imposed by linearity 42 . From this perspective, SVMs use a defined function to project data-points onto a non-linear high dimensional feature space where the consequent decision boundary, or hyperplane, can be constructed 41,42 . A number of kernel functions exist for SVMs. The present study primarily considered the Radial Basis Function (RBF; eq. S4), as well as the Polynomial function, of which the former obtained the best results on all accounts. The radial kernel function has 1 main hyperparameter (γ), which scales the influence two points (x and x ) will have during the kernel's re-projection of data.
The Neural SVM (NSVM) is a concept originally proposed as a means of adding "depth" to SVMs 43 , while using non-linear Neural Network architectures as a means of training a specified kernel function directly on the dataset. The added use of Neural Networks (NNs) thus add a highly versatile and flexible "kernel" based on the combined use of multiple layers of varying neuron densities for feature extraction. For NSVM, the present study additionally uses a single Random Fourier Feature (RFF) layer to define a complicated function on our original domain using Fourier Feature mappings [44][45][46] . In this context, RFF layers project datapoints randomly onto a line, and then transform this information into a new R n feature space using a specified shift-invariant kernel function 46 . Two different initialization functions were considered for RFF; a Gaussian transformation kernel and a Laplacian transformation kernel (eq. S5). The latter produced the best results. Beyond the kernel initialization, RFF layers require two additional functions; (1) the number of dimensions data will be mapped out onto and (2) a scale value α with similiar properties to γ from eq. S4.
The present system thus used a Laplacian RFF map as the first layer of the NSVM, followed by a series of hidden NN layers. For NN hidden layer design, numerous experiments were performed using different architectures, experimenting with a wide range of different parameters including different numbers of layers and layer densities, dropout layers, kernel initializers, batch normalization, weight regularizers, weight constraints 22  For neuron activation, two primary activation functions were considered, tried and tested. These included the traditional Rectified Linear Unit (ReLU) and a self-gated variant known as Swish 47 (eq. S6). After careful consideration, swish was chosen as the optimal activation function for the present study. Firstly, swish has been noted to preserve and improve gradient flow, facilitating the training of deep neural networks. More details on this, however, can be consulted in Supplementary Appendix 5. Secondly, it was considered best to avoid the creation of very sparse vectors in the final layers of the NN, considering how the final neuron layers will be used as input into the SVM portion of the NSVM. For the final layer, initial training of the NN component of the NSVM was performed using the softmax activation function, where K is the number of target labels (eq. S7).
Once defined, NNs were trained using x32 batches for 1000 epochs. The loss function used for initial training was categorical crossentropy, using categorical accuracy as an initial evaluation metric. NNs were trained using the adam optimization algorithm 48 with Triangular2 cyclic learning rate (Base α = 0.0001, Max α = 0.01, Step Size = 16) 49 . A training-validation split of 80:20% was additionally used.
After training, the final softmax layer of the NN was removed and replaced with a multi-class Linear SVM for activation. The SVM proportion of NSVM was trained using k-fold cross-validation (k = 10).
Hyperparameter optimization for both SVM and all components of NSVM were performed using Bayesian Optimization Algorithms (BOA) [50][51][52] . For NSVM, BOA was used to define a large series of different parameters in the NN, including; optimal dropout rate (0.5), number of neurons (250), L 2 parameters (0.0001), as well as the output dimensions (500) and α (0.26) for RFS. These parameters proved efficient for all datasets, regardless of the complexity or number of target labels. BOA was then used to fine tune the optimal c for the SVM part of NSVM as well as the c and γ values in Radial SVM. For both c and γ, these values were dependent on the sample at hand. For NSVM, BOA was run for 100 iterations, using an expected improvement selection function via the Tree of Parzen Estimator algorithm 52 . For SVM, random optimization was used for initialization and hyperparameter prior distribution definition 53 , similarly followed by Expected Improvement BOA for 30 iterations. While Gaussian Process Upper Confidence Bound and Probability of Improvement selection functions were also trialed and tested, they did not provide significant differences from their Expected Improvement counterparts.
Evaluation of supervised learning algorithms took into account a wide array of different popular evaluation metrics in machine and deep learning. These included; Accuracy, Sensitivity, Specificity, Precision, Recall, Area Under the receiver operator characteristic Curve (AUC), the F-Measure (also known as the F1 Score), Cohen's Kappa (κ) statistic, and model Loss. Each of these metrics are calculated using confusion matrices, measuring the ratio of correctly classified individuals (True Positive & True Negative) as well as miss-classified individuals (False Positive & False Negative). For more details see Supplementary Appendix 6.

Supplementary Tables
Supplementary Table 1 Table 9. Classification results for individual species in the African Taxa dataset. Reported values include; Accuracy (Acc.), Sensitivity (Sens.), Specificity (Spec.), Precision (Prec.), Recall (Rec.), Area Under Curve (AUC), F-Measure (F) and Loss. All evaluation metrics (with the exception of loss) are recorded as values between 0 and 1, with 1 being the highest obtainable value. Values reported over 0.8 are considered an acceptable threshold for powerful classification models.
Loss considers values closer to 0 as the most confident models.  Loss considers values closer to 0 as the most confident models.

Supplementary Figures
Supplementary Figure 1. Phylogeny of carnivore taxa analysed within the present study and primary morphological tendencies. Data regarding phylogenetic signal represented through morphology has also been included. Figure created using the scikit-learn, geomorph and phytools Python and R libraries.

10/36
Supplementary Figure 2. Detailed graphical description of landmark models employed. LM = Landmark. l = maximum length, w = maximum width, d = distance. Red landmarks are fixed landmarks, yellow landmarks are semilandmarks whose position are established computationally.

Supplementary Appendix 1 : Carnivores Studied
The present Appendix intends to provide a general overview of the different characteristics presented by each of the animal species included in this paper. The objective of this brief summary hopes to present a number of the conditioning factors that may be of importance when considering carnivore feeding behaviour and thus the nature of the tooth marks observed.
Considering how not all of the variables presented here could have been confronted in the present research, we hope that future investigation may be able to shed some light on a number of these features, revealing how some of these physical and ethological attributes may be directly or indirectly linked to the tooth pit morphologies described in this paper.
Tables S14-S18 contain information from a very general nature to the precise allometric components that are frequently used to calculate bite force coefficients. Bite force coefficients usually also include more metric variables regarding the moment arm of the masseter as well as the temporalis, however for specific details about these features we encourage the reader to consult the original work of the relevant authors [54][55][56][57] . Needless to say, in the authors' opinion the greatest weight needs to be given to details regarding Carnassial bite force, considering these teeth to be the most used during general chewing activities. Additional data used in the present study regarded the phylogeny of the different carnivores used. This data was used to create  62 . While other sources were also consulted, considering the slight disagreement that exists regarding the precise phylogeny of the

11/36
Supplementary Table 15. General ecology and behavioural attributes of each of the studied carnivores. All data was derived from Gittleman (1985) 59 with the exception of running speed 58 .  54 . CCH, CPD and CMD were derived from Valkenburgh (1987) 63 . Animals are presented in order of Carnassial Bite Force Coefficient values, with the carnivores presenting the strongest carnassial bite force in relation to carnivore size appearing first. Supplementary Table 18  Panthera genus, we decided to use the most up to date and recent information. Finally, we consider it important to point out that not all values and tooth mark results should be considered a complete analogy with those that may be found in the fossil record. The current carnivores provided have been chosen as the closest possible analogies to some carnivores, especially in the case of the Pleistocene European Taxa dataset. Here we have chosen P. pardus and P. leo as the closest possible analogies with animals such as the Eurasian Cave Lion (P. spelaea) and saber-tooth cat species such as the Megantereon and Homotherium. Likewise, C. crocuta was chosen as the closest possible analogy with the Cave Hyena C. crocuta spelaea. Needless to say, each of these animals are likely to have differences in biomechanical properties, therefore careful consideration must be made when choosing future comparisons.

Supplementary Appendix 2 : Rib Cortical Density
Numerous considerations were taken into account when collecting tooth mark samples from P. pardus individuals, namely concerning; (1) the anatomical element and (2) the size of the animal from which the anatomical element was obtained.
While diaphyses are preferred for the analysis of tooth mark samples, this is not always possible, forcing analysts to adapt to the rules and regulations set forth by natural parks 64 . In the case of P. pardus, animals are typically fed axial elements, including articulated ribs alongside all meat packets.
3 primary sources of information were used to assess the suitability of bovine axial elements for the generation of tooth mark samples. These include, in order of significance to the present study; and Equid (Equus sp.) anatomical elements 67 .
The first of these sources 65 consists of numerous calculations of carcass composition using data from 38 different cattle. Each of these cattle weigh between 312 and 388 kg, with measurements concluding between 56.7 and 62.9% (average) of carcasses to be composed of muscle, followed by 20.7-28.5% fat and 13.3-14.6% bone. Additional data show an average of 6.8-13.6mm thickness in fat around the 12th rib. In light of this data, it can be estimated that the general meat packets and associated periosteum of bovine rib cages to be relatively large, providing the bone surfaces enough "coating" to be mostly protected by extreme exertions of force such as those produced during mastication. While the current study is unable to estimate the amount of energy these muscular packages are likely to absorb, it is unlikely that for simple feeding (as opposed to hunting) leopards produce the 2268.7N molar bite force they have been estimated to have 55 . Additionally considering the reported average canine crown height of 31.9mm 63 , when adding together factors of cortical hardness and density, our first hypothesis is that leopards are unlikely to produce excessive force enough to produce punctures or excessively deep/large tooth pits on the thickest sections of bovine rib bones.
The second source 66 confronts the use of different types of materials for the simulation of different properties observed during surgical drilling. This study used tungsten 3.1 mm drills to test the mechanical properties of fresh human rib, porcine rib, bovine rib, as well as a number of synthetic materials typically used for the simulation of bone mechanical properties. The study concluded bovine rib to be the most difficult to prepare, arguing the more elevated cortical thickness (2.3 +/-0.13mm) and density (4490 kg/m 3 [213.5, 334.6]) to be an important conditioning factor.
The study presented by Lam et al. 67 , is the first of the three aforementioned studies to present data relevant to archaeological site formation processes. This study uses Computer Tomography scanning to extract digital cross-sections of a wide array of different anatomical elements from Connachaetes taurinus, Rangifer tarandus and Equus sp. adult individuals. From here, the authors derive cross-sections from which to calculate Bone Mineral Density (BDM). BDM measures were extracted using two different calculations (BMD 1 and BMD 2 ), of which only BMD 1 was considered in the present study. Each cross section was consequently labelled with an ID indicating the type of bone (RI for Rib), and a number associated to where the cross section was extracted (Fig. 1 of Lam et al. 67 ). Despite the average weight of Connachaetes taurinus is lower (140-290 kg) than the typical european domestic cattle, bovid cortical densities were recorded as being much higher than both the equid and cervid individuals.
Building on the data provided by Lam et al. 67 , BMD 1 values were compared between ribs and each of the long bones to test for significance in equivalence using the Two One-Sided equivalency Test (TOST). For the purpose of finding the densest section of the rib possible, only values from RI3 and RI4 were additionally considered. Because the original tables reported (Table 1 67 ) only included mean and standard deviation values, magnitude of differences could only be tested using the traditional Welch's t-statistic 39 . TOST in this context evaluates the magnitude of similarities via Superior (εS or ∆S) and Inferior (εI ∆I) equivalence bounds. For samples to be considered equivalent, both ∆S and ∆I need to be associated with 14/36 significant p-values. Contrary to typical hypothesis testing using MANOVA or ANOVA, for TOST H 0 considers samples to be different. Under this premise, p<0.05 or p<0.005 are considered significantly equivalent. Finally, considering that in the present context it is not the same for a long bone to be significantly denser than the cortical of a rib, greater importance was given to the magnitude of ∆S. From this perspective, if the overall test concurs that densities are significantly different, as long as ∆S tested insignificant, then we can conclude that differences between tooth marks found on bovine ribs and tooth marks found on long bone shafts are not of significant importance.  In summary, and as can be seen through both the visualisation of average density values (Fig. S3), as well as the calculation of ∆S and ∆I equivalence values (Table. S19), the density of bovine rib cortical should be sufficient in providing the closest possible analogy with tooth marks produced on diaphyses by other animals. The only anatomical areas likely to create issues, however, are the central-most parts of intermediate and lower long bone shafts (RA3, TI3, UL2, MC3 & MR3). Nevertheless, considering carnivore tooth damage is more likely to appear on the transition between diaphyses, metaphyses and epiphyses, these observations are not likely to be of concern for the present study.

Supplementary Appendix 3 : p-Values, Bayes Factor Bounds & False Positive Risks
In the October of 2017, the American Statistical Association held a 2 day symposium to discuss the use of p-Values and Statistical Significance. Product of said symposium can be summarised in the 73rd Volume (Issue Sup1) of The American Statistician; a collection of 43 detailed articles on the use and mis-use of p-Values, the term "Statistically significant", alongside

16/36
recommendations on how to proceed in a world beyond p<0.05 68 . While this collection builds on the original statements and principles proposed by the ASA after the symposium of October, 2015, and published in 2016 69 , the more extensive special issue of 2019 strongly discourages the use of the term "significant". Under this premise, careful evaluation of each of the proposals has led us to conclude that the most suitable solution for the present study was to build on the recommendations set forth by Benjamin and Berger 70 , and Colquhoun 71 .
While a detailed description and analysis of the use and mis-use of p-values goes far beyond the scope of the present study, here we compile a number of the key points that were considered of great importance for the preparation and presentation of our research.

A Brief Overview of p-Value Historiography
The p-value has a long history, arguably dating back to the XVIIIth century, while popularised and standardised during the beginning of the XXth century 72 . Among the many possible authors, notable early usage of p-values include some of the most relevant names in statistics; Laplace (1827) 73 , Poisson (1837) 74 and Pearson (1900) 75 , inter alia. Nevertheless, few authors actually make an attempt at clarifying an acceptable "threshold" or norm from which future analysts can base their work on. While as early as Edgeworth (1885)'s article 76 we can find a possible reference to a certain threshold of importance; "by far the greater proportion (namely 0.995)... Accordingly, if out of a set of [...] N statistical numbers which fulfil the law of error, we take one at random, it is exceedingly improbable that it will differ from the Mean to the extent of twice, and à fortiori thrice, the modulus... Such is the nature of the law of error" -Edgeworth (1885), Page 185 76 To some, at the time, 0.005 was considered excessive and not adaptable to all situations. Building on this, the most notable and widely-accepted contributions in this field of research, therefore, are not found until the later works of Pearson (1900) 75 , Elderton (1902) 77 and Gosset (a.k.a. Student, 1908) 78 . From each of these studies, we find valuable insights into early statistical reasoning, namely with the definition of probability distributions that laid down a framework from which p-value calculations would be standardised and developed (χ 2 , t & z distributions). Building from these studies, we find the first detailed publications of p-values, with notable contributions by R.A. Fisher. Alongside a series of articles, Fisher published his 1925 book titled Statistical Methods for Research Workers 79 , containing the referential passage; "The value for which p = 0.05... is convenient to take this point as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant." -Fisher (1925), Page 47 79 Here it is important to point out specifically the phrase "twice the standard deviation", from which it can be seen that 0.05 was not an arbitrarily chosen value, but more theoretically derived from 1 − 2σ ≈ 0.05. Nevertheless, it is also important to point out that Fisher did not only use 0.05, and in some cases we find his tables including a large number of different p-values, with reference to 0.01 as another threshold for different circumstances.
Nevertheless, Fisher's work was not published without some debate. In a series of studies by Neyman and Pearson 80, 81 , we find a slightly different approach to defining, not only p-Value calculations in hypothesis testing, but also the reasoning behind defining the 0.05 threshold; "...the testing of statistical hypotheses cannot be treated as a problem of estimation" (pg. 492) "H 0 is accepted when some alternative H i is true, the conesquences that follow will depend upon the nature of H i and its difference from H 0 ... while it is the 'size' of p I (w) that matters, it is what may be termed the quality of errors contributing to the risk P II (w), that must be taken into account" -Neyman and Pearson (Oct, 1933), Page 497 80 "We deal first with the class of alternatives for which α > α 0 . If ε = 0.05... we shall proabbly decide to accept the hypothesis H 0 as far as this class of alternatives is concerned" -Neyman and Pearson (Jan, 1933), Page 303 81 Once again, we also find the authors contemplating a more robust threshold of 0.01 depending on the type of probability distributions at hand 81 .
Here, Neyman and Pearson approach the concept of hypothesis testing from a different perspective, stating that the acceptance of a hypothesis should not just depend on the H 0 distribution, but also be contextualised with the H a distribution 80 . From this perspective, these authors search for a means of reducing possible error, with Type I (P I ) and Type II (P II ) statistical errors being False Positive and False Negative errors accordingly. In sum, Neyman and Pearson thus discuss a significance level α as 0.05 or 0.01 for P I risk, while β = 1 -statistical power as the P II risk.
While the p-Value threshold between the two approaches does not differ greatly, the means of deriving this threshold is quite different. Under this premise, it could quite possibly be argued that this constant mention of 0.05 across multiple sources, even if derived from different calculations, is what has generated such confusion in the use of p-Values. As 0.05 appears to resonate throughout history of applied statistics, with few questions regarding the reliability of this threshold, many authors misuse the concept of p < 0.05 in hypothesis testing without understanding the true origin of this value.
While the present study will refrain from declaring a preference for one approach or the other, we feel it is important to note both approaches have their advantages and disadvantages, and a general understanding of their origin is necessary in order to correctly evaluate p-value calculations. If we are to consider some of the more recent research into fields of robust statistics, it may be considered that 2σ is not always the best reflection of reality when dealing with non-symmetrical probability distributions. While the t and z distributions are more robust to a number of these cases, presenting longer tails and a means of making general inferences regarding the population, limitations are still present and the definition of a single p-Value threshold is not always as straightforward.

p-Value Calibrations and the use of Bayes Theorem
Beyond the more anecdotal components of this section, and in order to define the methods used to evaluate p-Values in the present study, we think it's important to define what we mean when we refer to p-Values and hypothesis testing in the XXIst century. For this purpose we quote Wasserstein and Lazar's 2016 ASA statement; "a p-Value is the probability under a specified statistical model that a statistical summary of the data [...] would be equal to or more extreme than its observed value." -Wasserstein and Lazar (2016), Page 131 69 Among the many papers presented by the ASA's 2019 Special Issue (SI), an important point raised by most authors builds on the exclusively Frequentist perspective most scientists have upon approaching hypothesis testing. In as such, most authors who contributed to the SI raised the point that one of the best means of overcoming p-Value misuse would be to find a compromise between Bayesian and Frequentist approaches. As pointed out by Colquhoun 71 , however, this is not an easy task and many analysts are likely to try and avoid these approaches. Part of the Bayesian-Frequentist conflict stems from the distrust many non-statisticians have to the concept of Bayesian Priors 29,71 . For full disclosure, the present author's are also admittedly guilty of this. While in archaeology a typical argument against Bayesian approaches consider that not enough is known about the fossil record in order to construct truly informative priors for Bayesian analysis, in many cases this does not mean that Bayesian approaches should be discarded. As pointed out by most Bayesian practitioners, the use of priors is only one component of the theorem, and while informative priors can be greatly beneficial to the quality of results, weakly-informative priors, or even diffuse priors, are still an option if little is known about the subject matter. Similarly, weakly-informative priors are a popular technique for regularization in many Bayesian learning strategies.
When using Bayes' theorem for hypothesis testing, we can define the following formula (eq. S8); where x is our data, H a is the alternative and H 0 the null hypothesis. Observing these equations, it can immediately be seen how Bayesian approaches have the distinct advantage of meeting the requirements set forth by Neyman and Pearson 80,81 , whereby the equation considers the test's results not only with regards to H 0 but also in context with H a . Similarly, Bayesian statistics employ the use of prior and posterior probabilities to quantify levels of uncertainty within a statement 29 , providing statisticians with a means of defining their certainty or uncertainty about H a . Furthermore, this not only implies that the reliability of a p-Value threshold can be empirically assessed, but is also adaptable to the specific data at hand. In Bayesian statistics, this is usually seen by how the final output of a model is not a single value, but a probability distributions of all the possible values 29 .
The Bayesian alternative to p-Values in hypothesis testing are known as Bayes Factors (BFs). A Bayes Factor is the likelihood ratio of the marginal likelihoods of H a and H 0 . These values are usually interpreted in ranges, with BF 1 to 3 being weak, 3 to 10 being moderate, 10 to 30 being substantial, 30 to 100 being strong, and over 100 being very strong evidence in favour for H a 82 . In a similar sense, different authors have made attempts at categorizing p-Values in this manner, with 0.1 to 0.05 implying weak, 0.05 to 0.01 moderate, 0.01 to 0.001 strong, and <0.001 very strong evidence for H a 83,84 . Nevertheless, in some senses this still assumes a one-size-fits-all p-Value, regardless of specific H a and H 0 distributions a study may have. Likewise, this can also create an illusion that the p-Value is a calculation of the probability of something being true, while being namely conditioned on the H 0 82 . To directly confront the differences between Frequentist derived p-Values and Bayesian metrics, multiple efforts have been made to propose means of "calibrating" p-value with BFs 70, 82, 85 . However not all produce the same results 70 . In part this can be explained by some of the issues in which p-values have been defined, whereby the conversion of p-values according to Fisher's definition 79 require a different formula (eq. S9) to those proposed by Neyman and Pearson 80,81 (eq. S10); B (p) = −e p log(p) (9) 18/36 is defined as the lower bound on the odds provided by the data for H 0 and H a . Wquation S10, on the other hand, is defined as the posterior probability of H 0 based on equation S9, combined with the assumption of an equal prior probability for H a and H 0 (i.e. 0.5) 85 . Under this premise, we can see how, once again, α(p) fulfills Neyman and Pearson's criteria on defining probability distributions for both H a and H 0 .
In the 2019 ASA SI, we find an interesting development of these equations by Benajmin and Berger, providing 3 recommendations for improving the use of p-Values 70 . The authors; (1) argue that 0.05 should be replaced by 0.005, with values between 0.05 and 0.005 being referred to as "suggestive" evidence rather than "significant"; (2) suggest that analysts report a Bayes Factor Bounds (BFB) value that supports the reported p-Value; and (3) suggest that analystsreport the prior odds of H 1 to H 0 .
Firstly, with regards to point (2), a BFB is the upper bound or confidence interval with regards to traditional BFs. Benjamin and Berger thus define BFB as an indication for "the highest possible BF consistent with the observed p-Value" 70 . Adapting equation S9 to this statement, BFB can thus be defined as (eq. S11) 70 ; BFBs are generally reported at most 70 as odds against the H 0 . Their interpretation therefore can consider 0.05, whose corresponding BFB value is 2.44, to mean that there are 2.44:1 odds that H a be correct. For those who find odds hard to interpret, and would thus prefer a % value, BFBs can easily be converted to percentages via the following formula (eq. S12); The third recommendation presented by Benjamin and Berger requires these values be presented as posterior odds. This is simply carried out by multiplying the BFB value by the prior odds.
One key disadvantage to the use of Bayes factors is that they require the specification of a prior distribution, a topic that frequently produces disagreements or is to complex to define. In some cases where the prior probabilities are unknown, this value can be derived from other variables (see: http://fpr-calc.ucl.ac.uk/ 86 ), however, as suggested by Colquhoun 71 , one fair assumption to make is that these probabilities are equal (50:50, a.k.a. random). This is considered a valid diffuse prior for most hypothesis testing. In light of this, posterior odds for a p-value of e.g. 0.05, would have a posterior probability of BFB(0.05) · 0.5.
For ease of comparisons, Table S20 contains a number of p-values calibrated with BFB, P U (H a |p) and their corresponding posterior odds.
The final consideration made regarding p-Values in the present study employed the use of the False Positive Risk (FPR) approach by Colquhoun 71 .
The FPR is a powerful notion developing a number of the components seen up to this point. FPR attempts to quantify the probability of declaring an effect is real, when it is, in fact, not. This is very similiar to Neyman and Pearson's notions of the α threshold and P I risk, considering both refer directly to the False Positive Rate of a prediction. In order to calculate this, Colquhoun uses Bayes' theorem to define the likelihood ratio in favour of H a as eq. S13, where FPR can then be calculated through eq. S14 and in the case where equal prior probabilities for H a and H 0 are defined we can simplify this equation to eq. S15;

19/36
Throughout his article, Colquhoun provides three main formulae for defining L 10 , each with their own peculiarities. For the purpose of simplicity, however, and so as to provide a means of calibrating Benjamin and Berger's BFB calculations with FPR, the present study has employed the Sellke-Berger approach defined in Appendix A2 71 , using eq. S11 as L 10 . Interestingly, using BFB in this sense forces FPR to be the inverse of P U (H a |p), creating a complementary metric that calculates the probability of Type I errors being present alongside the upper-bound probability of H a being true.
So as not to saturate the main text with large quantities of hard to read numbers, we have provided the following tables and figures as a frame of reference for both the BFB and FPR approaches. The main text, alongside all Supplementary Tables, include BFB ratios according to eq. S11. Posterior probabilities according to Benjamin and Berger's recomendation 0.3 have been included in Table S20. Likewise, this same table contains FPR calculations for each of the corresponding p-Values. Figures S4 & S5 represent a graphical representation of the trend between p-Values and each of the Bayesian-based metrics. One characteristic of Fig. S4 that is of notable importance is the difference between Bayes values when using α(p) and B(p), especially at p = 0.05. From here it can be seen that only when p ≈ 0.005 or closer to 3σ do the two values coincide, likely indicating that 0.005 or 0.003 be a more robust threshold to consider a p-Value as "significant" or not in the eyes of Benajmin and Berger's Recommendation 0.1 70 .
When consulting plots of curves for each of these equations, interesting patterns emerge, namely a non-symmetric U-shaped curve (Fig. S4 & S5), indicating BFB or FPR values to not be unique to one single p-Value. This can consequently be interpreted as a means for reversing the focus from H a to H 0 , with values falling on one side of the curve supporting either one of these hypotheses. To find the limit between p (H a ) and p (H 0 ), a simple experiment was devised, calculating the point of maximum curvature for each curve using steps of 0.0001 across p-Values. This experiment found p = 0.3681 as the limit between each hypothesis. Under this premise, in the main text as well as in supplementary tables, p-Values of over 0.3681 were reported as BFBs against H a and vice versa.
Finally, to observe the effects different priors would have on FPR and P U (H a |p) curves, Figure S6 shows how priors effect the morphology of these curves.   (H a |p)  In conclusion, and as evidenced by the numerous equations, tables and figures provided within this appendix, part of the reasons behind our choice of p < 0.005 as being significant, and where possible p < 0.003, is based on how p = 0.05 only implies a 71% probability in favour for our alternative hypothesis when our prior probabilities are 1:2, leaving a 28.9% risk for Type I errors, while p = 0.005 lowers this risk to 6.74%, and p = 0.003 lowers the risk even further to 4.5 % of Type I errors.

Supplementary
Implementations programmed in R for all formulae presented in the current appendix have been included in the Corresponding Author's GitHub page alongside all code for the present study.

Supplementary Appendix 4 : Central Limit Theorem (CLT)
The Central Limit Theorem (CLT) is a key concept in probability theory, and in as such has proven a valuable concept for data scientists. While a number of CLT variants exist, some of the most classic versions of CLT are associated with the french statistician: Pierre-Simon Laplace 87 . In the simplest terms, the CLT states that no matter the shape of a population distribution, the sampling distribution of sample means approaches a normal distribution (eq. S1) as the sample size increases. Laplace was even able to show, similar to some previous research by Abraham de Moivre (1733), how a binomial distribution, such as that formed by the throwing of dice, can be approximated to a Gaussian distribution as the number of die throws increases.
While many types of probability distribution exist, one of the most popular and central distributions to statistics is that defined by Carl Friedrich Guass 88 . The Gaussian distribution is also known by a number of different names, one of the most common being the "Normal" distribution (N ), and is central to many elements of statistics. The Gaussian distribution is popular because of many convenient features; the probability distribution function is symmetric, well defined by the mean (µ) and standard deviation (σ ), and seems to appear everywhere (hence the CLT). Nevertheless, this distribution is also unfortunately overused and, in many cases, misused, inspiring the development of a number of statistical sub-disciplines such as that of robust statistics.
While the current authors strive to ensure that the statistical tests performed within the current research are robust (See "Robust Statistics" of the main text), and not entirely dependent on hidden mathematical assumptions, it cannot be denied that the Gaussian distribution is central to statistical theory. In light of this, the present appendix intends to show how the samples used within the present study can still be converged on a Gaussian distribution for the purpose of MCMC modelling.

CLT Theory
To expand on some elements touched upon in the Appendix 3, an interesting component in early investigation into p-Value calculations touched on concepts of normality 75,77,78 , especially in the context of sample size. While many elements of statistical analyses are dependent on the Gaussian distribution, a wide array of different distributions also exist. One very popular distribution used frequently in hypothesis testing is the t-Distribution, sometimes referred to as Student's t-distribution. The term "Student" originates from Gosset's use of the pseudonym "Student" when publishing his research in 1908 78 . A simplified formula for the t-distribution is (eq. S16); Where µ is the population mean,x is the sample mean, σ is the sample standard deviation and n is the sample size. As can be seen both in equation S16 and the equation for the standard deviation (eq. S17);

22/36
both equations are dependent on sample sizes n and thus sensitive to a certain degree of freedom. The reason behind this is; as the sample size increases, the t-distribution will approximate a Gaussian distribution. Advantages of using this approach are (1) the t-distribution is more robust to outliers based on the probability density function's heavy tails 29 , and therefore (2) can provide better inferences on the population from smaller samples. This is an important notion to highlight, considering how as the sample size increases, the distribution appears more "normal". From this perspective, many analysts consider sample size to be a primary cause for a continuous sample's lack of normality. Under this premise, while hypothesis testing should take inhomogeneity into account (parametric vs non-parametric tests), these observed deviations from normality should not always be considered a true reflection of the populations' distribution.

Present Application
Considering how the objective of our study is to use augmentation techniques to simulate a sample's original population, these notions of normality with regards to sample size can be considered fundamental. While the empirical samples themselves do not always comply with the definition of a Gaussian distribution, this is not to say that the observed kurtosis and skewness values are representative of the population's normality, and thus a carefully adapted use of Gaussian priors can be justified for the augmentation of our datasets using MCMC. To present a final attempt at justifying this decision, a simple experiment was devised to show how sample size is likely to distort our view of whether a population is normally distributed or not. Figure S7 presents 4 multivariate (2D) theoretical samples of varying sizes. The x axis of each has been further represented univariately using a density plot in the upper panels (Fig. S7). In each case, the population probability distribution function has also been represented (red dotted-line), calculated using equation S1, with µ = 0 and σ = 1. The population was then sampled from, first extracting 10 points, then 50, then 100, then 1000. As can be seen, simply from a visual perspective, the empirical (black) distribution does not begin to resemble the theoretical (red) population distribution until at least n = 100. Table S21 provides complementary descriptive statistics for each of these graphs (focusing simply on the univariate x-axis for ease of describing the distributions). First normality is tested. A number of tests exist to detect whether a sample is Gaussian or not. The present study (both here and in the main paper) used primarily the Shapiro-Wilk's test for normality 89,90 , chosen due to its observed statistical power when compared with other methods 91 . As can be seen, despite the small sample size in the first case, all 4 samples test as homogeneous, with n = 50 coming the closest to rejecting the H 0 . Nevertheless, under the assumptions of a normal distribution with a predefined density (µ = 0, σ = 1), we should expect (1) the mean µ and mediañ x to coincide at 0, the standard deviation and Normalised Median Absolute Deviation (eq. S3) to also coincide at 1, while Skewness should reflect a symmetrical distribution (≈ 0) and relatively low kurtosis (≈ 0). As can be seen in the table, these patterns only appear for n = 100 and n = 1000.

Supplementary Appendix 5 : Neural Network Activation Functions
One of the key advantages presented by the use of Neural Networks (NNs) are their highly versatile non-linear attributes. Upon combining multiple layers of neurons, each with differing densities, NNs have proven capable of approximating any distribution function, solving advanced questions in multiple fields 22 .
A neuron is defined through simply multiplying an incoming signal (x 1 , . . . , x n ) with a set of weights (w 1 , . . . , w n ), such that the value of a neuron h is (eq. S18);

24/36
Here we have chosen to use h to represent a single neuron with h representing the term "hidden". A neural network is thus built up of multiple hidden neurons, all fully connected with a set of weights to other hidden neurons in the next layer. This continues until the final output layer.
With the intention of replicating biological functions of the human brain, the first conceptualisation of the neuron (a.k.a. perceptron), also introduced the concept of an activation function. An activation function attempts to determine whether neurons are "activated" or not, thus controlling the flow of information between the inputs and the outputs. From this perspective, a complex combination of multiple neuron layers with varying densities can thus map out complex non-linear fluctuations of information that helps map a set of inputs (x 1 , . . . , x n ) to their corresponding outputs (y i ). The activation function thus takes the value of the neuron (h) as input ( f (h i ) or essentially f (x)), and decides to what extent this information gets passed on to the next neuron.

Rectified Linear Units (ReLU)
Arguably the most famous and commonly used activation function is the Rectified Linear Unit (ReLU). ReLU can be considered a non-linear adaptation of a typical linear function. While linear functions are represented simply as y = x, ReLU provides a means of turning off/on a neuron by only allowing certain values through. This was originally conceptualised as a means of filtering negative values, with ReLU only allowing positive (h > 0) information through. Mathematically this is simply (eq. S19); While adaptations of ReLU exist that slightly adjust this filter gate [92][93][94][95] , ReLU is one of the most widely used activation functions for numerous reasons. Firstly the simplicity of the function, from a computational level as well as how simple it is while remaining non-linear. Secondly, the derivative ( f (x)) of ReLU is also easy to calculate, considering it is half of the linear function (eq. S20); Derivatives are used in calculus to calculate the slope of a function. Derivatives are therefore fundamental when performing weight optimization so as to know how to adjust a weight so as to reduce the error. This is commonly known as Stochastic Gradient Descent (SGD). Likewise, the second derivative of ReLU, is simply (eq. S21); Derivative calculations become increasingly more important the deeper the network is. From this perspective, it is fundamental that derivatives be easy to compute. From all three equations, it can be seen how ReLU has the distinct advantage of being; (1) non-linear, (2) easy to compute, and (3) easy to optimise when using SGD and its variants.
ReLU however has some disadvantages. Firstly, the cut-off limit can often be considered too harsh, with the hinge of this function being abrupt and thus leading to the unnecessary dis-activation of some neurons. In some of these cases, alternatives such as the Leaky ReLU function are powerful, but in others this can sometimes allow too much information to pass through. The latter is often cause for exploding gradients, which is the precise opposite objective of ReLU. Under this premise, while being "bounded" bellow (i.e. not allowing information bellow a certain threshold through) is advantageous, having too hard a cut off may be an issue for gradient flow.
Finally, from the precise perspective of the present study, considering how NNs are used as kernel functions, where the outputs are directly fed into a Support Vector Machine (SVM), we personally felt that such a strict dis-activation of neurons would produce overly sparse vectors, e.g.; which could possibly over-restrict the learning process for SVMs. Under this premise, we decided to gate neurons with a softer lower-bound so as to allow some information to pass through, but not too much information so as to not damage the learning process.

Self-Gated Rectified Activation Functions: Swish
Swish is an adaption of ReLU originally proposed in 2017 by Ramachandran and Lee (2017) 47 . The authors describe this function as; "Like ReLU, Swish is unbounded above and bounded below. Unlike ReLU, Swish is smooth and non-monotonic" -Ramachandran and Lee (2017), Page 2 69 The Swish activation function is thus defined mathematically as (eq. S22-S24); Before delving in to the advantages of Swish, one notable component of the fomulae that catches the eye is the similarities between Swish and the Sigmoid function (eq. S25). The Sigmoid function is famously attributed to algorithms such as those found in Logistic Regression, can also be found in elements of Gradient Boost classification algorithms, as well as being the typical activation function used in the final output of a Binary-Classification NN. The multi-output version of Sigmoid, in fact, is the Softmax function, previously defined in Equation S7 of the present document. Sigmoid (eq. S25) is a powerful activation function, that works by taking an input and returning a value between 0 and 1. The derivatives of Sigmoid (eq. S26 & S27) are equally useful as they are relatively easy to compute. Sigmoid, however, contrary to functions such as ReLU and Swish, is bound both above and below; the maximum value that can be computed is 1, and the lowest value is 0.
Returning to Swish, it can be seen that Equation S22 incorporates Equation S25 by multiplying the input x by the Sigmoid function (eq. S25). Product of this is a smooth version of ReLU, with a soft lower bound, and no upper bound. Swish therefore approximates to ReLU, but does not have as harsh a hinge that separates negative values from positive values. This essentially allows very small negative values to exist, thus improving gradient flow. This former point helps decrease the possibility of sparse vectors as input to SVM while allowing the most important features through. Figure S9 visually presents the curves for ReLU (eq. S19-S21), Sigmoid (eq. S25-S27), and Swish (S22-S24). Here it can be seen how Swish is effectively the combination of ReLU and Sigmoid, with very similiar derivatives to Sigmoid, a similiar f (x) morphology to ReLU, but a smoother cut off. From this perspective, and as pointed out by Ramachandran and Lee (2017) 47 , Swish is a non-monotonic smooth activation function that works very well on very deep neural networks, while still preserving the advantages of non-linearity with no upper bound for improved weight optimization.
R-Code used to produce graphs and calculate derivatives has been included in the corresponding author's GitHub page alongside all other code related to the present research.

Supplementary Appendix 6 : Computational Learning Evaluation Metrics
The evaluation of classification algorithms in computational learning is of the upmost importance, however can often be considered misleading. In most contexts, the words "80% classification" would be seen as a powerful algorithm, however this is often not the case as will be presented here. In contexts such as medicine, for the construction of diagnostic tools, more weight is given to values such as sensitivity and specificity. In other cases, such as economics and supervised anomaly detection in bank fraud, precision and recall are more reliable. In general, Precision/Recall values are considered metrics used for imbalanced classification, while sensitivity and specificity are less sensitive to imbalance, thus being more reliable in general classification problems. The term imbalance here refers to where one of the labels we wish to classify is underrepresented in the dataset.
In bank fraud, for example, there may be 100 frauds out of 10,000 examples. In this case, the imbalance is described as 1:100. In medicine, this would be the case of classifying a rare disease. A balanced problem, on the other hand, would be in the case where we have the same number of healthy individuals as unhealthy (1:1). Finally, it is not the same for slight imbalance to exist (1:10) as slightly more extreme cases (1:100) as very extreme imbalance (1:1000). In each of these cases different evaluation metrics will be more or less reliable.
From this perspective, each of the metrics are defined for a particular reason, however all carry out very similar functions. Evaluation metrics can thus be defined as a means of not only calculating classification accuracy, but also a means of evaluating how often an algorithm is right, wrong, not wrong or right in saying the opposite. While this can sometimes be confusing, consider the following situation; it is not the same for a doctor to diagnose a patient with an illness and they are actually healthy, as it is to diagnose a patient as healthy and they are in fact ill. In the former case, this could lead to a law suit, while in the latter this could even lead to death.
In order to perform these evaluations, a Confusion Matrix is defined; where P refers to positive, N to Negative, T to True and F to False. In the case of our aforementioned patient, a False Positive would be where we diagnose the patient as ill when they are in fact not. A True Positive would be if we diagnose the patient as ill and it turns out that this is true.
For simplicity as well as the purpose of demonstrating how evaluations work, each of the three matrices that will be tried and tested in this Appendix are defined in the context of binary classification (ill or healthy, fraud or not-fraud). Evaluation metrics for binary classification in this context can thus be defined as;

27/36
With F-Measure (F): and Kappa (κ); P e = P pos + P neg (33) For ease of calculation, AUC and loss were excluded from this appendix. AUC however can be seen as a powerful balance between Sensitivity, Specificity and Precision.
For the purpose of this demonstration we have defined three confusion matrices; A is an example of a very poor classifier with a misleading high accuracy; B is a slightly more evident poor classifier, with relatively high accuracy; and C is a relatively good classifier with more balanced accuracy. Matrices A and C both have 80% classification accuracy, a threshold that would normally be considered "good". Matrix B has 70% accuracy, a rate that is not particularly good but could be worse. The results for all three matrices can be consulted in the following table; Supplementary As can be seen in the table, the most obvious observations is that 80% accuracy is still possible even if an algorithm is very weak. In this case, the 80% accuracy is completely unreliable. Likewise, for Matrix B, while the accuracy falls 10%, in comparisons with other metrics for Matrix A it is still a much more powerful classifier. When performing the maths, for example, this appears increasingly more obvious as Kappa is calculated. If we can consider how the number of True Negatives is 0 in Matrix A, the first fraction in eq. S35 will equate to 0. A scalar 0 multiplied or divided by anything will condition the formula to reach 0, thus lowering the κ statistic drastically. Likewise, considering how Matrix B presents evaluation metrics higher than Matrix B, the low TN value still conditions κ to be much lower. Finally, even in Matrix C, where most values sensitivity, precision and recall values are above the acceptable 0.8 threshold, κ still proves to be a strict statistic determining Matrix C to be a moderate result.
The next interesting component to consider is how, even in matrices that have specifically been designed to be bad (A and B), both Precision, Recall and F-Measure values are very high. This is quite misleading, and could raise the question: are these really reliable? To simply answer this question, consider the following matrix; D = 85268 28 31 116 simply extracted from the following blog; https://blog.usejournal.com/credit-card-fraud-detection-by-neural-network-in-keras-4bd81cc9e7fe Matrix D represents a classification algorithm for credit card fraud with an extreme imbalance of 147:85384. The resulting precision and recall in this case study comes to 0.99967 and 0.99963 respectively. This indicates that the algorithm that produced matrix D was a very powerful classifier for both the smaller and larger class. Specificity, however, would not truly reflect this power with a value of only 0.81.
From this perspective, it can be seen that different evaluation metrics are designed for different purposes, and the choice of metrics in a study should reflect this. On a taxonomic level, the present study only has a single sample that is "imbalanced" with regards to the rest (V. vulpes), needless to say, this imbalance is relatively weak (≈ 5:8). Beyond this, the strongest imbalance present is in the Taxonomic Family dataset, when comparing felidae tooth marks with ursidae tooth marks. This case of imbalance is ≈ 7:24. So as to ensure our tables are comparable with those published by other authors, all evaluation metrics have been included. Nevertheless, the authors strongly encourage the reader to consider Sensitivity, Specificity, Kappa, AUC and loss metrics over others, so as to ensure the correct intepretation of our results.

Supplementary Appendix 7 : Computational Learning Performance on Non-Augmented Datasets
The present Appendix provides adaptations of Tables 1 & S8-S12 using non-augmented datasets for the training of supervised computational learning algorithms.
A number of important components can be noted from these tables; • Even without data augmentation, many algorithms are able to reach state of the art performance. On average, accuracy only appears to drop 4% for SVM and 6% for NSVM. Likewise, with the exception of the Pleisteocene European Taxa dataset (κ = 0.75) and the Taxonomic Family dataset (κ = 0.61), κ values remain above the standard 0.8 threshold.
• The greatest difference between augmented and non-augmented performance can be seen in the calculation of loss values. Central tendency calculations reveal a loss of 9% confidence when performing predictions without the aid of data augmentation (Table 1 & S23). When observing performance on specific datasets, it can be seen how the Taxonomic Family dataset is the most affected, with reports of loss values over 0.5. This is an evident product of dataset imbalance, whereby the smallest classes present the highest loss values. This ties strongly with observations presented by Courtenay and González-Aguilera 25 , observing a considerable drop in loss and improvement in decision boundary definitions when using augmented datasets to train Linear Discriminant models.
• Strongly tied to the previous point, sample imbalance in the Taxonomic Family dataset also produces a change to precision and recall metrics. As detailed in Supplementary Appendix 6, these two metrics are more suited for the evaluation of algorithms trained on imbalanced datasets. While the use of data augmentation in the present study was used to reduce these effects, the moment they are excluded, each of these performance metrics drop notably on the poorly represented samples. Here it can be seen how Hyaenidae (n = 86) and Ursidae (n = 69) present the lowest recall values (Hyaenidae = 0.37, Ursidae = 0.42). In some cases precision is also seen to drop (Hyaenidae = 0. 66