## Abstract

Data modeling requires a sufficient sample size for reproducibility. A small sample size can inhibit model evaluation. A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate *normal characteristic*; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples. Three samples (nā=ā667) were investigated with 10 input variables (X). KDE was used to augment the sample size in X. Maps produced univariate normal variables in Y. Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T. Reversing each step produced synthetic data in Y and X. All samples were approximately multivariate normal in Y, permitting the generation of synthetic data*.* Probability density function and covariance comparisons showed similarity between samples and synthetic samples. A class of samples has a latent *normal characteristic*. For such samples, this approach offers a solution to the small sample size problem. Further studies are required to understand this latent class.

## Introduction

Data modeling requires sufficient data for exploration and reproducibility purposes. This is especially relevant to biomedical-healthcare research, where data can be limited; although this field is broad, a few examples include risk prediction^{1}, response to therapy^{2} and benign malignant classification^{3,4,5}. Unfortunately, data can be limited for variety of reasons: the study of low-incidence diseases or underserved/underrepresented subpopulations^{6}; clinic visitation hesitancy^{7,8}; the inability to share data across facilities^{9}; cost of molecular tests; and study timeframes. We, the authors, have worked in biomedical-healthcare research for many years and have experienced this persistent problem over decades.

Multivariate modeling is often exploratory that can decrease model stability in various ways. Here we explain frequent approaches that we have experienced or witnessed. The process starts by analyzing data from the target population for a variety of goals such as: open-ended analyses by studying many different data characteristics searching for correlations and patterns; subgrouping the dataset; testing hypothesis feasibilities with varying endpoints; exploring multiple hypotheses simultaneously; feature selection; selecting the most suitable model; or estimating model parameters with an optimization procedure. In practice, there are virtually *unlimited* ways to search through a given data sample. Data mining of this sort may not always be viewed in the most positive light^{10}, but on the other hand it is also the nature of discovery, noting there is often a compromise between these positions. Extensive subgroup analyses can effectively deplete the sample. When this applies, we term it the *small sample problem*. In the *final* stage, the fully specified model (i.e., the model with its parameters fixed) is validated with new data to prove its generalizability. Both the exploration and final stages depend critically on having an adequate sample size.

Determining the adequate sample size in the multivariate setting is a difficult task^{11} and has relevance to the small sample problem. Adequate multivariate sample size is a function of both the analysis technique and covariance structure. For example, a multivariate two-sample test with normally distributed data and common covariance, Hotellingās T^{2}, is appropriate when comparing mean vectors^{12}. In a broad sense, when the variables under consideration tend to be more highly correlated, the adequate sample size decreases and vice versa. Adequate sample size is a function of the number of free model parameters, which does not necessarily correspond with the number of variables^{13}. In ordinary linear regression modeling with d noninteracting variables, there are about d parameters that must be determined. In contrast when taken to the limit, partial least squares regression^{14} has roughly d^{2} parameters, and deep neural network architectures have even greater number of parameters, requiring large sample sizes for a given design^{15} (see related table in^{15} for examples). These modeling techniques illustrate that an adequate sample size under one condition may not be optimal for another. It is our premise that the adequate sample size for a given multivariate prediction problem that allows independent validation deserves more attention beyond *larger sample sizes are better,* especially when normality assumptions do not hold. By hypothesis, a technique that can generate realistic synthetic data will provide benefits in modeling endeavors. Such an approach could be used to augment an inadequate sample size for modeling and validation purposes or to study sample size requirements for a given multivariate covariance structure.

Synthetic data applications in health-related research use a variety of techniques. Some methods are used for generating samples from large populations^{16,17,18,19}. These approaches include hidden Markov models^{18}, techniques that reconstruct time series data coupled with sampling the empirical probability density function of the relevant variables^{16}, and methods that estimate probability density functions (pdfs) from the data, not accounting for variable correlation^{19}. Other work used moment matching to generate synthetic data but does not consider relative frequencies in the comparison analysis^{20}. Discussions on synthetic data generation techniques indicate that the small sample size condition has received little analytical attention^{20,21}.

Our synthetic data technique estimates a multivariate pdf for arbitrarily distributed data, including when normal approximations fail to hold^{22}. This initial work, based on multivariate kernel density estimation (mKDE) with unconstrained bandwidths, was illustrated with dā=ā5 data from mammographic caseācontrol data^{22}. Synthetic populations (SPs) of arbitrarily large size were generated from samples of limited size. However, mKDE has noted efficiency problems for high dimensionality^{23,24}. As the dimensionality increases, the sample size requirement apparently becomes exceeding large, noting this area is under investigation. Although categorizing a problem as high or low dimensionality may be dependent on many factors, reasonable arguments suggest that *high dimensionality* may be defined as 3ā<ādāā¤ā50, where d is the number of variables considered^{25}; in that, density estimators should be able to address this range^{25}. Here we let dā=ā10 so that many of the findings can be presented graphically or reasonably tabulated, and modeling problems in healthcare research can be within this range.

In this report, we present modifications to our method to mitigate the mKDE efficiency problem under specific conditions (latent normality) and address synthetic data generation in relatively higher dimensionality (dā=ā10). This modified approach decomposes an arbitrarily distributed multivariate problem into multiple univariate KDE (uKDE) problems while characterizing the covariance structure independently^{26}. We are evaluating whether this approach can transform an arbitrarily distributed multivariate sample into an approximate multivariate normal form, which we define as a sample with a *latent normal characteristic*. In the universe of arbitrarily distributed samples, there is a multivariate normal subgroup. The technique for generating synthetic data for this normal subgroup is relatively straightforward and well-practiced. By hypothesis, our approach seeks to extend these straightforward techniques to the latent normal class by determining when (or if) it exists. Developing the analytics to detect this condition and then leveraging it to generate synthetic data are essential elements of our work^{26}.

## Methods

### Overview

Our modified synthetic data generation and analytic techniques have sequential components and many related analyses. Therefore, clear definitions, preliminaries, and a brief outline are given before the details are provided. Justifications are also discussed here when warranted.

#### Definitions

*Population* is used to define a hypothetical collection of virtually *unlimited* number of either real or synthetic entities from which samples comprised of observations or realizations may exist or can be drawn. The exception for the use of *population* is when explaining differential evolution (DE) optimization^{27} used for uKDE bandwidth determination. The *DE-population* is limited and defined specifically. *Sample* defines a collection of n real observations with d attributes (variables) from the space of possible samples, represented mathematically as a nāĆād matrix (rowsā=āobservations, columnsā=āattributes). Column vectors are designated with lower-case bold letters. For example, individual attributes are referred to as **x**, a column vector. Vector components are designated with lower-case subscripted letters. The components of **x** are referenced as x_{j} for jā=ā1, 2, ā¦, d and assumed continuous. The multivariate pdf for **x** is p(**x**). X refers to the input variable space, that is, **x** exists in the X representation. We assume p(**x**) exists at the population level, but not accessible. In practice, we evaluated normalized histograms throughout this work for all variables considered both univariate and multivariate (i.e., empirical pdfs), also referred to as pdfs for brevity; we use this term to refer to attributes at the population level as well as at the sample level. One-dimensional (1D) marginal pdfs for p(**x**) are expressed as p_{j}(x_{j}). Matrices are designated with upper case bold lettering. For example, **X** is the nāĆād matrix with n observations of **x** in its rows (i.e., the ith row contains the d attributes for the ith observation and the jth column of **x** has n realizations of x_{j}). Double subscripts are used for both specific realizations and matrix element indices. That is, x_{ij} is the jth component for the ith realization in X (also is the indexing for **X**). Variables in X are mapped to the Y representation. This creates the corresponding entities in Y: (1) the vector **y** with d components; (2) the multivariate pdf, g(**y**), and its marginal pdfs, g_{j}(y_{j}); and (3) the matrix **Y** defined analogously as **X**. We also work in the T representation (uncorrelated variables) as explained below, where **t**, t_{j}, and **T** are defined similarly. Likewise, r(**t**) is the multivariate pdf in T with marginals, r_{j}(t_{j}). We define the cumulative probability functions (i.e., the indefinite integral approximation of a given univariate pdf) for p_{j}(x_{j}) and g_{j}(y_{j}) as P_{j}(x_{j}) and G_{j}(y_{j}), respectively. Covariance quantities are calculated with the normal multivariate form: E(wāāām_{w})(vāāām_{v}), where E is the expectation operator, w and v are arbitrary random variables with means m_{w} and m_{v}. The corresponding covariance matrices are expressed as **C**_{x}, **C**_{y}, and **C**_{t}, respectively (or **C**_{k} generically). When an entity is given the subscript, s, it then defines the corresponding synthetic entity. Standardized normal defines a zero meanāunit variance normal pdf, used in both the univariate and multivariate scenarios. *Parametric* for this report refers to functions that can be expressed in closed form.

#### Preliminaries

General biomedical healthcare data characteristics are discussed to overview some of their characteristics. Measurements such as body mass index (BMI) and age, or measurements taken from image data can have right-skewed pdfs because they are often positive-valued and not inclusive of zero (see Figs.Ā 1, 2 and 3 for examples). Such measures can bear varying levels of correlation (see lower parts of Tables 1, 2 and 3). Thus, an arbitrary p(**x**) may not lend itself to parametric modeling in X (i.e., normality and non-correlation assumptions do not apply in many instances). To render such data into a more tractable form, a series of steps (see Fig.Ā 4) were taken to condition **X**; by premise, these steps will permit characterizing the sample with parametric means and then generating similar multivariate synthetic data without mKDE.

#### Outline of the processing steps

When describing these steps, we also briefly discuss the analysis at a given step (also provided in detail in the methods). The process starts with a given sample in X (Fig.Ā 4, top left). The processing flow for the sample (XāYāT) is illustrated in the top row of Fig.Ā 4, and the reversed SP generation flow (T_{s}āY_{s}āX_{s}) in the bottom row.

*Step 1* Univariate maps were constructed (Fig.Ā 4, top-left) to transform a given X measurement to a standardized normal, producing the respective marginal pdf set in Y (Fig.Ā 4, top-middle). Maps were constructed with an augmented sample size using optimized uKDE, addressing the small sample size problem.Ā uKDE was used to generate synthetic x_{j}. Here, we augmented the sample size with the goal of filling gaps in the input marginal pdfs of x_{j} (sample) to complement the map constructions. By hypothesis, this step addresses the small sample size problem by guaranteeing continuous smooth maps that will produce standardized normal pdfs from the sample. Synthetic x_{j} generated in this fashion do not maintain the covariance relationships in X and were not used further; only x_{j} from the sample were mapped to Y, and KDE was not used beyond this point. There is no guarantee that a set of normal marginals in Y will produce a multivariate normal pdf. Although the reverse is always true because a multivariate normal has univariate normal marginals. In practice, g(**y**) from the sample could be assessed at this point to estimate how well it approximates normality. If the latent normal approximation is poor, another synthetic approach could be pursued, or the process could be discontinued. Here we forgo such testing at this step (normality was tested for in steps 3 and 4 instead) and move through all steps to illustrate the techniques. We will also discuss a possible modification that could be investigated when the sample has a poor latent normal characteristic approximation, later in the discussion.

*Step 2* Principal component analysis (PCA) was used to decouple the variables in Y producing uncorrelated variables in T (Fig.Ā 4, top-right).

*Step 3* Synthetic data was generated in T (Fig.Ā 4, bottom-left) as uncorrelated random variables. Here we assumed that each marginal in T from the sample could be approximated as normal with variance given by the jth eigenvalue of **C**_{y}. To generate synthetic data, the columns of **T**_{s} were populated as normally distributed random variables with these specified variances (noting, the columns lack correlation). We refer to the realizations in T as the SP (i.e., **T**_{s}), noting the column length (number of synthetic entities) can be arbitrarily large. To address the normal characteristic, univariate marginal and multivariate pdfs from the sample were tested for normality at this point.

*Step 4* The inverse PCA transform (Fig.Ā 4, bottom-middle) of **T**_{s} produced the SP in Y (**Y**_{s}), thereby restoring the covariance relationships that were removed in Step 2. We note, this technique (steps 3 and 4) of producing multivariate normal data is a practiced approach when the sample is multivariate normal or well approximated as such. For reference, the multivariate standardized normal in Y is expressed as

where **C**_{y} can be approximated as the covariance matrix from a given sample. If synthetic samples are poor replicas of their sample, it follows the sampleās g(y) will be a poor approximation of multivariate normality (i.e., the sample is not in the latent normal class).

We evaluated how well the inverse PCA transformation preserved the covariance (**C**_{y}) and the methodās capability of restoring the univariate/multivariate pdfs in Y (i.e., normality comparisons) rather than in Step 2.

*Step 5* Each synthetic variable in Y is inverse mapped to X (Fig.Ā 4, bottom left). This reversed Step 1, thereby producing the SP in X (**X**_{s}), and restoring the covariance relationships by hypothesis. The respective pdfs and covariance matrices in X were compared with those from synthetic samples; pdfs were also tested for normality.

It is important to clarify a few aspects of this work. The univariate mapping from X to Y creates a set of univariate marginals normally distributed that can produce a multivariate normal, but not guaranteed. The comparison of univariate marginal pdfs, however, is no guarantee that the respective multivariate pdfs are reasonable facsimiles because the covariance structure has been removed. Many univariate pdf comparisons are provided between samples and synthetic samples in addition to multivariate comparisons, because these allow visualizing similarities with the above stipulations. When a given sample has the latent normal characteristic, the SP generation is greatly simplified, and then it is *defined* by Eq. (1). The work below shows how to generate synthetic data when this characteristic holds. We use three datasets that were selected *pseudo-randomly*. In the space of samples (virtually unlimited), we do not know the probability that a sample selected at random will have this latent normal characteristic. The main objectives are to present the analysis components with the methods for testing for the latent characteristic, give a thorough investigation, demonstrate that the synthesis produces realistic data when this latent condition exists, and then discuss further analyses.

### Study data

Samples were derived from two sources of measurements: (1) mammograms and related clinical data (nā=ā667), and (2) dried beans (nā=ā13,611)^{28}. Most technical aspects of these data are not relevant for this report. Mammography data included all observations with mammograms from a specific imaging technology, thereby defining nā=ā667. We used the dried bean data to add variation to the analyses as the variable nomenclatures are very different from the mammogram data, noting at this point the source of data is not germane. From mammograms, we considered two sets of measurements each with dā=ā10 variables referred to as Sample 1 (DS1) and Sample 2 (DS2). DS1 has 8 double precision measurements from the Fourier power spectrum in addition to age and BMI, both captured as integer variables. The Fourier attributes are from a set of measurements described previously^{29}; the first 8 measurements from this set are labeled as P_{i} for iā=ā1ā8. These Fourier measures are consecutive and follow an approximate functional form^{30}, and thus represent variables that are very different than those in DS2 (or SampleĀ 3 below). To cite the covariance quantities and correlation coefficients, we used a modified covariance (covariance for short) table format for efficiency because **C**_{k} is symmetric. In these tables, entries below the diagonal are the respective correlation coefficients, whereas the elements along the diagonal (variance quantities) and above are the covariance quantities. The covariance table for DS1 is shown in Table 1a. DS2 contains 8 double precision summary measurements derived from the image domain: mean, standard deviation (SD); SD of a high-pass (HP) filter output, SD of a low-pass (LP) filter output; local SD summarized; P_{20} Fourier measure (from the set described for DS1 measurements); local spatial correlation summarized^{31}; and breast (Br) area measured in cm^{2}. Age and BMI (from DS1) were also included in this dataset. These variables were selected virtually at random to give dā=ā10 and possibly provide a different covariance structure than DS1. The covariance table is shown in Table 2a. Neither DS1 nor DS2 were used in our related-prior mKDEĀ synthetic data work. Selected measures and realizations from the dried bean dataset^{28} are referred to as Sample 3 (DS3). The bean data has 17 measurements (floating point) from 7 bean types. We selected 10 measures at random to make the dimensionality of DS3 compatible with the other two datasets giving this set of variables: area (1); minor axis (5); eccentricity (6); convex area (7); equivalent diameter (8); extent (9); solidarity (10); roundness (11); shape factor 3 (15), and shape factor 4 (16). Here, parenthetical references give the variable number listed in the respective resource (see^{28}). Both bean type (bean typeā=āSira, with nā=ā2636) and nā=ā667 observations were selected at random to create DS3. Keeping nā=ā667 constant across datasets permits consistent statistical comparisons. For example, confidence intervals (CIs) and other comparison metrics are dependent upon the number of observations. The covariance table is shown in Table 3a. The analysis of three samples supports the evaluation of the processing scheme under generalized scenarios. The means and standard deviations for x_{j} in each dataset are provided in Table 4. Note, the dynamic range of the means and standard deviations within a given sample vary widely in some instances.

### KDE, optimization, and mapping

The mapping (Step 1) relies on generating synthetic x_{j} with uKDE. Each bandwidth parameter was determined with an optimization process wherein synthetic data from uKDE was compared with the sample. There is a continued feedback loop between the sample, synthetic data generation, and comparison during the optimization process. When the optimization was completed, a given map was constructed.

#### uKDE

For the map construction in Step 1 and as a modification to our previous work^{22}, uKDE was used to generate realizations from each p_{j}(x_{j}) given by

where x_{j} is the synthetic variable for this discussion, x_{ij} are observations from a given sample, k_{j} is a normalization factor, and h_{j} is the univariate bandwidth parameter.

#### Optimization

Differential evolution optimization^{27} was used to determine each h_{j} in Eq.Ā (2). The population of candidate h_{j} evolves over generations to a *solution*, as described in detail previously^{22}. To form generation zero for a given x_{j}, the DE-population (nā=ā1000) was initialized randomly (uniformly distributed) within these bounds: (0.0, 4āĆāthe variance of x_{j}) because a given range should span the solution for h_{j}. The DE-population size stays constant across all generations. By expectation, the populations become more fit according to the fitness function over generations. A given generation is found by comparing two-DE populations (1000 pair-wise competitions) of candidate h_{j} solutions derived from the previous generation. For a given pairwise competition, two respective SPs were generated for each candidate h_{j}. Synthetic samples (nā=ā667) were drawn from each SP and P_{j}(x_{j}) from the sample was compared with each P_{j}(x_{j}) derived from its synthetic sample using Eq.Ā (2). The h_{j} candidate [used in Eq.Ā (2)] that produced a smaller D_{j} was used to populate the current generation, where D_{j} is the difference metric derived from the fitness function. This process was repeated for 30 generations. In summary, 30āĆā2āĆā1000 synthetic samples were compared with the sample via P_{j}(x_{j}) comparisons for a given x_{j} to derive its respective h_{j} used in Eq.Ā (2). A given X to Y map was then constructed with x_{j} sampled from Eq.Ā (2) with h_{j}ā=āE[terminal population of candidate h_{j}].

As a modification, the Kolmogorov Smirnov (KS) test^{32} was used as the fitness function in the DE optimization. This is a nonparametric test that can be used to compare two numerically derived cumulative probability functions or compare a numerically derived curve to a reference^{32}. Here we compare the respective numerical univariate cumulative probability functions derived from synthetic samples with those derived from the sample. The difference metric, D_{j}, for the KS test is the absolute maximum difference between the two cumulative probability functions under comparison.

#### Mapping

For each map in Step 1, we solve P_{j}(x_{j})ā=āG_{j}(y_{j}) numerically with interpolation methods described previously^{33,34}, where P_{j}(x_{j}) and G_{j}(y_{j}) are assumed to be monotonically increasing. This solves the random variable transformation for each y_{j} analogous to histogram matching with double precision accuracy. Synthetic y_{j} (nā=ā10^{6}) were generated as standardized normal random variables using the Box-Muller (BM) method. Maps from X to Y are expressed as y_{j}ā=ām_{j}(x_{j}), where m_{j} is the jth map. The corresponding inverse maps, \({\mathrm{x}}_{\mathrm{j}}={\mathrm{m}}_{\mathrm{j}}^{-1}({\mathrm{y}}_{\mathrm{j}})\), were derived numerically by inverting a given map and solving for x_{j}. The map construction was complemented by generating synthetic x_{j} with Eq.Ā (2) using h_{j} derived from DE optimization. Synthetic x_{j} generated here were not used further.

### Synthetic population generation

Synthetic populations (SPs) were generated in the uncorrelated T representation and converted back to X via Y. In Step 2, the PCA transform for the sample is given by

where **P** is a dāĆād matrix with uncorrelated normalized columns. These are the normalized eigenvectors of **C**_{y} that capture the sampleās covariance structure. **C**_{t} is diagonal with: c_{jj} =\({\upsigma }_{\mathrm{j}}^{2}(\mathrm{t})\) corresponding to the ordered eigenvalues of **C**_{y}. We make the approximation that r(**t**) from the sample has the multivariate normal form expressed as

When the multivariate normality approximation holds in Y, it should hold in T. In Step 3, synthetic t_{j} were populated as zero mean independent normally distributed random variables with variancesā=ā\({\upsigma }_{\mathrm{j}}^{2}(\mathrm{t})\) using the BM method, producing the SP in T (**T**_{s}). The row length of T_{s} defines the number of realizations in each SP and is arbitrary. Here, we let nā=ā10^{6} for all SPs. In Step 4 to construct the SP in Y (**Y**_{s}), the inverse PCA transform was used by substituting **T**_{s} for **T** in Eq.Ā (3) giving

With this process, g(**y**)ā=āg_{n}(**y**) for synthetic data. By premise, the covariance of **Y**_{s} should be like that of **Y**. In Step 5 to produce the SP in X (**X**_{s}), synthetic y_{j} were inverse mapped. Similarly, the covariance of **X** should be like that of **X**_{s}. An example is also provided to illustrate that **X**_{s} is densely populated in contrast with the sparse sample in X.

### Statistical methods

The goals are to evaluate the latent normal characteristic and to produce synthetic data that is statistically like its sample. This analysis is based on both multiple univariate/multivariate pdf and covariance comparisons. A given synthetic sample (nā=ā667) was drawn at random from its SP. The same realizations from a given synthetic sample were used for comparisons in X, Y, and T, when applicable.

#### Probability density function comparisons

##### Univariate pdf comparisons

The KS test (described in Step 1) was used for all univariate pdf comparisons. For such comparisons, we selected the test threshold at the 5% significance level as the critical value. In X, p_{j}(x_{j}) from the sample were tested for normality and compared with their respective pdfs from synthetic samples created in Step 5. In Y, g_{j}(y_{j}) from the sample were compared with the respective pdfs from their synthetic samples produced by Step 4; this implicitly evaluated univariate normality in Y because synthetic y_{j} were derived from a multivariate normal process in T_{s}. In T, we compared r_{j}(t_{j}) from the sample with their respective pdfs derived from zero mean normally distributed random variables (i.e., synthetic t_{j}) with variancesā=ā\({\upsigma }_{\mathrm{j}}^{2}(\mathrm{t})\) [from Step 3]. For each t_{j}, the sample was compared with 1000 synthetic samples, and the percentage of times that measured D_{j} was less than the critical test value was tabulated.

##### Distribution free multivariate pdf comparisons

To evaluate whether the sample and synthetic samples were drawn from the same distribution without assumptions, we used the maximum mean discrepancy (MMD)^{35} test. This is a kernel-based (normal kernel) analysis that computes the difference between every possible vector combination between and within two samples (excluding same vector comparisons). To determine the kernel parameter for these tests, we used the median heuristic^{35,36}. This analysis is based on the critical value (MMD_{c}) at the 5% significance level and the test statistic \({(\mathrm{MMD}}_{\mathrm{u}}^{2})\). Both quantities are calculated from the two samples under comparison. This test has an acceptance region given the two distributions are the same: \({\mathrm{MMD}}_{\mathrm{u }}^{2}< {\mathrm{MMD}}_{\mathrm{c}}\) (see Theorem 10 and Corollary in^{35}). This test was applied in X, Y, and T. Note, when applying this test in either T or Y, it is implicitly testing the sampleās likeness with multivariate normality. In X, Y, and T, 1000 synthetic samples were compared with the sample. The test acceptance percentage was tabulated. \({\mathrm{MMD}}_{\mathrm{u}}^{2}\) and \({\mathrm{MMD}}_{\mathrm{c}}\) values are provided as averages over all trials because they change per comparison.

##### Random projection multivariate normality evaluation

Random projections were used to develop a test for normality. The vector **w** with d components is multivariate normal if the scalar random variable, zā=ā**u**^{T}**w**, is univariate normal, where **u** is a d component vector with unit norm that is defined as a projection vector in this report^{37,38}. As mentioned by Zhou and Saho^{38}, we developed this formulism into a specific random projection test. To actualize such a test to probe the samples and synthetic samples similarity with normality, the projection vector **u** was generated randomly 1000 times, referenced as **u**_{s}. Here, s is the projection index ranging from [1,1000]. The projection equation is then expressed as

where z|s defines the scalar z conditioned upon s. In Eq.Ā (6), **x**, **y**, or **t** was substituted for **w**, and given projection was taken over all realizations (i.e., nā=ā667) of given sample. These realizations of z|s were used to form the normalized histogram that approximates the conditional pdf for the left side of Eq.Ā (6) defined as f(z|s). A different series of **u**_{s} was produced for each representation; once a given series was produced, **u**_{s} remains fixed. The components of **u**_{s} were generated as standardized normal random variables, where **u**_{s} was normalized to unit norm. For a given sample, f(z|s) was tested for normality using the KS test. This procedure was repeated for all random projections (all s), resulting in 1000 KS test comparisons for normality. The percentage of the times that the null hypothesis was not rejected was tabulated as the normality similarity gauge. We refer to this procedure as the *random projection test*. It is the percentage of times that **w** was not rejected when probed in 1000 random directions. This test was performed once for each sample and with 100 synthetic samples and averaged. Here, the same **u**_{s} series used to probe a given sample was also used to probe 100 of its respective synthetic samples. We note, synthetic samples are multivariate normal in Y and T by their construction. Tests were performed on synthetic samples (Y and T) to give control standards as normal comparators. Tests were also performed in X: (1) as control comparator for the test itself; and (2) to determine if a given sample was multivariate normal before undergoing the mapping.

To gain both insight into Eq.Ā (6) and the test, expressions for both z|s and f(z|s) are developed. First, z|s results from a linear random variable transform given by

where u_{k} are the components of **u**_{s}, and h_{k}(w_{k}) are the univariate pdfs for scaledĀ w_{k}. To check one endpoint, we assume w_{k} are independent as a coarse approximation to our samples. Then, f(z|s) results from repeated convolutions given by

where c^{ā1}ā=āu_{1}āĆāu_{2}āĆāu_{3}āĆāu_{4}āĆāā¦āĆāu_{d}, and h_{k}(w_{k})ā~āh_{k}. If some h_{k}(w_{k}) have relatively much larger variances (widths) than others, their functional forms can tend to predominate Eq.Ā (8).

##### Mardia multivariate normality test

This is a two component test that uses multivariate skewness and kurtosis for evaluating deviations from normality^{37}, applied in X, Y and T. It produces a deviation measure for each component as well as a combined measure; we cite the component-findings. This test was applied in X as a control. Outlier elimination techniques were not applied.

#### Covariance comparisons

Two methods were used to evaluate the covariance similarity between samples and synthetic samples: with (CIs) and eigenvalue comparisons.

##### Comparisons with confidence intervals

Each covariance matrix element between the sample and its respective synthetic samples was compared with CIs. We assumed the sample and synthetic samples were drawn from the same distributions. We used the elements from each **C**_{x} and **C**_{y} as point estimates from a given sample in both X and Y. One thousand synthetic samples (nā=ā667) were used to calculate 1000 covariance matrices (in X and Y). For each matrix element, the respective univariate pdf was formed, and 95% CIs were calculated. This procedure was repeated 1000 times. The percentage of times the sampleās point estimate (for each element in **C**_{y} and **C**_{x}) was within the synthetic elementās CIs was tabulated.

##### Comparisons with PCA

The eigenvalues from **C**_{y} were used as the reference comparators under two conditions. For condition 1, the PCA transform determined with the sample was applied to a synthetic sample (sample/syn test). Synthetic eigenvalues were estimated by calculating the variances of synthetic t_{j}. For condition 2, the PCA transform determined with a synthetic sample selected at random was applied to the sample (syn/sample test). Eigenvalues were estimated by calculating the variances of t_{j} from the sample. For both conditions, each eigenvalue (or equivalently, variance) was compared to its respective reference (sample) using the F-test.

### Ethics and consent to participate

All methods were carried out in accordance with relevant guidelines and regulations. All experimental procedures were approved by the Institutional Review Board (IRB) of the University of South Florida, Tampa, FL under protocol #Ame13_104715. Mammography data was collected retrospectively on a waiver for informed consent approved by the IRB of the University of South Florida, Tampa, FL under protocol #Ame13_104715.

## Results

### Univariate normality analysis in the X representation

FiguresĀ 1, 2 and 3 show the univariate pdfs (solid) for each sample in X. Of note, many pdfs are observably non-normal, usually right skewed. Each x_{j} in DS1 showed significant deviation from normality (*p*ā<ā0.0001) except for x_{9}. In DS2, neither x_{1} or x_{9} (x_{9} is the same in DS1) showed significant deviation from normality, while the remaining x_{j} exhibited significant deviations (*p*ā<ā0.0003) except for x_{3} (*p*ā=ā0.0144). In DS3, x_{1} through x_{5} and x_{9} did not show significant deviations from normality, the remaining x_{j} deviated significantly (*p*ā<ā0.002).

### Mapping and KDE optimization

#### Mapping

FigureĀ 5 shows an example of the X to Y map for y_{9} in the left-pane and the inverse Y to X map in the right-pane. Red-dashed lines show the map and its inverse constructed without synthetic x_{j}. Staircasing effects are observable particularly in the tail regions, where sample densities are sparse. Black lines show the map and its inverse constructed with nā=ā667 (the sample) plus nā=ā10^{6} synthetic realizations produced with optimized uKDE. Staircasing effects were removed when incorporating synthetic x_{j}, which was common with all maps and inverses (not shown).

#### uKDE optimization

The optimization produced bandwidth parameter solutions (h_{j}) used in Eq.Ā (2). Here, we illustrate the evolution of the *solution* with h_{9} from DS1 and DS2 as an example. FigureĀ 6 shows the scatter plot between the candidate h_{9} population and the respective D_{9} (KS test difference metric) for DE generationā=ā1 in the left-pane and for the terminal generationā=ā30 in the middle-pane. The solution space (middle-pane) is tightly clustered indicating DE *convergence*. A closer view of this cluster is shown in the right-pane of Fig.Ā 6. This relatively tight-cluster characteristic was common among all variables and datasets (not shown).

### Univariate comparisons between samples and synthetic samples

#### Comparisons in T

These findings are summarized first because they start the flow back to X and can show departures from normality. FiguresĀ 7, 8 and 9 show the pdfs for the samples (solid) compared with their corresponding synthetic pdfs (dashed). Table 5 shows the variances in T for each sample (i.e., eigenvalues for each sample). Due to (1) the normalization in Y, and (2) that dā=ā10, multiplying a given, \({\upsigma }_{\mathrm{j}}^{2}(\mathrm{t})\) by 10% gives the percentage of the total variance explained by its t_{j}. Table 6 shows the KS test findings for the univariate normality comparisons. Here we use a cutoff of <ā65% to indicate deviation as most trends were well above this boundary. As shown in Table 6 (left column for each dataset): (1) the normal model did not deviate for any t_{j} in DS1 (7 t_{j} were <ā94%); (2) the normal model deviated for t_{10} in DS2 (5 t_{j} were <ā94%); and (3) the normal model deviated for t_{7}, t_{8}, t_{9}, and t_{10} in DS3 (3 t_{j} were <ā94%). In DS2, t_{10} explains about 0.2% of the total variance. Similarly, in DS3, the sum of the variances of the four variables (t_{7}, t_{8}, t_{9}, and t_{10}) constituted about 0.14% of the total variance.

#### Comparisons in Y

FiguresĀ 10, 11 and 12 show the pdfs in Y resulting from the mapped samples (i.e., mapped x_{j}) for each dataset (solid) compared with their respective synthetic pdfs (dashed), which are normal by construct. Comparisons in Y showed little departure from normality in any sample, as the tests were not rejected (about 99%) in most instances (Table 6, middle column for each dataset). Findings from DS2 and DS3 indicate that substituting normal pdfs in T whenever sample t_{j} deviated from normality had little influence on this analysis. This may be because these respective variables in total or isolation explained a minute portion of the variance in the respective PCA models.

#### Comparisons in X

FiguresĀ 1, 2 and 3 show the pdfs in X for the samples (solid) compared with their corresponding synthetic pdfs (dashed). The pdfs from the sample did not deviate from their corresponding synthetic pdfs in any dataset, as the tests were not rejected (<ā99%) in most instances (Table 6, right columns). The parenthetical entries in Table 6 show the test findings without using synthetic data for the map/inverse constructions (using the samples only). These show that complementing the map constructions with uKDE is a necessary component of this methodology, although the degree of deviation from the KS tests varied across datasets. Note, the improvement held in D3 as well, which was normal in X (shown below).

### Multivariate comparisons and normality comparisons

#### MMD tests

Testing was performed in X, Y, and T, and the test metrics are provided in Table 7. These show the samples and respective synthetic samples were drawn from the same distributions. In 100% the tests, measured \({\mathrm{MMD}}_{\mathrm{u }}^{2}\) values were less than the critical MMD_{c} quantities. The MMD tests in Y and T were also proxy tests for sample-normality due to the SP constructs.

#### Random projection normality tests

Testing was performed in X, Y, and T, and the findings are shown in Table 8. Test findings were mixed for the samples in X and were not rejected for approximately these instances: 40% in DS1; 74% in DS2; and 99% in DS3. Thus, DS3 is better approximated as normal in X compared to the other samples. The tests for synthetic samples in X tracked the findings for their respective samples: 44%, 74% and 99% respectively. In Y, the tests for the samples were not rejected for about these instances: 99% in DS1, 97% in DS2, and 95% in DS3, whereas the test for the synthetic samples should no deviation from normality. Similarly in T, the tests for the samples were not rejected for about these instances: 99% in DS1, 94% in DS2, and 95% in DS3. There is a difference in the X and Y analyses because the mapping normalizes the variables in Y. In DS3, the standard deviations vary over many orders of magnitude (see Table 4). As shown by Eq.Ā (8), variables in DS3 with the larger standard deviations may wash out the other variables; the variables in X that had normal marginals compared with those that were not, indicates that a portion of the normal marginals had much larger standard deviations (in Table 4, see x_{1}āx_{4}). As another control experiment, we standardized all variables in X to zero mean and unit variance and performed the tests again. The tests for the samples gave: 32.9%, 76.4%, and 75.2% for DS1, DS2, and DS3, respectively. For synthetic samples, these tests gave: 33.6%, 80.1% and 89.1% for DS1, DS2, and DS3, respectively. Note, centering the means alone had no influence on the findings as expected (data not shown). Thus, normalizing the univariate measures can influence the likeliness with normality by virtue of Eq.Ā (8). In sum, these tests show all samples resemble multivariate normality in both Y and T and that the sample for DS3 resembles normality in X without mapping (without first normalizing the variances).

#### Mardia normality tests

Testing was performed in X, Y, and T. The findings are shown in Table 8. In X, the samples and synthetic samples all deviated from normality (both skewness and kurtosis). In both Y and T, the samples showed significant deviations from normally in all tests. In contrast, synthetic samples did not deviate significantly from normality in any test in Y or T, as expected.

### Covariance comparisons

#### Covariance matrix comparisons with confidence intervals

Test findings are provided in Tables 1, 2 and 3 for the respective datasets. Part-a of each table shows the X quantities, and part-b shows the corresponding Y quantities. For DS1 (Table 1), covariance references (sample) were within the CIs of the synthetic data for 100% of the trials in both X and Y. For DS2 (Table 2), most references agreed with the synthetic elements except for two entries in X (Table 2a). From the 1000 trials, the x_{2}x_{3} covariance was out of tolerance for 0.1% of the instances, and the x_{5}x_{10} covariance was out for 18.7% of instances. For DS3, all covariance references were within tolerance except the x_{7}x_{10} covariance, which was out of tolerance for 100% of the instances. In tests that showed more deviation (percentageā>ā0.1%), the reference covariances were approximately zero.

#### Eigenvalue comparison tests

Eigenvalues are provided in Table 5. This table is separated into three sections vertically. Reference eigenvalues are provided in the top row of each section. Eigenvalues calculated from the sample/syn (condition 1) and syn/sample (condition 2) are provided in the middle and bottom rows of each section, respectively. F-tests were not significant (*p*ā>ā0.05) in any comparison with the references indicating similarity.

### Sample sparsity and synthetic population space filling

This illustration demonstrates that the approach fills in the multidimensional space with synthetic realizations derived from a relatively sparse sample. We selected a synthetic entity at random from DS1 giving this vector: **x**^{T}ā=ā[4.20, 2.08, 1.61, 1.15, 0.85, 0.67, 0.54, 0.44, 52.0, 23.6]. We selected x_{1} and x_{8} as the scatter plot variables. For the other 8 components, all synthetic realizations within x_{ij}āĀ±āĀ½ Ļ_{j} (the standard deviation for x_{j}) were selected and viewed in the x_{1}x_{8} plane as a scatter plot. The same vector and limits were used to select realizations from the sample. The plots are provided in Fig.Ā 13 for comparison. The sample (left-pane) produced nā=ā24 realizations, whereas the SP (right-pane) produced nā=ā36,398 realizations. These plots illustrate the sampleās relative sparsity and that the synthetic approach produces a dense population with observations that did not exist in the sample.

## Discussion

The work involved several steps to generate synthetic data from arbitrarily distributed samples. To the best of our knowledge, new aspects and findings from this work include: (1) demonstrating a class of arbitrarily distributed samples has a latent normal characteristic, as exhibited by two of the samples; (2) conditioning the input variables with sample size augmentation and then constructing univariate transforms so that known techniques could be applied to generate synthetic data; (3) deploying multiple statistical tests for assessing both normality and general similarity in both the univariate and multivariate pdf settings; (4) developing methods for comparing covariance matrices; and (5) incorporating differential evolution (DE) optimization for uKDE bandwidth determination based on the KS fitness function. The related findings are discussed below in detail.

A method was presented that converts a given multivariate sample into multiple 1D marginal pdfs by constructing maps. These XāY maps were constructed by augmenting the sample size with optimized uKDE. Performing the analysis with and without data augmentation improved the marginal pdf comparisons between the samples and synthetic samples; this also held in DS3 (four x_{j}), which was approximately normal in X. PCA applied to standardized normal variables in Y produced uncorrelated variables in T, where synthetic data was generated. This approach essentially decouples the problem into the covariance relationships (in **P** and its inverse) and 1D marginal pdfs (i.e., approximate parametric models in Y and T). This decoupling is similar to the objective of Copula modeling that follows from Sklarās work^{39,40}. Copula modeling allows specification of marginal pdfs and the correlation structure independently^{41}; in this approach, the marginals must be specified accurately and finding analytical solutions for dā>ā4 is difficult^{42}. In contrast with Copula modeling, which is flexible, the covariance (or correlation) structure in our approach is fixed by the normal form and empirically derived; the marginals were forced to normality rather than specified. As a benefit, the CIs for **C**_{k} with our approach were estimated from the pdfs for each matrix element without assumption other than the normal calculation form. Additionally, the eigenvalue comparison technique results reinforced the CI comparison findings. Outside of the multivariate normal situation it is not clear when (1) comparing the marginals with one set of tests, and (2) comparing the covariance relationships separately with another set of tests results in a good overall empirical comparison-approximation between two multivariate samples. Such situations will require further analyses.

There are several other points worth noting about this work. An empirically driven stochastic optimization technique was used to estimate the uKDE bandwidth parameters for the map/inverse constructions. The relative efficiency of the approach is an important attribute in that it only requires multiple uKDE applications rather than mKDE. The number of generations in the optimization was fixed. This can be changed easily to a variable termination based on achieving a critical threshold or applying other appropriate fitness functions; for example, the stopping criteria could be based on the critical distance in the KS test or the change in this distance from one generation to the next. Likewise, there are plug-in kernel bandwidth parameters that can be used statically. These are derived by considering closed form expressions containing the constituent pdfs and minimizing the asymptotic behavior of either the mean integrated square error or mean squared error^{43}. We explored such parameters^{44}, but they did not perform as well as the KS test with DE, notwithstanding the number of computations used here to determine a given bandwidth parameter. Of note, the KS test has limitations, as it is more sensitive to the median of the distribution rather than the tails. As an alternative, the Anderson Darling test is a variant of the KS procedure that is sensitive to the distribution tails^{32}. The mapping from X to Y standardized the problem at the univariate level, but in general there is no guarantee that collectively it produced a multivariate normal in Y. Testing performed in Y (Step 2) could be used to discriminate input samples that have the latent normal characteristic from those that do not. The random project test could be developed into a gauge at this step for assessing the deviation from normality. Moreover, comparing r(t_{j}) from the sample against normality (following Step 3) also provides a basis for testing sampleās likeness to a multivariate normal (discussed below). When p(**x**) is approximately multivariate normal, as in DS3, the mapping is not required and generating synthetic data based on PCA (without the mapping step) is a practiced technique; our approach addresses the case when this approximation fails to hold. The random projection tests changed the similarity with normality when standardizing the samples. Thus, the purpose for generating synthetic data should be considered before adjusting the input sample.

There are several other limitations and qualifications worth noting. Several multivariate pdf tests were examined with mixed findings. MMD tests in X, Y, and T showed each sample was statistically similar with its respective synthetic sample(s). These tests also indicated normality in Y and T (by default). This MMD test is sensitive to changes in the mean. In our processing, all means were forced either to identically zero or to statistical similarity via mapping. Likewise, the heuristic used for the kernel bandwidth determination can be less than optimal under certain conditions, decreasing the MMD test performance^{45}. Random projection tests in Y and T indicated that the samples did not deviate from normality in most instances, whereas synthetic samples showed essentially no deviation. Understanding the acceptable departure from normality for this test in the modeling context will require more work. This test also showed DS3 was approximately normal in X. In contrast, Mardia tests showed all samples deviated significantly from normality in X, Y and T. With theĀ Mardia test,Ā synthetic samples showed: (1) essentially no deviation in Y or T as expected; (2) and significant deviation in X. Here, we made no attempt to mitigate possible outlier interference when analyzing the samples^{46}. Note, testing for multivariate normality is not a trivial task; many of the complexities are covered by Farrell et al.^{47}

The conclusions we make from these tests indicated each sample was approximately multivariate normal in Y and T, noting the approach may not be dependent upon this characteristic as elaborated below. In planned research, these approximations will be tested in the modeling context to evaluate whether sample and synthetic data are interchangeable. When this normality approximation holds, it implies that the original multivariate pdf estimation problem in X was converted to a parametric normal model described by Eq.Ā (1), which simplifies the synthetic data generation. If this conversion generalizes to other datasets (at least in part), it implies that some class (subset) of the multivariate sample space can be studied with simulations by altering, n, d, and the covariance matrix to that of an arbitrary sample. Future work involves investigating arbitrary selected samples to understand how often this latent normal characteristic is present.

Alternatively, analyzing the samples in T may provide another method for comparing datasets, evaluating similarity, evaluating normality in Y, or generalizing our approach. The marginals from each sample were approximated as univariate normal in T, although there were noted variations. For example, and as noted, about 99% of the total variance came from the first four variables in DS1 (see Table 5). DS2 and DS3 were found to be similar with the first four variables accounting for 90ā92% of the total variance. Thus, DS1 is more compressible than the other two datasets as expected due to the high correlation from its approximate functional Fourier form. Although not the purpose of this report, the amount of compression is a likely metric for estimating the effective dimensionality (d_{e}) when d_{e}āā¤ād, which could be useful for estimating sample size. When viewing the PCA transform through the NIPALS algorithm^{14}, it is clear when the total variance is explained by a number of components d_{e}ā<<ād, the remaining components are residue (noise, chatter, rounding errors). This effect could explain why deviations from normality in DS2 and DS3 in T did not influence the multivariate normal approximation in Y. Here, we did not encounter non-normal variables (from the samples) in T that explained a significant portion of the total variance. When a given sample is well approximated as multivariate normal in Y, the PCA transformation will produce univariate normal marginal pdfs in T. This step could be developed into the *definitive* test for multivariate normality in Y by understanding the residual error of the non-normal marginal pdfs in T. In this work, the analysis in T supports the normality findings for each sample because the residual non-normal errors were parasitic. Future work will investigate: (1) the impact of the residual error in T on normality in Y, and (2) causes for normality in T, i.e., possibly due to forced normality in Y, some characteristic of the X representation data, or the PCA transform. If required, the technique could be generalized to accommodate non-normal marginals in T. As a generalization, uKDE will be investigated for generating univariate non-normal distributions in T when called for with the same method used to augment X for the map/inverse constructions. In this sample scenario, the Y description will deviate from Eq.Ā (1). Although, this premise will have to be investigated because the lack of correlation in T only guarantees t_{j} independence when r(**t**) is multivariate normal. We speculate when the sample has low correlation between most of the bivariate set in X, this approximation may hold.

Choosing the most appropriate space to perform modeling or to analyze the samples deserves consideration. We have used the covariance form suitable for normally distributed variables. In Y, this form is likely appropriate. We used the same form in X as well; this form may not be optimal here because covariance relationships are not preserved over non-linear transformations. It is our contention that Y is best suited for modeling because the marginals are normal. It is common practice in univariate/multivariate modeling to adjust variables (univariately) to a standardized normal form or apply transforms to remove skewness. The X to Y map converted each x_{j} to unit variance. When the natural variation for x_{j} is important, the mapping can be modified easily to preserve the variance. If the variable interpretation is not important, modeling can also be performed in T.

The method in this report addresses the small sample problem given the sample has the latent normal characteristic or is normal. The approach will require further evaluation on different datasets to understand its general applicability and when the univariate mapping from X to Y approximately produces a multivariate normal. Multiple methods were explored to evaluate multivariate normality. These tests indicated that the samples approximated normality in both the Y and T but also showed some deviation from normality. The interpretation of these findings in the context of data modeling may aid in understanding the limits of both the multivariate SPs and normality approximations in this reportās data and beyond. For example, determining the limiting percentage of the random projection tests may be informative in the modeling context. In summary, we offer a definition for an insufficient sample size in the context of synthetic data. When considering a given sample with d attributes and specified covariance structure, a sample size that does not allow reconstructing its population can be considered as insufficient. In future work, we will apply the methods in this report to understand the minimum sample size, relative to d and a given covariance structure, that permits recovering the population.

## Data availability

The link to publicly available data is provided in text. Mammography summary data can be obtained upon request to the corresponding author: John Heine (john.heine@moffitt.org). Kernel parameters are also available upon request.

## References

Gail, M. H. & Pfeiffer, R. M. Breast cancer risk model requirements for counseling, prevention, and screening.

*J. Natl. Cancer Inst.***110**, 994ā1002 (2018).Garrido-Castro, A. C. & Winer, E. P. Predicting breast cancer therapeutic response.

*Nat. Med.***24**, 535ā537 (2018).Huo, Z.

*et al.*Automated computerized classification of malignant and benign masses on digitized mammograms.*Acad. Radiol.***5**, 155ā168 (1998).Lei, C.

*et al.*Mammography-based radiomic analysis for predicting benign BI-RADS category 4 calcifications.*Eur. J. Radiol.***121**, 108711. https://doi.org/10.1016/j.ejrad.2019.108711 (2019).Nguyen, D. V. & Rocke, D. M. Tumor classification by partial least squares using microarray gene expression data.

*Bioinformatics***18**, 39ā50 (2002).Erves, J. C.

*et al.*Needs, priorities, and recommendations for engaging underrepresented populations in clinical research: A community perspective.*J. Community Health***42**, 472ā480. https://doi.org/10.1007/s10900-016-0279-2 (2017).Dickson, J. L.

*et al.*Hesitancy around low-dose CT screening for lung cancer.*Ann. Oncol.***33**, 34ā41. https://doi.org/10.1016/j.annonc.2021.09.008 (2022).Wang, G. X.

*et al.*Barriers to lung cancer screening engagement from the patient and provider perspective.*Radiology***290**, 278ā287. https://doi.org/10.1148/radiol.2018180212 (2019).Foraker, R., Mann, D. L. & Payne, P. R. O. Are synthetic data derivatives the future of translational medicine?.

*JACC Basic Transl. Sci.***3**, 716ā718 (2018).Elston, D. M. Data dredging and false discovery.

*J. Am. Acad. Dermatol.***82**, 1301ā1302. https://doi.org/10.1016/j.jaad.2019.07.061 (2020).Siddiqui, K. Heuristics for sample size determination in multivariate statistical techniques.

*World Appl. Sci. J.***27**, 285ā287 (2013).Wu, Y., Genton, M. G. & Stefanski, L. A. A multivariate two-sample mean test for small sample size and missing data.

*Biometrics***62**, 877ā885 (2006).Riley, R. D.

*et al.*Calculating the sample size required for developing a clinical prediction model.*BMJ***368**, m441. https://doi.org/10.1136/bmj.m441 (2020).Geladi, P. & Kowalski, B. R. Partial least-squares regression: A tutorial.

*Anal. Chim.***185**, 1ā17 (1986).Chartrand, G.

*et al.*Deep learning: A primer for radiologists.*Radiographics***37**, 2113ā2131 (2017).Buczak, A. L., Babin, S. & Moniz, L. Data-driven approach for creating synthetic electronic medical records.

*BMC Med. Inform. Decis.***10**, 1ā28 (2010).Chen, J. Q., Chun, D., Patel, M., Chiang, E. & James, J. The validity of synthetic clinical data: A validation study of a leading synthetic data generator (Synthea) using clinical quality measures.

*BMC Med. Inform. Decis. Mak.*https://doi.org/10.1186/s12911-019-0793-0 (2019).Dahmen, J. & Cook, D. A synthetic data generation system for healthcare applications.

*Sensors (Basel)***19**, 1181. https://doi.org/10.3390/s19051181 (2019).Goncalves, A. R., Sales, A. P., Ray, P. & Soper, B. NCI pilot 3-synthetic data generation report report no. Lawrence Livermore National Lab. (LLNL): LLNL-TR-747902 (2018).

Bogle, B. M. & Mehrotra, S. A moment matching approach for generating synthetic data.

*Big Data***4**, 160ā178 (2016).Quintana, D. S. A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation.

*Elife***9**, e53275 (2020).Fowler, E. E., Berglund, A., Sellers, T. A., Eschrich, S. & Heine, J. Empirically-derived synthetic populations to mitigate small sample sizes.

*J. Biomed. Inform.***105**, 103408 (2020).Scott, D. W. Feasibility of multivariate density estimates.

*Biometrika***78**, 197ā205 (1991).Hwang, J.-N., Lay, S.-R. & Lippman, A. Nonparametric multivariate density estimation: A comparative study.

*IEEE Trans. Signal Process.***42**, 2795ā2810 (1994).Wang, Z. & Scott, D. W. Nonparametric density estimation for high-dimensional dataāAlgorithms and applications.

*Wiley Interdiscip. Rev. Comput. Stat.***11**, e1461 (2019).Heine, J., Fowler, E. E. E., Berglund, A., Schell, M. J. & Eschrich, S. A. Techniques to produce and evaluate realistic multivariate synthetic data.

*bioRxiv.*https://doi.org/10.1101/2021.10.26.465952 (2021).Price, K. V., Storn, R. M. & Lampinen, J. A.

*Differential Evolution: A Practical Approach to Global Optimization*(Springer, 2005).Koklu, M. & Ozkan, I. A. Multiclass classification of dry beans using computer vision and machine learning techniques.

*Comput. Electron. Agric.***174**, 105507 (2020).Fowler, E. E. E.

*et al.*Generalized breast density metrics.*Phys. Med. Biol.***64**, 015006. https://doi.org/10.1088/1361-6560/aaf307 (2019).Heine, J. J. & Velthuizen, R. P. Spectral analysis of full field digital mammography data.

*Med. Phys.***29**, 647ā661 (2002).Fowler, E. E. E.

*et al.*Spatial correlation and breast cancer risk.*Biomed. Phys. Eng. Express***5**, 045007. https://doi.org/10.1088/2057-1976/ab1dad (2019).Press, W. H., Numerical Recipes Software (Firm).

*Numerical Recipes in C*2nd edn. (Cambridge University Press, 1992).Oh, H.

*et al.*Early-Life and adult anthropometrics in relation to mammographic image intensity variation in the nursesā health studies.*Cancer Epidemiol. Biomark. Prev.***29**, 343ā351. https://doi.org/10.1158/1055-9965.EPI-19-0832 (2020).Velthuzen, R. P. & Clarke, L. P. In

*SPIE proceedings series.*179ā187 (Society of Photo-Optical Instrumentation Engineers).Gretton, A., Borgwardt, K. M., Rasch, M. J., SchĆ¶lkopf, B. & Smola, A. A kernel two-sample test.

*J. Mach. Learn. Res.***13**, 723ā773. https://doi.org/10.5555/2188385.2188410 (2012).Garreau, D., Jitkrittum, W. & Kanagawa, M. Large sample analysis of the median heuristic. arXiv preprint https://arxiv.org/abs/1707.07269 (2017).

Zhou, M. & Shao, Y. A powerful test for multivariate normality.

*J. Appl. Stat.***41**, 351ā363. https://doi.org/10.1080/02664763.2013.839637 (2014).Shao, Y. & Zhou, M. A characterization of multivariate normality through univariate projections.

*J. Multivar. Anal.***101**, 2637ā2640. https://doi.org/10.1016/j.jmva.2010.04.015 (2010).Haugh, M. An introduction to copulas. In

*IEOR E4602: Quantitative Risk Management. Lecture Notes*(Columbia University, 2016).Durante, F., FernĆ”ndez-SĆ”nchez, J. & Sempi, C.

*Aggregation Functions in Theory and in Practise*85ā90 (Springer, 2013).Schirmacher, D. & Schirmacher, E.

*Multivariate Dependence Modeling Using Pair-Copulas*(The Society of Actuaries, 2008).Chandrasekara, N. & Tilakaratne, C. D. Determining and comparing multivariate distributions: An application to AORD and GSPC with their related financial markets.

*GSTF J. Math. Stat. Oper. Res. JMSOR***4**, 1ā8 (2016).Jones, M. C., Marron, J. S. & Sheather, S. J. A brief survey of bandwidth selection for density estimation.

*J. Am. Stat. Assoc.***91**, 401ā407 (1996).Gramacki, A.

*Nonparametric Kernel Density Estimation and Its Computational Aspects*(Springer, Berlin, 2018).Schrab, A.

*et al.*MMD aggregated two-sample test. arXiv preprint https://arxiv.org/abs/2110.15073 (2021).Korkmaz, S., GĆ¶ksĆ¼lĆ¼k, D. & Zararsiz, G. MVN: An R package for assessing multivariate normality.

*R J.***6**, 151 (2014).Farrell, P. J., Salibian-Barrera, M. & Naczk, K. On tests for multivariate normality and associated simulation studies.

*J. Stat. Comput. Simul.***77**, 1065ā1080 (2007).

## Funding

The work was in part supported by Moffitt Cancer Center grant #17032001 (Miles for Moffitt) and National Institutes of Health Grants R01CA166269 and U01CA200464.

## Author information

### Authors and Affiliations

### Contributions

J.H. is the corresponding author, conceived the plan and methods; E.F. is a coauthor, developed the computer code, assisted in the plan and methods development, and prepared figures; A.B. is a coauthor and provided statistical and principal component analysis expertise; M.S. is a coauthor and provided statistical expertise; S.E. is a coauthor and assisted in the plan and methods developments. All authors reviewed the manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Heine, J., Fowler, E.E.E., Berglund, A. *et al.* Techniques to produce and evaluate realistic multivariate synthetic data.
*Sci Rep* **13**, 12266 (2023). https://doi.org/10.1038/s41598-023-38832-0

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-023-38832-0

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.