Normalization of clonal diversity in gene therapy studies using shape constrained splines

Viral vectors are used to insert genetic material into semirandom genomic positions of hematopoietic stem cells which, after reinfusion into patients, regenerate the entire hematopoietic system. Hematopoietic cells originating from genetically modified stem cells will harbor insertions in specific genomic positions called integration sites, which represent unique genetic marks of clonal identity. Therefore, the analysis of vector integration sites present in the genomic DNA of circulating cells allows to determine the number of clones in the blood ecosystem. Shannon diversity index is adopted to evaluate the heterogeneity of the transduced population of gene corrected cells. However, this measure can be affected by several technical variables such as the DNA amount used and the sequencing depth of the library analyzed and therefore the comparison across samples may be affected by these confounding factors. We developed an advanced spline-regression approach that leverages on confounding effects to provide a normalized entropy index. Our proposed method was first validated and compared with two state of the art approaches in a specifically designed in vitro assay. Subsequently our approach allowed to observe the expected impact of vector genotoxicity on entropy level decay in an in vivo model of hematopoietic stem cell gene therapy based on tumor prone mice.

www.nature.com/scientificreports/ PCR protocols for IS retrieval, sequencing platforms and sequencing depth) of complex ecosystems, such as the population of vector integrations sites in the genome 12 . Thus, while Shannon diversity index provides an objective measure of the clonal complexity of any given IS sample, these confounding factors should be taken in account when the clonal complexity of different samples is compared. Since longitudinal studies of GT patients for IS monitoring could require the analysis of several samples collected under heterogeneous technical conditions, a method aimed at removing confounding effects in diversity index is needed. To remove confounding factors in the estimations of ecosystem diversity, several methods have been applied. Random subsampling without replacement, called "rarefying" 13 , is among the most popular methods for the normalization of species count data in ecology as well as for next generation sequencing (NGS) data in microbiology. Given a predefined sequence depth (total count, SD), a subsample from each library is generated by randomly picking reads without replacement, until the selected total number of counts is reached. Although rarefying has become the state-of-the-art tool for NGS data analysis 14 , some limitations have been recognized. Indeed 15 , demonstrated that rarefying is statistically inadmissible and should be avoided. Furthermore, in 16 it was highlighted that estimates of species diversity in sites/habitats at local scale, namely the α-diversity 17 , for rarefied microbiome count data may be strongly biased. This is mainly due to the rare species which may be over-(or under-) represented in the samples that have been normalized to a smaller depth by rarefaction. An alternative normalization to rarefying is scaling, which adjusts the size of all samples by scaling their counts to the same total amount. Scaling preserves the relative frequencies of the species and keeps the species richness unchanged. Therefore, simple scaling does not remove the effect provided from the library depth neither on species richness nor on species diversity. Beule 18 introduces a novel normalization method for species count data called scaling with ranked subsampling (SRS) and the authors demonstrate its suitability for the analysis of microbial communities.
The growing number of normalization and scaling approaches highlights that a robust method has not been developed yet. In this work we show that all proposed methods have limitations. In particular they miss of a precise quantification of the effect of each confounding variable on the Shannon entropy. Furthermore, we also show that the rescaled Shannon entropy index obtained by either rarefying or scaling with ranked subsampling still suffers from the effect of the confounders. We propose a spline-regression approach aimed to explain and remove those effects from the diversity indexes. The effect of the confounders is measured using a B-spline term whose shape is restricted according to a biological-sustained hypothesis. We test our framework by analysing a novel in-vitro dataset properly designed to simulate the same clonality state under different combinations of technical conditions. We also compare our method with the previously proposed methods from the literature in terms of efficiency according to hypothesis testing. That is, we consider a rescaling method to be more efficient if there is more evidence for the corresponding rescaled measure being independent from the effect of the candidate confounders. Finally, our rescaling approach allowed to unmask the expected impact of vector genotoxicity on entropy level decay in an in vivo model of hematopoietic stem cell gene therapy based on tumor prone mice 19,20 .

Next generation sequencing of clonal tracking data from gene therapy studies
There are several high-throughput systems capable to quantitatively track cell types repopulation from an individual stem cell after a gene therapy treatment [21][22][23] . Tracking cells by random labeling is one of the most sensitive systems 24 . In HSC-GT applications, haematopoietic stem cells (HSCs) are sorted from the bone marrow of the treated subject and uniquely labeled by the random insertion of a viral vector inside its genome. Each label, called clone, or integration site (IS), is defined as the genomic coordinates where the viral vector integrates. After transplantation, all the progeny deriving through cell differentiation inherits the original labels. During follow-up, the labels are collected from tissues and blood samples using Next Generation Sequencing (NGS) [25][26][27][28] . NGS is a recent approach for DNA and RNA sequencing, which consists of a complex interplay of chemistry, hardware, optical sensors and software [29][30][31][32] . In gene therapy applications NGS does allow identifying, quantifying and tracking clones arising from the same HSC ancestor. Over the past decades, clonal tracking has proven to be a cutting-edge analysis capable to unveil population dynamics and hierarchical relationships in vivo [33][34][35][36] . Clonal diversity, measuring how many distinct clones are collected and how they distribute, can address some of these aspects. Loosely speaking, the less distinct clones the lower the clonal diversity and in turn the less the system is being repopulated in that particular cell compartment. Furthermore, under the same number of different clones collected, the more their distribution is far from the uniform, the lower the clonal diversity and the more the dominance of few clones, thus suggesting the possible occurrence of an adverse event. The Shannon entropy index 37 , a well-established measure of population diversity in ecology studies 9 , nowadays is hugely used as a proxy of clonal diversity in gene therapy applications 10 . Following 37 , the Shannon entropy index is defined as where X has possible realisations x 1 , . . . , x n which occur with probabilities P(x 1 ), . . . , P(x n ) . Therefore, Shannon entropy is a special case (up to a change in sign and a multiplicative factor 1/n) of the Kullback-Leibler divergence 38 when the reference Q is the uniform distribution on x 1 , . . . , x n .  [39][40][41][42][43] . Indeed due to sampling and technical conditions, such as the amount of the host DNA being sequenced and the PCR being adopted, the number of reads obtained per library can span orders of magnitudes 12 which may affect the cellular counts and in turn their Shannon entropy. These differences in magnitude of library size/depth mainly depends on unequal pooling of PCR products before sequencing. In order to pool PCR products from individual samples in equimolar amounts 44 , DNA concentrations are commonly determined by ultraviolet-visible (UV) or fluorescence spectroscopy, real-time PCR or digital PCR 45 . Although these methods are very effective 46 , an identical library size across samples is difficult to achieve. Nonetheless, if we define the multiplicity of infection (MOI) as the average ratio between the number of virus particles and the number of target cells present in a defined space, then the actual number of viruses that will integrate on any given cell can be described by a stochastic process, such as some cells may absorb more than one infectious agent while others may not absorb any of them. Typically, the probability P(n|m) that a cell will absorb n virus particles when inoculated with an MOI of m can be modelled as a Poisson variable with rate m, Therefore, by definition, it is possible to increase the expected number of vector copies per cell (VCN) by properly tuning the MOI in the design of the experiment/treatment. As a result, the VCN may affect the number of IS collected and in turn the Shannon entropy.
In Fig. 1 we show the behaviour of the Shannon entropy index as a function of the DNA amount, the VCN and the sequencing depth (SD) in the case of an in-vitro assay described in "In-vitro assay" Section. Figure 1 suggests that the Shannon entropy index strongly depends on the quantitative confounders, until it reaches a steady-state. These features motivate us to use shape constrained splines (SCS) in order to model the effect of the candidate confounders on the entropy measurements.

Methods: shape constrained splines
Definition of the model. Shape-constrained splines (SCS) for fitting, smoothing and interpolation have been explored and proposed in various works, such as [47][48][49][50][51][52] . In this work we follow the cone-projection approach 51,52 . We model the logarithmic observed entropies h i 's, for i = 1, . . . , n , as a function of a SCS-bases C k i for every potential confounder k = 1, . . . , K plus a term F j i for any other additional feature of interest j = 1, . . . , J , so that where β 0 is the intercept, C k i is the basis of a quadratic spline for the k-th confounder used for observation i for which we assume a saturation state at the right boundary knot and a monotone increasing concave shape. For our applications, the boundary knots of a spline basis associated to a variable x are defined as the minimal and maximal value of x. The term F j i corresponds to the i-th observation of a basis describing the j-th additional component, such as the time or the cell type. The corresponding parameter vectors are β k c and β j f respectively. Finally we assume for the noise variable: Therefore, the model can be compactly written as www.nature.com/scientificreports/ where the number of features K and J depend on the project and/or the specific research questions that must be addressed. Quadratic splines are characterized by the discontinuity of the second-order derivative, which makes their treatments harder than cubic splines. This applies already to unconstrained spline fitting and interpolation.
In particular, the definition of quadratic penalised/smoothing splines is not straightforward. Therefore, in general, cubic splines should be preferred over quadratic splines. Despite this, in our in-vitro assay (VA) application only three distinct values of both the DNA amount and the vector copy number (VCN) are available. Therefore, in order to get a good trade-off between bias and variance as well as in order to obtain a full-rank design matrix, we chose quadratic splines with one interior knot for both the DNA amount and the vector copy number (VCN), and a quadratic spline with two interior knots for the sequencing depth. In Sect. S.4.1 of the supplementary material we compare the fits of quadratic and cubic splines (cf. Fig. S.1). For consistency, we also use quadratic splines in the mice study on genotoxicity, where our goal is to evaluate the impact of the vector design on the entropy decay; cf. Section "Viral vector safety in a mouse genotoxicity study".

Shape-constrained splines (SCS) normalization of Shannon entropy.
For simplicity, we set K = 1 and J = 0 in (4), namely we consider only one confounder and no additional factors of interest. The general case can be obtained straightforwardly. Here we follow 53 and we represent an (r + 1)-th order B-spline as where, for j = 1, . . . , q , the bases are iteratively computed as for a given sequence of evenly spaced knots ξ 1 ≤ ξ 2 ≤ · · · ≤ ξ q+r+2 , where q is the number of basis functions and β j 's are the corresponding coefficients. The first order derivative of (5) can be written as where δ is the distance between two adjacent knots. Since all B-spline basis functions are nonnegative by definition, a sufficient condition for m ′ (x) ≥ 0 , and in turn for the monotone-increasing shape of m(x), is Furthermore, the second order derivative of Eq. (5) can be written as Then a sufficient condition for m ′′ (x) ≤ 0 and in turn for the concavity of the spline in (5) is The monotonicity and concavity constraints can be written respectively as where If both monotonicity and concavity constraints must be satisfied, the first q − 2 constraints/rows of V are redundant, as stated by the following Lemma. www.nature.com/scientificreports/ Therefore by Lemma 1, if both constraints V and W are applied, the whole matrix of constraints reduces to Furthermore, we need to consider that sampling might be characterised by a sequencing saturation level due to technical limitations. In this case, the saturation level can be included by considering a steady-state/stationarypoint at the right boundary knot ξ q+r+2 , namely by setting where ∂ 1 B is the first derivative of the spline basis Our R code implementation allows to switch between the presence and absence of the saturation level by the additional logical input parameter SATU RAT ION. By default this parameter is set to TRUE, but it can be switched to FALSE if the user prefers not to implement a saturation level (or a steady state) w.r.t. a particular predictor variable. In our case of quadratic degree ( r = 1 ), the constraint (15) reduces to which can be written compactly using the following affine transformation where n X is the number of rows of X.
Estimation procedure. Given n observations (x i , y i ) of one predictor x and a response y, the restricted least squares estimate β RLS of β subject to the constraints (14) and (16) can be obtained as where Therefore, using (17)-(18) we can directly include the linear equality constraint inside the objective function, and the optimisation problem (19) reduces to Since f is a quadratic function, we solve (20) using quadratic optimization. To this end we use the function solve.QP() from the R package quadprog. Once the restricted least squares estimate β * RLS is obtained, we follow the cone projection approach 52 and we define a point-wise confidence interval (CI) with 1 − α/2 cover- where X I are the columns of BA indexed by I . Each weight p I represents the estimated probability that the projection of y lands on the cone's face corresponding to I . The probabilities p I are obtained by simulating many normal random vectors with mean vector ŷ = BAβ * RLS and covariance matrix σ 2 RLS I n , and recording the resulting sets I's, along with their frequencies. In case additional unconstrained components are present, the definition of (22) can be extended 52 . Furthermore, if we need to select from a set of candidate models featuring different covariates, we use information criteria 55 . For our analyses we use the corrected Akaike Information Criterion (AICc) for model selection, where L(β|y) is the likelihood of model M and p the number of parameters of M, which is equal to d in our set-up. In case some models have similar AICc values, we follow Burnham et al. 55 and we average across all ones using the frequentist model average estimator where β l is the parameter vector estimated under the l-th candidate model, and l the corresponding weight which can be computed as where BIC j is the Bayesian Information Criterion (BIC) associated with the j-th model. In case of model averaging, the BIC is preferred over AIC/AICc, since it provides a better estimation of the marginal likelihood 55 . From a Bayesian perspective, { j } j can be interpreted as an estimator of the posterior probabilities of the candidate models under a uniform prior {p(M j )} j , where

Applications of SCS in NGS data
In-vitro assay. To evaluate the reliability and sensitivity of the SCS rescaling method, we generated an IS dataset originating from an EBV-transformed B cell line transduced with a LV at Multiplicity of Infection (MOI) of 0.1, 1 and 10 to obtain DNA samples with increasing levels of polyclonality. Therefore, by increasing the MOI at each transduction we expect an increase in the vector copy number (VCN  Table S.2). The total number of sample's sequencing reads was used as proxy for the sample's sequencing depth (SD). The magnitude of VCN, DNA amount and SD affects the clonality so that the samples are incomparable.
Indeed Fig. 1 clearly shows a positive trend between the Shannon entropy index and the potential confounders. With the VA we are able to really understand the impact of the variables (confounding factors) to the entropy index, thus allowing a robust integrated analysis. We used the VA as "ground-truth" to compare our SCS-rescaling method with the competitor approaches (RAR and SRS). In our SCS method we took in consideration the DNA amount, VCN and SD as potential confounders.
In this case the number of candidate confounders is K = 3 with no additional factors of interest ( J = 0 ) and, according to the general formulation of Eq. (4), the model was defined as where we used two equidistant interior knots and the range of values as boundary knots for every SCS term in C . We report the corresponding fitted surface in Fig. 2 . In Fig. 2 we also show the rescaled values, i.e. the residuals that remain after having adjusted for the confounders. That is, according to the model definition in Eq. (4), we used the residuals as the rescaled values, where β c is the vector of the fitted parameters.
We compared our method with the two previously proposed in literature, such as the rarefaction (RAR ) 13 and the ranked subsampling (SRS) 18 approaches. We assessed the efficiency of the rescaling methods by correlation p-values for the two-sided test problem: where ρ(X 1 , X 2 ) is the Spearman's rank correlation between two random vectors X 1 and X 2 . We preferred Spearman's rank correlation over Pearson correlation since we assumed that the relationships are monotonic and possibly non-linear. Low p-values give statistical evidence for dependencies and thus for unsolved confounding effects. For the comparison, the total amount of reads of the sample with the lowest SD has been chosen as rarefaction level with 1000 replications for both the standard rarefaction (RAR ) and its ranked version (SRS). We report the results in Fig. 3. These pictures show that our SCS method outperformed both RAR and SRS methods in terms of correlation test p-values between the rescaled entropy and every potential confounder. Indeed, for all three confounders our new approach yields high p-values (0.37, 0.15 and 0.31 for DNA, VCN and SD respectively), so that we have no indication to reject the null hypothesis that the rescaled entropies and the confounder values are not correlated. For each of the two competing approaches we got 2-3 very low p-values ( ≪ 0.01 ), so that statistically significant amounts of correlations are left.
Subsequently, we also checked whether our SCS-rescaling method unveils comparable clonal levels among the samples. A proper rescaling method should return similar clonal diversities independently from the confounders. Indeed, Figs. 2 and 4 show that our SCS-rescaling method drastically reduced the variability of the observed www.nature.com/scientificreports/ entropies due to the effect of the confounders and, in turn, that the clonal level of the VA samples, measured by the SCS-rescaled Shannon entropy index, is approximately the same. It can be seen from Fig. 4 that the competitor methods RAR and SRS are also able to reduce the variability of the observed entropies. However, unlike our new SCS-rescaling method, the competing methods did not remove the effect of the confounders, as confirmed by the p-values of Spearman's rank correlation tests provided in Fig. 3. While the SCS-rescaling method made all (rank) correlations insignificant, the competing methods RAR and SRS left significant rank correlations (=dependencies) between confounders and the entropy.  19,20 . For the retrieval of vector insertion sites (ISs), peripheral blood was collected on a monthly basis from transplanted animals receiving transduced cells. Lymphoid B and T cells as well as myeloid cells were isolated by fluorescence activated cell sorting. To recover enough DNA material, equal amounts of blood from two or three mice belonging to the same experimental group were pooled before the sorting procedure. The composition of pools was maintained constant during the whole experiment, so that each pool is composed by the same mice over time. ISs were then retrieved by SLiM-PCR 58 at different time points from sorted T (CD3+) and B (CD19+) lymphocytes, from myeloid cells (CD11b+) and unsorted blood cells (total MNC). From the DNA purified from all the different sorted samples, we also measured the VCN by ddPCR. Overall, a higher amount of ISs were retrieved from the group of mice transplanted with Lin − cells transduced with PGK, reflecting the higher level of VCN observed in PGK versus LTR group of transplanted animals. Few statistics on the number of IS collected in each treatment/condition, along with the corresponding VCN, are reported in Table 1. The Shannon entropy index was then computed from each IS sample and the application of a simple spline without shape constraints and without considering any technical confounder yielded the results shown in Fig. 5. From Fig. 5 we cannot see a clear separation between the entropy profiles of the two vectors PGK and LTR. The prediction intervals overlap so that the differences in the profiles do not appear to be statistically significant. Henceforth, we cannot draw the conclusion that PGK is safer than LTR. However, the high variability of DNA amount (in nanograms), VCN, and the SD used for IS retrieval has a clear impact on the entropy measurements, as suggested by Fig. 6.
Furthermore, since some mice died faster, the size of each pool (PS) decreased over time, leading to variation in the cell counts and in turn in the Shannon diversity index calculations (see Fig. 6). Few statistics of these quantites are reported in Table 1 separately for each vector treatment.
This suggests that initial results of Fig. 6 might be biased by the presence of these confounding factors. The heterogeneity of these factors may affect the estimate of the cell counts and in turn the corresponding Shannon entropies. We therefore applied our shape-constrained spline approach of "Methods: shape constrained splines", including the DNA amount, VCN, SD and PS as potential confounders.
We proceeded as follows: We used the general formulation of Eq. (4), including a shape constrained spline (SCS) term with two interior knots for every confounder, plus a spline term w.r.t. the time decay for every combination of cell lineage/marker (L) and viral vector (V) as additional factors of interest. In this way we described the entropy decays separately for each combination of cell marker and treatment (viral vector) while removing the bias provided by the potential confounders. We also set the vector specific intercept V to zero to make sure all the individuals have the same clonal diversity before the treatments. Therefore, following the general formulation of Eq. (4), the model it has been explicitly defined as where C binds all the confounder's SCS bases and β c is the vector with all the corresponding parameters stacked together. Alike, F is a block-diagonal matrix where each block S l t is defined as www.nature.com/scientificreports/ and each sub-block S l,v t , corresponding to the l-th cell lineage and the viral vector v, is the basis of a monotone decreasing quadratic spline w.r.t. the time t for which we assume a steady-state to the left of the second right boundary knot. Indeed, each mouse pool started with 2-7 mice, which then successively died, until no mouse was left, so that no measurements could be taken anymore and therefore we do not expect any further change in the entropy thereafter. For this purpose we use again the affine transformation defined in Eqs. (17)- (18). Finally, we refer to β f as the vector with all the corresponding parameters stacked together. Therefore in this case the number of confounders is K = 4 and the number of additional factors of interest is J = 8 , namely a spline basis for the time-decay for every combination of the two treatments and the four cell lineages.
In order to identify the most important confounders among the candidates, we have fitted our model for each of the 2 4 − 1 = 15 possible confounder subsets. Each candidate model included always F as fixed term and featured at least one SCS term in C for the confounders. Then we averaged across the most likely models according to the frequentist criterion defined in Eqs. (24)-(25) and we reported in Fig. 7 the posterior distribution, along with the marginal inclusion probabilities of the four individual confounders.
Results from model averaging suggest that the posterior distribution is mainly dominated by three models namely: PS + SD (4th model), VCN + SD (6th model), and SD (8th model). The remaining 12 models get substantially lower posterior probabilities and thus have only negligible effects on the model averaging estimator. Therefore, after computing the frequentist model averaging estimator of Eq. (24), we used the residuals corresponding to the confounder terms as the rescaled values. Rescaled entropies are shown in Fig. 8 together with the lineage×vector-specific spline decays with a confidence interval of 0.95 coverage.
Thanks to the SCS-rescaling approach, a significant difference in the entropy decay for MNC, T-cells (CD3+) and Myeloid cells (CD11b) was observed depending on the genotoxicity level of the vector adopted. Whereas in the B-cell compartment no major differences in the entropy decay under the two vector treatments were observed. Indeed, consistently with the previous results 19,20 , the B-cell compartment is less affected by the genotoxicity of the LTR vector. In Sect. S.4.2 of the supplementary material we show that the RAR and the SRS approaches are less efficient than the proposed SCS approach, cf. Figs. S.5 and S.6 of the supplementary material.

Discussion
We have shown that the Shannon entropy index, a widely used measure of genetic variability, is strongly affected by the variability of technical factors. We have introduced a shape-constrained splines (SCS) approach aimed to quantitatively measure and remove the effect of confounders from the target of interest. In particular we have shown that our approach can remove confounding effects from the Shannon entropy index. We also have shown that our SCS approach outperforms all the state of the art rarefaction approaches like the RAR 13 and its ranked-subsampling version 18 . That is, using a correlation test, we have found statistical evidence that the SCSrescaled diversity measure does not significantly depend on the effect of the confounders anymore. Furthermore, our method is useful for genetic applications, as it provides an unbiased and more affordable measure of clonal diversity, and in turn it avoids to draw misleading results. As an example, the entropy decay of treatment-specific longitudinal studies may be erroneously interpreted if we do not take into account that the changes in the entropy increments may depend more on the confounders than on the biological treatments. Our method allows to discriminate between the two effects and to remove the one that comes from the confounders.
Since our approach is spline-regression based, its main limitation could be related to the available sample size. This is a potential problem when the number n of libraries is too low to define a spline basis. For the same reason the degree of the splines and the number of its knots should be chosen carefully. In particular, for the case of only  www.nature.com/scientificreports/ one library/sample it would be possible to rescale its diversity only using the parameters inferred from an external controlled environment, like the VA explored in Sect. 4.1, with a sufficient library size n. This is the main price we pay if we switch from a rarefaction-based rescaling approach to a spline-regression based one. Our model averaging approach allows also to rank the impact of the confounders according to their approximated posterior probabilities. We perform model averaging by means of the Bayesian Information Criterion which allows us to get an estimate of the marginal likelihood of each candidate model, and in turn, of the corresponding marginal confounder inclusion posterior probabilities. One possible methodological extension of this framework could be the implementation of a more precise method to estimate the marginal likelihood. This could, for example, either be done by Laplace Integration or by Bayesian thermodynamic integration.

Data availability
The data that support the findings of this study are openly available in the GitHub repository SHARES at https:// github. com/ calab rialab/ SHARES.

Code availability
The R code used to produce all the analyses of this work is openly available at the GitHub repository https:// github. com/ delco re-luca/ SCS.