Optimized design and analysis of preclinical intervention studies in vivo

Recent reports have called into question the reproducibility, validity and translatability of the preclinical animal studies due to limitations in their experimental design and statistical analysis. To this end, we implemented a matching-based modelling approach for optimal intervention group allocation, randomization and power calculations, which takes full account of the complex animal characteristics at baseline prior to interventions. In prostate cancer xenograft studies, the method effectively normalized the confounding baseline variability, and resulted in animal allocations which were supported by RNA-seq profiling of the individual tumours. The matching information increased the statistical power to detect true treatment effects at smaller sample sizes in two castration-resistant prostate cancer models, thereby leading to saving of both animal lives and research costs. The novel modelling approach and its open-source and web-based software implementations enable the researchers to conduct adequately-powered and fully-blinded preclinical intervention studies, with the aim to accelerate the discovery of new therapeutic interventions.

Supplementary Figure S1: Schematic illustrations of the common hierarchical structures that need to be taken into account in animal allocation and the experiment designs of the two case studies. (a) Preclinical cancer studies commonly incorporate a layered hierarchical design, where multiple nested animals may originate from a single batch or a cage, while multiple tumors may be located in a single animal. (b) ARN-509/MDV3100 -intervention study with orthotopic VCaP prostate cancer cells in male immunodeficient mice (HSD: Athymic Nude Foxn 1nu). According to the experimental procedure, orthotopic tumors were generated by injecting the cancer cells into the prostate of each animal. The growth of the tumors was followed by weekly measurements of the serum PSA indicating the tumor burden. The animals were castrated in two separate batches on subsequent weeks, resulting in two substrata with different tumor growth characteristics. The mice were followed by serum PSA measurements and after the re-appearance of the tumors the mice were allocated into several CRPC treatment arms . Hierarchical allocation procedure based on the global matching algorithm ensures that the substrata are evenly distributed among the intervention groups. (c) ORX/ORX+Tx -intervention study with analogous subcutaneous VCaP xenografts. A single substrata of animals was allocated into several intervention groups (out of which Control, ORX and ORX+Tx are presented in this paper), while some animals had to be dropped out due to ethical reasons.  Solving the non-bipartite submatching problem in the MDV3100/ARN-509 intervention study. (a) The animals (n = 75) were divided to two different castration sub-strata, which were separately submatched only within a strata and subsequently allocated evenly to the intervention arms (see Supplementary Fig. S1b). The matrix colors indicate dissimilarities in the baseline characteristics, and the box color indicates animals being part of same submatch. (b) Multidimensional Scaling (MDS) 2-dimensional projection of the complex baseline characteristics, with each submatch indicated with connecting edges and different coloring.   Vehicle MDV3100

Matched8null8hypothesis
Matched8inference VMDV31008vs8VehicleR ARNv5098vs8Vehicle MDV31008vs8Vehicle Supplementary Figure S5: Mixed-effects model fits in the ARN-509/MDV3100 intervention study.  Figure S7: The difference between bipartite and non-bipartite matching, and a graphical representation of the steps in the branch and bound algorithm for solving the non-bipartite problem. (a) A bipartite matching problem, where the matching is identified between two pre-defined groups. (b) In the preclinical cancer context, the non-bipartite matching enables detection of comparable individuals from a single pool of animals, based on similarities in their baseline characteristics. (c) Branching implicitly enumerates all possible combinations of matches in the solution. In this particular example, the branching structure is presented for matching of pairs (G = 2) for 6 individuals. (d) Concept of the bounding function in a continuous minimization task (lower objective function values are preferred). A bounding function is utilized to discard branches in the tree-like structure (panel c), by concluding that a certain range (branch) of solutions cannot improve the current best solution. In this example, ranges x ≤ X 1 and X 4 ≤ x do not have to be searched, as the bounding function hints that the current best solution (indicated in red) cannot be improved in this solution range. However, solutions in X 1 ≤ x ≤ X 2 and X 2 ≤ x ≤ X 3 have to be tested, since the bounding function suggests a possible lower theoretical boundary in these solution ranges.  Figure S8: Evaluation of the animal allocations in the ARN-509 / MDV3100 VCaP xenograft study using Mantel's test that compares the pre-intervention dissimilarity matrices of the baseline animal characteristics to the post-intervention mRNA gene expression profiles of the treated tumors. By visual inspection, two interesting dissimilarity sub-groups were identified (pink boxes). Further, one exceptional baseline animal remained an outlier also at the tumor mRNA-level (pink arrow). (a) Dissimilarity matrix of the baseline characteristics for the sequenced animals (n = 12) was calculated using standardized Euclidean distance. (b) Dissimilarity matrix of the RNA-seq expression profiles (fragments per kilobase of exon per million mapped reads, FPKM) was calculated using Euclidean distance. (c) Distribution of the permutated correlation statistic. Statistically significant Spearman correlation was observed between the baseline characteristics and postintervention mRNA expression (red line), by conducting n = 10, 000 permutations of the dissimilarity matrices (Mantel's test).

Matched4randomization Paired4testing
Matched4randomization   Figure S9: A simulation run of 1,000 matched 2-group datasets were generated for each combination in the parameter grid, resulting in a total of 432,000 datasets for which matching was conducted and data drawn from mulvariate normal distributions with given parameters. The matching procedure was used as in the manuscript, and conventional randomization randomly allocated groups of equal size ignoring baseline information to both experiment groups. Paired or non-paired t-test was used to determine whether there was a difference with α = 0.05 significance threshold. The following parameters were varied: Magnitude of true group difference µ 1 − µ 2 ∈ {0, 1, 2}; Sample size per group N ∈ {5, 10, 15} ; Magnitude of informativeness in (parameter q) predictive baseline variables s ∈ {0, 0.4, 0.7} ; Count of predictive baseline variables q ∈ {1, 3, 10, 20} ; Count of non-predictive baseline variables p ∈ {1, 3, 10, 20}. Few interesting key results were annotated in the simulation results: (a) Interestingly, when matched allocation was used, the specificity in testing was highly increased in the case when no true group difference was present. This phenomenon persistent in the non-paired testing, highlighting that matching-based allocation also serves to improve specificity and that non-paired testing can be benefit even if the matching information is not utilized in the post-intervention testing. (b) A benefit in sensitivity was observed in the small group-wise different (µ 1 − µ 2 = 1) in comparison to the non-matched testing as long as the number of predictive markers was greater than non-informative baseline markers (q ≥ p). As expected, this advantage was lost if no informative markers were present (s = 0), but no loss of accuracy was observed in comparison to the conventional methods. (c) In small explorative studies (N = 5), a slight advantage in sensitivity was observed especially if the baseline markers were highly predictive (s = 0.7), highlighting that predictive markers may help narrow down candidates more effectively in such explorative studies with typically smaller sample sizes N .   Figure S10: Simulation results as a function of the predictive s parameter, with default R loess smoothing applied for the visualization of the curves. Positive detection was defined using the conventional significance threshold of p < 0.05 for the multiple regression term to test differences between the two simulated groups. The overall performance of each modeling strategy was assessed with the area under curve (AUC) over the whole range of correlation of the covariate with the outcome (s), which summarizes the findings over the whole correlation spectrum both where there was no predictive baseline information (low s) or where the single covariate had strong predictive power (high s), but was confounded by the three additional random confounder-covariates. The three columns indicated the different sample sizes N ∈ {5, 10, 15}. (a-c) Simulations when no group difference was present. (d-f ) Mediocre group difference. (g-i) Strong group difference.  Figure S11: A single end-point testing example from the VCaP ARN-509 / MDV3100 -study.
Hotelling's T 2 multivariate extension of the t-test was used to illustrate how two end-point markers can be tested with or without the matching information. In this case the two end-point markers were highly correlated, illustrating that the PSA was a feasible surrogate marker to serve as a proxy for the actual tumor size in the orthotopic VCaP animal model. (a) In the non-paired case, MDV3100 was to some extent overlapping with the sacrifice measurements from the Vehicle group. (b) Pairing the end-point markers and comparing to the null hypotheis that the multivariate normal distribution µ = {0, 0}. The paired adjustment revealed difference between Vehicle and MDV3100, which was consistent with the results observed in the longitudinal PSA analysis (Table 1).

Distance measure Formula
Gower dissimilarity * continuous: | x i -y i | / R i binary/categorical: 0 if x i = y i , 1 otherwise a obtained as a special case of Minkowski when r = 2; b obtained as a special case of Mahalanobis when S is a unit diagonal matrix; c obtained as a special case of Mahalanobis when S is a diagonal matrix; d obtained as a special case of Minkowski when r = 1; e obtained as a special case of Minkowski when r → ∞; † is not scale-invariant, thus data normalization should be considered; * Suitable for mixed-type data. Gower's dissimilarity coefficient is obtained by summarizing over all the available variables i=1,2,..., d. 1

Orthotopic VCaP xenograft in immunodeficient mice: ARN-509 and MDV3100 interventions
We used our recently established orthotopic VCaP xenograft model that enables one to model the characteristics of castration resistant prostate cancer (CRPC) growth in vivo, and to study the intratumoral androgen biosynthesis in CRPC 22 . Similar to clinical CRPC, androgen receptor (AR) expression is restored in the VCaP model, despite undetectable serum androgen levels after castration. Furthermore, the AR-mediated signaling continues to play a key role in tumor growth, as indicated by the response to anti-androgen treatments as measured by serum PSA measurements. In the analyzed experiment, we tested the effects two AR antagonists on the castration resistant growth of the VCaP cells. Both of the antiandrogens (MDV3100 and ARN-509) block binding of the endogenous ligand to AR. Furthermore, the MDV3100 promotes also AR degradation, while the ARN-509 especially inhibits the nuclear import and DNA-binding of AR.
The study has been described in detail in 22 . Briefly, adult male immunodeficient mice (HSD: Athymic Nude Foxn 1nu, Harlan Laboratories, 6 to 8 weeks of age) were housed in individually ventilated cages under controlled conditions of light (12h light /12h dark), temperature (21 ±3ºC), and humidity (55% ±15%) in specific pathogen-free conditions at the Central Animal Laboratory, University of Turku. The mice were given irradiated soy-free natural-ingredient feed (RM3 (E), Special Diets Services) and autoclaved tap water ad libitum. One million VCaP cells in 20 μl medium were inoculated orthotopically into the dorsolateral prostate through an abdominal incision. Isoflurane (Baxter) was used to induce anesthesia. For pain relief, mice were injected s.c. with buprenorphine (Temgesic, Reckitt Benckiser Healthcare, 0.05-0.1 mg/kg) and carpofen (Rimadyl, Pfizer, 5 mg/kg) before and after the operation, respectively.
Tumor growth was followed by weekly serum prostate specific antigen (PSA) measurements until the mice were sacrificed. Tumors were allowed to grow for 4-5 weeks, until serum PSA had reached at least 5 µg/l in 60% of the animals, and the mean serum PSA value was approximately 15 µg/l. Thereafter, all mice with tumors were castrated within two subsequent weeks (week 4 and 5). This resulted in a dramatic reduction in the serum PSA concentration in all mice, while after a few weeks the castration resistant tumors emerged in 83% of those mice that had increased PSA levels prior to castration. The mice were then allocated to multiple treatment arms while retaining a balance between the groups based on the PSA levels measured at week 10, the change of the PSA level from week 9 to 10, body weights on week 9, cage placement, and the week castration took place. Subsequently, mice were randomized to masked treatment arms, each containing 15 matched animals, in order to guarantee that prognostically comparable mice were available for treatment effect assessment. All the groups were constrained to have an equal number of mice castrated on week 9 or 10, to prevent stratification in respect to this factor. Animals were treated with vehicle or novel antiandrogens of MDV3100 or ARN-509 (20 mg/kg/day). Vehicle and the antiandrogens were administered by gavage once a day for 28 days. The antiandrogens were synthesized at Orion Pharma (Finland). The vehicle contained 50% PEG300 (Merck KGaA) + 35% 100 mg/ml glucose solution (Baxter) + 10% Tween80 (Merck KGaA) + 5% DMA (Merck KGaA). The study was conducted in accordance with the Animal Experiment Board in Finland (ELLA) for the care and use of animals under the license ESAVI/1993/04.10.03/2011.

Subcutaneous VCaP xenograftin immunodeficient mice: ORX and ORX+Tx interventions
Male immunodeficient mice (Athymic Nude-Foxn1 nu , Harlan Laboratories, initially 5 to 6 weeks of age) were housed in individually ventilated cages under controlled conditions of light (12h light /12h dark), temperature (22 ±2ºC) and humidity (55% ±15%) in specific pathogen-free conditions at the animal facilities of Orion Corporation, Orion Pharma, Finland. The mice were given irradiated soy-free natural-ingredient feed (RM3 (E), Special Diets Services, England) and filtered, UV treated, tap water ad libitum.
Two million VCaP cells in 150 μl of RPMI medium (Gibco ® , Life Technologies, Canada) complemented with Matrigel™ (1:2, BD Biosciences, Belgium) were inoculated subcutaneously (s.c.) to the right flanks of the mice. Development of the tumors was monitored by measuring their volume twice a week, and by measuring the serum concentration of prostate specific antigen (PSA) every ten days. The volume of the tumors was calculated according to following formula: W 2 * L / 2 (W = shorter diameter, L = longer diameter of the tumor). For PSA measurements, approximately 100 µl of blood was collected by saphenous vein puncture, and the PSA was measured with time-resolved fluorometer (Wallac, PerkinElmer Analytical Life Sciences) as described previously 49 . Reagents for the PSA fluorometric assay were provided by Kim Petterson (University of Turku, Finland). Tumors were allowed to grow for 7 weeks, until the mean volume of the tumors reached approximately 300 mm3, and the mean serum PSA value was approximately 20 µg/l. Animals were randomized in three groups taking into account the animal weight, tumor size and PSA concentration. Two thirds of the animals were castrated and the remaining was left as intact controls. Castrations (ORX and ORX+Tx, where Tx is an undisclosed intervention) were carried out under the isoflurane (2-3%, Baxter S.A., Belgium) induced anesthesia. For pain relief, mice were injected s.c. with buprenorphine (0.1 mg/kg, Temgesic ® 0.3 mg/ml, Reckitt Benckiser Healthcare Ltd.,United Kingdom) and carpofen (5 mg/kg, Vet Rimadyl ® 50 mg/ml, Pfizer SA, Belgium) before and after the operations. Animals were treated with vehicle (2 ml/kg) administered by gavage once a day, from day 50 onwards, until the end of the study (8 weeks). The vehicle contained 0.5 % methylcellulose in water and 0.1 % Tween® 80 (Merck, Germany) solution. The study was conducted in accordance with the Animal Experiment Board in Finland (ELLA) for the care and use of animals under the license ESAVI/7472/04.10.03/2012.

Constrained randomization in the optimal matching
Let N be the number of individuals participating in the experiment to be allocated to G equally sized experimental groups. Lower case letters g = {a,b,c, …} are used to annotate the groups.
where f 1 is the masked group label given for the 1 st individual, and so on. We assume that the desired groups are balanced in size, thus, each label from g occurs in f total N/G times. Since the labels are not fixed to specific actual intervention groups, the annotated labels a,b,c, … are interchangeable and only separate class boundaries. As an example, allocation of 4 animals to two groups is done with a vector f of length 4, and g = {a,b} where a and b indicate the masked group labels. Enumerating all the possible combinations for allocating 4 individuals to two groups gives the following list: {a, a, b, b}, {a, b, a, b}, {b, a, a, b}, {a, b, b, a}, {b, a, b, a}, {b, b, a, a}. The solutions {a, b, b, a} is equal to {b, a, a, b} prior to assigning true treatment labels to the masked labels g, which is ideally performed by a person other than the randomizer. Suppose that the optimal matching has resulted in matching of individuals {1,3}. This indicates that individuals 1 and 3 should be allocated to different groups. Similarly, the match {2,4} in the same optimal matching solution suggests that individuals 2 and 4 should be allocated to different groups. Therefore, the allowed allocations are constrained to solutions where f 1 ≠ f 3 and f 2 ≠ f 4 based on the optimal matching. This criterion is fulfilled in the following allocations: {a, a, b, b}, {b, a, a, b}, {a, b, b, a}, {b, b, a, a}. Out of the above set of 4 feasible solutions, one is finally picked at random with equal probabilities for each instance. This will produce the final chosen allocation vector, which is then given as the masked labels for experimental investigators.
Additional constraints are trivial to add to the above methodology. For instance, in the ARN-509/MDV3100 -experiment, a potential cage effect was normalized out by setting additional constraints in the matching based constraints. This stated that for each pair of individuals i and j that originated from the same cage, the allowed solutions fulfilled the criterion f i ≠ f j .

Branch and bound algorithm (exact optimization)
The branch and bound steps in the algorithm for global solution are defined as follows: Branching step: The branching step must implicitly enumerate all possible paths in the branch and bound tree, if global optimum is to be guaranteed. We chose a strategy in which every branching step, the submatches are enumerated for each available un-matched individual (Supplementary Fig. S7c). This enumeration is structured so that potentially more similar submatches are prioritized in the tree search. The search tree begins with an empty node, where no individuals have yet been matched (root). Then, at the first step all possible matches that include the first index are enumerated. Similarly further down in the branching tree, always the first free index is enumerated. This spans a search tree that would, if needed, include all possible combinations once.
Bounding step: In order to reduce the size of the search tree ( Supplementary Fig. S7c), branches of the tree need to be discarded based on an optimistic bounding function. Supplementary Fig. S7d presents the concept of a bounding function in a minimization optimization task for a continuous variable x, which generalizes to the discrete optimization task at hand. Presumably we know some feasible solution (known local minimum) to the optimization task, which may be used as a starting point. We know the boundary of best possible solutions found in a certain part of the solution space based on a bounding function. For example, in the Supplementary Fig. S7d, it is known that the local minimum lies inside X 3 < x ≤ X 4 along with the value of that local minimum f(x). Based on the bounding functions and our current best known minimum, solutions found from space x ≤ X 1 may be discarded, as the optimistic bounding functions suggests that the best found solution in this range will not improve our current best known solution. Similarly, solutions found from X 4 ≤ x may be discarded. Then, solutions from the space X 1 < x ≤ X 2 need to be tested, since our bounding function does not guarantee that solutions in this range can be discarded. If the bounding function had been better, i.e. closer to the actual function that is minimized, this range could have been discarded, as it does not truly include a solution better than the current minimum. Furthermore, range X 2 < x ≤ X 3 needs to be tested based on the bounding function, and in this case the global optimum would be found in this range.
Bounding function is similarly utilized in the branch and bound algorithm to discard large amounts of solutions that cannot improve the current best known matching. After each branching step, a bounding function is computed for each branch through relaxation. For this purpose, the optimization problem in equations (1A-E) is temporarily modified by omitting either the constraint (1B) or (1C). Omission of either constraint causes matches to more than G -1 individuals. For each individual index that has not yet been fixed in the search tree, the corresponding rows and columns in D are checked. Per each row, the G -1 lowest distances are then picked. These distances are then summed over all the non-fixed rows, which yields the lowest sum of distances available for the non-fixed indices. Notice that due to the relaxation, the found boundary may not be a truly feasible solution to the real optimization task, but it is as low or lower than minima that fulfill the criteria set in equations (1B-C). Therefore it is used as an optimistic bounding function for the branch and bound algorithm.

Genetic algorithm (heuristic optimization)
In order to provide a faster local optimization algorithm in addition to the global optimizing Branch & Bound optimization algorithm, we provided a genetic algorithm inspired by similar usage of this family of methods in improving experimental design 22 . The genetic algorithm (GA) is an open-ended framework for mimicking evolutionary behavior in solving an optimization task. The user provides a desired population size of candidate solutions, and then new generations of solutions are generated using pre-defined evolutionary mechanics. In our application, the fitness of a solution is better if the optimal matching solution has a lower target function value, and thus it is more likely to produce offspring or live to the next generation of solutions. Our algorithm consists of the following evolutionary events (with experimentally observed feasible default tuning parameters): i. The initial random populations are randomly generated by creating a simple legal matching matrix and randomly permuting the rows and columns of such a legal solution, creating a new randomized legal solution fulfilling constraints set in equations (1A-E).
ii. Point mutations are randomly introduced in each generation by randomly performing either a swap of two rows or two columns in a single existing matching matrix to randomly chosen individual solutions. iii. New generations are bred by two parents (two solution matrices 1 and 2 ) by combining the common identified matches by the parents. Each element , = 1 that is shared by both the parents is propagated to the offspring as it is, while the remaining elements where either both parents had , = 0 or only one parent had , = 1 are randomly permutated to generate the rest of the solution. iv. Solutions with a worse fitness, i.e. higher optimal target function value in equation (5), are more likely to die per each simulated generation and are replaced by the breeding of new solutions.

Data and code availability
The utilized R-package hamlet 25 along with its source code is available in the Comprehensive R Archive Network (CRAN), by typing "install.packages(hamlet)" to the R-terminal. After loading the R-package using "library(hamlet)", all the measurements for the ARN-509 / MDV3100 -study for the pre-intervention baseline or the post-intervention longitudinal data are fully available from the R-package with commands "data(vcapwide)" and data(vcaplong)", respectively. Similarly, all the measurements for the ORX / ORX+Txstudy for the pre-intervention baseline or the post-intervention longitudinal data are fully available with the commands "data(orxwide)" and "data(orxlong)", respectively. The RNAsequencing files for the ARN-509 / MDV3100 -study have been deposited in the European Nucleotide Archive (ENA) with the identification code PRJEB11552.

Graphical User Interface (GUI)
To enable the wider use of these algorithms, also by researchers without bioinformatics skills or resources, we have implemented the core set of the functions also as part of a web-based graphical user interface (GUI), named R-vivo. The web-interface was implemented using the R-Shiny software platform (R-Studio, Inc.), and it is freely accessible at our in-house computer server (http://biomedportal.utu.fi:3838/utu-apps/Rvivo/). In addition to the power calculations, R-vivo includes user-friendly options and visualizations both for the preintervention analyses (animal matching, randomization and treatment group allocation), as well for the post-intervention analyses (including statistical testing of the treatment effects). In its simplest form, the user can input the raw baseline measurements of animals/tumours, compute local or global matching with or without randomization, and output relevant information such as sub-matches or blinded group labels to be further processed by the experimental investigators. A practical challenge in the sampling-based power calculation is that since the computations are not performed on closed-form parameter distributions, the simulations can be relatively time consuming. However, the user has options for quick pilot simulations to gain insight into the magnitude of the required sample size, and then perform a more detailed simulation once the exact sample numbers are to be proposed. In the GUI, this precision is controlled by the number of bootstrap samples within each sample size (n). Response measurements and baseline variables from the VCaP xenograft experiments are available both within the hamlet R-package and in R-vivo interface, for testing purposes.

Simulation study for predictive baseline covariates
In order to examine the effects of matching as a function of the number and type of confounding baseline variables, the following single end-point schema was simulated:

Sample generation
The samples z were drawn from the multivariate normal distribution: ~ ( , ∑) where the covariance structure was constructed as follows: Here, the symmetric covariance matrix ∑ with dimensions ( + + 1) × ( + + 1) was separated into 3 components to simulate different types of baseline covariates: i) q predictive rows/columns with cross-covariances , unit variance and predictive covariance of magnitude s in connection to the final prediction vector y ii) p non-predictive rows/columns with zero cross-covariances and unit variance iii) The final simulated prediction vector y which corresponded to the (q+p+1):th dimension in the covariance-variance matrix The cross-covariance structure in was determined using the Higham method 50 to make ∑ the closest possible positive definite covariance matrix, with the objective of being able to sample from the multivariate normal distribution.

Pseudoalgorithm
The following algorithm was then run 1,000 times over the parameter grid with given , , , and 1 − 2 : A. Sample 2 observations from the multivariate normal distribution Z specified by q, p, and s B. Set aside the response vector y from the ( + + 1):th dimension. The remaining ( + ) dimensional observation matrix was utilized as the baseline matching information. C. 1) If no matching was utilized, randomly allocate y to 2 sample groups of equal size using simple permutation 2) If matching at baseline was utilized, then match based on the ( + ) dimensional confounder matrix and randomly allocate the submatches to 2 sample groups of size as described in the manuscript D. Increment the y belonging to different sample groups according to desired 1 − 2 E. Perform statistical testing using either non-paired or paired t-testing with statistical significance threshold = 0.05: 1) If no matching information was to be utilized, perform conventional non-paired ttesting between y belonging to the different group labels 2) If matching was to be utilized, perform paired t-testing where the matching couples pairs of y that were matched at step C2. F. Compute the proportion of significant findings over all runs and report the findings in the parameter grid This led to three different approaches: i) Matched randomization, paired testing: steps C2 and E2 ii) Matched randomization, non-paired testing: steps C2 and E1 iii) Conventional randomization, non-paired testing: steps C1 and E1 The sampled observations were kept constant for each of the different method approaches and the resulting significant findings are reported in Supplementary Fig. S9.

Simulation study for baseline-adjusted or matched regression models
In the simulations showing how the matched inference based on baseline matching of animals differs from a standard multiple regression model that adjusts for baseline differences after interventions, we performed the random allocation according to 1) the matched random allocation and 2) by conventional randomization without using the matching information. We inspected three types of regression models: (i) A non-adjusted model that only included a group-difference term and which considered y without any pairing (even if the matching was used in the allocation); (ii) A covariate-adjusted multiple regression model where all the covariates were all included as coefficients to compensate for possible baseline differences, and where the non-paired y was modeled as the response. This was run both with and without baseline matching randomization; and (iii) A matched model where the baseline matched pairwise differences of y were modeled through the intercept of a simple linear regression model. This resulted in a total of 5 different modeling strategies (Supplementary Fig. S10).
In order to simulate a difficult scenario, where the researcher may have to choose between different confounders and may end up including even spurious ones, we simulated the case where 3 out of the 4 baseline covariates were randomly sampled from a standardized normal distribution with no cross-covariance to the outcome whatsoever, while the 4th covariate was sampled to have a given correlation (s) with the outcome y. The outcome y was also sampled from a standardized normal distribution and then shifted according to a matched randomization based group assignment or by conventional randomization. This was similar to the procedure used in above "Simulation study for predictive baseline covariates" for q = 1 and p = 3 with a continuous range of predictive s. A total of 1,000 datasets were simulated in this parameter grid, with known group difference ( 1 − 2 ∈ {0,1,2} for none, mediocre, or strong group difference), sample size per group (N ∈ {5,10,15}), and with added unit variance for all the covariates (i.e., variance equals to one).

Introduction
Hamlet is an R package intended for the statistical analysis of pre-clinical studies. This document is a basic introduction to the functionality of hamlet and a general overview to the analysis workflow of preclinical studies.
This document is structured as follows: First, a general overview to inputting and processing the raw data is presented. Second, functionality is presented for the processing of pre-intervention data. Finally, functionality is presented for the post-intervention period, along with brief discussion on the differences between non-matched and matched statistical approaches. Each section comes with a list of useful functions specific for the subtask.
Latest version of hamlet is available in the Comprehensive R Archive Network (CRAN, http://cran.r-project.org/). CRAN mirrors are by default available in the installation of R, and the hamlet package is installable using the R terminal command: install.packages("hamlet"). This should prompt the user to select a nearby CRAN mirror, after which the installation of hamlet is automatically performed. After the install.packages-call, the hamlet package can be loaded with either command library("hamlet") or require("hamlet").
The following notation is used in the document: R commands, package names and function names are written in typewriter font. The notation of format pckgName::funcName indicates that the function funcName is called from the package pckgName. If only the function name is given, this indicates that it is located in the base package in R and is thus always available.

Analysis workflow
Two different types of case-control setups for the analysis of pre-clinical are presented in Fig. 1.
The type A experiment design in Fig. 1 is preferred, as matching is performed before allocation to the experiment groups, and therefore improves the balance and power of the experiment. The alternate experiment type B requires the bipartite matching task, where suitable pairs of individuals are identified over two or more groups that existed prior to matching. This document focuses on experiment design of type A, where similarity information is utilized readily before interventions.

Pre-intervention analyses 2.1 Loading data into R
The hamlet package comes pre-installed with the VCaP dataset, which is used here to illustrate the workflow. Two different formats of the data are provided. First one is available in data(vcapwide), which includes the data in the socalled wide format. In this data format the columns are indicators for different variables available for the experimental unit (here animal). For example, the two first rows of observations are extracted with:  The former wide format is useful for summarizing multiple variables when constructing distance matrices for the data. The latter long format is typically used for longitudinal mixed-effects modeling where observations are correlated through time.

Excel format data
An example view of a pre-clinical dataset is given in Fig. 2. Such a dataset can be saved in an R-friendly format by selecting option File > Save As and CSV (Comma delimited) as the save format in MS Excel.

CSV-files
CSV (Comma Delimited Values) is a suitable text-based format for the data to be read into R using either the function read.table or read.csv. The above presented example CSV file can be opened with the following command: > ex <-read. The above presented CSV file was read into R using read.table with the following parameters: file="example.csv" is the first parameter and indicates the input file from our current working directory. The working directory may be changed using the command setwd or by including its path in the file parameter, i.e. file="D://my//current//windows//working//directory//example.csv". sep=";" indicates that the values on each line are separated with the symbol ';', as is the format defined for the CSV delimited files with ","-decimals. This could also be a value such as \tab or " " (space). dec="," indicates that the "," symbol is used for decimals. The default value for indicating decimals is "." otherwise. stringsAsFactors=F indicates that strings should not be handled as factors. Factors are an R class, where a character string may only take instances of a predetermined set of strings. As each of our animal IDs -which are read as strings -are unique, it is generally more flexible to conserve them as character strings. Lastly, header=T indicates that the text CSV file has a header row as the first row, which includes names for each column. If this value is set to header=F or header=FALSE, the first row of the text file is read as the first observation and the columns are left unnamed.
Depending on the country of origin, the CSV files may use "." decimals and "," separator, or alternatively (as assumed here) "," decimals" and ";" separators.
List of useful functions: • read.

Distance and dissimilarity functions
A distance or dissimilarity function is used to describe the amount of dissimilarity between two experimental units. Common choices for computing the amount of similarity between two vectors x and y include: Here, x and y are expected to be observation vectors of length P , where each dimension describes the measured value for a particular covariate. S describes the covariance-variance matrix between covariates, and therefore incorporates inter-correlations between variables. The standard deviation s may be used to standardize differences in variation over the dimensions.  Table 1 shows the Euclidean distance matrix for the 18 animals presented in Figure 2.
List of useful functions: • Multigroup non-bipartite matching: hamlet::match.bb • Paired non-bipartite matching: hamlet::match.bb, nbpMatching::nonbimatch • Paired bipartite matching: optmatch::fullmatch 2.5 Non-bipartite optimal matching of animals at baseline (GA) While the above described Branch and Bound algorithm is guaranteed to identify the global optimum, in some cases it is not feasible due to size of the search tree. In such cases, a feasible optimum is easily detected using a Genetic Algorithm implemention provided in hamlet::match.ga. The Genetic Algorithm (GA) commonly includes many parameters, as it aims to mimic evolutionary processes in solving a problem, here a non-bipartite multigroup matching problem. The basic parameters that should be considered include generations, which indicates for how many generations the simulation is run for and thus increases run time approximately linearly, and the parameter popsize, which indicates how many solutions should be "living" inside the whole population at a given generation. A thumb rule is that many solutions are easily solvable by the 1,000th generation, if the population size is at least 100, but the user may want to use the visualizations and diagnostic plots to see how well the GA has managed to solve the optimization problem. The convergence happens over the generations similarly as presented in Figure 3.  > sol.ga[ [3]] # Identified solution by GA 7 > set.seed(1) # GA is a stochastic algorithm, fixing the seed for reproducibility > sol.ga <-match.ga(d, g=3, generations=100, popsize=100)  The GA algorithm, with a linear run time, has resulted in a very close optimum to the guaranteed global optimum identified using BB, which may in return have some cases lead to drastic increases in run times. Both algorithms may thus be applicable where appropriate.

Randomization based on matched individuals
The submatches identified in the above section should not be mistaken for the randomly allocated intervention groups. The final intervention groups are obtained by dividing members of each submatch in the found solution to a separate treatment arm. Since the within-submatch distances are minimized, this guarantees that comparable individuals are randomly divided to separate arms: > ex[,"Submatch"] <-submatches > set.seed (1)   As is seen Table 2, each submatch (column Submatch) consists of similar experimental units in terms of the baseline characteristics (i.e. PSA and body weight). Furthermore, the baseline data has now been allocated in such a manner, that each submatch evenly distributes to the proposed intervention groups (column AllocatedGroup), resulting in balanced baseline intervention groups. These artificial labels A, B, and C may then be given to an external experimenter in a blinded manner, and allocated to the true labels in any fashion without any pre-fixed control group, as all pairwise contrasts have been considered in the submatching procedure.

Visualizations for pre-clinical data
Various visualization functions are available to illustrate baseline balance. For example, the boxplots in respect to allocation groups can be plotted using a command such as boxplot, which is illustrated in Figure 4. Mixed variable scatterplots with annotations for the submatches or allocation groups are plotted using the function hamlet::mixplot, which can be seen in Figures 5 or 6

Power analysis
The power simulations provided by the hamlet-package are conducted through bootstrap (sampling with replacement) simulations using a pre-fitted mixedeffects model. For this purpose, it is essential that the user pre-defines a suitable mixed-effects model in the lmer-function of the lme4-package, as this will be used in the sampling process. The function mem.powersimu is the main hamlet function that performs this sampling, and it automatically identifies the suitable experimental unit from the lme4-object, and then re-fits the model structure a pre-defined amount of times at given N values.

An artificial example
In a situation where the user wishes to generate artificial data, it is important for the experimenter to evaluate such factors as: • How many measurement points time will be available • What is the expected effect size • Will right-censoring (death or sacrifice) occur and what are the risk criteria • Are there baseline differences or is there a correlation between the initial baselinse response level and intervention efficacy The user is encouraged to creatively produce such expert curated data either by hand, or through a tailored simulation function. As a practical example, an example function will be constructed below (mainly utilizing normal distributions). In order for the artificial data to be modeled using lme4-package, it should follow the long format.
As an example, data with an initial baseline level of response values with µ = 5 and σ = 2 will be generated from the normal distribution. 4 follow-up time points will be available after the initial baseline, and the expected control growth will be 2 per time point and in turn the intervention effect to have an effect of -1 to growth per time point. Furthermore, we simulate a right-censoring that has 20% chance to occur for individuals reaching above response values > 10. In this artificial example, 5 individuals will be available for both the control and the intervention group. Each measurement will have measurement error with no bias (µ = 0) and σ = 2.

Structure of a mixed-effects model
A standard mixed-effects model would, for example, include the following coefficients, given the input data artdat (the formula coefficients need to corresponds to column names in the input data.frame): > f1a <-as.formula(Response~1 + Time + Time:Group + (1 + Time|ID)) > f1b <-as.formula(Response~1 + Time + Time:Group + (1|ID) + (0 + Time|ID)) This formula is structured as follows: • The left hand side Response is our response vector y.
• The non-parenthesis coefficients following the tilde are the so called fixed effects, which are here population-wise parameters • The first right hand side coefficient 1 stands for standard model intercept, i.e. y level when x = 0 • Coefficient Time captures natural growth of the tumors as a function of time • Coefficient Time:Group introduces grouping information as an interaction with the growth coefficient, and thus tests whether the intervention gives a growth inhibition advantage.
• The terms in parenthesis are random effects with analogous counterparts to their fixed effects. The difference is that the grouping variable, indicated here with |ID, is gives flexibility for the each experimental unit to have deviating intercepts and growth slopes. Separate value from a normal distribution with mean 0 and an estimated standard deviation are identified when fitting the mixed-effects model. The random effects allow individualized response curves, while controlling that multiple observations belong to a single individual (ID).
• An alternate non-correlated random-effects structure is given in f1b, indicated by separating the two random effects terms without a cross-correlation.

Bootstrap simulations
The package lmerTest is used to provide Satterthwaite approximation for the p-values for fixed effects in the linear mixed-effects model. Albeit the p-values are provided here for the model coefficients, we are interested in how power in such a study would develop as a function of animal numbers N . For this purpose we can perform power simulations, which bootstraps the pre-fitted mixed-effects model on our artificial data: Notice that the artificial data simulations were run with a very limited bootstrap sample size, in order to save time in generation of this vignette. A better estimate for the power as well as more exact N would be given e.g. by setting boot=1000 and N=3:20. Furthermore, we indicated with level that our experimental unit is defined by the individual indicator ID, and that we want to subsample evenly over the intervention groups through strata. The resulting power curve is shown in Figure 8, suggesting that in order to achieve sufficient statistical power, our experiment should include at least N arm = 9 individuals > set.seed(1) > pow <-mem.powersimu(fit1, + N=c (3,5,7,9,11,13,15), boot=20, + level="ID", strata="Group") > abline(h=0.8, col="grey") > pow in both intervention arms, and total experiment size consisting of N total = 18 individuals.

Post-intervention analyses
The presented pre-intervention submatching procedure provides a unique opportunity to utilize this predictive power to improve accuracy in the postintervention inference. We here provide the datasets from both the ARN-509 / MDV3100 -study and the ORX / ORX+Tx -study, by typing data(vcaplong) and data(orxlong), respectively. Alternatively, the pre-intervention data is also available in data(vcapwide) and data(orxwide), respectively.

Long format and the presented datasets
As out-lined before, in order to perform regression modeling, R requires the observations to be in the so-called long format, where each row in a data.frame corresponds to a single measurement. These measurements are then usually uniquely defined using individual identification codes as well as time points. Typically, an experimenter may model a single interesting contrast with a single model, thus we will split the vcaplong and orxlong into two separate data sets.

Collating to pairwise submatched observations
In order to fit pairwise matched mixed-effects models, the experimenter should utilize the baseline submatch information to subtract corresponding control growth from its intervened counterpart. For this, a field indicating Submatch should be available in the data frame, and subtraction in a pairwise manner per each time point T ime. For example: > arndat <-arndat [order(arndat[,'Submatch']),] > arnpair <-do.call ('rbind', by(arndat, INDICES=arndat[,'Submatch'] In the above example, first pairwise computed PSA results in 23.62−21.30 = 2.32 at time point DrugW eek = 0 (baseline). The following time points, i.e. DrugW eek = 1 result in turn a much more drastic growth in control tumor, i.e. 22.09 − 45.69 = −23.60. In this particular example, the treated tumor seems to grow much slower for 3 weeks subsequently to the baseline, until it bounces back in the final time point.

Fitting conventional and pairwise matched mixed-effects models
Linear mixed-effects models compose of two main components: • Fixed effects; Population effects, usually considered to cover either whole range of experimental units or a subpopulation such as an intervention group • Random effects; Individual effects, that allows flexible model fits. Typical random effects include a random intercept (variation at baseline) and a random slope (individual variation in the growth coefficient).
The formula interface in R for fitting linear mixed-effects models in lme4package consists of three parts: where the LFS refers to left-hand side, i.e. the response vector y which is usually a tumor growth feature such as serum PSA or tumor volume. The right hand side from ∼ holds the fixed effects part (here Xb), random effects part (here Zu) and the normally distributed error term .
Fixed effects are separated using the + sign. A typical longitudinal preclinical model could be built on three fixed effects terms: where 1 refers to a common intercept, which could be alternatively omitted using either a 0 or a −1 sign instead of 1. T ime typically is a running time point indicator, that starts from 0 at baseline and ranges to certain time units such as weeks or days, and tumor growth is computed a slope coefficient as a function of time. Furthermore, the term T ime : Group adds a binary indicator Group that may be used to compare a subpopulation in comparison to control tumor growth.
Furthermore, random effects follow a notation: where the notation indicates that each unique factor value within variable GroupingFactor is treated as an instance of the experimental unit. The upper notation indicates that both an individual-specific intercept as well as an individualized time-dependent slope are estimated along with a cross-covariance between the two normally distributed random effects. The lower notation in turn estimates these two effects, an individual-specific intercept and an individualized time-dependent slope, separately from each other. By default we utilized the lower notation approach, though the upper notation may offer an interesting alternative. The error term does not need to explicitly included in the model formula. In the ARN-509 -study, the presented mixed-effects models were fitted using:

R-vivo User Instructions
Introduction R-vivo is a browser-based interface for the hamlet R-package (Fig. 1). Its functionalities are intended mainly for refining and improving the experimental design and statistical analysis of pre-clinical intervention studies, although the tools can be generalized to cover other relevant fields. R-vivo analysis functions are divided into two main subcategories: • Pre-intervention analysis • Post-intervention analysis (incl. power analysis) Both subsections require specific data format(s). Pre-intervention analysis uses the so-called wide format data and post-intervention analysis uses the so-called long format data (see more detailed examples below).

File upload
R-vivo supports several different input file formats: • Excel files: .xls .xlsx • Comma separated: .csv • Text files: .txt R-vivo automatically detects the most suitable file format of the above templates. If your data is in the long format, press the "Data is in long format"-button after uploading the file. Long format data is suitable for downstream mixed-effects modelling, both for post-intervention statistical testing as well as for power analyses prior to conducting an actual experiment.  In the wide format data (Fig. 2), the columns are indicators of the variables available for the experimental unit (here animal or tumor). The wide format is utilized for constructing dissimilarity matrices from the multivariate baseline data. In the long format data (Fig. 3) selected set of longitudinally interesting response variables are available (i.e. here PSA or body weight), and the different observations belonging to a single experimental unit (here animal) are uniquely distinguished using measurement time (field Week or DrugWeek) and/or identification codes (field ID). In this experiment, treatment is initiated at week 10 so DrugWeek = 0 is the baseline for initiation of interventions. This long format is typically used for longitudinal mixed-effects modeling where multiple observations are correlated within an experimental unit.
In this particular example (Fig. 4), matching of animals is conducted before allocation to the treatment arms based on three characteristics: PSA level at baseline, PSA level at the week prior to allocation, and body weight at baseline.

Figure 4:
Example data further used in this document. 3

Non-bipartite multigroup matching of the example data
The multigroup matching presents the fundamental experimental design improvement presented in the current work. By identifying similar subgroups (submatches) of experimental units, the pipeline allows random distribution of these similar units evenly to the intervention arms. This leads to each group having a corresponding experimental unit in each of the other intervention arms, as well as results in balanced distributions over the treatment arms for the baseline characteristics. As the submatches consider all pairwise comparisons, the interventions arms can be blinded (for instance no need to fix a control group), and therefore comply with the rigorous standards applied e.g. in controlled clinical trials.
The pre-intervention analyses begin with uploading of wide format data (Fig. 5).
Step 2: If unnecessary columns are present in the data, process the data accordingly (Fig. 6).
• Press the "Continue to data selection!"-button • Select the desired columns • Choose a representative name for the data • Press "Save and continue!" to advance to the next step Figure 6: Column selection.

Step 4: Matching
The non-bipartite optimal matching problem can be solved using the implemented Genetic Algorithm (GA). The algorithm takes as input a distance matrix and desired size of the submatches (i.e. number of animals/tumours per a subgroup, and thus per intervention group). If the size of the submatch is for example G = 3, it indicates that the GA minimizes dissimilarity edges within triplets. These subgroups of desired size are called submatches. (Fig. 8) • Select the desired distance matrix • Select the submatch size G, the count of desired intervention arms • Press the "Match"-button • Default options are suitable for small to mediocre size experiments, but a better solution may be found by checking "Advanced options" and tuning the GA parameters. Notice that parameters such as generations will result in a linear increase in the GA run time. • Press "Save and continue!"-button to advance to the next step Hierarchical Clustering: • Go to "Hierarchical Clustering"-tab • Select the desired dissimilarity matrix to plot (Fig. 11) • Press "Download" to get a PDF file of the clustering Figure 11: Hierarchical clustering of the presented data. 10
• Press "Download" to get an interactive picture of the heatmap.

Scatterplot matrices
Scatterplot matrices can be used to determine roughly if you have a linear correlation between multiple variables, or to identify other interesting pairwise trends between the variables in the data.

Steps:
• Go to "Scatterplotmatrix"-tab • Select variables for the plots (Fig. 13) • Press "Download" to get a PDF file of the scatterplot matrix

Post-intervention
Post-intervention refers to the modeling of the provided long format data, here using mixed-effects modeling. It may be utilized to fit a mixed-effects model and to test population-wise hypotheses (fixed effects) and to model the individual variation (random effects).
Power simulation here is done through bootstrapping (sampling with replacement) the data source of a prefitted mixed-effects model, and then re-fitting and estimating the fixed effects coefficients to the bootstrapped datasets. However, please note that the power analysis is not intended to be used for retrospective speculation of an already conducted experiment. Its purpose is to model and simulate either an artificial or a pilot experiment, in order to improve accuracy and to guarantee sufficient statistical power in a representative future study.

Upload data
• Upload long format data • Press the "Data is in long format"-button (Fig. 14) • Press the "Continue to Post-intervention analysis!"-button to advance to the next step Figure 14: Long format data input for mixed-effects modeling.

Power analysis
The power analyses are sensitive to the model structure, and in order to obtain a smooth and accurate estimation of statistical power, the user should use a sufficient amount of bootstrapped datasets. Additionally, a tight grid of N values may ensure that exact numbers are reported. Here N refers to tested group-wise amounts of animals. It is a vector with a minimum value, maximum value, and a grid-spacing; for example, an N vector with minimum 5 and maximum 14 with a spacing of 3 would test N = {5, 8, 11, 14}.
• After fitting a mixed-effect model as described above, go to the "Power analysis"-tab • Select a suitable sequence of N values to test • Select number of bootstraps per Note: calculation time increases linearly • If model contains grouping information check corresponding checkbox and select tested group.
• If a loess smoothed curve is desired for approximation purposes, choose the "Smoothed curve"-checkbox • Each differently colored line represents the power of a fixed effects coefficient as a function of the sample size N (Fig. 16). • Download a report of the power simulations by pressing "Download" Figure 16: A bootstrapped power curve for each of the specified fixed effects, which utilizes the previously fitted mixed-effects model.
The ARRIVE Guidelines Checklist adopted from Kilkenny et al. 1  The related background to orchidectomy (ORX) is well known in literature, while the undisclosed intervention (Tx) remains to be further described.

Objectives 4
Clearly describe the primary and any secondary objectives of the study, or specific hypotheses being tested.
To longitudinally model response to antiandrogen therapy for ARN-509 and MDV3100 in orthotopic VCaP cells To longitudinally model response to orchidectomy (ORX) as well as a novel therapeutic intervention (Tx) in subcutaneous VCaP cells.

Ethical statement 5
Indicate the nature of the ethical review permissions, relevant licences (e.g. Animal [Scientific Procedures] Act 1986), and national or institutional guidelines for the care and use of animals, that cover the research.
All animal handling was conducted in accordance with Finnish Animal Ethics Committee and institutional animal care policies, which fully meet the requirements defined in current NIH guidelines on animal experimentation (license number 1993/04.10.03/2011).
The study was conducted in accordance with the Animal Experiment Board in Finland (ELLA) for the care and use of animals under the license ESAVI/7472/04.10.03/2012.

Study design 6
For each experiment, give brief details of the study design including: a. The number of experimental and control groups. b. Any steps taken to minimise the effects of subjective bias when allocating animals to treatment (e.g. randomisation procedure) and when assessing results (e.g. if done, describe who was blinded and when). c. The experimental unit (e.g. a single animal, group or cage of animals). A time-line diagram or flow chart can be useful to illustrate how complex study designs were carried out. 120 mice were originally used for the study. Eventually 45 mice were used to form 3 study groups, one control group and two treatment groups, each including 15 mice. ( Supplementary  Fig. S1b). Experimental unit was a single animal. Blinding, allocation, and randomization were conducted as presented in this manuscript.
Originally 109 mice were allocated to 6 experimental groups, each containing 14 to 16 mice ( Supplementary  Fig. S1c). Experimental unit was a single animal. Blinding, allocation, and randomization were conducted as presented in this manuscript.