A framework to assess the impact of number of trials on the amplitude of motor evoked potentials

The amplitude of motor evoked potentials (MEPs) elicited by transcranial magnetic stimulation (TMS) is a common yet highly variable measure of corticospinal excitability. The tradeoff between maximizing the number of trials and minimizing experimental time remains a hurdle. It is therefore important to establish how many trials should be used. The aim of this study is not to provide rule-of-thumb answers that may be valid only in specific experimental conditions, but to offer a more general framework to inform the decision about how many trials to use under different experimental conditions. Specifically, we present a set of equations that show how the number of trials affects single-subject MEP amplitude, population MEP amplitude, hypothesis testing and test–retest reliability, depending on the variability within and between subjects. The equations are derived analytically, validated with Monte Carlo simulations, and representatively applied to experimental data. Our findings show that the minimum number of trials for estimating single-subject MEP amplitude largely depends on the experimental conditions and on the error considered acceptable by the experimenter. Conversely, estimating population MEP amplitude and hypothesis testing are markedly more dependent on the number of subjects than on the number of trials. These tools and results help to clarify the impact of the number of trials in the design and reproducibility of past and future experiments.


Results
Analytical results. Number of trials for estimating single-subject MEP amplitude: previous studies. In a hypothetical single-pulse TMS experiment in which MEP amplitude is collected for n max trials in single subjects, the cumulative average µ trials (n) is defined as the average MEP amplitude obtained with the first n trials, so that µ trials (n max ) is the sample average with all trials. For simplicity, we will refer to the sample average (with n max trials) simply as µ trials . Previous studies empirically defined the optimal number of trials for estimating singlesubject MEP amplitude as the minimum number of trials n opt that allows the cumulative average to come within a certain level of 'acceptable similarity' to the sample average. Two main measures of 'acceptable similarity' were used: (i) a 95% confidence interval ( n opt_ci ) [26][27][28][29]31 , and (ii) a ± 10% difference ( n opt_%diff ) 28 around the sample average.
We can define the inclusion of the cumulative average within the desired level of acceptable similarity as a probability of inclusion p incl , so that α = 1 − p incl . With central limit theorem assumptions, here we show that both n opt_ci and n opt_%diff are analytical functions of n max , namely: where z 1−α ci /2 is the critical value of the standard normal distribution for a confidence interval of 1 − α ci (e.g. for a 95% c.i., α = 0.05 and z 1−α/2 = 1.96), z 1−α/2 is the critical value corresponding to the probability of inclusion p incl , η is the relative error that defines the acceptable difference from the sample average (e.g. for ± 10%, η = 0.1), µ trials and σ trials are the sample average and standard deviation across trials (computed with n max ). For derivations, see "Methods". Unfortunately, n opt_ci does not depend on the variability σ trials , and for both n opt_ci and n opt_%diff the cumulative average is a priori bound to reach the required 'acceptable similarity' to the sample average with a number of trials that depends on and is upper bounded by the total number of trials available n max .
In Eq. (1), the definition of the optimal number of trials n opt_ci is solely a function of n max , z 1−α ci /2 and z 1−α/2 . The above-cited studies were empirically trying to define the minimum number of trials n opt_ci that allowed the cumulative average to come within a 95% confidence interval around the sample average. They thus assumed α ci = 0.05, which implies z 1−α ci /2 = 1.96. They were also using p incl = 1 , which would correspond to a theoretical z 1−α/2 = + ∞, but in practice corresponded to an arbitrary p incl < 1 due to the finite number of subjects. For example, if z 1−α/2 = 2.493, which corresponds to an arbitrary but mathematically elegant inclusion probability p incl between 0.95 and 0.99, then n opt_ci = n max /φ, where φ = 1.618 is the golden ratio. With this 'golden' inclusion probability, the 'optimal' number of trials estimated by the inclusion of the cumulative average within a 95% confidence interval around the true average would be n opt_ci = 19, 25 and 62 with n max = 30, 40 and 100, respectively. With an empirical p incl = 1 , as used in previous studies [26][27][28][29]31 , if the number of subjects increases, then the experimental estimate of n opt_ci asymptotically tends to n max .
(1) n opt_ci = n max 1 + www.nature.com/scientificreports/ Unlike Eq. (1), Eq. (2) does take into account the trial-to-trial variability of MEP amplitude σ trials . Unfortunately, however, it still depends on (and is limited by) the total number of trials available n max . For example, if µ trials = 1, σ trials = 0.5, η = 0.1 and z 1−α/2 = 2.493, then n opt_%diff = 25, 32 and 61 with n max = 30, 40 and 100, respectively. With p incl = 1 as previously used empirically 28 , if the number of subjects increases, then the experimental estimate of n opt_%diff also asymptotically tends to n max .
Number of trials for estimating single-subject MEP amplitude: a principled framework. In order to avoid the limitations of previous empirical studies attempting to define the number of trials for estimating single-subject MEP amplitude, we rescue a more principled measure of 'acceptable similarity' that had already been used in the early TMS literature 14 : the inclusion of the cumulative average within an acceptable difference (e.g. ± 10%) from the true average. The optimal number of trials n opt for estimating single-subject MEP amplitude is thus simply obtained as the number of trials at which the confidence interval of the estimate of the true average equals the acceptable difference from the true average, i.e.
where the critical value z 1−α/2 is now defined by the desired probability of inclusion p incl within the relative error η around the true average µ trials , and CV trials is the corresponding coefficient of variation (i.e. CV trials = σ trials /µ trials ). For example, if CV trials = 0.5, then 96 trials are necessary to ensure that the estimated single-subject MEP amplitude stays within 10% of the true value ( η = 0.1) with 95% probability ( p incl = 0.95, z 1−α/2 = 1.96). Crucially, in Eq. (3) n opt is not upper-bounded by the total number of trials available n max . Therefore, n opt can also be rigorously estimated from experimental data (without dependence on n max ), by substituting the true CV trials with the sample estimate CV trials .
Note that Eq. (3) can also be derived as the theoretical asymptotic limit of Eq. (2) for a very-large number of trials, when the sample average µ trials converges to the true average µ trials : Equation (3) can be solved for η to calculate the relative error (i.e. the acceptable difference from the true average) that is implicitly assumed when the MEP amplitude is estimated with a given number of trials n, i.e.
where SE trials (n) is simply the standard error of µ trials estimating µ trials with n trials. The statistical error thus decreases with the inverse of the square root of n . For example, if CV trials = 0.5 and z 1−α/2 = 1.96, then reducing the number of trials n from 96 to 30 or 20 increases the relative error η from 10.0% to 17.9% and 21.9%, respectively.
Number of trials for estimating population MEP amplitude. In many studies the objective may be to estimate the average MEP amplitude of a population of N subjects, which we will refer to as the population MEP amplitude.
Substituting trials with subjects, Eq. (5) remains valid to calculate the relative error η(N) that is assumed acceptable when the population MEP amplitude µ subjects is estimated with N subjects, given the coefficient of variation across subjects CV subjects or the standard error of the population MEP amplitude SE subjects (N) , i.e.
In Eq. (6) the statistical error decreases with the inverse of the square root of the number of subjects N . In order to understand how the error depends on the number of trials n , we can decompose the variance between subjects with n trials, σ 2 subjects (n), into the sum of the asymptotic variance between subjects with infinite trials σ 2 subjects and the error variance of the sample average within subjects due to the finite number of trials n 32 : where σ 2 trials is the MEP variance across trials, either assumed to be equal across subjects or pooled across subjects. The standard error of the population MEP amplitude then becomes The relative error η of the population MEP amplitude thus depends on the number of trials n as follows:  (8) and (9) show that the statistical error can be reduced by increasing either the number of trials n or the number of subjects N . However, increasing the number of trials n provides only limited benefit. For example, consider a hypothetical population of N = 20 subjects with µ subjects = 1.0 mV, σ subjects = 0.5 mV and σ trials = 0.5 mV. The minimum relative error η achievable for estimating the population MEP amplitude with an infinite number of trials is 21.9%. If we reduce the number of trials from infinite to 10 or even 5, then the error only increases to 23.0% and 24.0%, respectively. With 10 trials, if we double σ trials from 0.5 to 1.0 mV, the error only increases from 23.0 to 25.9%. Conversely, the error can always be decreased by increasing the number of subjects N.
Number of trials for hypothesis testing. In many experimental situations, one might be interested in knowing if a certain number of trials is sufficient to perform hypothesis testing, for example to test if MEP amplitude is significantly different in a population of patients compared to a population of controls (unpaired), or if it is significantly different before and after an intervention on the same population of subjects (paired). The same reasoning used to estimate the population MEP amplitude can be applied to express the t statistic for a Student's paired t-test as a function of the number of subjects N and trials n: where µ subjects1 and µ subjects2 are the population MEP amplitudes of the two populations to be compared, assuming for simplicity equal variances, and r is the asymptotic correlation of MEPs between the two populations (i.e. the correlation that would be obtained within an infinite number of trials). Note that if we assume r = 0, then Eq. (10) represents an unpaired t-test with equal N and equal variances. A derivation of Eq. (10) is provided in the Methods.
The relationship between the number of trials and statistical power may be seen more directly in the corresponding formula for the calculation of the sample size N opt in a power analysis for the t-test 32 : where α is the probability of type 1 error and β is the probability of type 2 error ( 1 − β is the power). With typical values of α = 0.05 and β = 0.20 (i.e. z 1−α/2 + z 1−β = 2.80), Eq. (11) becomes: For example, if σ subjects = 0.5 mV and σ trials = 0.5 mV and we want to detect a difference µ subjects1 − µ subjects2 = 0.2 mV, Eq. (12) indicates the following. In a between-subjects design ( r = 0), with only one trial ( n = 1) we would need two groups of at least 196 subjects. By increasing the number of trials to n = 5 or 10, the number of subjects would conveniently decrease to 118 and 108, respectively. However, further increasing the number of trials would lead to negligible additional reduction of the number of subjects needed (e.g. 103 subjects with n = 20 trials, 101 subjects with n = 40 trials, 98 subjects with n = ∞ trials). In a within-subjects design with high correlation ( r = 0.9), with only one trial ( n = 1) we would need at least 108 subjects. By increasing the number of trials to n = 5, 10 or 20, the number of subjects would decrease considerably to 30, 20 and 15, respectively. Further increasing the number of trials would lead to a progressively smaller reduction of the number of subjects needed (e.g. 14 subjects with n = 30 trials, 13 subjects with n = 40, 10 subjects with n = ∞ trials).
Number of trials for test-retest reliability. Finally, the number of trials n clearly has an impact on the test-retest reliability of TMS measures 33 , as reported in previous experimental studies 28,31 . In the case of MEP amplitude, we can show this impact analytically. For simplicity, we focus on Pearson's correlation, which is useful to assess test-retest reliability when only two time points are available, particularly if means and variances do not change across time points 34 . Substituting Eq. (7) in Eq. (40) (see "Methods"), the dependence of the Pearson's correlation coefficient r(n) between repeated measures on the number of trials n within measures can be expressed as follows: where r and σ 2 subjects are the asymptotic Pearson's correlation across repeated measures and the (pooled) variance across subjects with infinite trials, and σ 2 trials is the (pooled) variance across trials. Note that if mean and www.nature.com/scientificreports/ variance do not change across time points (which should be the case in the context of test-retest reliability of TMS measures), then the Pearson's correlation coefficient is identical to the concordance correlation coefficient 34 , which in turn is virtually identical to a group of intraclass correlation coefficients that estimate the degree of absolute agreement between non-interchangeable measurements [35][36][37] . Equation (13) clarifies that increasing the number of trials can only increase the test-retest reliability up to a limit (i.e. r ), which is consistent with previous experimental observations 28,31 . Simulation results. Single-subject MEP amplitude. To validate Eq. (5), we simulated 10,000 single subjects with non-normally distributed MEPs at four levels of CV trials (0.25, 0.50, 0.75 and 1.00). For each subject, we simulated 100 trials drawn from an independent lognormal distribution with mean µ trials = 1.0 mV and standard deviation σ trials = 0.25, 0.5, 0.75 or 1.00 mV, with a corresponding skewness = 0.77, 1.63, 2.67, 4.0. The lognormal distribution was obtained as the exponential of a normal distribution with mean and variance We then calculated the cumulative average MEP amplitude for each subject. We finally calculated the 95th percentile of the distribution across subjects of the absolute errors of the cumulative average estimating the true average (divided by 1 mV), as a function of the number of trials n . This 95th percentile was used as an estimate of the relative error η of the single-subject MEP amplitude. Note that this means that we considered a 95% probability of inclusion of the cumulative average within the relative error η from the true average [i.e. z 1−α/2 = 1.96 in Eq. (5)]. The comparison between the simulated data and Eq. (5) is provided in Fig. 1A.
Population MEP amplitude. To validate Eq. (9), we simulated 10,000 populations of N = 10, 20, 30 and 40 subjects. For each population of subjects, the single-subject MEP amplitude µ trials (s) of each subject s was drawn from a lognormal distribution with mean µ subjects = 1.0 mV (i.e. the true population MEP amplitude) and standard deviation σ subjects = 0.5 mV (skewness = 1.63). For each subject s within each population, we simulated 100 trials drawn from an independent lognormal distribution with mean µ trials (s) and standard deviation σ trials = 0.5 mV. We then calculated the cumulative population MEP amplitude for each population of subjects. Finally, we calculated the 95th percentile of the distribution across subjects of the absolute errors of the cumulative population MEP amplitude, estimating the true population MEP amplitude (divided by 1 mV), as a function o the number of trials n . This 95th percentile (i.e. z 1−α/2 = 1.96 in Eq. (9)) was used as an estimate of the relative error η of the population MEP amplitude. The comparison between the simulated data and Eq. (9) is provided in Fig. 1B.
The bivariate lognormal distribution was obtained as the exponential of a bivariate normal distribution with means variances (14)  www.nature.com/scientificreports/ and covariance Note that we considered the unpaired t-test with equal sample sizes as a special case of the paired t-test (with null covariance). For each subject s i within each population pair, we simulated 100 trials drawn from a lognormal distribution with mean µ trials (s i ) and standard deviation σ trials = 0.5 mV, and we calculated the cumulative average MEP amplitude across trials. For each population pair, we then computed the average and standard deviation across subjects of the cumulative MEP amplitude differences. To reduce bias, the estimate of the standard deviation was divided by the following correction factor: www.nature.com/scientificreports/ Population averages and standard deviations of the cumulative MEP amplitude differences were then averaged across population pairs. The t statistic was estimated with the standard formula, as a function of the number of simulated trials n and simulated subjects N: where − d (n, N) and s d represent the mean and standard deviation of the cumulative MEP amplitude differences averaged across all population pairs. The comparisons between the simulated data and Eq. (10) are provided in Fig. 1C (unpaired) and in Fig. 1D (paired).
Experimental results. Experiment 1. In the first experiment we addressed the relatively simple problem of estimating MEP amplitude ( Fig. 2A). In 20 subjects we set a stimulus intensity intended to evoke approximately 1-1.5 mV MEPs and we delivered 100 single pulses of TMS to the cortical location ('hot spot') representing the FDI. Note that 100 trials is an arbitrary number that is considerably higher than that which is typically used in TMS protocols. Importantly, we did not control for possible attentional drifts over the approximate 10 min required to complete the 100-trial protocol, but we are assuming stationarity for simplicity.
Were 100 trials sufficient to estimate single-subject MEP amplitude? The estimated MEP variability within subjects ( CV trials ) was 0.61 (range 0.29 to 0.87). According to Eq. (3), if we wanted to guarantee that the estimated single-subject MEP amplitude was with 95% probability (i.e. z 1−α/2 = 1.96) within an arbitrary error of ± 10.0% (i.e. η = 0.1) from the true MEP amplitude, we should have increased the number of trials to 143 (range 33-291). Yet Eq. (5) indicates that with our 100 trials the actual difference from the true MEP amplitude was not much higher, just ± 12.0% (range 5.7-17.1%). Using only 30 or 20 trials, the error would increase to ± 21.8% and ± 26.7%, respectively (Fig. 2B,C).
Were 100 trials sufficient to estimate population MEP amplitude? The estimated MEP variability between subjects with 100 trials CV subjects was 0.39. Accordingly, Eq. (9) indicates that the estimated population MEP amplitude was with 95% probability within an error of ± 17.1% from the true population MEP amplitude. Importantly, this error would not increase much if the number of trials was decreased to 30 (± 17.7%), 20 (± 18.2%), 10 (± 19.3%.) or even 5 (± 21.4%), (Fig. 2D,E), and it virtually would not decrease further if we had an infinite number of trials (± 16.9%).

Experiment 2.
As a representative example of hypothesis testing, we considered the problem of designing an experiment to test whether stimulus intensity affects MEP amplitude (although we actually know that it does). We thus decide to deliver stimuli at two intensities commonly used in stimulus-response curves: 110% and 120% of the RMT 9,16,26 , and we use the results of Experiment 1 to make predictions for the following question: how many trials and subjects do we need to detect a difference in MEP amplitude between 110%RMT and 120%RMT?
In Experiment 1, the actual stimulus intensity employed was 122.5 ± 11.8% of the RMT, which elicited a population MEP amplitude µ subjects = 1.48 mV, with an estimation error of 17.1%, a pooled within-subjects MEP variability σ trials(pooled) = 1.01 mV and an estimated asymptotic between-subjects MEP variability σ subjects = 0.57. We thus make the following conservative estimations. (a) With 120%RMT intensity we will obtain a population MEP amplitude µ subjects1 = 1.48*(1 − 0.171) = 1.23 mV (i.e. the lower confidence limit from experiment 1). (b) With 110%RMT we will obtain a population MEP amplitude µ subjects2 = 1.23/2 = 0.62 mV. (c) Both within-subject and between-subjects MEP variability will be the same at 110%RMT and at 120%RMT, i.e. σ trials(pooled) = 1.01 mV and σ subjects = 0.57 mV. (d) The asymptotic correlation between MEPs obtained at 110%RMT and at 120%RMT will be r = 0.61. The latter was estimated from the split-half correlation of the first 40 trials in Experiment 1 (i.e. the correlation of the mean MEPs estimated from the first 20 trials with the mean MEPs estimated from the next 20 trials), eliminating one outlier.
With the above numbers (Fig. 3A), Eq. (11) indicates that in order to detect a significant difference in MEP amplitude between 110%RMT and 120%RMT, with type-I error α < 0.05 and type-II error β < 0.20 (i.e. power > 0.80), with infinite trials we would need only 6 subjects in a within-subjects design. This minimum number of subjects would increase to 7, 8, 10, and 14 with 30, 20, 10, and 5 trials, respectively. If instead we planned to perform the experiment in a between-subjects design ( r = 0, i.e. one group tested at 110%RMT and the other group tested at 120%RMT), Eq. (11) tells us that with infinite trials we would need at least 14 subjects per group, which would increase to 16 and 18 subjects with 30 or 10 trials, respectively (Fig. 3B).
We decided to perform Experiment 2 in a within-subjects design with 10 trials per intensity and 16 subjects, in order to have more than enough power to detect a significant difference in a within-subject design (even with half of the trials), and almost sufficient power if assuming a between-subjects design. The two stimulus intensities (i.e. 110%RMT and 120%RMT) were delivered in the same experimental session, and the experiment was repeated twice to verify the consistency of the statistical results. As expected, MEP amplitude was greater at 120%RMT compared to 110%RMT both in the first session (1.57 ± 1.59 mV vs. 0.81 ± 0.85 mV) and in the second session (1.79 ± 1.64 mV vs. 0.76 ± 0.89 mV). Considering only the first 10 subjects (i.e. the minimum number of subjects to detect a significant difference as suggested by Eq. (11)), MEP amplitude was significantly higher with 120%RMT compared to 110%RMT, both in the first experimental session (paired t-test, p = 0.010) and in the second one ( p = 0.044). The p-values decreased as expected considering the entire sample of 16 patients, both in www.nature.com/scientificreports/ the first experimental session ( p = 0.003) and in the second one ( p < 0.001) (Fig. 3C). As predicted, the difference remained significant even when only 5 trials were used, both in the first session ( p = 0.007) and in the second one ( p = 0.001). Conversely, if we assumed that the experiment was performed in a between-subjects design (i.e. two groups of 16 subjects), the p-values reached significance in the second session (unpaired t-test, p = 0.034), but not in the first one ( p = 0.10), consistent with the lower statistical power that had been expected (Fig. 3D). www.nature.com/scientificreports/

Discussion
We presented a general framework of simple equations that show how the number of trials affects single-subject MEP amplitude, population MEP amplitude, hypothesis testing and test-retest reliability in TMS experiments. The equations were derived analytically, validated with Monte Carlo simulations, and applied to two sets of experimental data in a representative manner.

Analytical results. A number of recent experimental studies suggested that the minimum number of trials
for estimating MEP amplitude would be around 30 trials 26,28,29 . However, we analytically showed that with the empirical approach used in these studies the estimated minimum number of trials essentially depends on total number of trials available n max [Eqs. (1) and (2)] and does not depend on the trial-to-trial variability [Eq. (1)]. This probably explains why in these studies the estimated minimum number of trials n opt_ci for MEPs collected at 120%RMT was higher when n max was 40 ( n opt_ci = 29-31) 26,28,29 , compared to when n max was 30 ( n opt_ci = 21) 27 . Previous experimental estimates of the minimum number of trials to reliably estimate single-subject MEP amplitude thus do not lend themselves to generalization. Equation (3) formalizes the intuition that the minimum number of trials to estimate single-subject MEP amplitude should depend on the trial-to-trial variability in the specific experimental conditions and on the acceptable statistical error defined by the experimenter 14 . Indeed, depending on stimulus intensity and on the stimulus-response curve of the individual subject in the specific experimental condition 18,38 , MEP amplitude has a different trial-to-trial variability, as measured by the coefficient of variation ( CV trials ) 9,16,39,40 . This affects the minimum number of trials required to estimate single-subject MEP amplitude, which is proportional to the square of CV trials . When the same equation is resolved in terms of the acceptable statistical error [Eq. (5)], it becomes explicit that increasing the number of trials dramatically reduces the error when only a few trials are available, but it offers a progressively smaller advantage as the number of trials increases (Fig. 1A). Nevertheless, the present study warns us that, if the acceptable error is low, in many experimental conditions estimating www.nature.com/scientificreports/ single-subject MEP amplitude may require substantially more trials than previously suggested (but maintaining stationary conditions may become a challenge). However, increasing the number of trials can only improve the test-retest reliability of MEP amplitude up to a limit [Eq. (13)], in agreement with previous experimental results 28,31 . This is important, for example, for possible diagnostic applications 41,42 , or for assessing the reproducibility of non-invasive brain stimulation techniques in individual subjects [43][44][45] . Equations (9), (10) and (11) define the impact of the number of trials for estimating population MEP amplitude and for hypothesis testing. Importantly, the non-linearity of the stimulus-response curve and its betweensubjects variability contribute to both the between-subjects MEP amplitude variability σ subjects and the pooled within-subjects MEP amplitude variability σ trials . This has a much higher impact on the minimum number of subjects than on the minimum number of trials required to estimate population MEP amplitude within a certain error or to detect a significant difference in hypothesis testing. In fact, the number of trials and trial-totrial variability within subjects have a relatively minor impact on the estimation of population MEP amplitude, which mostly depends on the variability between subjects and on the number of subjects [Eq. (9) ; Fig. 1B]. Similarly, hypothesis testing is markedly more dependent on the number of subjects than on the number of trials [Eqs. (10) and (11)], particularly in unpaired experimental designs ( r = 0; Fig. 1C). In paired designs (0 < r < 1), importantly, the number of trials becomes progressively more relevant if the asymptotic correlation r between repeated measures is higher (Fig. 1D). Nevertheless, even for highly reliable paired conditions (e.g. r = 0.9), a decrease in number of trials can always be compensated by an increase in number of subjects. In general, unless very few trials are used, increasing the number of trials will only induce a minor improvement in statistical power and reproducibility of comparisons between subjects (e.g. patients vs. controls) or within subjects (e.g. effect of an intervention). If more statistical power is needed, then the number of subjects rather than trials should be increased. Indeed, if sufficient subjects are available, theoretically the minimum number of trials per subject to detect any difference is always n = 1.
Simulation results. MEP amplitudes are typically not normally distributed. However, our analytical framework does not assume that MEP amplitudes are normally distributed: it assumes that the sample estimates of MEP amplitude means are normally distributed. Normal distribution of sample means is indeed guaranteed when the samples are normally distributed, but it also guaranteed by the central limit theory even when the samples are not normally distributed, if sufficient trials are available. To support this point, we validated Eqs. (5), (9) and (10) with Monte Carlo simulations that assumed lognormal distribution of single-trial MEP amplitudes within subjects and of single-subject MEP amplitude across subjects (Fig. 1). The results obtained with lognormal simulated data are highly consistent with the analytical equations. Note that very minor deviations from Eq. (5) are observed in the lognormal simulations, as expected, only with few trials and heavily skewed simulated data (skewness = 4 in Fig. 1; as a reference, the average skewness in Exp. 1 was 1.20, range [0. 48-2.25]). Therefore, with low numbers of trials and/or in the presence of "outliers", the estimates obtained with the equations may be more accurate after normalizing the data, e.g. via an appropriate Box-Cox transformation 46,47 . Still, in most cases the equations can be readily applied to raw MEP data.
Experimental results. We provided a step-by-step application of the equations to estimate single-subject MEP amplitude and population MEP amplitude in a dataset of 100 MEP trials recorded in 20 subjects (Experiment 1). Our results show that 100 trials were sufficient to keep the estimation error of MEP amplitude below ± 20% in all our subjects, and they suggest that most experimental paradigms employing 20-30 trials (including ours) implicitly accept relatively large estimation errors for single-subject MEP amplitude. On the other hand, 100 trials were not only sufficient, but also unnecessarily high to estimate population MEP amplitude. The experimental results confirm that in the estimation of single-subject MEP amplitude the concept of "minimum number of trials" essentially depends on the error that is considered acceptable by the experimenter and the variability of MEPs in the individual subject. Conversely, the number of trials plays little role in the estimation of population MEP amplitude, which is more dependent on number of subjects.
We then used the data from Experiment 1 to define the optimal number of trials and subjects to be used in a representative experiment designed to detect significant MEP amplitude differences between two stimulus intensities (i.e. 110%RMT vs. 120%RMT; Experiment 2). Our results provide a practical example of how Eq. (11) can be used as a tool to assess the impact of the number of trials when designing new experiments. The same reasoning can be used to estimate the impact of the number of trials on experiments aiming to assess differences in MEP amplitude between groups of subjects (e.g. patients vs. controls) or changes in MEP amplitude before and after an intervention (e.g. non-invasive brain stimulation protocols).
Importantly, the equations have broad applicability and are generally valid for all experimental measures and conditions dealing with multiple trials per subject and populations of subjects. Within the TMS field, for example, the same equations can be directly applied to any measure of MEP amplitude (e.g. peak-to-peak, area, modulus, etc.) at any intensity on the stimulus-response curve, and to other single-pulse measures such as the silent period. Different experimental conditions (e.g. at rest, in activation, during a task, etc.) can be readily reflected in the equations by entering the corresponding values of within and between-subjects variability. The framework can also be extended, at least in principle, to more complex measures, such as the steepness of the stimulus-response curve and paired-pulse TMS measures. For these measures, however, some effort may be necessary to properly estimate within and between-subjects variability as a function of the number of trials. Indeed, the same equations and reasoning can also be applied to other fields (e.g. reaction times in behavioral tasks, etc.).
Practical recommendations. The aim of this study was not to provide rule-of-thumb answers that may be valid only in specific experimental conditions, but to offer a more general framework to inform the decision www.nature.com/scientificreports/ about how many trials to use under different experimental conditions. Still, we can provide the following practical recommendations: 1. For estimating single-subject MEP amplitude, the minimum number of trials largely depends on the variability of the subject in the exact experimental conditions, and on the error considered acceptable by the experimenter. Equation (3) can be used to directly estimate the minimum number of trials, given the variability and the acceptable error. Equation (5) can be used to estimate the error, given the variability and the number of trials. An important caveat is that the estimate of single-subject MEP amplitude and its corresponding error refer only to the moment of the test. Their ability to represent the subject in general depends on test-retest reliability [Eq. (13)]. With this in mind, the general recommendation for estimating single-subject MEP amplitude is to use a relatively high number of trials. 2. For hypothesis testing, the number of trials plays a relatively minor role. Equation (11) can be used to explicitly estimate the impact of the number of trials in a power analysis. The general recommendation for hypothesis testing is to use at least few trials and to include a relatively high number of subjects.
Overall, we hope these simple equations will offer a useful tool to solve the issue of maximizing the number of trials and minimizing experimental time in many experimental situations, and to clarify the impact played by the number of trials on the design and reproducibility of past and future experiments.

Methods
Subjects. The study was performed according to the declaration of Helsinki and approved by the local Ethics Committee (Comité Ético de Investigación de HM Hospitales). We recruited 27 right-handed healthy participants (15 females; mean age ± standard deviation: 27.3 ± 5.7 years, 20-40 years old, 85% non-smokers) with a negative history of neurological or psychiatric conditions and medication-free at the time of the study. All subjects gave their informed consent.
Electromyographic recordings. We recorded EMG activity from the first dorsal interosseous (FDI) using disposable surface electrodes. EMG signals were band-pass filtered (2 Hz-2 kHz) and amplified (× 1000; D360, Digitimer Ltd, UK) and single trials were digitized (sample rate 5 kHz) using a CED 1401 A/D converter and Signal 5 software (Cambridge Electronic Design, Cambridge, UK). EMG signals were monitored online via visual feedback on a computer screen.
Transcranial magnetic stimulation. We used a 70-mm figure-eight-shaped magnetic coil connected to a Magstim 200 2 stimulator (Magstim Co. Ltd, UK) to perform monophasic single-pulse TMS. The coil was held tangential to the scalp with the handle oriented backwards and 45° from the midline. The induced current presented a posterior-anterior (PA) direction activating preferentially I1 waves 48,49 . Both experiments were performed using a frameless neuronavigation system (BrainSight, Rogue Research, Canada) to guide the coil position with the help of a magnetic resonance imaging template in standard space. For all experiments we measured the individual RMT, defined as the minimum TMS output intensity required to evoke a MEP peak-to-peak amplitude of ≥ 0.05 mV in five out of 10 consecutive trials in the resting FDI. We delivered TMS single pulses with 6 s ± 10% as inter-trial interval. This inter-trial interval was chosen to minimize the carryover effects in the initial transient state observed at intervals ≤ 5 s 24,50 and to be consistent with our recent studies 51-53 . Experimental procedures. We performed two independent experiments. Eighteen subjects participated in one experiment and 9 subjects participated in both. Subjects sat in a comfortable chair and were instructed to relax both arms and hands on a pillow keeping their eyes open for the duration of the experiment. Experiment 1 ( n = 20; 11 females; mean age 27.7 ± 5.6 years): For each subject we determined the FDI 'hot spot' in the right motor cortex and measured the RMT. After establishing the TMS output intensity that evoked a peak-to-peak MEP amplitude of 1-1.5 mV, we recorded 100 MEPs at rest at that intensity. Experiment 2 ( n = 16; 8 females; mean age 25.9 ± 4.8 years): Each subject performed two identical sessions, 7 days apart. In each session we determined the individual FDI 'hot spot' in the right motor cortex. We measured the RMT and recorded 40 MEPs at rest at different TMS output intensities (110%, 120%, 130%, and 140%RMT; randomized). Only the data from 110%RMT and 120%RMT were used in this study. In both experiments, single-trial MEP amplitude was estimated as peak-to-peak amplitude of recorded the EMG signal. Eq. (1). The optimal number of trials n opt ci estimated by the inclusion of the cumulative average µ trials (n) within a 95% confidence interval around the sample average µ trials (n max ) , as used empirically in previous experimental studies [26][27][28][29]31 , can be defined analytically. We will refer to the sample average µ trials (n max ) simply as µ trials , and to the true average as µ trials .

Derivation of
First, the half width of the 95% confidence interval around the sample average µ trials is simply z 1−α ci /2 SE(n max ) , where z 1−α ci /2 is the critical value (for a 95% c.i., α ci = 0.05 and z 1−α ci /2 = 1.96) and SE(n max ) is the standard error of the estimate of the true average µ trials with the maximum number of trials available n max . Second, we can define the 'inclusion of the cumulative average' within the above confidence interval around the sample average in probabilistic terms, as the confidence interval of the estimate of the sample average made by cumulative average: z 1−α/2 SE sample (n) , where z 1−α/2 is the critical value defined by the probability of inclusion p incl (i.e. α = 1 − p incl ) and SE sample (n) is the standard error of the cumulative average estimating the sample average with www.nature.com/scientificreports/ n samples. Note that the above cited studies empirically used p incl = 1, which would correspond to a theoretical z 1−α/2 = + ∞, but in practice corresponded to an arbitrary p incl < 1 that depends on the number of subjects. The optimal number of trials n opt_ci is then defined as the number of trials at which the confidence interval of the estimate of the sample average made by the cumulative average equals the confidence interval of the estimate of the true average made by the sample average, i.e.
In Eq. (23), SE(n max ) is given by the well-known formula: where σ trials is the standard deviation of MEP amplitude across trials.
SE sample (n) is somewhat less straightforward. Let ε(n) be the error in the estimate of the sample average made by the cumulative average with n < n max , i.e.
From the decomposition of variances, it follows that: where Var[ε(n)] is the variance of the cumulative average estimating the sample average. Since the standard deviation of an estimator (in this case the cumulative average as an estimator of the sample average) is by definition the standard error of the estimator, we can write: Var[ µ trials (n)] is the variance of the cumulative average estimating the true average, i.e. and Var[ µ trials (n max )] is the variance of the sample average estimating the true average, i.e.
The variance of the cumulative average estimating the sample average SE sample (n) 2 can thus be readily obtained by subtracting the variance of the sample average to the variance of the cumulative average estimating the true average, i.e.
Substituting (24) and (30) in (23) Eq. (2). The optimal number of trials n opt_%diff estimated by the inclusion of the cumulative average within a ± 10% difference around the sample average, as used empirically in one previous study 28 , can also be defined analytically, as follows:

Derivation of Eq. (10).
To derive Eq. (10), we start from the estimation of the t statistic in a paired Student's t-test, i.e.
where µ trials1 and µ trials2 are vectors of estimated single-subject MEP amplitudes for two repeated measures from the same population of subjects. Note that if we impose cov µ trials1 , µ trials2 = 0, then Eq. (36) becomes the t statistic for an unpaired t-test between two populations with an equal number of subjects. We assume equal variances (or pool them) so that σ 2 subjects1 + σ 2 subjects2 = 2 σ 2 subjects , and we model the estimated single-subject MEP amplitudes as: where µ trials is the vector of true single-subject MEP amplitudes across subjects and ε is the corresponding error vector for estimating the single-subject MEP amplitude with a limited number of trials. Assuming that the errors are independent, the covariance term can be rewritten as follows: where r(n) and σ 2 subjects (n) are the Pearson's correlation across repeated measures and the (pooled) variance across subjects with n trials, whereas r and σ 2 subjects are the asymptotic Pearson's correlation across repeated measures and (pooled) variance across subjects with infinite trials. Substituting Eqs. (7) and (38) in Eq. (36), we obtain: which corresponds to Eq. (10). Note that Therefore, r(n) provides a lower bound for r , and r can be estimated from the data. Note that Eq. (40) corresponds to a classic correction for attenuation 54,55 .