# Statistical Guidelines

Comprehensive guidelines on the presentation of statistical material in medical/dental journals have been published by Altman *et al.* elsewhere.^{1} The following is a slightly adapted summary of those guidelines emphasising the areas that are particularly relevant to the submission of articles to the* British Dental Journal*.

Adherence to these guidelines should not be viewed as a substitute for obtaining appropriate statistical advice. Authors are strongly advised to consult with a statistician when undertaking analytical research and to do so early in the process, preferably at the design stage of any investigation.

## Methods section

### Methods

State:

- The objective of the research; including major hypotheses
- Subjects
- Type, stating inclusion and exclusion criteria
- Source and selection
- Number, with justification

- Observations
- Types
- Measurement techniques

### Statistical methods

- Identify all methods used
- It is not sufficient to merely specify the computer software used (e.g. SPSS). The actual statistical techniques employed (e.g. unpaired t-test) should also be clearly identified.

- Common techniques
- These do not need to be described in detail, but methods with more than one version (e.g. paired and unpaired tests) need to be specified unambiguously.

- Complex methods
- These require some explanation. They may also benefit from appropriate references and/or a more detailed description as an appendix to the manuscript.

## Results section

### Statistical analysis

- Descriptive information
- Adequate description of the data should precede formal statistical analysis. Variables that are important for validity or interpretation of subsequent analyses should be described in the most detail. Continuous variables (e.g. age) can be summarised using the mean and standard deviation (SD). If the distribution of measurements is asymmetrical, the median and a percentile range (e.g. the interquartile range) provide a more appropriate summary. For ordinal data (e.g. data on an ordered scale such as the IOTN index of orthodontic treatment need, or the CPITN periodontal index) the use of means and standard deviations is incorrect; instead proportions should be employed.

- Deviations
- Deviations from the intended study design should be reported (e.g. patients withdrawing from follow-up). For surveys, it is valuable to give information on the characteristics of non-responders compared to those who took part. In surveys depending on a representative sample drawn from a population, a response rate of less than 70% should normally be considered unacceptable, unless adequate evidence is available that the reasons for non-response will not affect the outcome of the main investigation.

- Baseline characteristics
- In a follow-up/intervention study, it is useful to compare the distribution of baseline characteristics in different groups. Differences that exist may influence results even if they are not statistically significant. Any differences should be allowed for in the analyses.

- Underlying assumptions
- Methods of analysis all rely to some extent on certain assumptions about the distribution of the variables being analysed. These assumptions should be explored, and in certain cases it may be appropriate to transform the data before proceeding with analysis, or to use so-called "distribution-free" (non-parametric) alternative methods.

- Hypothesis tests
- These should be used primarily to evaluate a strictly limited number of preformulated hypotheses. Subsidiary analyses that have been carried out because they have been suggested by preliminary inspection of the data are likely to give a false impression because in such circumstances the calculated P-value is too small. Special "multiple comparison" techniques are available for making pairwise comparisons among several groups. However, if multiple groups are to be compared that have a natural ordering, such as age-groups, the data should be analysed by a method that evaluates a possible trend across groups.

- Confidence intervals
- Reporting results in terms of confidence intervals rather than probability values (i.e. P=0.36) is strongly recommended. Most studies are concerned with estimating some quantity, such as a mean difference between two groups. Since results are based on a sample rather than an entire population, they can only ever be an estimate of the true value. It is desirable to calculate the confidence interval around such an estimate. The 95% confidence interval, for example, is often interpreted as the range of values about which we are 95% confident that it includes the true value. Confidence intervals reveal the precision of an estimate. A wide interval points to lack of information, and is a warning against over estimating the results from small studies. In a comparative study, confidence intervals should be reported for the differences between groups, not the results of each group separately.

- Paired observations
- It is essential to distinguish between unpaired observations (e.g. measurements from two groups of people each receiving a different treatment A or B) and paired observations (e.g. two measurements made on the same individual before and after treatment). Different forms of tests exist for the analysis of paired or unpaired data. It should be made clear which form of test was used.

- Units of analysis
- In dental research it is common for several measurements to be made on the same patient (e.g. measurements may be taken from several sites in the same mouth, or from the same site on several different occasions), but the focus of interest usually remains the patient. Measurements taken from different sites within the same mouth should naturally be expected to be more similar to each other on average than measurements taken from different mouths. Failing to consider this situation of 'correlated' measurements causes multiple counting of individual patients and can lead to seriously distorted results. In particular it inflates the sample size and can lead to spurious statistical significance. Since the patient is the unit of the investigation, the patient should also be the unit of analysis. If the patient is not treated as the unit of analysis, complex specialised statistical methods need to be employed (e.g. Multilevel/hierarchical modelling). By contrast, groups are sometimes the focus of interest. This may be the case in cluster randomised trials where all the patients at a group of practices receive the same intervention (e.g. a practice leaflet or video) and they are compared to patients at another group of practices who do not receive the intervention. In such studies the cluster (practice) is the correct unit of analysis.

- Outliers
- Observations that are highly inconsistent with the main body of data should not be excluded from the analysis unless there are additional reasons to doubt their credibility. Any omission of outliers should be reported. Since the omission of such observations can have profound effects on the results, it is often useful to analyse the data both with and without such observations and to assess how much the conclusions depend on these values.

- Assessing agreement / reproducibility
- Many studies in dentistry attempt to assess the degree to which two or more sets of measurements are in agreement to each other. An example of such a situation would be when assessing the agreement between different examiners who are observing the same group of patients. Specific techniques exist to measure agreement for both qualitative data (e.g. Kappa statistics) and continuous data (e.g. quantifying the 'limits of agreement'). Standard statistical techniques and tests are often applied incorrectly in an attempt to measure agreement. Correlation techniques should not be used as correlation measures linear association and not agreement. It is also not appropriate to use tests such as the t-test (paired or unpaired) to assess agreement. Failing to show a statistically significant difference between sets of measurements is not the same as being able to say that they are the same. This is a common misconception.

- Complex analyses and confounding factors
- In many studies the observations of prime interest may be influenced by several other variables. These might be anything that varies among subjects and which might have influenced the outcome being observed (e.g. treatment success may be influenced by patient age). Some or all of these variables (covariates) may be included in appropriate multiple regression techniques to explain or predict the outcome of interest while controlling (adjusting) for those variables that may influence the outcome. When statistical models are used to obtain estimates adjusted for other variables, it should be made clear which variables were adjusted for, on what basis they were selected, and if relevant, how they were treated in the analysis.

### Presentation of results

- Presentation of summary statistics
- Mean values should not be presented without some measure of variability or precision. The standard deviation (SD) should be used to show the variability among individuals and the standard error of the mean (SE) to show the precision of the sample mean. It must be made clear which is presented. The use of the symbol
**σ**_{x̅}causes confusion and should be avoided. For example 14.2**σ**_{x̅}1.9, should be presented as 14.2 (SE 1.9) or 14.2 (SD 1.9) as appropriate. Confidence intervals are a good way of providing an indication of the uncertainty of sample means, proportions and other summary statistics. The use of a dash symbol (-) should similarly be avoided. Confidence intervals should be presented as a range such as (95%CI 10.4 to 18.0) or (95%CI 10.4, 18.0). Note that it is also necessary to indicate the type of confidence interval that has been calculated (e.g. 95%CI or 99%CI). If the summary statistics are percentages, the denominator should always be made clear.

- Mean values should not be presented without some measure of variability or precision. The standard deviation (SD) should be used to show the variability among individuals and the standard error of the mean (SE) to show the precision of the sample mean. It must be made clear which is presented. The use of the symbol
- Presentation of results of hypothesis tests
- When probability values (P) have to be reported it is desirable to report the calculated values of test statistics as well (e.g. 2 = 11.50, P = 0.001). The quantitative results being tested should be given whether the test was significant or not. Exact P values (e.g. P = 0.18) are preferable to notation such as P > 0.05 as they are more informative and avoid the use of arbitrary cut-off points between results being described as 'significant' or 'not-significant'.

- Figures
- Graphical displays of results are helpful to readers, and figures that show individual observations are to be encouraged. Points on a graph relating to the same individual on different occasions should preferably be joined, or symbols used to indicate related points. Error bars of one standard error above and below the mean depict only a 67% confidence interval and may cause confusion. Error bars presenting 95% confidence intervals (that are identified as such) are preferable. Scatter diagrams relating two variables should show all the observations.

- Tables
- The number of observations should be stated for each result in a table.

- Numerical precision
- When presenting means, standard deviations and other statistics the authors should bear in mind the precision of the original data. Means should not normally be given to more than one decimal place more than the raw data, but standard deviations and standard errors may need to be quoted to one extra decimal place. It is also usually sufficient to quote values of t, 2 and r to two decimal places.

- Repetition
- The presentation of the same results in multiple formats is to be discouraged. For example, results presented in a table should not be repeated verbatim in the body text of a manuscript. It is sufficient to refer to the table.

### Discussion section

- Interpretation of hypothesis tests
- A significant result does not necessarily indicate a real effect. There is always a risk of a false positive finding, but this risk diminishes for smaller P values. Also, a non-significant result does not mean that there is no effect, but only that the data are compatible with there being no effect. Furthermore, statistical significance should not be taken as being synonymous with clinical importance.

- Many hypothesis tests
- In many research projects some tests of hypothesis relate to important comparisons that were envisaged when the research was initiated. Tests, which were not decided in advance, are subsidiary, especially if suggested by the results. More weight should be given to the former, while the latter should be viewed as being only exploratory - for forming new hypotheses to be investigated in further studies.

- Association and causality
- Statistical association does not in itself provide direct evidence of causality. In observational studies, causality can only be established on non-statistical grounds.
^{2}It is easier to infer causality in randomised trials.

- Statistical association does not in itself provide direct evidence of causality. In observational studies, causality can only be established on non-statistical grounds.
- Weaknesses
- It is better to discuss weaknesses in the research, and to consider their possible effects on the results, than to ignore them in the hope that they will not be noticed.

- Non-Sequiturs
- Discussion sections should not include statements or assertions which go beyond the remit of the investigation, or the scope of the results presented. Thus an investigation which was limited to showing that schoolchildren tended to ignore health education posters should not be used to support the contention that more health education is needed in schools

## References:

1. Altman D G, Machin D, Bryant T N, Gardner M J (eds). *Statistics with confidence: Confidence intervals and statistical guidelines*. Second edition. Bristol: BMJ Books, 2000.

2. Moles D R, dos Santos Silva I. Causes, associations and evaluating evidence; can we trust what we read? *Evid Based Dent* 2000; **2:** 75-78.