Introduction

In any statistical analysis, data are examined for unusual observations [1]. Outliers can be defined as observations that are unusually larger or smaller as compared to other data points of the same variable [2]. Several studies have suggested ways of defining outliers: they are observations that deviate extensively from the overall pattern or expectation of the other data points (see [3, 4]), they are observations with large residual values [5] or they are data values falling outside of an expected range [6].

The cause of the extreme values may be the result of uncorrected data entry, system errors, self-reporting bias or they can be true observations that are due to rare events. Regardless, these outliers may have a large influence on the regression model parameters that can change the direction of the effect, mask the effect, underestimate the effect, or overestimate the effect [7, 8].

Detection of outliers from longitudinal and time to event studies with continuous time-varying covariates provides a challenge in most applied research areas [9]. Typically, the process of data cleaning and management takes a considerable amount of time due to the complexity of large datasets. Yet, conducting exploratory data analysis to detect extreme values at an earlier stage of the statistical analysis is necessary to avoid misleading conclusions that are based on only a few data points. There are several common approaches to detecting outliers including the use of knowledge and experience of investigators, using other published cross-sectional references or eyeballing using graphical outputs [10] to come up with single upper and/or lower values-cut-offs across the dataset. Some studies [11] have used conditional growth percentiles to identify outliers in growth trajectory data and defined outliers to be observations 4 standard deviations \(\left(\sigma \right)\) away from the expected (conditional) value. Parameters, \(\sigma\) and mean \(\left(\mu \right)\) are affected by extreme observations thus may not be suitable measures of spread and location, respectively, where the data are skewed [12]. In addition, conditional growth method cannot be applied to a subject’s first measurement (visit). Robust regression methods have also been used to detect outliers on a longitudinal electronic health data records [13]. Other methods include the use of Jackknife residuals [14] where they defined an outlier with a cut-off of ±5 in their longitudinal childhood growth data, and studentized residuals [15] with a cut-off of \(> 2\sigma\) to define outliers, whereas [16] used a cut-off of ±6 for a longitudinal study on obesity prevalence and weight change in children and adolescents together with cross-sectional z-scores thresholds as defined by Centers for Disease Control.

Non-parametric statistics such as interquartile range (IQR) and median absolute deviation (MAD) are simple and alternative useful methods that can be applied to detect potential outliers during the exploratory data analysis stage. One study [17] calculated weight for length percentiles and then used the IQR method to detect and remove outliers in a longitudinal childhood genome study. The numerical IQR method, followed by graphical assessment using box plots, was described as an approach to successfully identify potential outliers in an educational achievement study to improve student learning [18]. MAD method has been used to detect and remove outliers by several studies (see [12, 19, 20]. Both IQR and MAD statistics are robust measures of dispersion that are more resilient to outliers than standard deviation. The IQR method measures the central half of the data for any shape of the distribution and is a great alternative measure of dispersion that does not require symmetry and has no distribution assumptions since it uses percentiles making it more robust to the presence of outliers.

The primary aim of this study was to describe how to identify potential outliers for longitudinal data with skewed distributions at follow-up visits using a non-parametric interquartile range statistic method. This method may be a specific tool/procedure for the statistical analyst with no prior additional knowledge or policy at hand for dealing with extreme values.

Materials and methods

Robustness of IQR and MAD statistics

The robustness of IQR and MAD has been shown by [21] that if \({T}_{n}\) is a statistic on an ordered sample of size \(n\), then, \({T}_{n}\) has breakdown value \(b\), \(0\le b\le 1\), if for every \(\epsilon > 0\), \({li}{m}_{\left\{X\left(\left\{\left(1-b\right)n\right\}\right)\to \infty \right\}}{T}_{n} < \infty\) and \({li}{m}_{\left\{X\left(\left\{\left(1-(b+\epsilon \right))n\right\}\right)\to \infty \right\}}{T}_{n}=\infty\). The sample median (\(m\)) remains unchanged in the presence of extreme low or high values. If less than 50% of the sample \(\to \infty\), then \(m\) and MAD will remain the same. If more than 50% of the sample \(\to \infty\), then \(m\to \infty\) and so does MAD. The median in that case will be located within the outliers and thus MAD has a breakdown value of 50%. Similarly, IQR has a breakdown value of 25%, breaking down when \({Q}_{1}\) is located within the outliers [22].

Interquartile range algorithm

For each continuous time-dependent variable of interest, the following defines the IQR algorithm and can be applied overall or stratified by important subject level factors such as country or longitudinal variables such as age or follow-up visit:

  1. (i)

    Calculate \({{\rm{IQR}}}={Q}_{3}-{Q}_{1}\).

  2. (ii)

    Define lower and upper limits of outliers as \(\left[{Q}_{1}-k\times {{\rm{IQR}}},{Q}_{3}+k\times {{\rm{IQR}}}\right]\) or \(\left[{Q}_{1}-h,{Q}_{3}+h\right]\) where \(h=k\times {{\rm{IQR}}}\) and \(k > 0\) is a scale factor.

  3. (iii)

    Flag observations outside the limits as potential outliers.

IQR lower-limits below zero can be set to zero, for instance, in food record data where intake cannot be negative.

Motivating TEDDY dietary data

This study was motivated by a large longitudinal and time-to-event dataset with time-varying covariates. Dietary intake data was obtained from The Environmental Determinants of Diabetes in the Young (TEDDY), which is an observational longitudinal study that investigates factors associated with Diabetes (T1D) in children [23]. Of the 8676 children that were enrolled in the TEDDY study, 120 were HLA ineligible, 22 had no food records, and 33 had no record on Islet Autoimmunity and were dropped, leaving a total of 8501 subjects with 152,426 records followed up from birth up until censoring at 10 years of age for this study. Diet was assessed by 24-h dietary recall at the age of 3 months, by 3-day food record at the age of 6, 9, and 12 months and every 6 months thereafter until the subject developed islet autoimmunity (IA) or was censored at the end of the study period. For the purpose of illustrating the IQR method, focus was on exposure to daily intake of vitamin B12 (µg/day), including intake from foods and dietary supplements [24] and the risk of developing IA.

Statistical analysis

The basic Cox proportional hazards model [25] (see also [26, 27]) assumes that exposure covariates are fixed. This model can be extended to introduce variables that vary continuously with time in the form of \({z}_{i}\left(t\right)={z}_{i}g\left(t\right)\) and expressed as

$$\begin{array}{l}h\left(t\right)={h}_{o}\left(t\right)\exp \{{\beta }_{1}{x}_{1}+\cdots +{\beta }_{k}{x}_{k}+g\left(t\right)({\gamma }_{1}{z}_{1}+\cdots +{\gamma }_{m}{z}_{m})\}\,\\\quad\quad\,=\,{h}_{o}\left(t\right)\exp \left(\mathop{\sum }\limits_{{\rm{i}}=1}^{{\rm{k}}}{\beta }_{i}{x}_{i}+\mathop{\sum }\limits_{{\rm{i}}=1}^{{\rm{m}}}{\gamma }_{i}{z}_{i}(t)\right)\end{array}$$

where \({\boldsymbol{Z}}=\{{z}_{1},\ldots ,{z}_{m}\}\) are time-varying covariates and \({\gamma }_{i}\) are regression coefficient for a covariate \(g\left(t\right){z}_{i}\), which is a function of time. With the data arranged using the counting process, the extended time-dependent cox model associating the risk of islet autoimmunity and exposure variables is expressed as

$$h\left(t|{\boldsymbol{X}}{\boldsymbol{,}}{\boldsymbol{Z}}\right)={h}_{o}\left(t\right)\exp ({\beta }_{1}{{\rm{vitamin}}}{{\rm{B}}}_{12}\left(t\right)+{\beta }_{2}{{\rm{sex}}}+{\beta }_{3}{{\rm{fdr}}}+{\beta }_{4}{{\rm{hla}}}+{\beta }_{5}{{\rm{country}}})$$

where vitamin B12 (µg/day) is the continuous time-varying exposure of interest, adjusted for the fixed effects covariates: sex, HLA DR3/4, FDR, and country. Standard errors (SE) were calculated using robust (empirical) variance sandwich estimator to account for correlations within subjects [28, 29]. Vitamin B12 (µg/day) was energy adjusted by country and visit using Willett’s residual method [30].

Detecting highly influential observations using residual diagnostics

Residual diagnostics plots were obtained after fitting Cox regression models on the full TEDDY dietary data to further ascertain the influence of highly influential observations on the estimated parameters. Partial DFBETA measure of influence for vitamin B12 was plotted against analysis time and partial efficient score residuals were also plotted against analysis time to identify observations with disproportionate influence. Partial residuals were calculated for each observation within the subject. These are the additive contributions to a subject’s overall residual (see [26, 27] for details). The partial DFBETA value estimates the change in the regressor’s coefficient due to deletion of that individual record.

Although the primary aim of this study is to describe methods to identify outliers, we provide some guidelines on how to handle potential outliers. The steps below were followed in this study:

  1. (i)

    The full dataset was analyzed with the presence of these outliers for sensitivity analysis.

  2. (ii)

    The variable was log-transformed to base 2 to avoid dropping observations.

  3. (iii)

    Removed outliers based on IQR scale factors \(k=\mathrm{3,5,7,10}\).

Detected outliers for \(k=3\) were inversely weighted so that they can lie within the lower and upper bounds using the following procedure: For each \({x}_{i}\) value that is outside the limits, compute weight \({w}_{i}=\max (|{x}_{i}\mbox{-}{{\rm{upper}}\; {\rm{limit}}|},|{x}_{i}\mbox{-}{{\rm{lower}}\; {\rm{limit}}|})\). Calculate a pseudo value \({x}_{j}=\frac{{x}_{i}}{{w}_{i}}+v,\) where \(v > 0\) is any constant (such as lower limit, \(Q3,\) etc.) such that \({x}_{i}\) outlier is replaced with \({x}_{j}\) that is bounded within the limits. An alternative is to replace \({x}_{i}\) with the group-specific quartile value such as the median or a combination of quartiles (\(Q1+Q2+Q3)\). Statistical analysis was conducted using SAS® software version 9.4 [31], R Core Team (2023) version 4.3.1 [32] and Stata statistical software (release 18) [33].

Results

During the first 10 years of follow-up, 778 out of 8501 children developed persistent confirmed IA. The median age (IQR) at risk was 36.3 (18.1, 72.5) months. Of the 778 children with IA, 590 had complete records. The incidence rate of Islet Autoimmunity was estimated to be 0.042 with 95% CI: (0.039, 0.045).

Descriptive statistics of the demographics are displayed by country, FDR status, HLA DR3/4 status and sex as shown in Table 1 with the number of subjects at enrollment (age 3 months), number of person-years follow-up, number of children developing IA with incidence, that were analyzed in the Cox model. Supplementary Table S1 displays additional descriptive statistics of the distribution of vitamin B12 intake including interquartile range (IQR), median, standard deviations and standard errors by country for several datasets used.

Table 1 Demographics of study participants and the risk of getting Islet Autoimmunity where subjects are followed up for the first 10 years of life.

The trend of vitamin B12 levels (ug) by country and visit from 8501 children in the TEDDY cohort followed regularly till 10 years of age while at risk of developing Islet Autoimmunity is shown in Fig. 1, which shows that the intake/day of vitamin B12 varied on average between countries and by visit with the intake being consistently higher in the USA and lowest in Germany. A box plot of the vitamin B12 intakes (Supplementary Fig. S1) shows the presence of potential outliers at some of the follow-up visits.

Fig. 1: Mean intake by country and visit.
figure 1

Line graph of average vitamin B12 (µg/day) intake by country in first 10 years of life.

Table 2 provides Cox regression estimates both in the log scale (log-hazard ratios) and in the exponential form (HR) together with the sandwich robust SE and the 95% CI. The HR represents the risk associated with a unit standard deviation (SD) change in take. The two outstanding extreme observations (vitamin B12 of 670.81 and 1666.83 µg) were scrutinized further by dropping each one of them in turn and re-analyzing the data, then later, dropping both values and repeated the analysis, and then replaced them with their medians calculated by country and visit and re-analyzed the data. Results indicated that dropping an observation from the full data with vitamin B12 intake of 670.81 µg/day at the 6th year visit (visit 72 months) from the USA had the largest impact on both the standard errors and the direction of the effect. It was noted that this observation was a confirmed Islet Autoimmunity positive case. The other outlier with vitamin B12 intake value of 1666.83 µg/day was not an Islet Autoimmunity positive case and had minimal influence on the Cox regression model as shown in Table 2 under the sensitivity analysis sub-title.

Table 2 Association of vitamin B12 (µg/day) intake/day with risk of Islet Autoimmunity (IA) from TEDDY dietary data.

After fitting the extended Cox regression models, residual diagnostics plots were examined for highly influential observations on the model parameters and are given in Figs. 24. Figure 2 displays eight residual diagnostics plots from panels (a)–(h) to find out any extreme observations that are influential. Figure 2a, b shows standardized DFBETA residuals against vitamin B12 intake for full data and IQR-5 data. Extreme observations are noticeable from the full data. Similarly, Fig. 2c, d displays martingale residuals with potential outliers seen on the plot of full data. Deviance residual plots are provided in Fig. 2e, f while score residuals are shown in Fig. 2g, h for models fitted using the full data and IQR-5 datasets. All plots from the full data indicated the presence of at least two observations that have larger residuals than expected.

Fig. 2: Residuals diagnostics plots from fitted models for full and IQR-5 TEDDY data where residuals are plotted against average vitamin B12 (μg/day) intake/day.
figure 2

a, b The standardized DFBETA residuals, c, d the martingales residuals, e, f deviance residuals and g, h score residuals against the vitamin B12 intake. Extreme observations that appear to be further away from other points are clearly seen on the full data plots.

Fig. 3: Partial DFBETA residuals from full TEDDY dietary data showing vitamin B12 (μg/day) values that have larger or unusual partial DFBETA residuals compared to other data points.
figure 3

The extreme observations from the full data come from the USA in visits 72 months for the 670.81 μg/day and visit 120 months for the 1666.83 μg/day intakes indicate to have a disproportionate influence on the fitted model.

Fig. 4: Partial DFBETA residuals from IQR-5 TEDDY data with labeled vitamin B12 (μg/day) values.
figure 4

There are no obvious extreme values shown on the residual plot.

Figures 3 and 4 display partial DFBETA residuals by country for data analyzed based on the full data and IQR-5 data, respectively. We compared the magnitudes of the largest DFBETA values to the Cox regression coefficients and labeled the points by their vitamin B12 intake values. Results from Fig. 3 showed that the USA’s vitamin B12 intake/day values of 670.81 μg at visit 72 months and 1666.83 μg at visit 120 months were disproportionately influential for the data from the full data but not from the IQR-5 dataset (Fig. 4).

Similar patterns were observed when looking at the partial deviance residuals (Supplementary Figs. S2 and S3), and partial score residuals plots (Supplementary Figs. S5 and S6) by country. The goodness of fit graphs based on Cox-Snell residuals plotted against cumulative hazard for modeling the full data, IQR-5 data and log2-transformed data are provided in the Supplementary files (Supplementary Figs. S4a–c, respectively). There are light-heavy tails, but the models fitted the data reasonably well.

Discussion

The study has provided practical illustrations where IQR method can be used to identify potential outliers in longitudinal datasets by follow-up visits.

Analysis of TEDDY dietary data following the IQR method showed the impact of extreme observations on the model (Table 2). Results from the full data analysis indicated a significant increase in risk of developing IA with higher intakes of vitamin B12. Analysis from the log-transformed and IQR-k-reduced datasets showed that an increase in intake of vitamin B12 reduces the risk of IA although the association between exposure and outcome in the Cox model was not statistically significant. Some of the detected outliers by the IQR method were also found to be highly influential observations by the residual diagnostic plots. The most influential extreme observation was found to be a case where the unusual large value of vitamin B12 of 670.81 µg was on the 6th year visit in the USA when the subject was diagnosed with Islet Autoimmunity. Results in Table 2 show the impact of this outlier on the model. When this unusual observation is removed, the time-to-event analysis handles this subject as a non-case up to the last visit that appears in the dataset when data is arranged in the counting process format. As a result, the direction of the HR, the SE and the significance of the association changed drastically.

In many studies, researchers would not want to drop observations. We have illustrated the procedure of not having to drop outliers using these two extreme observations (vitamin B12 of 670 and 1666.83 µg) where we replaced them with their group-specific medians, and in using the log transformation and inverse weighting methods (section “Detecting highly influential observations using residual diagnostics”). This made the 1 outlier out of the 590 cases remain as a case in the survival model with an intake value that was within the group range. Results for this analysis were seen to be in the same direction as for the IQR-k methods and log2-transformed method, showing a reduced risk of IA with intake of vitamin B12 and larger standard errors (0.216) than when the outlier are not handled (0.009).

The extreme value of vitamin B12 = 1666.83 µg was found to be on the 10th year visit in the USA and although it is such an extremely large value compared to the specific country and visit values, this observation had mild impact on the survival model since it was not a case. When only this outlier is removed, in the presence of the other case-outlier, the HR still indicated an increased risk of IA with intake of vitamin B12.

Our study has shown that 1 outlier out of 590 cases of IA (Table 1) has a different impact on the model compared to 1 outlier out of 7911 (8501–590) non-cases. We could have had, say, 40 outliers out of 590 cases compared to 40 outliers out of 7911 non-cases. IQR-k algorithm can identify these potential outliers that could represent a genuine subgroup related to disease/case status. In survival regression models, cases with extreme values can be highly influential compared to non-cases.

Choices for the IQR scale factor \(k\) may depend on the distribution of the data and how far away from the median the researchers would like to keep the data points. Smaller \(k\) values are more stringent and make the distribution of the variable to be more precise than bigger \(k\) values which give room for larger variances. If no prior information on the distribution of the variable of interest is available, the analyst can examine several \(k\) values to identify potential outliers to flag them off then conduct sensitivity analyses to see if the results from the model parameters (estimate, standard errors, strength of association) change with or without the detected outliers.

Examining residual diagnostic plots for the full models compared to those from the IQR-5 models revealed similar patterns of the presence of potential outliers that could be highly influential in the full models but not in the IQR-5 models.

We have illustrated the use of IQR method to detect potential outliers of time-varying continuous variables in longitudinal datasets at follow-up visits. It can be used as a data quality control procedure to identify unusual observations. Once outliers are identified, they can be flagged off and investigated further, including conducting sensitivity analysis to ascertain their influence in the regression model.