Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort

‘Big data’ in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients’ longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.

It is desirable to expedite the cleaning and curation of contemporary electronic healthcare data. A linear relationship between age and the height for age z-scores (HAZ) and weight for age zscores (WAZ) is expected 1 . It is therefore possible to exploit this expected relationship using linear regression (LR) methodology to automate the detection of a minimal subset of outlier data that is flagged for manual curation. Reducing the burden of curation to a minimal set facilitates discrimination of erroneous data entry from clinically plausible measurements.
Using LR outlier detection methodology, a regression line is firstly fitted to the observed data by minimising a target function, usually the mean of squared errors where the error is the difference between the observed value and that predicted from the regression analysis. Outliers can then be flagged if the error exceeds certain threshold -often twice the value of the standard deviation (SD) of the errors.
However, substantial outliers can pose a serious problem to any standard LR method as they can skew the original line fitting 2 . This issue cannot be surmounted with very sparse data but can be addressed where there are sufficient datapoints using the jack-knife method 1 . This leave-one-out LR method iteratively removes a single data, fits the model on the remaining data and evaluates the excluded datapoint against the fitted model. This approach is more sensitive in detecting singleton outliers compared to the standard residual LR method 3 , however the method loses power if multiple outliers exist within the data 2 . However, the jackknife approach is computationally expensive, scaling quadratically with the number of datapoints.
To this end, the robust regression methods 2 was adopted in our outlier detection method development as it is robust with multiple outliers by using influence measurements such as Cook's distance 4 , DFFITS, DFBETAS. Cook's distance estimates the influence of a datapoint when performing least square LR analysis 4 . It measures the effect of removing a given observation to the LR analysis. Datapoints with large Cook's distance (>1) reflect large residuals (difference between predicted and true value) or high leverage. Studentized DFFITS measures the influence of a single datapoint in LR analysis. It is calculated as the change in the predicted value of a point when it is excluded from the LR divided by the estimated SD of the model at that point 5 . It was suggested that datapoints with DFFITS >2! should be further investigated where k is the number of parameters in the LR model 4 . Using height and weight data, age is the only feature to make prediction of HAZ or WAZ, hence k=1. DFBETAS measures the difference in any given parameter estimate with and without individual datapoints 6 . A datapoint with DFBETAS > 2! ! is suggested for further investigation.
Datapoints with influence statistics exceeding suggested thresholds are temporarily removed from the inference and the regression parameters are re-estimated from the remaining data. This results in a regression line that best fits the most reliable data without the computational expense of the jack-knife approach. It is this regression line that is used to discriminate outlying datapoints from the entire set of datapoints using the SD fold thresholdq.
2. Number of datapoints for effective outlier detection using linear regression (LR) methodology Using real data from a single patient's WAZ values, we performed a simulation on the number of datapoints from which linear regression-based method becomes effective. The simulation is from a series of WAZ values from an arbitrarily selected patient with 27 measurements (ai, si) (i=1…27), where ai is age and si is WAZ value at age ai. The process of a single simulation is as followed: 1. For n ranges from 4 to 12, randomly select n datapoints from 27 datapoints and place them in a subset which is a list of n pairs of values (anj, snj) (j=1..n) 2. For each sampled subset of size n: For j in 1..n: a. Replace snj with y∈[-6, 6] (incrementing by 0.1) b. Perform the OLS LR analysis to detect outliers as described in the main text. c. Plot the simulated value y on the graph depending on where it is in the series. If the value y is not flagged, it is plotted in blue, and if it is flagged, it is plotted in yellow The simulation was implemented in Python. The pseudo-code written in Python format is available in Figure S1. The simulation process was repeated four times ( Figure S2)   Figure S2. Simulation results to assess the number of datapoints from which OLS linear regression becomes effective in discriminating outliers. The figure shows simulation results of four independent runs of the process: (i) Given a real data set of weight measurements of a patient in UHS with 27 measurements, for each run, with n ranging from 4 to 12, a set of n datapoints is randomly selected from the given set.

(ii) Then for each datapoints in the sampled set, replace WAZ value at this point by a value ranging from -6 to 6 incremented by 0.1, and perform the OLS LR method on these n datapoints (iii) The replaced point is marked in blue if the LR method accepts that point as 'Plausible', in yellow if 'Implausible', and the original datapoint is marked in red. The original datapoints are not necessarily an indicator of a true non-outlier, and merely reflect possible real-life data. For the value of n from 4 to 6, the LR method exhibits a wide acceptance range and it is only with more
datapoints that the acceptance range is reduced, indicating discriminatory power for outlier detection where there are seven or more datapoint.

Parameter tuning and evaluation
Typically, individual datapoints exceeding q = 2 times the standard deviation of any series of measurements are nominally identified as outliers, corresponding to an outlier rate of 5% 7 . However, for voluminous datasets of growth data in children, this parameter may be unnecessarily stringent and invoke a higher rate of manual data inspection than is necessary.
We tuned q to identify the optimal value of this parameter that minimised the set of data for manual inspection while having high probability of flagging truly erroneous measurements for scrutiny. This was achieved by ranging q within a wide possible range and evaluate the sensitivity and specificity of each value of q. A truth set or gold standard data set was required to facilitate this parameter tuning.
As with all regression modelling, the method had greater robustness to detect outliers as the number of longitudinal measurements increased and reliably identified outliers with ≥7 measurements (Supplementary Figure S2). The gold standard subset was defined as that containing all data collected up to 1 July, 2018 and included only patients with ≥7 measurements within WHO parameters for each of height and weight respectively. These data were reviewed manually by a clinician (JHD) to provide expert opinion on the clinical plausibility of recorded measurements. For all patients in this set, each height or weight measurement was classified as either 'plausible' or 'implausible' by the clinician by visual inspection of the patient's growth chart and scatter plot of HAZ or WAZ and by additional height checks.
The gold standard dataset was further restricted to those with SD of the LR residuals within the 99 th percentile ( Figure S3).  Specificity and PPV values are consistently high (>75% and >99%) with the increase of q and provide little discriminatory power to the models. The tuning therefore relies on the assessment of sensitivity and NPV to inform the optimal compromise between sensitivity to detect outliers and the burden of time required for expert manual review of flagged cases (NPV).

Parameter tuning results
Vertical yellow lines represented the selected values of q that maintain a balance between sensitivity (>90%) and manual curation work requirement (NPV). Figure S5 demonstrate the distributions of patients by age at first measurement and length of follow-up time for height and weight data in the UHS EPR system for patients aged 2-20 years.