Introduction

Transient elastography (TE) (FibroScan®, Echosens, Paris, France) is a rapid (5–10 min), non-invasive point-of-care technique, which measures liver stiffness.1,2 Liver stiffness measurement (LSM) serves as a surrogate marker for the degree of liver fibrosis, reducing the need for invasive tests such as liver biopsy.1,2,3 In adults an LSM cut-off of >13.0 kPa has been shown to have excellent diagnostic accuracy for liver cirrhosis.1,2 Fraquelli et al.4 evaluated the precision of TE in over 200 adults with liver disease. They reported high interclass correlation coefficients (ICCs) for those with a Metavir score F ≥ 3; however, the ICC of 0.6 between repeated measures for milder forms of liver disease was poor.4

Cut-off values for milder degrees of fibrosis (Metavir stage ≥F2) have not been well established and vary according to the underlying pathology.1,2 Furthermore, a number of studies in adults have reported inconsistencies in repeated TE measurements in the absence of any change in the underlying pathology.5,6,7 Discrepancies of 2–5 kPa have been described between paired measurements, which could result in misclassification of the stage of fibrosis.3,6 In addition, extrahepatic cholestasis has been shown to increase LSM irrespective of fibrosis stage.8

In children, TE is potentially an attractive non-invasive test to identify and monitor liver disease progression. The hallmark of any clinical investigation is that it is repeatable and reproducible over a range of clinical circumstances and conditions.9 Therefore, careful evaluation of TE in children is required to ensure that measurements not only have good diagnostic accuracy (sensitivity and specificity) but also have precision in terms of repeatability and reproducibility.

While the diagnostic accuracy of TE has been examined in healthy children10,11,12 and in children with a variety liver diseases,13,14,15,16,17,18,19,20,21 few previous studies have examined the precision of TE in children.11,19 Goldschmidt et al.,11 using a subset of 28 children from a larger study population of healthy children, found that repeatability was good if TE measurements were performed in sequence using a marked point. However, repeatability was not satisfactory if examinations were carried out at different times by the same operator, or by different operators.

We planned to use TE to evaluate and monitor the progression of liver disease in a large cohort of children with Cystic Fibrosis (CF) (people with CF) as part of the national prospective study on the risk factors and outcome for Cystic Fibrosis Liver Disease (CFLD).22 Before using TE as a research tool, we first had to confirm the precision of TE by examining repeatability and reproducibility in healthy children.

Aims

The aims of this study were to determine

  1. (i)

    the normal range of liver TE measurements in children,

  2. (ii)

    the repeatability and reproducibility of TE in healthy children.

Methods

Research participants

Two sports clubs, under the auspices of the Gaelic Athletic Association, were approached and agreed to facilitate the study in healthy children. Eligible participants were informed of the study by sports club management and parents of interested volunteers contacted the research team who attended club facilities at agreed times. Following informed consent, children between 7 and 18 years of age were recruited over a 7-month period (March–September 2015).

Following consent and before performance of TE, each participant provided a short medical history, including medication usage, recent food intake, and had height and weight measured, as well as BMI calculated.23 None of the 235 volunteers had a history of liver disease, or other gastrointestinal disease, and none were taking any regular medication. Children with a body mass index (BMI) greater than the 85th percentile were not included. Children under 7 years were not included because in our experience they find TE uncomfortable.

Operator training

Both operators were experienced clinicians with a broad range of clinical, research and teaching expertise in diagnostic imaging and nursing. Prior to utilising the FibroScan® device investigators (A.McG., J.D.) attended small-group, 4-h training session provided by the manufacturer’s training consultant. Based on their demonstrated ability to set up and correctly utilise the portable FibroScan® device, they were then certified by the manufacturer as competent to perform TE.

During preliminary TE preparation, operators (J.D. and A.McG.) encountered challenges in obtaining repeatable LSMs. Therefore, a complete re-evaluation of the operators’ performance was undertaken in conjunction with the manufacturer’s training consultant. Sample time-motion mode (TM-mode) and amplitude-mode (A-mode) images were reviewed with the training consultant, seeking clarification and advice on improving quality. In order to avoid inadvertent inclusion of non-liver structures (right lung field, rib, vascular or biliary structures) within the region of interest, the operator must be able to clearly visualise speckle and line patterns associated with the TM- and A-mode images on the TE monitor.24 To achieve this, operator seating position and room lighting were reviewed to ensure that only high-quality TM- and A-mode images were obtained and used to measure liver stiffness. Specifically, (i) correct seating position for the operator to ensure probe positioning perpendicular to the skin surface, in both planes, while allowing optimal viewing of the TE monitor, (ii) correct lighting, in accordance with American Association of Physicists in Medicine (AAPM) guidelines, with overhead room lights switched off during elastography scanning and a lamp was used to provide dimmed room lighting conditions.25 In addition, the FibroScan® device was positioned to minimise reflections from windows or the lamp, which would have interfered with features in the displayed images.

TE measurements

Using a portable FibroScan® machine, J.D. and A.McG. performed TE measurements on each participant at rest, on two separate occasions, at least 24 h apart, but within 3 weeks of the initial examination. An immediate repeat examination would not have been appropriate because persistent and easily observable skin indentation at the probe site could bias the second measurement.

Prior to positioning the patient, each operator ensured that seating lighting and monitor positioning were optimal, as outlined above. The participant was positioned in a comfortable supine position for the duration of the examination, with the right arm extended in maximal abduction with the hand behind the head so that there was easy access to position the probe perpendicular to the skin surface. The M probe of the FibroScan® machine was used according to the manufacturer’s recommendations. The right lobe of the liver was accessed through an intercostal space at the level of the intersection of the mid-axillary line and a line extended laterally from the Xiphoid process. In practice, the transducer probe tip, covered with coupling gel, was positioned perpendicular to the skin surface in both axes, in the seventh right intercostal space, and moved or angled slightly anteriorly from the mid-axillary location, or moved superiorly or inferiorly to a higher or lower intercostal space in order to ensure optimal quality TM-mode images and A-mode graphs were generated, avoiding large intrahepatic vessels or other heterogeneous areas within the liver. When adequate images and graphs with no major artefacts/vessels were viewed in the TM- and A-modes, ten measurements were taken at the same location.

As per the manufacturer’s guidelines, ten readings of liver stiffness were performed on each participant. The machine independently calculated the median (M), interquartile range (IQR), IQR/median (IQR/M) ratio and the number of valid measurements. A high IQR/M ratio implies a large distribution of valid LSMs and thus a higher risk of aberrant LSM median values. LSM accuracy using FibroScan® has been shown to decrease when the IQR/M ratio increases, and measurements with an IQR/M >30% have lower accuracy, while LSM measurements <10% have the highest accuracy, particularly with increasing measures of liver stiffness.26 The manufacturer’s criteria for a valid and reliable TE examination are a success rate of >60% (more than six valid readings) and an IQR/M ratio <30%. Participants with a success rate of <60% or an IQR/M ratio of >30% were automatically classified as a failed examination by the machine and excluded from the study.

To calculate the upper limit of normal (ULN), we included only those volunteers who had IRQ/M ratios <25% on both TE measurements and a success rate of >80% in order to include only measurements with a high degree of accuracy and reliability. Only 8/235 (3.4%) participants had two measurements with an IQR/M ratio of ≤10%.

Data analysis

Data were analysed with MedCalc (https://www.medcalc.org; 2016) (MedCalc Software, Ostend, Belgium). Data are presented as means with standard deviations (SDs) for continuous variables and the distribution of the sample was assessed with the Shapiro–Wilk test. As suggested by Bland and Altman, repeat measurements of liver stiffness by the same or two different operators were considered as different methods of measurement in this study.9,27 To examine the variability in repeat measurements within the same subject, the SD of the difference between pairs of repeated measures was calculated.27 Bland and Altman plots were used to demonstrate visually the degree of agreement between two observers. The differences between any pair of LSM were plotted against the mean of the measurements, indicating how large the disagreement was.9,27 The repeatability coefficient was calculated according to the formula 1.96 × SD.28 The repeatability coefficient is the difference, in kPa, that will be exceeded by 5% of pairs of measures on the same subject.

Correlation between measurements was analysed using Lin’s concordance correlation coefficient (CCC).29 Lin’s CCC does not require the assumption of a normal distribution. The concordance coefficients were classified as poor (<0.90), moderate (0.90–0.95), substantial (0.95–0.99) and excellent (>0.99) (MedCalc Software, Ostend, Belgium 2016). The upper and lower limits of liver stiffness were calculated using the 97.5% quantile of the Student’s t-distribution with n − 1 degrees of freedom.

Results

Two hundred and fifty-seven healthy volunteer children were enrolled, and each had two TE examinations carried out at least 24 h apart. Data from 22 (8.6%) volunteers were classified as failed examinations because they did not meet the manufacturer’s guidelines (success rate <60%, IQR/M ratio >30%) for both TE examinations. Operator A conducted two examinations with 121 children, Operator B conducted two examinations with 71 children, while 43 children had one examination conducted by each operator.

The mean age of the 235 volunteers was 11.70 years (SD 2.51, range 7.01–17.12 years) and 107 (45.53%) were males. Girls were older (mean age 12.27 years, SD 2.53) than boys (mean age 11.02 years, SD 2.32, p < 0.001).

Normal range of liver stiffness in healthy children

The characteristics of TE measurements for Examination 1 and Examination 2 are outlined in Table 1. The mean LSM of Examination 1 (LSM 1) was 4.76 kPa, SD 0.85 kPa; and of Examination 2 (LSM 2) was 4.67 kPa, SD 0.74 kPa. Based on data from healthy children (n = 214) who had two TE measurements fulfilling the following criteria: (i) ≥80% success rate and (ii) an IQR/M ratio of ≤25%, the range of normal LSM values in healthy children was established and ranged from 2.88 to 6.52 kPa (Table 1). As shown previously by others, the ULN was higher in children over 12 years of age compared to those under 12 years.10 Gender did not significantly alter the ULN (Table 1).

Table 1 Mean (±SD) values and upper and lower limits of normal for LSM for the first and second examination in healthy volunteers (n = 235) according to age and gender.

Agreement in healthy volunteers

The mean difference between paired measurements for liver stiffness for all healthy participants was −0.044 kPa, SD 0.414 (p = NS paired t test). Figure 1 demonstrates that while the distribution of the paired differences for LSM 1 and LSM 2 followed a normal distribution (Shapiro–Wilk test W = 0.99, p < 0.37), there is a wide standard deviation of the differences. The 95% limits of agreement for repeated TE measurements ranged from −0.85 to +0.76 kPa in 235 healthy volunteers (Fig. 2). There was a wide scatter of data points with 15/235 (6.4%) lying on or outside the 95% limits of agreement. There was a difference of ≥1 kPa between the two examinations in 61/235 (25.9%) participants. The repeatability coefficient was 0.811 kPa.

Fig. 1: Histogram of Paired Differences in TE measurements for 235 healthy children.
figure 1

Histogram of the mean difference between liver stiffness measurement 1 (LSM 1) and liver stiffness measurement 2 (LSM 2) in n = 235 healthy children demonstrating the wide standard deviation (0.41) of the mean difference between the paired measurements.

Fig. 2: Bland and Altman Plot of the agreement between two TE measurements in 235 healthy children.
figure 2

Bland and Altman plot that demonstrates the 95% limits of agreement (−0.77 to +0.86 kPa), between two liver stiffness measurements in 235 healthy children.

When a single operator carried out both measurements (Operator A, n = 121; Operator B, n = 71), the difference between the upper and lower limits of agreement for Operator A was 1.51 kPa (−0.79 to +0.718), (Supplemental Fig. S1) with a repeatability coefficient of 0.75 kPa, and for Operator B was 1.80 kPa (−0.87 to 0.93) repeatability coefficient 0.89 kPa. The level of disagreement (1.66 kPa) was similar when two different operators carried out the initial and repeat measurement (n = 43; Supplementary Fig. S2) with a repeatability coefficient of 0.83 kPa.

Within-subject variability

We used Lin’s CCC to examine within-subject variability and found that the concordance between two measurements in the same individual was poor, 0.85 (95% confidence interval (CI): 0.82–0.88; n = 235). When the same operator performed both examinations, within-subject variability was also high with poor CCCs, Operator 1: 0.86 (95% CI: 0.81–0.89); Operator 2: 0.85 (95% CI: 0.75–0.91).

The effect of age on precision of TE

There was no difference in the mean and standard deviation of paired differences in LSM for children under 14 years of age (n = 190, mean difference = 0.045, SD 0.41) compared to children over 14 years of age (n = 45, mean difference = 0.038, SD 0.41, p = NS). Neither was there a difference in paired LSM measurements for children under 10 years of age (n = 64, mean difference = −0.004, SD 0.46) compared to children over 10 years of age (n = 171, mean difference = 0.06, SD 0.39, p = NS).

Discussion

TE is now widely used in the evaluation of children with liver disease.13,14,15,16,17,18,19,20 In this study, our findings on the normal range of TE values in healthy children are consistent with those reported in other studies.10,11,12 However, we found that TE lacks acceptable precision as a diagnostic instrument for use in children.

Poorly performing diagnostic tests can negatively impact on patient safety and waste scarce medical resources.30 The terminology to describe the performance of a clinical instrument can be confusing, and the literature on diagnostic accuracy studies abounds with a range of different terms to describe the two important facets of any diagnostic test or instrument.28,31,32 When a new instrument or test is evaluated, it is important to determine both its precision (that on average two or more measurements taken over a short period of time will be the same) and diagnostic accuracy (that the instrument can clearly distinguish those with disease from the healthy population).33 Measures of precision include reliability, repeatability, reproducibility or agreement, while test accuracy is reported as sensitivity and specificity or predictive value.33 The prevalence of the disease determines the accuracy of the test, and therefore diagnostic accuracy in children should not be inferred from adult studies.1,33 In the case of TE, there must be evidence demonstrating clinically acceptable levels of both precision and diagnostic accuracy, before TE is widely deployed as a non-invasive test to diagnose or monitor liver disease in children.1,9,27,30

In this study, we assessed the precision of TE in healthy children, that is, how variable repeated TE measurements are when made by the same observer or different observers on the same child. We also examined the repeatability of TE in different age groups. We assumed, when planning our study, that there would be negligible differences between two examinations a short time apart. However, we found that the precision of TE in children was poor. Variability occurred irrespective of the age of the child or whether the test was performed by the same operator or different operators.

Bland and Altman developed the limits of agreement method to examine the differences between measurements made by two methods or two observers.9,27 Central to the method is examining the mean and standard deviation of the distribution of paired differences.27 A wide standard deviation demonstrates lack of agreement between observers or tests.9,27 This study demonstrates that while the distribution of paired differences follows a normal distribution, the standard deviation is wide (Fig. 1), signifying that agreement between repeated TE measurements is inadequate.27 For 95% of healthy children, a given measurement of liver stiffness could range between 0.85 kPa less than or 0.76 kPa greater than a second measurement within a short period of time (Fig. 2). Over one-third of healthy children had an actual difference of 0.8 kPa between the first and second measurement, while 61/235 (25.9%) had a difference of >1 kPa between measurements. While these differences appear inconsequential, they could result in a change in liver disease classification based on TE measurements without any change in the underlying pathology. Adult studies have also reported discrepancies in repeated TE measurements, which could result in a change of liver disease classification.3,5,7

In clinical practice measurement variation is inevitable, but the degree of variation that can be deemed acceptable is determined by what constitutes a clinically important difference between measurements. Initial investigative studies of diagnostic accuracy should include an evaluation of the precision of the test. However, as noted by Harris and Smith32 and by Watson and Petrie34 procedures to assess reliability and measure agreement are often overlooked. In the first study of TE in adults, Sandrin et al.35 provided precision data in only 15/91 (16%) participants. From the data presented, it is not possible to determine if TE had clinically acceptable levels of precision. Subsequently, Fraquelli et al.4 examined the repeatability of TE in adults with biopsy-proven liver disease. They reported that while there was good agreement between repeat examinations in those with cirrhosis, TE precision was poor in those with milder forms of liver disease.4

There are only two paediatric studies11,19 prior to ours examining the precision of TE in children, and both highlight the inherent difficulties of getting good agreement between repeated measures.11,19 Goldschmidt et al.11 examined the repeatability of TE in 28 of 504 (5.5%) children and reported that agreement was good if TE measurements were performed in sequence using a marked point. However, repeatability was not satisfactory if examiners conducted the TE examination independently, which is reflective of clinical practice. Nobili et al.,19 in a study of children with non-alcoholic fatty liver disease, reported that they used ultrasound guidance, to determine optimal probe position and obtain good measures of interobserver agreement. However, even with ultrasound guidance the authors report a wide CI at the 90% level for the ICC (ICC 0.96, 90% CI: 0.92–0.97), indicating poor precision. Taken together with the data in this study, it suggests that the precision of TE in children is unacceptable as a non-invasive marker of liver disease.

We have considered the potential impact of a number of factors, including probe size, participant’s age and operator capacity, which may explain our results. LSM measurements are not comparable when different probe sizes are used,11,36 and therefore we used the M probe for all children in this study. Although the M probe was not recommended by the manufacturer for children under 14 years of age, it has been widely used in paediatric studies both before and after the development of the smaller paediatric probes.13,14,17,19,20,21,37 Goldschmidt et al.11 report that optimal evaluation of liver stiffness in children requires the use of the largest size probe that achieves a satisfactory LSM output. They report that 7/42 children (16.3%) aged 10.3–17.2 years would have been classified as “significant fibrosis” (cut-off >7.0 kPa) if the smaller S2 probe had been used.11 Given that over 90% of participants in this study had >80% success rate for both examinations with IQR/M of <25%, it is unlikely that probe size can explain the lack of precision reported in this study.

To further evaluate the impact of probe size, we examined the effect of age on the standard deviation of the paired differences in LSM and demonstrate that even in children over the age of 14 years, for whom the M probe is recommended by the manufacturer, the SD of the paired differences is wide supporting a lack of precision of TE regardless of age.

Operator capacity is unlikely to explain our results. The manufacturer certified the training of our operators. The range of normal TE values documented in children in this study is very similar to that reported in other paediatric studies.10,11,12 Our rate of failed examinations (success rate <60% and IQR/M <30%) was 8.6% (22/257), which is less than the failure reported by others.11,15 Furthermore, we achieved an 80% success rate for both examinations in 91% (214/235) of participants, which suggests that operator capacity was very good.

Children were not included in this study if they had a substantial meal within 2 h of the TE examination. However, strict fasting was not a requirement for participation. Food intake prior to TE is an evolving area and a 3-h fast is now a requirement for TE examinations based on data from a number of adult studies.38,39 The evidence for a prolonged fast in children is ambiguous. TE examinations carried out 30 min after lunch have been shown to increase LSM values in healthy children.11,12 In children with biopsy-proven liver disease, Lee et al.15 reported that there were no differences in LSM values between children who fasted overnight (n = 40) and those who did not fast. Metavir score was the only determinant of differences in LSM.15 While the requirement for a 3 h fast is now considered necessary, prolonged fasting needs careful evaluation on the accuracy and precision of TE in children.

This study has a number of weaknesses. It was not initially designed as a precision or diagnostic accuracy study of TE, but rather to demonstrate that TE measurements could monitor change in the liver disease status of children with CF. It includes only healthy volunteers and relies on clinical history as the gold standard for the absence of liver disease. Liver biopsy is the gold standard for the diagnosis of liver disease, but liver biopsy is not a clinically appropriate investigation in adults or children unless there is a high suspicion of liver disease.

It is recommended that studies of diagnostic accuracy should follow Standards for Reporting of Diagnostic Accuracy (STARD)40 reporting guidelines. While this study was not a diagnostic accuracy study of the sensitivity and specificity of TE against a gold standard, rather an examination of agreement between repeated measurements, we report our data in so far as is possible in line with the principles of transparency as outlined in STARD.

The strength of this study was the early identification of issues with repeated TE measurements, carefully considered approaches to training and optimisation of our protocols and the inclusion of over 200 healthy children to evaluate the repeatability of TE.

As part of a national prospective study on risk factors and outcome for CFLD, we also wanted to determine if TE had the precision required to facilitate the early diagnosis of CFLD and monitor disease progression. However, recruitment of participants with CF was stopped because of the lack of precision of TE demonstrated in healthy children.

At the time this study was stopped, we had enrolled 137 participants with CF, of whom 128 had a valid TE examination; 20 (15.6%) had liver disease (CFLD) with clinical or radiological evidence of portal hypertension, 43 (33.6%) had nonspecific changes on ultrasound or biochemical indices (Non-specific Cysitc Fibrosis Liver Disease (NSCFLD)), while 65 (50.8%) had no evidence of liver disease (No liver Disease (NoLD)) as outlined in Supplemental Data (online).

Sixty-six participants with CF had two examinations on separate days, of whom 11 had CFLD, 26 had NSCFLD and 29 had NoLD. The mean difference between paired measurements for the 66 participants with CF was −0.41, SD 3.22 kPa (paired t test NS). Supplemental Figure S3 (online) shows the wide standard deviation between paired measurements that was not normally distributed. Using a Bland and Altman plot, the limits of agreement between the two measurements of liver stiffness ranged from −5.0 to 6.1 kPa (Supplementary Fig. S4 (online)). In those with CFLD (n = 11), the mean difference between paired measurements was 0.02 kPa with a SD of 5.05 and the limits of agreement ranged from −9.9 to 9.9 kPa (Supplemental Fig. S5 (online)). The repeatability coefficient for those with CFLD was 9.8 kPa. Concordance between two measures was poor, regardless of the CF liver disease status of participants. In those with NoLD, the CCC was 0.86 (95% CI: 0.72–0.93), while in those with CFLD, it was 0.82 (95% CI: 0.48–0.95).

The disagreement demonstrated between two TE measurements over a short time in participants with or without liver disease could change the classification of their disease without any change in the underlying pathology. These findings on the lack of precision of TE in children with liver disease are consistent with our findings in healthy children that we report in this study.

Conclusion

Adequate evaluation of any new technology is essential to ensure that appropriate clinical decisions lead to optimal recommendations for treatment.30 This study demonstrates that TE does not have acceptable precision in children, because random measurement variation results in the lack of agreement between paired examinations. Further development is required to optimise the precision of TE in children in order to ensure patient safety, appropriate clinical decision-making and the optimal use of healthcare resources.