Determining the reliability of liver biopsies in NASH clinical studies

Liver biopsy sample evaluation is an essential part of clinical studies in nonalcoholic steatohepatitis (NASH) and is key in excluding confounding morbidities. Current scoring systems, which are decisive for study inclusion, rely on imprecisely defined histological features, leading to a high observer variability of disease categorization. In this News & Views, measures to overcome these limitations are discussed.

Refers to Davison, B. A. et al. Suboptimal reliability of liver biopsy evaluation has implications for randomized clinical trials. J. Hepatol. (2020).

In a new study, Davison et al. evaluated the observer variability of three liver pathologists reading digitized slides from 678 liver biopsy samples obtained from 339 patients during the EMMINENCE study. The study examined an insulin sensitizer (MSDC-0602K) for treatment of nonalcoholic steatohepatitis (NASH)1,2. The rationale for this study was the observation that MSDC-0602K improved insulin sensitivity and liver injury markers, but it failed to demonstrate statistically significant effects on the histological end points (for example, improvement of inflammation without worsening of fibrosis, NASH resolution and fibrosis improvement)1, which are required by both the United States Food and Drug Administration and the European Medicines Agency for approval of NASH drugs. A single liver pathologist performed the qualifying reads for screening of all patients included in the EMMINENCE study and re-evaluated the slides a second time after receiving them coded and randomly mixed with the follow-up biopsy samples. Two other liver pathologists evaluated the 678 digitized slides for the current study only, and all three used the NASH Clinical Research Network (CRN) histological scoring system3. Kappa statistics were used to evaluate the inter-observer variability. Although there was substantial agreement (kappa 0.61–0.80) between the liver pathologists for steatosis, the agreement was moderate (kappa 0.41–0.60) for fibrosis and hepatocyte ballooning and fair (kappa 0.21–0.40) for lobular inflammation. Comparably, unweighted kappa values were moderate for diagnosis of NASH and fair for both NASH resolution without worsening fibrosis and fibrosis improvement without worsening NASH. Notably, 46% of the patients included in the EMMINENCE study were considered by at least one of the three liver pathologists as not meeting the study’s histological inclusion criteria. On the basis of statistical simulations, the researchers reported that the lack of reliability of end points and inclusion criteria can drastically reduce study power and concluded that liver pathologists’ evaluation of liver biopsy samples might introduce a study bias that attenuates apparent treatment effects.

The fact that liver biopsy has sampling-based variability even in diffuse liver disease, and that it has a parameter-dependent intra-observer and inter-observer variability, has been known for decades (for example, kappa values were between 0.04 and 0.96 for paired biopsies in the METAVIR trial evaluating the reproducibility of liver biopsy interpretation in chronic hepatitis C in 1994)4. In fact, the reproducibility of categorical scoring is always limited and is determined by the accuracy of feature definitions. This is also true for the NASH CRN scoring system5: for instance, macrovesicular steatosis is graded on the basis of percentage of steatotic hepatocytes, but there is more than one definition of macrovesicular steatosis. Similarly, the grading of hepatocyte ballooning is vaguely defined (few versus many definitely ballooned cells), and markers that might result in a more objective evaluation (for example, keratin 18 immunohistochemistry) are not included6. Even the grading of lobular inflammation is open for interpretation, as an inflammatory focus is not defined and there is no guidance on how the overall assessment of all inflammatory foci should be performed (for example, calculating the average number of inflammatory foci per microscopic medium-power field (200x)). Fibrosis staging is of particular interest, as it reflects duration and to some extent reversibility of the driving liver disease. Importantly, histopathological fibrosis scoring provides ordinal values but is far from being quantitative, as the fibrosis stages are defined on the basis of qualitative features (for example, the location of fibrosis)3. In addition, the categories of the NASH CRN system are too limited to reliably detect small differences induced by less-effective drugs.

The length and width of a biopsy sample are other quality criteria that substantially affect the reliability of its evaluation. Length and width can vary within a given study and can be assessed but not determined by the liver pathologist. In fact, there is a substantial number of liver biopsy samples obtained for clinical studies that are split and shared with independent research activities concomitant to the study. Consequently, the input material for histopathological evaluation might be suboptimal, which can have a large effect on the representativeness of the liver biopsy sample and on the reliability of its analysis7. Unfortunately, detailed information on the length of the biopsy samples and the number of portal tracts available for study is not available for the EMMINENCE and the Davison et al. studies, and the effect of these criteria on the reliability of liver biopsy evaluation was not analysed1,2.

Despite these limitations, liver biopsy remains the gold standard for the study of liver disease. It is important for the exclusion of alternative diseases and co-morbidities such as autoimmune hepatitis or primary biliary cholangitis and provides unprecedented insights into the pathogenesis of the disease. In fact, the decision to not obtain liver biopsy samples from patients with hepatocellular carcinoma, as previously recommended by the clinical practice guidelines of both the American Association for the Study of Liver Diseases and the European Association for the Study of the Liver, resulted in a lost decade of targeted drugs’ approvals due to a remarkably high number of failed late-stage clinical trials8,9.

As a matter of fact, a study relying on a single pathologist has a higher chance of improperly evaluated entry criteria and misclassification. Thus, it is recommended that liver biopsy samples in study settings are evaluated by at least two independent expert pathologists using precisely defined and consented criteria for scoring of each histological feature, which have been prudently selected during study design. In case of discrepant qualifying reads, a final consensus should be reached before patients’ inclusion in the trial. As fibrosis staging is qualitative, morphometric analyses should be considered as an addendum during study design, as they will enable quantitative assessment of biopsy features (for example, collagen fibre deposition).

Considering that the absence or presence of a single ballooned hepatocyte might lead to inclusion or exclusion of patients in a study, whether a categorical classification (such as the NASH CRN and other current scoring systems) is optimal in NASH studies or might better be replaced by the grade of disease activity and the stage of fibrosis to provide a better description of the disease spectrum should be intensively discussed. In addition, it seems reasonable to involve expert pathologists when designing clinical studies, especially as the success of such trials rely on the histological data.

There are some considerations to highlight with respect to this study. First, the efficacy of therapy in most patients with NASH could be substantially influenced by physical activity, healthier nutrition and weight loss, as lifestyle modifications have proven to be effective in NASH10. It remains unclear whether a structured self-evaluation of the patients was performed during the EMMINENCE study and whether these confounding parameters (such as physical activity) were recorded at the end of the study to correct for this important bias. In this context, it might also be worth considering whether a placebo response rate of 10% (as was used in the simulations by Davison et al.) is adequate for power calculation of a study that is prone to a behavioural change-induced bias2. Furthermore, none of the pathologists generating the data for this study was listed as a co-author, and the primary data used are neither shown nor deposited and, therefore, not available for external validation. In addition, in our opinion, when publishing papers that could have an effect on clinical trial design and practices, the conflicts of interests of all stakeholders, including authors, handling editors and peer-reviewers, are an important consideration.

In summary, the topic of liver biopsy in clinical studies, such as for treatments for NASH, is highly important. For the reasons discussed, the design of NASH trials needs attention, and we hope that our suggestions (Box 1) are helpful in the realization of future clinical studies.


  1. 1.

    Harrison, S. A. et al. Insulin sensitizer MSDC-0602K in non-alcoholic steatohepatitis: a randomized, double-blind, placebo-controlled phase IIb study. J. Hepatol. 72, 613–626 (2020).

    CAS  Article  Google Scholar 

  2. 2.

    Davison, B. A. et al. Suboptimal reliability of liver biopsy evaluation has implications for randomized clinical trials. J. Hepatol. (2020).

    Article  PubMed  Google Scholar 

  3. 3.

    Kleiner, D. E. et al. Design and validation of a histological scoring system for nonalcoholic fatty liver disease. Hepatology 41, 1313–1321 (2005).

    Article  Google Scholar 

  4. 4.

    The French METAVIR Cooperative Study Group. Intraobserver and interobserver variations in liver biopsy interpretation in patients with chronic hepatitis C. Hepatology 20, 15–20 (1994).

    Article  Google Scholar 

  5. 5.

    Bedossa, P. et al. Histopathological algorithm and scoring system for evaluation of liver lesions in morbidly obese patients. Hepatology 56, 1751–1759 (2012).

    Article  Google Scholar 

  6. 6.

    Caldwell, S. et al. Hepatocellular ballooning in NASH. J. Hepatol. 53, 719–723 (2010).

    Article  Google Scholar 

  7. 7.

    Crawford, A. R., Lin, X. Z. & Crawford, J. M. The normal adult human liver biopsy: a quantitative reference standard. Hepatology 28, 323–331 (1998).

    CAS  Article  Google Scholar 

  8. 8.

    European Association for the Study of the Liver. EASL Clinical Practice Guidelines: management of hepatocellular carcinoma. J. Hepatol. 69, 182–236 (2018).

    Article  Google Scholar 

  9. 9.

    Bruix, J. & Sherman, M. Management of hepatocellular carcinoma: an update. Hepatology 53, 1020–1022 (2011).

    Article  Google Scholar 

  10. 10.

    Hallsworth, K. & Adams, L. A. Lifestyle modification in NAFLD/NASH: facts and figures. JHEP Rep. 1, 468–479 (2019).

    Article  Google Scholar 

Download references


The authors thank the members of the international liver pathology group the “Gnomes” for providing their invaluable input before drafting this article.

Author information



Corresponding author

Correspondence to Peter Schirmacher.

Ethics declarations

Competing interests

P.S. and T.L. are involved in centralized liver biopsy evaluation in clinical studies (Dr. Falk Pharma GmbH).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Longerich, T., Schirmacher, P. Determining the reliability of liver biopsies in NASH clinical studies. Nat Rev Gastroenterol Hepatol 17, 653–654 (2020).

Download citation


Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing