Dear Editor,

We carefully read the influential article by Pavageau and colleagues published in the journal Pediatric Research in 2019.1 To determine the reliability of the modified Sarnat neurologic examination in late and moderately preterm neonates born at 32–36 weeks’ gestational age was the purpose of this study. Kappa values were used to evaluate and analyze the agreement between examiners.1 Based on the results, the reliability of the neurologic exam between the gold standard (GS) study investigator and groups of attending neonatologists was good to excellent (k > 0.72) in most categories except for Moro and tone. While the agreement was poor/fair for both tone and Moro categories in infants born at 32–34 weeks’ gestation (k = 0.20–0.60), at 35–36 weeks’ gestation, in contrast, the agreement was perfect for the tone and Moro categories (k = 1.0). However, when the GS examiner was compared to groups of attending examiners, the agreement in the Moro and tone categories was fair, k = 0.46.1

Depending on the type of variables, reliability analysis can be performed in different ways, one of which is the kappa coefficient, which has been used to assess agreement for qualitative variables. However, applying kappa for such a situation in particular circumstances can provide misleading results. These conditions are as follows: when the prevalence difference in each group can significantly change the kappa value. The second condition is when there are more than two categories.2,3,4,5,6 Finally, the last critical situation occurs when the marginal distribution of voters' responses is different.2,3,4,5,6 In such a situation, we strongly recommend using weighted kappa. Table 1 illustrates these circumstances with a hypothetical example and shows how much kappa (0.44 as moderate and 0.80 as very good) can change across circumstances with different prevalence rates and number of categories.3,4,5,6

Table 1 The kappa and weighted kappa values for calculating reliability between two examiners for more than two categories and depend on prevalence.

The authors concluded that there was strong reliability with the exception of Moro and tone for the modified Sarnat in preterm infants. However, the experience of the examiners can influence these results and can improve reliability in tone and Moro agreement after 35 weeks. Eventually, their institution adopted these two goals: first, by providing education that targets the assessment of tone in preterm infants, and second, by testing a new neurological examination form that omits the Moro and provides details for evaluating the tone adapted from the Dubowitz and Hammersmith infant neurological examinations.1 Such a conclusion may have been due to the inappropriate use of the statistical test, which can ultimately lead to a misleading message. In this letter, we pointed out the disadvantages of using kappa to assess agreement.