We thank Maleki and Naderi1 for their correspondence letter highlighting the methodological kappa optimal use in the March issue of Pediatric Research in response to “Interrater reliability of the modified Sarnat examination in preterm infants”.2

The key findings in late preterm infants 33−36 weeks of age who were evaluated in the first 6 h after birth with the modified Sarnat exam were the following: (1) The reliability between the gold standard study investigator and the group of attending neonatologists was good to excellent (k > 0.72) in most categories except for Moro and tone. (2) While the agreement was poor for both tone and Moro categories in infants born at 32–34 weeks’ gestation (k = 0.20–0.60), it dramatically improved at 35−36 weeks suggesting an important maturation effect.

Unweighted and weighted kappa are both widely used to measure the degree of agreement between two independent raters. While we agree with the general value of weighted kappa statistics and the importance of monitoring partial agreement, we have purposefully selected to use the conservative nonweighted Kappa statistic in this particular clinical situation. The decision is based clinically on the need to completely agree on what constitutes moderate−severe encephalopathy. In such a situation, a near miss is not acceptable; you either categorically initiate hypothermia or you do not. This determined the use of a statistical approach of complete agreement among gold standard and other examiners.

As a review, Cohen first introduced unweighted kappa in 1960,3 as a chance-corrected index of agreement for categorical variables with Kappa = 1 representing a perfect agreement between two raters. Subsequently, weighted kappa4 was introduced in 1968 to find the agreement of two raters when using nominal scores. Whereas unweighted kappa does not distinguish among degrees of disagreement, weighted kappa incorporates the magnitude of each disagreement and provides partial credit for disagreements when agreement is not complete.5 The usual approach is to assign weights to each disagreement pair. While linear and quadratic agreement weights are common in the literature and available in statistical packages, other weights can be used depending upon the impact of disagreement.

We have now further performed the weighted kappa statistics for both linear and quadratic agreement to respond to the correspondence. We report, as one would expect, improvement in the kappa statistics when comparing the unweighted to linear and quadratic weighted statistics respectively for each of tone (kappa improved from 0.46 to 0.48 to 0.54) and Moro (kappa improved from 0.51 to 0.63 to 0.73). The conclusions and overall message of the study remain that the strong inter-reliability agreement does not include tone/Moro, which are significantly influenced by the gestational maturity and the experience of the examiner.

In conclusion, we emphasize that rigor and using the correct statistical approach is of essence for any scientific publication. This correspondence highlights the need for collaborative effort from both clinical and statistical perspectives. The clinical context dictates the best statistical approach. Regarding the modified Sarnat Exam, the clinical relevance is to report the complete agreement that affects decisions to initiate urgent hypothermia therapy.