To the Editor:

In many fields, including the study of genetic variation, prediction methods are essential for interpreting experimental data, and it is important to present their performance in a systematic way. Recently, Kumar et al.1 published a Correspondence about the use of evolutionary information to predict the consequences of amino acid substitutions. The authors claimed that machine-learning classifiers would benefit from training separately at different amino acid conservation levels in order to better predict harmful protein variants.

The approach might be useful, but it is difficult to judge as its performance is reported in a defective and partly misleading way. Several measures are needed to fully capture method performance2,3. In the Correspondence1 some of those measures were used, but a number of important details were omitted. The greatest problem relates to the use of the Matthews correlation coefficient (MCC), one of the most widely used measures for binary predictor performance. The MCC is based on true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values in a contingency table, with the accepted definition expressed as:

In contrast, Kumar et al.1 used ratios of the four values in their formulation. They also converted the incorrectly calculated MCC values to percentages, but only for the positive half of the values, thereby not considering their full range from −1 (perfect disagreement) to 1 (perfect agreement). The correct values are listed in Table 1 and affect the conclusions of the work in ref. 1. When the results are combined for the conservation classes ('total'; Table 1), it is evident that EvoD is overall the poorest of the tested methods.

Table 1 Corrected MCC values

The use of erroneous and misleading performance parameters prevents readers from obtaining a true idea of the qualities of a method. Evaluation of machine-learning methods has three prerequisites2: (i) there have to be sufficient numbers of known positive and negative cases available, for example, in the VariBench database for variation benchmark datasets4; (ii) proper measures have to be used for method assessment, and the class imbalance (difference in the number of positive and negative cases), if present, needs to be corrected; and (iii) training and test datasets should be disjoint.

Kumar et al.1 did not address class imbalance, and did not report whether data used for training their EvoD method were also used for testing. Thus, the performance data they cite may actually indicate how well the EvoD method learned the training data rather than how well it will perform on independent test data. Condel and PolyPhen2 have been trained with the same cases that are now used for testing the performance. In their analysis, the authors also did not include methods that have been shown in a systematic comparison to have superior performance5.

Sequence conservation is known to be an important feature for variation predictors. The results in Table 1 show, contrary to the conclusion of the Correspondence1, that variations at ultraconserved and less conserved sites are considerably less reliably predicted than those at well conserved sites by all the three tested methods.