Sir,

We read with great interest the work published by Shin et al (2015), which highlights the potential relevance of circulating cell-free miRNAs as biomarkers for the detection of triple-negative breast cancer (TNBC). Of importance, the authors identified three miRNAs (miR-16, miR-21 and miR-199-5p) as potential diagnostic biomarkers for TNBC. The information provided is of interest as the identification of miRNA signatures for TNBC, as well as for other types of cancer (Calin and Croce, 2006), is of increasing relevance. However, we found some worthwhile issues that need to be discussed. The authors’ conclusions seem to be based only on results obtained from a univariate analysis performed for each of the above mentioned miRNAs. Specifically they performed a receiver–operator characteristics (ROC) curve to assess their ability to discriminate TNBC patients from healthy controls. Results showed a considerable discriminatory performance for each of the three miRNAs. Although the authors reported in the statistical analysis section the following sentence: ‘Multivariate logistic regression model was established and leave one-out cross validation to find the best logistic model’, no results were provided in multivariate terms. The lack of assessment of the more intriguing level of diagnostic accuracy achievable by combining the three miRNAs in a composite score is a relevant drawback of the paper. This topic, that actually represents one of the most critical steps in developing a miRNA-based signature in cancer research, implies some methodological considerations directly related to the multivariate regression models theory (Harrell, 2001). Multivariate regression models allowing simultaneous association of miRNAs and predictors with clinical outcome, such as logistic regression for presence/absence of disease, are common building blocks of biomarker-based risk prediction tools. It should be considered that in such scenario the number of observations is not generally of the order of magnitude greater than the number of variables. Results from the multivariate regression models may thus be affected by the small number of events per variable (Verderio, 2012). As a consequence, the model may produce overoptimistic estimation of the combined area under the curve (AUC) on the original data, but fails when applied in an independent data set (Verderio et al, 2010). In addition, to better generate prediction and generalisation to new data, the model should be defined according to the principle of parsimony, which is essential in discriminating the structural part (signal) of empirical data from the idiosyncratic (noise) one (Vandekerckhove et al, 2015). Although different approaches had been described in the literature to find the optimal linear combination of putative miRNAs to maximise the AUC (Su and Liu, 1993; Pepe and Thompson, 2000; Kang et al, 2013; Yan et al, 2015), we believe that it is urgent to delineate a procedure that is methodologically as robust as flexible to cover this fundamental step.

To this end, we are developing a comprehensive procedure that, starting from a set of potential miRNAs, identifies a more powerful and parsimonious composite score. Briefly, the best combination of the potential miRNAs is reached by resorting to penalised maximum likelihood estimation (PMLE) regression methods (Harrell, 2001) that can provide more reliable results in the presence of large numbers of input variables. A more parsimonious final model was then obtained using a step-down procedure as suggested by Ambler et al (2002).

As example, for illustration purpose only, we applied our procedure in a similar context of Shin et al (2015), to data on circulating miRNAs in plasma from 20 hepatocellular carcinoma (HCC) patients and 20 healthy donors (GSE50013) retrieved from the Gene Expression Omnibus database (http://www.ncbi.nlm.nih.gov/gds). By applying our NqA algorithm (Verderio et al, 2014), four miRNAs were identified as potential diagnostic biomarkers for HCC. As reported in Table 1, the AUC value observed for each of these miRNAs ranged from 0.739 to 0.841. Interestingly, by combining these miRNAs with the PMLE approach, we observed a sensible increment of the predictive capability with an AUC value of 0.953. In addition, we obtained a more parsimonious model based only on three miRNAs (AUC=0.923) without the loss of discriminatory power. A similar AUC value (AUC=0.920) was observed by applying the least absolute shrinkage and selection operator (LASSO) method (Tibshirani, 1996). Notably, the two approaches retained the same three miRNAs.

Table 1 Estimated AUC and 95% confidence interval (CI) of the considered model

In conclusion, this example shows that a more appropriate way to get the information for the evaluation of miRNAs as biomarkers could be interpreting their predictive role in a multivariate fashion or following Collins et al (2015), that ‘Prediction is inherently multivariable’.

This suggests the need of resorting to statistical procedures, generally based on advanced methods, in order to properly embrace the complexity of the data with the ultimate aim of better predicting the presence/absence of disease.