Abstract
Causal gene discovery methods are often evaluated using reference sets of causal genes, which are treated as gold standards (GS) for the purposes of evaluation. However, evaluation methods typically treat genes not in the GS positive set as known negatives rather than unknowns. This leads to inaccurate estimates of sensitivity, specificity, and AUC. Labeling biases in GS gene sets can also lead to inaccurate ordering of alternative causal gene discovery methods. We argue that the evaluation of causal gene discovery methods should rely on statistical techniques like those used for variant discovery rather than on comparison with GS gene sets.
Similar content being viewed by others
Introduction
Identifying causal genes for complex diseases can highlight disease-specific dysregulated pathways, improve disease classification, and identify drug targets1. In genetic association analysis, it has become a common practice to implicate putative causal genes (PCG) computationally by linking variant-level genetic association evidence and the existing biological knowledge base2. Some methods approach PCG implication as a supervised learning problem, aiming to predict unknown binary causal/non-causal gene labels for a given trait, while others rank the candidate genes by their likelihood of being PCGs, returning a continuous probability estimate or ranking for each gene. In this article, we focus on the common practice of evaluating PCG implication methods in reference to known sets of causal genes. While many papers making use of these sets for evaluation acknowledge that reference sets may be incomplete, this is rarely accounted for in evaluation techniques, where they are treated as gold-standard (GS)2,3. A critical challenge for this assessment strategy is that known causal genes may differ meaningfully from as-yet unidentified causal genes.
Table 1 summarizes methods used by recent publications to identify GS-positive genes. Genes assigned as gold standard positives by these methods are often reliable and based on stringent standards for causality. However, when proposed methods are evaluated against these GS gene sets, genes not labeled as positive are implicitly treated as non-causal or negative. These genes are almost certainly contaminated by some as-yet-unidentified causal genes (thus motivating continued PCG discovery research). In fact, the more confident we are in the positive labels of a GS gene set, the more mislabeled non-causal genes we can expect2. GS gene sets also tend to favor genes with particular features determined by the method of constructing the set. For example, GS gene sets derived from the set of causal coding variants favor genes that act through protein-coding changes rather than expression regulatory mechanisms3. Most classification methods also have gene-feature-related biases due to the type of data they use as input. A PCG implication method will appear more accurate if it is evaluated using a GS gene set with similar feature-related biases to its own and less accurate if the GS gene set has different biases. Authors naturally select a GS gene set constructed using features they feel are important and may, therefore, unintentionally tilt the scales toward their own proposed method.
We show that when the GS gene set is incomplete, estimates of power, specificity, and receiver operating characteristic (ROC) are inaccurate and may even misorder the relative quality of two different classifiers. This phenomenon can occur even if the GS gene set contains no false positives. We argue that no true GS sets of labeled genes are currently available. Therefore, we urge caution in interpreting comparisons of causal gene classifiers based on existing labels.
Effect of label contamination on evaluation metrics
Evaluation with PU-labeled gene sets
Genes outside the constructed GS set are more accurately viewed as unlabeled (U) rather than as negatives (N). Combined with accurately positively labeled genes in the GS set, the overall GS gene set should be regarded as positive-unlabeled (PU) data, a term used in semi-supervised machine learning. Using PU data to evaluate performance as though they were positive-negative (PN) labeled data results in inaccurate evaluations4. A PU-labeled gene set with perfect positive labeling consists of three subsets of genes: true causal genes that are correctly identified (labeled positives), true causal genes that are not labeled and therefore assumed to be non-causal (unlabeled positives), and non-causal genes that are unlabeled and therefore correctly assumed to be non-causal (unlabeled negatives, UN) (Fig. 1a). Evaluation treating PU labels with perfect positive labeling as PN labels will always underestimate the positive predictive value (precision) and overestimate the negative predictive value (NPV) of a classifier (Fig. 1b).
Figure 2 shows four possibilities for the performance of the classifier on the unlabeled positive genes and the corresponding relationship between the estimated and true sensitivity, specificity, and ROC curve. The classifier may perform differently on unlabeled positive genes and labeled positive genes if these two groups differ on important features that either align with or do not align with the features used to construct the classifier.
In almost all scenarios, estimation with PU labels leads to underestimating specificity. To see why this is, let A, B, C, a, b, and c be defined as illustrated in Fig. 1 as the counts of labeled positive, unlabeled positive, and true negative genes that are classified as non-causal (upper-case) or causal (lower-case) by the classifier. With these definitions, \(\alpha =\frac{B+b}{C+c}\) is the proportion of unlabeled genes that are truly causal. As long as the sensitivity of the classifier to unlabeled positives is higher than the probability of falsely predicting a true negative to be causal, B/(B + b) < C/(C + c), so B < αC. Therefore,
This means that specificity is underestimated in all cases except for the scenario in Fig. 2d. However, sensitivity may be either over- or underestimated depending on the feature biases of the labeled positive genes. If the classifier is more sensitive to unlabeled positives than to labeled positives (Fig. 2b), the sensitivity will be underestimated. If the classifier is less sensitive to unlabeled positives than to labeled positives (Fig. 2c, d), the sensitivity will be overestimated.
Errors in estimating sensitivity and specificity result in errors in the ROC curve and, therefore, in the area under the ROC curve (AUC). These errors also affect other measures that rely on the 2 × 2 confusion matrix, such as Matthew’s correlation coefficient and F1 score. This error applies to evaluating ranking methods as well as methods that return only hard classifications. In the special case in Fig. 2a, the classifier has an equal ability to detect labeled and unlabeled positive genes, so sensitivity is estimated accurately. Motivated by this observation, refs. 4,5 rely on a “PU score” which is analogous to the F1 score but reliant only on sensitivity and not on specificity. However, if labeled positive genes are not representative of all positive genes, the PU score will also be inaccurate.
In genetics research, we expect labeling biases because there are multiple molecular mechanisms by which a causal gene can affect complex diseases, and different classification and GS identification methods will favor different mechanisms. For example, as shown in Table 1, several GS gene set construction strategies focus on genes with phenotype-associated coding variants. Genes that affect phenotypes primarily through expression dysregulation may not be represented in these GS gene sets, so classifiers particularly sensitive to causal genes acting through expression changes may appear to perform poorly when using these gene sets.
Simulated example
To illustrate this issue, we consider a hypothetical example in which each gene has two continuous, measurable features, Pr and Ex. We think of these features as continuous summaries of the evidence that a gene acts on the trait through mechanisms mediated by either protein sequence (Pr) or expression level (Ex). Let Yi be a binary indicator that gene i is causal for the trait of interest. We simulate Pri and Exi from independent standard normal distributions and generate Yi as
In our simulation, the protein feature, Pr is a stronger predictor of causality than the expression feature, Ex.
In each simulated data set, we generate Pr, Ex, and causal status, Y for 20,000 genes that are divided into a set of 10,000 genes used for training and 10,000 genes used for testing. In the training set, we fit two classifiers, the Pr-classifier, and the Ex-classifier, by fitting a logistic regression with Y as the outcome and either Pr or Ex only as a predictor. This differs from the methods generally used to build causal gene discovery methods, as no perfectly labeled gene sets are available for training. However, this strategy provides a straightforward method to obtain classifiers based on only one of the two gene features.
The 10,000 genes in the testing set function as our GS gene set. We consider 3 possibilities. Either all genes are correctly labeled, positives with high levels of Pr are more likely to be correctly labeled, or positives with high levels of Ex are more likely to be correctly labeled. We refer to these as correct, Pr-enriched, and Ex-enriched labels. Let ZC,i, ZPr,i, and ZEx,i be correct, Pr-enriched, and Ex-enriched labels for gene i in the testing set. We generate these as ZC,i = Yi, ZPr,i = YiWPr,i, and ZEx,i = YiWEx,i with
The Pr-enriched labels mislabel 4.5% of all positives as negative, while the Ex-enriched labels mislabel 13.5% of all positives as negatives.
ROC curves estimated using each of the imperfect label sets are shown in Fig. 3, compared against ROC curves estimated using perfect labels. In both cases, label enrichment results in biased estimation of classifier performance. When Pr-enriched labels are used, the AUC of the Pr-classifier is overestimated, and the AUC of the Ex-classifier is underestimated. However, the accuracy of the two classifiers is correctly ordered. When the Ex-enriched labels are used, the pattern is reversed, resulting in the misordering of the two classifiers. These results align with our theoretical expectations. When using the Pr-enriched labels to evaluate the Pr-classifier or the Ex-enriched labels to evaluate the Ex-classifier, we have the scenario in Fig. 2c, where sensitivity is over-estimated, and specificity is under-estimated, pushing the ROC curve up from its true value. Conversely, when label enrichment does not favor the genes a classifier is most sensitive to, we are in the scenario in Fig. 2b, where sensitivity is under-estimated, pushing the ROC curve down.
Outlook
It is currently impossible to confidently determine a comprehensive GS gene set that includes all causal genes for any trait due to the myriad biological mechanisms leading to complex phenotypes. Several studies have acknowledged that supervised ML methods designed to classify PCGs should not be trained on non-comprehensive GS gene sets4,6,7. Here, we draw attention to the fact that sensitivity and specificity estimated using non-comprehensive GS gene sets are also inaccurate, making it inappropriate to compare and evaluate methods using these or related measures such as AUC and F1 scores. However, properly evaluating PCG implication methods is critical to making progress in this field. To address a similar issue in other fields, researchers have proposed incorporating negative controls (i.e., known negatives) or weights estimating each unit’s probability of being detected based on its features8,9. These methods do not clearly extend to the PCG implication problem. It may be possible to build a case for some negative control gene-trait pairs. However, these will likely differ meaningfully from unknown negatives, leading to similar issues of biased estimation. Using weighting in this context would require estimating the probability that the gene was labeled given its features for each labeled positive gene, which is impossible without knowledge of the feature distribution of all true positive genes.
An alternative approach circumventing the issue is to use a statistical model-based approach for causal gene identification. This is a common approach in the field of causal variant identification, where methods for statistical fine-mapping rely on probabilistic models, allowing them to obtain model-based measures of uncertainty, such as posterior inclusion probabilities or confidence intervals10,11. Probabilistic methods can also be evaluated in simulations to test their robustness to violations of modeling assumptions. Using probabilistic models validated in simulations is the most practical method for obtaining defensible estimates of a method’s false discovery rate under different parameter settings.
Incomplete GS gene sets may still play an important role in method evaluation, as they can be used to provide an empirical estimate of the sensitivity of a method at a given parameter setting corresponding to a known false discovery rate or to compute the PU score used by ref. 4. However, it is important to note that this is a context-specific measure of sensitivity that may not replicate in other gene sets with different features. When presenting sensitivity results, researchers should acknowledge potential feature biases of GS gene sets and evaluate their methods using multiple gene sets constructed from different information.
Finally, incomplete GS gene sets should not be used to construct ROC curves or compute measures such as the F1 statistic. This limitation of PU-labeled data provides an argument against the use of purely rank-based PCG implication methods. Methods that supply only a ranking and no model-based measure of false discovery rate are completely reliant on accurately labeled testing data for calibration. We have argued that no such testing data exists, meaning it is impossible to calibrate rank-only methods to a target false discovery rate. Instead, we urge researchers to develop PCG implication methods to make use of statistical approaches that provide model-based measures of label uncertainty.
Code availability
Code replicating simulations can be found in the Supplementary Data and at https://lijiaw.gitlab.io/GS-gene-sets/comparison.html.
References
Hormozdiari, F., Kichaev, G., Yang, W.-Y., Pasaniuc, B. & Eskin, E. Identification of causal genes for complex traits. Bioinformatics 31, i206–i213 (2015).
Picart-Armada, S. et al. Benchmarking network propagation methods for disease gene identification. PLOS Comput. Biol. 15, e1007276 (2019).
Weeks, E. M. et al. Leveraging polygenic enrichments of gene features to predict genes underlying complex traits and diseases. Nat. Genet. 55, 1267–1276 (2023).
Kolosov, N., Daly, M. J. & Artomov, M. Prioritization of disease genes from GWAS using ensemble-based positive-unlabeled learning. Eur. J. Hum. Genet. 29, 1527–1535 (2021).
Claesen, M., De Smet, F., Suykens, J. A. K. & De Moor, B. A robust ensemble approach to learn from positive and unlabeled data using SVM base models. Neurocomputing 160, 73–84 (2015).
Duda, M. et al. Brain-specific functional relationship networks inform autism spectrum disorder gene prediction. Transl. Psychiatry 8, 1–9 (2018).
Krishnan, A. et al. Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nat. Neurosci. 19, 1454–1462 (2016).
Liu, L. & Peng, T. Clustering-based method for positive and unlabeled text categorization enhanced by improved tfidf. J. Inf. Sci. Eng. 30, 1463–1481 (2014).
Du Plessis, M. C., Niu, G. & Sugiyama, M. Analysis of learning from positive and unlabeled data. In Advances in Neural Information Processing Systems Vol. 27 (Curran Associates, Inc., 2014). https://proceedings.neurips.cc/paper_files/paper/2014/file/35051070e572e47d2c26c241ab88307f-Paper.pdf.
Benner, C. et al. FINEMAP: efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493–1501 (2016).
Zou, Y., Carbonetto, P., Wang, G. & Stephens, M. Fine-mapping from summary data with the “Sum of Single Effects" model. PLoS Genet. 18, e1010299 (2022).
Connally, N. J. et al. The missing link between genetic association and regulatory function. Elife 11, e74970 (2022).
Greene, C. S. et al. Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet. 47, 569–576 (2015).
Tranchevent, L.-C. et al. Candidate gene prioritization with endeavour. Nucleic Acids Res. 44, W117–W121 (2016).
Mountjoy, E. et al. An open approach to systematically prioritize causal variants and genes at all published human gwas trait-associated loci. Nat. Genet. 53, 1527–1533 (2021).
Gazal, S. et al. Combining snp-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat. Genet. 54, 827–836 (2022).
Author information
Authors and Affiliations
Contributions
L.W.: Literature review, conducting simulations, manuscript preparation, X.W.: Advising, manuscript preparation, J.M.: Advising, manuscript preparation, figure creation.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Communications Biology thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: George Inglis, Luke Grinham and Aylin Bircan. A peer review file is available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, L., Wen, X. & Morrison, J. Imperfect gold standard gene sets yield inaccurate evaluation of causal gene identification methods. Commun Biol 7, 873 (2024). https://doi.org/10.1038/s42003-024-06482-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s42003-024-06482-1
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.