Protein function prediction is a challenging but important task in bioinformatics. Many prediction methods have been developed, but are still limited by the bottleneck on training sample quantity. Therefore, it is valuable to develop a data augmentation method that can generate high-quality synthetic samples to further improve the accuracy of prediction methods. In this work, we propose a novel generative adversarial networks-based method, FFPred-GAN, to accurately learn the high-dimensional distributions of protein sequence-based biophysical features and also generate high-quality synthetic protein feature samples. The experimental results suggest that the synthetic protein feature samples are successful in improving the prediction accuracy for all three domains of Gene Ontology through augmentation of the original training protein feature samples.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All data can be downloaded via http://bioinfadmin.cs.ucl.ac.uk/downloads/FFPredGAN.
The source code can be accessed via https://github.com/psipred/FFPredGAN.
Cozzetto, D. & Jones, D. T. Computational methods for annotation transfers from sequence. Gene Ontol. Handb. 1446, 55–67 (2017).
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).
Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016).
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
Wan, C., Lees, J. G., Minneci, F., Orengo, C. A. & Jones, D. T. Analysis of temporal transcription expression profiles reveal links between protein function and developmental stages of Drosophila melanogaster. PLoS Comput. Biol. 13, e1005791 (2017).
Fa, R., Cozzetto, D., Wan, C. & Jones, D. T. Predicting human protein function with multi-task deep neural networks. PLoS ONE 13, e0198216 (2018).
Wan, C., Cozzetto, D., Fa, R. & Jones, D. T. Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks. PLoS ONE 14, e0209958 (2019).
Goodfellow, I. J. et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) Vol. 27, 2672–2680 (Curran Associates, 2014).
Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at https://arxiv.org/abs/1511.06434 (2015).
Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein GAN. In Proceedings of the 34th International Conference on Machine Learning (PMLR, 2017).
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30, 5767–5777 (Curran Associates, 2017).
Mao, X. et al. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) 2813–2821 (IEEE, 2017).
Chen, X. et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (eds Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I. & Garnett, R.) Vol. 29, 2172–2180 (Curran Associates, 2016).
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) 2223–2232 (IEEE, 2017).
Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1125–1134 (IEEE, 2017).
Choi, Y. et al. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8789–8797 (IEEE, 2018).
Souly, N., Spampinato, C. & Shah, M. Semi supervised semantic segmentation using generative adversarial network. In 2017 IEEE International Conference on Computer Vision (ICCV) 5688–5696 (IEEE, 2017).
Zhang, Z., Yang, L. & Zheng, Y. Translating and segmenting multimodal medical volumes with cycle- and shape-consistency Generative Adversarial Network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9242–9251 (IEEE, 2018).
Zhu, W., Xiang, X., Tran, T. D., Hager, G. D. & Xie, X. Adversarial deep structured nets for mass segmentation from mammograms. In 2018 IEEE 15th International Symposium on Biomedical Imaging 847–850 (IEEE, 2018).
Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4681–4690 (IEEE, 2017).
Yang, G. et al. DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Trans. Med. Imaging 37, 1310–1321 (2017).
Seeliger, K., Güçlü, U., Ambrogioni, L., Güçlütürk, Y. & van Gerven, M. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage 181, 775–785 (2018).
Wang, X., Dizaji, K. G. & Huang, H. Conditional generative adversarial network for gene expression inference. Bioinformatics 34, i603–i611 (2018).
Dizaji, K. G., Wang, X. & Huang, H. Semi-supervised generative adversarial network for gene expression inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1435–1444 (ACM, 2018).
Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate gene expression and predict perturbations in single cells. Preprint at BioRxiv https://www.biorxiv.org/content/10.1101/262501v2 (2018).
Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).
Wang, Y. et al. Synthetic promoter design in Escherichia coli based on a deep generative network. Nucl. Acids Res. 48, 6403–6412 (2020).
Frid-Adar, M. et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321–331 (2018).
Zhu, X., Liu, Y., Li, J., Wan, T. & Qin, Z. Emotion classification with data augmentation using generative adversarial networks. In Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018) (eds Phung, D. et al.) 349–360 (Springer, 2018).
Volpi, R., Morerio, P., Savarese, S. & Murino, V. Adversarial feature augmentation for unsupervised domain adaptation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5495–5504 (IEEE, 2018).
Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166 (2020).
Minneci, F., Piovesan, D., Cozzetto, D. & Jones, D. T. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS ONE 8, e63754 (2013).
Lopez-Paz, D. & Oquab, M. Revisiting classifier two-sample tests. In Proceedings of the International Conference on Learning Representations (ICLR, 2017).
Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, E1732 (2017).
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Lemaître, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017).
You, R. et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019).
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
We thank the members of the UCL Bioinformatics Group for valuable discussions. We also acknowledge the support of the high-performance computing facility of the Department of Computer Science at University College London. C.W. and D.T.J. acknowledge funding from the Biotechnology and Biological Sciences Research Council (BB/L002817/1) and the European Research Council Advanced Grant ‘ProCovar’ (Project ID 695558). This work was supported by the Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001002), the UK Medical Research Council (FC001002) and the Wellcome Trust (FC001002).
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a-c, The rankings of MCC values obtained by different combinations of synthetic and real protein samples and three different classification algorithms for predicting biological process (a), molecular function (b) and cellular component (c) domains of protein functions.
a-c, The rankings of AUROC values obtained by different combinations of synthetic and real protein samples and three different classification algorithms for predicting biological process (a), molecular function (b) and cellular component (c) domains of protein functions.
Extended Data Fig. 3 The comparison of predictive accuracy obtained by the FFPred-GAN augmented training samples and the SMOTE augmented training samples.
a-f, The scatter-plots about the MCC and AUROC values obtained by the FFPred-GAN augmented training samples and the SMOTE augmented training samples for predicting three domains of GO terms by using SVM (a-e) and RF (f) classification algorithms.
a, The boxplot about the distributions of computational time on obtaining the optimal synthetic protein samples for different GO terms; b, The boxplot about the distributions of sample sizes for different GO terms; c-h, The scatter-plots of correlation coefficient values between the computational time and sample sizes for positive and negative protein samples of different domains of GO terms.
About this article
Cite this article
Wan, C., Jones, D.T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat Mach Intell 2, 540–550 (2020). https://doi.org/10.1038/s42256-020-0222-1