Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks

Wan, Cen; Jones, David T.

doi:10.1038/s42256-020-0222-1

Article
Published: 31 August 2020

Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks

Nature Machine Intelligence volume 2, pages 540–550 (2020)Cite this article

2498 Accesses
33 Citations
6 Altmetric
Metrics details

Subjects

Abstract

Protein function prediction is a challenging but important task in bioinformatics. Many prediction methods have been developed, but are still limited by the bottleneck on training sample quantity. Therefore, it is valuable to develop a data augmentation method that can generate high-quality synthetic samples to further improve the accuracy of prediction methods. In this work, we propose a novel generative adversarial networks-based method, FFPred-GAN, to accurately learn the high-dimensional distributions of protein sequence-based biophysical features and also generate high-quality synthetic protein feature samples. The experimental results suggest that the synthetic protein feature samples are successful in improving the prediction accuracy for all three domains of Gene Ontology through augmentation of the original training protein feature samples.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: The flowchart for FFPred-GAN.**

**Fig. 2: The CTST results and 2D visualization of real and synthetic protein feature samples.**

**Fig. 3: Comparison of predictive accuracy obtained by augmented training samples and the original training samples.**

**Fig. 5: Comparison of five different methods’ predictive performance.**

**Fig. 6: 2D visualizations of the learned SVM decision boundaries.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Generative models improve fairness of medical classifiers under distribution shifts

Article Open access 10 April 2024

Data availability

All data can be downloaded via http://bioinfadmin.cs.ucl.ac.uk/downloads/FFPredGAN.

Code availability

The source code can be accessed via https://github.com/psipred/FFPredGAN.

References

Cozzetto, D. & Jones, D. T. Computational methods for annotation transfers from sequence. Gene Ontol. Handb. 1446, 55–67 (2017).
Article Google Scholar
Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).
Article Google Scholar
Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016).
Article Google Scholar
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
Article Google Scholar
Wan, C., Lees, J. G., Minneci, F., Orengo, C. A. & Jones, D. T. Analysis of temporal transcription expression profiles reveal links between protein function and developmental stages of Drosophila melanogaster. PLoS Comput. Biol. 13, e1005791 (2017).
Article Google Scholar
Fa, R., Cozzetto, D., Wan, C. & Jones, D. T. Predicting human protein function with multi-task deep neural networks. PLoS ONE 13, e0198216 (2018).
Article Google Scholar
Wan, C., Cozzetto, D., Fa, R. & Jones, D. T. Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks. PLoS ONE 14, e0209958 (2019).
Article Google Scholar
Goodfellow, I. J. et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) Vol. 27, 2672–2680 (Curran Associates, 2014).
Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at https://arxiv.org/abs/1511.06434 (2015).
Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein GAN. In Proceedings of the 34th International Conference on Machine Learning (PMLR, 2017).
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30, 5767–5777 (Curran Associates, 2017).
Mao, X. et al. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) 2813–2821 (IEEE, 2017).
Chen, X. et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (eds Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I. & Garnett, R.) Vol. 29, 2172–2180 (Curran Associates, 2016).
Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) 2223–2232 (IEEE, 2017).
Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1125–1134 (IEEE, 2017).
Choi, Y. et al. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8789–8797 (IEEE, 2018).
Souly, N., Spampinato, C. & Shah, M. Semi supervised semantic segmentation using generative adversarial network. In 2017 IEEE International Conference on Computer Vision (ICCV) 5688–5696 (IEEE, 2017).
Zhang, Z., Yang, L. & Zheng, Y. Translating and segmenting multimodal medical volumes with cycle- and shape-consistency Generative Adversarial Network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9242–9251 (IEEE, 2018).
Zhu, W., Xiang, X., Tran, T. D., Hager, G. D. & Xie, X. Adversarial deep structured nets for mass segmentation from mammograms. In 2018 IEEE 15th International Symposium on Biomedical Imaging 847–850 (IEEE, 2018).
Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4681–4690 (IEEE, 2017).
Yang, G. et al. DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Trans. Med. Imaging 37, 1310–1321 (2017).
Article Google Scholar
Seeliger, K., Güçlü, U., Ambrogioni, L., Güçlütürk, Y. & van Gerven, M. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage 181, 775–785 (2018).
Article Google Scholar
Wang, X., Dizaji, K. G. & Huang, H. Conditional generative adversarial network for gene expression inference. Bioinformatics 34, i603–i611 (2018).
Article Google Scholar
Dizaji, K. G., Wang, X. & Huang, H. Semi-supervised generative adversarial network for gene expression inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1435–1444 (ACM, 2018).
Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate gene expression and predict perturbations in single cells. Preprint at BioRxiv https://www.biorxiv.org/content/10.1101/262501v2 (2018).
Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).
Article Google Scholar
Wang, Y. et al. Synthetic promoter design in Escherichia coli based on a deep generative network. Nucl. Acids Res. 48, 6403–6412 (2020).
Article Google Scholar
Frid-Adar, M. et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321–331 (2018).
Article Google Scholar
Zhu, X., Liu, Y., Li, J., Wan, T. & Qin, Z. Emotion classification with data augmentation using generative adversarial networks. In Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018) (eds Phung, D. et al.) 349–360 (Springer, 2018).
Volpi, R., Morerio, P., Savarese, S. & Murino, V. Adversarial feature augmentation for unsupervised domain adaptation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5495–5504 (IEEE, 2018).
Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166 (2020).
Article Google Scholar
Minneci, F., Piovesan, D., Cozzetto, D. & Jones, D. T. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS ONE 8, e63754 (2013).
Article Google Scholar
Lopez-Paz, D. & Oquab, M. Revisiting classifier two-sample tests. In Proceedings of the International Conference on Learning Representations (ICLR, 2017).
Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, E1732 (2017).
Article Google Scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
Article MATH Google Scholar
Lemaître, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017).
Google Scholar
You, R. et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019).
Article Google Scholar
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).
Article Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the members of the UCL Bioinformatics Group for valuable discussions. We also acknowledge the support of the high-performance computing facility of the Department of Computer Science at University College London. C.W. and D.T.J. acknowledge funding from the Biotechnology and Biological Sciences Research Council (BB/L002817/1) and the European Research Council Advanced Grant ‘ProCovar’ (Project ID 695558). This work was supported by the Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001002), the UK Medical Research Council (FC001002) and the Wellcome Trust (FC001002).

Author information

Authors and Affiliations

Biomedical Data Science Laboratory, The Francis Crick Institute, London, UK
Cen Wan & David T. Jones
Department of Computer Science, University College London, London, UK
Cen Wan & David T. Jones

Authors

Cen Wan
View author publications
You can also search for this author in PubMed Google Scholar
David T. Jones
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.W. and D.T.J. conceived the idea. C.W. implemented the FFPred-GAN system and carried out the experiments. C.W. and D.T.J. analysed the experimental results and wrote the manuscript.

Corresponding author

Correspondence to David T. Jones.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The boxplot about the rankings of MCC values.

a-c, The rankings of MCC values obtained by different combinations of synthetic and real protein samples and three different classification algorithms for predicting biological process (a), molecular function (b) and cellular component (c) domains of protein functions.

Extended Data Fig. 2 The boxplot about the rankings of AUROC values.

a-c, The rankings of AUROC values obtained by different combinations of synthetic and real protein samples and three different classification algorithms for predicting biological process (a), molecular function (b) and cellular component (c) domains of protein functions.

Extended Data Fig. 3 The comparison of predictive accuracy obtained by the FFPred-GAN augmented training samples and the SMOTE augmented training samples.

a-f, The scatter-plots about the MCC and AUROC values obtained by the FFPred-GAN augmented training samples and the SMOTE augmented training samples for predicting three domains of GO terms by using SVM (a-e) and RF (f) classification algorithms.

Extended Data Fig. 4 Characteristics about the computational time and sample size.

a, The boxplot about the distributions of computational time on obtaining the optimal synthetic protein samples for different GO terms; b, The boxplot about the distributions of sample sizes for different GO terms; c-h, The scatter-plots of correlation coefficient values between the computational time and sample sizes for positive and negative protein samples of different domains of GO terms.

Supplementary information

Supplementary Information

Supplementary Figs. 1–3 and Tables 1–7.

Supplementary Data 1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wan, C., Jones, D.T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat Mach Intell 2, 540–550 (2020). https://doi.org/10.1038/s42256-020-0222-1

Download citation

Received: 22 October 2019
Accepted: 23 July 2020
Published: 31 August 2020
Issue Date: September 2020
DOI: https://doi.org/10.1038/s42256-020-0222-1

This article is cited by

MultiToxPred 1.0: a novel comprehensive tool for predicting 27 classes of protein toxins using an ensemble machine learning approach
- Jorge F. Beltrán
- Lisandra Herrera-Belén
- Stefania Short
BMC Bioinformatics (2024)
Generative deep learning for the development of a type 1 diabetes simulator
- Omer Mujahid
- Ivan Contreras
- Josep Vehi
Communications Medicine (2024)
GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides
- Jaskaran Singh
- Narendra N. Khanna
- Jasjit S. Suri
Scientific Reports (2024)
Artificial optical microfingerprints for advanced anti-counterfeiting
- Xueke Pang
- Qiang Zhang
- Yao He
Nano Research (2024)
Domain-PFP allows protein function prediction using function-aware domain embedding representations
- Nabil Ibtehaz
- Yuki Kagaya
- Daisuke Kihara
Communications Biology (2023)