Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks

Abstract

Protein function prediction is a challenging but important task in bioinformatics. Many prediction methods have been developed, but are still limited by the bottleneck on training sample quantity. Therefore, it is valuable to develop a data augmentation method that can generate high-quality synthetic samples to further improve the accuracy of prediction methods. In this work, we propose a novel generative adversarial networks-based method, FFPred-GAN, to accurately learn the high-dimensional distributions of protein sequence-based biophysical features and also generate high-quality synthetic protein feature samples. The experimental results suggest that the synthetic protein feature samples are successful in improving the prediction accuracy for all three domains of Gene Ontology through augmentation of the original training protein feature samples.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: The flowchart for FFPred-GAN.
Fig. 2: The CTST results and 2D visualization of real and synthetic protein feature samples.
Fig. 3: Comparison of predictive accuracy obtained by augmented training samples and the original training samples.
Fig. 4: Precision-recall curves.
Fig. 5: Comparison of five different methods’ predictive performance.
Fig. 6: 2D visualizations of the learned SVM decision boundaries.

Data availability

All data can be downloaded via http://bioinfadmin.cs.ucl.ac.uk/downloads/FFPredGAN.

Code availability

The source code can be accessed via https://github.com/psipred/FFPredGAN.

References

  1. 1.

    Cozzetto, D. & Jones, D. T. Computational methods for annotation transfers from sequence. Gene Ontol. Handb. 1446, 55–67 (2017).

    Article  Google Scholar 

  2. 2.

    Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).

    Article  Google Scholar 

  3. 3.

    Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016).

    Article  Google Scholar 

  4. 4.

    Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).

    Article  Google Scholar 

  5. 5.

    Wan, C., Lees, J. G., Minneci, F., Orengo, C. A. & Jones, D. T. Analysis of temporal transcription expression profiles reveal links between protein function and developmental stages of Drosophila melanogaster. PLoS Comput. Biol. 13, e1005791 (2017).

    Article  Google Scholar 

  6. 6.

    Fa, R., Cozzetto, D., Wan, C. & Jones, D. T. Predicting human protein function with multi-task deep neural networks. PLoS ONE 13, e0198216 (2018).

    Article  Google Scholar 

  7. 7.

    Wan, C., Cozzetto, D., Fa, R. & Jones, D. T. Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks. PLoS ONE 14, e0209958 (2019).

    Article  Google Scholar 

  8. 8.

    Goodfellow, I. J. et al. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (eds Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. & Weinberger, K. Q.) Vol. 27, 2672–2680 (Curran Associates, 2014).

  9. 9.

    Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. Preprint at https://arxiv.org/abs/1511.06434 (2015).

  10. 10.

    Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein GAN. In Proceedings of the 34th International Conference on Machine Learning (PMLR, 2017).

  11. 11.

    Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V. & Courville, A. Improved Training of Wasserstein GANs. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) Vol. 30, 5767–5777 (Curran Associates, 2017).

  12. 12.

    Mao, X. et al. Least squares generative adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) 2813–2821 (IEEE, 2017).

  13. 13.

    Chen, X. et al. InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems (eds Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I. & Garnett, R.) Vol. 29, 2172–2180 (Curran Associates, 2016).

  14. 14.

    Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV) 2223–2232 (IEEE, 2017).

  15. 15.

    Isola, P., Zhu, J.-Y., Zhou, T. & Efros, A. A. Image-to-image translation with conditional adversarial networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1125–1134 (IEEE, 2017).

  16. 16.

    Choi, Y. et al. StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8789–8797 (IEEE, 2018).

  17. 17.

    Souly, N., Spampinato, C. & Shah, M. Semi supervised semantic segmentation using generative adversarial network. In 2017 IEEE International Conference on Computer Vision (ICCV) 5688–5696 (IEEE, 2017).

  18. 18.

    Zhang, Z., Yang, L. & Zheng, Y. Translating and segmenting multimodal medical volumes with cycle- and shape-consistency Generative Adversarial Network. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9242–9251 (IEEE, 2018).

  19. 19.

    Zhu, W., Xiang, X., Tran, T. D., Hager, G. D. & Xie, X. Adversarial deep structured nets for mass segmentation from mammograms. In 2018 IEEE 15th International Symposium on Biomedical Imaging 847–850 (IEEE, 2018).

  20. 20.

    Ledig, C. et al. Photo-realistic single image super-resolution using a generative adversarial network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 4681–4690 (IEEE, 2017).

  21. 21.

    Yang, G. et al. DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Trans. Med. Imaging 37, 1310–1321 (2017).

    Article  Google Scholar 

  22. 22.

    Seeliger, K., Güçlü, U., Ambrogioni, L., Güçlütürk, Y. & van Gerven, M. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage 181, 775–785 (2018).

    Article  Google Scholar 

  23. 23.

    Wang, X., Dizaji, K. G. & Huang, H. Conditional generative adversarial network for gene expression inference. Bioinformatics 34, i603–i611 (2018).

    Article  Google Scholar 

  24. 24.

    Dizaji, K. G., Wang, X. & Huang, H. Semi-supervised generative adversarial network for gene expression inference. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 1435–1444 (ACM, 2018).

  25. 25.

    Ghahramani, A., Watt, F. M. & Luscombe, N. M. Generative adversarial networks simulate gene expression and predict perturbations in single cells. Preprint at BioRxiv https://www.biorxiv.org/content/10.1101/262501v2 (2018).

  26. 26.

    Gupta, A. & Zou, J. Feedback GAN for DNA optimizes protein functions. Nat. Mach. Intell. 1, 105–111 (2019).

    Article  Google Scholar 

  27. 27.

    Wang, Y. et al. Synthetic promoter design in Escherichia coli based on a deep generative network. Nucl. Acids Res. 48, 6403–6412 (2020).

    Article  Google Scholar 

  28. 28.

    Frid-Adar, M. et al. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321, 321–331 (2018).

    Article  Google Scholar 

  29. 29.

    Zhu, X., Liu, Y., Li, J., Wan, T. & Qin, Z. Emotion classification with data augmentation using generative adversarial networks. In Proceedings of the 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018) (eds Phung, D. et al.) 349–360 (Springer, 2018).

  30. 30.

    Volpi, R., Morerio, P., Savarese, S. & Murino, V. Adversarial feature augmentation for unsupervised domain adaptation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 5495–5504 (IEEE, 2018).

  31. 31.

    Marouf, M. et al. Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks. Nat. Commun. 11, 166 (2020).

    Article  Google Scholar 

  32. 32.

    Minneci, F., Piovesan, D., Cozzetto, D. & Jones, D. T. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS ONE 8, e63754 (2013).

    Article  Google Scholar 

  33. 33.

    Lopez-Paz, D. & Oquab, M. Revisiting classifier two-sample tests. In Proceedings of the International Conference on Learning Representations (ICLR, 2017).

  34. 34.

    Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, E1732 (2017).

    Article  Google Scholar 

  35. 35.

    Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).

    MATH  Article  Google Scholar 

  36. 36.

    Lemaître, G., Nogueira, F. & Aridas, C. K. Imbalanced-learn: a Python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 1–5 (2017).

    Google Scholar 

  37. 37.

    You, R. et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019).

    Article  Google Scholar 

  38. 38.

    You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).

    Article  Google Scholar 

  39. 39.

    Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank the members of the UCL Bioinformatics Group for valuable discussions. We also acknowledge the support of the high-performance computing facility of the Department of Computer Science at University College London. C.W. and D.T.J. acknowledge funding from the Biotechnology and Biological Sciences Research Council (BB/L002817/1) and the European Research Council Advanced Grant ‘ProCovar’ (Project ID 695558). This work was supported by the Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001002), the UK Medical Research Council (FC001002) and the Wellcome Trust (FC001002).

Author information

Affiliations

Authors

Contributions

C.W. and D.T.J. conceived the idea. C.W. implemented the FFPred-GAN system and carried out the experiments. C.W. and D.T.J. analysed the experimental results and wrote the manuscript.

Corresponding author

Correspondence to David T. Jones.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 The boxplot about the rankings of MCC values.

a-c, The rankings of MCC values obtained by different combinations of synthetic and real protein samples and three different classification algorithms for predicting biological process (a), molecular function (b) and cellular component (c) domains of protein functions.

Extended Data Fig. 2 The boxplot about the rankings of AUROC values.

a-c, The rankings of AUROC values obtained by different combinations of synthetic and real protein samples and three different classification algorithms for predicting biological process (a), molecular function (b) and cellular component (c) domains of protein functions.

Extended Data Fig. 3 The comparison of predictive accuracy obtained by the FFPred-GAN augmented training samples and the SMOTE augmented training samples.

a-f, The scatter-plots about the MCC and AUROC values obtained by the FFPred-GAN augmented training samples and the SMOTE augmented training samples for predicting three domains of GO terms by using SVM (a-e) and RF (f) classification algorithms.

Extended Data Fig. 4 Characteristics about the computational time and sample size.

a, The boxplot about the distributions of computational time on obtaining the optimal synthetic protein samples for different GO terms; b, The boxplot about the distributions of sample sizes for different GO terms; c-h, The scatter-plots of correlation coefficient values between the computational time and sample sizes for positive and negative protein samples of different domains of GO terms.

Supplementary information

Supplementary Information

Supplementary Figs. 1–3 and Tables 1–7.

Supplementary Data 1

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wan, C., Jones, D.T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat Mach Intell 2, 540–550 (2020). https://doi.org/10.1038/s42256-020-0222-1

Download citation

Further reading

Search

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing