Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Fast and effective protein model refinement using deep graph neural networks

A preprint version of the article is available at bioRxiv.


Protein model refinement is the last step applied to improve the quality of a predicted protein model. Currently, the most successful refinement methods rely on extensive conformational sampling and thus take hours or days to refine even a single protein model. Here, we propose a fast and effective model refinement method that applies graph neural networks (GNNs) to predict a refined inter-atom distance probability distribution from an initial model and then rebuilds three-dimensional models from the predicted distance distribution. Tested on the Critical Assessment of Structure Prediction refinement targets, our method has an accuracy that is comparable to those of two leading human groups (FEIG and BAKER), but runs substantially faster. Our method may refine one protein model within ~11 min on one CPU, whereas BAKER needs ~30 h on 60 CPUs and FEIG needs ~16 h on one GPU. Finally, our study shows that GNN outperforms ResNet (convolutional residual neural networks) for model refinement when very limited conformational sampling is allowed.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: GNNRefine for protein model refinement.
Fig. 2: Successful examples of refinement by GNNRefine.
Fig. 3: Running times of GNNRefine and DeepAccNet on CASP14 targets with respect to protein sequence length.

Data availability

Our in-house data are available at Click on this link and fill in your name, email address and organization name to obtain a data link, through which you will find a text file 0README.Data4GNNRefine.txt that specifies the names of the data files to be downloaded. The data are also available at Zenodo38. The DeepAccNet data are available at The CASP13 and CASP14 models for refinement are available at The CAMEO models are available at The CAMEO dataset includes 208 starting models for all the CAMEO hard targets released between 1 May 2018 and 1 May 2020. We keep only the targets with sequence length in [50, 500] and native structures containing at least 80% of sequence residues. Following CASPs, we select the best-predicted models (in terms of GDT-HA) for each target as the starting models, and only keep the starting models with lDDT > 50. For the CASP13 FM (free-modeling) dataset, there are 28 test targets corresponding to 32 official FM domains. For each target we build ~150 decoys as its starting models using our in-house template-free modeling software RaptorX-Contact. Source data are provided with this paper.

Code availability

The source code is available at Code Ocean39.


  1. 1.

    Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).

    Article  Google Scholar 

  2. 2.

    Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl. Acad. Sci. USA 116, 16856–16865 (2019).

    Article  Google Scholar 

  3. 3.

    Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

    Article  Google Scholar 

  4. 4.

    Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).

    Article  Google Scholar 

  5. 5.

    Read, R. J., Sammito, M. D., Kryshtafovych, A. & Croll, T. I. Evaluation of model refinement in CASP13. Proteins Struct. Funct. Bioinf. 87, 1249–1262 (2019).

    Article  Google Scholar 

  6. 6.

    Heo, L., Arbour, C. F. & Feig, M. Driven to near-experimental accuracy by refinement via molecular dynamics simulations. Proteins Struct. Funct. Bioinf. 87, 1263–1275 (2019).

    Article  Google Scholar 

  7. 7.

    Park, H. et al. High-accuracy refinement using Rosetta in CASP13. Proteins Struct. Funct. Bioinf. 87, 1276–1282 (2019).

    Article  Google Scholar 

  8. 8.

    Xu, D. & Zhang, Y. Improving the physical realism and structural accuracy of protein models by a two-step atomic-level energy minimization. Biophys. J. 101, 2525–2534 (2011).

    Article  Google Scholar 

  9. 9.

    Heo, L., Park, H. & Seok, C. GalaxyRefine: protein structure refinement driven by side-chain repacking. Nucleic Acids Res. 41, W384–W388 (2013).

    Article  Google Scholar 

  10. 10.

    Bhattacharya, D., Nowotny, J., Cao, R. & Cheng, J. 3Drefine: an interactive web server for efficient protein structure refinement. Nucleic Acids Res. 44, W406–W409 (2016).

    Article  Google Scholar 

  11. 11.

    Bhattacharya, D. refineD: improved protein structure refinement using machine learning based restrained relaxation. Bioinformatics 35, 3320–3328 (2019).

    Article  Google Scholar 

  12. 12.

    Lee, G. R., Won, J., Heo, L. & Seok, C. GalaxyRefine2: simultaneous refinement of inaccurate local regions and overall protein structure. Nucleic Acids Res. 47, W451–W455 (2019).

    Article  Google Scholar 

  13. 13.

    Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun. 12, 1340 (2021).

    Article  Google Scholar 

  14. 14.

    Mirjalili, V., Noyes, K. & Feig, M. Physics-based protein structure refinement through multiple molecular dynamics trajectories and structure averaging. Proteins Struct. Funct. Bioinf. 82, 196–207 (2014).

    Article  Google Scholar 

  15. 15.

    Sanyal, S., Anishchenko, I., Dagar, A., Baker, D. & Talukdar, P. ProteinGCN: protein model quality assessment using graph convolutional networks. Preprint at bioRxiv (2020).

  16. 16.

    Baldassarre, F., Hurtado, D. M., Elofsson, A. & Azizpour, H. GraphQA: protein model quality assessment using graph convolutional networks. Bioinformatics 37, 360–366 (2021).

    Article  Google Scholar 

  17. 17.

    Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).

    Article  Google Scholar 

  18. 18.

    Conway, P., Tyka, M. D., DiMaio, F., Konerding, D. E. & Baker, D. Relaxation of backbone bond geometry improves protein energy landscape modeling. Protein Sci. 23, 47–55 (2014).

    Article  Google Scholar 

  19. 19.

    Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).

    Article  Google Scholar 

  20. 20.

    Critical Assessment of Techniques for Protein Structure Prediction Thirteenth Round—Abstract Book (Prediction Center, 2018);

  21. 21.

    Critical Assessment of Techniques for Protein Structure Prediction Fourteenth Round—Abstract Book (Prediction Center, 2020);

  22. 22.

    Heo, L., Arbour, C. F., Janson, G. & Feig, M. Improved sampling strategies for protein model refinement based on molecular dynamics simulation. J. Chem. Theory Comput. 17, 1931–1943 (2021).

    Article  Google Scholar 

  23. 23.

    Shuid, A. N., Kempster, R. & McGuffin, L. J. ReFOLD: a server for the refinement of 3D protein models guided by accurate quality estimates. Nucleic Acids Res. 45, W422–W428 (2017).

    Article  Google Scholar 

  24. 24.

    Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 (1983).

    Article  Google Scholar 

  25. 25.

    Igashov, I., Olechnovič, L., Kadukova, M., Venclovas, Č. & Grudinin, S. VoroCNN: deep convolutional neural network built on 3D Voronoi tessellation of protein structures. Bioinformatics (2021).

  26. 26.

    Zhang, J. & Zhang, Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PLoS ONE 5, e15386 (2010).

    Article  Google Scholar 

  27. 27.

    Won, J., Baek, M., Monastyrskyy, B., Kryshtafovych, A. & Seok, C. Assessment of protein model structure accuracy estimation in CASP13: challenges in the era of deep learning. Proteins Struct. Funct. Bioinf. 87, 1351–1360 (2019).

    Article  Google Scholar 

  28. 28.

    Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  29. 29.

    Rao, R. et al. MSA transformer. Preprint at bioRxiv (2021).

  30. 30.

    Dawson, N. L. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289–D295 (2017).

    Article  Google Scholar 

  31. 31.

    Wang, G. & Dunbrack, R. L. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591 (2003).

    Article  Google Scholar 

  32. 32.

    Thomas, N. et al. Tensor field networks: rotation- and translation-equivariant neural networks for 3D point clouds. Preprint at (2018).

  33. 33.

    Huang, B. & Carley, K. M. Residual or gate? Towards deeper graph neural networks for inductive graph representation learning. Preprint at (2019).

  34. 34.

    Wang, M. et al. Deep Graph Library: a graph-centric, highly-performant package for graph neural networks. Preprint at (2020).

  35. 35.

    Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32 (eds Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).

  36. 36.

    Zhou, H. & Zhou, Y. Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci. 11, 2714–2726 (2002).

    Article  Google Scholar 

  37. 37.

    Park, H. et al. Simultaneous optimization of biomolecular energy functions on features from small molecules and macromolecules. J. Chem. Theory Comput. 12, 6201–6212 (2016).

    Article  Google Scholar 

  38. 38.

    Xu, J. Data for protein model refinement and model quality assessment. Zenodo

  39. 39.

    Jing, X. GNNRefine: fast and effective protein model refinement by deep graph neural networks (Code Ocean, 2021);

Download references


We thank D. Baker’s team, including H. Park, who provided us with the DeepAccNet training data and helpful comments on our manuscript. We are also grateful to L. Heo for explaining FEIG and FEIG-S to us. This work is supported by National Institutes of Health grant no. R01GM089753 to J.X. and National Science Foundation grant no. DBI1564955 to J.X. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information




X.J. conceived the research, developed the GNNRefine and carried out the benchmarking experiments. J.X. built the in-house training data and guided the research. X.J. and J.X. analyzed the results and wrote the manuscript.

Corresponding author

Correspondence to Jinbo Xu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Computational Science thanks Hahnbeom Park, Lim Heo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Jie Pan, in collaboration with the Nature Computational Science team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Extended Data Fig. 1 Quality improvement by different methods on the CASP13 refinement targets.

Boxplot of the distribution of ΔGDT-HA, ΔGDT-TS, and ΔlDDT on the CASP13 refinement targets. The five lines in each boxplot from top to bottom in turn mean: Maximum (Q3 + 1.5IQR), Third quartile (Q3, 75th percentile), Median (50th percentile), First quartile (Q1, 25th percentile), and Minimum (Q1-1.5IQR), where IQR is Q3-Q1. The precision is 2.

Source data

Extended Data Fig. 2 Quality improvement by different methods on the CASP14 refinement targets.

Box plot of the distribution of ΔGDT-HA, ΔGDT-TS, and ΔlDDT on the CASP14 refinement targets.The five lines in each boxplot from top to bottom in turn mean: Maximum (Q3 + 1.5IQR), Third quartile (Q3, 75th percentile), Median (50th percentile), First quartile (Q1, 25th percentile), and Minimum (Q1-1.5IQR), where IQR is Q3-Q1. The precision is 2.

Source data

Supplementary information

Supplementary Information

Supplementary sections 1–12, Figs. 1–6 and Tables 1–23.

Source data

Source Data Fig. 2

The PDB structure files for Fig. 2.

Source Data Fig. 3

The running time data for GNNRefine and DeepAccNet.

Source Data Extended Data Fig. 1

The source data used to draw the boxplot of Extended Data Fig. 1.

Source Data Extended Data Fig. 2

The source data used to draw the boxplot of Extended Data Fig. 2.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jing, X., Xu, J. Fast and effective protein model refinement using deep graph neural networks. Nat Comput Sci 1, 462–469 (2021).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing