Table 1 Performance of AlphaProof on formal mathematics benchmarks
From: Olympiad-level formal mathematical reasoning with reinforcement learning
Name | Compute budget (per problem) | miniF2F-test20 | formal-imo | PutnamBench-test21 |
|---|---|---|---|---|
Previous state of the art (before IMO 2024) | ||||
GPT-F expert iteration24 | – | 36.6% | – | – |
Hypertree proof search17 | – | 41.0% | – | – |
InternLM2-Math-Plus-7B23 | – | 43.4% | – | – |
Previous state of the art (after IMO 2024) | ||||
Kimina-Prover Preview25 | – | 80.7% | – | 1.6% |
DeepSeek-Prover-V226 | – | 88.9% | – | 5.3% |
This Work | ||||
AlphaProof | 2 TPU minutes | 96.3% | 33.2% | 27.9% |
AlphaProof | 12 TPU hours | 97.7% | 43.7% | 39.4% |
AlphaProof with TTRL | 50 TPU days | 97.5% | 53.9% | 45.5% |
AlphaProof with TTRL | 500 TPU days | 99.6% | 58.3% | 56.1% |