Table 1 Performance of AlphaProof on formal mathematics benchmarks

From: Olympiad-level formal mathematical reasoning with reinforcement learning

Name

Compute budget (per problem)

miniF2F-test20

formal-imo

PutnamBench-test21

Previous state of the art (before IMO 2024)

GPT-F expert iteration24

36.6%

Hypertree proof search17

41.0%

InternLM2-Math-Plus-7B23

43.4%

Previous state of the art (after IMO 2024)

Kimina-Prover Preview25

80.7%

1.6%

DeepSeek-Prover-V226

88.9%

5.3%

This Work

AlphaProof

2 TPU minutes

96.3%

33.2%

27.9%

AlphaProof

12 TPU hours

97.7%

43.7%

39.4%

AlphaProof with TTRL

50 TPU days

97.5%

53.9%

45.5%

AlphaProof with TTRL

500 TPU days

99.6%

58.3%

56.1%

  1. AlphaProof is compared against other methods, using the strongest reported result for each system, corresponding to their largest compute budgets. Cells with ‘–’ indicate unavailable compute budgets or evaluation results. For AlphaProof, the reported ‘compute budget (per problem)’ refers to the average computational cost as defined in Fig. 4a and is an inference-time budget only that does not include the amortized cost of the main RL training loop. AlphaProof’s miniF2F-test results are reported on a corrected version of the dataset (see Methods for details). Results for other methods on miniF2F-test are as reported in their respective publications, based on the dataset versions they utilized; direct comparison should therefore be made with caution, considering potential dataset differences. For PutnamBench-test, scores for previous work were recalculated based on the publicly available proofs reported by each system, for consistent comparison against our PutnamBench-test split; these may differ from figures reported by those systems for the full PutnamBench.