Table 1 Performance of AlphaProof on formal mathematics benchmarks

Name	Compute budget (per problem)	miniF2F-test²⁰	formal-imo	PutnamBench-test²¹
Previous state of the art (before IMO 2024)
GPT-F expert iteration²⁴	–	36.6%	–	–
Hypertree proof search¹⁷	–	41.0%	–	–
InternLM2-Math-Plus-7B²³	–	43.4%	–	–
Previous state of the art (after IMO 2024)
Kimina-Prover Preview²⁵	–	80.7%	–	1.6%
DeepSeek-Prover-V2²⁶	–	88.9%	–	5.3%
This Work
AlphaProof	2 TPU minutes	96.3%	33.2%	27.9%
AlphaProof	12 TPU hours	97.7%	43.7%	39.4%
AlphaProof with TTRL	50 TPU days	97.5%	53.9%	45.5%
AlphaProof with TTRL	500 TPU days	99.6%	58.3%	56.1%

AlphaProof is compared against other methods, using the strongest reported result for each system, corresponding to their largest compute budgets. Cells with ‘–’ indicate unavailable compute budgets or evaluation results. For AlphaProof, the reported ‘compute budget (per problem)’ refers to the average computational cost as defined in Fig. 4a and is an inference-time budget only that does not include the amortized cost of the main RL training loop. AlphaProof’s miniF2F-test results are reported on a corrected version of the dataset (see Methods for details). Results for other methods on miniF2F-test are as reported in their respective publications, based on the dataset versions they utilized; direct comparison should therefore be made with caution, considering potential dataset differences. For PutnamBench-test, scores for previous work were recalculated based on the publicly available proofs reported by each system, for consistent comparison against our PutnamBench-test split; these may differ from figures reported by those systems for the full PutnamBench.

Quick links

Search