Introducing AutoBench 2.0: Our New Benchmarking Platform is Out Just in Time to Evaluate GPT 5.2.

Community Article Published December 17, 2025

We are also announcing our latest benchmark (Run 5) made possible by new platform features for more powerful and efficient benchmarking: Random Score Pooling, Nonlinear Weighting, Parallel Iteration.

image

Today, we are releasing AutoBench 2.0, the new version of AutoBench. The new platform achieves the same level of accuracy and stability of AutoBench 1.0 with about half the number of evaluations (scores).

We are also releasing the results of Run 5 (Dec 2025), our latest LLM benchmark, generated by the new platform. This run offers the largest set of models to date (35 models), and a number of iterations (315 generated questions) that is comparable to our past benchmarks. However, this benchmark has been generated with only 110,000 scores, about half the number of scores required in past runs. All without compromising in any way on benchmark quality and achieving high levels of correlation with established benchmarks.

For full details, check out AutoBench Leaderboard or visit autobench.org.


AutoBench 2.0: New Powerful Features

In Version 2.0, we have re-engineered the "Collective-LLM-as-a-Judge" engine to be both faster and more efficient.

  1. Random Score Pooling: Instead of using prefixed scoring models (rankers), we now pool a set of n random models (out of all the benchmark models) for every answer scoring session. It enables us to reduce significantly the number of scores per generated answer while increasing representation of all models as rankers in the benchmark. This also increases noise and enables the system to explore the "LLM performance space" more thoroughly. The intentional variance is then effectively compensated for by our new nonlinear weighting system.

  2. Nonlinear Weighting: We have moved beyond simple linear averaging of scores by applying nonlinear weights (options available: exponential, power-law, asymptotic, Boltzmann). This has proven essential to improve convergence and evaluation quality in the presence of high performance models such Gemini 3 Pro, Claude Opus 4.5 and the just recently released GPT 5.2.

  3. Parallel Iteration: A new parallelized architecture allows us to run multiple iterations simultaneously, enabling us to run in few hours benchmarks which would have taken days in the past.

The result is a system that preserves all the benefits of the original AutoBench (gaming resistance, granularity, cost-effectiveness), correlates 89.38% with the Artificial Analysis Index, 82.21% with MMLU-Pro, and 71.84% with human preference (LMArena), and provides a more powerful and efficient tool for LLM evaluation.

image


Run 5 Headlines

  • New King of the Hill: GPT-5.2 Pro establishes a new SOTA with a score of 4.48, but at a steep price ($0.81/answer).
  • The "Thinking" Tax: Open-source reasoning models like Kimi-k2-thinking perform exceptionally well (Score: 4.32) but require 2x the latency of standard models.
  • The Efficiency Champion: GPT-oss-120b delivers 93% of SOTA performance for 0.1% of the cost, effectively providing a very efficient option for business and industrial scale deployments.
  • Scientific Validation: AutoBench 2.0 methodology now correlates 89.38% with the Artificial Analysis Index, cementing its status as a reliable "Ecosystem Consensus."

Run 5 Key Insights.

1. Is GPT 5.2 Pro Worth the Price?

newplot (8)

GPT 5.2 Pro performance is truly remarkable, with the model dominating most domains. However, such a performance comes at a steep price.

  • The King: GPT 5.2 Pro achieves the highest AutoBench score recorded (4.48).
  • The Cost: It costs $0.8188 per average answer.
  • The Challenger: GPT-5.2 (standard) scores 4.43 but costs only $0.0736 with comparable prices to other high performing models such as Gemini 3 Pro (score: 4.41).

The Takeaway: With GPT 5.2 you are paying an 11x price premium for an average 1.1% performance gain.

2. The Open Source "Slow and Deep" and "Good Enough" Zones

image

A new breed of "Slow but Deep" models is emerging:

  • Kimi-k2-thinking (Score 4.32) and Qwen3-235B-Thinking (Score 4.20) perform exceptionally well for Open Source models.
  • The Trade-off: They require massive compute time. Kimi-k2-thinking has an average latency of 248 seconds—nearly double that of standard GPT-5.2.

GPT-oss-120b is the efficiency champion of Run 5. With a score of 4.18 and a cost of $0.0011, it provides 93% of the intelligence of the top model for 0.1% of the cost.

For 90% of business use cases (summarization, RAG, extraction), these solutions provide a valid alternative to proprietary models.

3. Domain Specificity

While GPT 5.2 leads in almost all domains, it does not everywhere. Our domain breakdown shows:

  • Gemini 3 Pro leads in the Math and General Culture domains.
  • Claude Opus 4.5 retains a slight edge in Science.
  • Claude Sonnet 4.5 leads in the Creative Writing domain.
  • DeepSeek V3.2 and Kimi K2 Thinking punch significantly above their weight in Coding tasks, providing highly valid instruct open weight alternatives for developers.

download


What's Next? Benchmarks for the Agentic Era

Static benchmarks encourage "teaching to the test." AutoBench 2.0 generates fresh, unseen questions every run. This is the only way to evaluate models for the Agentic Era, where AI agents will face novel, unpredictable situations that cannot be memorized.

We are now using this data to build the AutoBench API Router, an intelligent layer that routes your prompt to the optimal model to answer it, potentially saving enterprises 40-60% on inference costs without sacrificing quality.

Gratitude to Our Partners

This ambitious run would not have been possible without our ecosystem partners:

  • Translated and Marco Trombetti: For their continued support in compute resources and strategic insight.
  • DIAG, University of Rome La Sapienza: The team led by Prof. Fabrizio Silvestri continues to provide the scientific rigor that validates our methodology.
  • eZecute: For enabling the industrialization of this platform.

Join the Community

AutoBench 1.0 is open-source. We invite you to explore the data, fork the repo, and join the discussion.

Community

Sign up or log in to comment