MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper • 2502.14499 • Published Feb 20 • 193
ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms Paper • 2502.06556 • Published Feb 10 • 3
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks Paper • 2507.12284 • Published Jul 16 • 7
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? Paper • 2309.08963 • Published Sep 16, 2023 • 12
Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs Paper • 2508.15878 • Published Aug 21 • 1
Compact Neural Graphics Primitives with Learned Hash Probing Paper • 2312.17241 • Published Dec 28, 2023 • 8
From Theory to Practice: Plug and Play with Succinct Data Structures Paper • 1311.1249 • Published Nov 5, 2013 • 1
Health system learning achieves generalist neuroimaging models Paper • 2511.18640 • Published 24 days ago • 3
Pre-trained knowledge elevates large language models beyond traditional chemical reaction optimizers Paper • 2509.00103 • Published Aug 27 • 1
CHESS: Contextual Harnessing for Efficient SQL Synthesis Paper • 2405.16755 • Published May 27, 2024 • 2
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data Paper • 2406.14546 • Published Jun 20, 2024 • 3
The Frontier of Data Erasure: Machine Unlearning for Large Language Models Paper • 2403.15779 • Published Mar 23, 2024 • 1
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery Paper • 2408.06292 • Published Aug 12, 2024 • 127
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published Jun 13 • 24