Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Paper • 2405.20541 • Published May 30, 2024 • 24
RedPajama: an Open Dataset for Training Large Language Models Paper • 2411.12372 • Published Nov 19, 2024 • 56
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback Paper • 2503.22230 • Published Mar 28 • 45
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle Paper • 2512.04324 • Published 22 days ago • 149
OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value Paper • 2512.14051 • Published 10 days ago • 38
Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers Paper • 2512.17351 • Published 7 days ago • 20