UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding Paper • 2606.07167 • Published about 1 month ago • 1
TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs Paper • 2606.09578 • Published 28 days ago
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues Paper • 2605.00119 • Published Apr 30
SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning Paper • 2604.19098 • Published Apr 30
NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors Paper • 2506.10627 • Published Jun 12, 2025
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding Paper • 2601.08645 • Published Feb 24
Can LLM Agents Be CFOs? A Benchmark for Resource Allocation in Dynamic Enterprise Environments Paper • 2603.23638 • Published Mar 24 • 11
FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs Paper • 2510.08886 • Published Oct 10, 2025 • 20
When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents Paper • 2510.11695 • Published Oct 13, 2025 • 3
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation Paper • 2511.14998 • Published Nov 19, 2025
Ebisu: Benchmarking Large Language Models in Japanese Finance Paper • 2602.01479 • Published Feb 1 • 17
Ebisu: Benchmarking Large Language Models in Japanese Finance Paper • 2602.01479 • Published Feb 1 • 17