Technology

MMLU-Pro Evals

MMLU-Pro Evals is the advanced, high-stakes LLM benchmark: 12,000+ reasoning-focused questions with ten answer options, engineered to break model saturation and precisely differentiate top-tier performance.

This is the MMLU-Pro benchmark: the definitive, next-generation evaluation for Large Language Models. It directly addresses the original MMLU's saturation by significantly increasing difficulty and robustness. The suite features over 12,000 rigorously curated questions across 14 diverse domains, shifting the focus from simple knowledge recall to complex, multi-step reasoning. Crucially, MMLU-Pro expands the multiple-choice options from four to ten, drastically reducing the chance of random guessing and enhancing discriminative power; for example, it showed a 9% performance gap between GPT-4o and GPT-4-Turbo, compared to just 1% on the old MMLU. Top models like Gemini 3 Pro (11/25) currently score around 90.1%, proving the benchmark remains a high bar for true general intelligence.

https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro

1 project · 1 city

Related technologies

Niagara 2 Ollama 82 OpenAI Evals 1 PromptFoo 1

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

LLM Fingerprinting: Model Classification

Toronto Mar 27

PromptFoo Ollama