Technology

MMLU

MMLU (Massive Multitask Language Understanding) is the definitive 57-subject benchmark for rigorously testing a large language model's (LLM) general knowledge and reasoning depth.

MMLU is the premier benchmark for evaluating Large Language Models (LLMs): Massive Multitask Language Understanding. It features 15,908 multiple-choice questions spanning 57 diverse academic and professional subjects (e.g., law, computer science, US history). This test measures a model's breadth of world knowledge and problem-solving ability, pushing performance past simple conversational tasks. When released in 2020, the top model (GPT-3 175B) scored 43.9%; today's leaders like GPT-4o consistently hit 88% accuracy. We use MMLU to track model generalization and identify critical shortcomings across specialized domains.

https://arxiv.org/abs/2009.03300

3 projects · 2 cities

Related technologies

EvalForge 2 HumanEval 1 Merlinite-7B-Lab 1 Mistral Mixtral 2 MMLU-Pro Evals 1 Niagara 2 Ollama 82 OpenAI Evals 1 PromptFoo 1 PyTorch 264 sentence-transformers 17 Transformers 168 Weave 4 Weights & Biases 10

Recent Talks & Demos

Showing 1-3 of 3

Members-Only

Instruct Lab LLM Evaluation Playbook

Toronto Nov 10

Merlinite-7B-Lab Mistral Mixtral

LLM Fingerprinting: Model Classification

Toronto Mar 27

PromptFoo Ollama

EvalForge: Automating LLM Judge

Seattle Sep 26

EvalForge Weave