Technology

HumanEval

HumanEval is the 164-problem benchmark developed by OpenAI to rigorously evaluate Large Language Models (LLMs) on functional code generation in Python.

HumanEval is a critical benchmark from OpenAI, designed to assess the functional correctness of LLMs in code generation (Sentence.). It features 164 hand-written Python programming problems, each with a natural language docstring prompt and an average of 7.7 hidden unit tests for objective evaluation (Sentence.). The core metric is `pass@k`, which measures the probability that at least one of $k$ generated solutions passes all unit tests, establishing a clear, objective standard for model performance (Sentence.). This suite has become the industry standard for validating models like GPT-4 and Codex (Sentence.).

https://github.com/openai/human-eval

1 project · 1 city

Related technologies

EvalForge 2 MMLU 1 Weave 3 Weights & Biases 9

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

EvalForge: Automating LLM Judge

Seattle Sep 26

EvalForge Weave