Technology
HumanEval
HumanEval is the 164-problem benchmark developed by OpenAI to rigorously evaluate Large Language Models (LLMs) on functional code generation in Python.
HumanEval is a critical benchmark from OpenAI, designed to assess the functional correctness of LLMs in code generation (Sentence.). It features 164 hand-written Python programming problems, each with a natural language docstring prompt and an average of 7.7 hidden unit tests for objective evaluation (Sentence.). The core metric is `pass@k`, which measures the probability that at least one of $k$ generated solutions passes all unit tests, establishing a clear, objective standard for model performance (Sentence.). This suite has become the industry standard for validating models like GPT-4 and Codex (Sentence.).
Related technologies
Recent Talks & Demos
Showing 1-1 of 1