Technology
SWE-Bench
An evaluation framework that tests LLMs on their ability to resolve real-world GitHub issues through autonomous software engineering.
SWE-bench benchmarks large language models by tasking them with fixing 2,294 functional bugs sourced from popular open-source repositories like django/django and scikit-learn/scikit-learn. Unlike static coding tests, it requires models to navigate complex codebases, modify multiple files, and verify solutions using unit tests. By measuring the percentage of issues successfully resolved (Pass@1), it provides a rigorous metric for the practical autonomy of AI coding agents.
0 projects
·
0 cities
Recent Talks & Demos
Showing 1-0 of 0
No public projects found for this technology yet.