Technology

SWE-Bench

An evaluation framework that tests LLMs on their ability to resolve real-world GitHub issues through autonomous software engineering.

SWE-bench benchmarks large language models by tasking them with fixing 2,294 functional bugs sourced from popular open-source repositories like django/django and scikit-learn/scikit-learn. Unlike static coding tests, it requires models to navigate complex codebases, modify multiple files, and verify solutions using unit tests. By measuring the percentage of issues successfully resolved (Pass@1), it provides a rigorous metric for the practical autonomy of AI coding agents.

https://www.swebench.com

0 projects · 0 cities

Recent Talks & Demos

Showing 1-0 of 0

Members-Only

No public projects found for this technology yet.