SWE-bench has become the industry-standard evaluation framework for benchmarking autonomous software engineering agents, utilized extensively by all major language model providers. It presents language models with real-world software engineering tasks drawn from GitHub repositories, challenging them to resolve complex issues requiring deep codebase understanding and reasoning beyond typical code generation.

While I joined the team after SWE-bench had already been published, I was involved with our follow up project, SWE-bench multimodal, in particular working on baselines with SWE-agent. SWE-bench multimodal generalizes the SWE-bench framework to typical frontend engineering tasks, shifting focus from Python to Javascript and requiring visual understanding in addition to reasoning and agentic abilities. Agents evaluated under this new benchmark must effectively interpret images provided within task descriptions and use visual feedback during issue resolution and validation processes. As a result, current state-of-the-art models continue to find this benchmark exceptionally difficult, successfully solving fewer than 25% of the included tasks.