In order to train language models to be good software engineers, we need large amounts of high-quality training data — examples of software bugs and what steps were taken in order to fix them. But collecting such data is difficult: it often requires hours of manual work and complex setups that are hard to scale.

That's where SWE-smith comes in.

SWE-smith is a system that automatically creates realistic training data for AI agents that work with code. Given any existing software project, it builds a runnable version of the code and then generates hundreds or thousands of small tasks — for example, by artificially introducing a bug. We can then use existing language models to attempt to fix these bugs, keep only the successful attempts, and then use this data to train a new language model.

We used SWE-smith to generate over 50,000 tasks from 128 popular open-source repositories and trained SWE-agent-LM-32B, using this data. This language model achieved the top scores among open-source models, and outperformed GPT-4o on SWE-bench Verified, a benchmark that tests AI agents on real-world software engineering tasks.

Despite recent progress in Language Models for software engineering, collecting training data remains a significant pain point. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability.

To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing tests in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works.

We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, outperforming GPT-4o and setting state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering.