Research

My work at Princeton University's Language and Intelligence initiative focuses on the use of AI for software engineering. The SWE-agent project has been trailblazing in being the first open source system to demonstrate how modern language models can effectively utilize tools to fix complex repository-level tasks (as measured on the SWE-bench benchmark). SWE-agent EnIGMA showed that the same system with different tools can set state of the art performance for red-teaming cybersecurity application.

Our most recent project, SWE-smith, is a novel pipeline for generating software engineering training data at scale and allowed us to set state of the art performance for open source models on the SWE-bench Verified benchmark.

I also continue to support research into Graph Neural Networks for High Energy Physics, research I started as a Postdoc at Princeton University.

During my PhD and studies, I focused on various aspects of data analysis, software engineering, maths and physics, including calibrating machine learning algorithms, clustering analyses, integration testing, anomaly detection, differential equations, data acquisition simulations, supersymmetry, and elliptic functions.

▼

Recent News

July 2025We just released mini-swe-agent, a radically minimal AI agent that still scores 65% on SWE-bench verified.

Jun 2025talkFrom code completion to autonomous software engineering agents at the Databricks Data+AI summit in SF (link)

May 2015SWE-agent-LM-32B is the new #1 open source model on SWE-bench Verified (SWE-smith project)

Apr 2025posterI presented our poster SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? at ICLR in Singapore (poster link, more information)

Apr 2025talkAs part of the week week at GenAI collective NY, I joined a technical fireside chat: Key lessons from pushing AI beyond autocomplete

Apr 2025talkBeyond Code Completion: Building Next-Gen AI Engineering Agents at Daytona AI Builder's day at GitHub HQ in SF (video)

Apr 2025talkInterview/podcast together with C.E. Jimenez on Databrew by Databricks (video)

Mar 2025talkVirtual talk at MLOps Agent hour: From Code Completion to Autonomous Software Engineering Agents

Feb 2025SWE-agent 1.0 sets State of the Art on SWE-bench Lite, Verified, and Full

SWE-agent: Autonomous Software Engineering

Research at Princeton University with PLI, NEURIPS '24

Since 2024

Audience:

SWE-agent enables your language model of choice to use tools to fix issues in real GitHub repositories, find cybersecurity vulnerabilities, or perform any custom task.

It was the first open-source system to significantly score on SWE-bench, far outperforming RAG baselines and creating a breakthrough for agentic AI in software engineering.

Released just days after the commercial equivalent project Devin showed its first public demo, we demonstrated that a simple open-source system with optimized agent tooling could perform similarly (if not better) than a well-funded company's demo, democratizing access to AI-powered software engineering capabilities.

The central innovation discussed in our paper is the design and optimization of the agent-computer interface (ACI) that allows the language model to effectively navigate, understand, and modify large codebases. This includes custom shell commands, file editing interfaces, and feedback mechanisms.

Since the initial release in April 2024, development has never stopped, and SWE-agent still regularly ranks at the top of the SWE-bench leaderboard while maintaining a lightweight, modular architecture that makes it easy to extend and customize for different use cases.

GitHub 🔗 Website 📚 arXiv

SWE-smith: Scaling Data for Software Engineering Agents

Research at Princeton University with PLI

Since 2025

Audience:

Despite recent progress in Language Models for software engineering, collecting training data remains a significant pain point. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability.

To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing tests in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works.

We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, outperforming GPT-4o and setting state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering.

GitHub 🔗 Website 📚 arXiv 🔗 Dataset

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Research at Princeton University with PLI, ICLR '25

Since 2024

SWE-bench has become the industry-standard evaluation framework for benchmarking autonomous software engineering agents, utilized extensively by all major language model providers. It presents language models with real-world software engineering tasks drawn from GitHub repositories, challenging them to resolve complex issues requiring deep codebase understanding and reasoning beyond typical code generation.

While I joined the team after SWE-bench had already been published, I was involved with our follow up project, SWE-bench multimodal, in particular working on baselines with SWE-agent. SWE-bench multimodal generalizes the SWE-bench framework to typical frontend engineering tasks, shifting focus from Python to Javascript and requiring visual understanding in addition to reasoning and agentic abilities. Agents evaluated under this new benchmark must effectively interpret images provided within task descriptions and use visual feedback during issue resolution and validation processes. As a result, current state-of-the-art models continue to find this benchmark exceptionally difficult, successfully solving fewer than 25% of the included tasks.

🔗 Leaderboard GitHub 📚 arXiv

Enigma: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Research at Princeton University with PLI, ICML '25

2024

Proactively identifying and resolving cybersecurity vulnerabilities through red-teaming exercises where systems are tested from an attacker’s perspective is critical to securing modern infrastructure. However, attack vectors are diverse and require a broad set of skills, tools, and knowledge, making them very challenging to execute for AI systems.

We build on the generalist capabilities of SWE-agent to create EnIGMA, an AI agent equipped with various cybersecurity tooling. In particular, we enable the agent to use interactive terminal applications, including debuggers and real-time server interactions. Evaluated across leading Capture The Flag (CTF) cybersecurity benchmarks including NYU CTF, Intercode-CTF, and CyBench, EnIGMA sets new state-of-the-art standards, significantly outperforming existing approaches and marking a notable leap forward in agent-driven cybersecurity.

GitHub 🔗 Website 📚 arXiv

Graph Neural Networks for Charged Particle Tracking

[+]

Research at Princeton University with IRIS-HEP

2022-2024

Audience:

The reconstruction of charged particle trajectories ("tracking") in particle physics detectors is one of the computationally most challenging tasks of the field, limiting the amount of high-quality data that can even be recorded. Applied to particle collider experiments such as the CMS experiment, this task is different from many other problems that involve trajectories: There are millions of particle collisions per second, each with thousands of individual particles that need to be tracked, there is no time information (the particles travel too fast), and we do not observe a continuous trajectory but instead only 5-15 points ("hits") along the way in different detector layers. The task can be described as a combinatorically very challenging "connect-the-dots" problem, essentially turning a cloud of points (hits) in 3D space into a set of $\mathcal O(1000)$ trajectories.

Unlike traditional tracking algorithms built around Kalman filters, this project uses graph neural networks for significant speed increases. A conceptually simple way to turn tracking into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn't connect points that belong to the same particle. In this way, only the individual trajectories remain as components of the initial fully connected graph. In this project, we instead explore the idea of object condensation or learned clustering, where a network maps all hits to a latent space, learning to place hits from the same track close to each other, such that simple operations can recover the hits belonging to the same tracks.

GitHub

Charged particle tracking as an embedding task: The left side shows a tSNE embedding of all hit features, with hits belonging to some (randomly selected) particles colored. Our embedding maps hits belonging to the same particle in the same place (right picture), such that tracks can be recovered by a simple clustering operation.

Past projects

Calibration of Machine Learning Algorithms for the Reconstruction of $B$ Mesons

[+]

PhD research at LMU Munich

2018 - June 2022

Audience:

While the Standard Model of particle physics describes particles and their interactions to astonishing accuracy, there are a range of shortcomings. One of the avenues to probe and test the Standard Model further is to investigate differences between similar particles of different "flavor". Studies of B mesons are particularly sensitive to this.

In this project, data from the Belle experiment is analyzed with the software of its successor, the Belle II experiment. One of the key ingredients is the Full Event Interpretation, a machine learning algorithm that reconstructs B mesons from their decay products as recorded by the detector.

Crucial to the physics goals of the experiments is to ensure that the reconstruction algorithms perform exactly the same on data and Monte Carlo simulation. However, because Monte Carlos simulations are never perfect, small differences can be picked up and exaggerated by the reconstruction algorithms. Furthermore, biases in the training data can lead to biases in reconstruction efficiencies. Therefore, it is paramount to calibrate the reconstruction algorithms.

The decay $\bar B\longrightarrow D^*\ell^-\bar\nu_\ell$ is used to precisely determine the CKM matrix element $|V_{cb}|$ , an important ingredient for tests of the flavor sector of the Standard Model. It is also the normalization channel for measurements of $R(D^*)$ , one of the key quantities of the flavor anomalies that recently sparked a flurry of interest in the field. Improving our understanding of $\bar B\longrightarrow D^*\ell^-\bar\nu_\ell$ might help to understand and improve analyses of $R(D^*)$ as well.

Reconstruction of a tag side B meson in addition to the semileptonically decaying B allows for a very clean data sample. Using the large Belle dataset but applying Belle II software for analysis, we can improve upon previous studies: The Belle II Full Event Interpretation, a machine learning algorithm to reconstruct the tag side B meson is almost two times more efficient than previously used algorithms. However, careful calibration studies are needed to address inconsistencies in its efficiency between data and Monte Carlo simulation.

🎓 Ph.D. Thesis

*Calibration factors for the Belle II Full Event Interpretation algorithm.*

Clustering of Kinematic Graphs

[+]

PhD research at LMU Munich

2018-2019

New Physics can manifest itself in kinematic distributions of particle decays. The parameter space defining the shape of such distributions can be large which is challenging for both theoretical and experimental studies. Using clustering algorithms, the parameter space can however be dissected into subsets (clusters) which correspond to similar kinematic distributions. Clusters can then be represented by benchmark points, which allow for less involved studies and a concise presentation of the results. To demonstrate this concept, I have written the Python package ClusterKinG, an easy to use framework for the clustering of distributions that particularly aims to make these techniques more accessible in a High Energy Physics context. As a physics use case its application has been demonstrated for the kinematic distributions of

\bar B \longrightarrow D^{(*)}\tau^-\bar\nu_\tau

📖 JHEP Publication 🎓 Ph.D. Thesis GitHub

Belle II Software Integration and Performance Testing

[+]

2018-2022

Coordinating Software Training and Education Efforts

[+]

at LMU Munich, Princeton University, and IRIS-HEP

2018-2024

Construction of Angular Observables Sensitive to New Physics in $\bar B\longrightarrow D^* \tau^-\bar\nu_\tau$ Decays and Measurements of Differential Cross Sections of $\bar B\longrightarrow D^*\ell^-\bar\nu_\ell$ Decays with Hadronic Tagging at Belle

[+]

Thesis (M. Sc.) at LMU Munich, TU Munich

2017-2018

Audience:

Most of our current understanding of elementary particle physics is encoded in the "Standard Model", a mathematically consistent description of all known particles and their interactions (except gravitation). There are a number of shortcomings of the Standard Model and most notably, astrophysical observations suggest that the currently known particle content accounts for but 5% of the total mass content of the universe (the rest being called dark matter). However, at the same time, accelerator experiments have confirmed the predictions of the Standard Model with astonishing accuracy.

One of the few exceptions are the so called “anomalies in semileptonic B decays”, anomalies found in decays of B mesons into other mesons (particles made up of two quarks) and leptons (e.g. electrons, neutrinos). Deviations have been seen by three independent experiments (Belle, BaBar and LHCb) and taken together, the measurements challenge the Standard Model like few before.

In my Master thesis project, I give an overview over possible models of physics beyond the Standard Model, develop new observables that could help to distinguish between them and finally present continuing work on an analysis of one kind of semileptonic B decays that plays an important role in the anomalies.

Recent measurements of $\bar B\longrightarrow D^{(*)}\ell^-\bar\nu_\ell$ at Belle, BaBar and LHCb challenge lepton universality and thus the Standard Model at a combined confidence level close to four standard deviations. New measurements of differential decay rates could contribute to the understanding of these anomalies.

The differential cross section of the decay $\bar B\longrightarrow D^*(\rightarrow D\pi)\ell^-\bar\nu_\ell$ is parametrized according to different dependencies on the three decay angles and the coupling constants of potential new physics contributions. Observables using binned measurements of the differential cross section are characterized and explicitly constructed. Based on an estimate for the obtainable sensitivity, optimal binnings for such measurements are discussed. The discriminatory power of the thus constructed observables is discussed based on a basis of dimension six operators with renormalizable couplings contributing to $\bar B\longrightarrow D^*\ell^-\bar\nu_\ell$ .

Furthermore, continuing work on an analysis of the $\bar B\longrightarrow D^*(\rightarrow D\pi)\ell^-\bar\nu_\ell$ decay channel for $\ell = e, \mu$ using data from the Belle detector at KEKB is presented. The events are selected from 772 million $e^+e^- \longrightarrow \Upsilon(4S) \longrightarrow B\bar B$ events, where one $B$ meson is fully reconstructed in hadronic modes. Unfolded differential decay rates in four kinematic variables are presented separately for $\ell= e, \mu$ and a combined fit, allowing for precise calculations of $|V_{cb}|$ and $B\longrightarrow D^*$ form factors. The new lepton flavor specific results are also expected to impact the discussion about potential light lepton flavor universality violations prompted by measurements of $B\longrightarrow K^{(*)}\ell\ell$ decays.

🎓 Master's Thesis

World averages for the measurements of R(D), R(Dstar) — *The world average for the measurements of the observables $R(D^{(*)})$ currently shows a $4sigma$ deviation from the Standard Model. Result of the Heavy Flavor Averaging Group from 2017.*

Complex Organic Molecules in Protoplanetary Disks

[+]

Summer Project at TITECH

July 2017 till September 2017

Audience:

The spectroscopic study of interstellar and circumstellar molecules has been long ongoing, with researchers concluding already in the 1970s that interstellar dust contains large numbers of complex organic molecules (COMs). Since then, ever improving searches have found about 50 such COMs. Besides being of great interest for astrochemistry and some researchers even pointing out their potential role regarding the origin of life, COMs also serve as valuable probes for the physical conditions of the surrounding medium.

The build-up of molecular complexity in a given system can be studied with chemical reaction networks (CRNs), mathematical models of the concentrations of various molecules based on a fixed set of reactions and an initial set of reactant concentrations. By expanding the previously studied CRNs with additional dust grain-surface reactions, we tried to improve the description of COM formation in protoplanetary disks. Trying to automate some time-consuming manual tasks necessary for studies of the influence of physical and chemical parameters, I wrote an analysis framework that will enable future students to conduct similar studies much more efficiently, thereby opening new research possibilities.

🔗 Experience Report 📊 Final Presentation

Data Acquisition Pipeline Performance Analysis

[+]

Summer Student Project at CERN/LHCb

July 2015 till September 2015

Audience:

The LHCb experiment is one of the particle physics detector experiments located at the LHC at CERN. At the collision points at the LHC, millions of particle collisions happen every second. Due to limitations in computing power and storage capacity, so far not every one of these events could be recorded and processed. Rather, only a fraction of the events were picked out to be processed by so called triggers. From 2020, the LHCb experiment plans to proceed to a trigger-free readout, where all of the events can be processed, thereby increasing the amount of data available to physicists. This requires an update of the LHCb Data Acquisition (DAQ) systems, which are responsible for the recording and processing of the events. DAQPIPE (Data Acquisition Protocol Independent Performance Evaluator) is a tool to simulate and evaluate the performance of such a DAQ system.

The aim of this 10-week summer student project was to implement network monitoring for a more detailed performance evaluation of different transport protocols and to spot potential bottlenecks. First, several existing performance monitors were tested. To that end DAQPIPE was run together with Tau and the obtained performance data was plotted with ParaProf, JumpShot and Vampir. In the second stage of the project, a light-weight performance analysis tool was written from scratch in C++.

Monitoring data sent by two readout units — *Monitoring the data sent by two readout units (RUs). RUs collect incoming data fragments from different subdetectors and send it to builder units (BUs), which process the information.*

Truth-level based estimation of the sensitivity to pMSSM models in events with one hard lepton

[+]

Thesis (B.Sc. in Physics) at LMU Munich

2015

Audience:

Most of our current understanding of elementary particle physics is encoded in the "Standard Model", a mathematically consistent description of all known particles and their interactions (except gravitation). However there are a number of shortcomings of the Standard Model and most notably, astrophysical observations suggest that the currently known particle content can account for but 5% of the total mass content of the universe (the rest being called dark matter). One of the most popular theoretical concept that introduces additional particles is the concept of supersymmetry (SUSY).

Current searches for SUSY particles are for example conducted with the ATLAS detector at the LHC at CERN. However, as SUSY theories depend on many unknown parameters, computation power becomes a limiting resource for the study (and exclusion) of possible concrete SUSY scenarios. To address this, two types of analysis methods are used: Truth level analysis (fast but unreliable) and reco level analysis (slow but reliable). Because truth level analysis is a shortcut, it has to be validated by comparing it with the reliable reco level analysis results.

For my thesis I performed such a comparison for a specific setup. Unfortunately I found but low levels of agreement between the results of both analysis strategies. I ruled out several sources of error and showed the necessity of a more detailed study of the underlying assumptions.

Based on the search for supersymmetry in final states containing one isolated lepton, jets and missing transverse momentum with proton-proton collision data recorded with the ATLAS detector at a center-of-mass energy of $\sqrt s = 8\, \mathrm{TeV}$ in 2012, I looked into the estimation of the sensitivity to phenomenological MSSM models using the signal shape of truth level signal samples. These were then compared to the sensitivity as calculated with MC samples on which a full detector simulation and reconstruction had been performed. The agreement was found to be generally low. Several sources of error were ruled out, showing the necessity of a more detailed study of the underlying truth- and reco-level signal samples.

🎓 Bachelor's Thesis

CLs values obtained by reco/truth level analysis — Comparing the CLs values obtained by reco level analysis (y axis) and truth level analysis (x axis). Ideally both values should roughly agree (resulting in the red line with $x=y$ ), but this is obviously not the case here.

Elliptic Functions

[+]

Thesis (B.Sc. in Mathematics) at LMU Munich

2014

Audience:

Central subject of this thesis are so called elliptic functions. Elliptic functions are a special type of meromorphic functions (complex-valued functions in one complex variable, which are holomorphic apart from a discrete set of poles) that are periodic in two directions on the complex plane, i.e. $f(x) = f(x+a)$ and $f(x) = f(x+b)$ for any $x$ out of the domain with two complex numbers $a$ and $b$ (which are required to be non-collinear on the complex plane).

Among others, elliptic functions are of great use in number theory, in particular there are interesting connections to sums of divisors of natural numbers. Furthermore they are used in the theory of elliptic curves and elliptic integrals.

*Imaginary part of the Weierstrass p function, an example of an elliptic function. Clearly visible are the two periods $p(x+2) = p(x) = p(x+2i)$ throughout the domain.*

Research

Recent News

SWE-agent: Autonomous Software Engineering

SWE-smith: Scaling Data for Software Engineering Agents

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

Enigma: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Graph Neural Networks for Charged Particle Tracking

Past projects

Calibration of Machine Learning Algorithms for the Reconstruction of BBB Mesons

Clustering of Kinematic Graphs

Belle II Software Integration and Performance Testing

Coordinating Software Training and Education Efforts

Complex Organic Molecules in Protoplanetary Disks

Data Acquisition Pipeline Performance Analysis

Truth-level based estimation of the sensitivity to pMSSM models in events with one hard lepton

Elliptic Functions

Calibration of Machine Learning Algorithms for the Reconstruction of $B$ Mesons