Is there a workflow where an AI can run tests, fix failures, and propose a patch as a PR I can review?

Most teams asking this question already tried a copilot and hit a wall: you still run the test suite yourself, copy stack traces into chat, then hand‑apply edits. The answer you actually want is different: a workflow where an AI can run tests in your real environment, fix failures, and hand you a ready‑to‑review pull request—with full traceability and guardrails.

That workflow exists today. In Factory, you do it by delegating to a Droid.

This guide breaks down how that works in practice, what’s required under the hood, and where it’s safe to trust an AI system with tests, fixes, and PRs.

The workflow in one pass

Here’s what “AI runs tests, fixes failures, and proposes a patch as a PR I can review” looks like as a concrete sequence:

Trigger the Droid
- From your IDE/terminal (VS Code, JetBrains, Vim)
- From a Git branch or open PR
- From a CI pipeline or a ticket in your tracker
Environment discovery
- Detects language, framework, test runner, package manager
- Inspects package.json, pyproject.toml, go.mod, pom.xml, etc.
- Maps repo layout and relevant services/configs
Run tests (or a failing subset)
- Executes npm test, pytest, go test ./..., mvn test, or your custom script
- Captures exit codes, logs, stack traces, and coverage hints
Diagnose and plan
- Links failures back to specific files and lines
- Builds an explicit patch plan (files to touch, refactors to perform, tests to add/modify)
Apply fixes in a sandbox
- Edits code with minimal diffs, not full rewrites
- Reruns tests to confirm green (or at least “improved”) state
- Iterates under time and safety limits
Prepare artifacts for review
- Opens or updates a branch
- Creates a pull request with a diff, description, and test results
- Adds a technical summary: what changed, why, and how it was validated
You review and merge
- Standard code review flow in GitHub, GitLab, Bitbucket, or your platform of choice
- You own the approval; the Droid never bypasses your existing protections

The key is that this is not a chat prompt. It’s a delegated task running in an environment-aware, permissioned agent system.

Why copilots don’t get you there

Autocomplete tools help you write snippets faster, but they don’t:

Run your test suite reliably in your actual environment
Chase failures across multiple modules and config layers
Maintain context across hours or days of debugging and iteration
Produce PRs tied back to tickets and traceable artifacts

To deliver a workflow where an AI actually runs tests, fixes failures, and proposes a patch as a PR, you need:

Real terminal and filesystem access (not just an API to a model)
Planning and error recovery, not just single-shot code generation
Strict permissions and audit so you can trust what it’s touching
A way to measure outcomes: files changed, tests fixed, PRs raised

Factory was built around those requirements.

How Factory’s Droids run tests and fix failures end‑to‑end

1. Droids where you already work

You don’t move your workflow to match the AI; Droids meet you where you work:

In your editor/terminal: VS Code, JetBrains, Vim, shells
In the browser: zero-setup sessions on any repo
In CI/CD: scripted and parallelized via CLI
In Slack/Teams: “Droids in the war room” during incidents
In project trackers: “Droids in your backlog” triggered from tickets

The “run tests → fix → PR” workflow can start from any of these surfaces:

Right‑click a test failure in your IDE and delegate:
“Fix this test failure and open a PR with the patch.”
In CI: when a build fails, a Droid picks up the failing job and attempts to remediate.
In Slack: paste a failing job link and ask a Droid to investigate, propose a patch, and open a PR.

2. Environment‑grounded test execution

Running tests in real codebases is messy: custom scripts, flaky tests, multi-language repos.

Factory’s agent system does environment discovery first:

Detects test tooling
- JS/TS: Jest, Vitest, Mocha, Cypress, Playwright
- Python: pytest, unittest, nose, tox
- Java: JUnit, TestNG, Maven/Gradle wrappers
- Go: go test, gotestsum
- Plus custom commands from your CI configuration or Makefile
Locates the right command
Reads package.json scripts, tox.ini, Makefile, CI config (.github/workflows, .gitlab-ci.yml, etc.) to infer how tests are actually run.
Executes tests under resource limits
- Sandboxed terminal session
- Timeouts and output capture
- Structured error reporting for the model

This is where agent design matters: the Droid doesn’t guess how you run tests; it discovers and reuses your existing commands and scripts.

3. Diagnosing failures with explicit plans

Once tests fail, a Droid doesn’t immediately rewrite code. It:

Parses stack traces and logs
- Maps error lines back to files and line numbers
- Follows import/require chains to find upstream causes
Builds a change plan
- Enumerates files to inspect and edit
- Distinguishes between “bugfix” vs “test update” vs “missing edge case”
- Prioritizes minimal diffs to keep review straightforward
Surfaces the plan as part of the PR description later:
“Plan: Update payment_service.py to handle null customer IDs, adjust test_payment_null_customer expectations, and add regression test for zero‑amount payments.”

That planning step is why Factory’s Droids can run longer tasks (like multi-module refactors or multi-step test fixes) without losing the plot.

4. Applying code changes safely

With a plan, the Droid:

Edits specific files, not entire directories
Uses minimal, localized patches instead of wholesale rewrites
Adds or updates tests when needed to lock in the behavior

Examples of delegated tasks that work well:

“Fix the failing Jest snapshot tests and update snapshots if they’re correct.”
“Resolve these pytest failures without changing API contracts.”
“Refactor shared validation logic and fix any tests it breaks.”

While it edits, it keeps a structured record:

Files opened
Files edited
Command outputs (test runs, linters, formatters)

This record later feeds into PR descriptions and Factory Analytics.

5. Re‑running tests until green (or improved)

After applying changes:

The Droid reruns tests, either:
- The full suite, or
- The failing subset, if that’s faster and your CI will handle the rest
It loops, under guardrails:
- Timeboxed sessions
- Hard limits on number of iterations
- Fallback behavior if tests remain flaky or non-deterministic
It classifies outcomes:
- Fully passing: ready to propose a patch
- Partially improved: fewer failures, but some remain
- Blocked: environmental issue (missing service, secret, or flaky test)

In the latter two cases, the Droid will still propose a patch and explicitly state what’s left unresolved, so you can make an informed call in review.

6. Creating a PR you can actually review

Once tests are in a better state, the Droid:

Creates or updates a branch
- Follows your branch naming conventions when configured
- Links to the source ticket or incident if triggered from your tracker
Opens a pull request with:
- A clear title: “Fix flaky payment retry tests under slow network”
- A structured description:
  - Problem summary
  - Files changed
  - Tests run and their results
  - Known limitations or follow‑ups
- Links back to logs or incident if relevant
Keeps traceability from ticket to code
- Factory attaches a trail: which Droid ran, what commands it executed, and what artifacts it produced.

You stay in your standard GitHub/GitLab/Bitbucket review UI. No new review tool, no bypassing of required checks.

Where this workflow shines in real teams

1. Flaky or failing tests in busy codebases

Scenario:

CI red because of intermittent Jest or pytest failures
No one on the team has the context to jump in immediately
Releases are blocked or slowed

Droid workflow:

CI job marks a test stage as failed.
A Droid is triggered via CLI or from Slack with the failing job URL.
It pulls logs, reruns the failing tests, and stabilizes them (fix logic, slow timeouts, bad mocks).
It opens a PR with:
- The fix
- A summary of root cause
- Tests it ran and their status

Result: your team reviews and merges; no engineer had to spend the day spelunking through test harness code.

2. Framework/library upgrades that break tests

Scenario:

Upgrading React, Django, Rails, or a core library
Hundreds of tests fail due to small API changes

Droid workflow:

Use “Droids at scale” via CLI to:
- Run targeted test subsets
- Apply repetitive fixes across files and modules
- Rerun and verify
- Open batched PRs for different risk levels

This turns what used to be a week of painful “search, tweak, rerun” into repeatable, parallel tasks—still reviewed by humans, but delegated in bulk.

3. Incident response and regression fixes

Scenario:

Production incident due to a bug that tests should have caught
You add a regression test that fails and now need a fast, safe fix

Droid workflow:

In Slack’s incident channel, you summon a Droid with the failing test and log snippet.
The Droid:
- Reproduces the failure locally
- Implements a fix
- Adds or adjusts tests
- Opens a PR with a full incident-oriented summary

On‑call engineers stay focused on coordination and rollout decisions while the Droid handles the mechanical “run tests → fix → PR” loop.

Safety, controls, and why enterprises trust this pattern

Letting an AI run tests and modify code in your repos demands serious guardrails. Factory’s setup is built for that:

Strict permissions enforcement
- Droids only see what the invoking user can already access
- Repo, branch, and ticket access follows your existing permissions
Single‑tenant sandboxed environments
- Dedicated VPC per customer
- Isolation between orgs and projects
Audit logging and compliance
- Full logs of commands, file edits, and PR actions
- Exportable to your SIEM
- SOC 2, GDPR/CCPA alignment, and early ISO 42001 adoption
Clear IP stance
- Factory does not use your code as training data without prior written consent
- Model provider choice is yours; Factory is interface and vendor agnostic
No unsupervised merges
- Droids stop at “ready-to-review PR”
- Your existing CI, branch protections, and approvals remain in full control

This is how you get the leverage of “AI runs tests and proposes patches” without sacrificing the governance your security and compliance teams require.

Measuring whether this workflow is worth it

Token counts won’t tell you if this workflow is paying off. Factory Analytics focuses on what matters for engineering leaders:

Outputs:
- Files created and edited by Droids
- Commits and PRs generated
- Tests added or fixed
Process metrics:
- Time from ticket/incident to PR
- Reduction in back‑and‑forth for context gathering
- Autonomy ratio: how much work Droids complete end‑to‑end vs just suggestions
Observability:
- OpenTelemetry export for integrating with your existing dashboards
- Drill‑downs by repo, team, or task type (e.g., “test fix” vs “refactor”)

You can validate, empirically, whether “AI runs tests, fixes failures, and proposes a patch” is cutting MTTR or speeding releases in your own environment.

When to use this workflow vs. a simple copilot

Use a copilot when:

You’re writing new code with clear requirements
You want inline suggestions but don’t need automation

Use Factory Droids when:

Tests are failing and you want the AI to:
- Run the suite
- Diagnose failures
- Apply patches
- Propose a PR with evidence
You’re managing organization‑wide processes: migrations, upgrades, and incident response with traceability and controls.

The two are complementary: copilot inside the editor for local editing, Droids for delegated, end‑to‑end tasks across your stack.

Final verdict: yes, and here’s the precise shape of that workflow

There is a workflow where an AI can run tests, fix failures, and propose a patch as a PR you can review—but it looks less like a chat window and more like a delegated Droid operating inside your existing tools:

Run tests in real terminals and CI.
Inspect logs and stack traces.
Plan and apply minimal patches.
Rerun tests to validate.
Open a traceable, reviewable PR under your existing permissions and policies.

If you want to try this pattern in your own repos—starting from IDE, CI, Slack, or tickets—you can start a Droid in Factory and delegate your next failing test suite.

Get Started

Is there a workflow where an AI can run tests, fix failures, and propose a patch as a PR I can review?

The workflow in one pass

Why copilots don’t get you there

How Factory’s Droids run tests and fix failures end‑to‑end

1. Droids where you already work

2. Environment‑grounded test execution

3. Diagnosing failures with explicit plans

4. Applying code changes safely

5. Re‑running tests until green (or improved)

6. Creating a PR you can actually review

Where this workflow shines in real teams

1. Flaky or failing tests in busy codebases

2. Framework/library upgrades that break tests

3. Incident response and regression fixes

Safety, controls, and why enterprises trust this pattern

Measuring whether this workflow is worth it

When to use this workflow vs. a simple copilot

Final verdict: yes, and here’s the precise shape of that workflow

Keep Reading

More from AI Coding Agent Platforms

How do I set up Windsurf Teams ($30/user/mo) with centralized billing, admin analytics, and automated zero data retention?

How do I contact Windsurf about Enterprise pricing, RBAC, and hybrid deployment for 200+ seats?

How do I add SSO to Windsurf Teams (+$10/user/mo) and what identity providers are supported?