Factory: how do I choose which model to use per task (Claude vs GPT vs Gemini)?

Most teams get stuck on the same question: if Factory lets Droids run on Claude, GPT, Gemini, and more, how do you actually choose the right model per task without turning every run into a research project?

The short answer: treat model choice as an implementation detail of your Droid workflows—not a new religion. Decide based on the task type, the environment (IDE, terminal, CI, Slack), and your tolerance for cost vs. depth of reasoning. Factory is model-agnostic by design, so you can mix and match without breaking your workflow.

This guide is written for that exact decision: how to choose between Claude, GPT, and Gemini for different engineering tasks inside Factory.

At-a-Glance Recommendation Framework

You can think of Factory’s model selection in three layers:

Default “everything” model
Your general-purpose workhorse for delegated coding tasks across IDE, web, and Slack.
Heavyweight “deep reasoning” model
Used when Droids need to perform multi-step debugging, complex refactors, or ambiguous incident triage.
Cost-optimized “batch / CI” model
Used for scripted Droids in CI/CD, migrations, and large-scale code review where throughput matters more than marginal accuracy.

Factory lets you wire all three into different Droid profiles or environments, so you don’t have to manually pick models every time.

How Factory Droids Use Models (and Why It Matters)

Factory doesn’t just “call a model.” Every Droid execution follows a system design:

Plan the task in steps.
Discover context from code, git, tickets, logs, and chat.
Act with tools (edit files, run tests, call CLIs, navigate the web).
Recover from errors under timeouts and partial failures.
Produce artifacts: PRs, patches, tests, runbooks, reviews.

Model choice affects how well this loop runs:

Can the model maintain a multi-step plan across a long session?
Can it read and write large codebases without hallucinating structure?
Does it handle terminal errors and stack traces gracefully?
Is it cost-effective at the scale you need (CI, migrations, code review)?

In our own benchmarks (e.g., Terminal-Bench and SWE-bench Full), we consistently see that agent design is the decisive factor, but model choice still matters at the edges—especially for complex, under-specified tasks.

What We Know from Benchmarks and Real Usage

From Factory’s internal evaluations and public reporting:

Claude Opus 4.1
- Delivers the strongest overall performance on the hardest tasks.
- Excels at advanced debugging, multi-hop reasoning, and long-running plans.
- Great default when you care about solving tricky tickets more than saving pennies.
T-5 Codex–style models (and similar cost-optimized models)
- Show strong domain knowledge in ML workflows, video editing, and many “everyday” engineering tasks.
- Tend to be more conservative and avoid risky changes—a good fit for production-facing work with guardrails.
- Their drastically lower cost makes them ideal for batch Droids in CI/CD, migrations, and large-scale code review.
Factory’s model stance
- We support all state-of-the-art coding models, including GPT-5, Claude Sonnet 4, OpenAI o3, Gemini 2.5 Pro, Claude Opus 4.1, and more.
- Droids can even sample multiple models when generating candidate solutions, then validate with tests and select the best trajectory.

So when you see “Claude vs GPT vs Gemini” inside Factory, remember: you’re choosing within a model-agnostic agent system, not a fixed vendor stack.

Core Comparison: Claude vs GPT vs Gemini in Factory

These are practical characterizations for Factory Droids—not marketing blurbs.

Claude (e.g., Claude Opus 4.1, Claude Sonnet 4)

Best as: Primary model for deep, complex engineering tasks.

Where it shines:

Complex, multi-step debugging
- Interpreting noisy stack traces.
- Navigating unfamiliar code paths.
- Iterating on hypotheses over several tool calls.
Large context + long sessions
- Long-lived Droid sessions (days) where context compaction is critical.
- Multi-file and cross-repo refactors.
Ambiguous tickets
- Under-specified Jira issues.
- “Something is slow” bug reports.
- Incident investigation where the plan must adapt.

Use Claude via Droids when you’re delegating tasks like:

“ Diagnose and propose a fix for this flaky integration test across services A, B, and C.”
“ Refactor this legacy module into a feature-flagged implementation and generate tests.”
“ Investigate this incident from logs and metrics, propose mitigations, and draft a postmortem outline.”

GPT (e.g., GPT-5, OpenAI o3)

Best as: Balanced default for general-purpose coding and analysis.

Where it shines:

Fast, high-quality code generation
- Implementing well-specified features or functions.
- Creating tests, benchmarks, or small utilities.
General reasoning + explanation
- Explaining codebases and architecture.
- Drafting docs, ADRs, and design notes.
Cross-domain content
- Combining code changes with product docs, release notes, or communication drafts.

Use GPT via Droids for tasks like:

“ Implement this ticket description across these files and open a PR.”
“ Generate tests for this module and update the documentation accordingly.”
“ Review this PR for security and correctness, add inline suggestions, and write a summary comment.”

Gemini (e.g., Gemini 2.5 Pro)

Best as: Complementary model where Google ecosystem, multimodal, or specific language support matters.

Where it shines:

Google ecosystem & tooling
- Integration with Docs/Sheets-style workflows (via your surrounding systems).
- Situations where the rest of your stack is already Google-heavy.
Multimodal / complex inputs
- Code plus diagrams, logs, or mixed content where multimodal capabilities help.
Non-English or international docs
- Projects with multilingual documentation and interfaces.

Use Gemini via Droids for tasks like:

“ Analyze this code plus attached design diagram to validate the implementation.”
“ Generate localized versions of UI copy, then update the frontend code and tests.”
“ Summarize complex log files and map errors to code paths.”

Choosing by Task Type (What Most Teams Actually Need)

Instead of starting from “which vendor,” start from “what are we delegating to Droids?”

1. Refactors and Large Codebase Changes

Traits:

Cross-file, cross-module context.
High risk if changes are wrong.
Often long-running, with multiple passes.

Droid surfaces:

IDE plugins (VS Code, JetBrains, Vim).
Web IDE.
Scripted runs in CLI for codebase-wide refactors.

Model choice:

Use Claude Opus 4.1 as the primary model.
It handles long-range dependencies and complex plans better in our experience.
If cost is critical and the refactor is well-scoped, use a cost-optimized coding model (T-5 Codex–style) for bulk edits, and reserve Claude as a “review Droid” on top.

Example setup in Factory:

Refactor Droid (IDE/Web) → Claude Opus 4.1.
Bulk-Refactor Droid (CLI/CI) → T-5 Codex–style model for generation + Claude/GPT for final review pass.

2. Incident Response and On-Call “War Room” Work

Traits:

Noisy signals (logs, metrics, chat snippets).
Time pressure and partial information.
Need for concise runbooks and mitigation plans.

Droid surfaces:

Slack/Teams (“Droids in the war room”).
Terminal (to inspect services, logs, metrics).
Web (dashboards, runbooks, ticket systems).

Model choice:

For triage and root cause analysis: Claude Opus 4.1.
It is better at advanced debugging and multi-step reasoning on messy input.
For summarizing incidents and drafting comms: GPT.
Strong at clear explanations and structured language.

Example setup in Factory:

Incident Triage Droid (Slack/Terminal) → Claude Opus 4.1.
Incident Comms + Postmortem Draft Droid → GPT-5.

3. Everyday Feature Work and Ticket Implementation

Traits:

Well-structured tickets.
Localized code changes.
Repeated across teams.

Droid surfaces:

IDE/terminal (day-to-day coding).
Project trackers (Droids triggered from Jira/Linear issues).
Web IDE for quick edits.

Model choice:

Use GPT-5 or Claude Sonnet 4 as the default.
Both are strong for “implement this ticket + tests + docs” flows.
Pick based on your org’s model preference and pricing; the agent behavior stays constant in Factory.

Example setup in Factory:

Ticket-to-PR Droid (issue-triggered) → GPT-5.
Local Dev Assistant Droid (VS Code/Vim) → GPT-5 or Sonnet 4.

4. CI/CD Automation, Code Review, and Migrations

Traits:

High volume, repeatable patterns.
Cost and throughput matter.
Often non-interactive, fully scripted runs.

Droid surfaces:

CLI (scripted/parallel Droids).
CI runners (GitHub Actions, GitLab CI, etc.).

Model choice:

Use T-5 Codex–style or similar cost-optimized models for:
- Automated style fixes.
- Dependency upgrades.
- Boilerplate migrations.
Optionally add a higher-end model (Claude/GPT) as a second pass for:
- Security-focused review.
- Complex diff evaluation.

Example setup in Factory:

Migration Droid (CLI, parallelized) → T-5 Codex.
Security Review Droid (CI, post-migration) → Claude Opus 4.1.

5. Documentation, Overviews, and Architecture Analyses

Traits:

Code + prose.
Need for structured, human-readable outputs.

Droid surfaces:

Web IDE (repo overviews).
IDE/terminal (inline doc generation).
Slack/Teams (architecture Q&A).

Model choice:

GPT-5 for high-quality explanations, ADR drafts, and structured docs.
Claude when documentation is tightly coupled to a deep debugging or refactor session (reuse the same context and reasoning chain).
Gemini for multimodal input or multilingual documentation.

Example setup in Factory:

Architecture Overview Droid (Web) → GPT-5.
Doc-from-PR Droid (CI) → GPT-5 or Gemini 2.5 Pro if multimodal.

Choosing by Environment (IDE, Web, CLI, Slack)

Different surfaces bias the kind of tasks you delegate.

Droids Where You Code (IDE/Terminal)

Tasks: Local refactors, feature work, “explain this code,” write tests.
Model default: GPT-5 or Claude Sonnet 4.
When to escalate: Switch specific Droids to Claude Opus 4.1 for hard debugging sessions.

Droids in the Browser (Web IDE)

Tasks: Repo overviews, large refactors, quick edits without setup.
Model default: Claude Opus 4.1 for repo-wide insight; GPT-5 for doc-heavy tasks.

Droids at Scale (CLI / CI/CD)

Tasks: Migrations, dependency bumps, automated review, enforcement of standards.
Model default: Cost-optimized T-5 Codex–style models for generation; optional Claude/GPT for critical review steps.

Droids in the War Room (Slack/Teams)

Tasks: Incident triage, log analysis, “what changed?” queries.
Model default: Claude Opus 4.1 for root cause, GPT-5 for summaries and comms.

Droids in Your Backlog (Issue-Triggered)

Tasks: Ticket-to-PR flows, backlog grooming, auto-updates on tickets.
Model default: GPT-5 or Sonnet 4 for implementation and summarization.

Cost, Risk, and Governance: How to Make This Enterprise-Ready

Model choice doesn’t happen in a vacuum. In enterprise environments, you also care about risk, compliance, and observability.

Factory’s stance:

Strict permissions enforcement
Droids only see what the user can see in the underlying system (repo, ticketing, chat).
Single-tenant, sandboxed environments with dedicated VPCs
Isolation between customers, aligned with SOC 2 and ISO 42001 practices.
Audit logging
Full execution traces exportable to your SIEM, including model interactions.
Clear IP posture
Factory does not use your code as training data without prior written consent.

On top of that, Factory Analytics ties model usage to real outputs:

Files created/edited.
Commits and PRs.
Autonomy ratio (how much work Droids do end-to-end).

This lets you measure whether “Claude vs GPT vs Gemini” decisions are actually improving PR throughput, MTTR, or migration completion time—not just changing your token charts.

Practical Model Selection Playbook

If you want a minimal, sane default for most orgs:

Pick a default model for everyday work
- Use GPT-5 or Claude Sonnet 4 as your baseline for IDE + ticket work.
Add a “deep reasoning” profile
- Configure a set of Droids (debugging, incident triage, major refactors) to use Claude Opus 4.1.
Add a “bulk automation” profile
- Configure CI/CLI Droids for migrations, automated review, and maintenance on T-5 Codex–style models or other cost-efficient options.
Keep Gemini in reserve for multimodal or specific needs
- Use Gemini 2.5 Pro where diagrams, mixed media, or multilingual needs show up.
Iterate using real metrics
- Use Factory Analytics and OTEL export to see which models produce:
  - More merged PRs per dollar.
  - Faster incident triage.
  - More autonomous task completion.

Final Verdict

Within Factory, “Claude vs GPT vs Gemini” is less about brand loyalty and more about matching the model to the task and environment:

Use Claude Opus 4.1 when you need deep debugging, complex refactors, and long-running, ambiguous work.
Use GPT-5 or Claude Sonnet 4 as your default engine for day-to-day coding, tickets, and documentation.
Use cost-optimized coding models (like T-5 Codex variants) for CI/CD automation, migrations, and bulk code review, reserving pricier models as reviewers.
Use Gemini where multimodal inputs, Google-heavy environments, or multilingual docs are central.

Because Factory is model- and interface-agnostic, you can make these choices per Droid, per surface, and per workflow—without disrupting how engineers already work in their IDE, terminals, chat, and CI.

Next Step

Get Started

Factory: how do I choose which model to use per task (Claude vs GPT vs Gemini)?

At-a-Glance Recommendation Framework

How Factory Droids Use Models (and Why It Matters)

What We Know from Benchmarks and Real Usage

Core Comparison: Claude vs GPT vs Gemini in Factory

Claude (e.g., Claude Opus 4.1, Claude Sonnet 4)

GPT (e.g., GPT-5, OpenAI o3)

Gemini (e.g., Gemini 2.5 Pro)

Choosing by Task Type (What Most Teams Actually Need)

1. Refactors and Large Codebase Changes

2. Incident Response and On-Call “War Room” Work

3. Everyday Feature Work and Ticket Implementation

4. CI/CD Automation, Code Review, and Migrations

5. Documentation, Overviews, and Architecture Analyses

Choosing by Environment (IDE, Web, CLI, Slack)

Droids Where You Code (IDE/Terminal)

Droids in the Browser (Web IDE)

Droids at Scale (CLI / CI/CD)

Droids in the War Room (Slack/Teams)

Droids in Your Backlog (Issue-Triggered)

Cost, Risk, and Governance: How to Make This Enterprise-Ready

Practical Model Selection Playbook

Final Verdict

Next Step

Keep Reading

More from AI Coding Agent Platforms

How do I set up Windsurf Teams ($30/user/mo) with centralized billing, admin analytics, and automated zero data retention?

How do I contact Windsurf about Enterprise pricing, RBAC, and hybrid deployment for 200+ seats?

How do I add SSO to Windsurf Teams (+$10/user/mo) and what identity providers are supported?