Engineering

How we automated code review with 5 AI agents

Sergej Drus March 21, 2026 8 min read

Every developer knows the feeling: you open a pull request, ping your team, and wait. Sometimes hours. Sometimes days. The reviewer is busy. The context is lost. The bug slips through anyway.

We decided to stop waiting. Instead of one tired human reviewer, we built a swarm of five specialized AI agents that review every PR in under 90 seconds β€” and catch things humans routinely miss.

Here's exactly how we did it, what broke, and what the numbers look like after three months in production.

The problem with single-agent code review

The obvious first attempt is to just ask Claude or GPT-4 to "review this PR." It works, sort of. But a single LLM trying to do everything at once is like hiring one person to be your security auditor, your style guide enforcer, your test coverage analyst, and your logic checker simultaneously. Attention gets diluted. The model satisfices instead of excels.

We also noticed a subtler problem: large diffs (1000+ lines) would cause the model to focus on the first few files and trail off by the end. Critical bugs in file 12 of 15 would get a cursory glance.

Key insight: Specialization beats generalization. One agent that only looks for SQL injection catches more SQL injection than one agent that looks for everything.

The 5-agent architecture

We split the review responsibility across five agents, each with a focused mandate:

πŸ”΄ Security Scanner
claude-sonnet-4-6 Β· system prompt: security-only
Looks for injection flaws, exposed secrets, auth bypasses, insecure deserialization, OWASP Top 10. Refuses to comment on style or logic.
🟑 Logic Checker
claude-opus-4-6 Β· system prompt: correctness-only
Traces data flow, checks edge cases, looks for off-by-one errors, null dereferences, race conditions, and incorrect algorithm behavior.
πŸ”΅ Style Enforcer
claude-haiku-4-5 Β· system prompt: style-only
Checks naming conventions, dead code, unnecessary complexity, documentation gaps. Fast and cheap β€” runs on every file including trivial ones.
🟒 Test Analyzer
claude-sonnet-4-6 Β· system prompt: tests-only
Maps test coverage to changed code, identifies untested branches, suggests specific test cases, flags missing integration tests.
βšͺ Orchestrator
claude-opus-4-6 Β· sees all agent outputs
Synthesizes all findings, resolves conflicts between agents, assigns severity (BLOCK / WARN / NOTE), and writes the final review comment.

How the pipeline works

When a PR is opened, a GitHub Actions webhook fires and hits our MindSwarm endpoint. The orchestrator splits the diff into chunks and fans out work to all four specialist agents in parallel. Each agent returns structured JSON findings. The orchestrator merges them, deduplicates, and posts a single formatted comment back to GitHub.

PR opened
    β”‚
    β–Ό
Orchestrator splits diff
    β”‚
    β”œβ”€β”€ Security Scanner ──────┐
    β”œβ”€β”€ Logic Checker ──────────  (parallel, ~40s)
    β”œβ”€β”€ Style Enforcer ─────────
    └── Test Analyzer β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                               β–Ό
                    Orchestrator merges findings
                               β”‚
                               β–Ό
                    GitHub PR comment posted

The entire pipeline runs in a single Claude Code session using the Agent tool to spawn subagents. Each agent gets exactly its slice of the diff β€” no full-diff overload, no attention dilution.

What we learned building this

1. Agent specialization requires explicit negative constraints

Early versions had the Security Scanner occasionally commenting on variable naming ("this could be more descriptive"). We fixed it not by telling the agent what to do, but by being explicit about what to ignore: "Do not comment on style, naming, or test coverage. If you find no security issues, return an empty findings array."

2. Parallel is faster but sequencing matters for the orchestrator

The orchestrator needs all agent outputs before it can synthesize. We use a fan-out/fan-in pattern: spawn four agents in a single tool call (parallel), wait for all to complete, then pass all outputs to the orchestrator in one prompt. This cuts wall-clock time from ~4 minutes sequential to ~90 seconds parallel.

3. Structured output is non-negotiable

Free-form review text from specialist agents is hard to merge. We enforce JSON output with a schema:

{
  "findings": [
    {
      "file": "src/auth/login.py",
      "line": 47,
      "severity": "BLOCK",
      "category": "sql-injection",
      "message": "User input passed directly to SQL query without parameterization",
      "suggestion": "Use parameterized queries: cursor.execute('SELECT * FROM users WHERE id = %s', (user_id,))"
    }
  ]
}

If an agent returns malformed JSON, the orchestrator retries once. If it fails twice, that agent's findings are marked as unavailable β€” the review posts anyway with a warning.

4. The Haiku style agent pays for itself

Using claude-haiku-4-5 for style review costs roughly $0.001 per PR. It catches 80% of style issues. Reserving Opus for logic and security β€” where reasoning depth matters β€” keeps costs under $0.05 per review while maintaining quality where it counts.

Results after 3 months

Production metrics (Jan–Mar 2026)
  • 847 PRs reviewed automatically
  • 91 seconds average review time (vs 4.2 hours human average)
  • 23 critical bugs caught before merge (6 were security issues)
  • $0.04 average cost per PR review
  • 94% of developer-reported findings rated "useful" or "very useful"
  • 0 false positives that blocked a valid merge (after week 2 tuning)

The most surprising finding: the Logic Checker caught three race conditions that had existed in the codebase for over a year. Human reviewers had approved those files dozens of times. The agent caught them on the first pass because it had no prior context bias β€” it read the code fresh every time.

What we'd do differently

If we were starting over, we'd add a fifth specialist earlier: a Dependency Auditor that checks new package additions against known CVEs and license compatibility. We added it in week 6 after a near-miss with a transitive dependency that had a known exploit.

We'd also instrument agent agreement rates from day one. When Security Scanner and Logic Checker both flag the same line, that's a high-confidence issue. When only one flags it, the orchestrator should weight it lower. We built this analysis retrospectively β€” it should have been in v1.

Try it yourself

The full code review skill is part of MindSwarm. It works with any GitHub repository and takes about 10 minutes to set up. The agent prompts, JSON schemas, and GitHub Actions workflow are all included.

The agents don't replace human judgment β€” they amplify it. Your team reviews fewer issues, but the ones that reach human eyes are the ones that actually need human eyes.

Run this on your codebase.

Set up in 10 minutes. Free tier included.

See pricing β†’