Every developer knows the feeling: you open a pull request, ping your team, and wait. Sometimes hours. Sometimes days. The reviewer is busy. The context is lost. The bug slips through anyway.
We decided to stop waiting. Instead of one tired human reviewer, we built a swarm of five specialized AI agents that review every PR in under 90 seconds β and catch things humans routinely miss.
Here's exactly how we did it, what broke, and what the numbers look like after three months in production.
The obvious first attempt is to just ask Claude or GPT-4 to "review this PR." It works, sort of. But a single LLM trying to do everything at once is like hiring one person to be your security auditor, your style guide enforcer, your test coverage analyst, and your logic checker simultaneously. Attention gets diluted. The model satisfices instead of excels.
We also noticed a subtler problem: large diffs (1000+ lines) would cause the model to focus on the first few files and trail off by the end. Critical bugs in file 12 of 15 would get a cursory glance.
We split the review responsibility across five agents, each with a focused mandate:
When a PR is opened, a GitHub Actions webhook fires and hits our MindSwarm endpoint. The orchestrator splits the diff into chunks and fans out work to all four specialist agents in parallel. Each agent returns structured JSON findings. The orchestrator merges them, deduplicates, and posts a single formatted comment back to GitHub.
PR opened
β
βΌ
Orchestrator splits diff
β
βββ Security Scanner βββββββ
βββ Logic Checker ββββββββββ€ (parallel, ~40s)
βββ Style Enforcer βββββββββ€
βββ Test Analyzer ββββββββββ
β
βΌ
Orchestrator merges findings
β
βΌ
GitHub PR comment posted
The entire pipeline runs in a single Claude Code session using the Agent tool to spawn subagents. Each agent gets exactly its slice of the diff β no full-diff overload, no attention dilution.
Early versions had the Security Scanner occasionally commenting on variable naming ("this could be more descriptive"). We fixed it not by telling the agent what to do, but by being explicit about what to ignore: "Do not comment on style, naming, or test coverage. If you find no security issues, return an empty findings array."
The orchestrator needs all agent outputs before it can synthesize. We use a fan-out/fan-in pattern: spawn four agents in a single tool call (parallel), wait for all to complete, then pass all outputs to the orchestrator in one prompt. This cuts wall-clock time from ~4 minutes sequential to ~90 seconds parallel.
Free-form review text from specialist agents is hard to merge. We enforce JSON output with a schema:
{
"findings": [
{
"file": "src/auth/login.py",
"line": 47,
"severity": "BLOCK",
"category": "sql-injection",
"message": "User input passed directly to SQL query without parameterization",
"suggestion": "Use parameterized queries: cursor.execute('SELECT * FROM users WHERE id = %s', (user_id,))"
}
]
}
If an agent returns malformed JSON, the orchestrator retries once. If it fails twice, that agent's findings are marked as unavailable β the review posts anyway with a warning.
Using claude-haiku-4-5 for style review costs roughly $0.001 per PR. It catches 80% of style issues. Reserving Opus for logic and security β where reasoning depth matters β keeps costs under $0.05 per review while maintaining quality where it counts.
The most surprising finding: the Logic Checker caught three race conditions that had existed in the codebase for over a year. Human reviewers had approved those files dozens of times. The agent caught them on the first pass because it had no prior context bias β it read the code fresh every time.
If we were starting over, we'd add a fifth specialist earlier: a Dependency Auditor that checks new package additions against known CVEs and license compatibility. We added it in week 6 after a near-miss with a transitive dependency that had a known exploit.
We'd also instrument agent agreement rates from day one. When Security Scanner and Logic Checker both flag the same line, that's a high-confidence issue. When only one flags it, the orchestrator should weight it lower. We built this analysis retrospectively β it should have been in v1.
The full code review skill is part of MindSwarm. It works with any GitHub repository and takes about 10 minutes to set up. The agent prompts, JSON schemas, and GitHub Actions workflow are all included.
The agents don't replace human judgment β they amplify it. Your team reviews fewer issues, but the ones that reach human eyes are the ones that actually need human eyes.