Lessons from Open-Inspect.
Lessons from Open-Inspect.
Most teams ship background agents. Few have run one in production for a year.
The architecture, the credential pattern, the convention rules — and what we’d do differently.
Lessons from Open-Inspect.
Architecture, the credential pattern, convention rules — and what we’d do differently.
Engineers start running parallel agents on their laptops — usually with git worktrees. The pattern works. Until it doesn’t.
The session count grows.
The machine’s resources don’t.
The pattern repeats across deployments — and it usually serves non-engineers as much as engineers.
The on-call engineer arrives to a diagnosis with a candidate fix — not an empty terminal.
This is the most common pattern across every deployment we’ve seen.
Most tickets close in the same shift. The ones that do escalate arrive code-grounded, not vibes-based.
And the team grows into it: CS starts by root-causing and handing the session to engineering — later, the same people start proposing the fixes themselves. Small-scope, low-hundreds of lines, junior-engineer shape.
PMs become a steady source of reviewable work — not a queue of issues for engineering to translate.
Scope grows with confidence: most PRs start as small UI tweaks; over weeks, the same PMs land small-to-medium features at a junior-engineer scope.
The data team stops being the bottleneck for every operational question.
One caveat: read-only credentials, scoped to the warehouse — not the OLTP database.
Decisions get made against working artifacts, not text. Disagreements show up early — not in QA.
Honest note: it’s never quite ready to ship. That’s the point — throwaway is the feature.
State lives in the Durable Object, not the client. Every surface and every participant is a thin connector to the same source of truth.
Slack mention, GitHub review, Linear issue, or a typed prompt in the web UI — all spawn the same DO-backed session. One started in Slack shows up live in the web sidebar — no handoff needed.
Anyone with a token joins via WebSocket. Tool calls, sandbox status, and PR artifacts fan out to every connected client. New joiners replay the last ~200 events to catch up.
Open a PR, @-mention the bot, or let it auto-review on open. The agent reads the diff, fetches context, and posts a real GitHub review.
Same agent, same docs and skills — just started by something other than a chat message. A clock, a webhook, an alert, a PR.
Cron, minimum 15-minute interval. Nightly deps, weekly reports, recurring audits.
One scheduled run at a time per automation.
Authenticated HTTP POST, up to 64 KB. Optional JSONPath conditions filter what fires.
Idempotency keys deduplicate noisy senders.
HMAC-verified Custom Integration, up to 256 KB. Triage new issues the moment they arrive.
One issue, one session, one PR.
Triggered by PR, issue, or comment activity on the repos the App watches.
Linear event triggers next.
For UI-heavy changes, “did it work” is a visual question. The sandbox ships browser-driving tools so the agent answers in pictures, video, or a live URL.
agent-browser screenshot — viewport, full-page, annotated, or diffed against a baseline.
Uploaded to R2 via upload-media. Lightbox in the session sidebar.
agent-browser record — silent MP4 of a flow. ffprobed for real dimensions and duration.
Uploaded as a media artifact, attached to the session.
Modal-native HTTPS for sandbox ports. Up to 10 user ports per session.
Click “Port 3000” in the sidebar — the running prototype, in your browser.
Modeled on Ramp’s internal Inspect system. Forkable, single-tenant, customisable to a company’s own services and conventions.
Every company’s infrastructure is unique. Background agents are critical infrastructure. Critical infrastructure should not be a vendor lock-in.
A perfectly pristine copy of the dev environment. The same one. Every time.
Quarantined from other sessions, from the developer’s laptop, and from production.
Container dies → harness provisions a replacement. State lives in the durable session log.
“In vs. out” is the wrong axis. The agent has to run somewhere. The question is what credentials the sandbox can touch.
Each tool call goes through the control plane. The agent sends arguments, gets back a result — the credential never enters the sandbox.
Agent never sees the token. Compromised sandbox can’t post anywhere it wasn’t already scoped to.
PR attributed to the human who triggered the run. The agent’s identity is its session, not its token.
Bonus: “out of the box” still needs a container — the worker just lives elsewhere, and you reinject sandbox state on every turn. The container question is a distraction. The credential question is the work.
~10s
Target time-to-first-token. Slow agents don’t get adopted — speed is the prerequisite, not the polish.
Meet people where they already work. The web UI is the obvious surface; it’s rarely the most used.
An agent is a perpetual new hire that joins fresh every session — with no Slack to ping and no teammate to pair with. Anything undocumented becomes visible immediately.
Shared developer accounts, the same secret pasted into ten .env files, no source-of-truth.
Per-user identities. A vault. Egress-time injection so the sandbox holds placeholders, not keys.
Scope what each agent can touch. The agent has a human’s tools — not necessarily a human’s permissions.
“Service-level auth is usually in place. Granular access control to restrict what the agent can and can’t do, almost never is.”
It is managing a context window. Wrap your APIs with tools that take targeted requests and return slimmed, agent-tailored responses.
Agent must construct a query, paginate, parse, summarise. Most attempts at root-causing a production issue exhausted context before resolving it.
Tool builds the right query, returns just the matched lines and the surrounding context. Root-cause success rate “drastically went up.”
Custom tools shrink the agent’s context. Verification artifacts shrink the reviewer’s. A screenshot or a tunnel link tells reviewers in seconds what a diff takes minutes to convey.
For UI work, the agent attaches before/after screenshots to the PR. A PM or reviewer sees the change without pulling the branch.
Open a port from the sandbox. The PM clicks through their own prototype before opening the PR. Reviewers do the same instead of guessing from the diff.
The teams with the best numbers — Ramp, Stripe, Cursor, Browserbase — all describe voluntary adoption. None describe a mandate.
Two preconditions: management visibly backs the effort, and the experience is fast enough that an engineer reaches for the agent on instinct.
PRs per day per reviewer.
Production stops being the constraint. Review throughput becomes it.
AI code review on top of CI. Smaller, single-purpose PRs. Skills that produce code already shaped like what your team merges.
Engineering output as a review metric. Tying performance to PRs closed. The system optimises for whatever you measure.
Warnings are read by reviewers. Deterministic failures are read by the agent — and the agent corrects.
An agent skipped the repository layer. One severity: ERROR rule, and the next PR used the helper.
Skips the repository, the read-replica router, the request-scoped session.
severity: ERROR + CI exit code = a red check in the agent’s own trajectory.
Same intent. Your conventions. Rule catches every future regression.
If the steps live in a teammate’s head, the agent can’t follow them. Build the docs you should already have.
Treat it like any ML/AI system: continuous iteration. When a PR fails, give the agent the trajectory — and turn it into a doc, a skill, or a lint rule. The feedback loop is the product.
As the system rolls out across personas, edge cases accumulate. Staff a small team to keep the line running. The biggest risk is treating it as a project.
Sharper isolation between the agent loop and the secrets it operates with. Network-policy primitives the agent inside the box can’t reach around. Permission scopes that survive prompt injection.
Close the “PR fails → docs / skills / rules” loop automatically. Failed-trajectory mining for skill suggestions. The agent gets better at your codebase faster than you can write docs.
Harden permissions across the system as user counts grow. Finer-grained roles, per-session and per-tool scopes, audit trails that match what real orgs need.
All three are directions. Share what would be useful to you — find me after.
CM Engineering takes on advisory work for background-agent systems. Reach out to learn more.
@-mention the bot in a thread; a sandbox spins up and reports back live.
Branch from an active session to pursue a parallel approach without losing the parent's context.
Capture a video of the agent navigating its sandbox — visual proof of the change, attached to the PR.
Assign the agent to a Linear issue; it picks up the context, runs in a sandbox, and reports back on the ticket.