Open Inspect

Modal · May 20 · 2026

A conversation

Running background agents in production.

Lessons from Open-Inspect.

SpeakerCole Murray · CM Engineering
DateMay 20, 2026
Format60 min · conversation + Q&A

The starting point

One agent, one machine. Then five. Then ten.

Engineers start running parallel agents on their laptops — usually with git worktrees. The pattern works. Until it doesn’t.

01Each session runs its own linters and tests.
02Each session compiles, in compiled languages.
03Each session boots its own dev server.

The session count grows.
The machine’s resources don’t.

01 The problemLocalhost doesn’t scale

What the local model can’t carry

Three places localhost breaks.

01 · Resource ceiling

Parallel agents consume parallel CPU, RAM, and disk. Worktrees don’t add hardware.

02 · Setup tax

20+

Microservices, each with their own services, secrets, and ports. New laptop = lost day.

03 · Non-engineers

Asking a PM or support agent to configure your dev env is, in practice, asking them not to contribute.

A background agent system is, mechanically, the answer to all three at once.

01 The problemThree failures

The high-leverage patterns

Five worth building for first.

The pattern repeats across deployments — and it usually serves non-engineers as much as engineers.

On-call triage & auto-fix.

SRE

Root-cause from the support inbox.

Customer support

PMs contributing PRs.

Product

Self-serve data & BI.

Ops · Finance · Product

Rapid prototyping against the real codebase.

Product · Design

03 Use casesAt a glance

Use case 01 · SRE

Alerts triage themselves.

Before

A page.

An engineer wakes up to an alert and starts hunting through logs, traces, and recent commits.

→

After

A PR.

The agent opened a session the moment the alert fired, gathered the same context, and proposed a fix.

The change

The on-call engineer arrives to a diagnosis with a candidate fix — not an empty terminal.

This is the most common pattern across every deployment we’ve seen.

03 Use cases01 SRE

Use case 02 · Support

Root-cause without an escalation.

Before

Escalate.

A ticket lands in engineering’s queue with a screenshot and a guess at what went wrong.

→

After

Resolve.

Support runs the agent against the same logs and account state engineers would — and answers the question.

The change

Most tickets close in the same shift. The ones that do escalate arrive code-grounded, not vibes-based.

And the team grows into it: CS starts by root-causing and handing the session to engineering — later, the same people start proposing the fixes themselves. Small-scope, low-hundreds of lines, junior-engineer shape.

03 Use cases02 Support

Use case 03 · Product

PMs file PRs, not issues.

Before

A spec.

A PM writes a doc, hands it to engineering, and waits for it to come back as a feature.

→

After

A diff.

The PM opens the codebase, runs the change, and arrives at engineering with a working PR.

The change

PMs become a steady source of reviewable work — not a queue of issues for engineering to translate.

Scope grows with confidence: most PRs start as small UI tweaks; over weeks, the same PMs land small-to-medium features at a junior-engineer scope.

03 Use cases03 PMs as contributors

Use case 04 · Data & BI

The dashboard you didn’t have to ask for.

Before

A ticket.

“How many trial accounts converted last week, by plan tier?” goes to the data team’s backlog.

→

After

A reply.

The agent has read access to the warehouse, writes the SQL, returns the table — in the same Slack thread.

The change

The data team stops being the bottleneck for every operational question.

One caveat: read-only credentials, scoped to the warehouse — not the OLTP database.

03 Use cases04 Data & BI

Use case 05 · Prototyping

Consensus before the PRD.

Before

A doc.

Stakeholders argue about a paragraph. The disagreement only surfaces once engineering has built the wrong thing.

→

After

A demo.

The agent builds it against the actual codebase — same components, same colors. People click through, not at.

The change

Decisions get made against working artifacts, not text. Disagreements show up early — not in QA.

Honest note: it’s never quite ready to ship. That’s the point — throwaway is the feature.

03 Use cases05 Rapid prototyping

The unit of work

One session. Many surfaces, many people.

State lives in the Durable Object, not the client. Every surface and every participant is a thin connector to the same source of truth.

Many surfaces.

Slack mention, GitHub review, Linear issue, or a typed prompt in the web UI — all spawn the same DO-backed session. One started in Slack shows up live in the web sidebar — no handoff needed.

Many people.

Anyone with a token joins via WebSocket. Tool calls, sandbox status, and PR artifacts fan out to every connected client. New joiners replay the last ~200 events to catch up.

04 What it doesSessions

Capability 01 · code review

Reviews every PR, faster than a human.

Open a PR, @-mention the bot, or let it auto-review on open. The agent reads the diff, fetches context, and posts a real GitHub review.

GitHub webhook Four events trigger the bot

PR opened (gated) review requested @-mention in comment inline review reply

↓ verify perms · “eyes” ack · spin up session · gh pr diff ↓

Agent posts A real GitHub review — not just a comment

APPROVE REQUEST_CHANGES inline comments on specific lines

04 What it doesCode review

Capability 02 · automations

The agent runs without you.

Same agent, same docs and skills — just started by something other than a chat message. A clock, a webhook, an alert, a PR.

Schedule

Cron, minimum 15-minute interval. Nightly deps, weekly reports, recurring audits.

One scheduled run at a time per automation.

Inbound webhook

Authenticated HTTP POST, up to 64 KB. Optional JSONPath conditions filter what fires.

Idempotency keys deduplicate noisy senders.

Sentry alert

HMAC-verified Custom Integration, up to 256 KB. Triage new issues the moment they arrive.

One issue, one session, one PR.

GitHub event

Triggered by PR, issue, or comment activity on the repos the App watches.

Linear event triggers next.

Stored as first-class objects: name, repo, branch, model, instructions (≤10 K chars). Three consecutive failures → auto-paused.

04 What it doesAutomations

Capability 03 · verification

The agent shows its work.

For UI-heavy changes, “did it work” is a visual question. The sandbox ships browser-driving tools so the agent answers in pictures, video, or a live URL.

Screenshots

agent-browser screenshot — viewport, full-page, annotated, or diffed against a baseline.

Uploaded to R2 via upload-media. Lightbox in the session sidebar.

Video

agent-browser record — silent MP4 of a flow. ffprobed for real dimensions and duration.

Uploaded as a media artifact, attached to the session.

Tunnels

Modal-native HTTPS for sandbox ports. Up to 10 user ports per session.

Click “Port 3000” in the sidebar — the running prototype, in your browser.

A PR with a screenshot is reviewed in 30 seconds. A PR with a tunnel link is reviewed by clicking through — the only UI review that actually matters.

04 What it doesScreenshots · video · tunnels

Open Inspect

An open-source background coding agent platform.

Modeled on Ramp’s internal Inspect system. Forkable, single-tenant, customisable to a company’s own services and conventions.

CTRLCloudflare Workers · Durable Objects · D1 (SQLite)
DATAModal sandboxes — Python, Node, Bun, Chromium, GH CLI
AGENTOpenCode CLI in server mode, inside each sandbox
CLIENTSWeb · Slack · GitHub PRs · Linear · webhooks · cron · Sentry

Why open-source

Every company’s infrastructure is unique. Background agents are critical infrastructure. Critical infrastructure should not be a vendor lock-in.

05 ArchitectureAt a glance

The pattern every production system converges on

Split control from data.

Control plane Coordination, state, identity

Cloudflare Workers Durable Objects (per-session SQLite) D1 AES-256-GCM secrets GitHub OAuth + allowlist

↓ dispatch · stream · resume ↓

Data plane Per-session sandbox — the place code actually runs

Modal sandbox Repo + dev env OpenCode code-server headless Chromium custom CLI tools

Every production system — Stripe, Ramp, Browserbase, Anthropic — exhibits this split. It’s the durable shape, not the implementation.

05 ArchitectureControl / data plane

The data plane in detail

A sandbox is what a developer has — without sharing their machine.

Reproducible

A perfectly pristine copy of the dev environment. The same one. Every time.

Isolated

Quarantined from other sessions, from the developer’s laptop, and from production.

Disposable

Container dies → harness provisions a replacement. State lives in the durable session log.

05 ArchitectureSandboxes

A debate worth re-framing

The agent is in the box. The secrets aren’t.

“In vs. out” is the wrong axis. The agent has to run somewhere. The question is what credentials the sandbox can touch.

Control plane Holds the tokens

Slack OAuth GitHub user tokens Linear internal API keys tool brokers

↓ tool calls cross this boundary · tokens do not ↓

Modal sandbox No production credentials

OpenCode (agent loop) repo + shell + browser tests + dev tools

05 ArchitectureSecrets not in the box

Two tool calls · same pattern

The agent never holds the token.

Each tool call goes through the control plane. The agent sends arguments, gets back a result — the credential never enters the sandbox.

Slack notification

agent → notify(channel, text)
ctrl plane → looks up Slack OAuth in vault,
posts on agent’s behalf
agent sees → { ok: true, ts: “1716...” }

Agent never sees the token. Compromised sandbox can’t post anywhere it wasn’t already scoped to.

Pull request creation

agent → open_pr(branch, title, body)
ctrl plane → uses the USER’s GitHub OAuth
(not a bot account)
agent sees → { url: “github.com/.../pull/4291” }

PR attributed to the human who triggered the run. The agent’s identity is its session, not its token.

Bonus: “out of the box” still needs a container — the worker just lives elsewhere, and you reinject sandbox state on every turn. The container question is a distraction. The credential question is the work.

05 ArchitectureThe brokerage

Performance is an adoption feature

Warm starts: the agent is ready before the user is.

IMGRepo images rebuilt every 30 minutes — clone, install, build, snapshot.
FSFilesystem snapshots after each prompt for follow-up turns.
POOLWarm pools for high-volume repos.
PROACTSpin a sandbox up while the user is still typing.

Why this matters

~10s

Target time-to-first-token. Slow agents don’t get adopted — speed is the prerequisite, not the polish.

05 ArchitectureWarm starts

How a session starts

Eight ways in.

Meet people where they already work. The web UI is the obvious surface; it’s rarely the most used.

Chat

Slack

Most used surface, every system.

Web

Open Inspect UI

Multiplayer sessions, code-server, live preview.

GitHub bot

Triggered by PR comments and reviews.

Issue

Linear bot

Issue → branch → PR.

Schedule

Cron

Recurring sweeps and audits.

Webhook

Inbound HTTP

Filtered by JSONPath conditions.

Alert

Sentry

Reproduce → propose fix.

Loop

Sub-tasks

Parent agent spawns child agents in their own sandboxes.

05 ArchitectureInvocation surfaces

The forcing function

Background agents expose every gap in your dev process.

An agent is a perpetual new hire that joins fresh every session — with no Slack to ping and no teammate to pair with. Anything undocumented becomes visible immediately.

Tribal knowledge → Documented setup scripts

Shared dev secrets → Centralised, scoped credentials

Service-level auth only → Granular access control

Special-flag launch incantations → Repeatable Docker images

The #1 gap in production: most teams can’t reliably run the full stack locally to begin with. Fixing it for the agent fixes new-hire onboarding too.

06 Forcing functionThe pattern

A prerequisite checklist

Secrets and access control come first.

Find

Shared developer accounts, the same secret pasted into ten .env files, no source-of-truth.

Fix

Per-user identities. A vault. Egress-time injection so the sandbox holds placeholders, not keys.

Then

Scope what each agent can touch. The agent has a human’s tools — not necessarily a human’s permissions.

“Service-level auth is usually in place. Granular access control to restrict what the agent can and can’t do, almost never is.”

06 Forcing functionSecrets & access

Highest-leverage investment

An agent needs a simpler interface than a human does.

It is managing a context window. Wrap your APIs with tools that take targeted requests and return slimmed, agent-tailored responses.

Before — raw API

GET /opensearch/_search
Body: 28 fields, paginated, JSON of unbounded depth

Agent must construct a query, paginate, parse, summarise. Most attempts at root-causing a production issue exhausted context before resolving it.

→

After — tailored CLI

oi-logs find —service=billing —since=15m —text="timeout"

Tool builds the right query, returns just the matched lines and the surrounding context. Root-cause success rate “drastically went up.”

06 Forcing functionCustom tools > raw APIs

The reviewer’s interface, not the agent’s

Make it verifiable.

Custom tools shrink the agent’s context. Verification artifacts shrink the reviewer’s. A screenshot or a tunnel link tells reviewers in seconds what a diff takes minutes to convey.

Screenshots from inside the sandbox.

For UI work, the agent attaches before/after screenshots to the PR. A PM or reviewer sees the change without pulling the branch.

Port-forward to the running prototype.

Open a port from the sandbox. The PM clicks through their own prototype before opening the PR. Reviewers do the same instead of guessing from the diff.

The agent ships faster than anyone can read. Verification artifacts — not better prose — are what keep the human in the loop.

06 Forcing functionMake it verifiable

Adoption pattern, every system, every time

Don’t mandate. Make it obviously better.

The teams with the best numbers — Ramp, Stripe, Cursor, Browserbase — all describe voluntary adoption. None describe a mandate.

Two preconditions: management visibly backs the effort, and the experience is fast enough that an engineer reaches for the agent on instinct.

01Meet people where they already work — Slack, GitHub, Linear.
02Run hackathons. Skeptics convert when they ship.
03Make adoption visible — dashboards, public counters.

07 AdoptionDon’t mandate

The adoption curve we see in the field

Start with bugs the team already knows how to fix.

Phase 01

WEEK 1–2

Small, well-scoped bugs.

The team learns the experience of handing work off — and getting a reviewable PR back.

Phase 02

WEEK 2–3

Larger tasks delegated.

Confidence grows. Engineers reach for the agent on instinct rather than ceremony.

Phase 03

ONGOING

Custom tools, on the team’s own pain points.

Whatever the agent does badly today is what gets the next CLI tool, the next skill, the next setup hook.

07 AdoptionStart small

An unexpectedly large use case

Non-engineers contributing code.

Product managers

— Read the codebase to assess feasibility before writing the PRD.
— Run before/after analyses on launches.
— Send PRs themselves. Engineering reviews.

Customer support

— Root-cause issues straight from the inbound queue.
— Submit fixes for regressions they triaged.
— Write feature requests engineering can act on.

Same review loop, regardless of origin. CI runs. AI code reviewer runs. An engineer signs off.
It shouldn’t matter who pressed enter — the agent makes the outcome reproducible.

07 AdoptionNon-engineers

What you create when you solve production

The bottleneck moves.

1–3

→

10–15+

PRs per day per reviewer.

Production stops being the constraint. Review throughput becomes it.

What helps

AI code review on top of CI. Smaller, single-purpose PRs. Skills that produce code already shaped like what your team merges.

What hurts

Engineering output as a review metric. Tying performance to PRs closed. The system optimises for whatever you measure.

07 AdoptionReview is the new bottleneck

What we learned · the PR problem

Give the agent backpressure, not advice.

Warnings are read by reviewers. Deterministic failures are read by the agent — and the agent corrects.

Step 01

PR opens.

Agent ships, CI starts.

→

Step 02

CI fails.

Semgrep returns severity: ERROR, non-zero exit.

→

Step 03

Agent reads.

Check output enters its own trajectory.

→

Step 04

Agent retries.

Against the rule. PR now passes.

The rule isn’t for the reviewer. It’s signal the agent uses to iterate. Fix the pattern once — it persists across every future PR.

07 AdoptionBackpressure, not advice

Anatomy of a convention rule

One PR. One Semgrep rule.

An agent skipped the repository layer. One severity: ERROR rule, and the next PR used the helper.

Before · what shipped

from sqlalchemy import create_engine

engine = create_engine(DATABASE_URL)

def get_overdue_invoices(account_id):
    return engine.execute(
        "SELECT * FROM invoices ...",
    ).fetchall()

Skips the repository, the read-replica router, the request-scoped session.

Rule · Semgrep

rules:
  - id: no-direct-sqla
    message: Use app.repositories.*
    severity: ERROR
    pattern: from sqlalchemy import ...

severity: ERROR + CI exit code = a red check in the agent’s own trajectory.

After · on retry

from app.repositories import InvoiceRepository

def get_overdue_invoices(account_id):
    return InvoiceRepository.find_overdue(
        account_id=account_id,
    )

Same intent. Your conventions. Rule catches every future regression.

07 AdoptionConvention rule, captured

Three lessons for teams standing this up

Three things to plan for.

The agent is only as good as your documented processes.

If the steps live in a teammate’s head, the agent can’t follow them. Build the docs you should already have.

This is not a one-and-done system.

Treat it like any ML/AI system: continuous iteration. When a PR fails, give the agent the trajectory — and turn it into a doc, a skill, or a lint rule. The feedback loop is the product.

Plan for an operations load.

As the system rolls out across personas, edge cases accumulate. Staff a small team to keep the line running. The biggest risk is treating it as a project.

08 LessonsThree things to plan for

The frame I’d leave you with

The agent is a perpetual new hire.
It joins fresh every session. It can’t ask Sarah how the staging server works. Whatever your team has written down is what it knows.

Build the agent. Build the company that can hire it.

08 LessonsThe closing frame

Roadmap · directions, not commitments

Where Open-Inspect is going.

Stronger sandboxing primitives.

Sharper isolation between the agent loop and the secrets it operates with. Network-policy primitives the agent inside the box can’t reach around. Permission scopes that survive prompt injection.

Skills that grow themselves.

Close the “PR fails → docs / skills / rules” loop automatically. Failed-trajectory mining for skill suggestions. The agent gets better at your codebase faster than you can write docs.

Better access control.

Harden permissions across the system as user counts grow. Finer-grained roles, per-session and per-tool scopes, audit trails that match what real orgs need.

All three are directions. Share what would be useful to you — find me after.

08 LessonsWhat’s next

Thanks

Standing one up at your team?

CM Engineering takes on advisory work for background-agent systems. Reach out to learn more.

SpeakerCole Murray

ConsultingCM Engineering

Projectgithub.com/background-agents

Writingmurraycole.com

X@_colemurray

Open InspectModal · May 20 · 2026

42 / 42

Running background agents in production.

Running background agents in production.

Running background agents in production.

One agent, one machine. Then five. Then ten.

Three places localhost breaks.

Five worth building for first.

Alerts triage themselves.

Root-cause without an escalation.

PMs file PRs, not issues.

The dashboard you didn’t have to ask for.

Consensus before the PRD.

One session. Many surfaces, many people.

Many surfaces.

Many people.

Reviews every PR, faster than a human.

The agent runs without you.

The agent shows its work.

An open-source background coding agent platform.

Split control from data.

A sandbox is what a developer has — without sharing their machine.

The agent is in the box. The secrets aren’t.

The agent never holds the token.

Slack notification

Pull request creation

Warm starts: the agent is ready before the user is.

Eight ways in.

Background agents expose every gap in your dev process.

Secrets and access control come first.

An agent needs a simpler interface than a human does.

Before — raw API

After — tailored CLI

Make it verifiable.

Screenshots from inside the sandbox.

Port-forward to the running prototype.

Don’t mandate. Make it obviously better.

Start with bugs the team already knows how to fix.

Non-engineers contributing code.

Product managers

Customer support

The bottleneck moves.

What helps

What hurts

Give the agent backpressure, not advice.

One PR. One Semgrep rule.

Before · what shipped

Rule · Semgrep

After · on retry

Three things to plan for.

The agent is only as good as your documented processes.

This is not a one-and-done system.

Plan for an operations load.

Where Open-Inspect is going.

Stronger sandboxing primitives.

Skills that grow themselves.

Better access control.

Standing one up at your team?

Slack → agent invocation

Forking a sub-session

Recording the agent's work

Linear → agent on the issue