Open Inspect
Modal · May 20 · 2026
A conversation

Running background agents in production.

Lessons from Open-Inspect.

SpeakerCole Murray · CM Engineering
DateMay 20, 2026
Format60 min · conversation + Q&A
Why this talk
Most teams ship background agents.
Far fewer have run one in production for a year — and felt what breaks.
What we built · how teams use it · what we learned
Open InspectCole Murray
02
Section 01
Why localhost
stops working.
The starting point

One agent, one machine. Then five. Then ten.

Engineers start running parallel agents on their laptops — usually with git worktrees. The pattern works. Until it doesn’t.

  • 01Each session runs its own linters and tests.
  • 02Each session compiles, in compiled languages.
  • 03Each session boots its own dev server.

The session count grows.
The machine’s resources don’t.

01 The problemLocalhost doesn’t scale
04
What the local model can’t carry

Three places localhost breaks.

01 · Resource ceiling
N
Parallel agents consume parallel CPU, RAM, and disk. Worktrees don’t add hardware.
02 · Setup tax
20+
Microservices, each with their own services, secrets, and ports. New laptop = lost day.
03 · Non-engineers
0
Asking a PM or support agent to configure your dev env is, in practice, asking them not to contribute.
A background agent system is, mechanically, the answer to all three at once.
01 The problemThree failures
05
Section 03
What people
actually do with it.
The high-leverage patterns

Five worth building for first.

The pattern repeats across deployments — and it usually serves non-engineers as much as engineers.

01
On-call triage & auto-fix.
SRE
02
Root-cause from the support inbox.
Customer support
03
PMs contributing PRs.
Product
04
Self-serve data & BI.
Ops · Finance · Product
05
Rapid prototyping against the real codebase.
Product · Design
03 Use casesAt a glance
07
Use case 01 · SRE

Alerts triage themselves.

Before
A page.
An engineer wakes up to an alert and starts hunting through logs, traces, and recent commits.
After
A PR.
The agent opened a session the moment the alert fired, gathered the same context, and proposed a fix.
The change

The on-call engineer arrives to a diagnosis with a candidate fix — not an empty terminal.

This is the most common pattern across every deployment we’ve seen.

03 Use cases01 SRE
08
Use case 02 · Support

Root-cause without an escalation.

Before
Escalate.
A ticket lands in engineering’s queue with a screenshot and a guess at what went wrong.
After
Resolve.
Support runs the agent against the same logs and account state engineers would — and answers the question.
The change

Most tickets close in the same shift. The ones that do escalate arrive code-grounded, not vibes-based.

And the team grows into it: CS starts by root-causing and handing the session to engineering — later, the same people start proposing the fixes themselves. Small-scope, low-hundreds of lines, junior-engineer shape.

03 Use cases02 Support
09
Use case 03 · Product

PMs file PRs, not issues.

Before
A spec.
A PM writes a doc, hands it to engineering, and waits for it to come back as a feature.
After
A diff.
The PM opens the codebase, runs the change, and arrives at engineering with a working PR.
The change

PMs become a steady source of reviewable work — not a queue of issues for engineering to translate.

Scope grows with confidence: most PRs start as small UI tweaks; over weeks, the same PMs land small-to-medium features at a junior-engineer scope.

03 Use cases03 PMs as contributors
10
Use case 04 · Data & BI

The dashboard you didn’t have to ask for.

Before
A ticket.
“How many trial accounts converted last week, by plan tier?” goes to the data team’s backlog.
After
A reply.
The agent has read access to the warehouse, writes the SQL, returns the table — in the same Slack thread.
The change

The data team stops being the bottleneck for every operational question.

One caveat: read-only credentials, scoped to the warehouse — not the OLTP database.

03 Use cases04 Data & BI
11
Use case 05 · Prototyping

Consensus before the PRD.

Before
A doc.
Stakeholders argue about a paragraph. The disagreement only surfaces once engineering has built the wrong thing.
After
A demo.
The agent builds it against the actual codebase — same components, same colors. People click through, not at.
The change

Decisions get made against working artifacts, not text. Disagreements show up early — not in QA.

Honest note: it’s never quite ready to ship. That’s the point — throwaway is the feature.

03 Use cases05 Rapid prototyping
12
Section 04
What OpenInspect
actually does.
The unit of work

One session. Many surfaces, many people.

State lives in the Durable Object, not the client. Every surface and every participant is a thin connector to the same source of truth.

01

Many surfaces.

Slack mention, GitHub review, Linear issue, or a typed prompt in the web UI — all spawn the same DO-backed session. One started in Slack shows up live in the web sidebar — no handoff needed.

02

Many people.

Anyone with a token joins via WebSocket. Tool calls, sandbox status, and PR artifacts fan out to every connected client. New joiners replay the last ~200 events to catch up.

04 What it doesSessions
14
Capability 01 · code review

Reviews every PR, faster than a human.

Open a PR, @-mention the bot, or let it auto-review on open. The agent reads the diff, fetches context, and posts a real GitHub review.

GitHub webhook Four events trigger the bot
PR opened (gated) review requested @-mention in comment inline review reply
↓  verify perms · “eyes” ack · spin up session · gh pr diff  ↓
Agent posts A real GitHub review — not just a comment
APPROVE REQUEST_CHANGES inline comments on specific lines
04 What it doesCode review
15
Capability 02 · automations

The agent runs without you.

Same agent, same docs and skills — just started by something other than a chat message. A clock, a webhook, an alert, a PR.

Schedule

Cron, minimum 15-minute interval. Nightly deps, weekly reports, recurring audits.

One scheduled run at a time per automation.

Inbound webhook

Authenticated HTTP POST, up to 64 KB. Optional JSONPath conditions filter what fires.

Idempotency keys deduplicate noisy senders.

Sentry alert

HMAC-verified Custom Integration, up to 256 KB. Triage new issues the moment they arrive.

One issue, one session, one PR.

GitHub event

Triggered by PR, issue, or comment activity on the repos the App watches.

Linear event triggers next.

Stored as first-class objects: name, repo, branch, model, instructions (≤10 K chars). Three consecutive failures → auto-paused.
04 What it doesAutomations
16
Capability 03 · verification

The agent shows its work.

For UI-heavy changes, “did it work” is a visual question. The sandbox ships browser-driving tools so the agent answers in pictures, video, or a live URL.

Screenshots

agent-browser screenshot — viewport, full-page, annotated, or diffed against a baseline.

Uploaded to R2 via upload-media. Lightbox in the session sidebar.

Video

agent-browser record — silent MP4 of a flow. ffprobed for real dimensions and duration.

Uploaded as a media artifact, attached to the session.

Tunnels

Modal-native HTTPS for sandbox ports. Up to 10 user ports per session.

Click “Port 3000” in the sidebar — the running prototype, in your browser.

A PR with a screenshot is reviewed in 30 seconds. A PR with a tunnel link is reviewed by clicking through — the only UI review that actually matters.
04 What it doesScreenshots · video · tunnels
17
Section 05
The architecture
behind Open Inspect.
Open Inspect

An open-source background coding agent platform.

Modeled on Ramp’s internal Inspect system. Forkable, single-tenant, customisable to a company’s own services and conventions.

  • CTRLCloudflare Workers · Durable Objects · D1 (SQLite)
  • DATAModal sandboxes — Python, Node, Bun, Chromium, GH CLI
  • AGENTOpenCode CLI in server mode, inside each sandbox
  • CLIENTSWeb · Slack · GitHub PRs · Linear · webhooks · cron · Sentry
Why open-source

Every company’s infrastructure is unique. Background agents are critical infrastructure. Critical infrastructure should not be a vendor lock-in.

05 ArchitectureAt a glance
19
The pattern every production system converges on

Split control from data.

Control plane Coordination, state, identity
Cloudflare Workers Durable Objects (per-session SQLite) D1 AES-256-GCM secrets GitHub OAuth + allowlist
↓  dispatch · stream · resume  ↓
Data plane Per-session sandbox — the place code actually runs
Modal sandbox Repo + dev env OpenCode code-server headless Chromium custom CLI tools
Every production system — Stripe, Ramp, Browserbase, Anthropic — exhibits this split. It’s the durable shape, not the implementation.
05 ArchitectureControl / data plane
20
The data plane in detail

A sandbox is what a developer has — without sharing their machine.

Reproducible

A perfectly pristine copy of the dev environment. The same one. Every time.

Isolated

Quarantined from other sessions, from the developer’s laptop, and from production.

Disposable

Container dies → harness provisions a replacement. State lives in the durable session log.

05 ArchitectureSandboxes
21
A debate worth re-framing

The agent is in the box. The secrets aren’t.

“In vs. out” is the wrong axis. The agent has to run somewhere. The question is what credentials the sandbox can touch.

Control plane Holds the tokens
Slack OAuth GitHub user tokens Linear internal API keys tool brokers
↓  tool calls cross this boundary · tokens do not  ↓
Modal sandbox No production credentials
OpenCode (agent loop) repo + shell + browser tests + dev tools
05 ArchitectureSecrets not in the box
22
Two tool calls · same pattern

The agent never holds the token.

Each tool call goes through the control plane. The agent sends arguments, gets back a result — the credential never enters the sandbox.

Slack notification

agent → notify(channel, text)
ctrl plane → looks up Slack OAuth in vault,
               posts on agent’s behalf
agent sees → { ok: true, ts: “1716...” }

Agent never sees the token. Compromised sandbox can’t post anywhere it wasn’t already scoped to.

·

Pull request creation

agent → open_pr(branch, title, body)
ctrl plane → uses the USER’s GitHub OAuth
               (not a bot account)
agent sees → { url: “github.com/.../pull/4291” }

PR attributed to the human who triggered the run. The agent’s identity is its session, not its token.

Bonus: “out of the box” still needs a container — the worker just lives elsewhere, and you reinject sandbox state on every turn. The container question is a distraction. The credential question is the work.

05 ArchitectureThe brokerage
23
Performance is an adoption feature

Warm starts: the agent is ready before the user is.

  • IMGRepo images rebuilt every 30 minutes — clone, install, build, snapshot.
  • FSFilesystem snapshots after each prompt for follow-up turns.
  • POOLWarm pools for high-volume repos.
  • PROACTSpin a sandbox up while the user is still typing.
Why this matters

~10s

Target time-to-first-token. Slow agents don’t get adopted — speed is the prerequisite, not the polish.

05 ArchitectureWarm starts
24
How a session starts

Eight ways in.

Meet people where they already work. The web UI is the obvious surface; it’s rarely the most used.

Chat
Slack
Most used surface, every system.
Web
Open Inspect UI
Multiplayer sessions, code-server, live preview.
PR
GitHub bot
Triggered by PR comments and reviews.
Issue
Linear bot
Issue → branch → PR.
Schedule
Cron
Recurring sweeps and audits.
Webhook
Inbound HTTP
Filtered by JSONPath conditions.
Alert
Sentry
Reproduce → propose fix.
Loop
Sub-tasks
Parent agent spawns child agents in their own sandboxes.
05 ArchitectureInvocation surfaces
25
Section 06
What you’ll fix
along the way.
The forcing function

Background agents expose every gap in your dev process.

An agent is a perpetual new hire that joins fresh every session — with no Slack to ping and no teammate to pair with. Anything undocumented becomes visible immediately.

Tribal knowledge Documented setup scripts
Shared dev secrets Centralised, scoped credentials
Service-level auth only Granular access control
Special-flag launch incantations Repeatable Docker images
The #1 gap in production: most teams can’t reliably run the full stack locally to begin with. Fixing it for the agent fixes new-hire onboarding too.
06 Forcing functionThe pattern
27
A prerequisite checklist

Secrets and access control come first.

Find

Shared developer accounts, the same secret pasted into ten .env files, no source-of-truth.

Fix

Per-user identities. A vault. Egress-time injection so the sandbox holds placeholders, not keys.

Then

Scope what each agent can touch. The agent has a human’s tools — not necessarily a human’s permissions.

Service-level auth is usually in place. Granular access control to restrict what the agent can and can’t do, almost never is.
06 Forcing functionSecrets & access
28
Highest-leverage investment

An agent needs a simpler interface than a human does.

It is managing a context window. Wrap your APIs with tools that take targeted requests and return slimmed, agent-tailored responses.

Before — raw API

GET /opensearch/_search
Body: 28 fields, paginated, JSON of unbounded depth

Agent must construct a query, paginate, parse, summarise. Most attempts at root-causing a production issue exhausted context before resolving it.

After — tailored CLI

oi-logs find —service=billing —since=15m —text="timeout"

Tool builds the right query, returns just the matched lines and the surrounding context. Root-cause success rate “drastically went up.”

06 Forcing functionCustom tools > raw APIs
29
The reviewer’s interface, not the agent’s

Make it verifiable.

Custom tools shrink the agent’s context. Verification artifacts shrink the reviewer’s. A screenshot or a tunnel link tells reviewers in seconds what a diff takes minutes to convey.

01

Screenshots from inside the sandbox.

For UI work, the agent attaches before/after screenshots to the PR. A PM or reviewer sees the change without pulling the branch.

02

Port-forward to the running prototype.

Open a port from the sandbox. The PM clicks through their own prototype before opening the PR. Reviewers do the same instead of guessing from the diff.

The agent ships faster than anyone can read. Verification artifacts — not better prose — are what keep the human in the loop.
06 Forcing functionMake it verifiable
30
Section 07
Adoption is
a product problem.
Adoption pattern, every system, every time

Don’t mandate. Make it obviously better.

The teams with the best numbers — Ramp, Stripe, Cursor, Browserbase — all describe voluntary adoption. None describe a mandate.

Two preconditions: management visibly backs the effort, and the experience is fast enough that an engineer reaches for the agent on instinct.

  • 01Meet people where they already work — Slack, GitHub, Linear.
  • 02Run hackathons. Skeptics convert when they ship.
  • 03Make adoption visible — dashboards, public counters.
07 AdoptionDon’t mandate
32
The adoption curve we see in the field

Start with bugs the team already knows how to fix.

Phase 01
WEEK 1–2
Small, well-scoped bugs.
The team learns the experience of handing work off — and getting a reviewable PR back.
Phase 02
WEEK 2–3
Larger tasks delegated.
Confidence grows. Engineers reach for the agent on instinct rather than ceremony.
Phase 03
ONGOING
Custom tools, on the team’s own pain points.
Whatever the agent does badly today is what gets the next CLI tool, the next skill, the next setup hook.
07 AdoptionStart small
33
An unexpectedly large use case

Non-engineers contributing code.

Product managers

  • — Read the codebase to assess feasibility before writing the PRD.
  • — Run before/after analyses on launches.
  • — Send PRs themselves. Engineering reviews.

Customer support

  • — Root-cause issues straight from the inbound queue.
  • — Submit fixes for regressions they triaged.
  • — Write feature requests engineering can act on.
Same review loop, regardless of origin. CI runs. AI code reviewer runs. An engineer signs off.
It shouldn’t matter who pressed enter — the agent makes the outcome reproducible.
07 AdoptionNon-engineers
34
What you create when you solve production

The bottleneck moves.

1–3
10–15+

PRs per day per reviewer.

Production stops being the constraint. Review throughput becomes it.

What helps

AI code review on top of CI. Smaller, single-purpose PRs. Skills that produce code already shaped like what your team merges.

What hurts

Engineering output as a review metric. Tying performance to PRs closed. The system optimises for whatever you measure.

07 AdoptionReview is the new bottleneck
35
What we learned · the PR problem

Give the agent backpressure, not advice.

Warnings are read by reviewers. Deterministic failures are read by the agent — and the agent corrects.

Step 01
PR opens.
Agent ships, CI starts.
Step 02
CI fails.
Semgrep returns severity: ERROR, non-zero exit.
Step 03
Agent reads.
Check output enters its own trajectory.
Step 04
Agent retries.
Against the rule. PR now passes.
The rule isn’t for the reviewer. It’s signal the agent uses to iterate. Fix the pattern once — it persists across every future PR.
07 AdoptionBackpressure, not advice
36
Anatomy of a convention rule

One PR. One Semgrep rule.

An agent skipped the repository layer. One severity: ERROR rule, and the next PR used the helper.

Before · what shipped

from sqlalchemy import create_engine

engine = create_engine(DATABASE_URL)

def get_overdue_invoices(account_id):
    return engine.execute(
        "SELECT * FROM invoices ...",
    ).fetchall()

Skips the repository, the read-replica router, the request-scoped session.

Rule · Semgrep

rules:
  - id: no-direct-sqla
    message: Use app.repositories.*
    severity: ERROR
    pattern: from sqlalchemy import ...

severity: ERROR + CI exit code = a red check in the agent’s own trajectory.

After · on retry

from app.repositories import InvoiceRepository

def get_overdue_invoices(account_id):
    return InvoiceRepository.find_overdue(
        account_id=account_id,
    )

Same intent. Your conventions. Rule catches every future regression.

07 AdoptionConvention rule, captured
37
Section 08
If you’re standing
this up at your team.
Three lessons for teams standing this up

Three things to plan for.

01

The agent is only as good as your documented processes.

If the steps live in a teammate’s head, the agent can’t follow them. Build the docs you should already have.

02

This is not a one-and-done system.

Treat it like any ML/AI system: continuous iteration. When a PR fails, give the agent the trajectory — and turn it into a doc, a skill, or a lint rule. The feedback loop is the product.

03

Plan for an operations load.

As the system rolls out across personas, edge cases accumulate. Staff a small team to keep the line running. The biggest risk is treating it as a project.

08 LessonsThree things to plan for
39
The frame I’d leave you with
The agent is a perpetual new hire.
It joins fresh every session. It can’t ask Sarah how the staging server works. Whatever your team has written down is what it knows.
Build the agent. Build the company that can hire it.
08 LessonsThe closing frame
40
Roadmap · directions, not commitments

Where Open-Inspect is going.

01

Stronger sandboxing primitives.

Sharper isolation between the agent loop and the secrets it operates with. Network-policy primitives the agent inside the box can’t reach around. Permission scopes that survive prompt injection.

02

Skills that grow themselves.

Close the “PR fails → docs / skills / rules” loop automatically. Failed-trajectory mining for skill suggestions. The agent gets better at your codebase faster than you can write docs.

03

Better access control.

Harden permissions across the system as user counts grow. Finer-grained roles, per-session and per-tool scopes, audit trails that match what real orgs need.

All three are directions. Share what would be useful to you — find me after.

08 LessonsWhat’s next
41
Thanks

Standing one up at your team?

CM Engineering takes on advisory work for background-agent systems. Reach out to learn more.

Open InspectModal · May 20 · 2026
42 / 42
Demo

Slack → agent invocation

@-mention the bot in a thread; a sandbox spins up and reports back live.

Demo

Forking a sub-session

Branch from an active session to pursue a parallel approach without losing the parent's context.

Demo

Recording the agent's work

Capture a video of the agent navigating its sandbox — visual proof of the change, attached to the PR.

Demo

Linear → agent on the issue

Assign the agent to a Linear issue; it picks up the context, runs in a sandbox, and reports back on the ticket.