OpenAI Agents SDK Sandboxes Harness and Memory¶

The April 2026 OpenAI Agents SDK update ships three primitives — controlled sandboxes, an inspectable harness, and configurable memory — in one Python library.

What Shipped¶

OpenAI released the Agents SDK update on 2026-04-15, consolidating three primitives teams previously assembled themselves:

A model-native harness — the control plane around the model
Native sandbox execution — a compute plane for model-directed work
Configurable memory — two systems (session and sandbox)

Python-first; TypeScript is planned.

Harness / Compute Separation¶

The SDK separates a persistent, trusted harness from an ephemeral, untrusted compute environment (concepts guide):

Plane	Owns
Harness	Agent loop, model calls, tool routing, handoffs, approvals, tracing, recovery, run state
Compute	File reads/writes, command execution, dependency installs, mounted storage, exposed ports, state snapshots

Colocation would let model-generated shell commands read loop credentials. Separation contains blast radius and enables snapshot/rehydrate: when a sandbox fails or expires, the SDK restores state in a fresh container from the last checkpoint.

graph LR
    Loop[Harness] -->|Routes tool calls| Sandbox[Sandbox]
    Sandbox -->|Results, snapshots| Loop
    Loop -.->|Trusted state:<br/>credentials, approvals| Loop
    Sandbox -.->|Untrusted execution:<br/>model-generated code| Sandbox

Sandbox Primitives¶

Sandbox execution is authored through SandboxAgent, Runner.run, and RunConfig. SandboxAgent keeps the standard agent surface (instructions, tools, handoffs, mcp_servers, guardrails, hooks) and adds a Manifest plus LocalDir mounts declaring workspace file access.

Sandbox clients are pluggable (reference):

UnixLocalSandboxClient — local filesystem, dev-only
Docker — stronger isolation, production parity
Hosted providers — OpenAI partners with Cloudflare, Vercel, E2B, and Modal for container-based execution

The provider lives in RunConfig, not the agent — swap clients per environment while the agent, manifest, and capabilities stay stable.

Isolation caveat: partners ship containers (Modal uses gVisor). For cross-tenant threat models, container isolation is weaker than Firecracker microVMs — see Subprocess and PID-namespace sandboxing.

Harness Primitives¶

The harness standardises primitives previously bespoke per-agent (Help Net Security):

Tool use via MCP
Progressive disclosure via skills
Custom instructions via AGENTS.md
Code execution via a shell tool
File edits via an apply_patch tool
Compaction for long-running runs

Loop customisation is coarse. Runner manages turns, tools, guardrails, handoffs, and sessions — teams that want full loop control call the Responses API directly.

Memory: Two Systems¶

The SDK exposes two memory systems with distinct lifecycles. Confusing them is the most common mistake.

Session Memory¶

Conversation history with an explicit API (sessions guide):

add_items() — append messages
get_items() — retrieve history
pop_item() — remove most recent
clear_session() — wipe

After a non-streaming run, add_items() persists user input plus model outputs from the latest turn. Backends ship as first-class extras:

Backend	Purpose
`SQLiteSession` / `AsyncSQLiteSession`	Local dev, single-server
`SQLAlchemySession`	Production — Postgres, MySQL, SQLite
`RedisSession`	Shared cache-backed session
`AdvancedSQLiteSession`	Branching, analytics, structured queries
`EncryptedSession`	At-rest encryption wrapper

Sandbox Memory¶

Filesystem artifacts distilled from prior runs (agent memory guide). The workspace stores:

MEMORY.md — concise summary injected into later runs
memories/memory_summary.md — longer distilled lessons
raw_memories/ — unprocessed notes
workspace/sessions/<rollout-id>.jsonl — rollout transcripts

The agent searches MEMORY.md for keywords and opens deeper rollout summaries only when needed — progressive disclosure inside the workspace.

Neither system replaces a dedicated long-term vector or graph store for cross-agent knowledge — pair with agent memory patterns for scope beyond a workspace.

When to Pick the SDK¶

Pick the SDK when:

Python stack, no existing harness or sandbox investments
Container isolation meets your threat model
You accept an opinionated loop (Runner) and memory schema (MEMORY.md, rollout summaries)
You want durable execution without writing it yourself

Skip the SDK when:

You need TypeScript today
You require microVM isolation for cross-tenant blast radius
You need custom turn scheduling, non-standard handoffs, or heterogeneous model routing — call the Responses API directly
You already run a self-hosted harness with verification or replay

Example¶

A SandboxAgent run with a SQLAlchemySession for conversation history and a Docker sandbox for execution. The harness routes the tool call; the sandbox runs shell and apply_patch against a manifested workspace.

from agents import Runner, RunConfig
from agents.sandbox import SandboxAgent, Manifest, LocalDir
from agents.extensions.memory import SQLAlchemySession

session = SQLAlchemySession.from_url(
    "user-123",
    url="postgresql+asyncpg://app:pw@db/agents",
    create_tables=True,
)

agent = SandboxAgent(
    name="refactor-bot",
    instructions="Refactor the target module. Run tests after each change.",
    manifest=Manifest(mounts=[LocalDir("./target", read_write=True)]),
    # tools: shell + apply_patch are wired by the harness
)

result = await Runner.run(
    agent,
    input="Extract the auth middleware into its own module.",
    session=session,
    run_config=RunConfig(sandbox_client="docker"),
)

Swap sandbox_client="docker" for "unix_local" in dev or a hosted provider in production. The agent, manifest, and session stay stable.

Key Takeaways¶

Three primitives in one Python SDK: model-native harness, native sandbox execution, configurable memory — shipped 2026-04-15
Harness owns the trusted loop; sandbox owns untrusted execution — snapshot/rehydrate recovers from sandbox failure
Two memory systems: Session for conversation history (SQLAlchemy, SQLite, Redis, encrypted), sandbox memory for filesystem-distilled lessons across runs
Harness primitives are opinionated (shell, apply_patch, AGENTS.md, MCP, skills, compaction) — bypass Runner for custom loops
Container-level isolation via partner providers (Cloudflare, Vercel, E2B, Modal) — insufficient for threat models requiring microVMs

Sandbox Runtime Comparison — selection rubric across the OpenAI sandbox clients, docker sbx, bubblewrap, and Seatbelt
Sandbox rules and harness tools
Harness engineering
Managed vs self-hosted harness
Agent memory patterns
Session harness sandbox separation
Claude Agent SDK
Copilot SDK