Skip to content

OpenAI Agents SDK Sandboxes Harness and Memory

The April 2026 OpenAI Agents SDK update ships three primitives — controlled sandboxes, an inspectable harness, and configurable memory — in one Python library.

What Shipped

OpenAI released the Agents SDK update on 2026-04-15, consolidating three primitives teams previously assembled themselves:

  1. A model-native harness — the control plane around the model
  2. Native sandbox execution — a compute plane for model-directed work
  3. Configurable memory — two systems (session and sandbox)

Python-first; TypeScript is planned.

Harness / Compute Separation

The SDK separates a persistent, trusted harness from an ephemeral, untrusted compute environment (concepts guide):

Plane Owns
Harness Agent loop, model calls, tool routing, handoffs, approvals, tracing, recovery, run state
Compute File reads/writes, command execution, dependency installs, mounted storage, exposed ports, state snapshots

Colocation would let model-generated shell commands read loop credentials. Separation contains blast radius and enables snapshot/rehydrate: when a sandbox fails or expires, the SDK restores state in a fresh container from the last checkpoint.

graph LR
    Loop[Harness] -->|Routes tool calls| Sandbox[Sandbox]
    Sandbox -->|Results, snapshots| Loop
    Loop -.->|Trusted state:<br/>credentials, approvals| Loop
    Sandbox -.->|Untrusted execution:<br/>model-generated code| Sandbox

Sandbox Primitives

Sandbox execution is authored through SandboxAgent, Runner.run, and RunConfig. SandboxAgent keeps the standard agent surface (instructions, tools, handoffs, mcp_servers, guardrails, hooks) and adds a Manifest plus LocalDir mounts declaring workspace file access.

Sandbox clients are pluggable (reference):

The provider lives in RunConfig, not the agent — swap clients per environment while the agent, manifest, and capabilities stay stable.

Isolation caveat: partners ship containers (Modal uses gVisor). For cross-tenant threat models, container isolation is weaker than Firecracker microVMs — see Subprocess and PID-namespace sandboxing.

Harness Primitives

The harness standardises primitives previously bespoke per-agent (Help Net Security):

  • Tool use via MCP
  • Progressive disclosure via skills
  • Custom instructions via AGENTS.md
  • Code execution via a shell tool
  • File edits via an apply_patch tool
  • Compaction for long-running runs

Loop customisation is coarse. Runner manages turns, tools, guardrails, handoffs, and sessions — teams that want full loop control call the Responses API directly.

Memory: Two Systems

The SDK exposes two memory systems with distinct lifecycles. Confusing them is the most common mistake.

Session Memory

Conversation history with an explicit API (sessions guide):

  • add_items() — append messages
  • get_items() — retrieve history
  • pop_item() — remove most recent
  • clear_session() — wipe

After a non-streaming run, add_items() persists user input plus model outputs from the latest turn. Backends ship as first-class extras:

Backend Purpose
SQLiteSession / AsyncSQLiteSession Local dev, single-server
SQLAlchemySession Production — Postgres, MySQL, SQLite
RedisSession Shared cache-backed session
AdvancedSQLiteSession Branching, analytics, structured queries
EncryptedSession At-rest encryption wrapper

Sandbox Memory

Filesystem artifacts distilled from prior runs (agent memory guide). The workspace stores:

  • MEMORY.md — concise summary injected into later runs
  • memories/memory_summary.md — longer distilled lessons
  • raw_memories/ — unprocessed notes
  • workspace/sessions/<rollout-id>.jsonl — rollout transcripts

The agent searches MEMORY.md for keywords and opens deeper rollout summaries only when needed — progressive disclosure inside the workspace.

Neither system replaces a dedicated long-term vector or graph store for cross-agent knowledge — pair with agent memory patterns for scope beyond a workspace.

When to Pick the SDK

Pick the SDK when:

  • Python stack, no existing harness or sandbox investments
  • Container isolation meets your threat model
  • You accept an opinionated loop (Runner) and memory schema (MEMORY.md, rollout summaries)
  • You want durable execution without writing it yourself

Skip the SDK when:

  • You need TypeScript today
  • You require microVM isolation for cross-tenant blast radius
  • You need custom turn scheduling, non-standard handoffs, or heterogeneous model routing — call the Responses API directly
  • You already run a self-hosted harness with verification or replay

Example

A SandboxAgent run with a SQLAlchemySession for conversation history and a Docker sandbox for execution. The harness routes the tool call; the sandbox runs shell and apply_patch against a manifested workspace.

from agents import Runner, RunConfig
from agents.sandbox import SandboxAgent, Manifest, LocalDir
from agents.extensions.memory import SQLAlchemySession

session = SQLAlchemySession.from_url(
    "user-123",
    url="postgresql+asyncpg://app:pw@db/agents",
    create_tables=True,
)

agent = SandboxAgent(
    name="refactor-bot",
    instructions="Refactor the target module. Run tests after each change.",
    manifest=Manifest(mounts=[LocalDir("./target", read_write=True)]),
    # tools: shell + apply_patch are wired by the harness
)

result = await Runner.run(
    agent,
    input="Extract the auth middleware into its own module.",
    session=session,
    run_config=RunConfig(sandbox_client="docker"),
)

Swap sandbox_client="docker" for "unix_local" in dev or a hosted provider in production. The agent, manifest, and session stay stable.

Key Takeaways

  • Three primitives in one Python SDK: model-native harness, native sandbox execution, configurable memory — shipped 2026-04-15
  • Harness owns the trusted loop; sandbox owns untrusted execution — snapshot/rehydrate recovers from sandbox failure
  • Two memory systems: Session for conversation history (SQLAlchemy, SQLite, Redis, encrypted), sandbox memory for filesystem-distilled lessons across runs
  • Harness primitives are opinionated (shell, apply_patch, AGENTS.md, MCP, skills, compaction) — bypass Runner for custom loops
  • Container-level isolation via partner providers (Cloudflare, Vercel, E2B, Modal) — insufficient for threat models requiring microVMs
Feedback