Per-Model Harness Tuning¶
The same harness produces different behaviour on different backing models. Treat the model as a first-class harness variable — tune prompt structure, tool descriptions, and middleware per model — instead of forcing one generic configuration to work everywhere.
The Pattern¶
Frontier vendors train models on provider-specific prompt and tool conventions. A harness shaped for one violates the other's training distribution. Declare per-model overrides — system-prompt prefix and suffix, tool inclusion and naming, middleware, subagent configuration — that the harness applies whenever a particular model is selected.
LangChain's deepagents library shipped this as harness profiles on 2026-04-29. On a tau2-bench subset, profiles produced a 10-20 point jump per model: GPT 5.3 Codex went from 33% to 53%; Claude Opus 4.7 from 43% to 53% (LangChain).
graph TD
H[Generic harness] --> P{Model selected}
P -->|openai:gpt-5.5| O[OpenAI profile<br>apply_patch tool<br>parallel-batch prompt]
P -->|anthropic:opus-4.7| A[Anthropic profile<br>tool_result_reflection<br>tool_usage block]
P -->|google:gemini-3.1| G[Google profile]
O --> R[Per-model agent]
A --> R
G --> R
What Varies Per Model¶
The override surface covers components that vendor prompting guides prescribe differently:
| Component | Why it varies | Source |
|---|---|---|
| System prompt prefix/suffix | XML sections for Claude vs batch-call instructions for Codex | Claude prompting; Codex prompting |
| Tool inclusion and naming | Codex is trained on apply_patch and shell-style names; generic file-edit tools score worse |
LangChain |
| Tool descriptions | Calibrate against each model's literal-interpretation profile | Anthropic migration |
| Middleware selection | Summarisation middleware tuned for one model's verbosity over-corrects on another | Cursor |
| Subagent configuration | Spawn-frequency defaults differ across models | Anthropic migration |
Structural mechanisms — sandboxing, permission gates, file-persistent context, observability hooks — depend on the harness contract, not the model, and port cleanly. See Temporary Compensatory Mechanisms for the structural-vs-compensatory split.
Concrete Per-Model Deltas¶
LangChain published the actual changes that produced their tau2-bench deltas (LangChain).
For Codex, the changes were tool-shape: replace the default file_edit with apply_patch, alias execute as shell_command, and add a parallel-batch prompt: "Before any tool call, decide ALL files and resources you will need. Batch reads, searches, and other independent operations into parallel tool calls instead of issuing them one at a time."
For Opus 4.7, the changes were prompt-only — XML-tagged blocks the model is trained to recognise:
<tool_result_reflection>
After receiving tool results, carefully reflect on their quality
and determine optimal next steps before proceeding.
</tool_result_reflection>
<tool_usage>
When a task depends on the state of files, tests, or system output,
use tools to observe that state directly rather than reasoning from
memory about what it probably contains.
</tool_usage>
Cursor reports the same shape independently: shell-style tool names for Codex, explicit read_lints triggers, removed mid-turn user-talk language because Codex models cannot "talk" until the end of a turn (Cursor).
Expressing the Deltas¶
LangChain's profile API keys overrides on a bare provider name ("openai") or fully qualified provider:model ("openai:gpt-5.4") (Deep Agents Profiles docs).
from deepagents import HarnessProfile, register_harness_profile
register_harness_profile(
"openai:gpt-5.4",
HarnessProfile(
system_prompt_suffix="Respond in under 100 words.",
excluded_tools={"execute"},
excluded_middleware={"SummarizationMiddleware"},
),
)
Merge semantics are field-typed: prompts last-wins, tool-description mappings merge per key, excluded sets union, extra middleware appended by class identity. Provider-level and model-level profiles compose at resolve time — model-level inherits from provider-level. The create_deep_agent call site does not change.
The mechanism generalises beyond deepagents — any harness can adopt a model key, a delta record, and resolution at agent construction. The discipline is keeping deltas declarative so they version, diff, and ship as plugins rather than scattering if model == ... branches through the loop.
Relation to Adjacent Harness Patterns¶
graph LR
HE[Harness Engineering] -->|how to build| H[Your Harness]
HHC[Harness Hill-Climbing] -->|tune over time| H
HI[Harness Impermanence] -->|design for deletion| H
PMHT[Per-Model Harness Tuning] -->|tune per model| H
PRX[Prompt-Rewrite on Migration] -->|tune across versions| H
Per-model tuning is the orthogonal axis to existing harness disciplines:
- Harness Engineering builds the harness contract.
- Harness Hill-Climbing tunes a fixed configuration over time. Per-model tuning tunes across models at one point in time.
- Harness Impermanence designs for cheap removal when the next model subsumes the capability — author profiles as declarative overrides so the same removal seam applies.
- Prompt-Rewrite on Cross-Generation Migration handles the temporal dimension (one model upgrades). Per-model tuning handles the spatial dimension (N models against one harness).
When This Backfires¶
Per-model profiles cost more than they pay back under specific conditions:
- Single-model deployments. No portability surface to manage; profile machinery is dead weight.
- Minor-version successors with stable evals. "Claude Opus 4.7 should have strong out-of-the-box performance on existing Claude Opus 4.6 prompts and evals" (Anthropic: Migration guide) — a profile per minor version is churn without measurable gain.
- No representative eval suite. Per-model deltas need measurement to validate the override actually improves the target model. See Harness Hill-Climbing for the eval discipline.
- Provider-managed harnesses. Claude Managed Agents and Copilot consumer tiers route to successors automatically; the user has no harness surface to tune.
- Capability convergence. As frontier models converge on shared tool conventions (
apply_patch, shell, structured outputs), the eval gap shrinks while maintenance cost stays flat.
Per-model deltas are exactly the kind of hand-coded knowledge model releases tend to subsume. Author profiles as declarative overrides so deletion is one config change, not a refactor.
Example¶
A team running deepagents against both OpenAI and Anthropic registers two profiles at startup:
from deepagents import HarnessProfile, register_harness_profile
# Codex needs apply_patch and parallel-batch guidance
register_harness_profile(
"openai",
HarnessProfile(
system_prompt_suffix=(
"Before any tool call, decide ALL files and resources "
"you will need. Batch independent reads and searches into "
"parallel tool calls instead of issuing them one at a time."
),
excluded_tools={"file_edit"}, # apply_patch wins for Codex
),
)
# Opus 4.7 needs explicit reflection and tool-investigation cues
register_harness_profile(
"anthropic:claude-opus-4-7",
HarnessProfile(
system_prompt_suffix=(
"<tool_result_reflection>"
"After receiving tool results, reflect on quality and "
"determine optimal next steps before proceeding."
"</tool_result_reflection>"
),
),
)
The create_deep_agent(model=...) call site is unchanged. Switching model="openai:gpt-5.5" to model="anthropic:claude-opus-4-7" swaps the active profile automatically; the agent inherits the right deltas without per-call branching.
Key Takeaways¶
- The same harness produces measurably different behaviour on different backing models — LangChain documented 10-20 point tau2-bench deltas from profile work alone (LangChain).
- The override surface is system prompt, tool inclusion and naming, tool descriptions, middleware selection, and subagent configuration. Structural mechanisms (sandboxing, permissions, observability) port cleanly and stay shared.
- Express deltas as declarative model-keyed overrides so they version, compose, and remove. Avoid
if model == ...branches through the agent loop. - Profile work pays off only with a representative eval suite. Without measurement, per-model tuning is unmeasured prompt churn.
- Skip the discipline for single-model deployments, minor-version successors with stable evals, and provider-managed harnesses where there is no harness surface to tune.
- Per-model tuning is the spatial counterpart to Prompt-Rewrite on Cross-Generation Migration (temporal) and orthogonal to Harness Hill-Climbing (single-model, over time).
Related¶
- Harness Engineering — the discipline of building the harness this pattern overrides per model
- Harness Hill-Climbing — eval-driven tuning of a fixed configuration over time, orthogonal to per-model tuning
- Harness Impermanence — author scaffolding for cheap removal; profiles benefit from the same discipline
- Prompt-Rewrite on Cross-Generation Migration — the temporal counterpart for handling model-version upgrades
- Temporary Compensatory Mechanisms — structural-vs-compensatory split that determines which components belong in profiles
- Cross-Vendor Competitive Routing — running multiple vendor agents on the same task; per-model profiles are how each side gets a fair shot
- Harness Design Dimensions and Archetypes — the population-level dimensions per-model deltas vary along
- Managed vs Self-Hosted Harness — managed harnesses remove the surface this pattern operates on