Runtime Scaffold Evolution¶

A mutable scaffold lets capable agents synthesize domain-specific tools at runtime, outperforming fixed toolkits.

The core insight¶

A capable LLM already knows how to write code and reason about tooling. The missing piece is permission and prompting: you explicitly ask the agent to treat tool creation as a first-class action alongside tool use.

Live-SWE-agent demonstrated this by starting with bash-only access and autonomously evolving its toolkit — achieving 77.4% on SWE-bench Verified and 45.8% on SWE-Bench Pro without offline training or pre-built tool libraries (Xia et al., 2025; live-swe-agent leaderboard; reference implementation).

How it works¶

graph TD
    A[Receive task] --> B[Attempt with existing tools]
    B --> C[Observe result + reflect]
    C --> D{Would a custom tool help?}
    D -->|No| E[Continue solving]
    D -->|Yes| F[Synthesize tool as script]
    F --> G[Execute custom tool]
    G --> C
    E --> H{Task complete?}
    H -->|No| B
    H -->|Yes| I[Submit solution]

The mechanism is simple:

Minimal start: the agent begins with only bash access and no specialized tools.
Step-reflection prompt: after each action, a prompt asks, "Would creating or revising a tool accelerate progress?"
Tool synthesis: the agent writes a script with clear inputs, outputs, and error handling.
Iterative refinement: the agent revises tools as its understanding deepens, rather than designing them upfront.

The agentic loop does not change. It adds only a reflection prompt and permission to create scripts.

What the agent builds¶

Runtime tools fold multi-step bash sequences into single domain-specific operations:

Scenario	Bash approach	Runtime-synthesized tool
Code search	`grep -r` with manual filtering	Context-aware search excluding test fixtures and vendored code
Binary parsing	Chained `xxd`, `awk`, `sed`	Dedicated parser with structured output
Multi-file edits	Sequential `sed` commands	Batch editor with AST awareness and rollback

Tool-creation opportunities come from friction the agent hits, not from upfront design.

The model-capability threshold¶

This is not a universal technique. It requires frontier-class models:

Model tier	Effect	Mechanism
Frontier	Significant improvement	Synthesizes useful, targeted tools that reduce step count
Mid-tier	Modest improvement	Creates tools but sometimes over-engineers them
Small	Performance degrades	Gets stuck in tool-creation loops, never solves the actual problem

In ablation experiments, the pattern yielded +22.6% improvement with Claude 4.5 Sonnet and −68.2% degradation with GPT-5-Nano. Weaker models lack the meta-reasoning to judge when tool creation is worthwhile, turning the reflection prompt into a distraction trap (Xia et al., 2025).

Runtime versus offline evolution¶

Approach	Timescale	Persistence	Human involvement
Runtime scaffold evolution	Single session	Ephemeral	None
Introspective skill generation	Across sessions	Persisted to library	Validation gate
Continuous agent improvement	Weeks/months	Config updates	Human-driven
Agentic flywheel	Continuous	Harness modifications	Tiered approval

Tools vanish when the session ends. Promoting useful ones to a skill library bridges ad-hoc creation and governed reuse.

Cost and context trade-offs¶

Token overhead is modest: on SWE-bench Verified, Live-SWE-agent averaged $0.68 per issue versus $0.56 for the baseline agent — roughly $0.12 incremental cost — which the authors describe as "minimal" relative to the accuracy gain (Xia et al., 2025).

The hidden cost is context pressure. Each synthesized tool definition consumes tokens. In long sessions, accumulated definitions may crowd out problem-relevant context. Current implementations do not address active tool pruning.

When to use¶

Good fit: complex unfamiliar codebases, domain-specific file formats, and frontier-class models with large context windows.

Poor fit: well-defined workflows with known tool sets (use a fixed skill library), smaller models, and short tasks where tool-creation overhead exceeds the time saved.

Example¶

A SWE-bench agent receives a bug report about incorrect CSV parsing. The system prompt includes a reflection hook:

After each tool result, reflect: would creating a reusable script
accelerate the remaining work? If yes, write it to /tmp/tools/ and
invoke it in subsequent steps.

Turn 1: the agent runs grep -r "csv" src/ and gets 200+ matches across test fixtures and vendored code.

Turn 2, reflection fires: the agent creates /tmp/tools/search_src.py:

#!/usr/bin/env python3
"""Search source files, excluding tests and vendored directories."""
import sys, pathlib, re

pattern = re.compile(sys.argv[1])
for p in pathlib.Path("src").rglob("*.py"):
    if any(skip in p.parts for skip in ("tests", "vendor", "__pycache__")):
        continue
    for i, line in enumerate(p.read_text().splitlines(), 1):
        if pattern.search(line):
            print(f"{p}:{i}: {line.strip()}")

Turn 3: the agent calls python /tmp/tools/search_src.py "csv.*parse", immediately narrows to 4 relevant files, then locates and fixes the bug.

The agent created the tool in response to friction (noisy grep results), used it for the rest of the session, and discarded it on completion.

Key Takeaways¶

The mechanism is a single reflection prompt — simplicity is the point
Requires frontier-class models; weaker models get trapped in tool-creation loops
Ephemeral by default — combine with skill library persistence for cross-session reuse
Gate behind model capability routing: enable for strongest model, disable for cost-optimized paths

Introspective Skill Generation — offline pattern mining across sessions
Agentic Flywheel — closed-loop harness self-improvement
Skill Library Evolution — lifecycle governance for persisted skills
Scaffold Architecture Taxonomy — three-layer framework the runtime evolution operates within, across control, tool interface, and resource dimensions
Harness Engineering — designing agent environments
Continuous Agent Improvement — human-driven observation-to-update loop
Temporary Compensatory Mechanisms — runtime tools as removable scaffolding
Tool Minimalism — counterpoint: fewer tools can outperform more