Subtask-Level Memory for Software Engineering Agents¶

Store and retrieve memory at the granularity of individual reasoning stages — not whole task sessions — to prevent misguided retrieval when tasks share surface similarity but require distinct reasoning at specific steps.

The Granularity Mismatch Problem¶

Instance-level memory stores a whole episode as one unit. Retrieval returns the full episode when a new task resembles it — useful when reasoning matches throughout, harmful when only one stage overlaps.

A bug needing a Reproduce step may share surface description with a prior episode that needed only an Edit; the retrieved memory injects guidance from the wrong phase. (arXiv:2602.21611) The fix is to match the memory unit to the reasoning unit.

Subtask-Aligned Memory Architecture¶

A structurally aligned system stores memory per functional category. The paper (arXiv:2602.21611) defines four categories for software engineering agents:

Category	What It Covers
Analyze	Understanding the problem, locating relevant code
Reproduce	Constructing reproduction steps and test cases
Edit	Implementing the fix or change
Verify	Confirming correctness, running tests

Each memory entry is a structured triple (z, d, e):

z — the functional category (hard constraint on retrieval scope)
d — a structured description with objective and mechanism-level keywords (the retrieval anchor)
e — an abstracted experience with instance-specific noise removed (file paths, variable names stripped)

Abstraction is critical: raw trajectory storage yields +1.2 pp; LLM-abstracted entries deliver +3.9 pp because abstraction distills transferable insights and drops ungeneralizable artifacts. (arXiv:2602.21611)

Two-Stage Retrieval¶

Retrieval runs in two stages to prevent cross-phase contamination:

graph TD
    A[Current subtask z, d] --> B{Stage 1: Category filter}
    B -->|Keep only entries where m.z = z| C[Within-category pool]
    C --> D{Stage 2: Cosine similarity rank}
    D -->|highest cosine similarity| E[Best-match memory entry m*]
    E --> F[Inject into agent context]

Stage 1 hard-filters by category z, removing cross-phase entries before ranking. Stage 2 ranks within-category entries by cosine similarity between the current description embedding and stored anchor embeddings; only the best match is injected. (arXiv:2602.21611)

Implementation Notes¶

Transition prediction via system prompt. The agent predicts its current category and synthesizes a structured description during reasoning — driven by the system prompt, no separate orchestrator required. (arXiv:2602.21611)

Memory sparsity in early sessions. The first ~200 instances produce a slight dip (−1 pp) from retrieval overhead on sparse pools; gains accelerate with density, reaching +9–10 pp after 300+ instances. (arXiv:2602.21611)

Model-agnostic. Results hold across model families; Gemini 2.5 Pro sees +6.8 pp. (arXiv:2602.21611)

Results¶

Subtask-level memory improves mean Pass@1 by +4.7 pp on SWE-bench Verified over unaligned baselines. (arXiv:2602.21611) The broader principle — retrieval granularity should match reasoning granularity — is independently supported by dual-layer episodic-semantic memory work, where granular logs plus abstract concept synthesis outperform flat retrieval on multi-hop tasks. (arXiv:2601.02744)

Relation to Scope-Based Memory¶

This technique is orthogonal to the scope-based patterns (episodic, working, project, user) in Agent Memory Patterns. Scope controls where and how long memories persist; subtask-level controls at what granularity they are stored and retrieved. The two combine: subtask-aligned entries stored in a project-scoped, episodic system.

Caveat: Dense-Retrieval Noise¶

The Stage-2 cosine step is a dense-retrieval operation. Follow-up work argues dense retrieval "fails to distinguish instances that are semantically similar but contextually distinct," yielding noisy matches even within a correctly filtered category; schema-constrained generation is proposed as an alternative. (arXiv:2604.20117) The category hard-filter mitigates cross-phase confusion but does not solve within-category ambiguity.

Example¶

The following shows the structure of a memory entry triple for the Reproduce category, and how two-stage retrieval uses it to inject only the relevant experience when a new task reaches its reproduction stage.

# Storing a memory entry after a successful Reproduce subtask
memory_store.add({
    "z": "Reproduce",   # functional category — hard constraint on retrieval
    "d": {              # structured description — the retrieval anchor
        "objective": "construct failing test case for off-by-one in pagination",
        "keywords": ["off-by-one", "pagination", "boundary condition", "unit test"]
    },
    "e": (              # abstracted experience — instance-specific details stripped
        "When reproducing boundary errors in list pagination, write a test that requests "
        "exactly page_size items, then page_size+1. The second call reveals the off-by-one "
        "because the slice index uses < instead of <=. Avoid mocking the data layer; "
        "use real objects to ensure the boundary arithmetic is exercised."
    )
})

At retrieval time, the agent predicts it is entering the Reproduce stage and synthesizes a description for the current task:

# Two-stage retrieval
current_z = "Reproduce"   # predicted by agent from system prompt instruction
current_d_embedding = embed("construct test case for cursor-based pagination bug")

# Stage 1: hard filter by category — eliminates Analyze/Edit/Verify entries
candidates = [m for m in memory_store if m["z"] == current_z]

# Stage 2: rank by cosine similarity within category, inject best match
best_match = max(candidates, key=lambda m: cosine(embed(m["d"]), current_d_embedding))
inject_into_context(best_match["e"])

The injected experience is the abstracted lesson from the prior pagination episode — not the raw trajectory with file paths and variable names from that specific codebase. The agent applies the transferable reasoning pattern to the new task.

Key Takeaways¶

Task description similarity doesn't imply reasoning stage similarity — instance-level memory causes granularity mismatch.
LLM-abstracted experience entries outperform raw trajectory storage; abstraction is not optional.
Gains are largest for long, multi-step tasks and grow with memory density.

When This Backfires¶

Subtask-level memory adds overhead that outweighs benefits in several conditions:

Cold-start penalty: The first ~200 instances yield a slight dip as retrieval fires on sparse per-category pools. Skip memory entirely for tasks that won't accumulate 200+ episodes.
Category misprediction: Misprediction routes retrieval to the wrong pool, injecting misleading experience. Ambiguous or interleaved phases (analyze and edit) raise this risk.
Abstraction cost: Each subtask requires an LLM call to abstract the experience, adding latency and token cost. Raw trajectory storage avoids this but delivers only +1.2 pp vs +3.9 pp.
Non-repetitive task streams: The pattern assumes recurring task structures. One-off or heterogeneous streams never reach the density where cross-episode transfer materializes.

Agent Memory Patterns: Learning Across Conversations
Episodic Memory Retrieval -- retrieval mechanics for episodic memory using trigger, context, and outcome indexing
Generative Agents Memory Stream — companion on the observation-stream-plus-reflection axis, orthogonal to retrieval granularity
Dual-Trace Memory Encoding — companion on the encoding-time axis, distinct from the retrieval-granularity angle here
Seeding Agent Context: Breadcrumbs in Code
Retrieval-Augmented Agent Workflows
Memory Synthesis: Extracting Lessons from Execution Logs
Code-Native Memory Substrates -- uses AST representations instead of natural language as memory substrate for code generation agents
Orchestrator-Worker Pattern