Memory Synthesis: Extracting Lessons from Execution Logs¶
Extract causal lessons from agent execution traces -- what worked, what failed, which approaches were abandoned and why -- so future runs improve.
Recording vs. Learning¶
Most agents save what happened without extracting why outcomes occurred. The gap: configuration ("this build command works") vs. knowledge ("approach X fails for file type Y because Z").
| Level | Example | Improves future runs? |
|---|---|---|
| Passive recording | npm run build is the build command |
Marginally |
| Active reflection | "Regex failed due to nested brackets" | Yes -- if retained and retrievable |
| Persistent synthesis | "For nested delimiters, use recursive descent, not regex" | Yes -- compounds across tasks |
The Synthesis Spectrum¶
graph LR
A[Raw logs] --> B[Passive recording]
B --> C[Verbal reflection]
C --> D[Structured lessons]
D --> E[Verified skill library]
style A fill:#333,stroke:#666
style B fill:#444,stroke:#888
style C fill:#555,stroke:#999
style D fill:#666,stroke:#aaa
style E fill:#777,stroke:#bbb
Passive recording: Claude Code saves observations to MEMORY.md -- build commands, debugging insights, style preferences. Context window constraints mean only a portion of a large memory file influences any given session.
Verbal reflection (Reflexion): Shinn et al., 2023 adds self-critique after failure, injected as context on retry. HumanEval pass@1 rose from 80% to 91%. Limitation: lessons are ephemeral and task-specific.
Structured lessons: Meta-Policy Reflexion (2025) consolidates reflections into transferable predicate-like rules that persist beyond the originating episode.
Verified skill libraries (Voyager): Wang et al., 2023 converts verified traces into executable code skills. Unverified attempts are refined or discarded.
Anchoring Reflection to Signals¶
Self-critique without objective checks fails because models rationalize (nibzard agentic handbook). Anchor reflection to a verifiable signal: tests, lints, schema validation, or compilation.
Error retention vs. error summarization
Manus retains failure traces in context rather than summarizing them away, so the model can "implicitly update its internal beliefs." Premature summarization strips the diagnostic signal that makes reflection useful. (Manus: Context Engineering)
Mining Failures for Training Signal¶
- SiriuS (Zhao et al., 2025): Repairs failed trajectories into positive training examples that become fine-tuning signal -- turning execution failures into direct model improvements.
A success confirms a path worked; a failure reveals why alternatives did not.
Storage Formats¶
| Format | Strengths | Weaknesses | Example |
|---|---|---|---|
| Flat markdown | Simple, human-editable, version-controllable | No semantic search; degrades at scale | Claude Code MEMORY.md |
| Structured predicates | Transferable, enforceable | Harder to audit; requires synthesis step | Meta-Policy Memory |
| Executable code | Composable, self-verifying | Brittle to environment changes | Voyager skill library |
| Hybrid vector + keyword | Relevance ranking + precision via FTS | Requires vector DB infrastructure | claude-mem |
Flat markdown suits most workflows; structured formats pay off at scale.
The Pruning Problem¶
Lessons expire: workarounds for old model limitations become wrong when models improve; tool-specific patterns become irrelevant when tools change. Strategies: usage-based expiry, version tagging (auto-deprecate on version change), manual audit via /memory.
Bridging the Gap Today¶
End-of-Session Synthesis Prompt¶
Prompt the agent at session end:
Before ending this session, review what happened and write 2-3 lessons
in this format:
- WORKED: [approach] because [reason anchored to a verifiable signal]
- FAILED: [approach] because [reason], PREFER [alternative] instead
- ABANDONED: [approach] in favor of [alternative] because [tradeoff]
Only include lessons anchored to test results, build output, or
observable behavior -- not speculation.
Structured Memory Template¶
In MEMORY.md, separate observations from lessons:
## Observations (what happened)
- Build uses pnpm not npm
- API rate limit is 100 req/min
## Lessons (what to do differently)
- FAILED: Parallel API calls > 50 hit rate limit; use batch endpoint instead
- WORKED: Running type-check before tests catches 40% of failures faster
Environmental Scaffolding as Alternative¶
Anthropic's harness engineering pattern -- progress files, git-based state, feature checklists -- offers a complementary approach: artifacts are verifiable and auditable without requiring a synthesis step. Synthesis pays off when the same class of problem recurs across projects or sessions.
When This Backfires¶
Three conditions where skipping synthesis is the better call:
- N=1 generalization: A single failure can produce a confidently-stated "lesson" ("never use library X") that reflects a one-off quirk, not a transferable rule. The form of the synthesized memory matters: distilled heuristics transfer across tasks better than replaying raw trajectories as few-shot examples (Experiential Reflective Learning, 2026).
- Tool/model churn: A workaround for a 2024-era context limit becomes wrong advice once the limit lifts, but the lesson sits in
MEMORY.mdfor months. The deeper cost is trusting aged advice without re-verification. - Context budget pressure: Retained lessons compete with task-relevant context; accumulated memory inflates cost and degrades selectivity (SSGM Framework, 2026). When the lesson library exceeds what retrieval can selectively surface, environmental scaffolding (progress files, git state) often pays off more reliably.
If the same class of problem does not recur across projects, do not synthesize.
Example¶
A Claude Code session debugging a flaky test produces this raw MEMORY.md entry:
## Observations
- Test `test_upload` fails intermittently on CI
- Added retry logic with exponential backoff
- Root cause: S3 eventual consistency on newly created buckets
After applying the end-of-session synthesis prompt, the agent produces:
## Lessons
- FAILED: Retry with backoff on `test_upload` -- masked the real issue (S3 eventual
consistency) and made the test slow. PREFER creating the bucket in a shared fixture
with a waiter (`s3.get_waiter('bucket_exists').wait()`) instead.
- WORKED: Running `pytest --last-failed` before full suite caught the flaky test in
12 seconds vs. 4 minutes for the full run -- anchored to CI timing logs.
The raw observation records what happened; the synthesized lesson records what to do differently and why, anchored to a verifiable signal (CI timing, S3 API behavior).
Key Takeaways¶
- Recording what happened is not learning; synthesis extracts why an outcome occurred into a rule that transfers to future runs.
- Anchor every synthesized lesson to a verifiable signal -- tests, lints, compilation, schema validation -- so reflection updates beliefs instead of rationalizing.
- Distilled heuristics transfer across tasks better than replaying raw trajectories as few-shot examples.
- Lessons expire: prune via usage-based expiry, version tagging, or manual
/memoryaudit so stale workarounds do not outlive their cause. - Skip synthesis when the problem class does not recur, when tools/models churn fast, or when retained lessons crowd out task-relevant context.
Related¶
- Agent Memory Patterns: Learning Across Conversations
- Continual Learning for AI Agents: Three Layers of Knowledge Accumulation
- Memory Retrieval as a Control Decision
- Agentic Flywheel: Self-Improving Agent Systems
- Skill as Knowledge
- Trajectory Logging via Progress Files and Git History
- Agent Transcript Analysis
- Context Engineering: The Discipline of Designing Agent Context