Evaluating AGENTS.md: When Context Files Hurt More Than Help¶

Auto-generated context files reduce task success rates. Human-written files improve success only when they contain minimal, specific instructions — not architectural overviews or duplicated documentation.

The Evidence¶

Two studies evaluated AGENTS.md-style context files on real coding benchmarks:

Study	Benchmark	Finding
Gloaguen et al. (2026)	SWE-bench Lite (300 tasks), AGENTbench (138 tasks)	LLM-generated files: -3% success, +20% cost. Human-written files: +4% success, +19% cost
Lulla et al. (2026)	10 repos, 124 PRs	AGENTS.md present: -28.6% runtime, -16.6% output tokens, completion rates unchanged
AIDev (2026)	Agentic PRs across many projects	Context files do not reliably improve merge rate: 27.7% of projects improved ≥20% while 26.35% degraded

One measures success, the other efficiency. A third, PR-level study reaches the same place from the merge-rate angle: an AIDev empirical analysis found instruction/context files do not reliably improve agentic-PR merge rate — roughly as many projects degraded (26.35%) as improved by ≥20% (27.7%). Context files can make agents faster but not more reliably successful.

Why Auto-Generated Files Fail¶

Running /init produces a document restating what the agent can already discover:

graph LR
    A[Auto-generated<br>context file] --> B[Duplicates existing<br>documentation]
    B --> C[Agent reads both<br>sources]
    C --> D[+20% token cost<br>No accuracy gain]
    B --> E[Remove existing docs<br>from repo]
    E --> F[+2.7% improvement<br>File now adds value]

When researchers removed existing documentation from repos, the same auto-generated files improved performance by 2.7% — confirming that redundancy, not the file itself, is the problem.

GPT-5.1 Mini and GPT-5.2 used 14% and 22% more reasoning tokens respectively with LLM-generated context files — effort spent processing information the agent would have found anyway.

Why Verbose Human-Written Files Trade Success for Cost¶

Human-written context files improved success by ~4% on AGENTbench but increased costs by up to 19% because agents followed instructions too faithfully — running more tests, reading more files, and executing more searches than the task required.

This is the compliance ceiling in action — agents treat every instruction as equally important, producing more work without proportional accuracy gains.

Architectural overviews did not help. Agents spent the same effort locating files regardless of overview presence.

What Actually Works¶

One finding was unambiguous: tool-specific instructions change agent behavior reliably. Repository-specific tools averaged 2.5 calls per instance when mentioned vs 0.05 when not.

Include	Omit
Exact build/test/lint commands with flags	Architecture overviews
Non-obvious constraints agents cannot infer	Codebase structure descriptions
Repository-specific tool invocations	Information already in existing docs
Critical rules that apply to every task	Task-specific procedures (load on demand)

This aligns with the table of contents pattern — a pointer map outperforms an encyclopedia by avoiding the redundancy that makes auto-generated files fail.

The Resolution¶

"AGENTS.md files hurt" overstates the finding. What the research shows:

Auto-generated context files are net negative — stop running /init and expecting improvement
Verbose human-written files trade marginal accuracy for significant cost — the compliance ceiling now has empirical backing
Minimal, specific instructions work — tool commands and non-inferable constraints change behavior reliably
Pointer files avoid the core failure mode — no duplication of discoverable information

The advice: remove everything the agent can already infer, and keep only what it cannot.

Benchmark Limitations¶

Both studies evaluated well-documented open-source repositories. Context file value is likely higher in:

Closed-source codebases with undocumented conventions
Projects with non-standard tooling or build systems
Repos where critical constraints are not inferable from code

This gap is untested — the evidence applies to the open-source case.

The two studies also used different model and agent sets, so it is unclear whether the efficiency gains Lulla et al. measured would hold for the models Gloaguen et al. tested, or vice versa.

Key Takeaways¶

Auto-generated context files duplicate discoverable information and increase costs 20%+ with no accuracy gain
Human-written files improve success ~4% but at ~19% higher cost
Tool-specific commands are the highest-value content: 2.5 calls when mentioned vs 0.05 when not
Architectural overviews do not reduce file discovery time — omit them
The research validates minimal instruction files and the pointer-map pattern

Sources¶

The Instruction Compliance Ceiling
Guardrails Beat Guidance: Rule Design for Coding Agents — complementary empirical finding on negative-vs-positive rule polarity for coding agents
AGENTS.md as Table of Contents, Not Encyclopedia
AGENTS.md Design Patterns: Commands, Boundaries, and Personas
AGENTS.md: A README for AI Coding Agents
Layered Instruction Scopes
Instruction File Ecosystem — the map of instruction-file types this evaluation lens sits inside
Configuration File Structure Compliance Gap — empirical null on file structure, complementary to this page's when-do-context-files-hurt question
Convention Over Configuration
Discoverable vs Non-Discoverable Context