Skip to content

Edit Format Selection: Diff vs. Search-Replace vs. Full Rewrite

Edit format is how an LLM expresses code changes — full file, search-replace, or structure-aware diff — and the choice swings accuracy and cost 2-3x.

Why Format Choice Matters

Switching a strict unified diff to a search-replace block raised GPT-4's score on Aider's editing benchmark from 26% to 59% with no model or prompt change (Aider, "Unified diffs make GPT-4 Turbo 3X less lazy"). The "To Diff or Not to Diff?" paper formalises why: fragile offsets and fragmented hunks make generation highly unnatural for LLMs (Cao et al., 2026, arxiv:2604.27296). The Diff-XYZ benchmark finds search-replace beats unified diffs for larger models, while smaller open models gain almost nothing from any format change (Glukhov et al., 2025, arxiv:2510.12487). The lever is real but not universal.

The Format Spectrum

Format Anchor Best for
Full rewrite Whole file Short files; small edits where anchor overhead exceeds the file
Search-replace block Exact old_str / new_str Frontier-model API users; mixed-language repos
Structure-aware diff (BlockDiff / FuncDiff) AST block (control structure or function) Long files; fine-tuned models; AST-supported languages
Line-numbered unified diff @@ -old,+new @@ headers Patch tools and humans, not LLMs

Full Rewrite

The model emits the entire updated file. Always applies, but cost scales linearly with file length. On function-level benchmarks, full rewrite is the cheapest option for ~50–60% of samples — short bodies make anchor overhead larger than the new content (arxiv:2604.27296 §5).

Search-Replace Block

The model emits exact old_str and new_str; the harness substitutes one for the other. Aider and Anthropic's str_replace_based_edit_tool both use this shape. Anthropic requires the match to be unique — multiple matches return an error and force the model to expand context (Claude text editor docs). Aider treats each unified-diff hunk as a search-replace operation, ignoring line-number headers (Aider unified-diffs).

Structure-Aware Diff

BlockDiff and FuncDiff use tree-sitter to align hunks to AST nodes — control structures for BlockDiff, functions and classes for FuncDiff. Anchors expand outward until contextually unique. On long-code edits (>300 tokens), AdaEdit cuts latency and cost by over 30% versus full rewrite while matching its accuracy (arxiv:2604.27296 §5).

Line-Numbered Unified Diff

LLMs drop blank lines, forget the leading +, and miscount offsets in @@ headers. Aider's documentation captures the practitioner consensus: GPT is terrible at working with source code line numbers (Aider unified-diffs).

The Mechanism: Distribution Alignment

LLMs are trained on coherent code spans, not patch fragments. A line-numbered diff forces a non-local positional commitment (the @@ header) while emitting code; any drift produces an unappliable patch. Aligning the diff unit to a syntactic block moves generation back inside the training distribution — the model emits a complete unit as it would during code completion (arxiv:2604.27296 §3).

Search-replace captures most of this gain without AST awareness: it swaps positional commitments for content anchors, and any unique span of code is an in-distribution target.

Adaptive Selection (AdaEdit)

For each source–target pair, AdaEdit picks whichever is shorter — diff or full rewrite — as the training label. The fine-tuned model learns to choose the cheaper format per sample, hitting >90% selection accuracy (>95% within a 20% token-deviation tolerance) (arxiv:2604.27296 §4).

Without fine-tuning, a harness can approximate this rule: full rewrite below ~300 tokens of file content, search-replace above.

Selection Heuristic

graph TD
    A[Edit request] --> B{File length}
    B -->|< 300 tokens| C[Full rewrite]
    B -->|>= 300 tokens| D{Fine-tuning available?}
    D -->|Yes| E[Structure-aware diff + AdaEdit]
    D -->|No| F{Match anchor unique?}
    F -->|Yes| G[Search-replace block]
    F -->|No| C

For frontier API models, the decision reduces to full rewrite versus search-replace. Structure-aware diffs pay off when you control fine-tuning data and the language has reliable AST tooling.

When This Backfires

  • Small or open-weights models. Diff-XYZ shows smaller models gain little from format engineering — they fail at all formats roughly equally (arxiv:2510.12487).
  • Short edits to short files. Anchor plus surrounding context exceeds the file; full rewrite is strictly cheaper.
  • Languages without solid AST tooling. BlockDiff/FuncDiff rely on tree-sitter grammars; templated or DSL-mixed source loses the structural guarantee.
  • Non-unique repeated code. Search-replace fails when the anchor matches multiple sites; Anthropic's tool errors and forces a retry with expanded context (Claude text editor docs).
  • No fine-tuning access. AdaEdit is a training-time strategy — API users can apply the formats but cannot replicate adaptive selection without harness-side heuristics.

Example

A 1,200-token Python file needs a five-line change inside one function. Three formats, three cost profiles:

  • Full rewrite — regenerate the entire file, ~1,200 output tokens. Always applies. Linear in file size.
  • Search-replace block — unique surrounding context as old_str, modified function as new_str, ~150–250 output tokens. Applies if the anchor is unique; otherwise the harness errors and the model retries with more context.
  • FuncDiff — modified function with an AST-derived anchor, ~120–200 output tokens. Requires tree-sitter at apply time and a model trained on the format to perform reliably (arxiv:2604.27296 §3).

Search-replace is the practical choice for a frontier API model on this file. Below ~300 tokens of file content, full rewrite would have been cheaper.

Key Takeaways

  • Edit format is a real lever on accuracy and cost — measured in 30%+ token reductions and 2–3× accuracy gains in published benchmarks.
  • The mechanism is distribution alignment: replace positional anchors (line numbers) with content anchors (unique strings or AST blocks) so the model emits coherent code instead of fragmented patches.
  • Format engineering helps frontier models more than smaller ones; verify the lever exists for your model class before investing.
  • For API consumers, the practical decision is full rewrite for short files, search-replace above ~300 tokens; structure-aware diffs add value mainly when you control fine-tuning.
Feedback