Instruction-Guided Code Completion¶

Functional correctness and instruction adherence are independent capabilities — a model that completes code correctly may still ignore your structural, algorithmic, and scope constraints.

The problem¶

Standard benchmarks measure whether generated code passes tests. HumanEval (Chen et al., 2021) scores functional correctness with unit tests and gives no signal on how the model implemented the solution. Developers routinely specify implementation constraints: a specific algorithm, a structural pattern, a limited completion scope. C3-Bench results show most models treat scale instructions as suggestions. Even advanced proprietary models score as low as 7% on scale-control tasks, while implementation-control adherence reaches only 50 to 60% even for top proprietary models.

C3-Bench (arxiv 2601.15879) is the first benchmark to measure this gap directly, testing 2,195 Python tasks across two instruction categories.

Two types of completion instructions¶

graph LR
    A[Developer Instruction] --> B[Implementation Control<br/>ICC]
    A --> C[Scale Control<br/>SCC]
    B --> D[Algorithm choice<br/>Control flow<br/>Structural pattern<br/>Parameter constraints]
    C --> E[Line count<br/>Block scope<br/>Statement boundaries]

Implementation-control (ICC) instructions specify how to implement: use recursion instead of iteration, follow a specific design pattern, constrain parameter types. Models handle these reasonably well. Proprietary models reach 50 to 60% instruction-following rates.

Scale-control (SCC) instructions specify how much to generate: complete only the next three lines, fill in just the if-block, stop at the function boundary. Even advanced models like Gemini-2.0-Flash (7.0% SCC) and GPT-4o (24.1% SCC) fail to respect scope boundaries in most cases.

Benchmark rankings mislead¶

Open-source models that top standard leaderboards underperform on instruction adherence. Qwen2.5-Coder-32B scores 49.2 EM on CrossCodeEval but only 28.8% on ICC instruction-following. Claude 3.5 Sonnet reaches 60.9% ICC, a gap invisible in standard rankings.

If your workflow involves guided completions (Cursor Composer, Copilot Chat, agent-driven code generation), benchmark scores do not reliably predict how well the model will follow your instructions.

What works¶

Be explicit about implementation constraints¶

Ablation studies show that removing instructions from prompts causes instruction-following scores to drop while functional correctness stays roughly the same. Models do respond to fine-grained guidance. Specify:

Algorithmic approach: "Use iterative depth-first search, not recursion"
Structural patterns: "Implement as a generator that yields results"
Control flow: "Handle the error case first with an early return"
Parameter constraints: "Accept only keyword arguments"

Do not rely on scale instructions¶

Asking a model to "complete only the next 3 lines" or "just fill in the if-block" is unreliable across most models. Instead:

Use explicit stop markers or delimiters in context
Post-process completions to trim to the desired scope
Structure prompts so the completion boundary is syntactically unambiguous

Select models for instruction adherence¶

For workflows with heavy instruction guidance, which is the norm for agent-assisted coding, instruction-following capability matters more than raw completion accuracy. At the time of the C3-Bench evaluation (early 2025), proprietary models led on instruction-following: Claude 3.5 Sonnet reached 60.9% ICC and 50.8% SCC, while the top open-source model (Qwen2.5-Coder-32B-Instruct) scored 28.8% ICC and 16.9% SCC. Model capabilities shift with each release, so re-evaluate when adopting a new model version.

Training improves instruction-following¶

Qwen2.5-Coder-32B-C3 (a fine-tuned Qwen2.5-Coder variant) improved ICC instruction-following from 28.8% to 52.5% and SCC from 16.9% to 80.7% using 200K synthetic instruction-completion pairs, while also improving functional correctness (ICC Pass@1 rose from 49.8% to 62.0%). This suggests instruction-following is a trainable capability, not an inherent limitation. Teams running local models can invest in instruction-tuning data to close the gap.

When this backfires¶

Instruction-guided completion increases prompt complexity and slows iteration. These conditions reduce its value or make it counterproductive:

Exploratory or prototype code: when constraints are not yet known, injecting implementation instructions early locks in decisions before the design is stable. Models constrained to a specific algorithm or structural pattern resist changing direction as the solution evolves.
Low ICC compliance models: if the model in use scores below about 40% on implementation-control adherence, instruction guidance produces inconsistent results. Prompts grow longer, constraint satisfaction varies run-to-run, and the overhead outweighs the benefit. Verify model ICC rates before investing in instruction-heavy workflows.
Scale control remains unreliable: even with best-practice prompting, most models ignore scope boundaries more than half the time (C3-Bench SCC median: under 25% for non-fine-tuned models). Workflows that depend on precise output length control need post-processing or syntactic delimiters; instruction guidance alone is not enough.

Key Takeaways¶

Instruction adherence and functional correctness are orthogonal. A model that writes correct code may still ignore your structural and scope constraints.
Implementation instructions work; scale instructions do not. Tell models how to implement (algorithm, pattern, flow) but do not rely on telling them how much to generate.
Benchmark rankings do not predict instruction-following. Evaluate models on the specific capability your workflow requires.
Fine-tuning on instruction-completion pairs closes the gap. If you run local models, synthetic instruction data is an effective lever — it lifted scale-control adherence from 16.9% to 80.7% in the C3-Bench fine-tuning run above.

Example¶

A developer needs a function to flatten a nested dictionary. Without instruction guidance, the model chooses its own approach:

# Prompt (no instruction guidance):
# Write a function to flatten a nested dictionary

def flatten_dict(d, parent_key='', sep='.'):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key}{sep}{k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key, sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

The model produces correct code but uses recursion, builds intermediate lists, and returns a new dictionary. With explicit implementation constraints:

# Prompt with instruction guidance:
# Write a function to flatten a nested dictionary.
# - Use an iterative approach with an explicit stack (no recursion)
# - Yield key-value pairs as a generator
# - Use '/' as the separator
# - Handle only dict nesting (skip lists)

def flatten_dict(nested, sep='/'):
    stack = [('', nested)]
    while stack:
        prefix, current = stack.pop()
        for key, value in current.items():
            full_key = f"{prefix}{sep}{key}" if prefix else key
            if isinstance(value, dict):
                stack.append((full_key, value))
            else:
                yield full_key, value

The second prompt specifies algorithm (iterative with stack), output structure (generator), separator choice, and scope boundary. C3-Bench results show that models follow these implementation-control instructions roughly 50 to 60% of the time. That beats scale instructions but is still unreliable enough to require verification.

Context Priming — Loading relevant context before completion shapes output quality; instruction-guided completion is a specific form of this discipline
Prompt Layering — Instructions arrive from multiple sources simultaneously; understanding precedence affects whether completion instructions are followed
Pass@k Metrics — Standard evaluation metric that measures functional correctness but not instruction adherence
Token-Efficient Code Generation — Structural patterns that reduce generated code tokens; a complementary lens on controlling model output quality
Repository-Level Retrieval for Code Generation — Cross-file context improves completion accuracy; instruction adherence and retrieval quality are complementary dimensions of code generation control