Skip to content

Stage-Targeted Prompt Structure for Pull Request Outcomes

Prompt structure splits into Specificity, Context, and Verification — each moves a different pull request stage; diagnose the failing stage, then raise that dimension.

A stage-based study of 265 manually validated developer-ChatGPT interactions extracted from self-admitted AI-assisted pull requests found that prompt structure does not have a single "quality" axis — its effects split across three dimensions that each dominate a different stage of the human-LLM-PR pipeline (Sserunjogi, Ogenrwot & Businge, 2026). Apply this only when the human is driving generation through prompts; in agent-loop mode, the harness mediates context regardless of the initial prompt (see When This Backfires).

The Three Dimensions

The framework codes every prompt against three orthogonal structural surfaces (Sserunjogi et al., 2026):

  • Context — what the model is given to work against: relevant code snippets, file paths, requirements, constraints, the surrounding system the change has to fit into.
  • Specificity — how narrowly the task is scoped: explicit inputs and outputs, named functions, concrete acceptance criteria, the shape the answer should take.
  • Verification — how the developer can cheaply decide whether the output is usable: tests to pass, expected behaviour, examples of correct output, evaluability cues.

These are independent — a prompt can be high on one axis and low on the others. The study's LLM-vs-human inter-rater check showed Specificity is the most reliably codeable; Context is systematically under-scored by automated assessment, requiring a hybrid human-LLM annotation strategy (Sserunjogi et al., 2026).

The Stage-to-Dimension Map

The three dimensions do not affect outcomes uniformly. Each dominates a different stage of the pipeline from prompt to merged PR (Sserunjogi et al., 2026):

Stage Failing if… Dominant dimension
Generation — model produces actionable code at all output is unusable, off-topic, or wrong shape Specificity (with Context as a strong secondary)
Adoption — developer commits or pastes the response output looks plausible but you keep discarding it Verification
Integration — response actually lands in the PR code adopted but rejected at review, breaks neighbours Context

This produces a directly actionable diagnostic. Instead of asking "is this prompt good?", ask "which stage is failing?":

  • The model keeps producing the wrong shape of answer → raise Specificity (name the function, fix the signature, declare the inputs and outputs).
  • You keep discarding plausible-looking answers without committing them → raise Verification (paste the failing test, name the expected behaviour, give a correct-output example).
  • Code gets adopted but breaks the existing codebase at review → raise Context (point at the file, name the patterns it has to follow, link the constraint document).

Why It Works

Each dimension addresses a different bottleneck in the same pipeline, which is why their effects show up at different stages (Sserunjogi et al., 2026). Specificity narrows the model's solution space at generation time — fewer candidate completions match the constraints, so the first response is closer to actionable code and the developer has less surface to discard. Context gives the model fallback evidence across multiple surfaces (codebase pointers, requirements, neighbours) — when one surface is incomplete, others still carry signal, so the output is embeddable into the existing system rather than merely correct in isolation. The same independent-surfaces mechanism produces the 11.8% vs 0.9% robustness gap between HumanEval (one docstring) and LiveCodeBench (four independent specification layers) measured by Akli et al. (2026) and documented in Multi-Layer Specification Redundancy. Verification cues let the developer cheaply distinguish a usable response from a plausible-but-wrong one — adoption rises without raising generation cost because the evaluation step gets cheaper, not the generation step.

When This Backfires

The framework was measured on single-shot, human-driven ChatGPT-PR interactions. Several conditions shrink or invert its leverage:

  • Agent-loop mode dominates over prompt structure. When Claude Code or Copilot's agent mode iterates with the codebase — reading files, re-prompting itself, running tests — the harness mediates Context and Verification regardless of the initial prompt. The human's prompt becomes a smaller share of total context; three-dimension prompt discipline pays much less. The harness itself is the lever; see Effective context engineering for AI agents.
  • More Context can hurt past the attention sweet spot. Stuffing Context past the model's effective attention triggers lost-in-the-middle and context rot — additional Context starts degrading Adoption rather than helping it. Strategic curation beats volume.
  • Structural controls beat prompt structure when both are available. When a hook, type check, or test suite can deterministically catch the failure mode the Verification dimension is trying to encode, the structural control dominates. See Hooks vs Prompts and the Prompt Tinkerer Anti-Pattern — endless prompt refinement is a recognised failure mode when the problem is structural.
  • Throwaway exploration. Three-dimension prompt overhead is wasted when the goal is to discover whether an approach is feasible at all — there is no PR to integrate into.
  • High-density instruction stacks hit the compliance ceiling. Stacking many independent specification layers across all three dimensions pushes the prompt past the instruction-compliance ceiling (~68% at high densities per IFScale); marginal structure starts being silently ignored regardless of which dimension you raise.

Example

The same change request, evolved by raising one dimension at each stage of failure. The diagnostic is the stage the previous version failed at; the final prompt below is what survives all three raises and is what you would actually paste.

Generation fails (model produces the wrong shape of refactor):

Refactor this React form.

After raising Specificity — Generation passes; the output is plausible but the developer keeps discarding it without committing:

Refactor the <SignupForm> component at apps/web/src/auth/SignupForm.tsx
to use react-hook-form's useForm hook. Keep the existing field names and
validation messages. The function signature stays (props: SignupProps) =>
JSX.Element. Return one component, no new exports.

After raising Verification — Adoption passes; the change is committed, but it gets rejected at PR review for breaking a codebase convention the model could not have known:

Refactor the <SignupForm> component at apps/web/src/auth/SignupForm.tsx
to use react-hook-form's useForm hook. Keep the existing field names and
validation messages. The function signature stays (props: SignupProps) =>
JSX.Element. Return one component, no new exports.

The existing snapshot test at apps/web/src/auth/__tests__/SignupForm.test.tsx
must continue to pass. Also: pressing Enter inside any text input still submits
the form, and pasting an email with surrounding whitespace still trims it
before validation.

After raising Context — the final prompt, ready to paste:

Refactor the <SignupForm> component at apps/web/src/auth/SignupForm.tsx
to use react-hook-form's useForm hook. Keep the existing field names and
validation messages. The function signature stays (props: SignupProps) =>
JSX.Element. Return one component, no new exports.

The existing snapshot test at apps/web/src/auth/__tests__/SignupForm.test.tsx
must continue to pass. Also: pressing Enter inside any text input still submits
the form, and pasting an email with surrounding whitespace still trims it
before validation.

This codebase uses react-hook-form via the wrapper at
apps/web/src/lib/forms/useTrackedForm.ts which adds analytics. Use that
wrapper, not useForm directly. The wider auth flow's error handling is
documented at apps/web/src/auth/AGENTS.md — error messages have to follow
the conventions there.

Each raise targets the stage the previous prompt failed at. Spending the budget on the wrong dimension does not move the right stage.

Key Takeaways

  • Prompt quality is not one axis — Specificity, Context, and Verification each move a different stage of the LLM-assisted PR pipeline (Sserunjogi et al., 2026).
  • Diagnose by stage: model output unusable → Specificity; plausible output you keep discarding → Verification; output adopted but breaks at review → Context.
  • The framework targets human-driven single-shot prompting. Inside agent loops the harness mediates Context and Verification — prompt discipline pays much less.
  • Structural controls (hooks, tests, schemas) dominate Verification prompt structure when both are available; see Hooks vs Prompts.
Feedback