Constraint Tax: Tool Suppression Under JSON Schema Decoding¶

Co-enabling JSON-schema constrained decoding and tool calling silently suppresses tool invocation in open-weight models — schema compliance stays high, tools never fire.

Tool Suppression is a reliability failure in open-weight LLMs served with grammar-based constrained decoding and tool calling in the same request. The model keeps returning schema-valid JSON, so monitoring looks healthy, while tool-call tokens become unreachable in the decoder — the agent silently stops acting. The effect is reproducible across multiple open-weight model families and deployment configurations (Constraint Tax in Open-Weight LLMs (arxiv:2606.25605)).

When This Applies¶

The failure mode surfaces when all of these hold:

Open-weight serving stack with grammar-based decoding — vLLM grammar mode, llama.cpp grammars, Outlines, or llguidance compiling a JSON schema into a finite-state grammar
Tool calling and structured output enabled in the same request — and the response schema does not enumerate the tool-call branch as a valid first-token sequence
Multi-capability agent loops where a single turn might either call a tool or emit a final structured answer

Closed-weight vendor APIs that own the inference path document the combination as a supported workflow with explicit complexity limits (Anthropic Structured Outputs). The trap is specific to self-hosted stacks where the constrained-decoding layer is unaware of the tool-call grammar.

Why It Backfires¶

Grammar-based decoding compiles the JSON schema into a finite-state grammar; at each decoding step the server intersects the model's logits with the grammar's allowed-token set and zeroes the rest. Tool-calling formats begin with distinct opening tokens (<tool_call>, {"tool_calls":, vendor markers). If the response schema doesn't enumerate that branch as a valid first-token sequence, the mask sets those probabilities to zero — the constraint engine outranks the model's preferences by construction (arxiv:2606.25605).

The paper names the artefact Constraint Priority Inversion (CPI): schema satisfaction dominates action selection when constraints compete. Schema-compliance stays near 100% while tool-invocation rate collapses — the dashboard signals success and the agent stops acting. Tam et al. (2024) corroborate the broader pattern: stricter format constraints produce greater reasoning-accuracy degradation via the same mask mechanism.

When This Doesn't Apply¶

The tax is not universal. Safe configurations:

Closed-weight vendor APIs (Anthropic, OpenAI) that track combined-schema complexity and handle tool-call and JSON-output paths as documented features (Anthropic structured outputs)
Schemas that explicitly reserve a tool-call variant as an allowed top-level branch — the grammar then permits the tool-call tokens
Tool calling without structured response constraints — only tool arguments carry strict schema validation; the tool-call entry tokens stay unmasked

In all other open-weight self-hosted configurations, validate empirically before shipping.

Mitigations¶

The paper proposes Transparent Two-Pass Execution: decouple tool invocation from schema-constrained generation. Pass 1 runs without the response grammar so tool-call tokens remain reachable; pass 2 applies the schema only to the final response after tool results return (arxiv:2606.25605). Alternatives: extend the schema to reserve the tool-call branch, or evaluate tool calling and structured output separately before combining them. Add a tool-invocation-rate metric next to schema-compliance metrics so Constraint Priority Inversion is detectable in monitoring.

Example¶

The suppression appears in vLLM-style serving when a JSON-mode response schema is passed alongside tool definitions:

# Anti-pattern — open-weight + grammar JSON schema + tools in one request.
# The response schema enumerates only {"answer": str, "confidence": float};
# tool_calls is not a valid first-token branch, so the grammar mask zeroes
# the tool-call opening tokens. Tools are defined but never invoked.
response = client.chat.completions.create(
    model="open-weight-model",  # e.g. Llama / Mistral / Qwen variants
    tools=[weather_tool, search_tool],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "schema": {
                "type": "object",
                "properties": {
                    "answer": {"type": "string"},
                    "confidence": {"type": "number"},
                },
                "required": ["answer", "confidence"],
            },
        },
    },
    messages=[{"role": "user", "content": "What is the weather in Oslo?"}],
)
# Returns schema-valid JSON. tool_calls is always empty.

The two-pass mitigation:

# Fix — pass 1 runs unconstrained so tools can fire; pass 2 applies the
# response schema only after tool results return.
draft = client.chat.completions.create(
    model="open-weight-model",
    tools=[weather_tool, search_tool],
    # no response_format — grammar masks do not touch tool-call tokens
    messages=[{"role": "user", "content": "What is the weather in Oslo?"}],
)
# Run any tool calls in draft.choices[0].message.tool_calls, append results
# to messages, then finalize with the schema applied:
final = client.chat.completions.create(
    model="open-weight-model",
    response_format={"type": "json_schema", "json_schema": {...}},
    messages=messages_with_tool_results,
)

Pair this with monitoring that tracks tool-invocation rate alongside schema-compliance rate so a future regression that re-enables the single-pass path is detectable.

Key Takeaways¶

Tool Suppression is a reproducible failure in open-weight stacks: schema-valid JSON with tool-invocation rate near zero
The mechanism is grammar masks zeroing the tool-call opening tokens, not a model-preference failure
Closed-weight vendor APIs document the combination as supported — the tax is specific to self-hosted grammar-decoding deployments
Mitigate with two-pass execution, extend the schema to include a tool-call branch, or evaluate tool calling and structured output separately before combining them
Monitor tool-invocation rate next to schema-compliance rate — single-axis monitoring hides Constraint Priority Inversion

Structured Output Constraints: Reducing Hallucination — the complementary benefit-side view; this page is the cost-side trade-off when combined with tool calling
Indiscriminate Structured Reasoning — another anti-pattern where a generally useful technique is applied without checking whether it helps the specific task
Dynamic Tool Fetching Breaks KV Cache — sibling failure where intuitive tool-side optimisation breaks a different layer (cache vs. decoding)
Token-Efficient Tool Design — tool-side design choices that affect agent reliability
Advanced Tool Use — broader reference for tool-calling reliability and harness design