Skill Eval Loop¶

Define test cases, benchmark pass rates, A/B-compare skill versions, and optimize trigger descriptions — bringing eval-driven development to skill authoring without writing code.

Skills fail on two independent axes: output quality (does it produce good results?) and trigger precision (does it activate at the right time?). The skill-creator framework addresses both through a structured loop. [Source: Improving skill-creator]

The eval loop¶

graph TD
    A[Define test cases] --> B[Run with-skill & baseline in parallel]
    B --> C[Grade outputs against assertions]
    C --> D[Aggregate benchmark]
    D --> E{Pass rate meets bar?}
    E -->|No| F[Revise skill instructions]
    F --> B
    E -->|Yes| G[Optimize trigger description]
    G --> H[Ship]

Step 1: Define test cases¶

Each eval in evals/evals.json has three parts: a realistic prompt with concrete details (paths, columns, context), an expected output description, and optional input files. Start with 2 to 3 cases; add assertions after the first run — you often cannot define "good" until you see what the skill produces. [Source: Evaluating skill output quality]

{
  "skill_name": "csv-analyzer",
  "evals": [
    {
      "id": 1,
      "prompt": "I have a CSV of monthly sales data in data/sales_2025.csv. Find the top 3 months by revenue and make a bar chart.",
      "expected_output": "A bar chart showing the top 3 months by revenue with labeled axes.",
      "files": ["evals/files/sales_2025.csv"]
    }
  ]
}

Step 2: Run evals in parallel¶

skill-creator spawns independent agents per eval — one with the skill, one without (or the prior version). Each runs in an isolated context, preventing bleed between runs. [Source: Improving skill-creator]

Workspace structure after a run:

csv-analyzer-workspace/
└── iteration-1/
    ├── eval-1/
    │   ├── with_skill/
    │   │   ├── outputs/
    │   │   ├── timing.json
    │   │   └── grading.json
    │   └── without_skill/
    │       ├── outputs/
    │       ├── timing.json
    │       └── grading.json
    └── benchmark.json

Step 3: Grade and benchmark¶

Assertions should be specific and observable ("The bar chart has labeled axes"), not vague ("The output is good"). Grade with code-based checks for deterministic properties, LLM-as-judge for nuanced quality, or human review as the gold standard. [Source: Demystifying evals]

Benchmark aggregation produces three metrics per configuration:

{
  "with_skill": {
    "pass_rate": { "mean": 0.83, "stddev": 0.06 },
    "time_seconds": { "mean": 45.0, "stddev": 12.0 },
    "tokens": { "mean": 3800, "stddev": 400 }
  },
  "without_skill": {
    "pass_rate": { "mean": 0.33, "stddev": 0.10 }
  },
  "delta": { "pass_rate": 0.50, "time_seconds": 13.0, "tokens": 1700 }
}

The delta quantifies skill cost (time, tokens) against benefit (pass rate). A 13-second overhead for a 50-point gain is a different trade-off than doubling tokens for a 2-point gain. [Source: Evaluating skill output quality]

Step 4: Analyze and iterate¶

Examine each iteration for actionable patterns: [Source: Evaluating skill output quality]

Always pass in both — not discriminating, so remove or replace it
Always fail in both — broken assertion or impossible task, so fix it before the next iteration
Pass with skill, fail without — where the skill adds clear value, so understand why
High variance across runs — ambiguous instructions, so add examples or tighten guidance

Revise SKILL.md from failed assertions and transcripts. Generalize fixes rather than patching individual cases. Rerun in iteration-N+1/ and compare.

Sequential evaluation introduces anchoring bias — the second version is judged relative to the first. Comparator agents remove this: a grader receives A and B outputs without labels and scores each criterion blindly. [Source: Improving skill-creator] This extends beyond skill versus no-skill to comparing versions, competing skills, or the same skill across models.

Trigger description optimization¶

Output quality evals only matter if the skill triggers. Run the description optimization loop in three steps.

Generate around 20 trigger queries: 8 to 10 that should trigger (varied phrasings — casual, formal, implicit) and 8 to 10 that should not trigger (near-misses with shared keywords but different intent). [Source: Skill-creator SKILL.md]
Run the loop: skill-creator scores the current description against the queries and suggests edits that cut false positives and false negatives.
Apply and verify: update the description field in SKILL.md frontmatter and rerun the set.

Testing across public document-creation skills improved triggering on 5 of 6. [Source: Improving skill-creator]

Queries must be realistic and detailed. A weak query is "Format this data" — too vague. A strong query is "my boss sent me Q4_sales_final_v2.xlsx and wants a profit margin column — revenue is in C, costs in D" — concrete, casual, with no skill name mentioned.

Model upgrade eval strategies¶

Two skill categories need different eval approaches on model upgrades: [Source: Improving skill-creator]

Capability uplift — encodes techniques the base model cannot do consistently. Compare the skill-augmented model against the raw model; if raw matches or exceeds it, retire the skill.
Encoded preference — sequences capabilities to fit team workflows. Verify workflow fidelity (step order, output format, required checks) rather than raw quality, because the model cannot infer your process.

When this backfires¶

The loop has real overhead and fails predictably:

Rarely-triggered or single-use skills — harness cost (parallel runs, grading, bookkeeping) can exceed lifetime savings, so ad-hoc manual QA may win
Same-model LLM-as-judge grading — grader agents inherit the target model's biases, which inflates pass rates on outputs the model itself would not critique. Prefer code-based assertions and human spot-checks for subjective quality. [Source: Demystifying evals]
Assertion over-fitting — a fixed eval set can tune the skill to that set while it drifts on real traffic. Refresh cases from production prompts.
Subjective skills — writing style, design, and taste resist objective assertions, so force-fitting produces a green benchmark that tells you nothing. [Source: Evaluating skill output quality]

Key Takeaways¶

Skills have two independent failure surfaces: output quality and trigger precision — eval both
Start with 2-3 test cases; add assertions after the first run, not before
Run with-skill and baseline evals in parallel with isolated agent contexts to prevent cross-contamination
Use blind A/B comparison (comparator agents) to eliminate anchoring bias when iterating
Benchmark delta (pass rate, time, tokens) quantifies the cost-benefit trade-off of a skill
Optimize trigger descriptions with should-trigger and should-not-trigger query sets
On model upgrades, capability uplift skills may become obsolete; encoded preference skills need workflow fidelity checks

Extension Points — choosing between skills, hooks, rules, and other Claude Code mechanisms
Sub-Agents — the isolated execution model that powers parallel eval runs
Skill Authoring Patterns — description craft, implementation patterns, and troubleshooting
The Eval-First Development Loop — the general eval-first workflow this technique specializes
What Evals Are — foundational concepts on agent evaluations and non-determinism
Enterprise Skill Marketplace — skill lifecycle including eval-gated publishing at scale
Skill Library Evolution — managing skill lifecycle and deprecation