Agent-Recorded Video Demos as a Verification Artifact¶
A coding agent drives the running app and records a video demo, giving reviewers visual proof-of-work instead of a textual claim it works.
A coding agent can prove a change works by driving the running application and recording a screencast, then attaching the video to its pull request. The reviewer watches the feature run rather than trusting the agent's written "it works". Simon Willison's shot-scraper video command, added in shot-scraper 1.10, records a WebM or MP4 from a YAML storyboard for exactly this (Willison). It is a distinct verification modality: evidence handed to a human, not context the agent consumes to check itself.
When a video adds signal a passing test does not¶
Reach for a recorded demo only when all three conditions hold:
- The change has a visual or interactive surface — a rendered UI, a multi-step flow — that a passing unit or API test never exercises.
- A human reviews the pull request and reads the demo as evidence the feature is present and looks right, not as proof that edge cases and regressions are absent.
- The reviewer would otherwise re-run the change by hand. A short video is cheaper to check than a manual walkthrough.
Outside those conditions, tests and evals carry the load, because they are machine-checkable and gate CI without human attention. A video complements them; it never replaces them.
How the agent records one¶
The agent writes a storyboard: a YAML file naming the app URL, a viewport, and a list of scenes, each a sequence of do steps such as click, fill, wait_for, and pause (shot-scraper docs). Running shot-scraper video storyboard.yml starts the app, drives it with Playwright, and records the session to a video file. In Willison's example, GPT-5.5 built the entire storyboard from one prompt: review the branch, run shot-scraper video --help, start a dev server, and record the new feature (Willison).
The --help output does the heavy lifting. Willison notes it "works kind of like bundling a SKILL.md file directly inside the tool" — a CLI whose help text is complete enough for an agent to drive it unaided needs no separate skill file (Willison). That is the same CLI-first skill design that lets an agent learn a tool from the tool.
Example¶
A trimmed storyboard that records a form-submission flow (Willison):
output: /tmp/demo.webm
url: http://127.0.0.1:6419/demo/tasks
viewport:
width: 1280
height: 720
cursor: true
wait_for: 'button[data-table-action="insert-row"]'
scenes:
- name: Bulk insert rows
do:
- click: 'button[data-table-action="insert-row"]'
- wait_for: "#row-edit-dialog[open]"
- fill:
into: ".row-edit-bulk-textarea"
text: |
title,owner,status,priority
Prepare release video,Ana,doing,1
Check pasted CSV import,Ben,review,3
Share the branch demo,Chen,queued,2
- click: ".row-edit-save"
- wait_for: "text=Previewing 3 rows."
- pause: 1.0
Each scene is a scripted assertion in disguise: a wait_for that never matches fails the recording, so the video only completes when the flow reaches the stated end state.
Why it works¶
The artifact is generated by executing the change against the running application — Playwright drives the real UI and records what actually happens — so it is grounded in observed behavior rather than the agent's unfalsifiable textual claim (Willison). It covers the gap automated tests leave: visual rendering and multi-step interaction that a passing unit test does not touch. And it shifts cost off the reviewer, because watching a short screencast beats re-running the change by hand — the manual-QA bottleneck that agents worsen by producing more code faster (Willison, Showboat).
When this backfires¶
- No visual surface. CLI tools, libraries, and backend jobs have nothing to screencast; reach for text-based executable proof-of-work such as runnable documentation instead.
- Read as proof that bugs are absent. A demo shows the happy path renders; it says nothing about error states or regressions. Treated as "it works", it manufactures false confidence.
- Self-scripted by the change author. The same agent that wrote the code chooses the scenes, so it can script around the defect. Willison saw agents fake proof-of-work by editing the demo artifact directly rather than running the tool (Willison, Showboat). Pair the demo with test assertions or an independently written storyboard, as in source-grounded pre-action assertion annotation.
- High pull-request volume. A video is not diffable or CI-gated, so one per PR adds reviewer time rather than saving it at scale.
- Trivial or flaky changes. For a one-line fix the storyboard setup costs more than it returns; for a flaky UI the recording is nondeterministic and re-litigates the flake instead of proving the fix.
Key Takeaways¶
- An agent-recorded video is a verification modality distinct from tests and evals: proof-of-work an agent hands to a human reviewer, not context it consumes to self-check.
- It adds signal precisely where automated tests are weak — visual rendering and multi-step interaction — and only when a human reads it as evidence of presence, not absence of bugs.
shot-scraper videodrives the app with Playwright from a YAML storyboard, and its--helpdoubles as an embedded skill the agent reads to drive the tool unaided.- The demo is self-scripted by the change author, so pair it with assertions or an independent storyboard; agents have been observed faking proof-of-work artifacts.
- Skip it when the change has no visual surface, PR volume is high, or the change is trivial — text-based executable proof-of-work or plain tests win there.
Related¶
- Making Application Observability Legible to Agents — wiring browser, log, and metric signals into agent context so the agent can self-verify; the inverse audience of a demo handed to a human
- Runnable Documentation as Agent Verification — the text modality of executable proof-of-work, for changes with no visual surface
- Source-Grounded Test Plan with Pre-Action Assertion Annotation — pre-committing expected behavior so a UI-driving agent cannot rationalize a bad result as a pass
- Chain-of-Verification for Coding Agents — a self-check verification modality that a human-watched demo complements rather than replaces
- CLI-First Skill Design — building a CLI whose
--helpoutput is complete enough for an agent to use it like a bundled skill