20 Minutes of Talking Out Loud, Dozens of Issues Filed

The post-walkthrough problem

Every product team has a version of this: you record a screen walkthrough, narrate everything that feels off, and upload the video to Notion or Loom. A week later, two things get fixed. The rest of the transcript is never read again.

The problem is not the quality of the observations. The problem is the gap between raw narration and actionable issues. Turning one into the other requires someone to read it carefully, filter it, classify each finding by type and severity, write acceptance criteria, and open tickets - and that someone is always yourself, and you never quite have the time.

We built a walkthrough skill for a client using Claude Code’s experimental agent teams feature that closes that gap automatically.

The input: just talk

The input is a text transcript. Ours came from narrated screen sessions: 20 minutes of clicking through the product and talking out loud. No structure required. No template to fill out. Just speech-to-text from a natural walkthrough - unedited, with all the interrupted sentences and “that doesn’t make sense” observations intact.

Drop the file in a folder, type /walkthrough, and the team takes over.

Four agents, four lenses

Claude Code’s agent teams feature lets you create a named team where agents can message each other directly, share a task list, and coordinate without the lead session mediating every exchange. That is what makes this work: each agent approaches the same transcript with a completely different question.

UX Designer reads for interaction design, hierarchy, copy, and mental model mismatches. Not looking for bugs - asking whether the interface sets the right expectation, whether the task flow is in the right order, whether labels match the context they appear in.

Lead Engineer cross-references the transcript against the codebase. For each user observation, it localises the probable root cause to a file and line range. It reads the relevant source before writing a note, and it is the agent that surfaces when several apparently separate complaints trace back to one root cause.

QA Tester produces a step-by-step ledger of everything the user did, with a works / partial / broken verdict on each step. It then maps broken steps to existing test coverage and generates a list of edge cases the user did not try but should be tested.

Product Owner waits for the other three to finish, validates every finding against the transcript (anything not traceable to a specific user observation gets rejected), deduplicates across all three reports, and opens GitHub issues. It is the only agent authorised to call gh issue create.

The three notes agents run in parallel. The PO unblocks only when all three are done.

Why the team structure matters

You could ask a single model to read a transcript and file issues. The output is a flat list of observations with no architectural perspective, no test coverage analysis, and no cross-referencing between findings.

The team structure produces something different because each agent is working on a different question at the same time. The UX agent is not thinking about test coverage. The QA agent is not thinking about root causes. When the PO synthesises at the end, it has three independent lenses rather than one attempt to do everything at once.

The cross-messaging helps too. Mid-investigation, the engineer can tell the QA agent that two test specs it is writing cover the same underlying bug - QA consolidates them. The UX agent can ask the engineer whether a confusing label is intentional or just an oversight in the copy. That coordination does not happen in a single-model run or a sequential subagent chain.

The outcome

The PO does not just forward the raw notes. It compresses them. Three agents will independently flag variations of the same issue from different angles - the UX agent sees a broken feedback loop, the QA agent logs a broken step, the engineer traces it to a missing filter. The PO merges those into one issue rather than three that would conflict in review.

Every issue follows the same structure: current behaviour, expected behaviour, acceptance criteria, explicit out-of-scope - the format a developer needs to start implementation with no translation step between “issue filed” and “issue ready to work on.”

Beyond the primary findings, the QA agent produces a list of edge cases the user never tried. Not guesses - structured hypotheticals derived from the flows the user did exercise: what happens when you do X with zero items, what happens when you change Y after Z has already run. These become follow-up issues or get folded into existing test specs.

Twenty minutes of narrating a product out loud, one command, and by morning you have a structured backlog. Not a pile of notes - a backlog.

It also surfaced issues that would not have been filed otherwise. When you are the person who built something, you know why a particular UI decision was made and you edit your own observations accordingly. The agents do not have that bias. They read what the user said and take it at face value.

What we’re going to try next

The current flow is parallel-then-synthesize: three agents work independently, then the PO compresses the results. The next experiment is a discussion protocol.

Instead of the PO receiving three separate reports, each finding gets discussed by the full team before the issue is written. The engineer presents a root cause, the UX agent responds with the interaction framing, the QA agent identifies what a test spec would need to cover. Only after that exchange does the PO write the consensus issue.

The hypothesis is that the synthesis step is currently doing too much work on its own. Forcing each finding through a structured conversation first should produce tighter acceptance criteria and fewer reopens - the kind of clarity that usually only comes from a standup where someone asks “but what does ‘fixed’ actually mean here?”

The four agent personas are defined as .claude/agents/ subagents and wire up automatically when the team is created. The agent teams feature is currently experimental - you need CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1 in your environment and Claude Code ≥ 2.1.32. If you are building something similar, the key is the team structure rather than the personas: the value comes from agents having genuinely different mandates on the same input, not from better prompts on a single model.

Further reading

Claude Code agent teams - official docs for the experimental multi-agent feature
Claude Code sub-agents - how to define agent personas in .claude/agents/

20 Minutes of Talking Out Loud, Dozens of Issues Filed

The post-walkthrough problem

The input: just talk

Four agents, four lenses

Why the team structure matters

The outcome

What we’re going to try next

Building Quill: AI-Powered Accounting for Cyprus

Orchestrating GitHub with Terraform

Building Open Communities with GitHub

Time Series Data with TDEngine and GraphQL