When the Agent Says Done, Ask It to Click

Field entry, 13 February.

There is a particular kind of AI-built app that looks finished until you ask it to do the one thing it exists to do.

The shell is there. The sidebar looks plausible. The composer has a reassuring rectangle where text ought to go. Documentation exists, which always gives a project a faint institutional smell. There may even be settings, icons, routes, and a few animations. Everything has the air of a railway timetable, right up until you type a message, press send, and discover that the train was never actually scheduled to depart, which was the Graft moment in miniature.

The task began in the reasonable way: read the design docs, compare them to the code, identify what was missing. The app had a substantial set of docs describing prompts, automations, input behavior, message streams, navigation, skills, tools, settings, and the rest of the Codex-like surface it wanted to replicate.

This is exactly the kind of work agents are good at in theory: give them a pile of docs, give them a codebase, and ask them to map aspiration to implementation.

The first output was useful, but the actual product truth arrived from the simplest possible interaction: entering text in the input and sending it did nothing. The chat path was not real enough.

That changed the assignment. It stopped being a coverage audit and became an end-to-end obligation.

The new test was explicit: use the browser. Send Hello!. See whether a response appears. Then send What's hello world in python. See whether it completes. Then test the UI elements, inspect errors, fix the 500s, wire /api/turns/start to the agent runtime, and stream events back to the client.

This is the part I want to remember because it is the difference between code generation and product development.

An agent can make code that satisfies a document. It can create routes, components, adapters, and types. It can make the codebase look more like the spec. But a user does not experience a spec. A user experiences the click.

The click is merciless because it has no respect for architecture diagrams, cleaner component names, plausible routes, or any other evidence that lives one layer away from the user. It asks one question: did the system respond?

For AI-built software, that question needs to move much earlier in the process. One should not wait until the end of the journey to discover whether the bridge has planks.

I have become increasingly suspicious of any session that spends too long in implementation without touching the running surface. Not because code inspection is bad. Code inspection is essential. But inspection and execution catch different species of bug.

Inspection finds missing routes, broken abstractions, duplicated logic, stale docs, unimplemented stubs. Execution finds whether the supposedly wired system survives contact with a browser, a real request, a real response, a real error, a real state transition.

The Graft session had both. It started with docs and code. It moved through implementation. But the quality bar only became honest when the browser became a participant.

That browser-driven loop also exposed a deeper problem: “UI complete” is often used far too generously. A composer that accepts typing but does not start a turn is not a composer. A chat stream that renders placeholder history but cannot show live agent output is not a chat stream. A route that returns 500 under the first real flow is not a route, at least not in the way users mean it.

The distinction sounds pedantic until you are using agents. Then it becomes survival.

Agents are very good at producing intermediate artifacts that look like progress. This is useful when those artifacts are scaffolding. It is dangerous when the scaffolding is mistaken for the building, especially when the scaffolding has tasteful typography.

The fix is not to trust agents less. The fix is to define “done” in a way that leaves less room for theater.

For a chat product, “done” means a user can type a message, send it, see the agent begin, watch useful intermediate state, and receive a final answer. It means errors are visible without being dumped as raw internals. It means refresh does not destroy the conversation. It means the app can perform the happy path twice, not once by accident.

That sounds like a lot because it is a lot, but it is also the product.

The agent can help build it, test it, and even write the browser tests that hold it accountable, but it should not be allowed to define completion by the presence of files; when it says “done”, ask it to click.

Hand-drawn notebook detail plate showing browser path, send action, and response evidence. — Browser path, send action, and response evidence.

Field note

My practical rule now is simple: for anything with a user interface, the first meaningful test should involve the interface early, not after the refactor, not after the last route is wired, and not after the agent has spent eight hours making the internal architecture more beautiful. Open the app, type the thing, press the button, watch the logs, refresh, and try again.

It is amazing how much false progress evaporates under that little ritual.