A few weeks ago I wrote about a product owner who came to me with a question I hadn’t heard before. He wanted to know whether his engineers were writing acceptance criteria that reflected product thinking or purely technical thinking. Outcome-driven inputs versus spec-and-contract-driven inputs. He could feel something was off. He just couldn’t point to where.
That question has stayed with me. Not because it was unusual, but because of what it revealed about where the real leverage point has moved in an agentic delivery system.
The monitoring and facilitation work that used to occupy a significant portion of coaching hours is automating. I made that argument in my previous post, and I’ll be going deeper on what that means for the coaching role in a future article in this series.
In an agentic model, input quality isn’t a hygiene issue. It’s an architectural one. Acceptance criteria happen to be the most familiar place most organizations encounter this shift. But the broader principle reaches further: every autonomous system is constrained by the quality of its inputs. The leverage point in agentic delivery isn’t execution. It never was.
Refinement is the cheapest intervention point
Have you noticed the bottleneck moving toward review since your teams started using AI-assisted development? More PRs backing up, more back-and-forth in comments, more time spent at the end of the cycle than you’d expect? Go look at what those PR comments are actually saying. If you’re seeing feedback that should have surfaced in refinement, you don’t have a code quality problem. You have an input quality problem.
The distinction matters more than most engineering managers want to reckon with right now. A conversation in refinement costs nothing. It’s people talking before anything is built. Catching the same issue in review means the execution cycle already ran: the AI wrote the code, a human reviewed it, found the problem, sent it back, and now the cycle runs again. You’ve paid for it twice. And that’s before you account for the decoupling problem. Once code is written with a flawed assumption baked in, everything built on top of that assumption carries the flaw forward. Decoupling it is expensive in a human-executed sprint. In an agentic pipeline running at speed, it compounds across dependent stories before anyone surfaces the pattern.
There’s a version of this argument that says refinement is going away as AI matures. I think the reverse is true. If we want agentic delivery to be cost effective, refinement becomes more important, not less. The human touchpoint before execution is the cheapest intervention point in the entire system. Shrinking it doesn’t save money. It moves the cost downstream where it’s harder and more expensive to address.
Three types of acceptance criteria, and why the third one barely exists yet
The PO’s question opened something up for me about how acceptance criteria actually functions, or fails to, in most delivery organizations. I’ve started thinking about it in three distinct types. Most teams are operating with one. Some are operating with two. Almost nobody is doing the third. And in an agentic model, the third is the one that matters most.
Outcome-driven, product-centric criteria describes what changes for the user. It’s written from the user’s perspective: what they can now do, what friction is removed, what value is realized. This is what most organizations mean when they say they’re writing good stories. It tells the agentic stack what it’s building toward. It’s necessary. It’s not sufficient.
Technical spec and contract-driven criteria describes what the system must do. Behavioral requirements, edge cases, integration contracts, error states. Also necessary. Also not sufficient. When this is the only type of criteria being written, you’re optimizing for technical compliance rather than value delivery. The system does exactly what you asked. Whether what you asked was worth asking is a separate question that nobody encoded.
Product outcome and validation criteria is the type that’s almost entirely absent from most delivery organizations right now, and the type that becomes structurally necessary in an agentic model. This is where you encode the market behavior hypothesis before the code is written. Not what the feature does. Not what the user experiences. What you expect to observe in production if the feature worked. “The redesigned onboarding wizard should reduce drop-off at step three by at least 15% within 30 days of release.” “Variant B of this pricing page should outperform variant A on conversion within a two-week test window.” A testable commitment about real-world signal that makes the feature falsifiable before it ships.
This isn’t a new idea at the concept level. Hypothesis-driven development has been talked about for years. What’s new is the operational necessity. And that necessity comes directly from what the agentic model is trying to do.
Why the third type isn’t optional anymore
In The Living Product white paper, I argue that agentic AI doesn’t improve the software development lifecycle. It eliminates the justification for it. What replaces it is a product that observes its own behavior continuously, generates its own demand signals, and routes those signals back into the development motion without a human translation layer. A living product needs signal to learn from. And that signal has to be encoded somewhere before execution begins.
If your acceptance criteria never includes a market behavior hypothesis, the agentic stack has no basis for evaluating whether what it built was worth building. It executes. It delivers. The feature ships. And then what? Without a defined success condition encoded at input time, nobody has an automated basis for the question that follows: does this feature earn its place in the codebase?
The white paper cites research showing that eighty percent of features are rarely or never used, and that building the wrong thing is the primary category of software development waste. Not slow delivery, not poor testing. The wrong thing. The more interesting question is why organizations keep building it.
In most organizations it comes down to one of two patterns. The first is catering to the loudest customer voice without weighing that against the ongoing cost of what gets built. A high-value account asks for a feature, the team builds it, and nobody does the math on what it costs to maintain, extend, and support that feature for the next three years against the revenue it actually protects. The second is chasing the next potential revenue stream without accounting for the hidden engineering cost of getting there. The roadmap fills up with bets that look compelling on a slide and quietly accumulate debt that the team carries long after the original rationale has been forgotten.
Both patterns share the same root. Nobody encoded a success condition before the work began. There was no testable commitment to evaluate the feature against. So the build-vs-keep conversation, when it finally happens, is a political negotiation rather than a data-grounded determination. And in most organizations, inertia wins that negotiation. Features persist long past their usefulness because there’s no agreed basis for retiring them.
The third type of acceptance criteria is the mechanism that closes that gap at the team level. It’s what makes the automated build-vs-keep determination possible. If the feature hit its adoption signal, it earns continued investment. If it didn’t, that’s a data-grounded conversation about whether the technical debt it’s accumulating is worth carrying. Not a political negotiation. Not a gut call. A determination the system can increasingly make with appropriate human oversight, because someone encoded the success condition before the work began.
Without that, you have an agentic execution stack that’s fast, capable, and flying blind about whether anything it’s building is worth building.
What this means for coaches and product people right now
I introduced the concept of the flow enabler in the previous post: the practitioner identity that emerges when the monitoring and facilitation layer automates and what’s left is the work of designing the systems that observe, encoding accumulated knowledge into detection logic, and handling the irreducibly human coordination work. That role has a very specific responsibility at the refinement stage.
Facilitation may be exactly the tool needed here, but oriented toward a specific outcome: the quality of what gets written before execution begins. The flow enabler, working alongside the product person, is the person in the room who knows what a good input looks like and can steer the conversation there. Who can distinguish between an outcome-driven story and a technically-compliant story that delivers no particular value. That’s a sophisticated facilitation act, not a departure from it. And increasingly, it means asking the question that most teams aren’t asking yet: what does success look like in production, and are we willing to commit to that before we build?
That last question is the coaching act that this moment in the technology curve is asking for. Not facilitation. Not impediment routing. Encoding a market hypothesis into a testable criterion, at the story level, before execution begins.
The PO who came to me with that question about acceptance criteria was, without knowing it, asking about all three of these types at once. He could feel the gap. He just didn’t have the language for it yet.
Giving practitioners that language, and the framework to act on it, is one of the objectives of this series.
This is the first article in The Agentic Reality Series. The series builds on Two Months Ago I Wrote About Coaching in an AI World. Here’s What I Underestimated. and the white paper The Living Product, covering the human, team, and role-level questions that neither piece addresses directly. Next: Five is the Floor, on team composition, the resilience threshold, and the north star capability every agentic delivery team should be building toward.


