This is the thing I keep bumping into too: nearly all the effort goes into making the agent smarter, and almost none into how a person and the agent muddle through real work together.
For any high-value task really - research, innovation, creative expressions
AI models are compression engines: they are great at synthesizing the world that was, but should not be trusted when it comes to decisions about the world that will be.
The tax-prep agent automated 7,000 returns at 97%. But the 3% it got wrong is where the real design problem lives. Did it know those returns were wrong? If it didn't, no amount of transparency or control in the interface would have caught them.
The first question for any agentic product is not how to design the handoff. It is how to detect when a handoff is needed. The agent will not tell you it is about to make a mistake. You have to build that detection separately.
Yeah I think it did come through. I suppose my point is that detection feels even more central than handoff design.
The handoff only works if the system can recognise when it is entering a risky state. Without that, the user is not supervising the agent so much as reviewing an output after the important failure signal has already been missed.
To me that feels like the layer where agentic products either become trustworthy or quietly brittle.
Claude Opus 4.8 has allegedly been a huge step up in terms of introspection capabilities
Haven't tried it yet, but for me this falls mostly in the domain of the model builders, not the product builders
Your harness and design should assume the model will make a mistake roughly 1 out of 3 times (according to latest data, across all tasks from easy to really hard), but detecting the mistakes should be done by the model - they are the most powerful AI systems
I agree that stronger models should absolutely be used inside the detection loop. A verifier model, critic pass, uncertainty signal, or second-agent review can all be useful.
The distinction I’m trying to make is slightly different though. I wouldn’t want the product to rely on the same model that may have made the mistake to be the sole authority on whether a mistake has happened.
To me, “detection” is a product and harness responsibility, even if models are one of the tools inside it. The product still has to decide what counts as a risky state, which actions need deterministic checks, when to escalate, what evidence must be shown, and where the blast radius needs to be limited.
So I agree with you that the most capable AI systems should help detect errors. I just think that strengthens the case for an explicit detection layer rather than replacing it. The model can be a sensor in the system, but the harness decides when that signal is enough to pause, verify, or hand off.
The verification agent will have the same issues the action agent has - and will still make mistakes, as a non-deterministic entity
As argued in the post, decision classification, interruptability, transparency and clear communication should ensure humans can close the verification gap effectivitely while maintaining full control
I agree a verification agent has failure modes too. I’m not arguing for “agent A checks agent B” as a complete solution.
My point is more that the verification gap cannot be closed by human control alone unless the product has already done some upstream work to identify where control is needed.
Decision classification is itself part of the detection problem. So are interruptibility and transparency. The product still has to decide which states are risky, which checks are deterministic, when a model-based verifier is enough, when it needs another signal, and when the human should be brought in.
So I think we agree that the verifier cannot be treated as infallible. I’d just frame the solution less as “humans close the gap” and more as “the harness narrows and surfaces the gap so the human has something concrete to close.”
Golden as always Jonas - thanks for sharing
Thanks for sharing 👍
This is the thing I keep bumping into too: nearly all the effort goes into making the agent smarter, and almost none into how a person and the agent muddle through real work together.
The jagged frontier that keeps shifting as we puddle along :)
The evolution from the chatboxes that once blew our minds and were the coolest thing online is real. Great post!
This is really helpful... we structured... The shift from chatboxes to true collaborative surfaces is where the real friction is right now.
Augmentation beats automation every time judgment is involved.
For any high-value task really - research, innovation, creative expressions
AI models are compression engines: they are great at synthesizing the world that was, but should not be trusted when it comes to decisions about the world that will be.
The tax-prep agent automated 7,000 returns at 97%. But the 3% it got wrong is where the real design problem lives. Did it know those returns were wrong? If it didn't, no amount of transparency or control in the interface would have caught them.
The first question for any agentic product is not how to design the handoff. It is how to detect when a handoff is needed. The agent will not tell you it is about to make a mistake. You have to build that detection separately.
True. This is one of the red threads throughout the post, and it touches on all key design decisions - trust, boundaries, control and learning.
Agents - like humans - are not infallible and as non-deterministic systems this should always be top of mind when designing for agentic
I thought it was clear from the post, but perhaps too implicitly 😆
Yeah I think it did come through. I suppose my point is that detection feels even more central than handoff design.
The handoff only works if the system can recognise when it is entering a risky state. Without that, the user is not supervising the agent so much as reviewing an output after the important failure signal has already been missed.
To me that feels like the layer where agentic products either become trustworthy or quietly brittle.
Claude Opus 4.8 has allegedly been a huge step up in terms of introspection capabilities
Haven't tried it yet, but for me this falls mostly in the domain of the model builders, not the product builders
Your harness and design should assume the model will make a mistake roughly 1 out of 3 times (according to latest data, across all tasks from easy to really hard), but detecting the mistakes should be done by the model - they are the most powerful AI systems
Probably should have linked this older post of mine on guardrails: https://metacircuits.substack.com/p/rogue-agents-and-what-to-do-about
I agree that stronger models should absolutely be used inside the detection loop. A verifier model, critic pass, uncertainty signal, or second-agent review can all be useful.
The distinction I’m trying to make is slightly different though. I wouldn’t want the product to rely on the same model that may have made the mistake to be the sole authority on whether a mistake has happened.
To me, “detection” is a product and harness responsibility, even if models are one of the tools inside it. The product still has to decide what counts as a risky state, which actions need deterministic checks, when to escalate, what evidence must be shown, and where the blast radius needs to be limited.
So I agree with you that the most capable AI systems should help detect errors. I just think that strengthens the case for an explicit detection layer rather than replacing it. The model can be a sensor in the system, but the harness decides when that signal is enough to pause, verify, or hand off.
The verification agent will have the same issues the action agent has - and will still make mistakes, as a non-deterministic entity
As argued in the post, decision classification, interruptability, transparency and clear communication should ensure humans can close the verification gap effectivitely while maintaining full control
I agree a verification agent has failure modes too. I’m not arguing for “agent A checks agent B” as a complete solution.
My point is more that the verification gap cannot be closed by human control alone unless the product has already done some upstream work to identify where control is needed.
Decision classification is itself part of the detection problem. So are interruptibility and transparency. The product still has to decide which states are risky, which checks are deterministic, when a model-based verifier is enough, when it needs another signal, and when the human should be brought in.
So I think we agree that the verifier cannot be treated as infallible. I’d just frame the solution less as “humans close the gap” and more as “the harness narrows and surfaces the gap so the human has something concrete to close.”
Spot on Jonas.