The 10% question

Why the harness matters more than the model you pick

Apr 27, 2026

It’s been a busy ten days of product releases.

After Anthropic launched Opus 4.7 back on April 16th and then Claude Design on the 17th, OpenAI released ChatGPT Images 2.0 on the 21st, followed up with workspace agents on the 22nd, and then GPT-5.5 in preview on the 23rd.

In between, Alibaba dropped Qwen 3.6-Max Preview (April 20th), Google introduced Gemini Enterprise, and DeepSeek released a preview of their V4 model.

So what does all of this mean for your business?

For one, pricing is wildly different between these AI models—even though their performance on AI benchmarks is converging.

The AAII (Artificial Analysis Intelligence Index) combines results from 10 of the top AI benchmarks into a single score per model.

And the most expensive model isn’t necessarily the smartest.

GPT-5.5 outperforms Opus 4.7 both on costs and on benchmark performance.

And that last bit of “frontier” performance (+5 points for Opus 4.7 compared to Qwen 3.6 Max Preview, a 10% performance increase) can cost up to 5.5 times as much.

Probably not worth it for most use cases.

This is what I meant when I wrote back in March that SLMs (Small Language Models) are going to go through a small (no pun intended) renaissance. Economics still matter, contrary to what Sam and the other AI hype boys are claiming.

But pricing isn’t even the most important variable.

How you drive these models matters more than which one you pick.

The brain of a goldfish

Much like my 9 month old tracking objects in her line of sight, AI models suffer from goldfish brain.

The moment something disappears from their session contexts it ceases to exist.

As Demis Hassabis puts it,

We train them, do various different types of training on them, and then they’re kind of frozen and then put out into the world. What you’d like is for those systems to continually learn online from experience, to learn from the context they’re in.—Demis Hassabis

Most of the AI tools you use are built to mitigate this lack of continual learning.

Conversations (“sessions”) that are fed to the model to provide “context”, memory systems that compress previous interactions using software mechanisms, and harnesses like file-based todo lists to keep computer agents on track are all product innovations that aim to make these systems more useful in real life.

One of the main reasons Anthropic is now reportedly at USD 30 bn ARR is because their team built a better harness (Claude Code) than the other labs.

What this means in practice is that we humans are their continual learning system for the time being—updating AI model contexts, feeding them insights, reflecting back what matters, guiding them to what we want to get out of these systems.

The Anthropic Economic Index data supports this—even though AI models have become much more powerful over the last 18 months, we haven’t seen a sharp increase in automation-type interactions (hand everything to computer agents).

Kaizen again

The operating system I’ve found works best for computer agents bears a lot on that of the Toyota Production System—Kaizen. For those of you that don’t know it, Kaizen is a system of work developed at Toyota with three core ingredients:

Standards exist only as the foundation for continuous improvements (Kaizen).
Every operator raises the bar and improves the system.
Improving processes requires watching them in action (Genchi Genbutsu).

At Toyota, every operator working in this system has the power to halt work when they see a defect—by pulling the “Andon cord” to stop the factory production line.

The core idea is that any short-term productivity loss from stopping the production line will be worth it—because the system upgrades that come from continuously improving and refining the process will ultimately add up to bigger productivity gains than the losses in productivity that come from stopping the work.

The great thing is that with computer agents, you don’t even need to stop the work.

I usually have two Claude Code sessions running in parallel—one to do the work, and one to improve the system. This is possible because we’re dealing with digital rather than physical processes.

Applied to computer agents, the Kaizen philosophy looks something like this:

Skills—repeatable process automations—are under continuous improvement.
You’re not just the operator—you’re the architect of your AI systems.
The best way to improve your AI systems is by watching your computer agents work, sampling outputs, and improving them while you are working.

And the kicker—in UI-based AI automation tools like Make or n8n updating automation processes is a tedious, time-consuming task that requires lots of manual interventions and testing.

But computer agents? They update themselves based on your directions.

As Sierra’s Bret Taylor recently pointed out—the era of humans clicking buttons like blind monkeys is coming to an end (ok he didn’t say the monkey part).

This means that you can make more upgrades, integrate more real-world learnings, and redesign systems much faster. This increase in learning speed and adaptability is a game-changer as you’re looking to complete tasks with actual economic value.

Treat your agents like a living system

So even though AI systems are more powerful than ever and their capabilities will continue to evolve, there is a good reason why companies aren’t handing over work to agents en masse.

The double loop learning AI systems demonstrate—their metacognition—is often very brittle, suffers severely from availability bias, and tends to break down in the face of real-world business tasks—where data is patchy at best, hidden constraints and cultural nuances abound, and theory of mind remains a core part of business success.

That’s why I think the “fire and forget” model of AI process automation that still dominates a lot of business thinking is the wrong mental model. For all but the most basic low-level tasks, having humans in the loop will remain a necessity.

Whether you are using computer agents for personal tasks or running an entire agentic customer service department, treat your agents as a living system:

Sample work products continuously
Interrupt and improve in-flight
Version your agent configs (prompts, Skills, AGENTS.md) to trace evolution
Watch the actual work — don’t just check the outputs

The harness matters more than the model. So build yours.

AI Operators is a 4-week 1-on-1 coaching program where we do exactly that—design your personal AI operating system, set up your Skills, and put the Kaizen loop in place.

Reply to this email or reach out via Substack DM, or take a look at https://ai-operators.eu.

Last week in AI

OpenAI released ChatGPT Images 2.0, Workspace Agents, and previewed GPT-5.5—positioned as the reasoning core of an emerging unified ChatGPT super-app spanning chat, coding, and browser agents.
Intercom published a detailed Claude Code productivity case study—engineering velocity doubled in nine months across the org.
Google relaunched its enterprise productivity suite as Gemini Enterprise with persistent agents embedded in email, docs, and chat. Adobe CX Enterprise followed the same pattern for marketing and creative ops.
DeepSeek previewed V4 and Alibaba previewed Qwen 3.6 Max—both claiming frontier-class agentic and reasoning gains at significantly lower price points than US peers, narrowing the open/closed performance gap.

The Circuit

Discussion about this post

Ready for more?