AI Coding Just Jumped to the Next Level... Or Did It?

Plus: How To Clear Your Inbox With Superhuman Speed

Jonas Braadbaart

Feb 28, 2025

This weeks tech news was all about the Claude 3.7 release by Anthropic.

The new Anthropic model was heavily anticipated by engineers across the world— myself included

Claude 3.5 Sonnet has been my coding assistant since August 2024.

In fact, I’ve probably had more conversations with Claude 3.5 Sonnet in the last 10 months than with any single human.

So how did Claude 3.7 Sonnet stack up to its older brother?

Claude 3.7 Sonnet performance on SWE-bench (autonomous programming)

For one, it massively outperforms all the state of the art models on one of the leading autonomous software engineering benchmarks, SWE-bench.

Given how close OpenAI o1, DeepSeek R1 and Claude 3.5 Sonnet had been to each other in terms of performance, this is a massive achievement.

Claude 3.7 Sonnet boasts

Hybrid reasoning capabilities that switch between quick responses and deep, step-by-step thinking for complex problems. This is great because you as a user no longer need to think about which model to use for which task.
Massive improvements in autonomous coding, with Claude 3.7 achieving 70.3% accuracy on SWE-bench Verified (compared to 49.3% for OpenAI's offerings)
Agentic capabilities that allow the Claude AI systems to edit files, write and run tests, and even commit changes to GitHub repositories

However, after working with Claude 3.7 Sonnet in Cursor these last two days I’m less than thrilled.

A 62.3% completion rate also means that the system fails almost half the time.

This wouldn’t be a problem except that the team behind Cursor, my AI coding development environment, decided to go all-in on autonomous coding.

In parallel with the Claude 3.7 Sonnet release, they decided to remove all non-autonomous coding features from their AI coding tool.

The rise of vibe coding

This means that Cursor, like Windsurf and Bolt.new, no longer lets you tactically leverage LLMs in your codebase within a context you set and control yourself.

Instead, Claude 3.7 Sonnet will run rampant in your codebase, going wildly beyond the scope of user requests, making changes and breaking things unrelated to the human input—at least in my experience.

So.. my coding productivity saw a massive drop this week, that was paired with a significant increase in the number of curse words sent to Anthropic servers via Anysphere (the company behind Cursor).

In the last two days, I’ve found Claude 3.7 in Cursor consistently

Failed to grasp simple component and file dependencies. At the same time the new Cursor UI made it ten times harder to guide the model.
Made changes to the codebase that were well out of scope of the user request, breaking working features in the process.
Misunderstood basic project architecture and design patterns.
Failed to perform simple tasks, at the same time generating unnecessarily complex and convoluted solutions.

I wouldn’t mind, except that these things worked in Cursor last week with Claude 3.5.

As Sully Omar noted on X,

The time I’ve spent correcting AI-generated changes this week have negated any time saved by the AI system generating code.

To summarize the experience, it felt a lot like working with a presumptuous little prick.

I get the impression that the team at Cursor was catering more towards the vibe coding community (and their investors) than towards enterprise users with this release.

Vibe coding is where you just "see stuff, say stuff, run stuff, and copy-paste stuff," and somehow, "it mostly works” — especially if you enable YOLO mode in Cursor.

Similar to Windsurf or Bolt.new, Cursor is still great for:

Generating boilerplate code at unprecedented speed
Building rapid prototypes from scratch
Creating standardised components like forms and API endpoints

But not so much for working on codebases where errors can have serious consequences and not everything can be rapidly tested by a single user.

I’m seriously considering moving back to VSCode as a result of this latest release.

An AI agent too soon

There is a lesson to be learnt here.

Even though our AI systems are becoming increasingly powerful, it is important not to underestimate the importance of the human in the loop.

As I found out this week, my workflow was as much part of the power of the system as the LLM on the other end of the endpoint.

If you’re building agentic AI systems this is important.

Don’t automate too much. Leverage the humans.

Tame Your Inbox: How Superhuman AI is Saving Users 4 Hours Every Week

As we saw in the story above, it’s still rare to find AI tools that are actually helpful.

Superhuman looks to be one of the exceptions.

Given that the average professional spends 28% of their workweek—11 hours—managing their inbox, it’s safe to say, newsletter readers, that email is perhaps the most underrated scourge of modern man.

The AI features announced by Superhuman last week look to change that:

The 4-hour promise

Superhuman's core claim is bold but specific.

Users save an average of 4 hours per week through AI-powered email management.

Unlike basic email filters, Superhuman's AI suite looks to automate every aspect of your email workflow with AI:

Auto Label & Archive: custom email labeling generated from prompts you write.
AI-powered writing assistance: emails generated from your own writing style.
Inbox segmentation: in combination with auto labels, you can now easily and automatically split your inbox across customisable work streams and projects.
Workflow automations like auto-forwarding or auto-replies with AI done right.

The ROI extends well beyond personal productivity.

For a team of 10 employees, Superhuman's 4-hour weekly savings translates to 40 reclaimed hours—essentially a full-time position without the additional headcount.

At average professional rates ($25-50/hour), that represents $4,000-8,000 a month.

More importantly, Superhuman users report qualitative benefits:

Faster response times to clients
Reduced decision fatigue
More energy for creative and strategic work
Significantly lower email anxiety

Which is also the main reason I wanted to call out this AI tool in The Circuit — they are not affiliated, it’s all about pure human upside for the teams adopting these kind of tools (also check out their main competitor, Missive).

The only reason I’m not on board (yet) is because my email volumes aren’t high enough to warrant the EUR/USD 30/month subscription price.

Go check out Superhuman if you’re suffering from that most common of all modern diseases: inbox fatigue.

This week in AI (Coding)

Claude 3.7 Sonnet Launches as Hybrid Reasoning Powerhouse: Anthropic’s latest release, Claude 3.7 Sonnet, marks a breakthrough in AI reasoning with its hybrid approach, seamlessly switching between quick responses and extended, step-by-step problem-solving. The model excels in coding tasks, achieving a 70.3% accuracy on SWE-bench Verified—far outperforming competitors like OpenAI’s models and DeepSeek-R1. Developers can now leverage Claude Code, a CLI tool that automates tasks like file editing, testing, and GitHub commits, streamlining workflows.
In related news, vibe coding has now officially cross-pollinated to the indie hacker community as Pieter Levels vibe-coded a flight sim game with Cursor in two days and got it monetised two days later.
Inception Labs announced a new family of diffusion large language models (dLLMs) that significantly accelerate text generation. Their models achieve speeds of over 1000 tokens per second on NVIDIA H100s, outperforming autoregressive models by up to 10x. The "coarse-to-fine" generation process of diffusion models enables better reasoning, error correction, and better structured responses. Their latest dLLM AI coding model matches or exceeds models like GPT-4o Mini and Claude 3.5 Haiku on coding benchmarks.
OpenAI launched GPT-4.5 in research preview for ChatGPT Pro subscribers and developers. GPT-4.5 is designed to enhance natural interactions, improve user intent comprehension, and demonstrate greater emotional intelligence compared to previous models. It excels at interpreting subtle cues and is well-suited for tasks like writing, designing, and programming, as it focuses on language intuition and pattern recognition rather than step-by-step reasoning.

The Circuit