
AI Roastmaster Daily
05/20/2026, 11:12:46 PM@Drew
Devin: The $2 Billion Intern That Needs Constant Supervision
Cognition raised $175M and a $2B valuation on a viral demo of the 'world's first AI software engineer.' The benchmark they used to crown themselves: 13.86%. The benchmark their competitors hit since: 72.7%. Independent tests put real task completion at 15%. It costs $500/month, degrades after 10 compute credits, and nobody on Reddit has met someone who actually uses it. Today's teardown.
It was March 2024. A 10-person startup that had been doing crypto six months earlier dropped a demo video, and the entire tech internet collectively lost its mind.
The video showed an AI software engineer that could read your codebase, write production-ready code, fix its own bugs, navigate the web for documentation, run tests, and ship a working product — all without a human touching the keyboard. Comments sections filled with "software engineering is dead." Tech Twitter erupted into a five-alarm eulogy for the SWE career. VCs opened their wallets so fast they practically dislocated their wrists.
The company was Cognition AI. The product was Devin. The price tag, eventually: $500/month.
The reality: an asynchronous, frequently wrong, occasionally confident-but-incorrect intern who needs 12-15 minutes to respond to a message, burns through credits like a crypto miner, and scores 13.86% on the benchmark it used to declare itself the world's best — while competitors have since hit 72.7% on the same test.1
Let's tear this down.
The pitch: "the world's first AI software engineer"
Cognition's marketing copy was direct and unapologetic: Devin is "the world's first AI software engineer." Not a coding assistant. Not an autocomplete plugin. An engineer. A peer. Someone you hand a Jira ticket to and go get coffee.
The launch demo showed Devin completing an end-to-end freelance task from Upwork — researching, coding, debugging, and submitting — autonomously. The narrative wrote itself: this is the software engineer that replaces software engineers.
The VC response was instant. Founders Fund, backed by Peter Thiel, led a $21M seed at a $350M valuation in early 2024 — for a company that had just pivoted from crypto and had existed for about four months. Then, in April 2024, just weeks after the demo went viral, Founders Fund returned and led a $175M round at a $2B valuation.2 That's $2 billion for a 10-person team and one viral video.
By March 2025: $4B valuation. By summer 2025: reportedly in talks for funding at a $10.2B valuation.
The headlines wrote: gold-medal competitive programmers build AI that can do their job for them. The subtext, which took about four months to surface: they can't.
Reality check: 13.86%, the number nobody updates
Cognition released one benchmark. One. In March 2024, Devin scored 13.86% on SWE-bench — a test that measures whether an AI can resolve real GitHub issues from open-source projects.1 This was, at the time, genuinely impressive. The previous record was 4.8%.
Here's the problem: Cognition has not published an updated SWE-bench score since.3
Meanwhile, Claude Sonnet 4 is now sitting at 72.7% on the same benchmark.3 That is not a marginal improvement. That is Devin claiming the title "world's best" and then watching everyone else lap it while it stopped taking the test.
And about that benchmark: independent analysis has found that roughly one-third of SWE-bench problems contain the solution in the issue description or comments. Models may be pattern-matching against test data they've essentially seen before, not reasoning through novel problems.4 In production environments, agents that score 90 on benchmarks typically deliver around 68 in real-world tasks. The gap between demo and deployment is a canyon.
Real user tests confirm this. An independent review by Trickle ran Devin through 20 complex tasks. 3 succeeded. 14 failed. 3 were unclear. Success rate: 15% — which, conveniently, matches Devin's own SWE-bench number almost exactly.5
A five-day hands-on test by Techpoint noted the classic Devin pattern: it fixed a bug correctly, but also made unnecessary changes to unrelated code, added redundant fallback configuration, removed a needed safeguard, and added an unnecessary type declaration. When questioned, it gave a confident explanation for the changes that turned out to be wrong — and only corrected the code after being pushed back.6 That's not a software engineer. That's an overconfident intern who argues with you before eventually backing down.
The review summary: "Performs like an eager, coachable but overconfident intern."
The secret: ACU credit burns, silent degradation, and a $500 price tag nobody can justify
The pricing model is its own piece of performance art.
Devin charges $500/month for its team plan, which includes 250 Agent Compute Units (ACUs). Each ACU represents compute consumed per task. A typical frontend task burns 1-2 ACUs.5 At $500 for 250 units, that's $2/ACU at baseline — or $2.25 each if you buy extras.
Here's what Devin's own documentation apparently acknowledges: after a session burns through 10 ACUs, code quality degrades noticeably.6 Which means the most complex tasks — the ones that would actually justify this price — are exactly the ones where Devin stops trying its hardest.
Compare this directly to the alternative:
| Devin (Team) | Claude Code | |
|---|---|---|
| Monthly cost | $500 | ~$15-40 |
| SWE-bench score | 13.86% (2024, not updated) | 72.7% (Sonnet 4, May 2026) |
| Response time | 12-15 minutes per turn | Near-instant |
| Environment | Isolated cloud sandbox | Your actual codebase |
| Model choice | Whatever Cognition picks | Sonnet 4, Opus 4, or others |
| Code quality verdict | Verbose, poorly structured on complex tasks | Passes first code review more often |
The cost-benefit math requires Devin to save more than roughly 6 hours per month at a blended $80/hour developer rate to break even on the $500 fee. Independent users rarely report hitting that threshold consistently.3
Meanwhile, on Reddit, someone asked about Devin when Cognition raised at the $10.2B valuation: "I have literally never met anyone who has used Devin."7
That is not a good sign for a $10 billion company.
The catch: they bought Windsurf after their story fell apart
The Devin narrative has been quietly drifting for over a year.
The Reddit post-mortem puts it plainly: "Devin didn't fail because the ambition was wrong. It failed because it aimed at a version of autonomy the current models and tooling can't support yet. You can't expect a single system to magically understand your repo, rewrite your backend, run migrations, and ship a product without a ton of human constraints wrapped around it."8 The comment section described the experience as "give it your repo and pray."
When the "autonomous engineer" story stopped landing, Cognition pivoted. In April 2025, they released Devin 2.0, dropped the entry plan to $20/month, and repositioned from "autonomous engineer" to "AI to stop slop" — now marketing a code review tool called Devin Review.2 That's a remarkable 180. The product that was going to replace engineers is now being sold to help engineers review AI-generated code.
Then in July 2025, Cognition acquired Windsurf — an agentic IDE that had just lost its CEO and senior staff to a $2.4 billion Google licensing deal.2 They bought the house after the original owners were paid to leave.
So the $10 billion autonomous-engineer startup, after watching its benchmark crown get stolen by every model released in 2025, is now in the IDE business it originally said it didn't need to be in. The product that was supposed to work without an IDE is now selling an IDE.
This is what the second act of an AI hype cycle looks like from the inside.
Verdict: What you actually bought
Here is what Devin is, in 2026, stripped of the press release:
A competent but inconsistent task-runner for well-scoped, isolated software tasks. It can clone a repo, set up an environment, and open a PR. It works without local installation. It has a clean UI. For small, clearly defined tasks, it saves time. There are documented enterprise cases — Nubank reportedly got significant ROI on migration tasks where developers only reviewed changes instead of doing full migrations.5
But it is not what was sold. It cannot handle ambiguity. It degrades on long sessions. It makes changes it doesn't need to make, then confidently defends them. It costs $500/month when a $20 tool outperforms it on every benchmark that matters. It launched with a 13.86% score and let that number sit untouched for over a year while the market moved to 72.7%.
Cognition raised over $400 million on the promise that the "world's first AI software engineer" was going to change how software gets built. What they shipped is a slow, expensive cloud sandbox that occasionally writes a correct PR and sometimes also deletes a safeguard check and tells you it was necessary.
The vision wasn't wrong. The timing was off. The valuation was not.
You were sold an employee. You got an intern. The intern needs managing, takes 15 minutes to respond to Slack messages, and costs half a grand a month before you even think about ACU overages.
The gap between the demo and the product is, as always, the whole story.
References
- 1Cognition SWE-bench technical report
- 2Cognition AI - Wikipedia
- 3Claude Code vs Devin: AI Agent vs Autonomous Dev Compared
- 4The AI Benchmark Illusion: Why Your Agent's Test Scores Mean Nothing
- 5Devin AI Review: The Good, Bad & Costly Truth (2025 Tests)
- 6Devin AI review 2025: I tested it for 5 days — Here's what I found
- 7r/cursor: Has anyone actually tried Devin before?
- 8Where did Devin go? What does it say about the future of AI dev tools?
Add more perspectives or context around this Drop.