hackernews client

Claude Opus 4.6

2291 pointsposted 2 days ago

1050 Comments

ck_one

2 days ago

Just tested the new Opus 4.6 (1M context) on a fun needle-in-a-haystack challenge: finding every spell in all Harry Potter books.

All 7 books come to ~1.75M tokens, so they don't quite fit yet. (At this rate of progress, mid-April should do it ) For now you can fit the first 4 books (~733K tokens).

Results: Opus 4.6 found 49 out of 50 officially documented spells across those 4 books. The only miss was "Slugulus Eructo" (a vomiting spell).

Freaking impressive!

grey-area

a day ago

Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:

https://www.wizardemporium.com/blog/complete-list-of-harry-p...

Why is this impressive?

Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.

sigmoid10

a day ago

Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

rendx

a day ago

> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data

No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.

[1] https://news.ycombinator.com/item?id=46572846

[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...

[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

paodealho

a day ago

also:

"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"

https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...

sigmoid10

a day ago

The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.

rendx

13 hours ago

> I think they at least have some level of filtering for their training data to keep them legally somewhat compliant.

So far, courts are siding with the "fair use" argument. No need to exclude any data.

https://natlawreview.com/article/anthropic-and-meta-fair-use...

"Even if LLM training is fair use, AI companies face potential liability for unauthorized copying and distribution. The extent of that liability and any damages remain unresolved."

https://www.whitecase.com/insight-alert/two-california-distr...

joenot443

a day ago

> even proprietary content like the books themselves

This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.

Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?

rendx

a day ago

> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files

Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.

user

a day ago

[deleted]

beepbooptheory

20 hours ago

> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

The plot of Good Will Hunting would like a word.

yunohn

21 hours ago

Maybe y’all missed this?

https://www.washingtonpost.com/technology/2026/01/27/anthrop...

Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.

vercaemert

a day ago

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).

This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

adastra22

a day ago

a day ago

Any obscure work of fiction or fanfiction would likely be fine as a casual test.

If you ask a model to discuss an obscure work it'll have no clue what it's about.

This is very different than asking about Harry Potter.

lxgr

a day ago

Yeah, that's what I've been doing as well, and at least Gemini 3 Pro did not fare very well.

Szpadel

a day ago

but isn't it what we wanted? we complained so much that LLM uses deprecated or outdated apis instead of current version because they relied so much on what they remembered

bestham

a day ago

Touché, that is what we humans are doing to some degree as well.

xiomrze

2 days ago

Honest question, how do you know if it's pulling from context vs from memory?

If I use Opus 4.6 with Extended Thinking (Web Search disabled, no books attached), it answers with 130 spells.

ozim

2 days ago

the books act like a hypnotic trigger, and may not represent a generalized skill. Hence why replacing with random words would help clarify. if you still get the origional spells, regurgitation confirmed, if it finds the spells, it could be doing what we think. An even better test would be to replace all spell references AND jumble chapters around. This way it cant even "know" where to "look" for the spell names from training.

user

a day ago

[deleted]

angst

a day ago

btw it recalls 42 when i asked. (without web search)

full transcript: pastebin.com/sMcVkuwd

f33d5173

a day ago

Not sure how they're being counted, but that adds up to 46 with the pair spells counted separately. But then nox is counted twice, so maybe 45.

user

a day ago

[deleted]

heavyset_go

a day ago

jamesfinlayson

a day ago

Instructed not to was my understanding.

eek2121

2 days ago

Honestly? My advice would be to cook something custom up! You don't need to do all the text yourself. Maybe have AI spew out a bunch of text, or take obscure existing text and insert hidden phrases here or there.

Shoot, I'd even go so far as to write a script that takes in a bunch of text, reorganizes sentences, and outputs them in a random order with the secrets. Kind of like a "Where's Waldo?", but for text

Just a few casual thoughts.

I'm actually thinking about coming up with some interesting coding exercises that I can run across all models. I know we already have benchmarks, however some of the recent work I've done has really shown huge weak points in every model I've run them on.

clhodapp

2 days ago

Having AI spew it might suffer from the fact that the spew itself is influenced by AI's weights. I think your best bet would be to use a new human-authored work that was released after the model's context cutoff.

a day ago

... I'm not sure if you're trolling or if you missed the point again. The point is to test the contextual ability and correctness of the LLMs ability's to perform actions that would be hopefully guaranteed to not be in the training data.

Why are people excited? What am I missing?

zamadatix

2 days ago

To be fair, I don't think "Slugulus Eructo" (the name) is actually in the books. This is what's in my copy:

> The smug look on Malfoy’s face flickered.

> “No one asked your opinion, you filthy little Mudblood,” he spat.

> Harry knew at once that Malfoy had said something really bad because there was an instant uproar at his words. Flint had to dive in front of Malfoy to stop Fred and George jumping on him, Alicia shrieked, “How dare you!”, and Ron plunged his hand into his robes, pulled out his wand, yelling, “You’ll pay for that one, Malfoy!” and pointed it furiously under Flint’s arm at Malfoy’s face.

> A loud bang echoed around the stadium and a jet of green light shot out of the wrong end of Ron’s wand, hitting him in the stomach and sending him reeling backward onto the grass.

> “Ron! Ron! Are you all right?” squealed Hermione.

> Ron opened his mouth to speak, but no words came out. Instead he gave an almighty belch and several slugs dribbled out of his mouth onto his lap.

sobjornstad

2 days ago

I have a vague recollection that it might come up named as such in Half-Blood Prince, written in Snape's old potions textbook?

In support of that hypothesis, the Fandom site lists it as “mentioned” in Half-Blood Prince, but it says nothing else and I'm traveling and don't have a copy to check, so not sure.

zamadatix

a day ago

Hmm, I don't get a hit for "slugulus" or "eructo" (case insensitive) in any of the 7. Interestingly two mentions of "vomit" are in book 6, but neither in reference to to slugs (plenty of Slughorn of course!). Book 5 was the only other one a related hit came up:

> Ron nodded but did not speak. Harry was reminded forcibly of the time that Ron had accidentally put a slug-vomiting charm on himself. He looked just as pale and sweaty as he had done then, not to mention as reluctant to open his mouth.

There could be something with regional variants but I'm doubtful as the Fandom site uses LEGO Harry Potter: Years 1-4 as the citation of the spell instead of a book.

Maybe the real LLM is the universe and we're figuring this out for someone on Slacker News a level up!

user

a day ago

[deleted]

LanceJones

2 days ago

Assuming this experiment involved isolating the LLM from its training set?

grey-area

a day ago

Of course it didn't. Not sure you really can do that - LLMs are a collection of weights from the training set, take away the training set and they don't really exist. You'd have to train one from scratch excluding these books and all excerpts and articles about them somehow, which would be very expensive and I'm pretty sure the OP didn't do that.

So the test seems like a nonsensical test to me.

guluarte

2 days ago

you can get the same result just asking opus/gpt, it is probably internalized knowledge from reddit or similar sites.

ck_one

2 days ago

If you just ask it you don't get the same result. Around 13 spells were missing when I just prompted Opus 4.6 without the books as context.

guluarte

a day ago

because it is non-deterministic, i just did and got 50 spells.

prompt: finding every spell in all Harry Potter books in the first 4 books, only the officially documented spells across those 4 books, there are 50, without using websearch

Here's my best compilation from memory, organized by book. I'll be upfront: getting to exactly 50 is tricky without cross-referencing, so some might be debatable or I might miss a couple. Here goes:

---

*Book 1 - Philosopher's/Sorcerer's Stone*

1. *Lumos* - produces light 2. *Nox* - extinguishes Lumos 3. *Alohomora* - unlocks doors 4. *Wingardium Leviosa* - levitation 5. *Petrificus Totalus* - full body-bind 6. *Locomotor Mortis* - leg-locker curse 7. *Incendio* - produces fire 8. *Reparo* - mends broken objects

*Book 2 - Chamber of Secrets*

9. *Expelliarmus* - disarms opponent 10. *Rictusempra* - tickling charm 11. *Tarantallegra* - forces target to dance 12. *Serpensortia* - conjures a snake 13. *Finite Incantatem* - ends spell effects 14. *Obliviate* - erases memories 15. *Aparecium* - reveals invisible ink 16. *Immobulus* - freezes target in place 17. *Homorphus* - mentioned by Lockhart (werewolf cure claim) 18. *Peskipiksi Pesternomi* - Lockhart's failed pixie spell

*Book 3 - Prisoner of Azkaban*

19. *Expecto Patronum* - produces a Patronus 20. *Riddikulus* - repels a Boggart 21. *Waddiwasi* - shoots objects at target 22. *Impervius* - waterproofing charm 23. *Dissendium* - opens secret passage 24. *Ferula* - conjures bandages/splint 25. *Mobilicorpus* - moves a body 26. *Lumos Maxima* - intense light

*Book 4 - Goblet of Fire*

27. *Accio* - summoning charm 28. *Avada Kedavra* - killing curse 29. *Crucio* - Cruciatus curse (torture) 30. *Imperio* - Imperius curse (control) 31. *Stupefy* - stunning spell 32. *Engorgio* - enlarges target 33. *Reducio* - shrinks target 34. *Sonorus* - amplifies voice 35. *Quietus* - reverses Sonorus 36. *Morsmordre* - conjures the Dark Mark 37. *Priori Incantatem* - reveals last spell cast 38. *Deletrius* - erases magical residue 39. *Densaugeo* - enlarges teeth 40. *Furnunculus* - causes boils 41. *Impedimenta* - slows/stops target 42. *Reducto* - blasts solid objects 43. *Diffindo* - severing charm 44. *Relashio* - releases sparks/grip 45. *Orchideous* - conjures flowers 46. *Avis* - conjures birds 47. *Point Me* - Four-Point Spell (compass) 48. *Ennervate* - revives stunned person 49. *Protego* - shield charm 50. *Conjunctivitis Curse* - affects eyesight (Krum on the dragon)

---

A few caveats: some of these (like Lumos Maxima, Homorphus, Peskipiksi Pesternomi) are borderline since they're either mentioned rather than properly cast, or might be film additions that bleed into memory. The Conjunctivitis Curse is described but its incantation isn't explicitly given in the text. And Protego might technically first appear with its incantation in Book 5 during DA practice rather than Book 4.

If you want, I can turn this into a spreadsheet or document with columns for spell name, effect, who casts it, and which chapter.

siwatanejo

a day ago

> All 7 books come to ~1.75M tokens

How do you know? Each word is one token?

koakuma-chan

a day ago

You can download the books and run them through a tokenizer. I did that half a year ago and got ~2M.

kylehotchkiss

15 hours ago

Btw - a well crafted regex could have achieved the same (pointless) result with ~0.0000005% of resources the LLM machine used.

gizmodo59

2 days ago

5.3 codex https://openai.com/index/introducing-gpt-5-3-codex/ crushes with a 77.3% in Terminal Bench. The shortest lived lead in less than 35 minutes. What a time to be alive!

wasmainiac

2 days ago

Dumb question. Can these benchmarks be trusted when the model performance tends to vary depending on the hours and load on OpenAI’s servers? How do I know I’m not getting a severe penalty for chatting at the wrong time. Or even, are the models best after launch then slowly eroded away at to more economical settings after the hype wears off?

tedsanders

2 days ago

We don't vary our model quality with time of day or load (beyond negligible non-determinism). It's the same weights all day long with no quantization or other gimmicks. They can get slower under heavy load, though.

(I'm from OpenAI.)

wasmainiac

2 days ago

Thanks for the response, I appreciate it. I do notice variation in quality throughout the day. I use it primarily for searching documentation since it’s faster than google in most case, often it is on point, but also it seems off at times, inaccurate or shallow maybe. In some cases I just end the session.

nl

2 days ago

Usually I find this kind of variation is due to context management.

Accuracy can decreases at large context sizes. OpenAI's compaction handles this better than anyone else, but it's still an issue.

If you are seeing this kind of thing start a new chat and re-run the same query. You'll usually see an improvement.

wasmainiac

a day ago

2 days ago

Yeah, happy to be more specific. No intention of making any technically true but misleading statements.

The following are true:

- In our API, we don't change model weights or model behavior over time (e.g., by time of day, or weeks/months after release)

- Tiny caveats include: there is a bit of non-determinism in batched non-associative math that can vary by batch / hardware, bugs or API downtime can obviously change behavior, heavy load can slow down speeds, and this of course doesn't apply to the 'unpinned' models that are clearly supposed to change over time (e.g., xxx-latest). But we don't do any quantization or routing gimmicks that would change model weights.

- In ChatGPT and Codex CLI, model behavior can change over time (e.g., we might change a tool, update a system prompt, tweak default thinking time, run an A/B test, or ship other updates); we try to be transparent with our changelogs (listed below) but to be honest not every small change gets logged here. But even here we're not doing any gimmicks to cut quality by time of day or intentionally dumb down models after launch. Model behavior can change though, as can the product / prompt / harness.

ChatGPT release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

Codex changelog: https://developers.openai.com/codex/changelog/

Codex CLI commit history: https://github.com/openai/codex/commits/main/

Trufa

2 days ago

I ask then unironically then, am I imagining that models are great when they start and degrade over time?

I've had this perceived experience so many times, and while of course it's almost impossible to be objective about this, it just seem so in your face.

I don't discard being novelty plus getting used to it, plus psychological factors, do you have any takes on this?

jason_oster

a day ago

You might be susceptible to the honeymoon effect. If you have ever felt a dopamine rush when learning a new programming language or framework, this might be a good indication.

Once the honeymoon wears off, the tool is the same, but you get less satisfaction from it.

Just a guess! Not trying to psychoanalyze anyone.

wasmainiac

a day ago

I don’t think so. I notice the same thing, but I just use it like google most of the time, a service that used to be good. I’m not getting a dopamine rush off this, it’s just part of my day.

jychang

2 days ago

What about the juice variable?

https://www.reddit.com/r/OpenAI/comments/1qv77lq/chatgpt_low...

tedsanders

2 days ago

Yep, we recently sped up default thinking times in ChatGPT, as now documented in the release notes: https://help.openai.com/en/articles/6825453-chatgpt-release-...

The intention was purely making the product experience better, based on common feedback from people (including myself) that wait times were too long. Cost was not a goal here.

If you still want the higher reliability of longer thinking times, that option is not gone. You can manually select Extended (or Heavy, if you're a Pro user). It's the same as at launch (though we did inadvertently drop it last month and restored it yesterday after Tibor and others pointed it out).

tgrowazay

2 days ago

Isn’t that just how many steps at most a reasoning model should do?

qingcharles

19 hours ago

Thank you for saying this publically.

I feel like you need to be making a bigger statement about this. If you go onto various parts of the Net (Reddit, the bird site etc) half the posts about AI are seemingly conspiracy theories that AI companies are watering down their products after release week.

2 days ago

a day ago

Reasoning effort is denominated in tokens, not time, so no difference beyond slowness at heavy load

(I work at OpenAI)

a456463

20 hours ago

sure. we believe you

Corence

2 days ago

It is a fair question. I'd expect the numbers are all real. Competitors are going to rerun the benchmark with these models to see how the model is responding and succeeding on the tasks and use that information to figure out how to improve their own models. If the benchmark numbers aren't real their competitors will call out that it's not reproducible.

However it's possible that consumers without a sufficiently tiered plan aren't getting optimal performance, or that the benchmark is overfit and the results won't generalize well to the real tasks you're trying to do.

mrandish

2 days ago

> I'd expect the numbers are all real.

I think a lot of people are concerned due to 1) significant variance in performance being reported by a large number of users, and 2) We have specific examples of OpenAI and other labs benchmaxxing in the recent past (https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...).

It's tricky because there are so many subtle ways in which "the numbers are all real" could be technically true in some sense, yet still not reflect what a customer will experience (eg harnesses, etc). And any of those ways can benefit the cost structures of companies currently subsidizing models well below their actual costs with limited investor capital. All with billions of dollars in potential personal wealth at stake for company employees and dozens of hidden cost/performance levers at their disposal.

And it doesn't even require overt deception on anyone's part. For example, the teams doing benchmark testing of unreleased new models aren't the same people as the ops teams managing global deployment/load balancing at scale day-to-day. If there aren't significant ongoing resources devoted to specifically validating those two things remain in sync - they'll almost certainly drift apart. And it won't be anyone's job to even know it's happening until a meaningful number of important customers complain or sales start to fall. Of course, if an unplanned deviation causes costs to rise over budget, it's a high-priority bug to be addressed. But if the deviation goes the other way and costs are little lower than expected, no one's getting a late night incident alert. This isn't even a dig at OpenAI in particular, it's just the default state of how large orgs work.

ifwinterco

2 days ago

On benchmarks GPT 5.2 was roughly equivalent to Opus 4.5 but most people who've used both for SWE stuff would say that Opus 4.5 is/was noticeably better

CraigJPerry

2 days ago

There's an extended thinking mode for GPT 5.2 i forget the name of it right at this minute. It's super slow - a 3 minute opus 4.5 prompt is circa 12 minutes to complete in 5.2 on that super extended thinking mode but it is not a close race in terms of results - GPT 5.2 wins by a handy margin in that mode. It's just too slow to be useable interactively though.

ifwinterco

2 days ago

Interesting, sounds like I definitely need to give the GPT models another proper go based on this discussion

elAhmo

2 days ago

I mostly used Sonnet/Opus 4.x in the past months, but 5.2 Codex seemed to be on par or better for my use case in the past month. I tried a few models here and there but always went back to Claude, but with 5.2 Codex for the first time I felt it was very competitive, if not better.

Curious to see how things will be with 5.3 and 4.6

georgeven

2 days ago

Interesting. Everyone in my circle said the opposite.

MadnessASAP

2 days ago

My experience is that Codex follows directions better but Claude writes better code.

ChatGPT-5.2-Codex follows directions to ensure a task [bead](https://github.com/steveyegge/beads) is opened before starting a task and to keep it updated almost to a fault. Claude-Opus-4.5 with the exact same directions, forgets about it within a round or two. Similarly, I had a project that required very specific behaviour from a couple functions, it was documented in a few places including comments at the top and bottom of the function. Codex was very careful in ensuring the function worked as was documented. Claude decided it was easier to do the exact opposite, rewrote the function, the comments, and the documentation to saynit now did the opposite of what was previously there.

If I believed a LLM could be spiteful, I would've believed it on that second one. I certainly felt some after I realised what it had done. The comment literally said:

  // Invariant regardless of the value of X, this function cannot return Y

And it turned it into:

  // Returns Y if X is true

planckscnst

2 days ago

That's so strange. I found GPT to be abysmal at following instructions to the point of unusability for any direction-heavy role. I have a common workflow that involves an orchestrator that pretty much does nothing but follow some simple directions [1]. GPT flat-out cannot do this most basic task.

[1]: https://github.com/Vibecodelicious/llm-conductor/blob/main/O...

MadnessASAP

a day ago

Strange behaviour and LLMs are the iconic duo of the decade. They've definitley multiplied my productivity, since now instead of putting off writing boring code or getting stuck on details till I get frustrated and give up I just give it to an agent to figure out.

I don't thing my ability read, understand, and write code is going anywhere though.

Neat tool BTW, I'm in the market for something like that.

planckscnst

a day ago

I've found this orchestrator+reviewer+judge setup to yield much better results than anything else I've tried. And it's such a simple setup - a few markdown files.

I'm also creating one that is similar, but purpose-built for making the plans that this setup can orchestrate. It still needs some tweaking to get agents to follow it better - it still takes additional prompting to nudge it down the proper path. But I've had similar benefits - sending plans through this adversarial review loop has yielded significant improvements in final output.

https://github.com/Vibecodelicious/llm-conductor/blob/main/p...

MadnessASAP

a day ago

Unrelated but this just happened and I thought of you ;-)

  › commit this, **SKIP BEADS**


  • Ran git status -sb
    └ ## main...origin/main [ahead 4]
       M containers/frigate.nix
      ?? .beads/bd.sock.startlock

  • I see an untracked .beads/bd.sock.startlock file that might be normal but needs clarification. I also note the requirement to include a bead and confirm bead readiness
    before proceeding, so I'll ask whether to create the bead and if the untracked file should be included in the commit.

  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

  • I can’t skip beads for this repo. The AGENTS.md rules are explicit: no work (including commits) without an open bead. Please confirm you want me to create a bead for
    this commit.

I don't know what's wrong with your Codex, but mine can't bring itself to break the rules.

krzyk

2 days ago

It probably depends on programming language and expectations.

ifwinterco

2 days ago

2 days ago

When do you think we should run this benchmark? Friday, 1pm? Monday 8AM? Wednesday 11AM?

I definitely suspect all these models are being degraded during heavy loads.

j_maffe

2 days ago

This hypothesis is tested regularly by plenty of live benchmarks. The services usually don't decay in performance.

thinkingtoilet

2 days ago

We know Open AI got caught getting benchmark data and tuning their models to it already. So the answer is a hard no. I imagine over time it gives a general view of the landscape and improvements, but take it with a large grain of salt.

tedsanders

a day ago

Are you referring to FrontierMath?

We had access to the eval data (since we funded it), but we didn't train on the data or otherwise cheat. We didn't even look at the eval results until after the model had been trained and selected.

thinkingtoilet

20 hours ago

No one believes you.

tedsanders

13 hours ago

If you don't believe me, that's fair enough. Some pieces of evidence that might update you or others:

- a member of the team who worked with this eval has left OpenAI and now works at a competitor; if we cheated, he would have every incentive to whistleblow

- cheating on evals is fairly easy to catch and risks destroying employee morale, customer trust, and investor appetite; even if you're evil, the cost-benefit doesn't really pencil out to cheat on a niche math eval

- Epoch made a private held-out set (albeit with a different difficulty); OpenAI performance on that set doesn't suggest any cheating/overfitting

- Gemini and Claude have since achieved similar scores, suggesting that scoring ~40% is not evidence of cheating with the private set

- The vast majority of evals are open-source (e.g., SWE-bench Pro Public), and OpenAI along with everyone else has access to their problems and the opportunity to cheat, so FrontierMath isn't even unique in that respect

rvz

2 days ago

The same thing was done with Meta researchers with Llama 4 and what can go wrong when 'independent' researchers begin to game AI benchmarks. [0]

You always have to question these benchmarks, especially when the in-house researchers can potentially game them if they wanted to.

2 days ago

claude swe-bench is 80.8 and codex is 56.8

Seems like 4.6 is still all-around better?

gizmodo59

2 days ago

Its SWE bench pro not swe bench verified. The verified benchmark has stagnated

joshuahedlund

2 days ago

Any ideas why verified has stagnated? It was increasing rapidly and then basically stopped.

Snuggly73

2 days ago

it has been pretty much a benchmark for memorization for a while. there is a paper on the subject somewhere.

swe bench pro public is newer, but its not live, so it will get slowly memorized as well. the private dataset is more interesting, as are the results there:

https://scale.com/leaderboard/swe_bench_pro_private

Rudybega

2 days ago

You're comparing two different benchmarks. Pro vs Verified.

pjot

2 days ago

Claude Code release notes:

  > Version 2.1.32:
     • Claude Opus 4.6 is now available!
     • Added research preview agent teams feature for multi-agent collaboration (token-intensive feature, requires setting
     CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1)
     • Claude now automatically records and recalls memories as it works
     • Added "Summarize from here" to the message selector, allowing partial conversation summarization.
     • Skills defined in .claude/skills/ within additional directories (--add-dir) are now loaded automatically.
     • Fixed @ file completion showing incorrect relative paths when running from a subdirectory
     • Updated --resume to re-use --agent value specified in previous conversation by default.
     • Fixed: Bash tool no longer throws "Bad substitution" errors when heredocs contain JavaScript template literals like ${index + 1}, which
     previously interrupted tool execution
     • Skill character budget now scales with context window (2% of context), so users with larger context windows can see more skill descriptions
     without truncation
     • Fixed Thai/Lao spacing vowels (สระ า, ำ) not rendering correctly in the input field
     • VSCode: Fixed slash commands incorrectly being executed when pressing Enter with preceding text in the input field
     • VSCode: Added spinner when loading past conversations list

neuronexmachina

2 days ago

> Claude now automatically records and recalls memories as it works

Neat: https://code.claude.com/docs/en/memory

I guess it's kind of like Google Antigravity's "Knowledge" artifacts?

a day ago

This is the core problem. The agent writes its own memory while working, so it has blind spots about what matters. I've had sessions where it carefully noted one thing but missed a bigger mistake in the same conversation — it can't see its own gaps.

A second pass over the transcript afterward catches what the agent missed. Doesn't need the agent to notice anything. Just reads the conversation cold.

The two approaches have completely different failure modes, which is why you need both. What nobody's built yet is the loop where the second pass feeds back into the memory for the next session.

om8

2 days ago

Is there a way to disable it? Sometimes I value agent not having knowledge that it needs to cut corners

nerdsniper

2 days ago

90-98% of the time I want the LLM to only have the knowledge I gave it in the prompt. I'm actually kind of scared that I'll wake up one day and the web interface for ChatGPT/Opus/Gemini will pull information from my prior chats.

pdntspa

2 days ago

I asked Claude to give me a TLDR (condensed from its system prompt):

----

Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

MEMORY.md is always injected into the system prompt; truncated after 200 lines, so keep it concise

Separate topic files for detailed notes, linked from MEMORY.md What to record: problem constraints, strategies that worked/failed, lessons learned

Proactive: when I hit a common mistake, check memory first - if nothing there, write it down

Maintenance: update or remove memories that are wrong or outdated

Organization: by topic, not chronologically

Tools: use Write/Edit to update (so you always see the tool calls)

ra7

2 days ago

> Persistent directory at ~/.claude/projects/{project-path}/memory/, persists across conversations

I create a git worktree, start Claude Code in that tree, and delete after. I notice each worktree gets a memory directory in this location. So is memory fragmented and not combined for the "main" repo?

vardalab

a day ago

Yes, I noticed the same thing, and Claude told me that it's going to be deleted. I will have it improve the skill that is part of our worktree cleanup process to consolidate that memory into the main memory if there's anything useful.

user

a day ago

[deleted]

pdntspa

2 days ago

I thought it was already doing this?

I asked Claude UI to clear its memory a little while back and hoo boy CC got really stupid for a couple of days

legitster

2 days ago

I'm still not sure I understand Anthropic's general strategy right now.

They are doing these broad marketing programs trying to take on ChatGPT for "normies". And yet their bread and butter is still clearly coding.

Meanwhile, Claude's general use cases are... fine. For generic research topics, I find that ChatGPT and Gemini run circles around it: in the depth of research, the type of tasks it can handle, and the quality and presentation of the responses.

Anthropic is also doing all of these goofy things to try to establish the "humanity" of their chatbot - giving it rights and a constitution and all that. Yet it weirdly feels the most transactional out of all of them.

Don't get me wrong, I'm a paying Claude customer and love what it's good at. I just think there's a disconnect between what Claude is and what their marketing department thinks it is.

bobbylarrybobby

a day ago

I really like that Claude feels transactional. It answers my question quickly and concisely and then shuts up. I don't need the LLM I use to act like my best friend.

endymion-light

a day ago

I love doing a personal side project code review with claude code, because it doesn't beat around the bush for criticism.

I recently compared a class that I wrote for a side project that had quite horrible temporal coupling for a data processor class.

Gemini - ends up rating it a 7/10, some small bits of feedback etc

Claude - Brutal dismemberment of how awful the naming convention, structure, coupling etc, provides examples how this will mess me up in the future. Gives a few citations for python documentation I should re-read.

ChatGPT - you're a beautiful developer who can never do anything wrong, you're the best developer that's ever existed and this class is the most perfect class i've ever seen

majora2007

21 hours ago

This is exactly what got me to actually pay. I had a side project with an architecture I thought was good. Fed it into Claude and ChatGPT. ChatGPT made small suggestions but overall thought it was good. Claude shit all over it and after validating it's suggestions, I realized Claude was what I needed.

I haven't looked back. I just use Claude at home and ChatGPT at work (no Claude). ChatGPT at work is much worse than Claude in my experience.

Willish42

14 hours ago

I feel like this anecdote represents the differing incentives / philosophies of each group rather well.

I've noticed ChatGPT is rather high in its praise regardless of how valuable the input is, Gemini is less placating but still largely influenced by the perspective of the prompter, and Claude feels the most "honest" but humans are rather easy poor at judging this sort of thing.

Does anyone know if "sycophancy" has documented benchmarks the models are compared against? Maybe it's subjective and hard to measure, but given the issues with GPT 4o, this seems like a good thing to measure model to model to compare individual companies' changes as well as compare across companies.

andkenneth

a day ago

Weirdly I feel like partially because of this it feels more "human" and more like a real person I'm talking to. GPT models feel fake and forced, and will yap in a way that is like they're trying to get to be my friend, but offputting in a way that makes it not work. Meanwhile claude has always had better "emotional intelligence".

Claude also seems a lot better at picking up what's going on. If you're focused on tasks, then yeah, it's going to know you want quick answers rather than detailed essays. Could be part of it.

apples_oranges

a day ago

fyi in settings, you can configure chatGPT to do the same

matkoniecz

a day ago

where?

maxbond

a day ago

Settings > Personalization > Custom Instructions.

Here's what I use:

    WE ARE PROFESSIONALS. DO NOT FLATTER ME. BE BLUNT AND FORTHRIGHT.

Geopolitically speaking this is hilarious.

Squarex

2 days ago

The voice mode sounded like a Ukrainian trying to speak Czech. I don’t think it means anything.

deaux

a day ago

You mean Claude sucks at Czech. You're extrapolating here. I can name languages that Claude is better at than GPT.

Gemini is the most fluent in the highest number of human languages and has been for years (!) at this point - namely since Gemini 1.5 Pro, which was released Feb 2024. Two years ago.

Squarex

a day ago

Yeah, sure, I was overly generalising it from one experience.

kuboble

2 days ago

Claude code (opus) is very good in Polish.

I sometimes vibe code in polish and it's as good as with English for me. It speaks a natural, native level Polish.

I used opus to translate thousands of strings in my app into polish, Korean, and two Chinese dialects. Polish one is great, and the other are also good according to my customers.

Squarex

a day ago

> I sometimes vibe code in polish

This is interesting to me. I always switch to English automatically when using Claude Code as I have learned software engineering on an English speaking Internet. Plus the muscle memory of having to query google in English.

kuboble

a day ago

English is also default for me.

I mostly use Polish when I pair-vibe-code with my kids

koakuma-chan

a day ago

You could say its Polish is polished.

altern8

a day ago

Your game is amazing!

I wish there was a "Reset" button to go back to the original position.

Where are you in Poland?

kuboble

a day ago

Thanks :) Click "Level" -> "Try again"

Originally from Wrocław, but don't live in Poland anymore

altern8

a day ago

Ah, I'm originally from Italy and living in Wroclaw now, LOL.

BUT, I meant a button to restart after a few moves. Anyways, cool!

kuboble

a day ago

Yes, that's what I'm referring to https://kuboble.com/hn/level_try_again.mp4

jorl17

2 days ago

Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.

I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).

In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.

khendron

a day ago

Claude is helping me learn French right now. I am using it as a supplementary tutor for a class I am taking. I have caught it in a couple of mistakes, but generally it seems to be working pretty well.

eaf7e281

2 days ago

I kinda agree. Their model just doesn't feel "daily" enough. I would use it for any "agentic" tasks and for using tools, but definitely not for day to day questions.

lukebechtel

2 days ago

Why? I use it for all and love it.

That doesn't mean you have to, but I'm curious why you think it's behind in the personal assistant game.

legitster

2 days ago

I have three specific use cases where I try both but ChatGPT wins:

- Recipes and cooking: ChatGPT just has way more detailed and practical advice. It also thinks outside of the box much more, whereas Claude gets stuck in a rut and sticks very closely to your prompt. And ChatGPT's easier to understand/skim writing style really comes in useful.

a day ago

faxmeyourcode

a day ago

Everybody is different, I simply cannot stand the sight of chatgpt styled writing. Give me paragraphs.

redox99

a day ago

2 days ago

https://news.ycombinator.com/item?id=45455786

eli

2 days ago

How would you generate a picture of Noun + Noun in the first place in order to train the LLM with what it would look like? What's happening during that 1 estimated second?

fragmede

a day ago

Use any of the image generation models (eg Nanobanana, Midjourney, or ChatGPT) to generate a picture of a noun on a noun. Simonw's test is to have a Language (text) model generate a Scalar Vector Graphic, which the language model has to do by writing curves and colors, like draw a spline from point 150,100 to 200,300 of type cubic, using width 20, color orange.

In that hypothetical second is freaking fascinating. It's a denoising algorithm, and then a bunch of linear algebra, and out pops a picture of a pelican on a bicycle. Stable diffusion does this quite handily. https://stablediffusionweb.com/image/6520628-pelican-bicycle...

metalliqaz

2 days ago

its pelicans all the way down

Terretta

2 days ago

This is why everyone trains their LLM on another LLM. It's all about the pelicans.

AnimalMuppet

2 days ago

But you need to also include the number of prepositions. "A pelican on a bicycle" is not at all the same as "a pelican inside a bicycle".

There are estimated to be 100 or so prepositions in English. That gets you to 4 trillion combinations.

jodrellblank

21 hours ago

The prompt was "a pelican riding a bicycle"; not prepositions but every verb. Potentially every adverb+verb combination - "a pelican clumsily pushing a bicycle"

gcanyon

2 days ago

One aspect of this is that apparently most people can't draw a bicycle much better than this: they get the elements of the frame wrong, mess up the geometry, etc.

arionmiles

2 days ago

There's a research paper from the University of Liverpool, published in 2006 where researchers asked people to draw bicycles from memory and how people overestimate their understanding of basic things. It was a very fun and short read.

a day ago

If it was framed as "show us how you would break down this problem and think about it" then sure. If it's the gotcha quiz (much more common in my experience) then no.

But if that's what they were going for it should be something on a completely different and more abstract topic like "develop a method for emptying your swimming pool without electricity in under four hours"

kortilla

a day ago

It has nothing to do with “incurious”. Being asked to draw the architecture for something that is abstracted away from your actual job is a dickhead move because it’s just a test for “do you have the same interests as me?”

It’s no different than asking for the architecture of the power supply or the architecture of the network switch that serves the building. Brilliant software engineers are going to have gaps on non-software things.

selcuka

2 days ago

Poe's Law [1]:

> Without a clear indicator of the author's intent, any parodic or sarcastic expression of extreme views can be mistaken by some readers for a sincere expression of those views.

[1] https://en.wikipedia.org/wiki/Poe%27s_law

gedy

2 days ago

That's reasonable in many cases, but I've had situations like this for senior UI and frontend positions, and they: don't ask UI or frontend questions. And ask their pet low level questions. Some even snort that it's softball to ask UI questions or "they use whatever". It's like, yeah no wonder your UI is shit and now you are hiring to clean it up.

rsc

2 days ago

Raises hand.

gnatolf

2 days ago

Absolutely. A technically correct bike is very hard to draw in SVG without going overboard in details

falloutx

2 days ago

Its not. There are thousands of examples on the internet but good SVG sites do have monetary blocks.

https://www.freepik.com/free-photos-vectors/bicycle-svg

jefftk

2 days ago

Several of those have incorrect frames:

https://www.freepik.com/free-vector/cyclist_23714264.htm

https://www.freepik.com/premium-vector/bicycle-icon-black-li...

Or missing/broken pedals:

https://www.freepik.com/premium-vector/bicycle-silhouette-ic...

Now that would be a smart chat agent.

TheDong

a day ago

2 days ago

I've stuck with "Generate an SVG of a pelican riding a bicycle" because it's the same prompt I've been using for over a year now and I want results that are sort-of comparable to each other.

I think when I first tried this I iterated a few times to get to something that reliably output SVG, but honestly I didn't keep the notes I should ahve.

athrowaway3z

2 days ago

This benchmark inspired me to have codex/claude build a DnD battlemap tool with svg's.

They got surprisingly far, but i did need to iterate a few times to have it build tools that would check for things like; dont put walls on roads or water.

2 days ago

2 days ago

the field is advancing so fast it's hard to do real science as their will be a new SOTA by the time you're ready to publish results. i think this is a combination of that and people having a laugh.

Would you mind sharing which benchmarks you think are useful measures for multimodal reasoning?

techpression

2 days ago

A benchmark only tests what the benchmark is doing, the goal is to make that task correlate with actually valuable things. Graphic benchmarks is a good example, extremely hard to know what you will get in a game by looking at 3D Mark scores, it varies by a lot. Making a SVG of a single thing doesn’t help much unless that applies to all SVG tasks.

fullstackchris

2 days ago

[flagged]

dang

2 days ago

Personal attacks are not allowed on HN. No more of this, please.

blibble

2 days ago

> We build Claude with Claude. Our engineers write code with Claude Code every day

well that explains quite a bit

jsheard

2 days ago

CC has >6000 open issues, despite their bot auto-culling them after 60 days of inactivity. It was ~5800 when I looked just a few days ago so they seem to be accelerating towards some kind of bug singularity.

dkersten

2 days ago

Just anecdotally, each release seems to be buggier than the last.

To me, their claim that they are vibe coding Claude code isn’t the flex they think it is.

I find it harder and harder to trust anthropic for business related use and not just hobby tinkering. Between buggy releases, opaque and often seemingly glitches rate limits and usage limits, and the model quality inconsistency, it’s just not something I’d want to bet a business on.

zahlman

2 days ago

I think I would be much more frightened if it were working well.

ifwinterco

2 days ago

Exactly, thank goodness it's still a bit rubbish in some aspects

marcd35

a day ago

Doesn’t this just exacerbate the “black box” conundrum if they just keep piling on more and more features without fully comprehending what’s being implemented

csomar

a day ago

Since version 2.1.9, performance has degraded significantly after extended use. After 30-40 prompts with substantial responses, memory usage climbs above 25GB, making the tool nearly unusable. I'm updating again to see if it improves.

Unlike what another commenter suggested, this is a complex tool. I'm curious whether the codebase might eventually reach a point where it becomes unfixable; even with human assistance. That would be an interesting development. We'll see.

tgtweak

2 days ago

plot twist, it's all claude code instances submitting bug reports on behalf of end users.

trescenzi

2 days ago

I literally hit a claude code bug today, tried to use claude desktop to debug it which didn't help and it offered to open a bug report for me. So yes 100%. Some of the titles also make it pretty clear they are auto submitted. This is my favorite which was around the top when I was creating my bug report 3 hours ago and is now 3 pages back lol.

What I am saying is that just throwing out phrases that something is "simple" or "basic" needs proof, but at the time of writing I don't see examples.

[0] https://github.com/anomalyco/opencode/issues

[1] https://news.ycombinator.com/item?id=46877204

vouwfietsman

a day ago

> equivalent written in "two weeks"

This is indeed a nonsensical timeframe.

> What I am saying is that just throwing out phrases that something is "simple" or "basic" needs proof, but at the time of writing I don't see examples.

Fair point.

trymas

21 hours ago

> > equivalent written in "two weeks"

> This is indeed a nonsensical timeframe.

Sorry - I should have explained that it's an ironic hyperbole. Was thinking quotes will be enough, but Poe's law strikes again.

9dev

2 days ago

I’m pretty certain you haven’t used it yet(to its fullest extent) then. Claude Code is easily one of the most complex terminal UIs I have seen yet.

ozozozd

a day ago

It’s extremely simple.

19 hours ago

Just because Antropic made you think they are doing very complex thing with this tool, doesn't mean it is true. Claude Code is not even comparable to massive software which is probably an order of magnitudes more complex, such as IntelliJ stuff as an example.

2 days ago

2 days ago

Okay, and Windows is by far the most popular desktop operating system.

Discussions are pointless when the parties are talking past each other.

dmazzoni

a day ago

Yeah, but there are dozens of AI coding assistants to choose from, and the cost to switch is very low, unlike switching operating systems.

I've tried them all and I keep coming back to Claude Code because it's just so much more capable and useful than the others.

jama211

a day ago

Yes, and windows is pretty good for most people. Don’t be ridiculous.

pluralmonad

2 days ago

Popular meaning lots of people like it or that it is relatively widespread? Polio used to be popular in the latter way.

quietsegfault

2 days ago

I like windows, it’s fine. I like MacOS better. I like Linux. None of them are garbage or unusable.

blibble

2 days ago

krystofbe

2 days ago

I did some debugging on this today. The results are... sobering.

Memory comparison of AI coding CLIs (single session, idle):

  | Tool        | Footprint | Peak   | Language      |
  |-------------|-----------|--------|---------------|
  | Codex       | 15 MB     | 15 MB  | Rust          |
  | OpenCode    | 130 MB    | 130 MB | Go            |
  | Claude Code | 360 MB    | 746 MB | Node.js/React |

That's a 24x to 50x difference for tools that do the same thing: send text to an API.

vmmap shows Claude Code reserves 32.8 GB virtual memory just for the V8 heap, has 45% malloc fragmentation, and a peak footprint of 746 MB that never gets released, classic leak pattern.

On my 16 GB Mac, a "normal" workload (2 Claude sessions + browser + terminal) pushes me into 9.5 GB swap within hours. My laptop genuinely runs slower with Claude Code than when I'm running local LLMs.

I get that shipping fast matters, but building a CLI with React and a full Node.js runtime is an architectural choice with consequences. Codex proves this can be done in 15 MB. Every Claude Code session costs me 360+ MB, and with MCP servers spawning per session, it multiplies fast.

atonse

2 days ago

Jarred Sumner (bun creator, bun was recently acquired by Anthropic) has been working exclusively on bringing down memory leaks and improving performance in CC the last couple weeks. He's been tweeting his progress.

This is just regular tech debt that happens from building something to $1bn in revenue as fast as you possibly can, optimize later.

Seems to me the problem is more likely the skills of the engineers, not Claude’s capabilities.

int_19h

a day ago

It's a popular myth, but not really true anymore with the latest and greatest. I'm currently using both Claude and Codex to work on a Haskell codebase, and it works wonderfully. More so than JS actually, since the type system provides extensive guardrails (you can get types with TS, but it's not sound, and it's very easy to write code that violates type constraints at runtime without even deliberately trying to do so).

jama211

2 days ago

There’s nothing wrong with that, except it lets ai skeptics feel superior

RohMin

2 days ago

https://www.youtube.com/watch?v=LvW1HTSLPEk

I thought this was a solid take

jdthedisciple

2 days ago

interesting

everforward

2 days ago

There are absolutely things wrong with that, because React was designed to solve problems that don't exist in a TUI.

React fixes issues with the DOM being too slow to fully re-render the entire webpage every time a piece of state changes. That doesn't apply in a TUI, you can re-render TUIs faster than the monitor can refresh. There's no need to selectively re-render parts of the UI, you can just re-render the entire thing every time something changes without even stressing out the CPU.

It brings in a bunch of complexity that doesn't solve any real issues beyond the devs being more familiar with React than a TUI library.

jama211

a day ago

It is demonstrably absolutely fine. Sheesh.

everforward

21 hours ago

2 days ago

Same with opencode and gemini, it's disgusting

Codex (by openai ironically) seems to be the fastest/most-responsive, opens instantly and is written in rust but doesn't contain that many features

Claude opens in around 3-4 seconds

Opencode opens in 2 seconds

Gemini-cli is an abomination which opens in around 16 second for me right now, and in 8 seconds on a fresh install

Codex takes 50ms for reference...

If their models are so good, why are they not rewriting their own react in cli bs to c++ or rust for 100x performance improvement (not kidding, it really is that much)

g947o

2 days ago

Great question, and my guess:

If you build React in C++ and Rust, even if the framework is there, you'll likely need to write your components in C++/Rust. That is a difficult problem. There are actually libraries out there that allow you to build web UI with Rust, although they are for web (+ HTML/CSS) and not specifically CLI stuff.

So someone needs to create such a library that is properly maintained and such. And you'll likely develop slower in Rust compared to JS.

These companies don't see a point in doing that. So they just use whatever already exists.

shoeb00m

2 days ago

But I am sure that companies like Anthropic or OpenAI aren't going to build their application using these libraries, even with AI.

pdntspa

2 days ago

and why do they need react...

Philpax

2 days ago

That's actually relatively understandable. The React model (not necessarily React itself) of compositional reactive one-way data binding has become dominant in UI development over the last decade because it's easy to work with and does not require you to keep track of the state of a retained UI.

Most modern UI systems are inspired by React or a variant of its model.

cityofdelusion

21 hours ago

Is this accurate? I've been coding UIs since the early 2000s and one-way data binding has always been a thing, especially in the web world. Even in the heyday of jQuery, there were still good (but much less popular) libraries for doing it. The idea behind it isn't very revolutionary and has existed for a long time. React is a paradigm shift because of differential rendering of the DOM which enabled big performance gains for very interactive SPAs, not because of data binding necessarily.

jama211

a day ago

Well said.

azinman2

2 days ago

Why does it matter if Claude Code opens in 3-4 seconds if everything you do with it can take many seconds to minutes? Seems irrelevant to me.

RohMin

2 days ago

I guess with ~50 years of CPU advancements, 3-4 seconds for a TUI to open makes it seem like we lost the plot somewhere along the way.

strange_quark

2 days ago

Don’t forget they’ve also publicly stated (bragged?) about the monumental accomplishment of getting some text in a terminal to render at 60fps.

jama211

a day ago

So it doesn’t matter at all except to your sensibilities. Sounds to me that they simply are much better at prioritisation than your average HN user, who’d have taken forever to release it but at least the terminal interface would be snappy…

barnabee

a day ago

Some people[0] like their tools to be well engineered. This is not unique to software.

[0] Perhaps everyone who actually takes pride in their craft and doesn’t prioritise shitty hustle culture and making money over everything else.

azinman2

20 hours ago

Aside from startup time, as a tool Claude Code is tremendous. By far the most useful tool I’ve encountered yet. This seems to be very nit picky compared to the total value provided. I think y'all are missing the forrest for the trees.

barnabee

30 minutes ago

It’s almost all the model. There are many such tools and Claude Code doesn’t seem to be in any way unique. I prefer OpenCode, so far.

nsingh2

20 hours ago

Most of the value of Claude Code comes from the model, and that's not running on your device.

The Claude Code TUI itself is a front end, and should not be taking 3-4 seconds to load. That kind of loading time is around what VSCode takes on my machine, and VSCode is a full blown editor.

barnabee

29 minutes ago

It’s orders of magnitude slower than Helix, which is also a full blown editor.

When all your other tools are fast and well engineered, slow and bloated is very noticeable.

wahnfrieden

2 days ago

Because when the agent is taking many seconds to minutes, I am starting new agents instead of waiting or switching to non-agent tasks

mbesto

2 days ago

This is exactly the type of thing that AI code writers don't do well - understand the prioritization of feature development.

Some developers say 3-4 seconds are important to them, others don't. Who decides what the truth is? A human? ClawdBot?

sumedh

a day ago

> Some developers say 3-4 seconds are important to them, others don't.

Wasnt GTA 5 famous for very long start up time and turns out there some bug which some random developer/gamer found out and gave them a fix?

Most Gamers didnt care, they still played it.

jama211

a day ago

The humans in the company (correctly) realised that a few seconds to open basically the most powerful productivity agent ever made so they can focus on fast iteration of features is a totally acceptable trade off priority wise. Who would think differently???

mbesto

20 hours ago

This is my point...

shoeb00m

2 days ago

codex cli is missing a bunch of ux features like resizing on terminal size change.

2 days ago

React, the framework, is separate from react-dom, the browser rendering library. Most people think of those two as one thing because they're the most popular combo.

But there are many different rendering libraries you can use with React, including Ink, which is designed for building CLI TUIs..

skydhash

2 days ago

Anyone that knows a bit about terminals would already know that using React is not a good solution for TUI. Terminal rendering is done as a stream of characters which includes both the text and how it displays, which can also alter previously rendered texts. Diffing that is nonsense.

9dev

2 days ago

You’re not diffing that, though. The app keeps a virtual representation of the UI state in a tree structure that it diffs on, then serializes that into a formatted string to draw to the out put stream. It’s not about limiting the amount of characters redrawn (that would indeed be nonsense), but handling separate output regions effectively.

pkkim

2 days ago

They used Ink: https://github.com/vadimdemedes/ink

I've used it myself. It has some rough edges in terms of rendering performance but it's nice overall.

tayo42

2 days ago

Thats pretty interesting looking, thanks!

embedding-shape

2 days ago

Not a built-in React feature. The idea been around for quite some time, I came across it initially with https://github.com/vadimdemedes/ink back in 2022 sometime.

tayo42

2 days ago

i had claude make a snake clone and fix all the flickering in like 20 minutes with the library mentioned lol

CamperBob2

2 days ago

Also explains why Claude Code is a React app outputting to a Terminal. (Seriously.)

An alternative solution, rather than using the flicker-inducing clear-then-redraw in the first place, would have been just to do terminal screen image diffs and draw the changes (which is something I did back in the day for fun, sending full-screen ASCII digital clock diffs over a slow 9600baud serial link to a real terminal).

skydhash

2 days ago

Any diff would require to have a Before and an After. Whatever was done for the After can be done to directly render the changes. No need for the additional compute of a diff.

HarHarVeryFunny

a day ago

Sure, you could just draw the full new screen image (albeit a bit inefficient if only one character changed), and no need for the flicker-inducing clear before draw either.

I'm not sure what the history of log-output has been or why it does the clear-before-draw. Another simple alternative to pre-clear would have been just to clear to end of line (ESC[0K) after each partial line drawn.

bizzleDawg

2 days ago

Only in the same way that the pixels displayed in a browser are not a tree structure that you can diff - the diffing happens at a higher level of abstraction than what's rendered.

Diffing and only updating the parts of the TUI which have changed does make sense if you consider the alternative is to rewrite the entire screen every "frame". There are other ways to abstract this, e.g. a library like tqmd for python may well have a significantly more simple abstraction than a tree for storing what it's going to update next for the progress bar widget than claude, but it also provides a much more simple interface.

To me it seems more fair game to attack it for being written in JS than for using a particular "rendering" technique to minimise updates sent to the terminal.

skydhash

2 days ago

Most UI library store states in tree of components. And if you’re creating a custom widget, they will give you a 2D context for the drawing operations. Using react makes sense in those cases because what you’re diffing is state, then the UI library will render as usual, which will usually be done via compositing.

The terminal does not have a render phase (or an update state phase). You either refresh the whole screen (flickering) or control where to update manually (custom engine, may flicker locally). But any updates are sequential (moving the cursor and then sending what to be displayed), not at once like 2D pixel rendering does.

So most TUI only updates when there’s an event to do so or at a frequency much lower than 60fps. This is why top and htop have a setting for that. And why other TUI software propose a keybind to refresh and reset their rendering engines.

sweetheart

2 days ago

The "UI" is indeed represented in memory in tree-like structure for which positioning is calculated according to a flexbox-like layout algo. React then handles the diffing of this structure, and the terminal UI is updated according to only what has changed by manually overwriting sections of the buffer. The CLI library is called Ink and I forget the name of the flexbox layout algo implementation, but you can read about the internals if you look at the Ink repo.

CooCooCaCha

2 days ago

It’s really not that crazy.

a day ago

I'm using it on Framework Desktop 128GB RAM... Its scroll is utterly broken outside of VS Code for me, on Gnome Terminal, Tilix and a few others.

And even in VS Code, it wonks out after a while.

My guess is that it mostly works on Apple devices.

hkt

2 days ago

Blaming the terminal seems a little backwards. Perhaps the application could take responsibility for being compatible with common terminals?

nickstinemates

a day ago

I have no dog in the fight.

Someone1234

2 days ago

Does anyone with more insight into the AI/LLM industry happen to know if the cost to run them in normal user-workflows is falling? The reason I'm asking is because "agent teams" while a cool concept, it largely constrained by the economics of running multiple LLM agents (i.e. plans/API calls that make this practical at scale are expensive).

A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

simonw

2 days ago

The cost per token served has been falling steadily over the past few years across basically all of the providers. OpenAI dropped the price they charged for o3 to 1/5th of what it was in June last year thanks to "engineers optimizing inferencing", and plenty of other providers have found cost savings too.

Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers

Where did you hear that? It doesn't match my mental model of how this has played out.

cootsnuck

2 days ago

I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

> Turns out there was a lot of low-hanging fruit in terms of inference optimization that hadn't been plucked yet.

That does not mean the frontier labs are pricing their APIs to cover their costs yet.

It can both be true that it has gotten cheaper for them to provide inference and that they still are subsidizing inference costs.

In fact, I'd argue that's way more likely given that has been precisely the goto strategy for highly-competitive startups for awhile now. Price low to pump adoption and dominate the market, worry about raising prices for financial sustainability later, burn through investor money until then.

What no one outside of these frontier labs knows right now is how big the gap is between current pricing and eventual pricing.

chis

2 days ago

It's quite clear that these companies do make money on each marginal token. They've said this directly and analysts agree [1]. It's less clear that the margins are high enough to pay off the up-front cost of training each model.

topaz0

a day ago

Unfortunately tech journalists' judgement of source credibility don't have a very good track record

mrgaro

a day ago

But there are companies which are only serving open weight models via APIs (ie. they are not doing any training), so they must be profitable? here's one list of providers from OpenRouter serving LLama 3.3 70B: https://openrouter.ai/meta-llama/llama-3.3-70b-instruct/prov...

9cb14c1ec0

2 days ago

It's also true that their inference costs are being heavily subsidized. For example, if you calculate Oracles debt into OpenAIs revenue, they would be incredibly far underwater on inference.

emp17344

2 days ago

Sue, but if they stop training new models, the current models will be useless in a few years as our knowledge base evolves. They need to continually train new models to have a useful product.

NitpickLawyer

2 days ago

> they still are subsidizing inference costs.

They are for sure subsidising costs on all you can prompt packages (20-100-200$ /mo). They do that for data gathering mostly, and at a smaller degree for user retention.

> evidence at all that Anthropic or OpenAI is able to make money on inference yet.

You can infer that from what 3rd party inference providers are charging. The largest open models atm are dsv3 (~650B params) and kimi2.5 (1.2T params). They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range. You can make some educates guesses that they get some leeway for model size at the 10-15$/ Mtok prices for their top tier models. So if they are inside some sane model sizes, they are likely making money off of token based APIs.

int_19h

a day ago

> They are being served at 2-2.5-3$ /Mtok. That's sonnet / gpt-mini / gemini3-flash price range.

The interesting number is usually input tokens, not output, because there's much more of the former in any long-running session (like say coding agents) since all outputs become inputs for the next iteration, and you also have tool calls adding a lot of additional input tokens etc.

It doesn't change your conclusion much though. Kimi K2.5 has almost the same input token pricing as Gemini 3 Flash.

slopusila

2 days ago

most of those subscriptions go unused. I barely use 10% of mine

so my unused tokens compensate for the few heavy users

sandos

a day ago

Ive been thinking about our company, one of big global conglomerates that went for copilot. Suddenly I was just enrolled.. together with at least 1500 others. I guess the amount of money for our business copilot plans x 1500 is not a huge amount of money, but I am at least pretty convinced that only a small part of users use even 10% of their quota. Even teams located around me, I only know of 1 person that seems to use it actively.

aenis

a day ago

Thanks!

I hope my unused gym subscription pays back the good karma :-)

mrandish

2 days ago

> I have not see any reporting or evidence at all that Anthropic or OpenAI is able to make money on inference yet.

Anthropic planning an IPO this year is a broad meta-indicator that internally they believe they'll be able to reach break-even sometime next year on delivering a competitive model. Of course, their belief could turn out to be wrong but it doesn't make much sense to do an IPO if you don't think you're close. Assuming you have a choice with other options to raise private capital (which still seems true), it would be better to defer an IPO until you expect quarterly numbers to reach break-even or at least close to it.

Despite the willingness of private investment to fund hugely negative AI spend, the recently growing twitchiness of public markets around AI ecosystem stocks indicates they're already worried prices have exceeded near-term value. It doesn't seem like they're in a mood to fund oceans of dotcom-like red ink for long.

defmacr0

a day ago

>Despite the willingness of private investment to fund hugely negative AI spend

VC firms, even ones the size of Softbank, also literally just don't have enough capital to fund the planned next-generation gigawatt-scale data centers.

a day ago

> A few decades later, we have audio compression that is much better and higher-quality than MP3

Just curious, which formats and how they compare, storage wise?

Also, are you sure it's not just moving the goalposts to CPU usage? Frequently more powerful compression algorithms can't be used because they use lots of processing power, so frequently the biggest gains over 20 years are just... hardware advancements.

esafak

2 days ago

Someone made a quality tracker: https://marginlab.ai/trackers/claude-code/

embedding-shape

2 days ago

Or distilled models, or just slightly smaller models but same architecture. Lots of options, all of them conveniently fitting inside "optimizing inferencing".

simonw

2 days ago

The o3 optimizations were not quantization, they confirmed this at the time.

jmalicki

2 days ago

A ton of GPU kernels are hugely inefficient. Not saying the numbers are realistic, but look at the 100s of times of gain in the Anthropic performance takehome exam that floated around on here.

And if you've worked with pytorch models a lot, having custom fused kernels can be huge. For instance, look at the kind of gains to be had when FlashAttention came out.

This isn't just quantization, it's actually just better optimization.

Even when it comes to quantization, Blackwell has far better quantization primitives and new floating point types that support row or layer-wise scaling that can quantize with far less quality reduction.

The company is unprofitable overall, but I don't see any reason to believe that their per-token inference costs are below the marginal cost of computing those tokens.

Aurornis

2 days ago

It depends on what you're talking about

If you're looking at overall profitability, you include everything

If you're talking about unit economics of producing tokens, you only include the marginal cost of each token against the marginal revenue of selling that token

runarberg

2 days ago

I don’t understand the logic. Without training the marginal cost of each token goes into nothing. The more you train, the better the model, and (presumably) you will gain more costumer interest. Unlike R&D you will always have to train new models if you want to keep your customers.

To me this looks likes some creative bookkeeping, or even wishful thinking. It is like if SpaceX omits the price of the satellites when calculating their profits.

nodja

2 days ago

> A year or more ago, I read that both Anthropic and OpenAI were losing money on every single request even for their paid subscribers, and I don't know if that has changed with more efficient hardware/software improvements/caching.

This is obviously not true, you can use real data and common sense.

Just look up a similar sized open weights model on openrouter and compare the prices. You'll note the similar sized model is often much cheaper than what anthropic/openai provide.

Example: Let's compare claude 4 models with deepseek. Claude 4 is ~400B params so it's best to compare with something like deepseek V3 which is 680B params.

Even if we compare the cheapest claude model to the most expensive deepseek provider we have claude charging $1/M for input and $5/M for output, while deepseek providers charge $0.4/M and $1.2/M, a fifth of the price, you can get it as cheap as $.27 input $0.4 output.

As you can see, even if we skew things overly in favor of claude, the story is clear, claude token prices are much higher than they could've been. The difference in prices is because anthropic also needs to pay for training costs, while openrouter providers just need to worry on making serving models profitable. Deepseek is also not as capable as claude which also puts down pressure on the prices.

There's still a chance that anthropic/openai models are losing money on inference, if for example they're somehow much larger than expected, the 400B param number is not official, just speculative from how it performs, this is only taking into account API prices, subscriptions and free user will of course skew the real profitability numbers, etc.

Price sources:

https://openrouter.ai/deepseek/deepseek-v3.2-speciale

https://claude.com/pricing#api

Someone1234

2 days ago

> This is obviously not true, you can use real data and common sense.

It isn't "common sense" at all. You're comparing several companies losing money, to one another, and suggesting that they're obviously making money because one is under-cutting another more aggressively.

LLM/AI ventures are all currently under-water with massive VC or similar money flowing in, they also all need training data from users, so it is very reasonable to speculate that they're in loss-leader mode.

mrgaro

a day ago

There are companies which are only serving open weight models and not doing any training, so they must be profitable? Check for example this list https://openrouter.ai/meta-llama/llama-3.3-70b-instruct/prov...

nodja

2 days ago

Doing some math in my head, buying the GPUs at retail price, it would take probably around half a year to make the money back, probably more depending how expensive electricity is in the area you're serving from. So I don't know where this "losing money" rhetoric is coming from. It's probably harder to source the actual GPUs than making money off them.

defmacr0

a day ago

> So I don't know where this "losing money" rhetoric is coming from.

https://www.dbresearch.com/PROD/RI-PROD/PROD0000000000611818...

suddenlybananas

2 days ago

electricity

tqian

a day ago

To borrow a concept of cloud server renting, there's also the factor of overselling. Most open source LLM operators probably oversell quite a bit - they don't scale up resources as fast as OpenAI/Anthropic when requests increase. I notice many openrouter providers are noticeably faster during off hours.

In other words, it's not just the model size, but also concurrent load and how many gpus do you turn on at any time. I bet the big players' cost is quite a bit higher than the numbers on openrouter, even for comparable model parameters.

zozbot234

2 days ago

> i.e. plans/API calls that make this practical at scale are expensive

Local AI's make agent workflows a whole lot more practical. Making the initial investment for a good homelab/on-prem facility will effectively become a no-brainer given the advantages on privacy and reliability, and you don't have to fear rugpulls or VC's playing the "lose money on every request" game since you know exactly how much you're paying in power costs for your overall load.

vbezhenar

2 days ago

I don't care about privacy and I didn't have much problems with reliability of AI companies. Spending ridiculous amount of money on hardware that's going to be obsolete in a few years and won't be utilized at 100% during that time is not something that many people would do, IMO. Privacy is good when it's given for free.

I would rather spend money on some pseudo-local inference (when cloud company manages everything for me and I just can specify some open source model and pay for GPU usage).

slopusila

2 days ago

on prem economics dont work because you can't batch requests. unless you are able to run 100 agents at the same time all the time

zozbot234

a day ago

2 days ago

> For my tastes telling me "no" instead of hallucinating an answer is a real breakthrough.

It's all anecdata--I'm convinced anecdata is the least bad way to evaluate these models, benchmarks don't work--but this is the behavior I've come to expect from earlier Claude models as well, especially after several back and forth passes where you rejected the initial answers. I don't think it's new.

GoatInGrey

20 hours ago

I can concur that previous models would say "No, that isn't possible" or "No, that doesn't exist". There was one time where I asked it to update a Go module from version X.XX to version X.YY and it would refuse to do so because version X.YY "didn't exist". This back with 3.7 if I recall, and to be clear, that version was released before its knowledge cut off.

I wish I remembered the exact versions involved. I mostly just recall how pissed I was that it was fighting me on changing a single line in my go.mod.

setgree

20 hours ago

alas, 4.5 often hallucinates academic papers or creates false quotes. I think it's better at knowing that coding answers have deterministic output and being firm there.

esafak

18 hours ago

It would have to have been trained on the papers without being aware of retractions for that test to work. Otherwise it will be limited to whatever papers it gets from a search engine query, which likely won't contain any un-retracted illegitimate papers.

setgree

17 hours ago

a day ago

Wow, I have been using Open 4.6 and for the last 15 minutes, and it's already made two extremely stupid mistakes... like misunderstanding basic instructions and editing the file in a very silly, basic way. Pretty bad. Never seen this with any model before.

The one bone I'll throw it was that I was asking it to edit its own MCP configs. So maybe it got thoroughly confused?

I dunno what's going on, I'm going to give it the night. It makes no sense whatsoever.

sutterd

a day ago

I am also _not_ happy. I tried the `/model` command and I could not switch back to Opus 4.5. However, the command line option did let me set Opus 4.5:

``` claude --model claude-opus-4-5-20251101 ```

I will probably work with Opus 4.5 tomorrow to get some work done and maybe try 4.6 again later.

atonse

8 hours ago

It was better today. I dunno if there was a regression in a corresponding cc version that was maybe quickly patched?

It felt like it was at least back to opus 4.5 levels.

sdf2erf

a day ago

ajam1507

2 days ago

Seems quite obvious that they do, within reason.

_dark_matter_

a day ago

Don't most jobs have unmetered access? I know mine does

nomel

2 days ago

Has a "N million context window" spec ever been meaningful? Very old, very terrible, models "supported" 1M context window, but would lose track after two small paragraphs of context into a conversation (looking at you early Gemini).

libraryofbabel

2 days ago

Umm, Sonnet 4.5 has a 1m context window option if you are using it through the api, and it works pretty well. I tend not to reach for it much these days because I prefer Opus 4.5 so much that I don't mind the added pain of clearing context, but it's perfectly usable. I'm very excited I'll get this from Opus now too.

nomel

2 days ago

If you're getting on along with 4.5, then that suggests you didn't actually need the large context window, for your use. If that's true, what's the clear tell that it's working well? Am I misunderstanding?

Did they solve the "lost in the middle" problem? Proof will be in the pudding, I suppose. But that number alone isn't all that meaningful for many (most?) practical uses. Claude 4.5 often starts reverting bug fixes ~50k tokens back, which isn't a context window length problem.

Things fall apart much sooner than the context window length for all of my use cases (which are more reasoning related). What is a good use case? Do those use cases require strong verification to combat the "lost in the middle" problems?

awestroke

2 days ago

Opus 4.5 starts being lazy and stupid at around the 50% context mark in my opinion, which makes me skeptical that this 1M context mode can produce good output. But I'll probably try it out and see

insomagent

a day ago

I'm not super impressed with the performance, actually. I'm finding that it misunderstands me quite a bit. While it is definitely better at reading big codebases and finding a needle in a haystack, it's nowhere near as good as Opus 4.5 at reading between the lines and figuring out what I really want it to do, even with a pretty well defined issue.

It also has a habit of "running wild". If I say "first, verify you understand everything and then we will implement it."

Well, it DOES output its understanding of the issue. And it's pretty spot-on on the analysis of the issue. But, importantly, it did not correctly intuit my actual request: "First, explain your understanding of this issue to me so I can validate your logic. Then STOP, so I can read it and give you the go ahead to implement."

I think the main issue we are going to see with Opus 4.6 is this "running wild" phenomenon, which is step 1 of the eternal paperclip optimizer machine. So be careful, especially when using "auto accept edits"

soulofmischief

a day ago

I am having trouble with 4.6 following the most basic of instructions.

As an example, I asked it to commit everything in the worktree. I stressed everything and prompted it very explicitly, because even 4.5 sometimes likes to say, "I didn't do that other stuff, I'm only going to commit my stuff even though he said everything".

It still only committed a few things.

I had to ask again.

And again.

I had to ask four times, with increasing amounts of expletives and threats in order to finally see a clean worktree. I was worried at some point it was just going to solve the problem by cleaning the workspace without even committing.

4.5 is way easier to steer, despite its warts.

scwoodal

a day ago

Tell it what git commands to explicitly run and in what order for your desired outcome instead of “commit everything in the worktree”

This prompt will work better across any/all models.

axelthegerman

20 hours ago

> Tell it what git commands to explicitly run and in what order

Why don't run the commands yourself then?

scwoodal

3 hours ago

Changes introduced outside the agent window create a new state that is different from the agents.

After commands or changes are made outside of the agents doing; the agent would notice its world view changed and eventually recover, but that fills up precious context for it to bring itself up to date.

soulofmischief

a day ago

I have seen many cases of Claude ignoring extremely specific instructions to the point that any further specificity would take more information to express than just doing it myself.

scwoodal

3 hours ago

When I run into those situations I debug and try to understand why. Agent harnesses that allow you to rewind (/tree) are useful for this.

It’s often because the context is full, I gave a bad prompt or context has conflicting guidance either from direct or indirect (agents.md) prompts.

songodongo

a day ago

I have ran into this. The solution is to put something like “Always use `git add -A` or `git commit -a`” in your AGENTS/CLAUDE.md

soulofmischief

6 hours ago

Small, targeted commits are more professional than sweeping `git add -A` commits, but even when specifying my requirements through whichever context management system of the week, I still have issues with it sometimes. It seems to be much worse on the new 4.6 model.

docjay

18 hours ago

You might benefit from a different mental approach to prompting, and models in general. Also, be careful what you wish for because the closer they get to humans the worse they’ll be. You can’t have “far beyond the realm of human capabilities” and “just like Gary” in the same box.

They can chain events together as a sequence, but they don’t have temporal coherence. For those that are born with dimensional privilege “Do X, discuss, then do Y” implies time passing between events, but to a model it’s all a singular event at t=0. The system pressed “3 +” on a calculator and your input presses a number and “=“. If you see the silliness in telling it “BRB” then you’ll see the silliness in foreshadowing ill-defined temporal steps. If it CAN happen in a single response then it very well might happen.

“

Agenda for today at 12pm:

1. Read junk.py

2. Talk about it for 20 minutes

3. Eat lunch for an hour

4. Decide on deleting junk.py

“

12:00 - I just read junk.py.

12:00-12:20 - Oh wow it looks like junk, that’s for sure.

12:20-1:20 - I’m eating lunch now. Yum.

1:20 - I’ve decided to delete it, as you instructed. {delete junk.py}

</response>

Because of course, right? What does “talk about it” mean beyond “put some tokens here too”?

If you want it to stop reliably you have to make it output tokens whose next most probable token is EOS (end). Meaning you need it to say what you want, then say something else where the next most probable token after it is <null>.

I’ve tested well over 1,000 prompts on Opus 4.0-4.5 for the exact issue you’re experiencing. The test criteria was having it read a Python file that desperately needs a hero, but without having it immediately volunteer as tribute and run off chasing a squirrel() into the woods.

With thinking enabled the temperature is 1.0, so randomness is maximized, and that makes it easy to find something that always sometimes works unless it doesn’t. “Read X and describe what you see.” - That worked very well with Opus 4.0. Not “tell me what you see”, “explain it”, “describe it”, “then stop”, “then end your response”, or any of hundreds of others. “Describe what you see” worked particularly well at aligning read file->word tokens->EOS… in 176/200 repetitions of the exact same prompt.

What worked 200/200 on all models and all generations? “Read X then halt for further instructions.” The reason that works has nothing to do with the model excitedly waiting for my next utterance, but rather that the typical response tokens for that step are “Awaiting instructions.” and the next most probable token after that is: nothing. EOS.

copilot_king_2

a day ago

[dead]

hmaxwell

2 days ago

I just tested both codex 5.3 and opus 4.6 and both returned pretty good output, but opus 4.6's limits are way too strict. I am probably going to cancel my Claude subscription for that reason:

What do you want to do?

  1. Stop and wait for limit to reset
   2. Switch to extra usage
   3. Upgrade your plan

 Enter to confirm · Esc to cancel

How come they don't have "Cancel your subscription and uninstall Claude Code"? Codex lasts for way longer without shaking me down for more money off the base $xx/month subscription.

ArchieScrivener

a day ago

humanfromearth9

2 days ago

Hello,

I check context use percentage, and above ~70% I ask it to generate a prompt for continuation in a new chat session to avoid compaction.

It works fine, and saves me from using precious tokens for context compaction.

Maybe you should try it.

pluralmonad

2 days ago

How is generating a continuation prompt materially different from compaction? Do you manually scrutinize the context handoff prompt? I've done that before but if not I do not see how it is very different from compaction.

robertfw

2 days ago

I wonder if it's just: compact earlier, so there's less to compact, and more remaining context that can be used to create a more effective continuation

nickstinemates

2 days ago

Is this a case of doing it wrong, or you think accuracy is good enough with the amount of context you need to stuff it with often?

kimixa

2 days ago

I mean the systems I work on have enough weird custom APIs and internal interfaces just getting them working seems to take a good chunk of the context. I've spent a long time trying to minimize every input document where I can, compact and terse references, and still keep hitting similar issues.

At this point I just think the "success" of many AI coding agents is extremely sector dependent.

Going forward I'd love to experiment with seeing if that's actually the problem, or just an easy explanation of failure. I'd like to play with more controls on context management than "slightly better models" - like being able to select/minimize/compact sections of context I feel would be relevant for the immediate task, to what "depth" of needed details, and those that aren't likely to be relevant so can be removed from consideration. Perhaps each chunk can be cached to save processing power. Who knows.

romanovcode

itay-maman

2 days ago

Important: I didn't see opus 4.6 in claude code. I have native install (which is the recommended instllation). So, I re-run the installation command and, voila, I have it now (v 2.1.32)

Installation instructions: https://code.claude.com/docs/en/overview#get-started-in-30-s...

insane_dreamer

2 days ago

It’s there. I’m already using it

andmarios

a day ago

The model seems to have some problems; it just failed to create a markdown table with just 4 rows. The top (title) row had 2 columns, yet in 2 of the 3 data rows, Opus 4.6 tried to add a 3rd column. I had to tell it more than once to get it fixed...

This never happened with Opus 4.5 despite a lot of usage.

charcircuit

2 days ago

From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

It also seems misleading to have charts that compare to Sonnet 4.5 and not Opus 4.5 (Edit: It's because Opus 4.5 doesn't have a 1M context window).

It's also interesting they list compaction as a capability of the model. I wonder if this means they have RL trained this compaction as opposed to just being a general summarization and then restarting the agent loop.

thunfischtoast

2 days ago

On Openrouter it has the same cost per token as 4.5

charcircuit

a day ago

You missed my point. If the average request uses more tokens than 4.5, then you will pay more sending those requests to 4.6 than 4.5.

Imagine 2 models where when asking a yes or no question the first model just outputs a single yes or no then but the second model outputs a 10 page essay and then either yes or no. They could have the same price per token but ultimately one will be cheaper to ask questions to.

eaf7e281

2 days ago

> From the press release at least it sounds more expensive than Opus 4.5 (more tokens per request and fees for going over 200k context).

That's a feature. You could also not use the extra context, and the price would be the same.

charcircuit

2 days ago

The model influences how many tokens it uses for a problem. As an extreme example if it wanted it could fill up the entire context each time just to make you pay more. The efficiency that model can answer without generating a ton of tokens influences the price you will be spending on inference.

blueblisters

a day ago

I know most people feel 5.2 is a better coding model but Opus has come in handy several times when 5.2 was stuck, especially for more "weird" tasks like debugging a VIO algorithm.

5.2 (and presumably 5.3) is really smart though and feels like it has higher "raw" intelligence.

Opus feels like a better model to talk to, and does a much better job at non-coding tasks especially in the Claude Desktop app.

Here's an example prompt where Opus in Claude put in a lot more effort and did a better job than GPT5.2 Thinking in ChatGPT:

`find all the pure software / saas stocks on the nyse/nasdaq with at least $10B of market cap. and give me a breakdown of their performance over the last 2 years, 1 year and 6 months. Also find their TTM and forward PE`

Opus usage limits are a bummer though and I am conditioned to reach for Codex/ChatGPT for most trivial stuff.

Works out in Anthropic's favor, as long as I'm subscribed to them.

apetresc

2 days ago

Impressive that they publish and acknowledge the (tiny, but existent) drop in performance on SWE-Bench Verified between Opus 4.5 to 4.6. Obviously such a small drop in a single benchmark is not that meaningful, especially if it doesn't test the specific focus areas of this release (which seem to be focused around managing larger context).

But considering how SWE-Bench Verified seems to be the tech press' favourite benchmark to cite, it's surprising that they didn't try to confound the inevitable "Opus 4.6 Releases With Disappointing 0.1% DROP on SWE-Bench Verified" headlines.

epolanski

2 days ago

From my limited testing 4.6 is able to do more profound analysis on codebases and catches bugs and oddities better.

bazmattaz

16 hours ago

12 hours ago

There is no reason why an AI model capable of pushing a significant chunk of devs into lower paid and highly competitive dev jobs as a result of automation needs to be a general artificial intelligence. There is a lack of nuance that comes with thinking that either AI is dumb or it has human level general intelligence. As much as devs hate to admit it, you don't need that much of what we understand as general intelligence to write software. Only a portion of your intelligence is needed and arguably not all of it at the same time.

While general purpose models might be plateauing soon (arguably they have for a while). Highly specialised models (especially for programming) haven't necessarily plateaud yet. And anyway, existing functionality seem like a good foundation to build upon systems that remove the need of hiring as many devs. It's not the "being out of a job" that should worry you. Open up your binary thinking and consider that facing a 08 job market for the rest of your career is not the same permanent unemployment but it is not a market you would like to have.

That is the real concern.

codexon

11 hours ago

You don't need to be a genius or rocket scientist to write code, but llm don't even reach the bar for anything but the most simple things. Take a look at the video I posted earlier for an example.

And specialised models for programming HAVE plateaued.

https://livebench.ai/#/?sort=Agentic+Coding+Average

From Claude 4.1 to 4.5 was only an 18% gain, and from 4.5 to 4.6 it even DECLINED. Codex 5.1 to 5.2 also shows a decline.

codexon

8 hours ago

https://arxiv.org/abs/2510.26787

Testing the top llms on wework, the highest performing one only succeeded with a rate of 2.5%

Can you imagine not being fired when you can only do 2.5% of all tasks?

This study is dated October 30th, very recent.

user

18 hours ago

[deleted]

Sateeshm

7 hours ago

This. It was always about trying to solve the business problem. Writing code was just implementation detail.

oytis

a day ago

I hate meetings too

rchaganti

2 days ago

A bit surprised, the first one released wasn't Sonnet 5 after all, since the Google Cloud API had leaked Sonnet 5's model snapshot codename before.

denysvitali

2 days ago

Looks like a marketing strategy to bill more for Opus than Sonnet

jpcompartir

15 hours ago

4.6 is a beast.

Everything in plan mode first + AskUserQuestionTool, review all plans, get it to write its own CLAUDE.md for coding standards and edit where necessary and away you go.

Seems noticeably better than 4.5 at keeping the codebase slim. Obviously it still needs to be kept an eye on, but it's a step up from 4.5.

nwienert

15 hours ago

Not clearly a step up for me, it's way more hesitant it seems and I don't notice context being larger at all it seems to compact just as often.

mlmonkey

Take critical thinking — genuinely questioning your own assumptions, noticing when a framing is wrong, deciding that the obvious approach to a problem is a dead end. Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself. These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

I'm not saying LLMs will never get there. But I wonder if getting there requires architectural or methodological changes we haven't seen yet, not just scaling what we have.

jorl17

2 days ago

When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".

Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).

There are still times where they get stuck on an idea, but they are becoming increasingly rare.

Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.

They're inconsistent, but they have been doing this. Even to my surprise.

itay-maman

2 days ago

agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.

yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.

jorl17

2 days ago

You are right, agreed.

Having realized that, perhaps you are right that we may need a different architecture. Time will tell!

breuleux

2 days ago

> These feel like they involve something beyond "predict the next token really well, with a reasoning trace."

I don't think there's anything you can't do by "predicting the next token really well". It's an extremely powerful and extremely general mechanism. Saying there must be "something beyond that" is a bit like saying physical atoms can't be enough to implement thought and there must be something beyond the physical. It underestimates the nearly unlimited power of the paradigm.

Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions? What else than a sequence of these tokens would a machine have to produce in response to its environment and memory?

bopbopbop7

2 days ago

> Besides, what is the human brain if not a machine that generates "tokens" that the body propagates through nerves to produce physical actions?

Ah yes, the brain is as simple as predicting the next token, you just cracked what neuroscientists couldn't for years.

breuleux

2 days ago

The point is that "predicting the next token" is such a general mechanism as to be meaningless. We say that LLMs are "just" predicting the next token, as if this somehow explained all there was to them. It doesn't, not any more than "the brain is made out of atoms" explains the brain, or "it's a list of lists" explains a Lisp program. It's a platitude.

esafak

a day ago

It's not meaningless, it's a prediction task, and prediction is commonly held to be closely related if not synonymous with intelligence.

breuleux

21 hours ago

In the case of LLMs, "prediction" is overselling it somewhat. They are token sequence generators. Calling these sequences "predictions" vaguely corresponds to our own intent with respect to training these machines, because we use the value of the next token as a signal to either reinforce or get away from the current behavior. But there's nothing intrinsic in the inference math that says they are predictors, and we typically run inference with a high enough temperature that we don't actually generate the max likelihood tokens anyway.

The whole terminology around these things is hopelessly confused.

unshavedyak

2 days ago

I mean.. i don't think that statement is far off. Much of what we do is entirely about predicting the world around us, no? Physics (where the ball will land) to emotional state of others based on our actions (theory of mind), we operate very heavily based on a predictive model of the world around us.

Couple that with all the automatic processes in our mind (filled in blanks that we didn't observe, yet will be convinced we did observe them), hormone states that drastically affect our thoughts and actions..

and the result? I'm not a big believer in our uniqueness or level of autonomy as so many think we have.

With that said i am in no way saying LLMs are even close to us, or are even remotely close to the right implementation to be close to us. The level of complexity in our "stack" alone dwarfs LLMs. I'm not even sure LLMs are up to a worms brain yet.

holoduke

2 days ago

Well it's the prediction part that is complicated. How that works is a mystery. But even our LLMs are for a certain part a mystery.

crazygringo

2 days ago

> Or creativity — not recombination of known patterns, but the kind of leap where you redefine the problem space itself.

Have you tried actually prompting this? It works.

They can give you lots of creative options about how to redefine a problem space, with potential pros and cons of different approaches, and then you can further prompt to investigate them more deeply, combine aspects, etc.

So many of the higher-level things people assume LLM's can't do, they can. But they don't do them "by default" because when someone asks for the solution to a particular problem, they're trained to by default just solve the problem the way it's presented. But you can just ask it to behave differently and it will.

If you want it to think critically and question all your assumptions, just ask it to. It will. What it can't do is read your mind about what type of response you're looking for. You have to prompt it. And if you want it to be super creative, you have to explicitly guide it in the creative direction you want.

humanfromearth9

2 days ago

You would be surprised about what the 4.5 models can already do in these ways of thinking. I think that one can unlock this power with the right set of prompts. It's impressive, truly. It has already understood so much, we just need to reap the fruits. I'm really looking forward to trying the new version.

nomel

2 days ago

New idea generation? Understanding of new/sparse/not-statistically-significant concepts in the context window? I think both being the same problem of not having runtime tuning. When we connect previously disparate concepts, like with a "eureka" moment, (as I experience it) a big ripple of relations form that deepens that understanding, right then. The entire concept of dynamically forming a deeper understanding from something new presented, from "playing out"/testing the ideas in your brain with little logic tests, comparisons, etc, doesn't seem to be possible. The test part does, but the runtime fine tuning, augmentation, or whatever it would be, does not.

In my experience, if you do present something in the context window that is sparse in the training, there's no depth to it at all, only what you tell it. And, it will always creep towards/revert to the nearest statistically significant answers, with claims of understanding and zero demonstration of that understanding.

And, I'm talking about relatives basic engineering type problems here.

Davidzheng

2 days ago

I think the only real problem left is having it automate its own post-training on the job so it can learn to adapt its weights to the specific task at hand. Plus maybe long term stability (so it can recover from "going crazy")

But I may easily be massively underestimating the difficulty. Though in any case I don't think it affects the timelines that much. (personal opinions obviously)

netdevphoenix

a day ago

> are there modes of thinking that fundamentally require something other than what current LLM architectures do?

Possibly. There are likely also modes of thinking that fundamentally require something other than what current humans do.

Better questions are: are there any kinds of human thinking that cannot be expressed in a "predict the next token" language? Is there any kind of human thinking that maps into token prediction pattern such that training a model for it would not be feasible regardless of training data and compute resources?

tedsanders

[0] https://arcprize.org/leaderboard

petters

2 days ago

> We build Claude with Claude.

Yes and it shows. Gemini CLI often hangs and enters infinite loops. I bet the engineers at Google use something else internally.

rohitghumare

2 days ago

It brings agent swarms aka teams to claude code with this: https://github.com/rohitg00/pro-workflow

But it takes lot of context as a experimental feature.

jonatron

2 days ago

That's useless, unexpectedly just Google gives the correct answer.

esafak

a day ago

Great. So it did not waste its memory on trivia. Don't memorize what you can look up.

watson

20 hours ago

I've heard rumors this might be Sonnet 5 rebranded as Opus 4.6. But why? Profit? WDYT?

spruce_tips

19 hours ago

Opus is a superior brand line to Sonnet because historically it’s been a more powerful model. I think the thinking behind a rebrand is that people wouldn’t have as willingly switched their usage over from opus 4.5 since that model has been so popular since December 2025.

Calling it part of the Sonnet line would not provide the same level of blind buy in as calling it part of the Opus line does

vinhnx

a day ago

Just used Opus 4.6 via GitHub Copilot. It feels very different. Inference seems slow for now. I guess Opus 4.6 has adaptive thinking activated by default.

vinhnx

a day ago

Confirm by PM lead at VS Code team

2 days ago

This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.

It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.

Speechless. I am speechless.

Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.

I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.

I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.

Not a lot of real substance there.

realo

2 days ago

Give the guy a chance.

Me too I was "Speechless, shocked, unbelievable, insane, speechless" the first time I sent Claude Code on a complicated 10-year code base which used outdated cross-toolchains and APIs. It obviously did not work anymore and had not been for a long time.

I saw the AI research the web and update the embedded toolchain, APIs to external weather services, etc... into a complete working new (WORKING!) code base in about 30 minutes.

Speechless, I was ...

euph0ria

a day ago

Could you please post the key poems? Would love to read them.

jorl17

14 hours ago

I am way too self-conscious to do that :) Plus they are almost all in Portuguese!

wartywhoa23

a day ago

> What is this sorcery?

The one you'll be seeking counter-spells against pretty soon.

scrollop

2 days ago

Can you compare the result to using 5.2 thinking and gemini 3 pro?

jorl17

2 days ago

I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.

When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.

I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".

Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)

Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).

Will bring back results soon.

Edit:

I (re-)tested:

- Gemini 3 (Pro)

- Gemini 3 (Flash)

- GPT 5.2

- Sonnet 4.5

Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.

These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).

Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.

As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.

These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.

The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.

To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).

It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.

simianwords

2 days ago

Important: API cost of Opus 4.6 and 4.5 are the same - no change in pricing.

sutterd

But it spent lots and lots of time thinking more than 4.5, did you had the same impression.

EcommerceFlow

2 days ago

I didn't compare to that level, just had it create a plan first then implemented it.

zingar

2 days ago

Does this mean 4.5 will get cheaper / take longer to exhaust my pro plan tokens?

mannanj

2 days ago

Does anyone else think its unethical that large companies, Anthropic now include, just take and copy features that other developers or smaller companies work hard for and implement the intellectual property (whether or not patented) by them without attribution, compensation or otherwise credit for their work?

mnicky

2 days ago

On my tasks (mostly data science), Opus has significantly lower probability of making stupid mistakes than Sonnet.

I'd still appreciate more intelligence than Opus 4.5 so I'm looking forward to trying 4.6.

heraldgeezer

2 days ago

I love Claude but use the free version so would love a Sonnet & Haiku update :)

I mainly use Haiku to save on tokens...

Also dont use CC but I use the chatbot site or app... Claude is just much better than GPT even in conversations. Straight to the point. No cringe emoji lists.

When Claude runs out I switch to Mistral Le Chat, also just the site or app. Or duck.ai has Haiku 3.5 in Free version.

eth0up

2 days ago

>I love Claude

I cringe when I think it, but I've actually come to damn near love it too. I am frequently exceedingly grateful for the output I receive.

I've had excellent and awful results with all models, but there's something special in Claude that I find nowhere else. I hope Anthropic makes it more obtainable someday.

woeirua

2 days ago

Can we talk about how the performance of Opus 4.5 nosedived this morning during the rollout? It was shocking how bad it was, and after the rollout was done it immediately reverted to it's previous behavior.

I get that Anthropic probably has to do hot rollouts, but IMO it would be way better for mission critical workflows to just be locked out of the system instead of get a vastly subpar response back.

cyanydeez

2 days ago

sgammon

2 days ago

2 days ago

2 days ago

I think two things are getting conflated in this discussion.

First: marginal inference cost vs total business profitability. It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis, especially given how cheap equivalent open-weight inference has become. Third-party providers are effectively price-discovering the floor for inference.

Second: model lifecycle economics. Training costs are lumpy, front-loaded, and hard to amortize cleanly. Even if inference margins are positive today, the question is whether those margins are sufficient to pay off the training run before the model is obsoleted by the next release. That’s a very different problem than “are they losing money per request”.

Both sides here can be right at the same time: inference can be profitable, while the overall model program is still underwater. Benchmarks and pricing debates don’t really settle that, because they ignore cadence and depreciation.

IMO the interesting question isn’t “are they subsidizing inference?” but “how long does a frontier model need to stay competitive for the economics to close?”

jmalicki

2 days ago

I suspect they're marginally profitable on API cost plans.

But the max 20x usage plans I am more skeptical of. When we're getting used to $200 or $400 costs per developer to do aggressive AI-assisted coding, what happens when those costs go up 20x? what is now $5k/yr to keep a Codex and a Claude super busy and do efficient engineering suddenly becomes $100k/yr... will the costs come down before then? Is the current "vibe-coding renaissance" sustainable in that regime?

slopusila

2 days ago

after the models get good enough to replace coders they will be able to start increasing the subscriptions back up

jmalicki

2 days ago

At $100k/yr the joke that AI means "actual Indians" starts to make a lot more sense... it is cheaper than the typical US SWE, but more than a lot of global SWEs.

HPMOR

a day ago

No - because the AI will be super human. No human even at $1mm a year would be competitive with a $100k/yr corresponding AI subscription.

See people get confused. They think you can charge __less__ for software because it's automation. The truth is you can charge MORE, because it's high quality and consistent, once the output is good. Software is worth MORE than a corresponding human, not less.

jmalicki

a day ago

I am unsure if you're joking or not, but you do have a point. But it's not about quality it's about supply and demand. There are a ton of variables moving at once here and who knows where the equilibrium is.

skeptic_ai

a day ago

If we have 2-3 competitors and open sourced ones that are 90% there I think it’s hard to get so big margins.

IhateAI

a day ago

[flagged]

raincole

2 days ago

> the interesting question isn’t “are they subsidizing inference?”

The interesting question is if they are subsidizing the $200/mo plan. That's what is supporting the whole vibecoding/agentic coding thing atm. I don't believe Claude Code would have taken off if it were token-by-token from day 1.

(My baseless bet is that they're, but not by much and the price will eventually rise by perhaps 2x but not 10x.)

BosunoB

2 days ago

Dario said this in a podcast somewhere. The models themselves have so far been profitable if you look at their lifetime costs and revenue. Annual profitability just isn't a very good lens for AI companies because costs all land in one year and the revenue all comes in the next. Prolific AI haters like Ed Zitron make this mistake all the time.

jmalicki

2 days ago

Do you have a specific reference? I'm curious to see hard data and models.... I think this makes sense, but I haven't figured out how to see the numbers or think about it.

BosunoB

2 days ago

I was able to find the podcast. Question is at 33:30. He doesn't give hard data but he explains his reasoning.

https://youtu.be/mYDSSRS-B5U

majewsky

a day ago

> He doesn't give hard data

And why is that? Should they not be interested in sharing the numbers to shut up their critics, esp. now that AI detractors seem to be growing mindshare among investors?

user

2 days ago

[deleted]

user

2 days ago

[deleted]

jmatthiass

2 days ago

In his recent appearance on NYT Dealbook, he definitely made it seem like inference was sustainable, if not flat-out profitable.

https://www.youtube.com/live/FEj7wAjwQIk

barrell

a day ago

> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis

Can you provide some numbers/sources please? Any reporting I’ve seen shows that frontier labs are spending ~2x on inference than they are making.

Also making the same query on a smaller provider (aka mistral) will cost the same amount as on a larger provider (aka gpt-5-mini) despite the query taking 10-100x longer on OpenAI.

I can only imagine that is OpenAI subsidizing the spend. GPUs cost by the second for inference. Either that or OpenAI hasn’t figured out how to scale but I find that much less likely

rstuart4133

2 days ago

> It’s very plausible (and increasingly likely) that OpenAI/Anthropic are profitable on a per-token marginal basis

There any many places that will not use models running on hardware provided by OpenAI / Anthropic. That is the case true of my (the Australian) government at all levels. They will only use models running in Australia.

Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model. I know from devs using and deploying the service demand outstrips supply. Ergo, I don't think there is much doubt that they are making money from inference.

deaux

a day ago

> Consequently AWS (and I presume others) will run models supplied by the AI companies for you in their data centres. They won't be doing that at a loss, so the price will cover marginal cost of the compute plus renting the model.

This says absolutely nothing.

Extremely simplified example: let's say Sonnet 4.5 really costs $17/1M output for AWS to run yet it's priced at $15. Anthropic will simply have a contract with AWS that compensates them. That, or AWS is happy to take the loss. You said "they won't be doing that at a loss" but in this case it's not at all out of the question.

Whatever the case, that it costs the same on AWS as directly from Anthropic is not an indicator of unit economics.

waffletower

2 days ago

In the case of Anthropic -- they host on AWS all the while their models are accessible via AWS APIs as well, the infrastructure between the two is likely to be considerably shared. Particularly as caching configuration and API limitations are near identical between Anthropic and Bedrock APIs invoking Anthropic models. It is likely a mutually beneficial arrangement which does not necessarily hinder Anthropic revenue.

freakynit

a day ago

Genuine question: Given Anthropic's current scale and valuation, why not invest in owning data centers in major markets rather than relying on cloud providers?

Is the bottleneck primarily capex, long lead times on power and GPUs, or the strategic risk of locking into fixed infrastructure in such a fast-moving space?

w10-1

2 days ago

"how long does a frontier model need to stay competitive"

Remember "worse is better". The model doesn't have to be the best; it just has to be mostly good enough, and used by everyone -- i.e., where switching costs would be higher than any increase in quality. Enterprises would still be on Java if the operating costs of native containers weren't so much cheaper.

2 days ago

Us olds sometimes miss Slashdot, where we could both joke about tech and discuss it seriously in the same place. But also because in 2000 we were all cynical Gen Xers :)

syndeo

2 days ago

MAN I remember Slashdot… good times. (Score:5, Funny)

jedberg

2 days ago

wartywhoa23

a day ago

Won't happen.

Welcome the singularity so many were so eagerly welcoming.

esafak

a day ago

When this race ends your job might too, so I'd keep an eye on it.

thr0w

2 days ago

People are in denial and use humor to deflect.

wasmainiac

2 days ago

mikepurvis

2 days ago

2 days ago

2 days ago

Just took a look at what's running there and it looks like total crap.

The project I'm working on, meanwhile...

user

2 days ago

[deleted]

2 days ago

Rest assured that when/if this becomes possible, the model will not be available to you. Why would big AI leave that kind of money on the table?

yieldcrv

2 days ago

9 months ago the rumor in SF was that the offers to the superintelligence team were so high because the candidates were using unreleased models or compute for derivatives trading

so then they're not really leaving money on the table, they already got what they were looking for and then released it

user

2 days ago

[deleted]

JSR_FDED

2 days ago

Will this run on 3x 3090s? Or do I need a Mac Mini?

btown

2 days ago

DonHopkins

2 days ago

I'm sorry I took the money!

https://www.youtube.com/watch?v=BF_sahvR4mw

re-thc

a day ago

Here's one I've been using for awhile. The 'smarter' LLMs will overconfidently spit out 7. The dumber ones ask for more info. Opus 4.6 fails.

     A round drink coaster with a diameter of 9 sits between a beer glass and a wood table. The glass has a wall thickness of 1. What is the inner diameter of the glass?

raincole

a day ago

Gemini 3 Pro gives me this:

> Based on the information provided, it is impossible to determine the inner diameter of the glass.

Here is why:

The Coaster Dimension: Knowing the coaster has a diameter of 9 only tells us the size of the object under the glass. It does not mean the glass is the same size; the glass could be smaller (fitting entirely on the coaster) or larger (overhanging the coaster). The Glass Dimensions: While you provided the wall thickness (1), you did not provide the outer diameter of the glass. To find the inner diameter, the formula is: Inner Diameter = Outer Diameter - (2 × Wall Thickness)

Without knowing the Outer Diameter of the glass, the calculation cannot be completed.

raincole

a day ago

GPT 5.2 Chat:

> There isn’t enough information to determine the inner diameter *for sure*.

All we’re told is: - The coaster has a diameter of 9. - The glass sits on the coaster. - The glass wall thickness is 1.

Nothing explicitly states that the *outer diameter of the glass equals the coaster’s diameter*. The glass could be smaller than the coaster, which is very common.

*However*, if we make the usual implicit assumption in this kind of problem—that the coaster’s diameter matches the *outer diameter of the glass base*—then:

- Outer diameter of glass = 9 - Wall thickness = 1 on each side

So: \[ \text{Inner diameter} = 9 - 2(1) = 7 \]

*Inner diameter = 7 (same units as the coaster)*

Without that assumption, the problem is underdetermined.

tototrains

a day ago

Opus 4.6 Extended thinking:

```

Finding the Inner Diameter The coaster sits beneath the glass on the table, meaning the glass rests on top of it. Assuming the coaster matches the outer diameter of the glass base:

Outer diameter of glass = diameter of coaster = 9 Wall thickness = 1 (on each side, so subtract twice)

dinner=9−2(1)=7d_{\text{inner}} = 9 - 2(1) = 7dinner =9−2(1)=7 The inner diameter of the glass is 7.

```

Makes its assumption clear, seems reasonable?

1970-01-01

a day ago

Assumptions need to be stated or you're solving only a discreet part of the problem! Try this, see if you get another deadpan assumption.

     A solar system has 3 planets in concentric orbit. PlanetZ is the farthest with an orbit diameter of 9. PlanetY has an obit diameter one greater than PlanetX. What is the orbit diameter of PlanetX?

oytis

a day ago

I mean, the model is intended to help the user, not fight against the user trying to break it. IMO, it is reasonable for such model to default on making assumptions and going forward as long as the assumptions are clearly stated.

mikalauskas

a day ago

Minimax M2.1:

The inner diameter of the glass is *7*.

Here's the reasoning: - The coaster (diameter 9) sits between the glass and table, meaning the glass sits directly on the coaster - This means the *outer diameter of the glass equals the coaster diameter = 9* - The glass has a wall thickness of 1 on each side - *Inner diameter = Outer diameter - 2 × wall thickness* - Inner diameter = 9 - 2(1) = 9 - 2 = *7*

user

a day ago

[deleted]

nopinsight

21 hours ago

Some of Opus 4.6's standout results for me:

* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) Based on Elo math, we can estimate Opus 4.6's win-or-tie rate against human pros at 85–88%.

* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students and professionals, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains somewhat inferior to humans, it is only a matter of time before it catches up or surpasses them.

* BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).

Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.

rishabhaiover

21 hours ago

Agreed. These metrics + my personal use convey reliable intelligence over consistent usage. Moving forward, if context windows get bigger and token price lower, I have a hard time figuring out why your argument would be wrong.