doctoboggan
6 hours ago
The cost per task chart is telling me that I should _never_ use Sonnet 5 above medium effort level - Opus always performs better for a given cost. So I guess the takeaway is that if Sonnet 5 medium isn't good enough for you, switch models, not effort levels.
AquinasCoder
6 hours ago
While I appreciate, they publish this information, it's increasingly hard to keep track of it all. I've lost the mental model of how different models at different effort levels perform and what tasks they are good at.
In practice, I tend to just use the default on Claude Code that works well enough. But I wonder to what degree other users really play around with these settings to optimize for their project.
m-dot-reviews
13 minutes ago
I've been plugging this perhaps too many times now, but I am trying to bootstrap a user-sourced corpus of exactly "what model is good at task X". So, not benchmarks, but high-level tasks. There's a bit of a ordering problem in that nobody wants to bother commenting on a site that has few comments - so PTAL and contribute if you can. https://model.reviews
matheusmoreira
2 hours ago
I always use Opus 4.8 at max effort for everything. The $20 subscription didn't have enough tokens, but the $100 one had too many of them. So now I just max out Opus in order to maintain 100% weekly utilization.
easygenes
an hour ago
I'm a heavy enough user that I have both the OAI and Anth $200 plans. I always use at least 50% of my weekly Opus quota at Extra setting (meaning I use double the limit of the $100 plan, at minimum). Max I rarely touch because it is twice as slow and the incremental capability gain is minimal. Usually if Opus can't sort something well at Extra, the answer isn't to use Max but to hand the issue off to GPT-5.5 at XHigh.
tyg13
19 minutes ago
I too have settled into a kind of dual Claude/GPT model setup. I will often use one to review the other's work, or critique the other's plan in some way. Sometimes I'll have Claude implement a feature one way, then have GPT do it the other way, then have them both review each other's implementation. Then synthesize a final plan from the previous implementations+reviews.
I might just be having fun with models, but I have actually noticed their capabilities vary somewhat, and so my (perhaps vain) hope is that by using both, one can catch each the other's blindspots. It's still unclear to me if that's consistently happening, but I am making substantial progress in my personal and professional projects, so something seems to be working.
ATMLOTTOBEER
2 hours ago
Agreed I think your strategy is optimal. This is what I landed on as well
vcf
an hour ago
Me too, I rarely hit limits anymore on the $100 Max, except for the brief period with Fable
nolok
3 hours ago
Same boat as you, and my answer is "... Except when I ask and overall or checkup task that is specifically heavy or overseeing in which case I use the maximum level" which lately meant ultracode.
I'm not going to play around with thinking level every request because the goal is to make me save time not spend it in a different setting menu.
brobdingnagians
3 hours ago
I tend to run it on High and then step it up for problems where I'm noticing it struggles, bump it back down after. Sometimes I accidentally leave a session in Ultracode for a day and wonder why things are taking so long, but generally happy with the results.
sanderjd
5 hours ago
What I want is a harness that knows how to optimize this kind of thing for me.
nl
an hour ago
In practice I don't think any harness (happy to be corrected here!) uses the lesser capability models for writing code. The cost trade-offs are rarely worth it.
They are often used for reading code though.
To expand on this, while the "big model to write a plan, small model to write the specific code" idea is quite common it trips up on edge cases.
In theory the flow works like this:
- small fast models read lots of code, and pass details to the large model to write a plan
- large model takes those details and writes a detailed plan
- medium models write the code
The issue happens when the medium model hits something that the plan didn't take into account (which happens a lot - the big model didn't actually read the code). Then it has to either guess, or pass back to the large model.
If it guesses, the plan usually starts to fall to bits.
If it passes back to the large model, inevitable the large model has to start reading lots of code. In that case you are paying the expensive tokens to read so you might as well have it write the code too (many less tokens are written than are read)
It might be possible to get this to work, but I haven't seen anyone who has tried agentic work with frontier models be satisfied with this hybrid setup.
I'd note that Amp (mentioned above) is probably the leader in using multiple providers in a coding agent but still uses frontier models to write code.
cunningfatalist
5 hours ago
You might want to check out Amp: https://ampcode.com/
sanderjd
3 hours ago
I appreciate the suggestion! But it isn't clear to me, from reading their marketing site, what they bring to the table from this perspective. Can you give me a more targeted pitch?
manojlds
5 hours ago
Which is your own harness and your own evals for your tasks I guess
munk-a
3 hours ago
I don't demand a customized compiler for my code even if such a compiler could outperform gcc. There is a lot of value in focusing on correctness to an extreme degree even if the outcome might be suboptimal to something more tailored - a tool with a large customer base can justify more resources going into its maintenance.
sanderjd
5 hours ago
Maybe. But that sounds like a large amount of bespoke work for what seems like a common problem?
manojlds
4 hours ago
I was talking about enterprise agents and then realized the question is more about coding agents.
sanderjd
4 hours ago
Ah I see! Yes, I was talking about a coding harness, not an enterprise agent. I entirely agree with you that your suggestion of driving it via evals is the right thing for that use case!
jbvlkt
4 hours ago
Exactly this is my problem with all AI tools. I want someone else to create working tools for me so I can focus on my product. It is the same with other tools. I do not want to spent huge amounts of energy and time to setup my IDE, operating system or desk layout. I guess it is too early to have that now.
jerojero
3 hours ago
I think that's the whole selling point of lovable?
throwaway219450
an hour ago
Same advice as ever? We call it context engineering now, but prompt engineering still matters a lot. Most of the failures I run into are unspecified assumptions made by the model that derails the conversation, but usually updating the first prompt fixes it. Opus in my experience is a bit better about checking assumptions, while Sonnet will plow on ahead. An example is mentioning a file that doesn't exist: Sonnet will go ahead and try to grep your entire hard drive for it. Opus will say it's not local and request the path.
I trust neither for general knowledge and I still find Opus giving me answers that are completely BS. But the token spend for Q&A is nothing compared to coding, so I always use Opus + a lot of thinking. For coding, I find Opus to be better value/token but I haven't done any sort of rigorous test.
deadbabe
an hour ago
There are token optimization consultants that can help organizations find the right balance of models for their employees to minimize costs.
j45
2 hours ago
Just because it’s hard to keep track of doesn’t mean it’s not relevant.
Playing around with learning the differences is incredibly helpful to schedule on ones calendar weekly for an hour or two, while saving links throughout the week to try out.
jacooper
6 hours ago
Just use deepswe as a reference point.
paulddraper
3 hours ago
It's almost like you want an automatically intelligent choice of your artificial intelligence.
Understandable frankly.
2001zhaozhao
6 hours ago
There are two wrinkles to this:
- For Claude.ai subscriptions I think Sonnet is much cheaper than Opus. This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
- For some tasks the sheer amount of raw input tokens is the most important. For example multimodal computer use tasks. You can't make them any more efficient on Opus by turning down the reasoning, so a cheaper model like Sonnet is useful for them
timcobb
6 hours ago
> This is why there was a "Sonnet only" usage bar for Max tier for the longest time.
it's still there. I still don't totally grok why I can't use all my tokens on Sonnet if I want to... maybe that signals something?
i000
4 hours ago
They want to encourage diversifying model use.
laughingcurve
4 hours ago
Distillation attacks? Volume of calls?
Torkel
6 hours ago
Yeah, I was looking at the same chart and was very surprised at where the curve is relative to opus... Feels like sonnet 5 is "what if opus had an extra-low effort level"?
energy123
6 hours ago
The arguable caveat is Sonnet may run faster (although this isn't known for sure, due to more tokens being used for the same task), so you can potentially get more done in a synchronous iterative workflow
I don't really believe this however, because so much time is spent fixing up after models, that a slower but more intelligent model is a net time saver in my experience.
kolinko
3 hours ago
From my benchmarks, sadly, it doesn't seem to be the case much. Surprisingly. I found Sonnet comparable in speed to Opus (sic), but perhaps I was testing it wrong?
riverbirch
2 hours ago
I can confirm this, I too I'm not seeing much of a difference in practice
XCSme
3 hours ago
Well, it is a Sonnet model, it is indeed better[0] than Sonnet 4.6 (smarter, faster, cheaper), but I don't see why would you use it as opposed to Opus 4.8 low or GLM-5.2...
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...
XCSme
3 hours ago
What's interesting, is that Sonnet 5 is actually worse[0] than 4.6 without reasoning.
It makes some sense, as models are trained more and more with reasoning, than without.
[0]: https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-non...
lucamark
5 hours ago
You're referring to the Agentic search, but if you look at the Agentic computer use the cost is basically halved.
However, I am also confused about market positioning. Too expensive to perform daily tasks - open souce models are much cheaper - and not frontier model to address complex real world problems.
Rarely used Sonnet btw.
energy123
5 hours ago
You're the second person that has said this but I cannot understand why you are interpreting the "Agentic computer use" graph in this manner.
The graph shows that Opus is cheaper than Sonnet for the same performance. Unless I am suffering a cognitive blindness thing right now.
lucamark
5 hours ago
Wrong! Look at it better. It shows that Opus has superior performance but at higher cost.
doctoboggan
5 hours ago
No, you are misunderstanding the graph. Draw a vertical line anywhere, that is a "constant cost" line. For any given cost, Opus 4.8 has a higher performance than Sonnet 5. Only where Sonnet 5 effort is at medium or low would it make any sense to use it, as there isn't even an equivalent Opus effort level to compare to.
Alternatively you can draw a horizontal "constant performance" line and see that Opus is cheaper for a given performance level.
827a
4 hours ago
Why are you comparing xhigh reasoning between Sonnet and Opus? Of course Sonnet xhigh is cheaper than Opus xhigh, but that isn't the point; the point is that at e.g. 80% accuracy on Opus costs ~$0.45 (medium reasoning) whereas on Sonnet it costs ~$0.52 (xhigh/max reasoning).
brokencode
5 hours ago
That is a bad comparison. Compare Sonnet xhigh against Opus medium, which is both better and cheaper.
energy123
5 hours ago
No, that's apples and oranges. You need to compare Sonnet5's 79% with the interpolated Opus4.8's 79%.
annzabelle
3 hours ago
> Too expensive to perform daily tasks - open souce models are much cheaper
There is a real advantage, especially for businesses, in using an off the shelf solution from a corporate provider.
Personally, the advantage of not having to set up multiple solutions from multiple sources outweighs the cost of a $20 a month subscription. Think about why a lot of consumers prefer Apple devices over Linux. There are a lot of advantages to Linux, but "never having to think about my tools" is its own advantage.
girvo
3 hours ago
The specific market positioning is... for me to use at my big tech company job, where we aren't allowed to use GLM and similar, but have fixed caps on how much token usage we're allowed to rack up a month.
johnfn
6 hours ago
That's just one benchmark, though. Tab to the next one and Sonnet 5 performs better as effort goes up just as you'd expect. I imagine the suggestion is that performance vs effort tradeoff is task dependent.
energy123
6 hours ago
No it doesn't? It's worse than Opus across the whole shared frontier on both plots.
acchow
4 hours ago
Agreed. The graphs clearly show that opus 4.8 performs strictly better at the same cost per task
jsnell
4 hours ago
But they don't show "strictly better" performance at cost per task!
The graphs show parts of the cost/performance pareto frontier occupied by Opus 4.8 and others occupied by Sonnet 5.0. If Opus 4.8 was strictly better at cost per task like you say, by definition the entire frontier would be occupied by Opus.
So neither is pareto-dominant over the other. In contrast, Sonnet 5.0 is Pareto-dominent over Sonnet 4.6 on those graphs.
energy123
3 hours ago
> by definition the entire frontier would be occupied by Opus.
But the entire frontier is occupied by Opus under any reasonable interpolation scheme (piecewise linear which is what they've done, and most reasonable spline or polynomial fits would also lead to the same result) over the overlapping x values for which both are defined.
Under that interpolation scheme, for x > ($ cost of Opus low effort), Opus is Pareto-dominant over Sonnet 5. You can see this by picking any point on Opus's interpolation and realizing that you get strictly worse by switching to Sonnet for the same x value or the same y value. Meaning if you want to pay the same $x then you get a worse y, or if you want the same y you pay more $x.
jsnell
3 hours ago
I really don't get what you're proposing. The cost ranges do not overlap at the low end. You can't (by definition!) interpolate outside of the range.
If you mean extrapolate, at that point you're just making up data. The available effort levels are discrete and covered totally by the benchmarks. You can draw on the monitor with a sharpie to show a "ultra-low" effort level for Opus that scores better than Sonnet "low" at the same price, but it doesn't magic the ultra-low effort into actual existence.
(Anyway, the blog post now has an errata and a graph that shows substantially better relative performance for Sonnet 5.0 than the original graph.)
energy123
3 hours ago
That's why I said "over the shared frontier" in my first post and more precisely in my second post I said "over the overlapping x values for which both are defined."
It was a claim that applies to a range of x-values where both curves are defined.
Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region. Which is what I understand to be your point?
jsnell
2 hours ago
The post I was replying to said "performs strictly better at the same cost per task". That claim was obviously not true, there are costs where Opus cannot do the task and Sonnet can, so Opus can't be performing strictly better that the same cost. It seems that you agree that it is not true.
You could make it true by artificially dropping some of the data points, but, like, why?
(Again, this is moot given the updated graph.)
> Of course if you go beyond those x-values where only one of the two are defined, then trivially the one that is defined constitutes the Pareto frontier in that region.
Not so! It's only sound to do that at the low end of the cost axis (x) or the high end of the performance axis (y). You can't do it at the low end of the performance axis or the high end of the cost axis.
seiru
5 hours ago
Worth noting that the default chart there is for "agentic search performance", not coding. I didn't see an effort comparison for coding specifically.
booi
5 hours ago
i actually exclusively use Sonnet in low effort level. It's too slow otherwise and at a higher effort levels is strictly worse than Opus.
intellijdd
6 hours ago
I noticed that as well but with the introductory pricing, I wonder how true that is.
It would be great to see these charts with the promotional pricing just because it’s here for about two whole months.
I guess I could get Sonnet 5 to do it.
manojlds
5 hours ago
Opus 4.8 high doing better and cheaper than Sonnet 5 xhigh
partsch
2 hours ago
I feel like the charts have been adjusted. I am quite sure, they looked different a couple hours ago...
callahad
an hour ago
They've absolutely both changed. The initial version I saw didn't include max effort data points on the first chart, and the plot itself was much less favorable to Sonnet at high/xhigh relative to Opus, but the new chart shows them as closer competitors. Weird.
goldenarm
3 hours ago
It's funny the exact same thing happened to Gemini 3.5 flash. Cheaper and more agentic model that ends up worse and more expensive than 3.5 pro low.
al_borland
5 hours ago
What is a "task" in real-world terms? If it will be $15/million output tokens, and high/xhigh is somewhere in the $7.50/task range. Does that mean a single task is using 500k tokens. That seems like it would start to add up fast.
wyre
5 hours ago
I’ve found input tokens is around 5x more than output, so a task could be a couple million thinking tokens and then a few couple 100k output tokens?
Natelinathan
5 hours ago
I just re-wrote the /code-review skill anthropic ships to use Sonnet 4.6 for some tasks as it was using Opus for simple git diff commands and similarily mechanical tasks (launched 100+ agents for one of my diffs, cmon). I wonder how Sonnet 5 will impact my usage.
Does anyone else have any review token saving measures?
nicce
5 hours ago
> Opus always performs better for a given cost.
Assume it to get deprecated sooner rather than later.
windexh8er
an hour ago
Except for the fact that Opus 4.8 is not good. Constant hallucinations, doesn't use the web very intentionally until you explicitly ask it to and it nopes out rather quick on benign items. Anthropic has been very disappointing as of late. All of the gatekeeping is taking a toll on what should be some of the better models out there, but you can't trust 4.8 to go off on its own. It will burn down tokens doing what it deems correct as per its guidance. Truly painful to use.
lukan
an hour ago
"but you can't trust 4.8 to go off on its own."
And what (avaiable) model do you trust to go off on its own?
make3
2 hours ago
it might be worth it if speed is an issue
ZeWaka
6 hours ago
It's very interesting. Why even release a new product that underperforms at the same price level? Why not just lock it?
I guess it's probably a lot cheaper for them to run, and it cuts costs for them. Seems disingenuous, though.