jampa
15 days ago
Slightly off topic, but does anyone feel that they nerfed Claude Opus?
It's screwing up even in very simple rebases. I got a bug where a value wasn't being retrieved correctly, and Claude's solution was to create an endpoint and use an HTTP GET from within the same back-end! Now it feels worse than Sonnet.
All the engineers I asked today have said the same thing. Something is not right.
eterm
15 days ago
That is a well recognised part of the LLM cycle.
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
quentindanjou
15 days ago
This is not always true. LLMs do get nerfed, and quite regularly, usually because they discover that users are using them more than expected, because of user abuse or simply because it attract a larger user base. One of the recent nerfs is the Gemini context window, drastically reduced.
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
landl0rd
15 days ago
There's at least one site doing this: https://aistupidlevel.info/
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
dudeinhawaii
14 days ago
Interesting! I was just thinking about pinging the creator of simple-bench.com and asking them if they intend to re-benchmark models after 3 months. I've noticed, in particular, Gemini models dramatically reducing in quality after the initial hype cycle. Gemini 3 Pro _was_ my top performer and has slowly reduced to 'is it worth asking', complete with gpt-4o style glazing. It's been frustrating. I had been working on a very custom benchmark and over the course of it Gemini 3 Pro and Flash both started underperforming by 20% or more. I wondered if I had subtle broken my benchmark but ultimately started seeing the same behavior in general online queries (Google AI Studio).
Analemma_
15 days ago
> What we need is an open and independent way of testing LLMs
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
judahmeek
14 days ago
How hard is benchmarking models actually?
We've got a lot of available benchmarks & modifying at least some of those benchmarks doesn't seem particularly difficult: https://arc.markbarney.net/re-arc
To reduce cost & maintain credibility, we could have the benchmarks run through a public CI system.
What am I missing here?
Maxious
15 days ago
Except the time that it was to the point Anthropic had to acknowledge it? Which also revealed they don't have monitoring?
https://www.anthropic.com/engineering/a-postmortem-of-three-...
jampa
15 days ago
I usually agree with this. But I am using the same workflows and skills that were a breeze for Claude, but are causing it to run in cycles and require intervention.
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
eterm
15 days ago
When I once had that happen in a really bad way, I discovered I had written something wildly incorrect into the readme.
It has a habit of trusting documentation over the actual code itself, causing no end of trouble.
Check your claude.md files (both local and ~user ) too, there could be something lurking there.
Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.
F7F7F7
15 days ago
I’m a x20 Max user who’s on it daily. Unusable the last 2 days. GLM in OpenCode and my local Qwen were more reliable. I wish I was exaggerating.
mrguyorama
15 days ago
Also people who were lucky and had lots of success early on but then start to run into the actual problems of LLMs will experience that as "It was good and then it got worse" even when it didn't actually.
If LLMs have a 90% chance of working, there will be some who have only success and some who have only failure.
People are really failing to understand the probabilistic nature of all of this.
"You have a radically different experience with the same model" is perfectly possible with less than hundreds of thousands of interactions, even when you both interact in comparable ways.
olao99
14 days ago
Just because it's been true in the past doesn't mean it will always the case
ojr
14 days ago
Opus was a non-deterministic probability machine in the past, present and the foreseeable future. The variance eventually shows up when you push it hard.
spike021
15 days ago
Eh, I've definitely had issues where Claude can no longer easily do what it's previously done. That's with constant documenting things in appropriate markdown files well and resetting context here and there to keep confusion minimal.
user
15 days ago
F7F7F7
15 days ago
I don’t care what anyone says about the cycle or that implying that it’s all in our heads. It’s bad bad.
I’m a Max x20 model who had to stop using it this week. Opus was regularly failing on the most basic things.
I regularly use the front end skill to pass mockups and Opus was always pixel perfect. This last week it seemed like the skill had no effect.
I don’t think they are purposely nerfing it but they are definitely using us as guinea pigs. Quantized model? The next Sonnet? The next Haiku? New tokenizing strategies?
ryanar
14 days ago
I noticed that this week. I have a very straightforward claude command that lists exact steps to follow to fetch PR comments to bring them into the context window. Stuff like step one call gh pr view my/repo and it would call it with anthropiclabs/repo instead, it wouldn’t follow all the instructions, it wouldn’t pass the exact command I had written. I pointed out the mistake and it goes oh you are right! Then proceeded to make the same mistake again.
I used this command with sonnet 4.5 too and have never had a problem until this week. Something changed either in the harness or model. This is not just vibes. Workflows I have run hundreds of times have stopped working with Opus 4.5
kachapopopow
15 days ago
They're A/B testing on the latest opus model, sometimes it's good sometimes it's worse than sonnet annoying as hell. I think they trigger it when you have excessive usage or high context use.
hirako2000
15 days ago
Or maybe when usage is low so that we try again.
Or maybe when usage is high they tweak a setting that use cache when it shouldn't.
For all we know they do whatever experiment the want, to demonstrate theoretical better margin, to analyse user patterns when a performance drop occur.
Given what is done in other industries which don't face an existential issue, it wouldn't surprise me some whistle blowers in a few years tell us what's been going on.
root_axis
15 days ago
This has been said about every LLM product from every provider since ChatGPT4. I'm sure nerfing happens, but I think the more likely explanation is that humans have a tendency to find patterns in random noise.
measurablefunc
15 days ago
They are constantly trying to reduce costs which means they're constantly trying to distill & quantize the models to reduce the energy cost per request. The models are constantly being "nerfed", the reduction in quality is a direct result of seeking profitability. If they can charge you $200 but use only half the energy then they pocket the difference as their profit. Otherwise they are paying more to run their workloads than you are paying them which means every request loses them money. Nerfing is inevitable, the only question is how much it reduces response quality & what their customers are willing to put up with.
landl0rd
15 days ago
I've observed the same random foreign-language characters (I believe chinese or japanese?) interspersed without rhyme or reason that I've come to expect from low-quality, low-parameter-count models, even while using "opus 4.5".
An upcoming IPO increases pressure to make financials look prettier.
boringg
13 days ago
Ive seen this too and ignored it. Weird
mrstern
14 days ago
I've noticed a significant drop in Opus' performance in Claude Code since last week. It's more about "reasoning" than syntax. Feels more like Sonnet 4.1 than Opus 4.5.
epolanski
15 days ago
Not really.
In fact as my prompts and documents get better it seems it does increasingly better.
Still, it can't replace a human, I really need to correct it at all, and if I try to one shot a feature I always end up spending more time refactoring it few days later.
Still, it's a huge boost to productivity, but the time it can take over without detailed info and oversight is far away.
cap11235
15 days ago
Show evals plz