nsingh2
8 hours ago
Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.
Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.
Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.
postalcoder
4 hours ago
You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.
miki123211
25 minutes ago
And to add insult to injury, some providers will ride on the good reputation of some local model, selling you a terrible quant instead.
With OpenAI, at least my gpt-5.5 is the same as your gpt-5.5. You can't say that about glm for example.
embedding-shape
9 minutes ago
> And to add insult to injury, some providers will ride on the good reputation of some local model, selling you a terrible quant instead.
I just started using OpenRouter for some control testing of local models and what surprises me the most isn't that there are different providers providing different quantization levels, that makes sense, but I can't seemingly find a way of seeing what provider+model+quantization is actually used?! https://openrouter.ai/models shows the models, then say https://openrouter.ai/moonshotai/kimi-k2.7-code shows the providers but when I go to https://openrouter.ai/moonshotai/kimi-k2.7-code?endpoint=e7a... for example, why on earth is it showing the actual details about the actual weights they're serving?! It does have a "Precision" value that is sometimes filled out, but that seems to be a guess at best, even providers with the same values there have wildly different quality responses.
I like the idea about OpenRouter but holy hell does the implementation seem very far off from what it needs to be, in order to be useful.
jdiff
2 hours ago
But in that case you have nobody but yourself to blame, and you can stabilize things yourself at any time by refraining from making any changes. You won't be surprised by a provider. Honestly? That's not just valuable—it's essential.
LiamPowell
2 hours ago
> Honestly? That's not just valuable—it's essential.
I'm curious if you wrote this or had a LLM write it.
I'm genuinely curious to be clear as I don't see why anyone would bother to go through a LLM to write such a short reply. Have we reached the point where Claudeisms that are this obnoxious have become part of regular speech?
topynate
an hour ago
I've noticed them trying to creep into my writing. It doesn't help that I was a heavy em-dash user ten years before GPT-3.
UqWBcuFx6NV4r
2 hours ago
LiamPowell
an hour ago
Sure, but it doesn't really fit there as a joke, it looks like it's just meant to be part of what they were trying to say.
subscribed
26 minutes ago
I also think it's a joke, it starts with the response / argument, and then flows into tongue-in-cheek joke about the core issue of the post (Llama)
dannyw
5 hours ago
I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours.