sosodev
a day ago
I really hope this doesn't hinder development too much. As Simon says, Qwen3.5 is very impressive.
I've been testing Qwen3.5-35B-A3B over the past couple of days and it's a very impressive model. It's the most capable agentic coding model I've tested at that size by far. I've had it writing Rust and Elixir via the Pi harness and found that it's very capable of handling well defined tasks with minimal steering from me. I tell it to write tests and it writes sane ones ensuring they pass without cheating. It handles the loop of responding to test and compiler errors while pushing towards its goal very well.
misnome
a day ago
I've been playing with 3.5:122b on a GH200 the past few days for rust/react/ts, and while it's clearly sub-Sonnet, with tight descriptions it can get small-medium tasks done OK - as well as Sonnet if the scope is small.
The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked, and I find it has stripped all the preliminary support infrastructure for the new feature out of the code.
sheepscreek
a day ago
That sounds awfully similar to what Opus 4.6 does on my tasks sometimes.
> Blah blah blah (second guesses its own reasoning half a dozen times then goes). Actually, it would be a simpler to just ...
Specifically on Antigravity, I've noticed it doing that trying to "save time" to stay within some artificial deadline.
It might have something to do with the system messages and the reinforcement/realignment messages that are interwoven into the context (but never displayed to end-users) to keep the agents on task.
jtonz
21 hours ago
As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.
If you ask it to do something laborious like review a bunch of websites for specific content it will constantly give up, providing you information on how you can continue the process yourself to save time. Its maddening.
zzrrt
21 hours ago
That’s pretty funny when compared with the rhetoric like “AI doesn’t get tired like humans.” No, it doesn’t, but it roleplays like it does. I guess there is too much reference to human concerns like fatigue and saving effort in the training.
martin-t
20 hours ago
This is what happens when a bunch of billionaires convince people autocomplete is AI.
Don't get me wrong, it's very good autocomplete and if you run it in a loop with good tooling around it, you can get interesting, even useful results. But by its nature it is still autocomplete and it always just predicts text. Specifically, text which is usually about humans and/or by humans.
selcuka
18 hours ago
You are not wrong, but after having started working with LLMs, I have this feeling that many humans are simply autocomplete engines too. So LLMs might be actually close to AGI, if you define "general" as "more than 50% of the population".
goodmythical
4 hours ago
Humans are absolutely auto-complete engines, and regularly produce incorrect statements and actions with full confidence in it being precisely correct.
Just think about how many thousands of times you've heard "good morning" after noon both with and without the subsequent "or I guess I should say good afternoon" auto-correct.
jrumbut
18 hours ago
Well the essence of software engineering is taking this complex real world tasks and breaking them down into simpler parts until they can be done by simple (conceptually) digital circuits.
So it's not surprising that eventually autocomplete can reach up from those circuits and take on some tasks that have already been made simple enough.
I think what's so interesting is how uneven that reach is. Some tasks it is better than at least 90% of devs and maybe even superhuman (which, in this case, I mean better than any single human. I've never seen an LLM do something that a small team couldn't do better if given a reasonable amount of time). Other cases actual old school autocomplete might do a better job, the extra capabilities added up to negative value and its presence was a distraction.
Sometimes there is an obvious reason why (solving a problem with lots of example solution online vs working with poorly documented proprietary technologies), but other times there isn't. They certainly have raised the floor somewhat, but the peaks and valleys remain enormous which is interesting.
To me that implies there is both lots of untapped potential and challenges the LLM developers have not even begun to face.
root_axis
19 hours ago
Yep. The veil of coherence extends convincingly far by means of absurd statistical power, but the artifacts of next token prediction become far more obvious when you're running models that can work on commodity hardware
justinclift
11 hours ago
> As someone that started using Co-work, I feel like I am going insane with the frequency that I have to keep telling it to stay on task.
Used to have the same thing happening when using Sonnet or Opus via Windsurf.
After switching to Claude Code directly though (and using "/plan" mode), this isn't a thing any more.
So, I reckon the problem is in some of these UI/things, and probably isn't in the models they're sending the data to. Windsurf for example, which we no longer use due to the inferior results.
shinycode
13 hours ago
If found it better to split in smaller tasks from a first overall analysis and make it do only that subtask and make it give me the next prompt once finished (or feed that to a system of agents). There is a real threshold from where quality would be lost.
bandrami
20 hours ago
It really is like having an intern, then
throwup238
21 hours ago
In my experience all of the models do that. It's one of the most infuriating things about using them, especially when I spend hours putting together a massive spec/implementation plan and then have to sit there babysitting it going "are you sure phase 1 is done?" and "continue to phase 2"
I tend to work on things where there is a massive amount of code to write but once the architecture is laid down, it's just mechanical work, so this behavior is particularly frustrating.
dripdry45
15 hours ago
I hope you will excuse my ignorance on this subject, so as a learning question for me: is it possible to add what you put there as an absolute condition, that all available functions and data are present as an overarching mandate, and it’s simply plug and chug?
elcritch
13 hours ago
Recently it seems that even if you add those conditions the LLMs will tend to ignore them. So you have to repeatedly prompt them. Sometimes string or emphatic language will help them keep it “in mind”.
girvo
13 hours ago
Glad it's not just me then, it's been driving me slightly batty.
beepbooptheory
15 hours ago
Why keep using it then? I simply still read websites. It's not always great but sounds better than whatever that weird dynamic is!
wood_spirit
a day ago
Yeah that happened to me with Claude code opus 4.6 1M for the first time today. I had to check the model hadn’t changed. It was weird. I was imagining that maybe anthropic have a way of deciding how much resource a user actually gets and they had downgraded me suddenly or something.
e1g
a day ago
Claude Code recently downgraded the default thinking level to “medium”, so it’s worth checking your settings.
joecool1029
14 hours ago
recently being within the past 24 hours lol: https://github.com/anthropics/claude-code/releases/tag/v2.1....
darkwater
13 hours ago
> Re-introduced the "ultrathink" keyword to enable high effort for the next tur
Doh.
nekitamo
a day ago
Thank you. The difference was quite noticeable today.
wood_spirit
14 hours ago
Thank you thank you you give me hope :)
But how do you see the current thinking level and how do you change it? I’ve been clicking around and searching and adding “effortLevel”:”high” to .claude/settings.json but no idea if this actually has any effect etc.
varshar
12 hours ago
As per Anthropic support (for Mac and Linux respectively) -
$ echo 'export ANTHROPIC_EFFORT="high"' >> ~/.zshrc source ~/.zshrc
$ echo 'export ANTHROPIC_EFFORT="high"' >> ~/.bashrc source ~/.bashrc
I prefer settings.json (VSCode) - "claudeCode.environmentVariables": [
{ "name": "ANTHROPIC_MODEL", "value": "claude-opus-4-6" },
{ "name": "CLAUDE_CODE_EFFORT_LEVEL", "value": "high" }
], ...nnoremap
7 hours ago
Or the 2026 version: 'Hey Claude set your thinking level to high.'
jasonjmcghee
6 hours ago
I've found antigravity to be completely unusable.
It's amazing how much foundational prompting and harness matters.
mavamaarten
4 hours ago
Haha yeah I've had this happen to me too (inside copilot on GitHub). I ask it to make a field nullable, and give it some pointers on how to implement that change.
It just decided halfway that, nah, removing the field altogether means you don't have to fix the fallout from making that thing nullable.
Lmao.
varispeed
4 hours ago
Opus 4.6 found in my documentation how to flash the device and wanted to be clever and helpful and flash it for me after doing series of fixes. I've got used to approving commands and missed that one. So it bricked it. Then I wrote extra instructions saying flashing of any kind is forbidden. Few days later it did again and apologised...
storus
a day ago
> to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked
That's likely coming from the 3:1 ratio of linear to quadratic attention usage. The latest DeepSeek also suffers from it which the original R1 never exhibited.
nl
15 hours ago
There is no way you can diagnose this like that. Correlation isn't causation and much more likely is a common source of reinforcement training data.
shaan7
a day ago
> that it would be "simpler" to just... not do what I asked
That sounds too close to what I feel on some days xD
reactordev
a day ago
Turn down the temperature and you’ll see less “simpler” short cuts.
smokel
a day ago
For the uninitiated: Interestingly, it is not advisable to take this to the extreme and set temperature to 0.
That would seem logical, as the results are then completely deterministic, but it turns out that a suboptimal token may result in a better answer in the long run. Also, allowing for a little bit of noise gives the model room to talk itself out of a suboptimal path.
mejutoco
7 hours ago
Setting the temperature to zero does not make the llm fully deterministic, although it is close.
LoganDark
a day ago
I like to think of this like tempering the output space. With a temperature of zero, there is only one possible output and it may be completely wrong. With even a low temperature, you drastically increase the chances that the output space contains a correct answer, through containing multiple responses rather than only one.
I wonder if determinism will be less harmful to diffusion models because they perform multiple iterations over the response rather than having only a single shot at each position that lacks lookahead. I'm looking forward to finding out and have been playing with a diffusion model locally for a few days.
reactordev
a day ago
Yup. I think of it as how off the rails do you want to explore?
For creative things or exploratory reasoning, a temperature of 0.8 lends us to all sorts of excursions down the rabbit hole. However, when coding and needing something precise, a temperature of 0.2 is what I use. If I don’t like the output, I’ll rephrase or add context.
slices
a day ago
I've seen behavior like that when the model wasn't being served with sufficiently sized context window
Aurornis
a day ago
> The main quirk I've found is that it has a tendency to decide halfway through following my detailed instructions that it would be "simpler" to just... not do what I asked,
This is my experience with the Qwen3-Next and Qwen3.5 models, too.
I can prompt with strict instructions saying "** DO NOT..." and it follows them for a few iterations. Then it has a realization that it would be simpler to just do the thing I told it not to do, which leads it to the dead end I was trying to avoid.
soulofmischief
10 hours ago
Claude Opus does this constantly for me, no matter how I prompt it or what is in my AGENTS.md, etc. It is the bane of my existence.
abhikul0
a day ago
Are you running it locally with llama.cpp? If so, is it working without any tweaking of the chat template? The tool calls fail for me when using the default chat template, however it seems to work a whole lot better with this: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/discussions/9#69...
sosodev
19 hours ago
I’ve been running it via llama-server with no issues. Running the latest Bartowski 6-bit quant
brightball
18 hours ago
Bartowski? Like Chuck Bartowski from the TV show?
BoredomIsFun
13 hours ago
Different one. Bartowski is a minor celebrity in the local LLM world, together with Unsloth.
Balinares
4 hours ago
What's the selling point of these quants vs the Unsloth ones?
abhikul0
16 hours ago
Thanks, i'll check his quants.
arcanemachiner
a day ago
Have you tried the '--jinja' flag in llama-server?
abhikul0
a day ago
Yes, it fails too. I’m using the unsloth q4_km quant. Similarly fails with devstral2 small too, fixed that by using a similar template i found for it. Maybe it’s the quants that are broken, need to redownload I guess.
Twirrim
a day ago
I've been testing the same with some rust, and it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself. It seems a little more likely to jam up than some other models I've experimented with.
It's also driving itself crazy with deadpool & deadpool-r2d2 that it chose during planning phase.
That said, it does seem to be doing a very good job in general, the code it has created is mostly sane other than this fuss over the database layer, which I suspect I'll have to intervene on. It's certainly doing a better job than other models I'm able to self-host so far.
Aurornis
a day ago
> it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself.
I think this is part of the model’s success. It’s cheap enough that we’re all willing to let it run for extremely long times. It takes advantage of that by being tenacious. In my experience it will just keep trying things relentlessly until eventually something works.
The downside is that it’s more likely to arrive at a solution that solves the problem I asked but does it in a terribly hacky way. It reminds me of some of the junior devs I’ve worked with who trial and error their way into tests passing.
I frequently have to reset it and start it over with extra guidance. It’s not going to be touching any of my serious projects for these reasons but it’s fun to play with on the side.
sosodev
a day ago
Some of the early quants had issues with tool calling and looping. So you might want to check that you're running the latest version / recommended settings.
misnome
a day ago
> and it's has spent a fair bit of time going through an infinite seeming loop before finally unjamming itself
I can live with this on my own hardware. Where Opus4.6 has developed this tendency to where it will happily chew through the entire 5-hour allowance on the first instruction going in endless circles. I’ve stopped using it for anything except the extreme planning now.
cbm-vic-20
a day ago
I don't know much about how these models are trained, but is this behavior intentional (ie, the people pulling the levers knew that this is how it would end up), or is it emergent (ie, pulling the levers to see what happens)?
anana_
a day ago
I've had even better results using the dense 27B model -- less looping and churning on problems
androiddrew
21 hours ago
Which dense model are you referring to? The dense model isn’t finetuned for code instruction according to the model card.
anana_
20 hours ago
https://huggingface.co/Qwen/Qwen3.5-27B
I wasn't aware of that, which page mentions that?
zerebos
16 hours ago
Yeah the page you linked even shows the benchmarks in coding for this model, so I'd be curious where that claim comes from
nu11ptr
a day ago
What hardware do you have it running on? Do you feel you could replace the frontier models with it for everyday coding? Would/will you?
sosodev
a day ago
Around 20ish tokens a second with 6-bit quant at very long context lengths on my AMD AI Max 395+
I’m trying to use local models whenever possible. Still need to lean on the frontier models sometimes.
politelemon
a day ago
60 to 70 on a 5080, but only tinkering for now. The smaller models seem exceptionally good for what they are, and some can even do OCR reliably.
bigyabai
a day ago
I'm getting ~30 tok/s on the A3B model with my 3070 Ti and 32k context.
> Do you feel you could replace the frontier models with it for everyday coding? Would/will you?
Probably not yet, but it's really good at composing shell commands. For scripting or one-liner generation, the A3B is really good. The web development skills are markedly better than Qwen's prior models in this parameter range, too.
jasonjmcghee
6 hours ago
That seems oddly low / slower by a fair amount than i get on my m4. (I believe it was ~45 tok/s?)
What quant are you using? How much ram does it have?
paoliniluis
a day ago
what's your take between Qwen3.5-35B-A3B and Qwen3-Coder-Next?
sosodev
a day ago
In my experience Qwen3.5 is better even at smaller distillations. From what I understand the Qwen3-next series of models was just a test/preview of the architectural changes underpinning Qwen3.5. So Qwen3.5 is a more complete and well trained version of those models.
kamranjon
a day ago
In my experience qwen 3 coder next is better. I ran quite a few tests yesterday and it was much better at utilizing tool calls properly and understanding complex code. For its size though 3.5 35B was very impressive. coder next is an 80b model so i think its just a size thing - also for whatever reason coder next is faster on my machine. Only model that is competitive in speed is GLM 4.7 flash
xrd
a day ago
What do you use as the orchestrator? By this I mean opencode, or the like. Is that the right term?
simonw
a day ago
I use the term "harness" for those - or just "coding agent". I think orchestrator is more appropriate for systems that try to coordinate multiple agents running at the same time.
This terminology is still very much undefined though, so my version may not be the winning definition.
kamranjon
a day ago
I'm basically using the agentic features of the Zed editor: https://zed.dev/agentic
It's really easy to setup with any OpenAI compatible API and I self host Qwen Coder 3 Next on my personal MBP using LM Studio and just dial in from my work laptop with Zed and tailscale so i can connect from wherever i might be. It's able to do all sorts of things like run linting checks and tests and look for issues and refactor code and create files and things like this. I'm definitely still learning, but it's a pretty exciting jump from just talking to a chat bot and copying and pasting things manually.
nvader
a day ago
Another vote in favour of "harness".
I'm aligning on Agent for the combination of harness + model + context history (so after you fork an agent you now have two distinct agents)
And orchestrator means the system to run multiple agents together.
Zetaphor
15 hours ago
This has also been my understanding of all of these terms so far
nekitamo
13 hours ago
In my tests, Qwen3.5-35B-A3B is better, there is no comparison. Better tool calling and reasoning than Qwen3-Coder-Next for Html/Js coding tasks of medium size. Beware the quants and llama.cpp settings, they matter a lot and you have to try out a bunch of different quants to find one with acceptable settings, depending on your hardware.
karmakaze
a day ago
We don't have a Qwen3.5-Coder to compare with, but there is a chart comparing Qwen3.5 to Qwen3 including Qwen3-Next[0].
[0] https://www.reddit.com/r/LocalLLaMA/comments/1rivckt/visuali...
a3b_unknown
a day ago
What is the meaning of 'A3B'?
simonw
a day ago
It's the number of active parameters for a Mixture of Experts (misleading name IMO) model.
Qwen3.5-35B-A3B means that the model itself consists of 35 billion floating point numbers - very roughly 35GB of data - which are all loaded into memory at once.
But... on any given pass through the model weights only 3 billion of those parameters are "active" aka have matrix arithmetic applied against them.
This speeds up inference considerably because the computer has to do less operations for each token that is processed. It still needs the full amount of memory though as the 3B active it uses are likely different on every iteration.
zozbot234
a day ago
It will benefit from a full amount of memory for sure, but AIUI if you use system memory and mmap for your experts you can execute the model with only enough memory for the active parameters, it's just unbearably slow since it has to swap in new experts for every token. So the more memory you have in excess to that, the more inactive but often-used experts can be kept in RAM for better performance.
EnPissant
a day ago
The ability to stream weights from disk has nothing to do with MoE or not. You can always do this. It will be unusable either way.
zozbot234
a day ago
Agreed but for a dense model you'd have to stream the whole model for every token, whereas with MoE there's at least the possibility that some experts may be "cold" for any given request and not be streamed in or cached. This will probably become more likely as models get even sparser. (The "it's unusable" judgmemt is correct if you're considering close-to-minimum reauirements, but for just getting a model to fit, caching "almost all of it" in RAM may be an excellent choice.)
EnPissant
18 hours ago
Unlike offloading weights from VRAM to system RAM, I just can't see a situation where you would want to offload to an SSD. The difference is just too large, and any model so large you can't run it in system RAM, is going to be so large it is probably unusable except in VRAM.
zozbot234
8 hours ago
Unusable for anything like realtime response, yes. Might be usable and even quite sensible to power less-than-realtime uses on much cheaper inference platforms, as long as the slow storage bandwidth doesn't overly bottleneck compute.
whalesalad
a day ago
What hardware are you running this on?
Zetaphor
15 hours ago
I'm running this exact same setup on a Framework Desktop and I'm seeing ~30 tokens/second