senko
14 hours ago
I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...
The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.
So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)
I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.
To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).
Lists of various models I tested: https://senko.net/vibecode-bench/
0xbadcafebee
11 hours ago
It was almost certainly not trained for coding, as it's got both audio and vision input, is only 12B, and nowhere in the announcement is coding mentioned. It will likely not have good performance on coding in general, compared to other small models like Qwen 3.6 35B A3B, Gemma 4 26B A4B, Nvidia Nemotron 3 Nano 30B-A3B, gpt-oss-20b.
For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.
dotancohen
9 hours ago
> For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
You seem like the guy to ask. For a laptop with 12GB VRAM (RTX 5070) and 32 GB system RAM, what is a good multilingual (English, Hebrew, Greek) model for conversing with personal notes in Org mode format? I don't care how long updating the model or rag takes, and even inference can be reasonably slow, but the results of the query as they relate to my personal notes are important. I don't care about general knowledge, for those questions I can use e.g. ChatGPT.Thanks
akmarinov
3 hours ago
Joins us over on Reddit at r/LocalLlaMA to get 10 different opinions on that
emmelaich
6 hours ago
You may like https://www.llmfit.org/
(not recommendation, I've not used it .. yet)
hypfer
30 minutes ago
Just tried it and honestly it's a terrible experience lacking any sort of intent or reason.
Which is unsurprising in the AI space.
You get a wall of text showing you various random fine-tuned models by random people, and that is basically it.
Actual sane default requirements like "just give me the normal AI labs", "please filter for dense only" and "I want this exact context size at this quant" are not part of the tool, apparently. Neither is "compare these quants for me for the same model".
Or maybe it's just hidden enough that I did not find them before I've stopped caring.
Conway's law is at it again.
sourcecodeplz
6 hours ago
Any Gemma 4 model, they are great at translations, multilingual
dotancohen
12 minutes ago
While Gemini 4 seems fine, Gemma 4 does not do Hebrew well. I've replaced it with Aya Expanse and am getting much better results, but there is still much improvement to be had.
I'm not doing translations, rather querying Hebrew text with a Hebrew prompt.
silversmith
3 hours ago
For the biggest languages, Spanish, French, maybe.
For smaller ones like my native Latvian, the output could be confused for good translation from across the room, the words do look like Latvian words. But the quality is Google translate circa 20 years ago, tops.
It could probably do a decent enough translation to English, if all you need is to get the gist of text. But for smaller European language outputs, nothing comes close to Gemini.
tacomagick
3 hours ago
Gemma 4 26A4B
dirkg
3 hours ago
> For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.
https://www.youtube.com/watch?v=8F_5pdcD3HY&t=1s
even the 27B in some quants can fit.
https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...
qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.
Gemma family is better for almost all other tasks you'd use a local llm for.
kajecounterhack
11 hours ago
Have you found Gemma 4 31B better than Qwen 3.6 27B Q8? I just started using Qwen + Pi agent and it's great, but "which model works best" is still totally crowdsourced and I was going off of peoples' opinions on reddit. Would love to hear more opinions if people have them.
embedding-shape
10 hours ago
> Have you found Gemma 4 31B better than Qwen 3.6 27B Q8?
Which quant of Gemma? For coding Qwen seems to be pretty far ahead, but generally Gemma seems to have a "vaster" set of knowledge, but armed with a search tool it doesn't really matter, and Qwen 3.6 been really great for all sorts of tool calling. I mostly do programming and related things though, fwiw.
> I was going off of peoples' opinions on reddit
It's extremely astroturfed all over the place, especially the larger subreddits, and especially the one related to a specific animal in a specific location. It's sad, as early on it was a great resource, but now it's mostly paid posts and a race to the bottom, with lots of piling, and all the knowledgeable people I used to recognize are nowhere to be found.
xenophonf
10 hours ago
It took me way too long to realize you were referring to r/localllama.
MoonWalk
9 hours ago
Why the obfuscation in the first place?
zozbot234
8 hours ago
I'm not sure that GP is correct, many people in that forum tend to hate Qwen for closing up many of their more recent models and leaving the whole local inference community 'stranded' on their older releases.
julianlam
4 hours ago
Are you sure? Prior to today the sub seems to be pretty partial to Qwen.
kajecounterhack
4 hours ago
That was definitely not the subreddit where I got my info.
thangalin
10 hours ago
Yes. I'm using Gemma-4 31B (gemma-4-31B-it-assistant.Q4_K_M.gguf) with llama.cpp to attribute quotations throughout chapters of my sci-fi novel. I started with Qwen3, but couldn't get it to work. Qwen3 TTS Voice Design, on the other hand, is incredible (Qwen3-TTS-12Hz-1.7B-VoiceDesign). I'm using both for an audiobook generator that produces a variety of voices.
Screens:
* https://i.ibb.co/TBBV5nJk/kl-01.png (voice design)
* https://i.ibb.co/nNvvKDyV/kl-02.png (quotation attributions)
qingcharles
2 hours ago
Gemma 4 31B is enormously impressive. You get 1000 requests/day for free on Google's API and another 1000/day off OpenRouter. Only problem is you get 503 like crazy.
jmpeax
6 hours ago
> nowhere in the announcement is coding mentioned
It's right there in the middle benchmark bar "LiveCode Bench" 72%.
senko
11 hours ago
Yeah, I agree 24B-36B sizes are better in general.
I don't have unified RAM tho and offloading to CPU is dog slow, which is why I'm interested in 7b-12b models.
iso1631
9 hours ago
I find ram crazy. My thinkpad has 32G of ram, it's a t470 that's nearly a decade old
Why do people with modern laptops have such little amounts of ram?
willy_k
8 hours ago
The ram that’s important for LLMs is gpu-accessible memory, meaning either systems with unified ram or VRAM, the latter of which is tied to the caliber of GPU one has.
doubled112
8 hours ago
My job still issues 16GB laptops as standard. You need a business reason to get more. This has been going on since before the price hikes.
I’m a system administrator and I can do my job with no issues at 16GB. Most days 8GB would likely be enough, since I’m just using and abusing other systems anyway.
Java devs at my last job were still running 16GB in 2020. Admittedly that was a while ago. Still not a decade.
Close some Chrome tabs?
zigzag312
12 hours ago
> It roughly compares with GPT-4.1 (!!), released 14 months ago
I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.
mdp2021
11 hours ago
> I suspect ... still wins in general world knowledge due to bigger size
Encyclopedic knowledge matters relatively little in perspective, given the expectable future developments: even the more knowledgeable of us will use that knowledge for reasoning and intuition (and we will have absorbed the intellectual keys during our training), but under our professional hat we should in theory be ready to go "I stand corrected" and "more precisely" with the actual data at hand.
I.e.: for the encyclopedic knowledge needed, the /understander/ will have a RAG subsystem and a corpus of knowledge to inquire upon processing queries.
(Corroboration: we can't delirate, and neither can the machine...)
bitexploder
10 hours ago
Don't LLMs work on attention though? The closer in their hyperdimensional space you can land your problem to their inherent understand the better they are at understanding your problem domain. RAG loops can be very slow and agents may simply lack the knowledge to use them correctly.
coldcity_again
11 hours ago
A great position to take. Strong opinions, weakly held.
DeathArrow
33 minutes ago
>The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually
Can you instruct it to use a lsp?
superkuh
11 hours ago
>consumer-grade card with 12G of VRAM and got 5t/s
That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.
senko
11 hours ago
Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).
I should play a bit more with llama.cpp options and see what bappened there. Thanks!
superkuh
8 hours ago
I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.
frikk
14 hours ago
Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.
profunctor
13 hours ago
With a harness you could feed the code to a linter and if there are errors feed that to a model automatically. It’s amazing that the models are good enough that I haven’t bothered doing this
pseudosavant
10 hours ago
Models this small and this capable bode really well for the usefulness of a PC like the RTX Spark that Nvidia/Microsoft announced this week. 128GB of unified memory will likely be more than sufficient for effective local agentic coding, even if SOTA cloud models will still be even better.
Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.
dirkg
3 hours ago
The RTX/DGX Spark, Mac Ultras with 128GB unified ram are all ~$5k. Its still an expensive toy for rich people, it might as well be an H100 for 99.9% of the population (not devs with high paying jobs, of course).
the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.
there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.
zozbot234
9 hours ago
RTX Spark is pretty much the DGX Spark in a laptop form factor, plus some lower-performing chips in the same series to be released later according to rumors. We know quite well how the top-of-the-line chip performs: it's very interesting for some application areas, less so for others.
pseudollm
9 hours ago
> usefulness of the RTX Spark
Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow
SwellJoe
9 hours ago
Yep, I have a Strix Halo and while it can run models bigger than Qwen 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for coding agents.
The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.
zozbot234
8 hours ago
Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.
int_19h
5 hours ago
I can't speak for others but IMO the only reason to run models locally right now is privacy - i.e. you don't trust any of the cloud providers to not look at your prompts. Price-wise the market is extremely competitive and cheap model serving favors large scale so anything that can be run locally can be run cheaper in the cloud. But if privacy is important, then it's important for everything, including traditional chatbot applications, which kinda do require interactivity.
SwellJoe
8 hours ago
Even batched it's uncomfortably slow. I started to benchmark ds4 with my security vulnerability benchmark (after Qwen 3.6 dense and MoE and a bunch of cloud models), but it was going to tie up the Strix Halo for more than a day, so I decided not to run it as it would prevent me from doing other stuff with it during that time.
Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.
Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.
hedgehog
7 hours ago
The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.
SwellJoe
7 hours ago
Qwen 3.6 35B-A3B with MTP at 8 bits is blazing fast, something like 50-60 tokens per second. That's plenty fast for interactive use, so I haven't tried lower bits. Unfortunately the MoE is notably dumber than the dense model (for the case I have data about...I've been benchmarking models for security vulnerability scanning, and 27B is notably better on hard bugs).
The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.
I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.
hedgehog
5 hours ago
Same chip, with a 6 bit 35B and 8 bit KV cache I see about 500 prefill and 55 decode at 30k into the context window. MiniMax seemed a bit lower token rate but much, much less prone to 40k tokens of monologue before generating an answer. A pattern I like is to use a smaller model to do most execution and then a larger model to review transcripts and output and do any fixups and tooling improvements (this is all batch jobs so all I care about is overall throughput).
milch
3 hours ago
What hardware do you need to run MiniMax M2.7 230B locally?
hedgehog
20 minutes ago
Ryzen 395 is what I'm using, anything with 128GB+ of RAM accessible to the GPU should work fine for a 4 bit version of the model (so Spark or Mac Studio should be ok too).
McGlockenshire
12 hours ago
> my consumer-grade card with 12G of VRAM and got 5t/s for output
Thank you for giving me hope!