jtrn
3 hours ago
My quickie: MoE model heavily optimized for coding agents, complex reasoning, and tool use. 358B/32B active. vLLM/SGLang only supported on the main branch of these engines, not the stable releases. Supports tool calling in OpenAI-style format. Multilingual English/Chinese primary. Context window: 200k. Claims Claude 3.5 Sonnet/GPT-5 level performance. 716GB in FP16, probably ca 220GB for Q4_K_M.
My most important takeaway is that, in theory, I could get a "relatively" cheap Mac Studio and run this locally, and get usable coding assistance without being dependent on any of the large LLM providers. Maybe utilizing Kimik2 in addition. I like that open-weight models are nipping at the feet of the proprietary models.
hasperdi
32 minutes ago
I bought a second‑hand Mac Studio Ultra M1 with 128 GB of RAM, intending to run an LLM locally for coding. Unfortunately, it's just way too slow.
For instance, an 4‑bit quantized model of GLM 4.6 runs very slowly on my Mac. It's not only about tokens per second speed but also input processing, tokenization, and prompt loading; it takes so much time that it's testing my patience. People often mention about the TPS numbers, but they neglect to mention the input loading times.
Reubend
4 minutes ago
Anything except a 3bit quant of GLM 4.6 will exceed those 128 GB of RAM you mentioned, so of course it's slow for you. If you want good speeds, you'll at least need to store the entire thing in memory.
mft_
an hour ago
I’m never clear, for these models with only a proportion active (32B here) to what extentt this reduces the RAM a system needs, if at all?
l9o
an hour ago
RAM requirements stay the same. You need all 358B parameters loaded in memory, as which experts activate depends on each token dynamically. The benefit is compute: only ~32B params participate per forward pass, so you get much faster tok/s than a dense 358B would give you.
deepsquirrelnet
an hour ago
For mixture of experts, it primarily helps with time to first token latency, throughput generation and context length memory usage.
You still have to have enough RAM/VRAM to load the full parameters, but it scales much better for memory consumed from input context than a dense model of comparable size.
noahbp
an hour ago
It doesn't reduce the amount of RAM you need at all. It does reduce the amount of VRAM/HBM you need, however, since having all parameters/experts in one pass loaded on your GPU substantially increases token processing and generation speed, even if you have to load different experts for the next pass.
Technically you don't even need to have enough RAM to load the entire model, as some inference engines allow you to offload some layers to disk. Though even with top of the line SSDs, this won't be ideal unless you can accept very low single-digit token generation rates.
__natty__
3 hours ago
I can imagine someone from the past reading this comment and having a moment of doubt
embedding-shape
3 hours ago
> Supports tool calling in OpenAI-style format
So Harmony? Or something older? Since Z.ai also claim the thinking mode does tool calling and reasoning interwoven, would make sense it was straight up OpenAI's Harmony.
> in theory, I could get a "relatively" cheap Mac Studio and run this locally
In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
biddit
3 hours ago
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it instead of just using paid APIs until proper hardware gets cheaper / models get smaller.
Yes, as someone who spent several thousand $ on a multi-GPU setup, the only reason to run local codegen inference right now is privacy or deep integration with the model itself.
It’s decidedly more cost efficient to use frontier model APIs. Frontier models trained to work with their tightly-coupled harnesses are worlds ahead of quantized models with generic harnesses.
theLiminator
3 hours ago
Yeah, I think without a setup that costs 10k+ you can't even get remotely close in performance to something like claude code with opus 4.5.
cmrdporcupine
3 hours ago
10k wouldn't even get you 1/4 of the way there. You couldn't even run this or DeepSeek 3.2 etc for that.
Esp with RAM prices now spiking.
coder543
2 hours ago
$10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
The point in this thread is that it would likely be too slow due to prompt processing. (M5 Ultra might fix this with the GPU's new neural accelerators.)
embedding-shape
an hour ago
> $10k gets you a Mac Studio with 512GB of RAM, which definitely can run GLM-4.7 with normal, production-grade levels of quantization (in contrast to the extreme quantization that some people talk about).
Please do give that a try and report back the prefill and decode speed. Unfortunately, I think again that what I wrote earlier will apply:
> In practice, it'll be incredible slow and you'll quickly regret spending that much money on it
I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
rynn
11 minutes ago
> Please do give that a try and report back the prefill and decode speed.
M4 Max here w/ 128GB RAM. Can confirm this is the bottleneck.
I weighed about a DGX Spark but thought the M4 would be competitive with equal RAM. Not so much.
cmrdporcupine
6 minutes ago
I think the DGX Spark will likely underperform the M4 from what I've read.
However it will be better for training / fine tuning, etc. type workflows.
coder543
28 minutes ago
> I'd rather place that 10K on a RTX Pro 6000 if I was choosing between them.
One RTX Pro 6000 is not going to be able to run GLM-4.7, so it's not really a choice if that is the goal.
benjiro
2 hours ago
> $10k gets you a Mac Studio with 512GB of RAM
Because Apple has not adjusted their pricing yet for the new ram pricing reality. The moment they do, its not going to be a $10k system anymore but in the $15k+...
The amount of wafers going to AI is insane and will influence not just memory prices. Do not forget, the only reason why Apple is currently immunity to this, is because they tend to make long term contracts but the moment those expire ... then will push the costs down consumers.
tonyhart7
an hour ago
generous of you to predict apple only make it 50% expensive
reissbaker
3 hours ago
No, it's not Harmony; Z.ai has their own format, which they modified slightly for this release (by removing the required newlines from their previous format). You can see their tool call parsing code here: https://github.com/sgl-project/sglang/blob/34013d9d5a591e3c0...
rz2k
2 hours ago
In practice the 4bit MLX version runs at 20t/s for general chat. Do you consider that too slow for practical use?
What example tasks would you try?
whimsicalism
19 minutes ago
commentators here are oddly obsessed with local serving imo, it's essentially never practical. it is okay to have to rent a GPU, but open weights are definitely good and important.
reissbaker
3 hours ago
s/Sonnet 3.5/Sonnet 4.5
The model output also IMO look significantly more beautiful than GLM-4.6; no doubt in part helped by ample distillation data from the closed-source models. Still, not complaining, I'd much prefer a cheap and open-source model vs. a more-expensive closed-source one.