hackernews client

GLM-5.2: The Most Powerful Open Model yet and the Brutal Reality of Running It

41 pointsposted 10 hours ago

27 Comments

CorpOverreach

10 hours ago

Yep. The entire thing. Instant turn-off when reading an article.

I'm sure the content does have some value, and perhaps someone spent time putting together an original copy that they thought was going to be made better by having AI "make it better".

Actually, I take some of that back - most of the site seems to be AI written, following the formula of "ingest multiple sources" => feed to AI => write article.

LeoPanthera

10 hours ago

I've been using "the one genuinely adjective noun" for years as a weird English tic, and it bothers me that it's become an LLM tell.

butvacuum

10 hours ago

That's because most the "tells" expose more about the "reader" than the content.

polotics

6 hours ago

That is so not true. The golden rule of "say what you mean" is about the writer. It gets broken constantly by LLMs: they always add words that do not reinforce the meaning, but dilute it.

Lockal

2 hours ago

New "delve" alternatives: honest (4x), real(ity) (11x), genuine (3x).

While older LLMs were tuned to slop, current ones are tuned to actively deceive.

tfirst

10 hours ago

If model performance continues to scale with model size, I have a hard time seeing how local models will have any chance of competing with models hosted on datacenter hardware.

1. There are strong economies of scale in hosting inference (batched prompts, high uptime, shared infrastructure).

2. There are physical limits on how much memory we will be able to produce over the next few years. Demand will probably scale at least as fast as production does, so we won't be saved by falling prices.

dlcarrier

7 hours ago

That's a big if.

The big commercial models seem to gain far more from pre-processing than they do from size, and you can already run pretty useful models on desktop hardware.+

Check out this video about how DeepMind significantly improved performance: https://youtu.be/Dkqzqw8rxXI They basically ran the LLM tuning through an old-school genetic or annealing style algorithm and trounced what a larger model could do alone.

butvacuum

10 hours ago

For your first point- You've just repeated "shared tenant." A scaling factor that's been used since before the turn of the milenium. Uptime is, as always, an irrelevancy for personal/homelab vs cloud. It shifts from uptime to pure financial (capex first, then how you account for "wasted" time).

2) The current memory crunch is more political than cyclical. The only reason we have fabs as far intro construction as we do is CHIPS Act. Which, predates LLMs public existance by more than 6mo. the horrific silicon prices are a direct result of openAI's openly Illegal dealings. Their pretense of needing it for stargate gets sundered further with each missed or cancelled deadline.

They predicted the political and regulatory outcome superbly.

walrus01

10 hours ago

There's a value for many people and organizations with running a model locally on hardware they fully own and control (or pay to colocate in a datacenter somewhere) vs running a model on something owned/controlled by any third party. For highly privacy sensitive, medical applications, etc. It's not just a question of raw efficiency in dollar per tokens per day or tokens/second.

dabinat

10 hours ago

Cloud models will always be ahead, but not every task needs Fable-level intelligence. The number of usable situations for local models will increase as hardware and open-weight models improve.

jauntywundrkind

8 hours ago

GLM hasn't ballooned model size yet is ridiculously scaled?

walrus01

10 hours ago

Before people go and drop a gargantuan sum of money on a server capable of running it entirely in GPU, there's still a fair amount of used x86-64 servers capable of running it in CPU and RAM (using llama-server) for probably under $6000. For example a Dell R640 with two older Xeon 18-core CPUs and 1TB of RAM. Test it out at a slow token/sec rate and see if it fits your needs.

Same idea for Kimi.

sgc

10 hours ago

To check whether I understand how this all works: Wouldn't a 4 bit quant run reasonably well (for that hardware) with far less ram, something like 1.5x the 476gb, or 714gb+ ram?

walrus01

9 hours ago

Yes, but the price difference between buying a used x86-64 server with 512GB and 1024GB isn't that great, and if you're already determined to buy the hardware to run in CPU a "large" model (eg: Not Qwen 3.6 35B-A3B, gemma4 or similar size), the loss of quality and sometimes suspicious nature of the output from a 4-bit quant might be undesirable vs running a Q8 quant or full precision.

You would also want a lot of RAM for context/kv cache to make it usable so just the amount of RAM that will fit a Q4 model and run it (before any cache starts getting populated through active use) isn't enough.

qingcharles

10 hours ago

Agreed. There are some crazy good deals on these older servers. For me, the inference speed would be fine as I'd just get on with a million other tasks between each response.

easygenes

10 hours ago

Article reads as though written by someone who doesn't have much experience with deployments like this. Underestimates the memory needed to run with a reasonable amount of context. Misses two other obvious targets:

  1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
  2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.

Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).

This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis

redox99

9 hours ago

Can you really say you're running GLM 5.2 if its a 2 bit quant? It might be usable but the capabilities will definitely not be the same.

CorpOverreach

10 hours ago

I do think it's going to get harder and harder to run bleeding-edge models; this is just the start of it.

It being hard for the average joe to run these at its fullest potential is unfortunate, but the important part is that _you can_ assuming you can acquire the resources.

I think that's going to be important for the sake of preserving privacy and freedom of information in the long run. We're seeing this play out right now with Anthropic originally playing the "safety" card for why they can't let everyone at Mythos and subsequently got on the US Gov't radar with access to Fable being pulled.

The next biggest milestone will be an open-weights challenger to Mythos. There'll be consequences to that, but I feel those are less worse than someone else deciding what you can and can't use a model for.

blackoil

10 hours ago

I think people overrate 'local' part of open Models vs private. With OpenAI my choice is 1. I have to use them, even if they decide to double the cost or work with govt to blow my country. My $5 server can't run GLM but I have choice from many providers based on my requirements of cost, data residency, political alignment.

KaoruAoiShiho

10 hours ago

Terrible zero value article, I am extremely surprised it is upvoted.

That being said Artificial Analysis just came out with a brand new benchmark where it scored between opus 4.8 and gpt-5.5 and well behind fable-5 so it's definitely frontier-ish https://x.com/ArtificialAnlys/status/2067744637155226101

ma2kx

9 hours ago

Thats just stupid.

- Why should I run it on local hardware when there are already about a dozen US provider available?

- To compare the token usage per task with GLM 5.1 is worthless when GLM 5.1 is unable to do the task.

- Not even z.ai itself runs the model with BF16 weights.

- I couldn't care less how good the model is at drawing a pelican on a bicycle.

GLM-5.2: The Most Powerful Open Model yet and the Brutal Reality of Running It

27 Comments

kristianp

CorpOverreach

LeoPanthera

butvacuum

polotics

kibibu

Lockal

tfirst

dlcarrier

butvacuum

walrus01

dabinat

jauntywundrkind

walrus01

sgc

walrus01

qingcharles

easygenes

redox99

CorpOverreach

blackoil

KaoruAoiShiho

ma2kx

lamida

knbknb

raincole

nryoo