Irritating LLMisms:
- "real architecture trick"
- "the honest hardware reality of running it at home."
- "What it is — and what Z.ai claims"
- "The one genuinely new idea"
And many more.
Yep. The entire thing. Instant turn-off when reading an article.
I'm sure the content does have some value, and perhaps someone spent time putting together an original copy that they thought was going to be made better by having AI "make it better".
Actually, I take some of that back - most of the site seems to be AI written, following the formula of "ingest multiple sources" => feed to AI => write article.
I've been using "the one genuinely adjective noun" for years as a weird English tic, and it bothers me that it's become an LLM tell.
That's because most the "tells" expose more about the "reader" than the content.
That is so not true. The golden rule of "say what you mean" is about the writer. It gets broken constantly by LLMs: they always add words that do not reinforce the meaning, but dilute it.
New "delve" alternatives: honest (4x), real(ity) (11x), genuine (3x).
While older LLMs were tuned to slop, current ones are tuned to actively deceive.
If model performance continues to scale with model size, I have a hard time seeing how local models will have any chance of competing with models hosted on datacenter hardware.
1. There are strong economies of scale in hosting inference (batched prompts, high uptime, shared infrastructure).
2. There are physical limits on how much memory we will be able to produce over the next few years. Demand will probably scale at least as fast as production does, so we won't be saved by falling prices.
That's a big if.
The big commercial models seem to gain far more from pre-processing than they do from size, and you can already run pretty useful models on desktop hardware.+
Check out this video about how DeepMind significantly improved performance: https://youtu.be/Dkqzqw8rxXI They basically ran the LLM tuning through an old-school genetic or annealing style algorithm and trounced what a larger model could do alone.
For your first point- You've just repeated "shared tenant." A scaling factor that's been used since before the turn of the milenium. Uptime is, as always, an irrelevancy for personal/homelab vs cloud. It shifts from uptime to pure financial (capex first, then how you account for "wasted" time).
2) The current memory crunch is more political than cyclical. The only reason we have fabs as far intro construction as we do is CHIPS Act. Which, predates LLMs public existance by more than 6mo. the horrific silicon prices are a direct result of openAI's openly Illegal dealings. Their pretense of needing it for stargate gets sundered further with each missed or cancelled deadline.
They predicted the political and regulatory outcome superbly.
There's a value for many people and organizations with running a model locally on hardware they fully own and control (or pay to colocate in a datacenter somewhere) vs running a model on something owned/controlled by any third party. For highly privacy sensitive, medical applications, etc. It's not just a question of raw efficiency in dollar per tokens per day or tokens/second.
Cloud models will always be ahead, but not every task needs Fable-level intelligence. The number of usable situations for local models will increase as hardware and open-weight models improve.
GLM hasn't ballooned model size yet is ridiculously scaled?
Before people go and drop a gargantuan sum of money on a server capable of running it entirely in GPU, there's still a fair amount of used x86-64 servers capable of running it in CPU and RAM (using llama-server) for probably under $6000. For example a Dell R640 with two older Xeon 18-core CPUs and 1TB of RAM. Test it out at a slow token/sec rate and see if it fits your needs.
Same idea for Kimi.
To check whether I understand how this all works: Wouldn't a 4 bit quant run reasonably well (for that hardware) with far less ram, something like 1.5x the 476gb, or 714gb+ ram?
Yes, but the price difference between buying a used x86-64 server with 512GB and 1024GB isn't that great, and if you're already determined to buy the hardware to run in CPU a "large" model (eg: Not Qwen 3.6 35B-A3B, gemma4 or similar size), the loss of quality and sometimes suspicious nature of the output from a 4-bit quant might be undesirable vs running a Q8 quant or full precision.
You would also want a lot of RAM for context/kv cache to make it usable so just the amount of RAM that will fit a Q4 model and run it (before any cache starts getting populated through active use) isn't enough.
Agreed. There are some crazy good deals on these older servers. For me, the inference speed would be fine as I'd just get on with a million other tasks between each response.
Article reads as though written by someone who doesn't have much experience with deployments like this. Underestimates the memory needed to run with a reasonable amount of context. Misses two other obvious targets:
1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.
Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).
This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis
Can you really say you're running GLM 5.2 if its a 2 bit quant? It might be usable but the capabilities will definitely not be the same.
I do think it's going to get harder and harder to run bleeding-edge models; this is just the start of it.
It being hard for the average joe to run these at its fullest potential is unfortunate, but the important part is that _you can_ assuming you can acquire the resources.
I think that's going to be important for the sake of preserving privacy and freedom of information in the long run. We're seeing this play out right now with Anthropic originally playing the "safety" card for why they can't let everyone at Mythos and subsequently got on the US Gov't radar with access to Fable being pulled.
The next biggest milestone will be an open-weights challenger to Mythos. There'll be consequences to that, but I feel those are less worse than someone else deciding what you can and can't use a model for.
I think people overrate 'local' part of open Models vs private. With OpenAI my choice is 1. I have to use them, even if they decide to double the cost or work with govt to blow my country. My $5 server can't run GLM but I have choice from many providers based on my requirements of cost, data residency, political alignment.
Terrible zero value article, I am extremely surprised it is upvoted.
That being said Artificial Analysis just came out with a brand new benchmark where it scored between opus 4.8 and gpt-5.5 and well behind fable-5 so it's definitely frontier-ish https://x.com/ArtificialAnlys/status/2067744637155226101
Thats just stupid.
- Why should I run it on local hardware when there are already about a dozen US provider available?
- To compare the token usage per task with GLM 5.1 is worthless when GLM 5.1 is unable to do the task.
- Not even z.ai itself runs the model with BF16 weights.
- I couldn't care less how good the model is at drawing a pelican on a bicycle.
Pretty sure the article is fully written by LLM without editing at all. See all the — emdash sprinkled all over.
The 4 emdashes are in the headline, the introtext below the headline, in the author bio, and the site 's footer — all of them might have been inserted by the website editor, not the original author.